What is data engineering?
Data engineering enables data users across the enterprise with clean, quality data they can trust, so they can drive better business insights and actions. Data engineering is the result of technology disruption in what we used to call big data. Overall, the industry is moving toward data management environments that deliver insights from AI and machine learning while leveraging the cloud for agility. And while the amount data in these environments is still “big” (in fact, AI and ML need massive amounts of data), the technologies that used to manage big data just aren’t big enough for this evolutionary step.
Here’s how Gartner defines data engineering: “Data engineering is the practice of making the appropriate data accessible and available to various data consumers (including data scientists, data analysts, business analytics and business users). It is a discipline that involves collaboration across business and IT.”
As organizations look to modernize their analytics environments, data engineering is on the rise. Here’s a look at how we got here and what you need to know about data engineering.
The toughest challenge for AI and advanced analytics is not AI, it’s actually data management at scale. But the scale of data has far exceeded the technologies that traditionally managed it. Hadoop, MapReduce, Yarn, HDFS, are among the key technologies that enabled organizations to handle high volumes, wide varieties, and various types of data, i.e., big data. Compute, storage, and big data management were all closely tied together to drive data and analytics success from data lakes and data warehouses.
The adoption of cloud and advent of technologies such as Spark, serverless, and Kafka have all ushered in the era of big data engineering, effectively uncoupling storage and compute, enabling faster processing of multi-latency petabyte-scale data with auto-scaling and auto-tuning.
Cloud: Cloud has been one of the biggest disruptors of big data – by separating storage and compute, by making it easy to scale and tune servers, and by bringing huge cost savings – in processing data engineering pipelines at scale.
Spark: Another major disruptor has been Apache Spark, which grew rapidly over the last few years. Spark is a distributed processing engine for big data engineering workloads at petabyte scale, enabling machine learning and analytics. Speed is the biggest advantage of Spark, it can be 100x faster than Hadoop for large-scale data processing [Source: Databricks].
Serverless: Serverless capability enables enterprises to build applications comprised of microservices that run in response to events, auto-scale for you, and only charge you when they run. This lowers the total cost of maintaining your apps, enabling you to build more logic, faster.
Kafka: Apache Kafka is an event streaming technology capable of handling trillions of events a day, and has evolved from messaging queue to a full-fledged event streaming tech.
Cloud, Spark, serverless, and Kafka, among other technologies, have made Hadoop and big data near-obsolete when it comes to data and analytics. Heavy adoption of these technologies by prominent providers such as Microsoft Azure, Amazon Web Services (AWS), and Databricks furthered the evolution of big data to data engineering.
While cloud, Spark, serverless, and Kafka are essential technologies of data engineering, data engineers, data scientists, and data analysts are quintessential user personas of data engineering. To understand the impact of data engineering on AI and analytics, let’s look at it from the vantage point of these data users.
Lines of business (sales, finance, marketing, supply chain, etc.) need to answer key questions such as:
Further, data scientists are spending 80% of their time in preparing the data, versus building the models; so, they’re asking:
Similarly, data analysts do not have the right data for business insights to help drive actions, and they want to know:
Data engineers help data scientists and data analysts find the right data, make it available in their environment, make sure the data is trusted and that sensitive data is masked, ensure they spend less time on data preparation, and operationalize data engineering pipelines.
In fact, data engineering is one of the hottest jobs in the tech industry [Source: LinkedIn and Dice]. The data scientist gets much attention as an important role in the age of analytics. Equally important, but with less fanfare, is the role of data engineer. The data scientist finds meaning and insights in data. Data engineers design and build the data ecosystem that is essential to analytics [Source: Eckerson]. There are 4x more jobs for data engineers than for data scientists [Source: Datanami].
Enterprises must take a platform and AI-driven approach for end-to-end data engineering instead of stitching together piecemeal solutions. The platform needs to support all technologies that led to the emergence of data engineering: cloud, Spark, serverless, and Kafka.
According to Databricks’ research, very few AI projects in the enterprise are successful, mainly due to lack of data [Source: Databricks/Google research]. Despite massive investment in data and analytics initiatives, many organizations report difficulties in bringing them into production. Yet another report states, data users spend 80% of the time preparing data before they can use it for analysis or modeling [Source: CrowdFlower]. The common theme among all these is having good clean data that enterprises can trust for their AI and analytics projects, and it’s exactly what end-to-end data engineering brings.
Informatica data engineering portfolio provides the following capabilities and includes:
I invite you to visit the following resources to learn how data engineering can help deliver AI and analytics success at your enterprise.
 Gartner, “Give Up Controlling Your Data: How to Overcome the Limits of Conventional Data Management Wisdom,” by Ted Friedman, 16 September 2019