The Rise of Big Data Engineering: Cloud, AI & Analytics Success

Last Published: Dec 23, 2021 |
Vamshi Sriperumbudur
Vamshi Sriperumbudur


What is data engineering?

Data engineering enables data users across the enterprise with clean, quality data they can trust, so they can drive better business insights and actions. Data engineering is the result of technology disruption in what we used to call big data. Overall, the industry is moving toward data management environments that deliver insights from AI and machine learning while leveraging the cloud for agility. And while the amount data in these environments is still “big” (in fact, AI and ML need massive amounts of data), the technologies that used to manage big data just aren’t big enough for this evolutionary step.

Here’s how Gartner defines data engineering: “Data engineering is the practice of making the appropriate data accessible and available to various data consumers (including data scientists, data analysts, business analytics and business users). It is a discipline that involves collaboration across business and IT.”[1]

As organizations look to modernize their analytics environments, data engineering is on the rise. Here’s a look at how we got here and what you need to know about data engineering.

How big data has evolved to data engineering

The toughest challenge for AI and advanced analytics is not AI, it’s actually data management at scale. But the scale of data has far exceeded the technologies that traditionally managed it. Hadoop, MapReduce, Yarn, HDFS, are among the key technologies that enabled organizations to handle high volumes, wide varieties, and various types of data, i.e., big data. Compute, storage, and big data management were all closely tied together to drive data and analytics success from data lakes and data warehouses.

The adoption of cloud and advent of technologies such as Spark, serverless, and Kafka have all ushered in the era of big data engineering, effectively uncoupling storage and compute, enabling faster processing of multi-latency petabyte-scale data with auto-scaling and auto-tuning.

Cloud: Cloud has been one of the biggest disruptors of big data – by separating storage and compute, by making it easy to scale and tune servers, and by bringing huge cost savings – in processing data engineering pipelines at scale.

Spark: Another major disruptor has been Apache Spark, which grew rapidly over the last few years. Spark is a distributed processing engine for big data engineering workloads at petabyte scale, enabling machine learning and analytics. Speed is the biggest advantage of Spark, it can be 100x faster than Hadoop for large-scale data processing [Source: Databricks].

Serverless: Serverless capability enables enterprises to build applications comprised of microservices that run in response to events, auto-scale for you, and only charge you when they run. This lowers the total cost of maintaining your apps, enabling you to build more logic, faster.

Kafka: Apache Kafka is an event streaming technology capable of handling trillions of events a day, and has evolved from messaging queue to a full-fledged event streaming tech.

Cloud, Spark, serverless, and Kafka, among other technologies, have made Hadoop and big data near-obsolete when it comes to data and analytics. Heavy adoption of these technologies by prominent providers such as Microsoft Azure, Amazon Web Services (AWS), and Databricks furthered the evolution of big data to data engineering.

Data engineering user personas – and the AI and analytics challenge

While cloud, Spark, serverless, and Kafka are essential technologies of data engineering, data engineers, data scientists, and data analysts are quintessential user personas of data engineering. To understand the impact of data engineering on AI and analytics, let’s look at it from the vantage point of these data users.

Lines of business (sales, finance, marketing, supply chain, etc.) need to answer key questions such as:

  • How can data help me predict what will happen?
  • How can data help me understand what has happened?
  • How can my staff collaborate better and prepare data more easily?

Further, data scientists are spending 80% of their time in preparing the data, versus building the models; so, they’re asking:

  • How will I find the right data for my modeling?
  • How will I make this data available in my ML environment?
  • How can I ensure I trust the data for my modeling?
  • Can I simplify data prep so I can spend more time on modeling?
  • How can I deploy and operationalize my ML models into production?

Similarly, data analysts do not have the right data for business insights to help drive actions, and they want to know:

  • How will I find the right data for my business insights?
  • How will I make this data available in my data lake?
  • How can I ensure I trust the data?
  • Can I simplify data prep so I can spend more time on analysis?
  • How can I easily collaborate with my peers and IT for ongoing changes?

Data engineer to the rescue

Data engineers help data scientists and data analysts find the right data, make it available in their environment, make sure the data is trusted and that sensitive data is masked, ensure they spend less time on data preparation, and operationalize data engineering pipelines.

In fact, data engineering is one of the hottest jobs in the tech industry [Source: LinkedIn and Dice]. The data scientist gets much attention as an important role in the age of analytics. Equally important, but with less fanfare, is the role of data engineer. The data scientist finds meaning and insights in data. Data engineers design and build the data ecosystem that is essential to analytics [Source: Eckerson]. There are 4x more jobs for data engineers than for data scientists [Source: Datanami].

7 critical capabilities of data engineering

Enterprises must take a platform and AI-driven approach for end-to-end data engineering instead of stitching together piecemeal solutions. The platform needs to support all technologies that led to the emergence of data engineering: cloud, Spark, serverless, and Kafka.

  1. Discover the right dataset with an intelligent data catalog
  2. Bring the right data into your data lake or ML environment with mass ingestion
  3. Operationalize your data pipelines with enterprise-class data integration
  4. Process real-time data at scale with AI-powered stream processing
  5. Desensitize confidential information with intelligent data masking
  6. Ensure trusted data is available for insights with intelligent data quality at scale
  7. Simplify data prep and enable collaboration with enterprise-class data preparation

Why data engineering is critical to AI and analytics success

According to Databricks’ research, very few AI projects in the enterprise are successful, mainly due to lack of data [Source: Databricks/Google research]. Despite massive investment in data and analytics initiatives, many organizations report difficulties in bringing them into production. Yet another report states, data users spend 80% of the time preparing data before they can use it for analysis or modeling [Source: CrowdFlower]. The common theme among all these is having good clean data that enterprises can trust for their AI and analytics projects, and it’s exactly what end-to-end data engineering brings.

Informatica supports end-to-end data engineering for AI and analytics

Informatica data engineering portfolio provides the following capabilities and includes:

Learn more about how Informatica Data Engineering can help your organization

I invite you to visit the following resources to learn how data engineering can help deliver AI and analytics success at your enterprise.

[1] Gartner, “Give Up Controlling Your Data: How to Overcome the Limits of Conventional Data Management Wisdom,” by Ted Friedman, 16 September 2019

First Published: Nov 20, 2019