Have you heard the roar of a Ferrari? Built by design, the sound is unique as a result of decades of engineering feat and refinement, connecting the driver experience to the car. In the same way, data processing engines are the central component to connecting the data user experience to the data.
In context of enterprise data management, data processing engines take the data processing pipelines, abstract the business logic, either simple or complex, and process the data on frameworks such as Apache Spark in an optimized way, in a streaming or a batch mode, on-premises or in the cloud.
There are many data engines in the marketplace to use, but just like shopping for a car, you look for key capabilities and differentiators that will get you from point A to point B optimally.
At Informatica, we’ve been building data processing engines for a very long time (25 years!). Over the years, we have learned and implemented best-in-class, enterprise-ready data engines to support various data workloads, on-premises or cloud.
8 data processing engine concepts
Based on our experience, these are the 8 data processing engine concepts that you should understand when evaluating data platforms:
- Validation: Many design tools typically generate an XML or JSON representation of a data pipeline. The data engine typically revalidates the pipeline definition and replaces placeholder parameters with real parameters supplied at the runtime. If the data pipeline refers to reusable pipeline components or mapplets, they are also expanded.
- Optimization: Design tools provide capabilities to build data pipelines using simple step-by-step processes, the data processing engine needs to make sure the data pipeline is understandable and easily maintainable, so it is optimally translated to code processed in that engine. For example, if the data pipeline is reading data from a relational table and applies a filter, it is optimal to "push" the filter "down" to the relational database. This simple optimization has the following benefits:
- Reading from the relational table is much faster because you are reading a subset of the data
- A relational database engine would most likely use a database index to allow faster reads
- This eliminates the need to move data between "read" and "filter" steps by combining them into one step
- Code Generation & Pushdown: Once the pipeline has been validated and optimized, the data pipeline needs to be translated into optimized code to support transactional, database, big data, and analytical workloads. To support various compute workloads, the data processing engine offers two modes of code translation: “native” and “ecosystem pushdown.” With Informatica’s native-mode capabilities, the data processing engine provides its own execution environment. Ecosystem “pushdown” mode translates the data pipeline into another abstraction for execution, for example, Spark or Spark stream processing. In the next blog, we will dive deeper into ecosystem pushdown optimization.
- Resource Acquisition: Without proper resource acquisition upfront, the data pipeline execution may fail and result in wasted compute resources and missed SLAs. If the data processing engine is using Informatica’s native execution mode, it will reserve resources where the engine is running, for example on Linux or Windows. If it’s in pushdown mode, the data processing engine will acquire the required resource up from the ecosystem, for example, AWS Redshift, Azure SQL, a relational database, or Spark. In a streaming scenario, where the workload runs continuously, the resource strategy must be elastic and take into consideration the incoming streaming data.
- Runtime: At this point our data processing pipeline is validated, optimized, translated, and required resources are acquired. We need to ship the code and run. The data processing engine must be very efficient in performing low-level data operations. This means, it must be very efficient at storing data in memory, minimize data marshalling and unmarshalling, be efficient at memory buffer management, etc. For example, Informatica’s native engine is highly tuned for efficient run-time processing, and Apache Spark uses Project Tungsten to achieve efficiency.
- Monitoring: The data processing engine must provide progress and health-related data when the job is running. Monitoring must provide meaningful data that can be consumed by a monitoring UI, CLI, or API. Monitoring differs subtly between batch and streaming workloads. For example, because a streaming workload runs continuously, you will be monitoring for the volume of data versus the number of jobs run.
- Error Handling: The data processing engine needs to detect error condition and cleanup resource allocations, temporary files, etc. Error handling can be done at the data engine level -- all processing follows the same handling -- or at the data pipeline level, where each pipeline has its own error handling instructions. Just like monitoring, there is a difference in how errors are handled between batch and streaming workloads. When an error occurs with a batch workload, the job can be restarted, and data can be processed in the next workload invocation. In real-time streaming mode, data might not available at restart.
- Statistics Collection: When the job completes, the data processing engine must record various statistics like status, total runtime, per transformation runtime, and the amount of resources requested and used. This information is useful for future optimization, especially by the "Resource Acquisition" step.
We’ve covered quite a few concepts for data processing engines, and you’ve gotten a glimpse of how central a data processing engine is to a data platform. In the upcoming blogs, we will dive into details on two compute concepts – push-down optimization and serverless compute – which are key capabilities of a data engine.