Data Processing Engines Part 2: Pushing the Pedal to the Metal With Pushdown Optimization
Updated October 2020.
In the previous blog, we discussed how data processing engines are central to a data platform. In this blog, we’ll dive into the characteristics of pushdown optimization.
If you’ve read the data engine blog series, so far, we described working with abstract definition of a data pipeline. That is, once the data pipeline is validated and optimized, the data pipeline needs to be translated into optimized code to support transactional, database, big data, and analytical workloads. In the previous blog we described data process engine capabilities and introduced “native” and “pushdown” mode concepts. Let’s look at these two concepts in detail.
- Native mode: In this mode, the data engine provides its own execution environment whereby the data pipeline is translated into optimized Informatica “native” code that executes the pipeline logic. For example, the data pipeline may integrate and cleanse data from a CRM and ERP systems and load to a cloud data warehouse. In this case, all data processing, resource allocating, and grid capabilities happen on the Informatica “native” engine, either running on Linux or Windows.
- With pushdown mode, the data pipeline specification is translated into another abstraction for execution, such as a database. Take the above example of integrating CRM and ERP systems, here, the engine translates the data pipeline to an optimized SQL statement and “pushes” the business transformation logic to a database, where the underlying database or application handles the allocation of resources and scheduling. The same technique applies to Spark or Spark streaming processing engines. To efficiently run the workload, a data processing engine must allow the following types of pushdown optimizations:
- In source pushdown optimization, the data processing engine analyzes the data pipeline from source to target and inquires if the given data source connector has pushdown capabilities supported by the data source. For example, if a data pipeline reads data from a relational table and then applies a filter, the data processing engine will determine it is better to “push” the filter “down” to the source. For a detailed discussion on the connectivity capabilities that a data platform must provide, check out the blog “Why Is Connectivity Foundational for a Data Platform?”
- The second optimization technique is called “full” pushdown. Take for example, a data pipeline that reads source data, applies business logic, and writes the output to the same database or application as the source. The process execution of these steps in the data pipeline is translated into optimized code of the underlying database or application, reducing the amount of data being moved.
- Lastly, a data process engine has “ecosystem” pushdown. Ecosystem pushdown is similar in concept to full pushdown but leverages open source frameworks like Spark or public clouds like Azure, AWS, or GCP. These ecosystems typically provide mechanisms to process large amounts of data, either batch or streaming, that a data processing engine must be able to leverage. For example, if the data resides on a Hadoop cluster, the pipeline can be translated into Apache Spark code. If a data pipeline reads from Amazon S3 and write to Amazon Redshift, the code can be translated into a combination of S3, Redshift, and Spark code. Another example could have a data pipeline fetch data from Microsoft Dynamics CRM, stage it in Azure Data Lake Storage (ADLS), and then load data into Azure SQL Data Warehouse and be translated into a combination of Dynamics CRM, Azure Databricks Spark, ADLS, and SQL Data Warehouse code.
In a multi-cloud ecosystem, there may be a requirement to move from one ecosystem to another. For a smooth transition to the targeted ecosystem, most, if not all, the data pipeline business logic can be reused – the only changes needed are to the source and target in the pipeline.
Once a data pipeline is defined, it must be able to run on any ecosystem, database, or application just by changing the source and the target without requiring any major change to your core business logic.
This is the power of abstraction provided by a data platform.
5 considerations in choosing pushdown optimization
To sum up, a data processing engine needs to look at the following to determine whether to use pushdown capability or not:
- Data pipeline specification – Does the pipeline meet the capabilities of the underlying application, database, or ecosystem to run in pushdown mode?
- Resources – Evaluate the amount of data to be processed to understand the resources required.
- Location – You need to understand the data location and partitioning information to process the data optimally.
- Speed – Evaluate the data pipeline execution times using various available options such as source, full, or ecosystem pushdown.
- Impact – Evaluate the stress on the underlying source and target, especially if they are operational.
In the next blog, we will spend time understanding how serverless compute is relevant to a data processing engine.