Improve Cloud Data Integration Performance and Reduce Costs with Advanced Pushdown Optimization

Jan 19, 2022 |
Arkapravo Chakraborty

Principal Product Manager

In your cloud-based analytics journey, you’re likely using ETL (extract, transform, load) or ELT (extract, load, transform) to move data from your transaction-based business systems (such as ERP or CRM) to a cloud-native data lake and/or data warehouse. If so, I’m sure you’re also looking for ways to save time, improve performance, and reduce costs.

Whether you use ETL or ELT for your data pipeline, you need to transform your data at some point to make sure even the transactional data is analytics ready. The key difference with ETL is that extraction and transformation happen before loading the data into the data warehouse. With ELT, data is loaded into a data lake or data warehouse staging area first, and then some transformation logic is used to push the transformed data into the data warehouse.

Transformation of data is a vital step, regardless of whether you choose ETL or ELT. The data you extract is in a state that is either optimized for transaction updates or designed for object-oriented representation of entities, which is often not useful for analytics consumption. That said, chances are that you have been using some (oftentimes very compute-intensive) transformation logic to remodel your transactional data in a state that data scientists, ML engineers, and data analysts can use.

This is where advanced pushdown optimization can help.

What Is Advanced Pushdown Optimization?

Advanced pushdown optimization (APDO) is a cost and performance optimization technique where, instead of running the transformation logic inside the third-party ETL engine, the logic is translated into SQL or cloud ecosystem-specific commands and pushed down to the cloud data warehouse. This eliminates the time-consuming tasks of pulling it out of the cloud infrastructure, transforming it, and then moving it back in the cloud again. And the reduced overall processing time drops the data egress cost, resulting in a significant cost-benefit for the customer.

Not to mention, cloud is here to stay. In a recent Accenture survey, 90% of business and IT executives agree that to be agile and resilient, their organizations need to fast forward their digital transformation with cloud at its core1. We believe that using APDO can help you succeed in your cloud-based data management initiatives.
 

Figure 1: APDO is applied when transporting data from the data lake to the data warehouse or within the data warehouse Figure 1: APDO is applied when transporting data from the data lake to the data warehouse or within the data warehouse

 

So, how does APDO work?

Let’s consider the following use case, where data is being transported from Azure Data Lake Storage Gen2 (ADLS Gen2), the source, to Snowflake, the target. One way to approach this is to extract and pull the data from ADLS Gen2 to a third-party ETL engine, perform all transformations, and then move (again) the data back to the data warehouse. This requires a two-way movement of data from and to your cloud infrastructure, resulting in significant network and data I/O costs.
 

Figure 2: APDO during data transportation from ADLS to Snowflake Figure 2: APDO during data transportation from ADLS to Snowflake

 

With APDO, all the underlying transformations get translated into ecosystem-specific commands and SQL statements, and then get executed in the underlying cloud infrastructure itself. Here, you do not have to move the data out of the cloud infrastructure to transform the data.

Below are the sample SQL statements and mapping that got triggered behind the scenes and executed within the cloud platform itself:
 

Figure 3: Generated SQL statements for APDO to be executed on the underlying cloud infrastructure Figure 3: Generated SQL statements for APDO to be executed on the underlying cloud infrastructure

 

For a deeper dive on how APDO helps lower costs and speed up data processing, check out this short video.

Key Benefits of Informatica’s Advanced Pushdown Optimization

While cost reduction due to zero data egress remains one of the major benefits of using APDO, there are other advantages as well:

  • No vendor lock-in: Unlike other prominent ELT vendors, Informatica’s APDO capability is not hard-wired for any specific cloud data warehouse vendor. In fact, we support all major cloud platforms and key data warehouses, including Amazon Redshift, Microsoft Azure Synapse Analytics, Snowflake, Databricks, and Google BigQuery. This helps you remain loosely coupled with the underlying infrastructure, so if you ever want to switch from your cloud and/or data warehouse providers throughout your data journey, you can do so seamlessly.
  • Enhanced performance: Since there is no data transportation in and out of your cloud infrastructure and processing is done using elastic cloud compute, performance will improve significantly. In our internal testing, we found APDO to be up to 50X faster than traditional ETL for some use cases.
  • Multiple connectors: As stated, Informatica’s APDO supports all major cloud platforms, data lakes, and data warehouses. This is a big deal because it ensures extensive connectivity support.
  • No-code enablement: Enabling APDO within Informatica Cloud Data Integration is simple and does not require complex commands. All you have to do is select "advanced pushdown” as the runtime choice from a simple drop-down option in the GUI.
  • Transformation support: Informatica supports all the commonly used transformations in pushdown form (e.g., filters, sequence, custom query, and upserts) and some advanced transformation rules like router, union, and sorter. We also support patterns, such as slowly changing dimension (SCD), which contains relatively static data that can change slowly but unpredictably, rather than according to a regular schedule.2

Key Use Cases for APDO

There are two major use cases for APDO. Both allow you to maximize the value of your existing investment in a cloud data warehouse by harnessing its scalable processing power:

1. Ecosystem pushdown: This covers the data movement and transformation between a data lake and a data warehouse. In Figure 4, Informatica reads the data from data lakes such as Amazon S3 or ADLS and uses cloud ecosystem commands to move data into the cloud data warehouse.
 

Figure 4: Informatica’s ecosystem APDO showing copy commands to copy data from the data lake (Amazon S3) to the data warehouse (Snowflake) Figure 4: Informatica’s ecosystem APDO showing copy commands to copy data from the data lake (Amazon S3) to the data warehouse (Snowflake)

 

2. Cloud data warehouse pushdown: In this use case, data movement happens within a cloud data warehouse such as Amazon Redshift, Azure Synapse, Google BigQuery, or Snowflake. As shown in Figure 5, the entire transformation logic gets translated into SQL statements and executed on the cloud data warehouse.

Figure 5: Informatica’s cloud data warehouse APDO showing SQL statements to transfer data from staging area to the warehouse Figure 5: Informatica’s cloud data warehouse APDO showing SQL statements to transfer data from staging area to the warehouse

 

See Advanced Pushdown Optimization in Action

If you are currently using Informatica Cloud Data Integration or want to bring scalable, cost-efficient, productive integration to your data architecture, APDO is essential for meeting your cloud data management needs.

Ready to dive in? Get started with a 30-day trial.

 

1Accenture Technology Trends 2021

2 https://en.wikipedia.org/wiki/Slowly_changing_dimension