Why Advanced Pushdown Optimization Is a Must-Have Capability for Your Cloud Analytics Journey

Jun 24, 2021 |
Mamun Reza

Data Engineering, Product Marketing

Organizations need advanced pushdown optimization in their cloud analytics journey

The widespread adoption of cloud databases and transactional reporting systems across the business landscape has led to exponential growth in the volume and variety of data moving through organizations. As a result, today’s organizations are struggling to integrate more data in less time and at less cost. They’re also trying to decode the right data integration strategy to effectively manage terabytes of data with enough flexibility and adaptability to cope with future growth.

In this post, we’ll cover a technique or methodology that best leverages the performance capabilities of the cloud data warehouses to optimize the use of available resources without incurring undue configuration and management burdens – ELT or pushdown optimization (PDO). We will also dig deeper and talk about Informatica Advanced Pushdown Optimization (APDO).

Evolution of ETL and ELT

In the early days of data warehousing, large companies that ran substantial transactions and had large user bases used extract transform load (ETL) processes to consolidate transactional data across all their systems for reporting and analysis. Fast forward to today: ETL remains at the core of data warehousing, though with vastly expanded functionality, applications, and variants. With the advent of modern cloud-based data warehouses (such as Amazon Redshift, Azure Synapse or Google BigQuery) that offer near-endless storage capabilities and scalable processing power, the traditional concept of ETL is giving way to extract load transform (ELT) – an approach where you run transformations right in the data warehouse. Pushdown optimization is a method that works with that approach. Modern data integration software now includes a pushdown optimization mode that allows users to choose when to use ELT and push the transformation logic down to the database engine with a click of a button.

Why Should Pushdown Optimization Matter to Your Organization?

Before we delve further into pushdown optimization, let us try to understand why you need it. Here are some typical challenges we hear from customer CIOs and CDOs:

  • I can’t predict my cloud costs, let alone manage them.
  • My organization’s default answer for all problems is to add more compute hours.
  • I don’t have visibility into who is using what service and by how much.
  • It’s costing me a fortune just to get my data into and out of my cloud data warehouse and cloud service provider.

But how did your organization land up here?

Chances are multi-cloud is the default state of your organization’s technology landscape today. Organizations use a variety of IaaS, PaaS, and SaaS products to deliver cost-effective and flexible applications required for agile and innovative operations. However, pushing data back and forth between public cloud and on-premises compute resources, or (under certain circumstances) between regions can easily result in cost overruns. Data transfer fees are one of the cloud’s biggest hidden costs, hurting enterprises and small businesses alike. Many enterprises pay millions (or tens of millions) of dollars annually in data transfer fees, and data transfer costs are well documented. Any time data moves in or out of the cloud, there’s a cost associated. It is therefore best to rearchitect and recode a percentage of mappings to avoid these associated data transfer costs.

What’s the reason behind these charges?

The answer isn’t just the solid profit margins; it’s also about making switching providers unappealing. A recent survey by Wakefield Research shows 78% of IT decision makers believe they are not getting the most out of the cloud and all its resources because of vendor lock-in concerns.

The story doesn’t end there. In addition to the ingress and egress charges, if you run a job, you are also paying for the time the job runs on the platform. So, in this new usage-based pricing world, the better performing technology is also the cheaper technology.

To address these challenges, you need a solution that:

  • Lets you pick the right processing for the job, while limiting computer hours and data transfer costs
  • Has cost-control capabilities that offer you increased visibility and management
  • Lets you pay for what you use with flexible consumption-based pricing

Informatica’s end-to-end cloud-native data management solution empowers with all the above capabilities. Now that we’ve covered how pushdown optimization can benefit your organization, let’s look at how it works.

What Is Pushdown Optimization?

Pushdown optimization is a performance tuning technique where the transformation logic is converted into SQL and pushed toward either the source database or target database, or both. Let us try and understand it better with a few examples from Informatica cloud-native data integration for the three different kinds of pushdown optimization techniques.

Look at the below sample mapping for an e-commerce organization where one of the category heads is trying to analyze the buying patterns of customers with Gross Merchandise Value (GMV) greater than $100K. We’ll use this to illustrate each type of pushdown optimization.

Pushdown optimization illustration

When the mapping is triggered with no pushdown optimization enabled, all the records from the source table are read. Afterward, a filter is applied on top of that data and then the filtered data is sorted based on customer ID before loading into the target.

Individual source and target results diagram

So here, all 107 records were read from the source. After transformation logic is applied, 15 records are loaded into the target.

Now, let’s understand how the data is processed when full pushdown optimization method is enabled.

Full pushdown optimization – In this method, Informatica pushes as much transformation logic as possible to the target database. If the entire logic cannot be pushed to the target database, it performs source-side pushdown and rest of the intermediate transformation logic which cannot be pushed to any database is processed by the Informatica Data Integration Service. To take advantage of full pushdown optimization, the source and target databases connections should be same.

When the mapping completes, the source/target results are as below:

Completed mapping with source/target results.

Here, only target database details are mentioned because the entire logic is pushed completely to target database.

At the back end, a direct INSERT statement is applied on the target table selecting the data from source table with all the other transformation logic of filter and sorting applied.

When Should You Use Pushdown Optimization or ELT?

The advent of ELT doesn’t mean it is going to completely replace ETL anytime soon. There is still a need for ETL in organizations, and those who can find an effective way to utilize both realize the distinct capabilities each brings. While ETL is the traditional choice, ELT is more scalable, allows for the preservation of data in the raw state, and provides greater use case flexibility over time. At the end of the day, each enterprise is going to have its own set of needs and processes that it must meet to succeed. The question of whether to use ETL or ELT is not, in fact, so black-and-white. ELT makes a lot of sense when a company needs to handle large amounts of data, while ETL is perfectly acceptable for smaller jobs.

Here are three instances where it makes sense to use ELT:

  1. When speed is critical: Data is ingested much faster with ELT because there’s no need for an ETL server, and transformation and loading can happen simultaneously in the target database.
  2. When more raw information is better: In cases when having refined, transformed data sets are not as important as having a large volume of raw data so it can be mined for hidden patterns, ELT automatically becomes the preferred approach, speeding up the loading process at the cost of delivering raw data.
  3. When using high-end processing engines: Modern cloud data warehouse appliances and databases offer native support for parallel processing. This allows your ELT processes to take advantage of more processing power for greater scalability.

Using Informatica’s Cloud Data Integration to Do More with Less

Informatica’s Cloud Data Integration, built on the Intelligent Data Management Cloud, helps you manage your costs, increase resource productivity, and simplify data integration across multi-cloud and multi-hybrid environments using its Optimization Engine. Informatica’s Optimization Engine is a capability that sends your data processing work to the most cost-effective option, whether that’s cloud ecosystem pushdown, cloud data warehouse pushdown, traditional ETL, or Spark serverless processing. However, it also empowers you to choose the best processing option for job based on your need.

Using Informatica’s Cloud Data Integration choose the best processing option for job based on your need

Here are three ways Informatica’s Optimization Engine processes data in a cloud data warehouse and data lake.

1. Pushdown: In this mode, Informatica sends the work to:

a. A cloud ecosystem like AWS, GCP, and Azure. In this case, you don’t pay for any data transfer charges and you experience performance gains resulting in fewer compute hours. Fewer compute hours in turn results in saving cost.

Cloud ecosystem

b. A data warehouse, like Amazon Redshift, Azure Synapse, Google Big Query, and Snowflake. This method lets you maximize the value of your existing investments in the cloud data warehouse by harnessing their scalable processing power. This eliminates any additional data transfer cost.

A cloud data warehouse lets you maximize the value of your existing investments by harnessing their scalable processing power.

2. Spark Processing – Informatica sends the work to a large number of Spark servers. This method is typically used in cases of large data volumes and complex requirements where the work cannot be pushed down. Based on benchmark test results, we’ve found Informatica Spark cluster to be faster than Spark, while using less hardware. This translates into fewer compute hours.

3. Traditional ETL – This is the traditional technique where the work is done on an Informatica server. For small volumes of data, Informatica can run the same job faster using the Informatica Secure Agent with traditional ETL. This method too results in fewer compute hours.

While data transfer costs can run into millions for some organizations, it’s also important to consider cost overruns due to compute hours – in other words, the cost associated with the number of servers and amount of time the servers are used. Informatica’s Optimization Engine sends the work to the most cost-efficient option, saving both data transfer costs and also compute hours. The true value that Informatica provides is that it empowers you with not just one or two, but all these capabilities and lets you adapt based on your requirements.

Informatica Advanced Pushdown Optimization Versus Traditional Pushdown Optimization

Informatica Advanced Pushdown Optimization (Advanced PDO) is specially designed to support  modern cloud data warehouse and lake patterns, including traditional cloud data warehouse patterns currently in use by many vendors. Advanced PDO supports these traditional patterns as well as emerging cloud data warehouse and data lake patterns involving more transformations, endpoint applications, and use cases. Let us look at some of the key differences between Pushdown Optimization (PDO) and Advanced Pushdown Optimization.

  • PDO’s appeal is primarily performance and resource driven (lower core usage, faster processing) but data processing still happens on customer network. However, in a cloud world, alongside performance and resources, cost is a key factor. And as discussed above, Advanced PDO can significantly reduce cost overruns.
  • PDO is SQL-based while Advanced PDO uses SQL as well as native ecosystem commands
  • PDO is applied to database use cases with the same source and target systems while Advanced PDO is not limited to one endpoint and it can go across different source/target systems and across data warehouses and data lakes.

Informatica’s advanced pushdown optimization capabilities are suited to any organization that wants to bring scalable, cost-efficient, and extremely productive ELT integration to its data architecture. The bottom line is, you absolutely need ELT, but you also need the option to not use it. It helps to have a tool like Informatica Cloud Data Integration that allows you to choose an option that best suits the workload. To learn about data integration using Advanced Pushdown Optimization watch this video.