If you have been using a data warehouse in your organization or are planning to build one, chances are you have already heard about ETL and ELT. But, let’s quickly recap to set the stage for the rest of this discussion.
ETL and ELT refer to whether you transform (“T”) the data before loading (“L”) it into a data warehouse or after loading it. The “E” refers to extracting the data from its original source – traditionally transactional databases like those used by ERP or CRM applications, but nowadays these sources also include social, web, and machine data. In general, the “E” refers to extracting the data created by your organization’s main business functions.
Traditionally, when most organizations had on-premises data warehouses, the difference between ETL and ELT was mainly about which computing power you use to transform the data.
Note that the transformation of data as a conceptual step is vital and inevitable regardless of whether you choose ETL or ELT. This is because the data you extract is in a state either: 1) optimized for transaction updates or 2) the result of how the object-oriented representations of various user actions in modern applications are mapped to the underlying data structures. The point is: The extracted data is not suited for enterprise analytics, as that’s not the primary use case of the application that created this data.
In some design approaches, this step can be called something else. For example, in data lake analyses, this step may be called data preparation. While data preparation is a separate category of tools that allows data scientists and engineers to define such transformation rules visually/interactively and then apply them over large data, at its core, the resulting action is data transformation. In data science/ML projects, data scientists define the rules and data engineering teams operationalize them. Even here, the terms are different, but it’s still about making data ready for analysis.
The ELT approach assumes you are using some extraction/ingestion steps to extract raw data from its source into a landing layer. This step is common between both the approaches.
The difference between ETL and ELT is in what happens next – and it becomes starker as the data warehouse moves to the cloud. As mentioned above, traditionally, it was mainly about using a bigger server in the same data center, or, where such a computation was suitable for SQL, using more efficient SQL processing on the data warehouse application. But, with the data warehouse in the cloud, multiple factors are at play:
You may wonder why you’d use traditional ETL in a cloud data warehouse at all. Why not always use ELT? The next section goes a little deeper into this topic.
In an enterprise data warehouse implementation, ELT is necessary, but not sufficient. The present emphasis on ELT by some companies propagates a myth that ELT is all you need for your cloud data warehouse.
A typical enterprise cloud data warehouse implementation includes very complex sets of data flows and transformations. There are various reasons why you should look beyond ELT before finalizing your implementation and tools:
With ELT-only vendors, the only way to implement any data flow logic is by invoking the commands offered by the underlying application. If your requirements cannot be represented by the underlying SQL, you end up having to write your own code.
Such purpose-built logic requires significant rework when the cloud data warehouse needs to be changed or replaced. The cloud data warehouse world is still undergoing significant changes. And it is common for organizations to switch cloud vendors. While you may or may not switch the entire cloud deployment, it is also common to implement different pieces of analytics using different vendors or technologies. We have also seen customers who used data warehouse specific functions having to do daunting rework because the cloud data warehouse vendor they switched to did not support those functions.
Then, there is the unpredictability of subscription use. Here’s a common scenario: You open up subscription to cloud services and suddenly usage explodes. Then one day you get a bill way beyond what you had imagined because of runaway use of compute resources. Sound familiar? Most users don’t fully understand how the pricing works, and, more importantly, it is difficult to enforce usage discipline for many users.
With an ELT-only solution, your entire enterprise data warehouse is fully hardwired to only use the endpoint application’s commands. If the vendor pricing structure changes or charges increase, you have no option but to absorb the cost.
Use cases evolve. What may work well via ELT today may need a different approach tomorrow if the requirements change. In such cases, it always helps to have the flexibility to apply a different execution approach to your data flow.
In summary, for enterprise cloud data lake and warehouse use cases, you need a platform like Informatica Intelligent Cloud Services (IICS) that provides both:
Next, we’ll look at how Informatica Advanced Pushdown Optimization helps you implement ELT patterns, while at the same time future-proofing your work as cloud data warehouse technology and approaches evolve.
The cloud data warehouse world is still evolving with new vendors bringing in new approaches. A platform that lets you develop your data flows at a logical level while offering you a choice among multiple runtime options, including ELT, future-proofs your investments in data. You’ll get a lot of flexibility as use cases evolve, cloud vendors’ pricing structures change, and you explore newer ways to support modern analytics.
At present, there are multiple recommended data load patterns prevalent among cloud data warehouse vendors. In addition, new technologies are still emerging to make analytics better. Consider the following examples:
We expect the variations described above to expand soon and ways of using data warehouse semantics to evolve. This is why Informatica’s Advanced Pushdown Optimization (PDO) becomes critical instead of proprietary ELT.
PDO offers all the benefits of ELT, but it also keeps your data flow definitions at a logical/abstract level. This helps you avoid a lock-in to any specific implementation. For example, if you switch the target of an Informatica “mapping” (data flow definition) from your existing cloud data warehouse endpoint to a new one, most of your data flow definition remains the same. When you run it the next time, it gets optimized for the newly configured endpoint.
This flexibility means you can push down data flow logic where suitable and use different execution options in other cases. It also helps you avoid getting locked in into any proprietary technology or implementation paradigm.
Before we wrap up, let’s look at a few myths around ELT that we need to dispel.
ELT does not eliminate transformations. You do not want to get raw data into the hands of the enterprise consumers. It still needs all the cleansing, consolidations, code lookups and general calculations as before. However, ELT allows you to do the transformations using the compute resources of the cloud data warehouse – without having to take data out of it.
ELT is certainly better for doing typical data warehousing processing using the cloud application’s compute resources, but it does not eliminate that processing.
A cloud data warehouse application provides elasticity by allowing you to expand or shrink compute or storage as needed – whether using ELT, ETL or a different process. Even here, both ETL and ELT can benefit from such features.
This is more about having an on-prem/installed ETL tool versus a cloud tool. It’s not the ELT that reduces maintenance, it’s the way a tool is provisioned. A cloud-based integration tool reduces maintenance regardless of whether it uses ETL or ELT. On the other hand, a poorly designed or hand-coded ELT requires a lot of maintenance.
This is a bit nuanced. The “E” and “L” part of ELT is good for loading data into data lakes.
But, for this data to be analyzed across the enterprise by different departments and roles, further transformation or data curation is required. And, it is too costly and error-prone to leave it to individual users to do.
In summary, you absolutely need ELT, but you also need the option to not use it. It helps to have a tool like IICS that not only lets you work on abstract definitions of your data flow, but also allows you to choose a runtime option that best suits the workload.
Using Informatica, you can choose from any of the runtime options and services to take a “horses for courses” approach. Moreover, you can also govern such data by using business glossary, data catalog and data lineage features that span across your entire data flow – from your source of data to the data analytics. You can truly have an enterprise cloud data management implementation where you define your business terms and analytics dashboards, implement the data lake and data warehouse needed for those using a vendor-agnostic definition, and govern it all using data catalog and lineage.