Leverage the Power of Cloud Data Integration-Elastic With Local Deployment
Co-authored by: Srinivasan Desikan
Running data integration on the cloud has many benefits. This includes increased operational efficiency, flexibility, scalability and faster time to market. Although there are many advantages of cloud data integration, it also comes with its own set of challenges. This includes latency, processing, choosing the right architecture and complexity. To help, you can build your data pipelines or bring your own code for processing structured or unstructured data into the cloud. When it is difficult to predict the workload, Informatica’s Cloud Data Integration-Elastic (CDI-E) is the perfect solution.
CDI-E reduces the time spent on activities that are not central to the business by freeing administrators from infrastructure provisioning and management. As you onboard new workloads or as data volume increases, CDI-E automatically provisions compute resources as necessary and adapts to the changing demand.
Advanced features of CDI-E include:
- Bring-your-own Scala or Python code in an Informatica data pipeline
- Ability to build self-tuning pipelines using CLAIRE® Tuning
- Incremental file load
- Spark-native processing for complex hierarchical data that support machine learning (ML) by consuming ML models within your data pipeline
- Out-of-the-box NoSQL connectors
- Support for 250+ connectors with Advanced Mapping
- Data quality, data profiling and data preview
CDI-E Deployment Options
Informatica’s CDI-E provides support for single cloud, multi-cloud, and hybrid cloud environments. Depending on the level of ownership desired, security requirements and in-house operational expertise available, IT teams can opt for one of the four available deployment options:
- Advanced serverless
- Fully managed cluster
- Customer managed cluster
- Local cluster
Introducing CDI-E: Local Deployment
One of the challenges to implementing an advanced cloud integration solution is dependencies on the infrastructure team, the information security team, cloud service provider support, etc. A static or local cluster gives an opportunity to start small and simple, yet you can use all the features and functionality of CDI-E. While the value of CDI-E fully managed cluster is undisputed, several organizations can benefit from unique functionalities and the powerful Spark engine processing for their data lake and data warehouse use cases with local deployment. In fact, they can get started within hours.
How CDI-E Local Deployment Is Different From Fully Managed Cluster
On CDI-E local cluster, everything runs on the secure agent node. In CDI-E auto scaling cluster, we need a secure agent, a master node and a worker node(s). Figure 2 illustrates how it works.
CDI-E local cluster leverages the benefit of both Kubernetes and Apache Spark. Although Spark requires greater computing resources (CPU, memory, disk IO), Informatica’s local cluster offers the flexibility to execute containerized complex Spark Applications on any machine with a minimum of a 4-CPU core and 16GB of memory:
Below is the machine configuration required for running workloads with local Kubernetes cluster.
How to Get Started With Local Deployment
Before you start to run workloads using a local cluster, be aware of the permissions needed and the format of the staging and logging locations in a cluster configuration. For more information, visit here for cloud permission. Figure 3 shows the general process on how to get started:
We ran some rigorous experiments to see if there is any difference in performance and speedup time with the change in data volume and CPU cores. What we discovered is the performance drastically improved as we processed a higher data load with increased number of CPU cores.
Oftentimes organizations need to process data for their cloud data lake use cases and the data volume keeps growing over a very short period. At the same time, the SLA remains the same as the data volume grows. As local cluster uses the same host where the secure agent is running, by increasing the number of CPU cores of the secure agent host, the organization can process the growing amount of data with the same SLA. Since there are more CPU cores available, local cluster (K8s) can accommodate a higher number of executors, which increases the parallelism. The below graph demonstrates the speedup with the increase in number of CPU cores available for local cluster.
To process data volume more than 10GB, it is recommended to use a machine with at least 8-cores or higher.
Note: Test Data: TPCH-Line item of up to 500 GB.
The above example (Figure 4) is for simple mappings where there are no or light transformations happening. Let’s now look at data mapping where you require heavy transformations like aggregators, joiners, expressions, router, etc. Let’s say you need to run a quarterly pricing summary report to determine the amount of business that was billed, shipped and returned. The Pricing Summary Report query provides a summary pricing report for all TPCH-H Lineitem shipped as of a given date. The date is within 90–120 days of the greatest ship date contained in the datastore. Figure 5 shows with a high volume of data (500GB), processing power speeds up to 4.5 times with 16-cores.
Note: Linear Speedup is observed with 8 and 16 cores when compared with 4 cores.
With local clusters you are not compromising on concurrency. If an organization needs to run queries concurrently by different users, the system spins up several executors to take care of the requests. Such workloads are also supported on CDI-E local cluster.
While the users are designing the mappings for different queries, it is necessary to preview the data to ensure accuracy. One of the best features to review code at design time and avoid failures in production is Midstream Data Preview. CDI-E local cluster not only supports preview but also does not allow the performance to drop.
In the above TPCH-Q1 mapping (Figure 8), the user submitted a Data-Preview on Aggregator transformation that resulted in sample output as illustrated below:
The below chart shows the time taken by Data Preview jobs submitted against TPCH-Q1 (Pricing Summary Report), TPCH-Q8 (Market Share Query) and TPCH-Q10 (Returned Item Reporting Query).
TPCH-Q8 and TPCH-Q10 are very complex queries with 8 – 15 transformations in the pipeline.
Note: Starting CDI-E local cluster for the first time takes between 3 to 4 minutes. The local cluster times out after 5 minutes of idle time. The subsequent re-start of the cluster takes less than a minute.
*TPCH Data: We used industry Standard TPCH-line-item data for our tests.
Next Steps
It’s clear the CDI-E local deployment option offers simplicity in onboarding and provides rich and unique functionalities for data lake and data warehouse use cases, without sacrificing performance. Ready to try it out? Learn how to set up a local cluster or sign up for a 30-day trial today.