Boost Productivity and Optimize Performance of Data Integration Tasks with CLAIRE® Tuning

Apr 06, 2022 |
Vinay Bachappanavar

Senior Product Manager 

The Benefits of Performance Tuning of ETL Processes

Because the runtime environment can at times be unpredictable, designing your data integration jobs can give a different result while executing it in production. The workloads fluctuate, requirements get modified, data experts may leave your organization, and the allocated resources might be used somewhere else. Optimizing data integration tasks is a continuous effort—and a full-time job—unless you leverage artificial intelligence (AI) and machine learning (ML) techniques to implement best practices automatically.

With the right data platform, you can automate the process of adjusting to workload patterns, recommending the right mix of resources, and pre-setting the parameters. You’re able to increase the efficiency of your systems and teams with auto-tuning capabilities to improve the performance of the data processing engine both in terms of cost and time — bit by bit.

What Is Performance Tuning?

Performance tuning refers to the iterative process of assessing the problem, identifying the bottleneck, modifying resources to remove the bottleneck and measuring the outcome. In extract, transform, load (ETL) processes, performance tuning plays a key role in achieving various objectives, such as meeting service level agreements (SLAs), linear scalability, optimal resource consumption, etc. It is increasingly perceived that performance tuning is the sole responsibility of ETL developers or data engineers.

While performance tuning has its upsides, manual performance tuning has downsides as well: It does not scale, it can be frustrating, and it can lead to huge bottlenecks. Using performance tuning has other challenges as well. For example, it:

  • Takes considerable time
  • Requires deep expertise of the underlying technologies
  • Requires constant reassessment of performance due to frequently changing workloads
  • Involves complex interdependence of configuration properties

How CLAIRE® Tuning Boosts Productivity and Performance in Data Integration Tasks

Let’s start with an example—processing big data in Spark. While the most efficient engine out there, Spark can also be the most complex for performance tuning with over 20 runtime properties that can affect processing speeds. And to make things even more challenging, tweaking one property can affect the other properties. Until now.

To solve this problem, we developed CLAIRE Tuning in Informatica Cloud Data Integration-Elastic (CDI-E). CLAIRE Tuning helps your data integration jobs automatically optimize themselves. This innovative service performs all the steps of tuning from recording the mapping task runtime, changing the resources via Spark runtime properties, and measuring the performance until full optimization is reached. How? CLAIRE Tuning uses hill-climbing optimization techniques to iteratively improve performance. In fact, there are over 15 different Spark runtime values tuned iteratively until a minimum task runtime is reached.

Figure 1: Example of Spark property values and performance improvement. Figure 1: Example of Spark property values and performance improvement.

 

And when CLAIRE Tuning is combined with an auto scaling serverless cluster, you do not have to configure cluster nodes, Spark executors or its memory or cores for different workloads. CLAIRE Tuning not only works for traditional structured data but also with semi-structured data (JSON, XML, Parquet, etc.). We understand some users may want to override the ability to perform manual tuning. So, we present the Spark property values and performance improvement as shown in Figure 1. The user can accept the configuration or simply edit to override the recommended values for the mapping task as shown in Figure 2.

Figure 2: Example of tuning recommendation for mapping task. Figure 2: Example of tuning recommendation for mapping task.

 

There are 2 types of CLAIRE Tuning specific to deployment environments:

  • Active tuning
  • Passive tuning

In active tuning, the developer creates an auto tuning request in the mapping task canvas for a specific mapping. The CLAIRE service executes the mapping task a set number of times for sampling while varying various attributes, such as runtime configurations, data size and recording the runtimes and other performance attributes. The user can configure the sample number of runs as shown in Figure 3. Our ML model uses this data to recommend runtime properties for the best results. The user can simply apply the tuning properties for that specific task to take effect. Active tuning can also help eliminate job failures caused by resource contention by optimally combining and tweaking multiple parameters. Active tuning best applies to mapping tasks that can be run iteratively for this purpose.

Figure 3: Example of configuring the sample number of runs. Figure 3: Example of configuring the sample number of runs.

 

In passive tuning, the user enables continuous tuning in the CLAIRE Tuning wizard for scheduled mapping tasks as shown in Figure 4. Every time the mapping runs after that, performance attributes are recorded and fed into the ML model to determine optimized settings that are applied to the mapping tasks. Passive tuning is better suited to production jobs where user intervention is restricted.

Figure 4: Example of continuous tuning in the CLAIRE Tuning wizard. Figure 4: Example of continuous tuning in the CLAIRE Tuning wizard.

 

You can also combine both techniques. For example, use active tuning to bring the performance to an acceptable level and then use passive tuning to improve upon that.

Performance optimization in data integration pipelines is incredibly important and complex. Most organizations rely on expert developers and performance engineers. This is a potential bottleneck that can cost organizations both time and money. Informatica CLAIRE Tuning makes use of AI/ML models so, your developers can build self-tuning tasks that boost productivity and optimize performance for your data integration tasks.

Next Steps

Optimizations on top of open-source Spark and Kubernetes ensure that CDI-E automatically manages your infrastructure and uses resources efficiently. If you have data on the cloud and want to perform data integration tasks faster and more efficiently, learn more about CDI-E or sign up for the 30-day trial today.