This blog is co-authored by Mohammed Morshed, Staff Performance Engineer, Informatica.
Data Integration in the Cloud: An Overview
Modernizing data integration infrastructure is a complex and continuous process. Traditional databases and on-premises storage can’t keep up with the amount of data generated every second. The result? Data silos with poor insights, scale and security are present in many large enterprises.
To address this challenge, most organizations have either moved to the cloud or plan to soon build a cloud resource. According to Flexera’s State of the Cloud Report, 90% of business and IT executives agree that to be agile and resilient, their organizations need to fast forward their digital transformation with cloud at its core. Public cloud adoption continues to accelerate, driving up cloud spend for organizations of all sizes.1
Moving to the cloud makes it easier for organizations to improve data performance. It also increases data governability, enables actionable analytics and the ability to scale. According to the IDC Global Chief Data Officer (CDO) Engagement Survey 2021, organizations with a high level of data maturity can generate 250% more value from their data.2
Cloud data integration is a key component of any modern data analytics strategy. It is an anchor for your digital transformation. With cloud data integration, organizations can bring together different data sources and derive greater business value. With unified cloud data lakes or cloud data warehouses, data is accessible to all relevant users and applications. But despite moving to the cloud, enterprises may face unexpected cloud bills. How can an organization advance their cloud modernization journey in their data centers and still reduce cloud costs?
Cloud Cost Overruns: A Key Cloud Data Integration Challenge
Companies see the benefits of cloud integration in their data centers. But reducing cloud cost is a challenge. In fact, 80% of organizations overspend their cloud budgets due to ineffective cloud cost management.3 The main reason? Infrastructure usage.
For example, data landing zones with poor designs lead to high data processing times. Failing to identify the right network topology leads to ingress/egress charges. If not managed, these issues can derail your cloud data integration.
How Cloud Data Integration-Elastic (CDI-E) Improves Cloud Cost Savings
Informatica Cloud Data Integration-Elastic (CDI-E) addresses these cloud cost challenges. CDI-E provisioning helps you manage your valuable compute resources. It supports cloud cost optimization across your entire cloud environment.
With CDI-E, you can work on any Informatica-optimized Spark engine on Kubernetes. You can also process any volume of data with any concurrency.
CDI-E features AI-powered auto tuning and auto scaling. This lets you scale data clusters and tune for optimal performance in real time. Auto tuning manages jobs that are less well designed. Auto scaling can handle jobs demanding more resources, especially as data volume grows and service level agreements shift.
The Relationship Between Cloud Environment and Cloud Costs
CDI-E supports two types of clusters: static and auto scaling. Both cluster types work in different ways to help you reduce cloud costs and manage cost overruns.
Static clusters have a fixed cost depending on the cluster run time. Static clusters can be useful when cost overruns occur due to:
- Slower execution: Cost for infrastructure is proportional to usage time
- Underutilizing provisioned resources – Unused capacity wastes budget
- Delayed use of provisioned resources – Costs for delay periods
Static clusters are helpful for improving performance and reducing cost overruns. Static clusters help when batch jobs have a fixed resource demand. An example is when the same set of jobs run from time to time.
Auto scaling clusters help optimize resource usage and costs. They scale a cluster up and down in keeping with resource demand. Auto scaling is helpful when a service in production has increased load at specific times. An auto scaling cluster can be useful when cost overruns occur due to:
- Slower execution - Cost for infrastructure is not proportional to usage time
- Delayed ScaleUp - Slower execution => (1)
- Delayed ScaleDown - Cost for unutilized capacity
With auto scaling, cost varies depending on the demand arising from workloads submitted to the cluster. When fewer nodes are needed, auto scaling clusters offer significant cost benefits. Static clusters perform better where workloads result in scaling up an auto-scaling cluster.
When it comes to cloud modernization, CDI-E with static or auto scaling helps you save money and compute resources.
CDI-E Case Study: Reduce Cloud Costs and Improve Performance
Learn how CDI-E is improving cloud data performance while reducing cloud costs at a large pharmaceutical company:
Static Cluster Use Case: For provisioning resources
The company needed to process about one terabyte of Apache Parquet data. In this case, they would use a 20-node static cluster that can run 140 Spark executors. Open-source Spark took over 2.5 minutes to create the executors or to use the cluster's provisioned resources. But when the team switched to Informatica CDI-E, the full cluster was used in about one minute. In other words, speed improved 2.5 times when compared with open-source Spark (Figure 1). Improved data center performance and reduced cloud costs were the result.
Figure 1: Spark executors on a 20-node static cluster.
Auto Scaling Cluster Use Case: For scaling up and down
The company needed to process a large amount of data. But not all their processes or mappings demand the cluster’s entire capacity. In fact, for processing one terabyte of data, many mappings needed less than 50% of provisioned resources. As a result, our team recommended an auto scaling cluster with 1-20 nodes instead of a static 20-node cluster. Informatica CDI-E with auto scaling was able to bring up all 20 nodes in about 14 minutes. Open-source Spark took almost 38 minutes for only 15 nodes. In other words, CDI-E scaled three times faster. Ultimately, it reached 100% capacity compared with only 75% capacity for Spark. (Figure 2) Open-source Spark simply couldn’t use the configured maximum capacity of the cluster.
Figure 2: Number of worker nodes on auto scaling cluster.
Efficient auto scaling doesn't just mean scaling up. When resources aren’t required, it’s just as critical to be able to scale down. In this auto scaling use case, three mappings would require about 40% of the cluster's maximum capacity when they were run after the cluster scaled up to maximum capacity. CDI-E scaled down to 50% where open-source Spark didn’t scale down at all. (Figure 3) This indicates that CDI-E scheduler is cost-effective too.
Figure 3: Number of worker nodes on auto scaling cluster.
These simple examples show the benefits of CDI-E performance and cloud cost savings. But complex use cases show off CDI-E’s impact on cloud cost savings even more. When faced with two complex business processes for analyzing monthly and quarterly store sales, the company’s job execution in CDI-E was up to 3x faster than open-source Spark! (Figure 4)
These innovations and optimizations cause CDI-E to better use cloud infrastructure, resulting in optimal resource usage and better performance, translating to lower cloud bills. The cost can be further reduced by using spot instances as worker nodes instead of on-demand instances. Here’s additional information on how cloud data integration can help reduce cloud costs.
Figure 4: Complex business cases analysis.
CDI-E enhances cloud cost optimization, data performance and your ability to manage compute resources. If you want to reduce cloud costs while performing data integration tasks more efficiently, visit Cloud Data Integration Elastic or sign up for the 30-day trial today.