Co-authored by Harish R.G.
Informatica launched the Informatica Intelligent Cloud Services (IICS) Cloud Data Integration Elastic service in the summer of 2019. IICS Cloud Data Integration Elastic enables your IT organization to process data integration tasks without managing servers or requiring additional big data expertise. Data integration solutions are deployed using IICS Cloud Data Integration Elastic to process large volumes of big data in the cloud. An ephemeral Kubernetes cluster provides the serverless Spark engine to process the integration mapping tasks.
The InformaticaSecure Agentspins up a cluster onKubernetesthat runsserverlessSparkjobs which divide and run your data loads in parallel. Based on your SLAs, size of the job and mapping complexity, you can choose the size of the Kubernetes cluster (number of nodes) and type of machine (CPU:Memory configuration).
By default, we have enabled autotuning in Cloud Data Integration Elastic to run your mappings with optimal performance. However, for advanced users, we do provide a mechanism to optimize the tuning further. This blog discusses the variousSparksettings that impact performance of workloads and how they can be tuned further for better performance.
RunningSparkjobs against Cloud Data Integration Elastic can show underutilized CPU and memory resources on the cluster. For example, on a cluster node with 8 cores and 32 GB RAM, all 8 cores and 32 GB of memory are not used by the Spark executor tasks. We can broadly classify the causes of under-utilization into three categories.
1. System Resource Allocation.
There are system resources reserved by Cloud Data Integration Elastic services for Spark Shuffle, Docker daemon and Kubernetes services. Spark shuffle is used for dynamic allocation, which enables dynamic addition or removal of Spark executors based on workload.
2. Default Spark settings.
a) spark.executor.cores – the default is set to 2, this parameter defines the number of CPU cores to be allocated to each executor. Oftentimes for non-CPU intensive workloads, 2 cores per executor process is an excessive allocation and can lead to underutilized CPU on the cluster.
b) spark.executor.memory – the default is set to 6 GB. This parameter defines the amount of memory to be used by each executor process. For most workloads, unless using very specific memory-intensive expressions or transformations, this is too high a value and can lead to underutilized memory.
c) spark.driver.memory – the default is set to 4 GB, this parameter defines the amount of driver process memory where SparkContext is initialized. This can also be reduced to make more memory available for executor processes.
3. Ratio of CPU to memory available on the Cloud Data Integration Elastic cluster worker nodes.
Cloud Data Integration Elastic clusters can run on different instance types of choice. The ratio of CPU to memory on these machines plays an important role in determining how many Spark executor instances can be launched.
a) Memory optimized instances like m4, m5 EC2 instance types where CPU:Memory is 1:4.
In a memory optimized instance like m4.2xlarge, with default Cloud Data Integration Elastic settings, there would be about 2 executor instances per node.
b) Compute optimized instances like c4, c5 EC2 instance types where CPU:Memory is 1:2.
In a compute optimized instance like c4.2xlarge, with default Cloud Data Integration Elastic settings, there would only be about half the number of executor instances per node as compared to m4.2xlarge. This is because, the default behavior of Spark parameter spark.executor.memorywould allocate 6 GB per executor. On top of this, spark.driver.memory would reserve 4 GB. Even though there are enough CPU cores available, we are not able to spin up more executors since we are limited by available memory on the instance (15 GB).
Scenario 1: Memory optimized instances
These twoCloud Data Integration Elastic spark parameters control CPUrequests for each executor and driver. Example values include 500m, 750m, 1000m. A value of 1000m in Cloud Data Integration Elastic directs each executor/driver to be assigned 1 vCPU (defined in AWS terminology). Setting it to a value lower than 1000m (say 500m) would mean that each executor/driver would be given half a vCPU. Therefore, setting it to a lower (more aggressive) value would lead to more executor subscriptions and higher overall CPU resource consumption.
Scenario 2: Compute optimized instances
In compute-optimized EC2 instance types, the CPU to memory ratio is 1:2. In such instance types, even though you may have enough CPU cores, overall executor allocation will be impacted by available memory on the instance. The default spark setting of 6 GB forspark.executor.memorymust be reduced in such cases to launch a greater number of executor instances. This is to be done in conjunction with settingspark.kubernetes.executor.cpu.per.taskandspark.kubernetes.driver.cpu.per.coreto 500m.
Based on our internal performance tests, optimal usage of cluster computing resources could be achieved by setting the spark.kubernetes.executor.cpu.per.task and spark.kubernetes.driver.cpu.per.core to 500m. In general, this is a fair strategy since each task does not need 1 full CPU most of the time unless the task is a specific CPU-bound one.
As you go for a more aggressive resource scheduling mechanism using 500m as the value forcpu.per.task, Cloud Data Integration Elastic is able to spin up more executor instances on the same cluster configuration, thereby utilizing system CPU and memory more efficiently.
With the Cloud Data Integration Elastic cluster now being able to spin up more executor tasks, each Spark job can execute faster than before, leading to performance gains. In some of our internal tests, we have observedup to 12%improvement in job execution timings.
IICS Cloud Data Integration Elastic has solutions for all kinds of big data users. We made it easy for users to run their Spark tasks by providing auto tuning out of the box. For advanced users who want more control over performance, the above best practices and recommendations help you tune and optimize your cloud data integration performance.
All the performance claims mentioned in the blog are either observed in development environment or shared to us by our customers. One may or may not achieve the same performance as there are various factors which influence performance results.