In the big data world of Hadoop and NoSQL the spotlight has shifted to big data processing engines. Today, many different big data engines are available making it difficult for organizations to determine the right processing engine for their big data integration needs. Spark is a new and exciting technology that’s showing a lot of promise by enabling new use cases such as machine learning and graph processing, as well as simplifying development and providing scale for big data. Informatica has always embraced open source innovation for its products and will continue to leverage and extend open source technologies. Informatica already uses Spark for graph processing. However, our latest Big Data Management product shows 2-3 times faster performance over Spark for batch ETL processing by using Informatica Blaze on YARN. You will find more details on this & Informatica Blaze later in the blog.
When you evaluate big data processing engines for data integration performance, consider the following key points before making a decision:
- Performance: Performance is a key factor to evaluate for any big data processing engine. Look out for the following areas before picking any big data engine:
- Concurrency: Running concurrent jobs at the same time is very common in data integration scenarios. Make sure that your big data integration tool performs well when you run concurrent jobs. For example, Spark has some limitations as relates to concurrent job execution. Spark starts a new YARN Application with every job. This means that if you run 10 mappings, Spark will start 10 YARN applications. There is no re-use of resources between applications. Therefore, any data integration tool which packages only Spark will also have the same issue.
- Memory Utilization: Another hot-topic today is memory utilization. There are use cases for in-memory processing engines such as Spark but not all use-cases are fit for Spark processing. Spark needs a lot of memory. Much like traditional RDBMSs, Spark loads a process into memory and keeps it there until further notice, for the sake of caching. If Spark runs on Hadoop YARN with other resource-demanding services, or if the data is too big to fit entirely into the memory, then there could be performance degradations for Spark. Spark is well-suited for use cases like interactive queries and machine learning processes that need to pass over the same data many times but batch use cases such as data integration (e.g. ETL) are different. For batch ETL, even if you give more memory, it is not always helpful. An engine which works well for in memory use cases won’t always work well for high-volume, high-throughput batch ETL processing.
- Performance Benchmark: Request a performance benchmark from software vendors. Keep in mind that you’re evaluating big data software for big data management use cases which must include performance and scalability. TPC benchmarks are the best way to provide a vendor-neutral evaluation for the performance and price-to-performance ratio.
- Layer of Abstraction: We have seen the big data landscape change tremendously within a few years and it continues to change even today. For example, both the development of YARN for resource management capability and the migration from the MapReduce programming framework to Spark as the new processing engine were all fairly recent developments. Today, Spark is the talk of the town. Spark might be a great processing engine now, but tomorrow another engine could step into the spotlight. Therefore, you must determine which vendor provides a layer of abstraction so that you are not tied to a particular processing engine. For example, a code generating tool is tied to a specific processing engine because it generates code only for the engine it supports, requiring you to understand what the code does when migrating to another processing engine. By asking this simple question of whether or not a software vendor supports an abstraction layer, you can future-proof your big data management platform against changing big data technologies.
- Breadth of Functionality: When you begin your big data journey, look at the entire journey. Today, big data is much more than just big data integration. To be successful at a big data project, a software vendor must talk to you about the big data management framework, which consists of big data integration, big data governance, and big data security. Breadth of functionality is an important factor when talking to vendors. A vendor who claims support for only a processing engine like Spark has already defeated any chance of success for your big data project. Look for big data software that provides functionality for your entire company, from the business analyst who needs to profile data on Hadoop or provide data quality rules or governance for the big data landscape, to a developer who needs to be able to parse complex files or build complex transformations. Remember big data management is much more than whether a tool can perform join operations and aggregate data. What you need to ask is how the tool or platform handles different data formats, parsing, standardizing, normalizing, data quality, data matching, or data masking? Can the processing engine dynamically manage a variety of data types (e.g. Timestamp with Timezone)? Ask for a detailed list for the functionality that you might encounter or need in your big data implementation.
Be wary of performance benchmarks that do not consider all 3 critical criteria’s above. For example, recently a software vendor funded a performance benchmark study comparing a two-year old version of Informatica Big Data Edition, which supported Map Reduce, with their software supporting Spark. The study was conducted without considering the key points noted above. Unfortunately this benchmark is misleading for the following reasons.
- Benchmark compares Disk I/O intensive map-reduce engine against in-memory spark engine on a single memory optimized Amazon Instance (VM).
- Vendor has chosen a custom benchmark instead of the industry standard TPC benchmarks which provide vendor-neutral evaluation for performance and price-to-performance ratio. In the big data space where Tera/Peta-bytes are the norm, this benchmark process ~2 GB of source data on a single memory optimized VM hide the cost of shuffle phase.
- The use case was executed using only 12 million records on a cluster with 4 CPUs, 30.5 GB Memory, and 200 GB Storage, which is hardly representative of a real-world big data environment.
At Informatica, we made a strategic decision to accommodate all 3 critical criterias when building and optimizing big data engines for our customers’ real-world workloads. As we explore the big data use cases from our customers and the technologies that are driving the big data world, one reality seems to emerge – just like polyglot persistence, we also have a need for polyglot engines. Polyglot, implies that no one technology whether storage, language, processing or your favorite technology area can and will solve all problems well (Polyglot= using multiple languages). A polyglot engine will be able to use the right processing engine for a given data processing task.
As Informatica is the leader in data integration space for 23 years, we leverage our years of innovation atop open source technologies to deliver a highly scalable, performant, and flexible big data platform that addresses the demands of the real-world as relates to data integration, data governance, and data security. With Hadoop 2.0, YARN allowed multiple, complex distributed applications to run in a multi-tenant Hadoop platform. Informatica Big Data Management Version 10, includes our new big data engine “Blaze”.
Blaze is the industry’s unique data processing engine integrated with YARN to provide intelligent data pipelining, job partitioning, job recovery, and scalability, which is optimized to deliver high performance, scalable data processing leveraging Informatica’s cluster aware data integration technology.
YARN, which provides the capability to build custom application frameworks on top of Hadoop to support multiple processing models, allowed the integration of Informatica’s data transformation engine natively with Hadoop. Informatica Blaze is built using a memory-based data exchange framework which runs natively on YARN without the dependence of MapReduce or Hive or Spark and heeds the functional gaps of each processing engine. Informatica Blaze extends data processing capabilities on Hadoop by complementing Informatica’s Big Data Management solutions and supports multiple processing paradigms, such as MapReduce, Hive on Tez, Informatica Blaze, and Spark to execute each workload on the best possible processing engine. This is similar to the polyglot approach described above.
The following figure shows how Informatica Big Data Management integrations with its proprietary big data Blaze engine and other big data engines:
By combining the best of open source technology, YARN and 23 years of data management experience, the introduction of Informatica Blaze adds to Informatica’s optimized support for multiple data processing frameworks by delivering flexible, scalable high performance data processing on Hadoop. This support ultimately provides organizations an end-to-end platform for optimized big data integration. Blaze also provides cluster abstraction. Informatica Blaze is not tied to only hadoop platform. Blaze requires:
- Resource Management.
- Distributed File System
- Cluster Management.
As long as any cluster provides these functionalities, Blaze can run on top of it. This means Blaze can be tuned to work on Mesos and other resource managers.
I would like to share with you some performance benchmark results comparing our new engine Blaze with Spark and MapReduce. These benchmarks adhere to the 3 criteria’s I discussed earlier and of course will vary based on hardware and software configurations. We used the standard TPC-DS queries for these benchmark tests running Informatica Big Data Management V10.0 on a 13 node Cloudera V5.4.2 cluster. All the nodes are 2 CPUs, 12 cores & 64 GB in memory. TPC-DS is the new decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance.
From these tests it is quite apparent that Informatica Blaze runs on an average 11 to 20 times faster than MapReduce (even more benefits and also 2 to 3 times faster than Spark). Key Benefits of Blaze
- Performance :
- Running Concurrent Jobs: Blaze follows multi-tenant architecture. This means multiple jobs will be served by a single Blaze instance. This provides better resource usage and sharing among jobs. Even if we start 100 mappings, we will only start one YARN application.
- Resource Utilization:
- High performance C++ native code: Informatica Blaze is written in Native C++ code. It doesn’t have any memory issues as in Java. Spark is written in Scala which runs on same JVM as Java. Java introduces overheads for storing objects in memory and for tuning of garbage collection for data intensive applications. For example,” a simple 4 byte string becomes over 48 bytes in total in the JVM object model!”
- Optimized Data Shuffle: Informatica Blaze uses an optimized fault tolerant Data Exchange Framework (DEF Daemon) for the shuffle phase. The shuffle phase is very common in data integration use cases. Any use case involving a Reduce operation will require a shuffle-phase. With the Blaze engine, the DEF daemons can shuffle data in memory and also in disk. This happens without the loss of recovery. This is a very critical capability in big data processing on hadoop cluster which is difficult to achieve. We are the only big data engine which supports fault tolerant in memory shuffle phase. This is a unique advantage and capability of Blaze. If we have the available resources in a big Hadoop cluster, the shuffle phase can happen in-memory with Blaze to achieve higher performance. In Spark, Shuffle happens in disk.
- Layer of Abstraction: Blaze is a part of Big Data Management and helps to provide a viable option for Informatica smart executor to pick Blaze for the right use cases.
- Breadth of Functionality: Blaze runs the native Informatica engine on top of YARN. This gives us the flexibility to support all data Integration transformations including advanced Data Integration, Data Profiling, Data Quality, Parsing, Masking transformations.
- Logging and Monitoring: Informatica Blaze offers its own Blaze console. The Blaze console runs on top of the Apache Application TimeLine Server. This helps in monitoring and debugging.
- Consistent Behavior: Informatica Blaze generates a consistent result running any job in Hadoop mode or non-Hadoop mode.
A successful big data strategy depends on implementing a big data management solution that scales for big data integration, big data governance and big data security. It is important for any company starting on their big data journey to evaluate software vendors thoroughly and understand whether they can deliver maximum productivity and reuse (e.g. using dynamic templates), optimal performance and scalability that doesn’t require a complete code-rewrite every time a new engine emerges in the Big Data ecosystem (e.g. uses an abstraction layer to select the best engine for the best job such as Blaze or Spark rather than using a code generator), a universal data catalog (e.g. Informatica Live Data Map). These are just a few key criteria you should consider when evaluating a Big Data Management solution. There are many other criteria to consider which you can read about in the white paper, “How Big Data Management Turns Petabytes into Profits”.
At Informatica, we have the advantage of our smart executor to optimize Big Data Management workloads. This executor dynamically selects the best engine based on cost and use cases. We leverage the best of open source technologies and add our own innovation atop these projects. Informatica Big Data Management supports a variety of engines such as Blaze, Spark, Tez, and MapReduce and has the ability to support future engines without having to rebuild or refactor your data pipelines, while providing a breadth of functionality for data quality, data governance, and data security and providing a layer of abstraction to support a polyglot ecosystem.