This blog was co-authored by Nauman Fakhar, Director of ISV Solutions at Databricks.
Apache Hadoop was born as an on-premises platform. Most of the use cases for early commercial Hadoop vendors focused on on-premises implementations of the open source data analytics platform. Eventually, Hadoop-as-a-Service—meaning Hadoop running in the cloud became increasingly popular.
However, the Hadoop-as-a-Service model ran into the following challenges:
With the acceleration of data engineering and AI workloads moving into the cloud, customers’ expectations from the underlying data platform also evolved. Customers now expect:
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS), with a pay as you go model. With PaaS, analytical engines such as Apache Spark come ready to use, with a general-purpose configuration and upgrade management system. We no longer have the need for long-running Hadoop clusters for most of the jobs in the current data pipelines.
With fast-changing trends and ecosystems, Informatica plays a critical role as an abstraction layer for customers. Customers can choose any technology and any vendor to process and store their data using Informatica Data Engineering Integration.
Informatica customers are able to leverage simple drag-and-drop functionality to build complex data pipelines against any big data vendor and technology. When they want to move to a different vendor or distribution service, the pipelines work seamlessly without any code changes for the customer. In this way, Informatica Data Engineering Integration customers can future-proof their big data management platform against changing big data technologies.
Databricks is a managed, cloud native, unified analytics platform built on Apache Spark. Databricks is also the creator of Delta Lake, which allows customers to create reliable and performant data lakes on their cloud of choice.
Informatica and Databricks have partnered to help organizations realize big data value sooner by making ingestion and preparation of data for analysis and machine learning easier. This integration dramatically increases productivity across the organization.
Data engineers, data scientists, and administrators don’t need to spend time configuring and optimizing clusters and manually maintaining or scaling the data platform. Instead, data engineers can spend time building data pipelines for machine learning and analytics. And because Informatica Data Engineering Integration offers a visual paradigm for expressing data engineering workloads, organizations that don’t have Python, Scala, R or SQL programming language skill sets can still leverage the power and scale of Databricks from a GUI-based environment.
For customers who are looking to migrate from the traditional Hadoop architecture to a cloud-native platform like Databricks, this article highlights the issues and benefits of changing trends in big data architecture. The article also suggests several best practices for migrating to cloud and serverless technologies.
In long-running Hadoop clusters, YARN manages capacity and job orchestration. It requires users to learn complex configurations to balance capacity and performance needs of multiple users.
A cluster in Databricks is a light-weight concept, which can be created on demand very quickly by leveraging the native elasticity and scale of the underlying cloud.
This on-demand model relieves users from the operational burden of managing capacity in shared long running clusters—users easily spin up elastic clusters that automatically expand or shrink with workload demands and can shut down automatically in quiet periods. This allows Databricks users to focus on analytics, instead of operations.
On Hadoop, HDFS is used as the storage layer. It is like a distributed file system that is tied to compute.
Databricks leverages cloud-native storage such as S3 on AWS or ADLS on Azure, which leads to an elastic, decoupled compute-storage architecture. Such an architecture allows users to scale compute independently of storage and relieves them from having to capacity plan their storage needs or deal with scalability limits of HDFS name nodes.
Data Lake, SQL and NoSQL
Hadoop includes engines such as Hive, open source Spark and HBase. While these engines had their merits as first-generation big data products, they aren’t well suited to build a reliable and performant cloud native data lake today.
Databricks includes Databricks Runtime. A Databricks implementation of Apache Spark, which is much more performant, scalable and enterprise ready than open source Spark.
Databricks also includes Delta Lake, which enable users to build reliable and performant data lakes on cloud storage. Support for transactional pipelines, autonomic caching and data clustering techniques make it possible to build a truly enterprise grade data lake.
For NoSQL capability, Databricks integrates with cloud-native services.
For more information on architecture changes, refer to the Databricks documentation.
Customers who plan to switch from Hadoop to Databricks should be aware of the following key changes:
Sqoop is not available on Databricks. Customers should use Data Engineering Integration mass ingestion to ingest data to any cloud storage layer that Databricks supports. Customers can use Informatica’s JDBC V2 connector for Databricks to ingest data directly into Delta Lake
Hive: Hive is a SQL layer on HDFS that allows you to access data on HDFS through SQL representation. Customers migrating from Hadoop to Databricks, should migrate their Hive datasets to Delta Lake.
Databricks Delta Lake: Delta Lake provides ACID transactions, versioning, and schema enforcement to Spark data sources.
Just as Data Engineering Integration users use Hadoop to access data on Hive, they can use Databricks to access data on Delta Lake.
Hadoop customers who use NoSQL with HBase on Hadoop can migrate to Azure Cosmos DB, or DynamoDB on AWS, and use Data Engineering Integration connectors to process the data. This is a sound architectural strategy as customers expect a cloud-native, managed, and elastic alternative to HBase when migrating NoSQL workloads from Hadoop to cloud.
Customers can use Informatica transformations with drag-and-drop functionality, relieving developers of the need to write code to process data. Informatica transformations are compatible with any Hadoop or non-Hadoop vendors, making it easier for customers to switch between vendors and technologies. Customers need to follow some best practices while migrating to Databricks:
Jobs that were configured to run with Hive engine must be updated to run with Databricks.
Follow the link to learn how to update your Hive jobs.
Sequence Generator transformation:Use UUID4 or monotonically_increasing_id() function in Spark.
Job concurrency on a shared cluster behaves differently on Databricks and Hadoop. Hadoop uses YARN, which includes job scheduler and resource pools to orchestrate jobs. YARN also launches a new Spark Driver for each Spark job to allow job recovery and concurrency on one cluster. But the resources to YARN are limited by the overall capacity in the cluster. This may result in increased job completion times, resource contention, missed SLAs, and operational burden of dividing limited capacity amongst multiple competing workloads
Databricks is a cloud-native product that relieves customers from waiting for cluster resources. With Databricks, a single job is allowed to consume all resources on the cluster, thereby improving the performance significantly for jobs and reducing operational risk
Since a Databricks cluster is backed by the elasticity of the underlying cloud, it’s a much lighter weight and agile component in comparison to a monolithic fixed-capacity long running YARN-Hadoop cluster. As a best practice, you should design your architecture to segregate independent data engineering pipelines into their own clusters. This allows for a “pay for what you use” elastic model and results in both minimal operational burden and lower TCO as no resources are wasted (Databricks clusters can automatically shut down once jobs are finished).
Some existing pipelines may require you to run multiple jobs on the same cluster. While this is possible on Databricks, there are some guidelines to keep in mind:
With the evolution of cloud-based big data pipelines, Informatica plays a critical role in future-proofing the data engineering platform.
Informatica and Databricks together provide an efficient way to process your data and help reduce the cost to compute with auto-scaling capabilities of Databricks. Customers can design some of the most advanced data pipelines with no coding involved and minimal cluster maintenance activities. For customers who are migrating from Hadoop to a Spark-based compute engine like Databricks, there can be architectural changes to consider as outlined in the above sections. Please refer to Informatica documentation for any help you need with creating Databricks connection and running mappings and workflows on Databricks from Data Engineering Integration.
To learn more, watch the on-demand webinar: Building Intelligent Data Pipelines for AI/ML Projects.