The benefit of leveraging Informatica’s Intelligent Data Management Cloud for Delta Lake on Databricks

Jul 13, 2021 |
Avadhoot Patwardhan

Accelerate your data engineering pipeline development on Databricks and govern your Delta Lake with Informatica’s Intelligent Data Management Cloud.

Leveraging Informatica's Intelligent Data Management Cloud for Delta Lake on Databricks

Informatica recently announced advanced ingestion and integration capabilities for Databricks Delta with its summer 2021 release. The new capabilities help data analysts and data scientists move large amounts of data into Databricks Delta Lake for AI, machine learning, and data science projects.

What is Delta Lake?

Delta Lake is an open format storage layer that delivers reliability, security, and performance on your data lake — for both streaming and batch operations. By replacing data silos with a single home for structured, semi-structured, and unstructured data, Delta Lake is the foundation of a cost-effective, highly scalable lakehouse.

Without fast access to accurate and prepared datasets, data teams are challenged to build accurate AI and ML models. In addition, inaccurate or incomplete data can skew results and undermine confidence in AI and ML projects.

If you are a data scientist or data analyst who deals with petabytes of data and performs complex analysis on top of it, by now you will have heard of Databricks Delta.  With the availability of open, cost-effective unified cloud data infrastructure, businesses can access all data within their organization. This data is useful for a variety of downstream applications, but we’d like to focus on two types of data consumers – data analysts and data scientists who are developing new machine learning models to reliably forecast important business insights.

Advantages of Delta Lake

One of the most important requirements for data scientists and data analysts solving business-critical problems is to have access to accurate, curated, and reliable data. Databricks Delta plays a critical role filling this demand with its multi-cloud presence and support for ACID (atomicity, consistency, isolation, durability) transactions. In an era where machine learning is poised to disrupt any industry, the data lakehouse is a modern data management architecture that dramatically simplifies enterprise data infrastructure and accelerates innovation.

Previously, organizations used to isolate reporting and ML modeling use cases, so separate data lake and cloud data warehouse deployments were common. However, overlap between reporting and ML modeling has brought the two categories together. It is now more advantageous to source data for both use cases from a single data lakehouse, such as Databricks Delta.

To build a data lakehouse with Delta, organizations need to figure out how to bring in data from on-premises or legacy applications. They also need to address important questions such as:

  • Is there a way to do such migrations easily and efficiently and without causing significant disruption for the business? 
  • Is there an automated, self-service platform to perform ingestion and integration workflows, or do we need to invest in hand coding and run a multi-year migration project?

There are many options available to move your data to the cloud. But each option comes with its own advantages and limitations. It’s wise to pick a solution that is proven, industry-leading, comprehensive, and easy to connect to all possible data sources. At Informatica’s summer 2021release, we announced a new connector for Databricks Delta that helps enterprises to build reliable data lakehouses on Delta using Informatica’s Intelligent Data Management Cloud.

Informatica’s latest integration with Databricks Delta, a self-service Intelligent Data Management Cloud (IDMC), allows users to ingest the data at scale onto Delta with a wizard-driven experience and then transform and enrich the data in Delta at scale using out-of-the-box transformations and functions.

Why Informatica Intelligent Data Management Cloud for Databricks?

Faster access to accurate and prepared datasets is critical for enterprise analytics to deliver better business outcomes. Informatica and Databricks partnered to provide a scalable data and machine learning solution with faster data discovery, ingestion, and preparation that accelerates development and increases model accuracy.

The Informatica partnership with Databricks brings together the best of both  platforms. The following capabilities help organizations rapidly build best-in-class AI, ML, and analytics projects to drive meaningful business insights.

  • Data Discovery – Informatica’s Enterprise Data Catalog provides UI-based capabilities for profiling, discovering, and tracking data lineage of Delta tables and ADLS Gen2 with Databricks’ managed and optimized platform for running Spark jobs.
  • Data Ingestion – Informatica’s Cloud Mass Ingestion efficiently ingests huge volume of data from a variety of sources such as streaming and  IoT devices, files of any size, and database or data warehouse tables as well as incremental change data onto Databricks Delta for deeper insights using AI and ML.
  • Data Integration – Informatica’s Intelligent Data Management Cloud supports more than 200 sources from where enterprises can ingest the data using a wizard-driven experience and create mappings using our mapping designer to enrich, transform, and load clean, curated data to Delta Lake securely and at scale.
  • Data Governance – Data democratization requires trust, which is achieved only through enterprise data governance. Informatica has a comprehensive product portfolio that is deeply aligned with Databricks, designed to help enterprises deliver data that is consistent, trusted and governed. Further, it empowers organization in managing and protecting data assets in accordance with enterprise data policies as well as regulations such as GDPR and CCPA.
  • Data Quality – Informatica Data Quality ensures clean, complete, consistent and ready-to-use data for AI and machine learning initiatives on Delta Lake. It features standardization, matching, worldwide address cleansing, and versatile data quality management for all AI and ML projects on Delta Lake.

The joint Informatica and Databricks solution enables organizations to build and iterate machine learning models faster to address rapid go-to-market demands. As you can see from below reference architecture, the Informatica and Databricks joint solution seamlessly accelerates data engineering pipelines for AI and analytics.

Informatica and Databricks reference architecture

Build an efficient Delta Lake with Informatica’s Intelligent Data Management Cloud

With its summer 2021 release, Informatica is providing new connectivity for Databricks Delta that helps customers source data from Delta tables in their Informatica mappings. With these new capabilities, you can easily ingest data from various cloud and on-premises sources—whether applications, databases, files, streaming, or IoT—and move this data into the Delta Lake. We’ve simplified the UI to the point where it’s very easy to configure: just few steps, and you’re up and running, all set to move data into Delta Lake at a large scale.

Data scientists and data analysts can now broaden use cases and run complex AI and ML models and analytics predictions with more data into Delta.

Informatica’s Intelligent Data Management Cloud also provides Amazon S3 and Microsoft ADLS Gen2 connectivity that may be used to enrich complex files before used by Databricks Delta to create tables. Check out more information on the Informatica S3 connector.

The new Informatica’s Intelligent Data Management Cloud Databricks Delta connector helps customers build new integration pipelines with a user-friendly Informatica GUI.  Some of the key features of this connector include:

  1. Read/Write – read data from Databricks Delta tables/views and seamlessly use in integration mappings. Provides support for DML commands.
  2. Datatypes – supports native Databricks Delta datatypes.
  3. Authentication – authenticate via token.
  4. Advance properties – provides flexibility for user to choose Databricks runtime cluster.

With Informatica’s new Databricks Delta integration capability, our joint customers can uncover meaningful insights using AI and machine learning at scale. 

Next Steps

  1. To try Informatica’s Intelligent Data Management Cloud with new Databricks Delta connector, sign up for a free trial
  2. Learn more about Informatica solutions for Databricks
  3. Watch our virtual Summer 2021 launch event to learn more about major releases and product innovations