AI/ML Needs Data Management – Informatica and Databricks Deep Dive

Last Published: Dec 23, 2021 |
Sumeet Agrawal
Sumeet Agrawal

VP, Product Management

With all the buzz and talk about AI /ML technologies, it can be hard to separate the hype from projects that actually deliver business value. At Informatica, we believe that the data landscape is growing so complex, you simply won’t be able to derive value from your data without AI and machine learning. But just for the sake of argument, let’s look at the reasons behind some claims that AI isn’t ready to deliver value:

  • Data scientists spend 80% of their time in preparing data – and only 20% on modeling
  • Data is coming in at high volume, high velocity, from a variety of sources – producing data silos
  • Enterprise data cannot be provisioned if it lacks governance
  • Productivity is lost in tedious data pipelines designed to move data into a lake or a warehouse
  • Data engineers spend too much time capacity planning for big data processing

All these reasons argue for better data management. Data management holds the key for a successful AI/ML project.

Informatica and Databricks partnership

AI and machine learning with Informatica and Databricks

Informatica and Databricks have partnered to help organizations more quickly realize big data value by making the ingestion and preparation of data for analysis and machine learning easier. This integration dramatically increases productivity across the organization. How? Data engineers, data scientists, and administrators won’t need to spend time configuring and optimizing clusters and manually maintaining the data platform. Instead, your data teams can spend their time building data pipelines for machine learning and analytics to ultimately turn data into profits.

This partnership combines the power of Informatica and Databricks, providing an end-to-end solution to build and process data pipelines, ingest data, and prepare data for analytics and machine learning. At the same time, the integration provides visibility into end-to-end data lineage. As a result, you’re able to:

  • Develop high speed ingestion of hybrid data into a managed Delta Lake
    The seamless integration between Databricks and Informatica enables data engineers to quickly ingest high volumes of data from multiple hybrid sources into a data lake with high reliability and performance.
  • Make it easier to create high-volume data pipelines for data at scale
    With an easy drag-and-drop user interface that pushes processing down to an optimized Apache Spark implementation in the cloud, customers experience faster and lower-cost development of high-volume data pipelines.
  • Enable data scientists to discover the right datasets for model training
    Data scientists will build more accurate models based on the right dataset and can verify lineage of the data used for model creation and analytics. End-to-end lineage addresses compliance with GDPR and other regulations.

How does this integration work?

Informatica and Databricks provide faster and easier data discovery, ingestion and preparation for data teams to accelerate analytics. Here is a reference architecture of the Informatica and Databricks integration:

Follow these four steps for data management:

Step 1: Data Ingestion: Use Informatica Data Engineering Integration (formerly known as Big Data Management) or Informatica Intelligent Cloud Services for at-scale ingestion to Databricks Delta Lake. Informatica can connect to more than 200 sources to bring data in to the Delta Lake.

Step 2: Data Enrichment: Informatica Data Engineering Integration (formerly known as Big Data Management) can help data engineers prepare data for AI/ML projects. Informatica provides an easy drag-and-drop interface for data engineers to do pushdown the jobs to Databricks Spark.

Step 3: AI/ML Model Creation: Once you have the cleaned and integrated data, you can use Databricks to create AI/ML models.

Step 4: Operationalization: Use Informatica Data Engineering Integration (formerly known as Big Data Management) to operationalize the AI/ML models by running it against full datasets.

At any given point, Informatica Enterprise Data Catalog captures the end-to end-lineage, supporting regulatory compliance.

These four simple steps can help you be successful in your AI/ML projects. To learn more – and to see how Informatica and Databricks can solve a fraud detection use case – register for our webinar.


First Published: Jun 06, 2019