Data Discovery and Lineage Simplified for Cloud Analytics with Informatica and Databricks

Dec 08, 2019 |
Dharma Kuthanur

Vice President, Product Marketing

data lineage and cloud analytics

We all know the old maxim “garbage in garbage out.” That is as true for the most sophisticated AI and machine-learning models as it is for basic reporting. And that points to a deeper truth that gets lost in all the hype around AI-powered analytics and applications – the hardest part of AI isn’t AI, it’s data management at scale and the surrounding infrastructure required to enable this. It’s about the infrastructure required to ingest high volumes of hybrid data at high speed, create high volume data pipelines for data at scale, and provide data discovery and lineage across the enterprise.

Informatica and Databricks have partnered to provide an end-to-end solution to address this challenge. This solution helps organizations ingest data, build and process data pipelines, and prepare data for analytics and machine learning, allowing them to accelerate the development of data pipelines for AI/ML projects and deliver faster time-to-value. To fully deliver on this promise, you need the ability to find and use the right data at the right time, and have trust and confidence in that data. And you need this kind of visibility into the data at every stage of the data pipeline. That’s why an intelligent data catalog like Informatica’s Enterprise Data Catalog (EDC) is an essential, foundational part of the Informatica-Databricks solution. Here’s the reference architecture of the Informatica-Databricks solution with Informatica EDC for data cataloging, discovery and lineage:



Discover, understand, and curate data across your data engineering pipeline




Now let’s take a deeper look at how Informatica’s Enterprise Data Catalog provides you with data discovery and lineage across the entire data engineering pipeline. There’s been an explosion in the volume and variety of data that needs to be ingested, processed, and analyzed for business insights. You need broad metadata connectivity to any type of data source, so you can scan and catalog all of this data for easy discovery through a simple Google-like search.

The next step is understanding and trusting the data. With the scale and complexity of the modern data environment, it’s impossible to do this manually. You need powerful AI/ML-powered automated data curation that allows you to automatically identify domains (e.g., names, phone numbers) and entities (e.g., customers, sales orders), learn from user tagging of data fields, identify similar data, and automatically associate business terms and definitions with physical datasets to add rich business context to the data. This automated data curation and enrichment gives you a huge head start in turning your data into valuable business insights.

You need to have trust and confidence in the data that you use to derive business insights. That requires a thorough understanding of data lineage, i.e., understanding where your data is coming from and how it gets transformed at every step of the data pipeline. Again, broad metadata connectivity and AI are essential to providing business-friendly summary views and detailed views of the lineage. This is critical for addressing compliance, auditing, and data governance requirements. The graphic below illustrates how Informatica EDC provides end-to-end data lineage for the entire data engineering pipeline all the way from source systems to Informatica Data Engineering Integration to tables in Delta Lake:









As you leverage these powerful AI-driven capabilities, you should not forget the importance of shared data knowledge that is distributed across your organization. Collaboration and social curation capabilities such as the ability to certify datasets, provide ratings and reviews, and automate change notifications on datasets of interest, allow you to leverage the best of AI and human expertise to fully understand your data. Finally, you also need to understand the quality of the data by viewing profiling statistics, and data quality rules and scorecards for the data. With the combined power of all of these capabilities, Informatica EDC enables you to comprehensively discover and understand data across your entire data pipeline.

To learn more about this and view a demo of how Informatica EDC enables data discovery and lineage as part of the Informatica-Databricks solution, watch the webinar: Data Discovery and Lineage Simplified for Cloud Analytics.