The Importance of Provenance and Data Lineage
The connection between good data and good food
This is a blog about data lineage. But I am going to start by talking about food. We are what we eat—or at least that’s what the purveyors of fine foods would have us believe. As consumers, how do we decide what’s good for us to eat? First, we scan the label to understand some key aspects like calorie content, nutritional content, organic versus conventionally grown, processed versus natural food, etc. If all of that looks good, then we carefully scan the list of ingredients to see what goes into it and check if that meets our requirements and expectations. And if we are really picky, we want to know where the ingredients come from and who supplies them. High-end grocery stores, like Whole Foods, make this process interesting and colorful by almost telling a story about what’s on their shelves—about the farmers cultivating organic brown rice, the happy chickens that hatch their eggs, and the serene pastures on which their animals graze.
Lineage for Tracking and Understanding Problems
What about the producers and brands who are selling these food products? How do they ensure these products meet their quality requirements and the requirements of their brand? That again is inextricably dependent on what goes into these products. If anything goes awry, like a case of food poisoning involving processed meat, the food producers need to have a way to quickly track and understand the source of the problem and what needs to be done to fix it. The timely FDA alerts we get in such cases depend on having reliable and detailed information on what goes into these products, where they came from, and how they were processed. The same goes for reducing costs, streamlining their supply chain, or improving product quality.
You will start noticing a common theme. The provenance of things, or lineage, matters—for the food we eat, the electronic gadgets we use, or the clothes we wear. And it matters for the data that powers our analytics and business decisions. That is becoming more and more critical as there is more data, more types of data, and more distributed data across cloud and on-premises in the modern data environment. At the same time, there is more business pressure to get timely access to relevant data, have trust and confidence in the data, and have policies and processes in place to govern appropriate use of the data and ensure compliance with external regulations. Having an end-to-end view of lineage is emerging as a critical foundational requirement to support all data-driven business initiatives.
The Importance of Data Lineage Across All Data-Driven Business Priorities
For AI and analytics, data lineage helps analysts and data scientists develop a better understanding of the data and drive business insights based on trusted data. Data is fluid (as it should be), and as data moves across the organization, data governance should ensure consistent and appropriate governance policies are applied to the data. Data lineage enables this by helping clarify availability, ownership, security, and quality of the data as it flows across the organization. Regulatory compliance requires more than just producing static reports. Regulations also mandate implementation of data lineage to demonstrate where the data originated, trace its journey through the systems in the organization, and show how it changed along the way. Data lineage also helps organizations take a proactive approach to identifying and fixing gaps in the required data. On the data security front, collection of sensitive data exposes organizations to regulatory and business liabilities. Data lineage helps manage this by tracking and identifying risks in the data flows and checking if the appropriate controls are in place. For IT operations, data lineage helps understand the impact of data changes on downstream analytics and applications, understand the risk of change to business processes, and take a more proactive approach to change management. It also helps drive operational efficiency and cost reduction by eliminating duplicate data and data silos.
How to Overcome the Challenges of Deriving End-to-End Data Lineage
In a complex modern data environment, understanding end-to-end data lineage is not a trivial task. It requires metadata connectivity across the entire data landscape—across cloud and on-premises databases, ETL tools, BI tools, and enterprise applications. It requires the ability to automatically stitch together lineage from all of these sources including the ability to extract and infer lineage from the metadata. Data lineage will often have to be automatically derived from different types of code—ETL jobs, SQL scripts and stored procedures—to understand how data gets transformed in each step. Lineage views have to be presented at different levels ranging from business and logical views to detailed field-level views with the ability to drill down into transformation logic. When there is no direct ability to extract lineage, it requires the ability to indirectly infer lineage through AI/ML-powered intelligent capabilities like data similarity and data relationship discovery. Informatica’s AI-powered Enterprise Data Catalog delivers automated, granular end-to-end data lineage across cloud and on-premises with these capabilities.
Learn how Enterprise Data Catalog can help you discover and inventory data across your organization.
Hear from our enterprise customer Rabobank about how data lineage helps them address different business use cases. Watch the video here.