What Is Data Lineage?

Data lineage is essentially the provenance for data: an ongoing and continuously updated record of where data originates, how it moves through the organization, how it gets transformed, where it’s stored, who accesses it, and other key metadata.

Data lineage answers the question, “Where is this data coming from and where it is going?” It is a visual representation of data flow that helps track data from its origin to its destination. It explains the different processes involved in the data flow and their dependencies. Metadata management is critical to capturing enterprise data flow and presenting data lineage.

What Questions Does Data Lineage Answer?

Data lineage clarifies the availability, ownership, security, and quality of data as it flows across the organization. This ensures that you can generate trustworthy answers to questions like these:

  • What data sources should we use to develop new customer experience initiatives?
  • What data in my enterprise needs to be brought into compliance with industry regulations?
  • Where do we have risk exposure that needs to be mitigated?
  • How does data quality change across multiple lineage hops?
  • What data should we migrate to a cloud data lake, and how will this move affect which users?
  • How can data scientists improve confidence in the data they need for advanced analytics?


Why Is Data Lineage Important?

Data lineage is essential to data governance, regulatory compliance, data quality, data analytics, data science, and data privacy and security. The ability to map and verify how data has been accessed and changed is key to generating a detailed record of where specific data originated, how it got changed, and how it gets used. This is as valuable for finding and fixing gaps in necessary data as it is for to responding to reporting requirements and audit requests for regulatory compliance. It also increases security by tracking and identifying risks in data flows and indicating whether appropriate controls are in place.

Data lineage helps organizations take a proactive approach to identifying and fixing gaps in the required data. On the data security front, collection of sensitive data exposes organizations to regulatory and business liabilities. Data lineage helps manage this by showing how sensitive data flows throughout your organization so you can ensure that you have proper controls in place.

For IT operations, data lineage helps understand the impact of data changes on downstream analytics and applications, understand the risk of change to business processes, and take a more proactive approach to change management. It also helps drive operational efficiency and cost reduction by eliminating duplicate data and data silos.

In addition, data lineage helps with successful cloud data migrations and modernization initiatives that drive transformation by visualizing how different data objects and data flows are related and connected. This deeper understanding makes it easier for data architects to predict how moving or changing data will affect the data itself as well as the downstream processes and applications that depend on it, and also to validate the changes.

Data Lineage Best Practices

Here are a few things to consider when planning and implementing data lineage initiatives.

Automate data lineage extraction

Many organizations today rely on manually capturing lineage in Microsoft Excel files. That practice is not suited for the dynamic and agile world we live in where data and data lineage is always changing.

Include the source of metadata in data lineage

ETL software, BI tools, relational database management systems and modeling tools, enterprise applications, and custom applications all create their own data about your data. This metadata is key to understanding where your data has been and how it has been used.

Involve owners of metadata sources in verifying data lineage

The owners of the tools and applications that create metadata about your data know better than anyone else how timely, accurate, and relevant the metadata is.

Plan progressive extraction of the metadata and data lineage

Trace the path data takes through your systems and extract the metadata and data lineage from each of those systems in order. This makes it easier to map out the connections, relationships, and dependencies among systems and within the data.

Validate end-to-end lineage progressively

Start by validating high-level connections between systems. Then drill down into the connected data set, followed by data elements, before finally validating the transformation level documentation.

Use an enterprise-class data catalog

For granular, end-to-end lineage across cloud and on-premises, use an intelligent, automated, enterprise-class data catalog. AI and ML capabilities enable an enterprise-class data catalog to automatically stitch together lineage from all your enterprise sources, including the ability to extract and infer lineage from the metadata.


4 Data Lineage Techniques to Start Using Now

Ensure you have a breadth of metadata connectivity

For end-to-end data lineage, you need to be able to scan all your enterprise data sources across multi-cloud and on-premises environments, from legacy and mainframe systems to custom-coded enterprise applications. AI-powered discovery capabilities can streamline the process of identifying connected systems using metadata from ETL software and describing lineage from custom applications that don’t allow direct access to metadata.

Take advantage of AI and machine learning

AI and machine learning (ML) capabilities can infer data lineage when it’s impracticable or impossible to do so by other means. For instance, similar data has similar lineage. When there is no direct way to extract data lineage (for example, when data is moved manually through FTP or using code), AI-powered data similarity discovery enables you to “infer” data lineage by finding like datasets across sources. AI and ML capabilities also enable data relationship discovery, which is essential for impact analysis.

AI-powered data lineage capabilities can help you understand not only data flow relationships, but also “control” relationships, such as joins and logical-to-physical models. For example, deleting a column that is used in a join can impact a report that depends on that join. An AI-powered solution that infers joins can provide end-to-end data lineage that enables a more complete impact analysis, even when these relationships are not documented.

Extract deep metadata and lineage from complex data sources

Enterprises are often challenged to gain end-to-end visibility into data lineage across a complex data landscape that includes hundreds of data sources—from on-premises databases, data warehouses, and mainframe systems to SaaS applications and multi-cloud environments. For comprehensive data lineage, you should use an AI-powered solution that automatically extracts the most granular metadata from a wide array of complex enterprise systems, including ETL software, SQL scripts, programming languages, code from stored procedures, and applications that are considered “black boxes.”

Provide different capabilities to different users

Giving your business and technical users the right type and level of detail about their data helps them understand and feel more confident in it. In addition to increasing their productivity, this improves collaboration by linking business views of data with underlying logical and detail information.


Data Lineage Customer Success Stories

Insurance firm AIA Singapore needed to provide users across the enterprise with a single, clear understanding of customer information and other business data. Data lineage helps them discover and understand data in context and keep its quality high to optimize sales, drive decision-making, and control costs.

An industry-leading auto manufacturer implemented a data catalog to track data lineage for greater flexibility and agility in reacting to market disruptions and opportunities. As a result, it’s easier for product and marketing managers to find relevant data on market trends and trust the results of their self-service reporting – thus reaching actionable insights 70% faster.

Data Lineage: Catalyst for Digital Transformation

Informatica’s AI-powered data lineage solution includes a data catalog with advanced scanning and discovery capabilities to ensure you capture all the relevant metadata about all of your data from all of your data sources and provide detailed, end-to-end data lineage across cloud and on-premises. Find out more about why data lineage is critical and how to use it to drive growth and transformation with our eBook, “AI-Powered Data Lineage: The New Business Imperative.”


Additional Data Lineage Resources

Blog: The Importance of Provenance and Lineage

Video: Automated End-to-End Data Lineage for Compliance at Rabobank