Harness Automated Inferred Data Lineage to Accelerate Responsible AI Outcomes

Last Published: Dec 02, 2024 |
Mark Kettles
Mark Kettles

Senior Product Marketing Manager

Data & AI Governance and Privacy

As part of its July 2024 product launch, Informatica announces the debut of its inferred data lineage capability. AI-powered data lineage is at the forefront of the Cloud Data Governance and Catalog (CDGC) offering, which allows users to track and view data lineage from origin to consumption across even the most fragmented and complex data landscapes.

In the modern enterprise landscape, typified by multi-cloud environments and stringent regulatory requirements, managing the lifecycle and integrity of data is pivotal to keeping pace in the new age of AI systems. AI-powered data lineage tools can enable you to maintain high data quality and security standards, reduce data redundancies and streamline operations. These tools can ensure compliance with expanding global regulations on data and AI and strengthen your data governance framework.

AI-Powered Data Lineage for the Digital Age

Data originates from various heterogeneous sources scattered across the organization, including on-premises systems, hybrid IT and multi-cloud environments. Enterprises might be dealing with hundreds or even thousands of such data sources, spiraling into tens of millions of data objects. Informatica’s recent 2024 CDO Insights survey points to 41% (of respondents) already struggling with 1,000+ sources and 79% expect that number to increase in 2024. With the overwhelming volume and scale of data today, manually tracing its route through system infrastructure is impossible.

With the advent of AI-powered, automated data scanning, organizations can now track data movement and sharing with unparalleled precision. This technological leap allows for the extraction of metadata, offering a clearer picture of the intricate relationships within your data ecosystem.

Investing in cutting-edge AI technologies requires more than advanced tools; it demands enterprise-wide visibility and intelligent data governance capabilities. Today, businesses stand at the forefront of innovation, equipped with the means to unlock unprecedented value from their data assets.

How AI-Powered Data Lineage Provides Business Value 

Intelligent Automation: AI helps to automatically discover and visualize data flows between systems, reducing manual effort and increasing accuracy. Data lineage derives end-to-end data lineage using advanced techniques to automatically scan and extract metadata from a variety of data sources such as cloud platforms, BI tools, databases, multi-vendor ETL and data science tools, various enterprise applications and file formats, SQL dialects and stored procedures, helping to provide comprehensive visibility of data during its journey.

Scalability: AI can handle and analyze large volumes of data and complex data lineages that would be impractical to manage manually. AI-powered data lineage can automatically scale up or down the resources needed to process the data based on the load, ensuring efficient data lineage tracking without manual intervention.

Quality: AI models can predict and identify potential data quality issues or inefficiencies in data pipelines. Data lineage can visualize the end-to-end path data takes from its origin and evaluate data accuracy and quality — leading to better data intelligence and insights.

Enhanced Analysis: AI offers capabilities like semantic recognition and anomaly detection, which can discern patterns and inconsistencies in data usage and flow. Interactively trace data flow through data lineage views at any level, from business-friendly to system-level views.

Compliance: Understand where sensitive data — such as personally identifiable information (PII) and intellectual property (IP) — reside to help mitigate risk exposure and avoid fines and remediation penalties. Additionally, utilize automated, granular data lineage to support transparency and reporting for regulatory compliance mandates, such as BCBS239, GDPR and the EU AI Act. Plus, organizations can extract deep metadata from complex enterprise systems and parse code in stored procedures to create comprehensive audit trails quickly for faster reporting during audits and inquiries.

Inferred Data Lineage Drives Greater AI Data Readiness 

Due to technological limitations or security constraints, complete lineage may not be visible after metadata extraction in many enterprises. However, this gap can be bridged by employing inferred data lineage, which means that the data flow and relationships have been analyzed to make educated deductions about how data moves through processes, transformations and storage locations; for example, if a data pipeline extracts data from a source database, performs some transformations and then loads it into a target data warehouse, the inferred data lineage would show the flow from source to target, even if there isn’t explicit documentation for each step.

The system infers the lineage based on the observed patterns and dependencies, filling in gaps where documentation might be missing and providing a more complete picture of data movement across an organization. This is critical to ensure that your data is AI-ready.

Inferred data lineage plays an important role in managing AI governance responsibly by contributing to:

  1. Transparency:
    • Showing how data flows through all the processes, models and algorithms across the enterprise’s network.
    • Ensuring that key stakeholders like data stewards, auditors and regulators understand the origin of data used in AI systems.
  2. Model Explainability:
    • Providing context for AI model inputs, building trust and reliability for AI Governance.
    • Understanding data sources for improving model interpretability, identifying critical features and their impact on model predictions.
  3. Bias Detection and Mitigation:
    • Helping to identify biases, such as discriminatory patterns, by tracing data lineage.
    • Ensuring fairness by addressing biases, adjusting models based on lineage insights to reduce them.
  4. Data Privacy and Security:
    • Highlighting data movement across systems and providing insights for access controls.
    • Helping to protect sensitive information and comply with privacy regulations; for example, ensuring sensitive data like PII is handled securely.
  5. Accountability:
    • Holding stakeholders accountable for data handling, as regulators and internal auditors rely on lineage to verify compliance for reporting.
    • Aiding data stewards in responsibly managing data with compliance checks.
  6. Risk Assessment:
    • Helping to evaluate risks associated with AI models, data sources and transformations.
    • Assessing the potential harm of using AI models.

Informatica Cloud Data Governance and Catalog Inferred Data Lineage 

AI-powered data lineage as part of Informatica Cloud Data Governance and Catalog (CDGC) enhances the process of tracking data as it flows through various systems and transformations in an IT ecosystem, integrating with existing data management environments to provide deeper insights and a more granular level of control over data assets. Thus, modern enterprises can understand the origin, movement, characteristics and quality of data in their systems more completely, transforming its value to the business.

Capabilities include:

  • Supporting the identification of, and bridging gaps in, existing lineage with a rule-based interpretation of data flow.
  • Accelerating completion of lineage to drive better transparency​ by understanding data sources that manage AI responsibly.
  • Automatically detecting dataflow between user-provided source and target at the dataset and data element level​.
  • Allowing object filtering for more precise matching. ​
  • Allowing user curation of generated dataflows.​
  • Making accepted dataflows visible as lineage in the CDGC application.​

As shown in Figure 1, the metadata command center can link catalog sources and construct data lineage based on object name matching or user-defined rules. Source and target catalog sources can be selected to link and create data lineage, and source and target schemas can be chosen to restrict lineage inference to specific subsets of data objects within the data sources, illustrated in Figure 2. Data stewards can manage the automatic acceptance of linked lineage assets directly.

Figure 1: Identifying and bridging gaps in existing lineage with name-based linking of data flow.

Figure 2: Configuring data lineage linking in Informatica Cloud Data Governance Catalog. 

Key benefits​

  • Simplify and accelerate time to value and completeness for trusted data flow​ through end-to-end visibility, delivering enhanced responsible AI governance with CLAIRE® copilot automation.
  • Accelerate GenAI adoption by providing explainability and improved reliability of data through the reduction of risk of user errors in manual custom lineage.
  • Improve data steward productivity by reducing manual overhead, linking data directly to its source to determine relevance, ​and eliminating the time-consuming manual lineage process through custom lineage.

Summary

Informatica’s inferred data lineage as part of CDGC can help organizations deliver the data that builds confidence in their analytics and AI models, improves customer experience programs, helps ensure regulatory compliance with industry policies, accelerates cloud modernization initiatives and much more.

Making strategic decisions encompassing digital transformation while remaining compliant with an array of regulations, including new AI controls, requires trustworthy data. Investing in an intelligent, enterprise-scale data catalog designed for multi-cloud and on-premises environments that encompasses all eight capabilities can empower organizations to succeed in the new age of AI systems.

Business users can enhance governance and privacy, deepen data analytics, transition to the cloud and augment the customer experience with greater ease and assurance. Concurrently, your IT teams and data analysts can refine change management, improve operational efficiency, reinforce data security and enhance responsible AI governance.

Read about how Informatica Intelligent Data Management Cloud (IDMC) can support AI data readiness with greater simplicity and productivity innovations here, as part of the July 2024 product release.

First Published: Sep 05, 2024