Simplifying Data Lake Complexity Issues with Cloud Data Governance and Catalog

Last Published: Aug 16, 2023 |

Arun G Krishnan

Technical Marketing Consultant

The world is moving to a data-driven digital economy, increasing the need for agile, data-driven decision making. But the advent of data lakes and lake houses to help store large amounts of data has made it harder to get the data you need. More data in more places creates complexity and makes it challenging to find, understand, trust and extract value from your data. A lack of common business terms and availability of up-to-date data to make critical decisions adds to the difficulty. These challenges negatively affect the data community — the data engineers, data scientists, data stewards, business stewards, and other key personas — who need access to governed data to do their day-to-day job.

And it’s not only data access that’s become a challenge. Governing the variety and volume of data has become an uphill task. And because data lives everywhere, you cannot apply data governance in isolation, or only to portions of the organization.

Data Governance Challenges with Data Lake

Data lakes are very flexible and support most analytical, data science and AI/ML use cases. They also introduce a lot of complexity for data governance as data pipelines become more complex. The data community still needs to know where its data lives, who can access it and to be able to measure its overall health. Teams cannot troubleshoot data issues effectively without a way to track data lineage. Enforcing data governance will not scale without automation when it comes to vast data lakes. Data enters a data lake through multiple entry points and the data community would need a catalog to get a better context for their use. For organizations to systematically use and manage data, they need to know the kind of data that they have, what it is being used for and who is using it.

Cloud Data Governance and Catalog

Informatica Cloud Data Governance and Catalog (CDGC) helps organizations understand, analyze, interpret and govern large volumes of data in an organization. CDGC displays metadata that is extracted from various source systems in the organization. It displays the metadata and its attributes within a comprehensive catalog. Data users can organize the metadata, view how data flows from one system to another, and see relationships and links between the data assets. You can document data assets, add business context to technical data assets, and apply your security and compliance requirements to govern the data.

Catalog of Catalogs

The business glossary is a main component of CDGC. A collection of domain, sub domain, business terms, process and policies, the glossary defines important concepts within an organization. It helps organizations create a common knowledge base of business terms, concepts, and metrics, which allows people to communicate and collaborate without ambiguity.

Figure 1. Example of business glossary

You can link different types of data in the glossary to understand the meaning, location and details of data flow. For example, you can link business metadata (business descriptions, classifications, etc.) to technical metadata (schemas, tables, columns, file names, etc.) . This allows you to enrich the technical assets with metadata and relationships that you can use for governance. You can enrich the data value of catalogued metadata with the active collaboration of data users in an organization, which will enhance data trust.

Figure 2. Sample Approval Workflow

You can use the metadata review and approval workflow mechanism in CDGC to send any changes made to the metadata to a Governance Administrator for review and approval. The approval workflow should be carefully set up to align with the organization’s data culture.

Need for Catalog

Organizations start their governance journeys for a number of reasons, including:

to comply with regulatory requirements
address data quality issues
for analytics purposes

Whatever the reason, any metadata captured in the tool should be trustworthy and kept up to date. To ensure a successful implementation, be sure to identify and assign governance administrators.

Now, let’s look at two advanced CDGC capabilities and how they’re effectively addressing today’s data lake issues:

To increase business agility and optimize costs, organizations are migrating their on-premises data lakes to the cloud. It’s easy to see why. Manual data discovery is unsustainable in a complex, petabyte-scale data landscape where data resides across a multi-cloud environment. CDGC uses AI to automatically scan and catalog enterprise metadata at scale, from cloud data stores (Azure Data Lake Storage Gen2, Amazon Redshift, Snowflake), BI tools (Power BI, Tableau) ETL (Azure Data Factory, PowerCenter), third-party data catalogs, and more. You can also schedule and determine the frequency of source scans for metadata extraction. It provides real-time visibility into the actual state of data against its cataloged state.

Data lineage is a visual representation of the flow of data across the systems in an organization. Lineage depicts how the data flows from the system of its origin to the final consumption layer.

Figure 3. Sample Data Lineage

You can view data lineage for technical assets in CDGC. You can use this view to:

Understand the systems that access the data and the extent of its usage.
Know the upstream systems that contribute to the data, and the downstream systems that make use of the data.
See quality improvements to the data along the way and identify data issues or discrepancies.
View changes to data, such as standardization of formats.
Understand the security and privacy processes that apply to the data, such as data sensitivity. You can determine whether privacy policies have been applied at the right stage of the data flow and are aligned with the governance principles of your organization

Read more about other data lineage use cases in our eBook, AI-Powered Data Lineage: The New Business Imperative.

Figure 4. Overlay Options in Lineage

Lineage helps track data movements, making it easier to troubleshoot when data pipelines break. This is becoming a common problem as modern data stacks evolve to accommodate complex data use cases. CDGC provides you with the flexibility to display lineage with multiple overlay options, allowing you to easily identify failures within a data lake. If you create a business report that uses the data, you can know the upstream data sets from where the data is derived. Similarly, if you choose to modify the data, you will know whether the downstream impact is due to the change.

You have the ability to show both upstream sources and downstream dependencies, from raw data all the way to dashboards. This allows you to pinpoint how a change will potentially impact users, business processes, and reports, increasing trust and confidence in the data.

We will discuss advanced CDCG capabilities such as data profiling, classification, data quality, automatic glossary association, and more in our upcoming blogs.

Learn More About How Cloud Data Governance and Catalog Can Help You Manage the Multiverse of Data Explosion Madness

Learn more about Informatica’s comprehensive data governance and privacy solutions. Download the Extract Value from Your Data with AI-Powered Data Discovery eBook.

Watch the webinar, “Meet the Experts: Empowering Business With Cloud Data Governance & Catalog”

Discover how you can deliver trusted data for your analytics and AI initiatives with an agile cloud data governance and catalog solution. Join an upcoming Live Demo.

First Published: Jul 28, 2022

Get fast, free, frictionless data integration

DFW: Transforming travel with the power of data

Inside Paycor’s AI journey with Informatica