Machine Learning Data Catalog Automation: Using ML-augmented data catalogs in enterprise data management

Dec 21, 2022 |
Dharma Kuthanur

Vice President, Product Marketing

What is a Machine Learning Data Catalog?

A machine learning data catalog utilizes advanced algorithms and techniques to automate capabilities including data discovery, metadata extraction, data cataloging, data classification, data curation and data lineage. This type of data catalog is sometimes abbreviated as “ML data catalog” or “MLDC.”

Automation is critical for managing large, complex data estates. An ML data catalog can help boost productivity and accelerate data-driven business outcomes by automating or augmenting common data management tasks at scale.

A machine learning data catalog enhances common data management processes including:

Why Does an ML Data Catalog Matter?

Managing growing volumes of data is a challenge faced by large enterprises everywhere. Data distributed across a wide range of sources and applications increases the challenge. A great starting point to address this challenge is data discovery. Data discovery helps you identify, classify and inventory your data across complex, fragmented data landscapes. Additionally, data curation is necessary to provide business context to the data. With data discovery and curation, you can better harness the power of your data for analytics and artificial intelligence (AI) initiatives.

Companies today need to use their data assets effectively to drive business value. AI-led automation through an ML data catalog helps you keep up, even when you’re dealing with thousands of data sets. Without it, you’re faced with impractical, time-consuming and error-prone alternatives.

Modern data catalogs apply ML to automatically scan data and metadata. ML helps discover data structure, content and relationships — at scale. ML-augmented data catalogs can also streamline and automate common data curation processes. These include data tagging, classification and the process of associating business glossary terms to technical data assets. It helps data stewards focus on higher-value work. That means fewer tedious, repetitive tasks. And data scientists and other data consumers can access and understand the data they need.

An ML-augmented data catalog helps ensure your team is working with trusted data that can enhance business value with more impactful, data-driven decisions.

How Does an ML Data Catalog Work?

A robust machine learning data catalog uses an ML-based data discovery engine. With it, you can scan and inventory your data assets from heterogenous sources across on-premises and cloud environments. Modern data catalogs can automatically:

Extract metadata from data assets

  • Tag and classify data
  • Discover relationships among data
  • Deliver intelligent recommendations to users
  • Profile data to assess data quality
  • Infer data lineage when lineage can’t be extracted
  • Associate business glossary terms to technical data assets

ML-augmented data catalogs learn from users over time. With predictive data intelligence, you can manually classify data with user-specific tags and then similar data will be automatically inferred from those tags. From there, users can accept or reject the catalog’s inferences. The catalog will interpret user feedback to refine its future recommendations. This helps predict next-best actions to address your data. It also means less manual effort is required from your team.

Key Features of Machine Learning Data Catalogs

An ML-augmented data catalog can help data stewards, data analysts, data scientists and other data consumers. It improves productivity by streamlining and/or automating critical tasks. ML data catalogs can also bring operational efficiencies across the organization.

Key capabilities and features of a machine learning data catalog include:

  • Automated metadata extraction
  • Automated data discovery
  • Semantic search
  • Data recommendations
  • Domain and entity recognition
  • Automated data tagging and classification
  • Data profiling
  • Inferred lineage ad relationships
  • Automated association of glossary terms to technical data assets

Benefits of Machine Learning Data Catalogs

ML data catalogs can provide many benefits for data-driven organizations. These include the ability to:

Enable data analysts and data scientists to find, assess and use relevant data for value-creating analytics and AI initiatives

ML data catalogs can leverage relationship metadata; this enables several benefits for data analysts and data scientists. This provides 360-degree views of data via knowledge graphs and allows users to perform quick searches. Then they can discover and understand enterprise data and meaningful data relationships. Users can automatically discover related data sets. These data sets are based on technical, business, usage-based and semantic relationships. Through automated data profiling, ML data catalogs let users quickly evaluate data quality. Users can identify and assess relevant data assets. They can also help in the progressive discovery of other data sets of interest to fuel analytics and AI initiatives.

Quickly identify and classify sensitive data to help mitigate risk exposure

ML data catalogs provide the tools you need to detect and classify sensitive data across vast data landscapes. Data stewards can identify and migrate potential data exposure risk with insights into data sharing activity through data lineage. This capability is critical to policy and regulation compliance efforts.

Advance data literacy across the organization by providing business context for data at scale

ML-augmented data catalogs help organizations democratize data. They provide a trusted foundation for data use. Data consumers can use natural-language search to find the most relevant data. ML data catalogs help users better understand their data. They do this through capabilities like automatic data profiling and data lineage. Intelligent data catalogs help improve data trust and transparency. That’s because they provide the rich business context data consumers need. Data intelligence empowers data consumers to make impactful decisions with confidence.

Improve the productivity of data stewards, allowing them to focus on more valuable work

An ML-augmented data catalog can help data stewards reduce the amount of time and effort spent on tedious manual processes that can’t scale. Intelligent data catalogs help boost productivity by augmenting and automating data curation-related tasks. These tasks include profiling and classifying data and assigning business glossary terms to technical assets. Data stewards can also leverage metadata knowledge graphs to help accelerate and/or automate tasks. These include determining data lineage and identifying sensitive data. By spending less time on these processes, data stewards can focus on more in-depth analysis and higher-value work demanded by CDOs and business leaders.

Machine Learning Data Catalog Challenges

If not addressed, some machine learning data catalogs can face fundamental limitations, reducing their effectiveness. These include:

Limited Connectivity

With the size and complexity of today’s enterprises, it’s important to have a tool that can intelligently inventory data. You also need to be able to inventory metadata across various sources and applications. Ensure that your machine learning data catalog has broad and deep connectivity across cloud and on-premises systems and applications. Some data catalog solutions are vendor-specific. This can limit the efficacy of the solution. A catalog of catalogs — a data catalog with universal metadata connectivity — provides a centralized, comprehensive view of all your data and is essential for getting value from this data.

Constrained Metadata Capabilities

Many machine learning data catalogs can scan and extract specific types of metadata. But they lack comprehensive capabilities. To discover all your critical data, your data catalog should be able to scan across a wide range of business, technical, operational and usage metadata.

Lacking End-to-End Data Lineage

Data lineage visually represents how data flows from its origin to its destination. This indicates how data changes along its journey. Many data catalogs aren’t capable of tracing end-to-end lineage across systems or when data moves from on-premises to the cloud.

Not Scalable

Some data catalogs are limited in the number of objects they can scan. If your machine learning data catalog can’t scan at least tens of millions of objects, you will not be able to view and manage all your enterprise data.

ML Data Catalog Use Cases

At enterprise scale, it’s virtually impossible to manually perform and manage critical processes. Errors are introduced and precious time is wasted — increasing opportunity costs. Automating common data management processes allows data professionals to avoid mundane, time-consuming tasks. Instead, it lets them focus on leveraging data to deliver business value. A machine learning data catalog can support many use cases, such as:

  • Data governance of cloud data warehouses / data lakes
  • Data discovery and lineage
  • Common business understanding of business terms and policies
  • Sensitive data classification and management
  • Policy compliance
  • Data quality monitoring and improvement
  • Master data governance
  • AI and analytics governance

ML Data Catalog Examples

Organizations across industries depend on machine learning data catalogs to support their strategic business initiatives — from accelerating innovation to improving customer experience. Here are a few success stories:

Celcom Advances Data Governance to Accelerate Innovation

Celcom is a Malaysian telecommunications company with more than 8 million subscribers. They provide the most extensive mobile coverage in Malaysia. They needed a data-driven solution to help meet their strategic goals of growing the business through innovation and developing digital products and services to realize the potential of 5G.

They needed to maximize the value of their vast subscriber and service data with a data governance solution. They implemented a machine learning data catalog. Celcom maximized the value of their vast subscriber and service data by implementing a machine learning data catalog and a data governance solution. Data across their complex data landscape was frequently being extracted and combined for reporting and data science purposes. The data catalog helped to preserve data lineage and integrity of data as it moved around the enterprise. It also helped them clarify data ownership. With the ML data catalog, Celcom minimized time-consuming, manual processes that were required for regulatory reporting. They cite one example in which a report which previously took more than 150 person-hours to create, now takes only five hours to develop.

Celcom is leveraging data catalog and data governance capabilities to enable access to trusted data. This provides insights and reporting to support data-driven decision making. The organization successfully created a foundation to develop new digital services, improve customer experience and efficiently comply with regulatory requirements.

Improving Provider Data and Patient Privacy at L.A. Care Health Plan

L.A. Care Health Plan is one of the largest U.S. based publicly operated health plans. It serves nearly 2.2 million members. Its mission is to provide access to quality healthcare for Los Angeles county’s most vulnerable residents. When the Affordable Care Act was passed, demand for health insurance coverage increased. L.A. Care grew eight-fold and the amount of patient information they protect, govern and manage soared.

To improve population health, the company wanted to better leverage the data for analytics. An ML-based data catalog and metadata system of record was built. This helps them discover the location of both personal health information (PHI) and personally identifiable information (PII). The system also helps them understand how the information moves across the enterprise. Data quality and governance solutions were implemented as well. Their ML data catalog helps them:

  • Improve provider data quality
  • Ensure data security
  • Enhance patient privacy

Now, L.A. Care’s data consumers have access to governed, high-quality data. It provides invaluable insights to better understand social determinants of health such as socioeconomic status, education, neighborhood and physical environment, employment and social support networks. This impacts overall health, helping to enable better patient outcomes and future advancements in population health.

Why Informatica?

Informatica's Intelligent Data Management Cloud™ (IDMC) with machine learning data catalog services is the industry’s most comprehensive, AI-powered data management platform. Powered by CLAIRE®, IDMC leverages broad and deep metadata connectivity to automate data management tasks, allowing organizations to drive value with the data fueling their analytics, AI and data-driven business outcomes.

Helpful Resources:

  • What is a data catalog?
  • Watch our Back to Basics: Data Catalog webinar series to learn more about the fundamentals of data cataloging.
  • Read about the must-have features of an enterprise-scale data catalog.
  • Learn why a data catalog is essential for digital transformation initiatives.
  • See why Informatica was named a Leader in the IDC MarketScape: Worldwide Data Catalog Software 2022 Vendor Assessment, Aug 2022 | Doc #US48395622.