Table of Contents
- Data Quality Importance
- What Is Data Quality?
- Business Impact of Data Quality
- Overcoming Data Quality Challenges
- CLAIRE GPT for Data Quality
- Exploring Data Assets & Quality
- Assessing Data Assets
- Recommended Data Quality Rules
- Diagnostic Data Quality Insights
- Examining Data Quality Records
- Key Metadata Quality KPIs
- Next Steps
Why Is Data Quality So Important?
The data landscape supporting business is ever changing, becoming complex and diverse. Modern decentralized concepts such as data mesh and data fabric are being favored among data teams. With more people influencing data pipelines, data sources and tech stack, maintaining high data quality is crucial. We believe that now more than ever, there’s a need for self-service, automated operationalization and the ability to quickly detect even the smallest changes in data and metadata. Can conventional methods and tools efficiently handle this complexity? The likely answer is no. This viewpoint is supported by our annual CDO insights survey, “The Rise of Generative AI in Business - Transforming Data Strategies,” which highlights some of the key takeaways:
Data is a roadblock, as 99% cite data- or technology-related obstacles in realizing their data strategy. Some of these obstacles include:
- 38% are grappling with an increasing volume and variety of data
- 41% already struggle with 1,000+ sources and 79% expect that number to increase in 2024.
- 30% cited being unable to scale data delivery when and where needed.
Investment is on the rise, as data leaders are investing heavily in data management to make the most of changing data environments including generative AI. Top investments include: |
|
AI is top of mind, with 45% of respondents stating as such, having already implemented generative AI, and another 54% are planning to do so soon. As per the survey, the Top generative AI challenges are as follows:
- Quality of Data (42%)
- Data Privacy and Protection (40%)
- AI ethics (38%)
- Quantity of domain specific data for training and tuning of LLMs (38%)
- AI Governance (36%)
What Is Data Quality?
I struggled a lot to find a definition of data quality. There isn’t a fixed definition of data quality, and it can vary from context to context. After some struggle, I found a relatively general definition from DAMA International’s latest edition, “Data Management Body of Knowledge,” which states, “Data quality is the planning, implementation and control of activities that apply quality management techniques to data to ensure it is fit for consumption and meets the needs of the data consumers.”1
The Impact of Data Quality
Before diving into the impact of high-quality data, let us first discuss what bad data quality means and its consequences. Bad data is any data that lacks accuracy, inconsistent, incomplete or stale information that does not represent reality. This kind of data can lead to misguided strategies and poor decisions.
good data quality matters — it brings real business value. High-quality data is crucial for business success because it helps avoid the pitfalls of inaccuracies and misinformation.
At Informatica, we see the value of data quality as the measurable benefits you get from trusted data, like boosting revenue, cutting costs and managing risks.
When we say business value, we’re talking about the value and risk associated with any piece of information an organization has, whether it’s data, metadata or any documentation around it. Let us consider a few examples:
- Online retail companies almost always interact with customers through phone, email, address and payment information like credit or debit card transactions, net banking or modern-day mechanisms like UPI. It’s crucial that these details are accurate and consistent. Any errors here could lead to financial risks.
- Many organizations handle sensitive information, such as PII (personally identifiable information), PCI or PHI (protected health information). Poor data quality can occur when there’s inadequate coverage or when important data elements and data sources aren’t associated with the right policies. Companies may be exposed to fines for violating regulations defined in GDPR (Article 4 and 9) if such data isn’t properly “scanned,” “identified” and “protected.”
- Data quality issues impacted a large bank’s reporting under the Home Mortgage Disclosure Act of 1975 (HMDA), which is a critical business task. This occurred because, often, the income recorded in the bank’s loan application register (LAR) did not match the paperwork, resulting in inconsistencies.
- It is often said, “Garbage in, Garbage out.” This holds true for machine learning (ML) models. Data quality of training and evaluation data impacts the performance, deployment costs and reliability of the models. As a result, tech companies working with state-of-the-art models may face questionable returns on investment (ROI) without proper data quality and governance in place.
- In a similar vein, innovative companies go through system modernization programs. These projects can go over budget or even come to a halt when the state of the data is inconsistent, and there are many sources for the same information eventually leading to confusion and lack of trust in data.
- 1-10-100 Principle: early identification and fix is cheaper and returns high value.2
Data Quality Management: A Challenge for Organizations
The main challenges revolve around achieving the following:
1. Automating manual data management processes. In a simplified view, the conventional data quality processes include the following activities:
a. Define standards adhering to the domain: This is typically done by subject matter experts (SME) of the domain/business and is often manual.
b. Convert business rules to technical rules: This involves mapping rules to data sources, which can be an elaborate process requiring substantial documentation.
c. Identify and report issues in the right business context.
d. Fix invalid data, Address errors either at the source or at the point of data consumption.
2. Reducing unknown issues: These are issues that go unnoticed until data consumption. Characteristics of such silent issues include:
a. Occurring in complex data landscapes.
b. Arising in environments where there is no end-to-end view of the various data flows and pipelines across complex landscapes. Consider a few examples:
i. A report shows stale data due to a failed ETL job.
ii. Sensitive PII ends up in an unauthorized system.
3. Enabling a proactive approach to data quality.
a. Reverse the reactive approach of dealing with the bad data after it has spread to different places.
b. Foster proactive data quality through data observability.
c. Automate data remediation processes to minimize human intervention.
4. Defining clear ownership of data
a. Identifying the right stakeholders and owners for data is usually an important aspect of data governance. A critical data source without a stakeholder is akin to bad quality. Hence, it is very important to identify critical data assets without stakeholders.
b. Clear ownership helps in fixing data quality issues much faster.
5. Enhance visibility into data flows and the current state of data.
a. It is important for an organization to understand where the data is stored, what it’s about, who owns it and how it flows through the systems.
b. This understanding helps uncover sensitive data, define ownership and ensure transparency.
CLAIRE GPT Data Quality
At this juncture, let’s introduce the Informatica CLAIRE® GPT and its data quality capabilities. CLAIRE GPT effectively helps our users achieve self-served data quality in a complex data environment by helping them in the following ways:
- The discovery of assets of interest
- Assessment of quality through different means
- Recommendation and automation of data quality for data governance programs
(Want to see CLAIRE GPT in action? Check out the demonstration.)
CLAIRE GPT simplifies data quality by allowing diverse users to engage in data governance and data quality management. We see CLAIRE for data quality, being used not just within our comprehensive platform, Intelligent Data Management Cloud (IDMC), but potentially elsewhere in the future.
Discover and Explore Assets of Interest: Gaining Visibility into the State of Data
CLAIRE GPT helps users have visibility into the current state of data, helping them to answer questions such as:
1 . Where does the data come from?
2. What information does the data contain?
3. What are the critical elements associated with it?
4. What are the related assets?
5. What are the data profiles for its elements?
6. Are there any sensitive elements?
7. What is the quality of the data elements?
8. Is the data quality high?
9. Is the data quality outdated?
10. Who owns the data assets?
11. How does the data flow and what does it impact?
12. Has the data asset been assigned to a stakeholder?
With CLAIRE GPT, users can quickly get answers to all the questions above by simply asking in natural language. These answers provide a clear understanding of various aspects of data assets and their importance in terms of data quality, value and business risk. For example, if users want to discover assets from a schema without DQ scores or the score is stale, they can just ask CLAIRE GPT, “Show the tables from CRM_SLS_OPS with stale DQ scores or without DQ scores.”
As seen in Figure 1, CLAIRE GPT responds with a list of assets that meet these criteria, along with other information, such as:
Who created the asset?
When was the asset created? (For example: the asset was created on November 26, 2024)
The asset type, whether it is a table, a view, a file or something else.
Where the asset resides, including the asset path as well as the source, like Snowflake.
If there are existing scores, they will be displayed as well.

Figure 1. CLAIRE discovers the datasets of interest.
Users also have the option to download or copy content for reporting purposes. When downloading, the data is saved in an MS Excel worksheet, complete with headers. This downloaded sheet may be shared with other stakeholders for further evaluation.
Assessing Data Quality for a Data Asset
Once you identify a few assets of interest, you can assess their data quality through different means. Let us explore some common methods.
Bird’s-eye view: Obtain an overview of the data through data profiles. These profiles provide essential data elements, a statistical summary based on data class (numeric or categorical), missing values, unique values, frequently occurring values, outliers and anomalies.
Data quality scores: Check for existing data quality scores, both subjective and objective.
Criteria-based examination: Analyze the data using specific criteria.
Coverage summary: Review the data’s coverage by checking for specific policies related to the data elements or understanding the sensitive data elements.
Continuing with our previous example, after discovering a few assets of interest, the user decides to look at the data profile for one of the datasets. To do this, the user simply needs to ask CLAIRE GPT: “Provide a data profile of CUST_MST” and CLAIRE responds with the data summary as shown in Figure 2.
.jpg)
Figure 2. CLAIRE GPT responds with the data profiles for the critical data elements (CDE)
It not only provides a detailed profile but also key insights from the statistics, CLAIRE-generated insights from the data summary represented in Figure 3 below.

Figure 3. CLAIRE generated insights from the data summary.
Using the insights and the data profiles, we can understand different aspects of the data such as missing values, frequent values and outliers. Notice the outlier in the “date of birth” column in Figure 4 below.

Figure 4. CLAIRE identified the outliers in key data elements
As shown in Figure 5, CLAIRE suggests the next best action for the user, which assists the user with one of the most likely next steps. In this case, it is about asking for recommendations for data quality rules.

Figure 5. CLAIRE suggests the next best action for the user
Recommendation of Data Quality Rules
CLAIRE GPT data quality can recommend data quality rules for the data elements of a dataset. Now, let us understand the recommendation process with an example. Suppose a user has identified an asset of interest and reviewed its data profiles for any obvious issues. They may be unsure about which data quality rules to apply to the data elements. In this case, the user can type in their request for recommendations or just choose the suggested prompt, “Recommend data quality rules.”

Figure 6. CLAIRE recommended data quality rules
Please note the next best actions or the suggested prompts are related to “Acceptance” and include recommendations. When presented with a response, the user may choose to ignore or accept some of these recommendations, utilizing the suggested prompts accordingly.
Acceptance of a Recommendation
Upon accepting a recommendation, CLAIRE GPT generates a data quality rule occurrence —a business representation of a data quality rule that is applied to one or more specific data elements, as illustrated in Figure 6. It also provides a link for the user to review it for further within the Cloud Data Governance and Quality application. Continuing with our example, once the user decides to accept a recommendation, the user can instruct CLAIRE GPT by simply saying, “Accept DQREC_CUST_FRST_NM_Accuracy_Customer First name Check.” This generates a DQ rule occurrence along with the associated link for further action, as shown in Figure 7.

Figure 7. Accepted DQ rule occurrence
Upon clicking the link, the user is taken to the Cloud Data Governance and Quality application (see Figure 8) for further review and modifications. Please note that the primary data element is the one as per the recommendation.

Figure 8. Cloud Data Governance and Quality application
Diagnostic Data Quality
CLAIRE GPT data quality features can help identify the possible root causes of poor data quality for a data asset. It attempts to provide insight in the following areas:
Which data elements contributed to the lowest scores?
Which data quality (DQ) dimensions contributed to the lowest scores?
Which DQ rules contributed to the lowest scores?
Were there any upstream applications with poor data quality feeding into the asset of interest?
In the given sample, were there records that violated any rules?
However, the current release focuses on the first 3 methods.
Understanding Data Quality Through Data Examination
Users can seek out records that violate specific criteria. Let us illustrate this with another example. Users begin by examining the data summary of a table as seen in Figure 9.

Figure 9. Understanding DQ from data profiles
On further analysis, the user focused on the column, QTY, and found that there are 3 outliers of 50, 20 and 10, as shown in Figure 10.

Figure 10. Outliers identified from the data profiles
Therefore, the user can now examine the records with outliers by using a very simple prompt like “Show the records from SLS_TXN for the quantities of 50,20 and 10.” The response will include not only the rows matching the criteria but also an explanation. Refer to Figure 11.

Figure 11. Examining the rows containing the outliers.
When the user expands the explanation, they will see a representative SQL code that fetched the records of interest, along with a simpler explanation (refer to the Figure 12) for less technical users: “I retrieved records from the 'SLS_TXN' table where the quantity matches any of the given values — 50, 20 or 10. This result set includes all such records.”

Figure 12. Explanation of the data examination.
Also, notice the suggested next best actions for this, which include creating a mapping for the above query. This can help users in automating and operationalizing such tasks.
Data quality examination can be more complex as well. Using the same dataset as an example, let us explore another type of question and its response: “Show the transactions where the quantity is less than 1 or the quantity is at least 5 times the average quantity value.”
Unlike the previous question, this generalizes the data examination. Refer to Figure 13.

Figure 13. Data Examination using a general condition.
Now, when the explanation is seen, as shown in Figure 14, we can see the SQL code, as well as a business-friendly explanation of what was done to retrieve the rows.

Figure 14. Explanation of the response seen above.
Business-Friendly Metadata Quality KPIs
We understand that a lot of less technical users — particularly those with deep subject matter expertise in data — will be using CLAIRE GPT. These users speak the language of business assets. To meet their needs, CLAIRE GPT is designed to understand the business terminologies used by our data governance practitioners. Some commonly referenced business assets used in the context of Informatica’s data governance and catalog are as follows:
- System: Refers to a high-level data container.
- Dataset: A logical representation of an identifiable collection of data elements.
- Business Glossary: Includes terms, metrics, domains, sub-domains. This enterprise glossary allows everyone to access and understand key concepts and definitions. It helps align various activities (often using a different name for the same concept) to these central definitions. The glossary commonly defines critical data elements.
- Policies: A documented set of principles or standards that guide a business purpose or requirement, including any usage restrictions.
- Process: A documented series of repeatable steps to achieve a desired outcome. While policies and processes are often used together, they differ in that policies state the rules and processes state the steps that can be taken to complete a given activity related to those policies.
- Regulations: Designed to capture and describe external regulations such as CCPA or GDPR.
- Business rule: Includes data quality rule templates and data quality rule occurrences.
Data governance users and data quality owners may examine the data and metadata quality aspects of system or business datasets. Let us examine a sequence of user prompts (some disconnected) and responses as examples.
To discover the “Systems” related to concepts like CRM or sales operations, users can ask CLAIRE GPT to “Show the system about customer relationship management and sales operations,” as shown in Figure 15.

Figure 15. Discover business assets of interest.
After assigning stakeholders to datasets in a system, the user can explore the datasets that lack assigned stakeholders. Datasets without stakeholders indicate unclear ownership, which is crucial for metadata quality. To identify and report such misses, the user simply needs to prompt CLAIRE GPT with, “Show the datasets in CRM and Sales Ops without stakeholders,” as illustrated in Figure 16.

Figure 16. Understanding business datasets without stakeholders.
Users can also check for data elements in a dataset that are not associated with any data quality rules, as shown in Figure 17, below.

Figure 17. Understanding data elements without any data quality rule.
Users can also check for data quality rules created in a certain time frame. (See Figure 18.)

Figure 18. Discover the DQ rules were created in last quarter.
Another variation would be to understand how many data quality rules were executed in a timeframe. Refer to Figure 19, below.

Figure 19. Explore how many DQ rules were executed last quarter.
Similarly, the metadata quality questions can be complicated as well, such as “Show the data quality rules which are created by Sidhartha Sarat Bardoloye, but without stakeholders.” (See Figure 20.)

Figure 20. Explore which DQ rules were created by a certain user.
Users can ask for recommendations of data quality rules for business datasets, too. (See Figure 21.)

Figure 21. CLAIRE recommending DQ rules for business datasets.
Next Steps
Informatica CLAIRE GPT represents a transformative leap in how organizations approach data quality and governance. By harnessing the power of generative AI, CLAIRE GPT simplifies complex tasks such as data assets of interest discovery, assessing data (including metadata) quality and automating rule recommendations. CLAIRE GPT’s ability to deliver actionable insights through conversational interfaces makes it an essential tool for modern data ecosystems characterized by complexity and rapid growth.
The roadmap for CLAIRE data quality aims to ensure the following:
Advanced discovery and assessment: Enhancing the understanding of data and metadata quality, including data quality for AI, by introducing the objectivity dimension (which measures data independence, bias and data drifts given a baseline).
Automated processes: Implementing auto-assessment, issue identification, data validation and cleaning.
Enhanced recommendations: Providing stronger recommendations through industry accelerators.
Data quality rules: Facilitating the generation and evaluation of data quality rules.
Proactive data quality: Offering proactive approaches to data quality.
Ease of use: Ensuring easy access and usability across platforms such as IDMC, BI tools, AI tools, browsers and enterprise social tools such as MS Teams or Slack.
Are you ready to transform your data quality experience?
Discover how CLAIRE GPT can help you explore and assess the quality of metadata and data, generating trusted insights and ensuring compliance. Start your journey towards efficient, effortless and reliable data quality management today.