Delivering Trusted Insights from AI: Why You Need Integrated Data and AI Governance

Jul 26, 2021 |
Dharma Kuthanur

A recent McKinsey survey on the state of AI in 2020 had some interesting insights. First, it confirms what we have been seeing over the last couple of years: specifically, how organizations are increasing their use of AI to both increase revenues and lower costs. In fact, the data from the study goes even further: it also points to clear evidence of a positive impact on value at the enterprise level. An impressive 22% of respondents reported at least 5% of their organizations’ EBIT was attributable to AI. It also confirms that the global pandemic accelerated the trend towards becoming more data driven. Growth in the use of AI is both an indicator as well as a beneficiary of this trend. The study found that 61% of high-performing companies had increased their investments in AI during the pandemic.

Interestingly, the same study points to some lurking challenges that may impede AI adoption amidst all the rosy indicators of value driven by AI. Respondents from companies who have adopted AI more aggressively also were more likely to report that they saw their models “misperform” during the pandemic. Of course, it’s not surprising that the organizations that are using AI more aggressively are also the ones more likely to report issues. But it’s important to note the underlying reason for this “misperformance”—that is, rapid market shifts during COVID invalidated underlying AI model assumptions, which in turn triggered the need to re-evaluate both the input data and the models. A recent Forbes article highlighted how COVID kicked off a flurry of model building and summarized an academic study of these models that discovered that all of the several hundred models they evaluated had “fatal flaws.” These flaws fell into one of two general categories: data (using small datasets that didn’t represent the population being studied) and transparency (or the lack of it, i.e., limited disclosure of information related to data sources, modeling techniques, and potential sources of bias). And not surprisingly, these findings highlight the paramount importance of data (its scale, quality, lineage, fit for purpose) and the need for transparency (visibility, shared understanding, explainability, performance against business metrics).

The Importance of Data in AI/ML Initiatives

As the use of AI models becomes more pervasive across industries, in many domains these types of issues can have outsized impact on consumers and citizens, and ultimately erode trust in AI. With that in mind, let’s take a deeper look at both types of problems. As the CEO of Databricks, one of our partners, said at an Informatica World event, “The hardest part of AI isn’t the AI, it’s the data.” Too often, the issues arise as a result of not paying enough attention to the data that is being used across different stages of the AI model lifecycle like training, validation, and production. A recent Google paper on the importance of data quality for AI called out that “data is the most under-valued and de-glamorized aspect of AI.” Some of the more common data issues that seem to bedevil AI/ML initiatives are:

  • Not enough data: As the example above pointed out, using data that’s too small and not representative of the universe being analyzed is a common problem.
  • Poor data quality: Basic data quality issues such as duplicate data, lack of standardization, inconsistent formats, etc. can have a huge impact on the AI model.
  • Bias in the data: High-profile stories of racial and/or gender bias in applications such as health care treatment and recruiting have received a lot of press in recent years. While there could be many reasons for this, bias in the input data used to train the models would directly contribute to bias in machine-learning models which learn from the data.
  • Inappropriate use of data: The data used has to be appropriate for the business objective being targeted by the AI model and be compliant with relevant policies and regulations. If customer data is being used to develop a targeted marketing campaign, is it in compliance with opt-in policies and regulations like GDPR?
  • Data Drift: Data pipelines for AI/ML models often use data from disparate sources. The structure and quality of data can change over time due to many different reasons. This is what’s referred to as “data drift”.

The good news is that many of these data problems can be addressed with data intelligence that empowers data consumers with the context and understanding required to ensure  appropriate use of data. Organizations have to adopt modern data governance practices that enable the definition of policies that govern data quality and the appropriate use of data for different business needs, and the ability to automate the process of tracking and reporting on how we are performing against key business metrics. Even AI-centric data issues such as bias can in many cases be codified into business policies, and relevant data quality rules can be applied to the data and monitored over time.

Other Key Considerations

But is data governance alone enough? What about other types of issues that impact AI model performance? Without getting into the details of how models are developed, some key considerations are:

  • Shared visibility and understanding: Are you able to document the AI models so that there is a shared understanding of what models are available, their inputs and outputs, data dependencies, ownership, and deployment status? Are you able to map the model definitions to deployed models and track data usage and lineage across the deployed models?
  • Tracking and monitoring performance: Do you have the ability to track and monitor key metrics like model performance (against business KPIs), bias scores, input data quality, data drift, etc.? Are you able to define policies to set acceptable thresholds for these key metrics and trigger relevant action when needed?
  • AI model explainability: Are you able to explain the model’s recommendations? If the model is not performing as expected (e.g., bias scores above a certain level), are you able to explain why? In some verticals (such as financial services), AI model explainability is needed for regulatory reporting. In other cases, model explainability may be needed to instill confidence in users and drive adoption.
  • Managing risks: Do you have the governance framework required to manage different types of AI risks? These could range from regulatory compliance (e.g. GDPRCCPA) to ethical issues and reputational risks (e.g. racial/gender/age bias). The European Union’s recently proposed new AI regulations, based on the level of risk for different AI systems, have heightened the sensitivity and awareness around this topic. For high-risk AI systems, the proposed regulation explicitly requires providers of AI systems to establish “appropriate data governance and management practices.” Are you able to manage these risks without impeding agility and innovation with too much control?
  • Avoiding redundant efforts: As use of AI/ML models become pervasive, are you able to promote re-use and avoid duplicated efforts? For instance, multiple teams selling different products may have a need for similar next-best offer recommendation models. Can you effectively democratize trusted AI along with trusted data?

While the nature and scope of these problems differs from what we outlined earlier for data, we can see common themes around what is needed to adequately manage these and similar issues: the need for documentation, the ability to connect the documentation to what’s being developed and deployed, the ability to set policies, the ability to track and monitor compliance with the policies, the ability to trigger actions or alerts when necessary, and the need to foster collaboration across different teams. These can be addressed through an appropriate governance framework, but clearly, we need a more comprehensive governance framework that goes beyond just data to encompass governance of AI models as well.

Given how critical the data is to model performance, you cannot govern AI models effectively unless you have an integrated solution for governing both the AI models and the data that fuels the models. Interested in learning more? Join us for our upcoming Cloud Data Intelligence Summit where you will be hearing diverse perspectives from analysts, customers and Informatica executives on the topic of agile data, analytics and AI governance in the cloud.