I travel a lot for work, and one of the most rewarding parts of travel is meeting with customers and listening to their stories. Very often I’ll hear a story that offers particular insight into the state of the industry – and that I want to share with you in my blog.
I was in San Francisco recently to have lunch with Joe, the AI leader of a strategic customer. During the meal he told me he had a story about AI and machine learning (ML) that he thought had good lessons for Informatica – and for anyone working now with data and AI.
“Building algorithms for our customers is a key part of my job,” he started. “We are trying to help our customers grow, and leveraging AI and ML helps them focus their investment where they can make an impact. We were running a strategic pilot project for one of our customers. And our goal was to use AI to survey our customer’s business data and then provide an operations plan with recommendations for where to start offering a new service and how many sales reps to hire to make it a success.”
Joe is running a very strong and expensive team of data scientists, including five AI/ML experts and 10 data engineers. As this project was for a strategic customer, they got the best people involved. Three data engineers focused on three geographies – Americas, EMEA and APJ – and they aggregated as much information as possible: a few hundred tables, with customers, orders and market data. Two ML specialists invested the next few days locking on the model and coming up with recommendations. The model was clearly calling to increase the investment in EMEA, which was different from the investment the company has made in the last few years.
The team went into the meeting convinced they had hit it out of the park. But when the presentation was finished, the execs told them: “You are way off. This recommendation doesn’t make any sense.”
Joe’s team went back and regrouped. They tried to find out how they had gone so wrong. After some investigation, they realized that when they merged 150 EMEA data sources, about 30% of the records in a key area were duplicates. The same record appeared with a different value due to currency and a different description due to different languages. These duplicates really tilted the results.
Joe and his team fixed the data, removed the duplicates, and readjusted the model. They were convinced they now had the right recommendation, which was to increase investment in the Americas and focus on mid-market.
But again, when the presentation was finished, the audience was shaking their heads. Focusing on mid-market was a valid strategy, but all indications to date were that investing in the G2000 would be the right move.
Joe and his team asked for permission to validate their recommendation. They looked carefully into the key model features leading to the conclusion. It wasn’t easy to see how the model got to the recommendation, but the team could clearly see that renewal level was a key metric in the model’s calculations. Looking more closely at the renewal rate data, the team was shocked. The renewal information was missing many records. Somewhere in the data pipeline the null for renewal became a zero. This skewed the model significantly.
Once again, they fixed the model and went back to present.
Did it work out this time? Sadly, no. In fact, my lunch partner confessed, on average the team goes back for a total of three to five times before getting it right.
My lunch friend was telling me this because he realized something of critical importance: Data quality and data engineering are really the hard part of AI.
At this year’s Informatica World, the theme of the show was all about AI and data. For our Market Perspectives sessions, we had a number of analysts, customers, and partners speak about how AI and machine learning are changing the industry. But for such a cutting-edge topic, a key part of the message was about the importance of tried-and-true data management disciplines like data quality.
AI and machine learning are really data-hungry processes. They need vast amounts of data in order to run and deliver meaningful insights. But, as my friend’s company discovered, data quality and data preparation become paramount whenever AI and ML are involved.
As my friend told me, “AI can be the smartest way to get a dumb answer.”
Ali Ghodsi, CEO of Databricks, said something similar at Informatica World when he remarked that “The hardest part of AI isn’t the AI, it’s the data.”
When it comes to ensuring that you get smart answers from your artificial intelligence, data quality, data cleansing, and data preparation turn out to be just as important as the sophistication of your algorithms.
Learn best practices for modernizing your cloud data warehouse and data lake at our Data for AI & Analytics VIP Summit on 4/16/20. Check out the agenda and exciting speaker lineup.