Machine Learning Needs Data Quality
Data quality is not a new problem. Then again, for the past few decades, the quality of data was mainly considered in the context of data warehousing and operational systems. However, with the ever-increasing volume and variety of data, data lakes to store structured and unstructured data at scale, and the resurgence of machine learning, data analysts and data engineers are starting to ask, “How do I ensure data is fit for purpose in this alternative paradigm?”
When it comes to machine learning, data cleansing is a vital step. Because incomplete, inconsistent, duplicate, and missing data can drastically impact the performance of machine learning solutions (like predictions or clustering), which can lead to the business lacking confidence in the results and making misinformed decisions, data must be cleansed before a dataset can be used effectively to train a model.
So how do existing data quality tools help with machine learning?
First, the tools profile data to discover and assess the data’s content, structure, and anomalies. This profiling step identifies incomplete, noisy and inconsistent data in the data sets and helps define the corresponding data processing and cleansing steps, as in these examples:
Missing values – Machine learning depends on data, and any missing values in the data set require a mitigating strategy (Do you delete the records? Replace with dummy values? Use the mean, median, mode or k-nearest neighbor as a replacement?). Although any of these approaches may work, they can potentially lead to the loss of essential details or introduce bias.
Outliers and default values – The accuracy of data sourced from a customer relations management system relies on how accurately the user entered the data. As a recent Salesforce study reported, this means 20% of records are essentially useless. One cause for this inaccuracy is when the user fails to change a default value (such as opportunity create dates with 01-01-00 or the first option from a drop-down menu). Another factor is an algorithm that gives outliers an unequal weighting. In a similar fashion, a disproportionate number of default values can also skew the results. Deciding on which outliers to remove or include from your model depends on the use case (for example, if you are looking for fraudulent activity, you are looking for outliers).
Duplicates – If duplicate data in a single system is an issue, that problem is compounded when that data is sourced from multiple systems. For example, is Jim Smith in the CRM systems the same as James Smyth in the billing system? Or the James J Smith in the customer service system? As one result of duplicate data could be overfitting in the model (and removal of duplicate data in large data sets can be especially time-consuming and problematic) it’s important to identify an effective procedure to detect and remove duplicates.
Standardization – These are issues that vary from simple to complex. A simple standardization process may involve converting all text entries to the same case (ALL CAPS, lower case, Sentence case or Title Case). Other, more complex standardization processes might involve all the variations of a company name like “Pacific Gas & Electric”(“Pacific Gas and Electric”, “PG&E”, “PGE”), or variations of how to code the color “Black” (Blk, “K”), and are far more challenging and time-consuming to hand-code. Other examples of standardization include product dimensions and units of measure.
Once you have a clear and accurate picture of your data and the shape you need it in for modelling, you can move on to define data cleansing rules. These rules can validate data is syntactically and semantically accurate, automatically fix and standardize data, and generate exception reports. An exception reporting process helps address and correct any weaknesses in the data and exposes them for further profiling and analysis. Furthermore, as more data sources come on stream, these data cleansing rules can be reused over and over again.
Finally, data quality should not be a one-off “set it and forget it” exercise. In order to maintain and improve high levels of model performance, you must continuously monitor and manage data quality against all targets and across all sources.
These are just a few ways current data quality tools can cleanse data so it can be used effectively for machine learning. So, before you start your next machine learning initiative, consider the following questions and how data quality tools can help:
Are you spending more time cleansing data than tuning models?
Are you using complex models to adjust for low-quality data?
Are the results of our models trustworthy?
One way you can find the answers to these questions and more is to attend our Data Lake Governance Virtual Summit. This is a great opportunity to learn about market-leading capabilities that you can use across on-premises and multi-cloud environments to accelerate your digital transformation initiatives.
Register today for an event in your region:
- North America on November 19
- EMEA on November 20
- APJ on November 21