Understanding Data Generalization & Advanced De-identification Techniques
Broadly, data de-identification is a comprehensive set of privacy preserving techniques that enable your organization to adjust what is available to data scientists. These techniques, such as data generalization, allow you to manage risk and tune what’s available for analysis based on the analysis, model, or use. Understanding the different techniques will help you decide which techniques are correct for your use case.
What is data generalization?
Data generalization allows you to replace a data value with a less precise one using a few different techniques, which preserves data utility and protects against some types of attacks that could lead to re-identification of individuals or reveal private information unintentionally.
Data generalization, also known as blurring, transforms one value into a more imprecise one. This can be done in various ways, including binning (where values within a range are all converted to that range), or providing a less specific value. For instance, a date of birth could be blurred to become a month of birth. A specific value, such as $14, could be expressed as a range, such as $10-$20.
Main forms of generalization
There are two main forms of generalization; automated and declarative:
- Automated generalization blurs values until it reaches a specified value of k. This option can offer the best tradeoff between privacy and accuracy, as you can use an algorithm to apply the minimum amount of distortion required to achieve the stated value of k. There are several methods to reach any value of k, so you can specify which values are of most interest for your use case, and those values are blurred least to achieve k.
- Declarative generalization allows you to specify the bin sizes up front, for example, you might always round to whole months. Sometimes this method results in discarding outliers, which can distort the data in certain ways and introduce bias. It’s also important to understand that applying declarative generalization doesn’t necessarily result in k-anonymity. Even though declarative generalization may not help you achieve k-anonymity, it’s a good practice to apply declarative generalization as a default so the recipient of the de-identified data only sees the level of detail that they require.
Understanding identifiers
There are two main types of identifiers: direct and quasi identifiers. A direct identifier, absent any other information, can identify an individual in a dataset and allow data about that individual to be linked. However, direct identifiers may or may not be unique. For example, in the table below, customer ID, email address, and credit card number are all unique and therefore enable you to single out an individual.
The size of the data set matters as well. For example, in a small data set, names may be unique, but multiple individuals may share the same name in large datasets. Names are considered a direct identifier even though they’re not always unique, however, because they often allow for identification.
Quasi identifiers don’t enable you to identify an individual in a dataset on its own, but they can be used to identify individuals when combined. So quasi identifiers have two important properties:
- Their combination can be unique in a dataset.
- Quasi identifiers are likely to be present in other available datasets (or become so in the future), which allows datasets to be linked.
Any individual’s name, gender, address, and ZIP code is likely to be available from other sources, such as voter registration lists. So, these pieces of data can help identify individuals. Deciding which values are direct or quasi identifiers can be challenging because it requires that you understand what data is available (or may become available in the future, which can be tricky to determine).
Example
In 2007, Netflix published a dataset containing the film ratings of 500,000 subscribers. Netflix believed that the data was anonymous, but researchers from the University of Texas at Austin were able to link the data with publicly available ratings from the Internet Movie Database (IMDb) to re-identify Netflix subscribers. This is an example of not correctly identifying and protecting quasi identifiers.
Understanding direct and quasi identifiers gives us a baseline to talk about pseudonymous data. Pseudonymous data is data that isn’t directly identifying but can be used in conjunction with other data to identify an individual. Therefore, removing direct identifiers can (in most cases) render data pseudonymous.
Masking identifiers and more
Masking is effective at obscuring direct identifiers but used alone may be insufficient to protect against the risk of re-identification. Indeed, individuals might still be identified through unique combinations of other information known about them.
For example, while most individuals have a unique combination of date of birth (DoB), zip code, and gender, there are fewer unique individuals if you clip the zip coded to include just the first few digits, you generalize the DoB information to the month or year of birth, and the gender redacted. Using multiple masking techniques, including generalization, can produce a k-anonymous output dataset. k-anonymity is a property of the dataset, where every record is indistinguishable from at least k-1 others.
In more complex datasets, you can prioritize values when you need to be more precise using more advanced generalization algorithms. For example, if you are working with data to perform a gender pay gap analysis, you need to retain gender and generalize other details into fine-grained ranges.
Why would we want to use data generalization?
Data generalization helps you to take personal data and abstract it, such that you take away the personally identifying attributes. This enables you to analyze the data you’re gathering without compromising the privacy of the individuals in your dataset. It’s important to note that there are different ways to generalize data, and you want to use the method that makes the most sense for your use case. Sometimes the most appropriate course is to apply masking to direct identifiers, while in other cases you want to retain signal in the analytics of data.
No single approach is a silver bullet for maintaining privacy, which is why you need to understand different techniques, such as tokenization, redaction, and pseudonymization, and apply them as appropriate to maintain the greatest data utility without unduly compromising privacy.