Data anonymization techniques alter data across systems so it can't be traced back to a specific individual, while preserving the data's format and referential integrity. It is one of several approaches organizations can use to comply with stringent data privacy laws that require the protection of personally identifiable information (PII) such as contact information, health records, or financial details.

Why is data anonymization important?

Data anonymization can help companies keep PII private by masking sensitive attributes, even as they derive business value from it for customer support, analytic insights, test data, supplier outsourcing purposes, and more.

What are the key benefits of data anonymization?

Data anonymization is a way to demonstrate that your company recognizes and enforces its responsibility for protecting sensitive, personal, and confidential data in an environment of increasingly complex data privacy mandates that may vary based on where you and your global customers are located.

Customers who entrust their sensitive data to companies will consider a breach of that data a breach of their trust as well, and take their business elsewhere as a result. Indeed, one industry survey found that 85% of consumers will not do business with a company if they have concerns about its security practices, and just 25% of respondents believe most companies handle their PII responsibly.

In addition to protecting companies against potential loss of trust and market share, data anonymization is a defense against data breach and insider abuse risks that result in regulatory non-compliance. The fine for a GDPR violation, for example, can be €10 million to €20 million or 2-4% of global annual turnover, whichever is greater. Even a single complaint can trigger a costly and time-consuming audit. When the equally stringent requirements of the California Consumer Privacy Act (CCPA) go into effect on January 1, 2020, they will also carry risks of fines and litigation as well as the day-to-day time and costs of responding to consumer requests about the use of their PII. As the largest economy in the U.S. and the fifth largest in the world, California’s legislation is seen as a blueprint for other states and nations seeking to enforce data privacy regulations.

But data anonymization is not simply about avoiding risk—it also improves data governance and data quality. With clean, trusted data, you can optimize applications and resources, protect big data privacy and analytics, and accelerate cloud workloads, all of which drive digital transformation by opening up safe data for use in creating new business value.

What data should be anonymized?

The rigorous requirements of the GDPR provide a useful benchmark for the data types to protect, regardless of whether a company stores or processes PII about EU citizens. The GDPR defines personal information as "any information relating to an identified or identifiable data subject," which includes the following:

  • Basic identity information such as name, address, and ID numbers
  • Web data such as location, IP address, cookie data, and RFID tags
  • Health and genetic data
  • Biometric data
  • Racial or ethnic data
  • Political opinions
  • Sexual orientation

When the CCPA goes into effect, it will cover even broader classes of personal data. Your company is responsible for protecting any information that "identifies, relates to, describes, is capable of being associated with, or may reasonably be linked, directly or indirectly, with a particular consumer or household" if it conducts business with Californians and includes any of the following:

  • Has $25 million or more in gross revenue
  • Reaches 50,000 or more households or devices
  • Derives at least half its annual revenue from selling PII

Depending on your business, the data types involved could be anything from vehicle identification numbers (VINs), to data streaming from cellular towers or IoT-enabled household smart devices.

Many companies must also comply with industry-specific regulations. Independence Health Group, a US health insurance company, is an example of how to successfully apply data anonymization for healthcare regulations. Independence Health Group is subject to HIPAA, which tightly regulates the handling of Americans' protected healthcare information (PHI). The company must protect the PHI of its 8.3 million insureds, both to avoid the high cost of healthcare data breach fines and remediation, and to safeguard their customers’ well-being and trust. However, the insurer also needs to be able to collaborate with outside data processing partners and allow both in-house and outsourced developers to test applications on relevant data.

To build and test high-quality applications and process data without the risk of unauthorized access, Independence Health Group uses Dynamic Data Masking to anonymize a broad array of data ranging from names, birthdates, and Social Security Numbers to diagnoses and billing records.

Are there alternatives to data anonymization?

Persistent data masking for anonymization

Data masking can be used for anonymization or pseudonymization. It replaces data elements with similar-looking proxy data, typically using characters that will preserve the format requirements for an application, enabling it to work with the masked results. Persistent data masking is typically used for anonymization, whereas dynamic data masking is reversible and can transform data on the fly based on user role and context to secure real-time transactional systems for more flexible data privacy, compliance implementation, and maintenance.

Once data is masked, persistent data masking does not contain any references to the original information and is irreversible, potentially lowering risk of improper data exposure. This is most commonly used for test data, with highly sensitive data, or to perform research and development on sensitive projects. Persistent masked data cannot be unmasked.

Dynamic data masking for pseudonymization

Data pseudonymization can be used to replace personally-identifying data fields in a record with alternate proxy values, as well. Pseudonymization does not remove all potential identifiers from the data and is reversible, so there is potential for re-identification if you have additional details that can connect or restore the pseudonym to the original data.

For example, if you have a data set of employee names, email addresses, phone numbers, and salaries, original values may still be discovered through an inference attack that looks for revealing patterns across these fields. Alternatively, simple access to encryption keys used, or similar data transformation controls to reverse the proxy values entirely to their non-masked original state, could be used to “unmask” pseudonymized data.

Because of the possibility that data can be re-identified either indirectly or directly, data pseudonymization should not be used in situations where you need complete disassociation between the individual’s identity and their data—only data anonymization fully obfuscates the data of possible identifying information. On the upside, pseudonymization can offer manageable risk when there are legitimate use cases for data being restored to original values later. See the GDPR definition of pseudonymization in Article 4(5).

Data encryption

Data encryption is another form of data protection that uses algorithms to scramble cleartext data into an unreadable form, losing its original format and making it unusable in the new state. Data encryption is useful for data at rest and in motion, such as storage or network links, where data usability is not an immediate requirement. Unlike anonymization, data encryption is reversible; encrypted data can be restored by someone who has the encryption key for the corresponding decryption algorithm. This makes it imperative to use a complex encryption algorithm that cannot be easily cracked, and to safeguard access to keys associated with the data.

Encryption is widely used to protect files in transit or at rest but offers the flexibility when those files may need to be used later to reidentify them—for example, to link successful clinical trial results back to the specific patients for further follow-up.

Learn more about how data anonymization protects sensitive, private, and confidential information.

Data anonymization resources