Data Masking: What It Is, Techniques and Examples
Updated 2021 by the data privacy team
Introduction to data masking and data encryption
Here’s a test: Are you able to obscure the heading of this post correctly using data encryption vs. data masking techniques? For example, “opjuqzsdof bube vs data ;8^)!#g.” Let’s move ahead and we will revisit it again. Of course, if you’re a super user and want to skip ahead to learn about Informatica solutions, check out our various data masking techniques and bookmark this page for later!
Moving on, the main topic of this post is to better understand the intricacy of this jumbled raw text. In the data privacy world of data risk management and the security world of access controls, data encryption and data masking are widely considered de facto standards for the most effective and powerful techniques to protect unauthorized access and use of sensitive data, such as personal information that is increasingly regulated under GDPR, CCPA and similar evolving data privacy laws.
Before moving ahead, we must state: Data encryption and data masking are different methods of applying data protection. Traditional data encryption is not the same as data masking, nor is data masking necessarily the same as data encryption. Both are intended to solve different data privacy and data security problems.
Let’s start with data encryption followed by data masking techniques to better understand how these two are used as part of a data privacy management program.
What is data encryption security and how does it help to achieve data privacy?
The process of transforming original data into a coded format by using encryption techniques (symmetric key encryption and asymmetric public key encryption), so that only authorized users can decode the encoded message and prevent unauthorized data exposure, is termed traditional encryption built around key management. And the process of retrieving original data from encoded data using an encryption key/decryption key is termed decryption.
In general, the encryption method uses an encryption key to encode original cleartext data and authorized users have access to use this encryption key/decryption key. It highlights one of the most fundamental salient features of the encryption algorithm: All encryption algorithms are reversible (provided you are authorized to have access to the encryption key).
Original source data is encoded using an encryption algorithm (AES, DES, RSA, etc.) and key. Encoded data is reverted to original data using the same algorithm and encryption/decryption key. See the following diagram, which shows the sequence of events and summarizes the understanding of encryption fundamentals: For the sake of simplicity, we have considered a basic encryption method. (Yes, it is not a deterministic algorithm.) We agreed with the other party to use a symmetric key (same key for encryption and decryption) – just decrement 1 from each digit of the SSN and get the original SSN data from the encoded one.
Let’s walk through the events that occur in this encryption and decryption. Joye requests the employee’s details from Bob, who is authorized to access the data source, and both agreed to use the encryption/decryption key as agreed upon earlier. Bob encodes the sensitive data, such as an SSN, by incrementing each digit by 1and sends it to Joye. Joye knows the decryption key, so she decodes the SSN and can access the original data.
Note: In principle, it is possible to break the encryption algorithm. By a brute force approach, we can try all possible key combinations and break the encryption algorithm. However, the quantum of time to search all possible keys is a huge order of 10 to the power 27 years (if, for example, we are using a 256-bit encryption key). It is the size of the key that makes breaking the encryption algorithm harder and harder. In 1997, a 40-bit RC4 key was cracked in only 3.5 hours, and in 2000, a 56-bit DES key was cracked in less than 4 days. So, the strength of the encryption algorithm .
What is data masking and how does it help to achieve data security for data privacy?
The process of providing a safeguard to original data through obfuscating field-level data attributes is termed data masking and the data set is called masked data. For example, using an SSN, we could mask the first five digits, while still leaving the last four available for user validation—what you often encounter when calling a customer support center.
Another way to think of it, in data masking methodology, we may not have to reconstruct original data to still achieve some usability while desensitizing the data. It helps to point out the most fundamental difference between encryption (original data is transformed into encoded data and original data is restored from it) and data masking (no transformation, just original data is protected to achieve data anonymization). The most significant property of data masking is: Data masking methodology does not require data to be reversible. The strength of data masking methodology is data masking can be done in such a way that there is no way to retrieve original data from masked data when not required. It is typically a one-way transformation, much like hashing.
Let’s better understand data masking methodology with the following example. Suppose we have two different types of users: administrator and business analyst, and the system- and data access-level privileges are different for each of them. Administrators can see and edit original source data. For business analysts, an SSN is not relevant (an employee ID might be enough to maintain referential integrity), and the system is designed so that outside working hours (9 a.m. to 5 p.m.), the business analyst cannot see or access original data such as a bank account number or SSN.
To make it simpler, we will use the simplest data masking methodology (sometimes called applying a data masking rule): Replace the original data with “XXXXX” for purposes of this example, if the user is not authorized to see the original data. From the above diagram, we can easily demonstrate what data masking does. For the business analyst, SSN is masked (XXX displayed instead of original data) and other sensitive information is masked after/before business hours. Even though it is the most rudimentary form of data masking, the fundamental concept is the same: Obscure data from unauthorized users by applying a data masking rule/data masking algorithm and the data masking is irreversible (from masked data we should not necessarily be able to retrieve original data).
Before concluding this article, let’s go back to the heading of the article and understand the representation complexity: “opjuqzsdof bube” – reverse this string and shift one alphabet character back for each, we will get the original string, data encryption. And similarly, data ;8^)!#g is the masked version of the original string “data masking.” So, let’s summarize data encryption and data masking:
Data encryption: Original cleartext source data is encoded (intermediate encrypted data). From this encoded data, the original data can be retrieved. Data encryption is useful for data at rest or in motion where real-time usability is not required.
Data masking: Original data is masked (obscured), and the results can be permanent (no need to reverse the masking). Data masking is a very fine-grained security approach to protecting field-level data attributes. It can leave the data highly portable for data records where confidential personally identifiable information (PII) may be redacted, while leaving other non-personally identifying data types open for use.
As an example, you could still run analytics on these records for trend analysis without exposing PII to understand credit health within a geographic region:
While the definitions can often overlap, depending on the use case—such as data anonymization, data pseudonymization, or even hybrids such as format-preserving data encryption—this should give you a general idea how to approach your next data privacy or data security data governance project.
Want to learn more about data masking such as the difference between persistent data masking, typically used for data anonymization, and dynamic data masking, which can be applied based on certain real-time contextual use cases? Be sure to check out our data masking solutions, along with data privacy solutions, which enable data protection to be orchestrated using these techniques.