Data Redaction: What It Is and When to Use It
What is Data Redaction?
Data redaction is a data masking technique that enables you to mask (redact) data by removing or substituting all or part of the field value. This helps protect sensitive personally identifying data.
Data Redaction Techniques
One of the first methods to protect sensitive information was to implement column based security. Column based security can ensure a sensitive column is not exposed to a user without the proper privileges. This method, while effective, can present issues to the calling application (like a BI tool), as it is expecting a certain number of columns to be returned from the query.
Redaction was one of the first methods to protect sensitive data, yet return a column value. Some redaction techniques can be referred to as ‘simple masking’ as it is a one-way substitution scheme.The most common use of the redaction technique is to ‘redact’ the entire column, and replace with a constant. In this method, the query returns the proper number of columns, but instead of the actual value, the column value is replaced with a constant. For example, when applying redaction to a Social Security Number (SSN), the result might be ‘N/A’ or ‘XXX-XX-XXXX.’
Another redaction technique is to have a ‘look-up’ to find a value to put in the resulting column, instead of a constant value.For example, a column ‘FirstName’ might have a value of “Susan.” and the look-up would get a name from a random list and replace it with “Cathy.”
Often, part of a sensitive column will have value. This part can be shown without exposing the entire column.
Putting Data Redaction Into Practice
We’ve all been on the phone when we are asked to verify ourselves using the ‘last 4 digits’ of our SSN. In this case, it is likely that the person asking you that question can only see the last 4 digits of your SSN. So, the redacted SSN (last 4 digits) has value in the verification process, but has been redacted enough to not be a direct identifier. In data privacy terms, we have turned a direct identifier (SSN) into an indirect or quasi identifier (SSN last 4 digits).
Because it is not a direct identifier, we often are asked another question, like the last 4 digits of our phone number, and the same methodology applies. This technique of partial redaction goes beyond the SSN and phone number. Some examples are below:
- DOB: Simply replacing the ‘Date of Birth’ field with the ‘Year’ or the ‘Month/Year’ is a form of redaction. This could also be accomplished with other privacy techniques like generalization.
- Credit Card Number: With the first 6 digits of a credit card, the credit card provider can be determined (Visa, Amex…). By providing only the first 6 digits and redacting the rest, we have provided some analytical value, but have not exposed the actual credit card number.
- SKU: A clothing store might create an eight-digit SKU number with the first two digits representing the product category, such as t-shirts or jeans, the next two digits representing the style (such as slim fit or regular fit), then two digits representing the product color (such as “RE” for red or “BL” for blue). Depending on the use case, you could expose individual attributes of the product by redacting everything but the desired digits, without exposing the actual individual product.
- Healthcare ICD-10 codes: ICD-10 codes, which are used by physicians and other healthcare providers to classify and code all diagnoses, symptoms and procedures, contain key information in the first few characters/digits. For example, S86.011D is the code for a “strain of the right Achilles tendon.” By leaving the first letter, and redacting the rest, we have some information, without the full ICD-10 code meaning.
In the example above, the letter “S” designates that the diagnosis relates to “Injuries, poisoning and certain other consequences of external causes.” The first three characters of the ICD-10 above (S86) would reveal “Injury of muscle, fascia and tendon at lower leg level.” More information, but again, not the full ICD-10 code information and the exact injury is not revealed.
Key Considerations When Redacting Data
When designing a data privacy strategy, data redaction is often considered as a first step. This entails reviewing your sensitive data, and determining:
- What sensitive data should be de-identified with redaction?
- Which redaction technique should be used (full, partial, lookup)?
- Once redacted, does this field still maintain the data value/utility for analysis downstream?
When to Not Use Data Redaction
While data redaction can be incredibly powerful, it is also important to note when it is not the correct de-identification technique.
Redaction is typically not reversible. For example: If you have redacted all but the last 4 digits of an SSN, and after some analysis decide you wish to have the full, actual SSN, that is not possible. If this reversal or re-identification to the real value is needed, other privacy techniques, such as tokenization, should be used.
If the redacted field is a unique or direct identifier, or a unique key in database terms, partial redaction can remove the ‘uniqueness.” Depending on the use of the field, that can be problematic. For example, if a report rolls up transactions by 8 digit account number, and the account number is redacted to the first 4 digits, that could be problematic.In this case, 12347777 and 12348888 both become 1234. So transactions for both accounts would roll up under the same number, which is not desired behavior.
Use of data redaction should be a key component to any data privacy strategy.