There comes a time when data needs to be shared -- whether to evaluate a matter for research purposes, to test...
the functionality of a new application, or for an infinite number of other business purposes. To protect sensitivity or confidentiality of shared data, it often needs to be sanitized before it can be distributed and analyzed.
A popular and effective method for sanitizing data is called data anonymization. Also known as data masking, data cleansing, data obfuscation or data scrambling, data anonymization is the process of replacing the contents of identifiable fields (such as IP addresses, usernames, Social Security numbers and zip codes) in a database so records cannot be associated with a specific individual, project or company. Unlike the concept of confidentiality, which often means the subjects' identities are known but will be protected by the person evaluating the data, in anonymization, the evaluator does not know the subjects’ identities.
Thus, the anonymization process allows for the dissemination of detailed data, which permits usage by various entities while providing some level of privacy for sensitive information.
Data anonymization techniques
There are a number of data anonymization techniques that can be used, including data encryption, substitution, shuffling, number and date variance, and nulling out specific fields or data sets.
Data encryption is an anonymization technique that replaces sensitive data with encrypted data. The process provides effective data confidentiality, but also transforms data into an unreadable format. For example, once data encryption is applied to the fields containing usernames, "JohnDoe" may become "@Gek1ds%#$". Data encryption is suitable from an anonymization perspective, but it’s often not as suitable for practical use. Other business requirements such as data input validation or application testing may require a specific data type -- such as numbers, cost, dates or salary -- and when the encrypted data is put to use, it may appear to be the wrong data type to the system trying to use it.
Substitution consists of replacing the contents of a database column with data from a predefined list of factious but similar data types so it cannot be traced to the original subject. Shuffling is similar to substitution, except the anonymized data is derived from the column itself. Both methods have their pros and cons, depending on the size of the database in use. For example, in the substitution process, the integrity of the information remains intact (unlike the information resulting from the encryption process). But substitution can pose a challenge if the records consist of a million usernames that require substitution. An effective substitution requires a list that is equal to or longer than the amount of data that requires substitution. In the shuffling process, the integrity of the data also remains intact and is easy to obtain, since data is derived from the existing column itself. But shuffling can be an issue if the number of records is small.
Number and date variance are useful data anonymization techniques for numeric and date columns. The algorithm involves modifying each value in a column by some random percentage of its real value to significantly alter the data to an untraceable point.
More uses for data anonymization
Anonymizing data to comply with regulations
Testing new applications with anonymized data
Nulling out consists of simply removing sensitive data by deleting it from the shared data set. While this is a simple technique, it may not be suitable if an evaluation needs to be performed on the data or the fictitious form of the data. For example, it would be difficult to query customer accounts if vital information such as customer name, address and other contact details are null values.
Data anonymization tools
I have often used anonymization when working with various IT vendors for troubleshooting purposes. Data generated from log servers, for example, cannot be distributed in its original format, so instead traceable information is anonymized using log management software. By initiating the anonymize function in the software, I am able to protect data in our logs, replacing identifying data such as usernames, IP addresses, domain names, etc. with fictional values that maintain the same word length and data type. For example, a variable originally defined as “firstname.lastname@example.org” will get converted into “email@example.com”. This allows us to share log data with our vendors without revealing confidential or personal information from our network.
Some interesting tools in the data anonymization space are Anonymous Data 1.11 by Urban Software and Anonimatron, which is available on SourceForge.net. Both tools are freeware and can run on a Windows-based platform, while Anonimatron can also operate on Linux and Apple OSX systems
In addition, I have worked with many IT security professionals who prefer to create custom scripts against files to anonymize their data.
Whatever your choice for data anonymization, the goal remains the same: to anonymize sensitive information. Although these anonymization techniques and tools do not fully guarantee anonymity in all situations, they provide an effective process to protect personal information and assist in preserving privacy.
With the growing need to share data for research purposes and the legal implications involved if due diligence is not properly conducted when releasing information, many organizations are now discovering the necessity and the benefits of data anonymization.
About the author:
Kellep Charles (@kellepc) is an information security analyst with over 15 years of experience in the areas of incident response, computer forensics, security assessments, malware analyst and security operations. He is completing his Doctorate in Information Assurance at Capitol College with a concentration in Artificial Neural Networks (ANN) and Human Computer Interaction-Security (HCISec).
Charles is the creator of SecurityOrb.com (@SecurityOrb), an information security & privacy knowledge-based blogsite. He holds the following industry certifications Certified Information Systems Security Professional (CISSP), Cisco Certified Network Associate (CCNA), Certified Information Systems Auditor (CISA), National Security Agency - INFOSEC Assessment Methodology (NSA-IAM) and Information Technology Infrastructure Library version 3 (ITILv3).