One of the mechanisms to safeguard personally identifiable information (PII) is to anonymize it. This means removing or obfuscating any identifying information about an individual in a dataset to ensure that it can't be learned or deduced, while also still allowing valid analysis of the dataset. If data is identifiable when it's collected, then it will still be identifiable when it is stored or analyzed unless steps are taken to anonymize it. Anonymization can normally be attempted during collection, retention and disclosure, but any solution will be a balance between anonymity and dataset value, the goal being anonymity with minimal information loss.
Database anonymization is a lot harder than it may seem because even a combination of non-personal data can be exploited to deduce who a particular record belongs to. For example, even if a dataset has had individual customer names and email addresses removed, research shows that about half the U.S. population can be identified from just three pieces of information: date of birth, gender and place. Successful identification rises to over 85% if a zip code is available. Date of birth, gender and place would provide useful information for an advertising campaign, but together they could potentially enable a salesperson to re-associate a customer with his or her purchase records (a re-identification disclosure). If these purchases happened to be for drugs to treat a particular illness, the salesperson could deduce that the customer had a particular disease (a predictive disclosure), resulting in a breach of his or her privacy.
Successful anonymization is difficult because of the seemingly unrelated attributes that, when put together, can increase the risk of re-identification. There are a number of privacy models and statistical disclosure control techniques for data anonymization, such as k-anonymity, l-diversity and t-closeness, but the science, possibly even art, of data anonymization is still in its infancy. Most techniques fall between providing privacy protection and allowing accurate scientific analysis. For example, generalizing an attribute -- where it's replaced by a less specific value such as age group instead of date of birth -- is good practice, but limits the level of analysis that can be performed.
Another problem is termed background knowledge, which is acquired when someone may know additional information about a specific individual or can use well-known facts, probability theory, correlational knowledge, demographic information and public records to make inferences about him or her, or even reconstruct his or her personal data.
Also there are a number of situations where permanent anonymization of data would create practical difficulties. For example, if an individual withdrew midway through a study, it may be necessary to identify his or her records in the database in order to delete them. In cases such as this, depersonalization may be the only option. Depersonalization means that identifying information like the individual's name is stored separately in an identification database linked to the research database holding the sensitive data.
The only truly effective way to prevent privacy breaches is to remove analytically valuable information from the dataset, so carefully consider the inclusion of any sensitive data. You could use the 18 attributes specified in the Health Insurance Portability and Accountability Act (HIPAA), including names and geographical subdivisions, as a basis for instigating additional safeguards if it is required in a dataset.
* PII is any piece of information which can potentially be used to uniquely identify, contact, or locate a single person, such as a Social Security number, email address, credit card number or fixed IP address.
This was first published in September 2009