Manage Learn to apply best practices and optimize your operations.

Can data anonymization ensure privacy of Web application user data?

There are many regulations requiring an organization to protect the personally identifiable information (PII) that it may collect. In this tip, Michael Cobb explains why it may not be too early for data anonymization techniques to help protect Web application user data.

What is data anonymization and is it a concept that enterprises should employ to ensure the security and privacy of Web application user data?
There are many laws and regulations requiring an organization to protect personally identifiable information (PII) that it may collect. PII is any piece of information which can potentially be used to uniquely identify, contact, or locate a single person, such as a Social Security number, email address, credit card number or fixed IP address. Web applications most often collect this type of information when a user either buys something from its Web site or registers to use the site's services. But this is not the only type of data that Web applications collect and store about their users. Products purchased, pages visited and advertisements clicked are just some of the many statistics often collected about a visitor. Although the majority of organizations do a good job of securing this data from attackers, users' privacy can be put at risk when the data is analyzed.

As an example, let's take a pharmaceutical company that sells drugs on the Internet. The marketing department may want to mine the collected user data in order to fashion a new advertising campaign. To prevent privacy breaches through data inference, it is critical that this data is anonymized prior to being analyzed. Data anonymization allows analysis to take place, but ensures that no sensitive information can be learned about a specific individual. The process is a lot harder than it may seem. Even a combination of non-personal data can be exploited to deduce who a record could belong to.

Using our example, even if the dataset given to the sales department has had individual customer names and email addresses removed, research shows that about half the U.S. population can be identified just from three pieces of information: date of birth, gender and place. If a zip code is available, the figure rises to 85%. Date of birth, gender and place would provide useful information for an advertising campaign, but taken together they could potentially enable a salesperson to re-associate a customer with their purchase records, causing what is called a re-identification disclosure. If these purchases were for drugs to treat a particular illness, the salesperson could deduce that the customer had a particular disease, resulting in a predictive disclosure and a breach of his or her privacy.

When analyzing Web application data, it is important that you take steps to anonymize it. The inclusion of any sensitive data should be carefully considered. Unfortunately, data anonymization is still really in its infancy. Disguising or hiding certain data in the original dataset can provide general privacy protection while still allowing reasonably accurate analysis. Instead of providing date of birth, for example, an alternative could be to use age groups. However, the only effective way to prevent disclosures like the one above is to remove analytically valuable information from the dataset. Finally, another important warning: when testing a new system, real customer data should never be used.

More information:

This was last published in November 2007

Dig Deeper on Web application and API security best practices

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.