Manage Learn to apply best practices and optimize your operations.

Unmasking data masking techniques in the enterprise

Patch-testing and development environments can't use live data and keep it secure. That's where data masking comes in. Michael Cobb examines the principles behind data masking and why security pros should endorse its use in order to keep production data secure.

Company networks need regular patching; so do the applications that run on them. Combine network and application...

patches with version upgrades and you can easily see how important a test environment is in order to ensure all these changes don't break the corporate infrastructure.

To be effective from a security perspective, data masking techniques must preserve the privacy of individual records by changing the data so the actual values cannot be determined or re-engineered.

But in a test environment, using valid production data isn't practical. Its use is often restricted by privacy legislation and security policies, and for good reason. In a production environment, strict access controls can be put in place and the user interface provides a controlled and managed view of the data. Data security in non-production systems is typically less robust to facilitate speed and flexibility during development and testing. Plus far more employees, such as developers and system engineers, need privileged, low-level access to the data. Clearly this type of environment doesn't meet the legal requirements to secure sensitive data.

But to be effective, a test environment has to use realistic data. However, if an application handles half a million users and several million transactions, it's impossible to manually create the volume of dummy data required.

So how do you get hold of test data to put a newly patched or upgraded application through its paces before rolling it out to the live system? Many test teams turn to data masking to provide realistic data for non-production environments.

Data masking -- also known as data obfuscation, de-identification, depersonalization or data scrubbing -- aims to remove all identifiable and distinguishing characteristics to render data anonymous yet still usable and, most importantly, remove the risk of exposing or leaking sensitive information. The concept of data masking was first introduced in the early 90's in an effort to provide development teams with meaningful test data without exposing sensitive information. Recent legal and compliance requirements and improved masking techniques, which make it easier to recreate large data sets, have increased interest among enterprises.

Learn more about encryption

To be effective from a security perspective, data masking techniques must preserve the privacy of individual records by changing the data so the actual values cannot be determined or re-engineered. The most common data masking techniques used include encryption, shuffling, masking, substitution, variance and nulling. Shuffling involves randomly moving the data in a column between rows, while substitution replaces a column of data with information that looks similar, but is completely unrelated to the real details, such as replacing all male names with names drawn from a random list. The variance technique can be used on numeric and date columns and involves modifying each value by some random percentage of its real value.

But true data masking is a complex feat that seeks to deliver anonymous, yet usable, test data that still has the look and feel of the original information; a string of random, meaningless text is usually not sufficient. Encryption, for example, converts data into binary characters so the data no longer looks realistic when plugged into applications, and it's not great for reports and printers either. Substitution data, such as street names, can sometimes be hard to find in large quantities, while data shuffling can only really be used on large data sets, and even then can still reveal sensitive data. For example, the highest salary in the HR database -- probably that of the CEO -- will still be visible, but would appear as if it were another employee's salary figure (yet someone with access to the data could guess that the highest salary is likely that of the CEO, and so the information would leak out by inference). While the variance approach offers a reasonable disguise for such data, it's vital that the range and distribution of values are kept within viable limits; it shouldn't create employees who are 150 years old, for example. Free-format text data such as memos and notes are practically impossible to sanitize with any sort of data masking, so these do have to be replaced with dummy text such as Lorem Ipsum.

Regardless of the data masking method used, it is critical that data structures and data relationships between database rows, columns and tables are maintained with each masking operation, preserving any relationships. For example, if the key to the employees table is the EMPLOYEE_NUMBER, changes to it must trigger identical changes in all other related tables. Some data items have a structure that represents an internal meaning, such as the checksum on a credit card number. The only way to sanitize this type of data is to shuffle it so no row contains its original data but each data item is valid internally. Replacing it with a random collection of digits would mean any validity checks would fail and prevent testing the updating of the database. As you can see, with data masking there are plenty of issues to consider in order to get it right.

Thankfully there are a growing number of data masking products aimed at automating the sanitization of large datasets. One of the top five data masking vendors according to Forrester Research Inc. is Camouflage Software Inc., which offers the Camouflage Data Masking Lifecycle Management Suite. Other vendors include DataGuise Inc. and Original Software Ltd, while Oracle Corp. offers a Data Masking Pack for its database applications. IBM meanwhile has developed a software tool called Masking Gateway for Enterprises (MAGEN), which uses optical character recognition and screen scraping to identify and cover up confidential data before it reaches a user's screen.

Data masking, when it's done correctly, not only demonstrates due diligence regarding compliance with data privacy legislation, but can also be an effective strategy for reducing the risk of data exposure from inside and outside of an organization and should be considered a best practice for any non-production databases and other testing environments. It enables realistic data to be used for testing, training and software development, including off-site or cross-border projects.

About the author:
Michael Cobb, CISSP-ISSAP is the founder and managing director of Cobweb Applications Ltd., a consultancy that offers IT training and support in data security and analysis. He co-authored the book IIS Security and has written numerous technical articles for leading IT publications. Mike is the guest instructor for several Security Schools and, as a site expert, answers user questions on application security and platform security.

This was last published in August 2010

Dig Deeper on Data security strategies and governance