Tech Accelerator What is data security? The ultimate guide

Definition

data masking

Michael Cobb

By

Michael Cobb

What is data masking?

Data masking is a method of creating a structurally similar but inauthentic version of an organization's data that can be used for purposes such as software testing and user training. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required.

Although most organizations have stringent security controls in place to protect production data in storage and in business use, sometimes that same data element has been used for operations that are less secure. The issue is often compounded if these operations are outsourced and the organization has less control over the environment. In the wake of compliance legislation, most organizations are no longer comfortable exposing real data unnecessarily.

Data masking substitutes original values in a data set with randomized data using various data shuffling and manipulation techniques. The obfuscated data maintains the unique characteristics of the original data so that it yields the same results as the original data set.

How does data masking work?

Encryption is the best way to securely store and transfer sensitive data. Unfortunately, encrypted data is difficult to query and analyze. For example, you cannot filter users based on age if their data of birth is encrypted. As a result, organizations need another way to keep data secure and private when it is being used for research, development and testing.

This article is part of

What is data security? The ultimate guide

Data masking, which is also called data sanitization, keeps sensitive information private by making it unrecognizable but still usable. This lets developers, researchers and analysts use a data set without exposing the data to any risk.

Data masking is different from encryption. Encrypted data can be decrypted and returned to its original state with the correct encryption key. With masked data, there is no algorithm to recover the original values. Masking generates a characteristically accurate but fictitious version of a data set that has zero value to hackers. It also cannot be reverse engineered, and statistical outputs cannot be used to identify individuals.

Like data encryption, not every data field needs to be masked, although some fields must be completely hidden.

Diagram of how an encryption key works — How a cryptographic key works when encrypting a message.

Why is data masking important?

Various data protection standards and regulations require that businesses and other organizations protect personally identifiable information, or PII, and protected health information and keep it confidential. These standards and regulations include the following:

California Consumer Privacy Act (CCPA)
General Data Protection Regulation (GDPR)
Health Insurance Portability and Accountability Act (HIPAA)
Payment Card Industry Data Security Standard (PCI DSS)

These regulations and standards play an important role in establishing appropriate levels of data protection and preventing unauthorized users from accessing the data. However, they also make it challenging for companies that want to analyze or share their data with others. Data masking reduces the risks of sensitive data being exposed and lets enterprises comply with various standards and regulations while handling regulated data.

Data masking techniques

A variety of data management techniques can be used to mask or anonymize PII and other private and sensitive data depending on the data type. These masking methods include the following:

Scrambling

Scrambling randomly reorders alphanumeric characters to obscure the original content. For example, a customer complaint ticket number of 3429871 in a production environment could appear as 8840162 in a test environment after being scrambled. Although scrambling is easy to implement, it only works on certain types of data. Data obfuscated this way is not as secure as other techniques.

Substitution

This technique replaces the original data with another value from a supply of credible values. Lookup tables are often used to provide alternative values to the original, sensitive data. The values must pass rule constraints and preserve the original characteristics of the data.

It is harder to apply substitution than scrambling, but it can be applied to several data types and provides good security. For example, credit card numbers can be substituted with numbers that pass card provider validation rules.

Shuffling

Values within a column, such as user surnames, are shuffled to randomly reorder them. For example, if customer surnames are shuffled, the results look accurate but won't reveal any personal information. However, it is essential that the shuffling masking algorithm is kept secure so it cannot be used to reverse-engineer the data masking process.

Date aging

This method increases or decreases a date field by a specific date range. Again, the range value used must be kept secure.

Variance

A variance is applied to a number or date field. This approach is often used for masking financial and transaction value and date information. The variance algorithm modifies each number or date in a column by a random percentage of its real value. For instance, a column of employees' salaries could have a variance of plus or minus 5% applied to it. This would provide a reasonable disguise for the data while maintaining the range and distribution of salaries within existing limits.

Masking out

Masking out only scrambles part of a value and is commonly applied to credit card numbers where only the last four digits remain visible.

Nullifying

Nullifying replaces the real values in a data column with a null value, completely removing the data from view. Although this sort of deletion is simple to implement, the nullified column cannot be used in queries or analysis. As a result, it can degrade the integrity and quality of the data set for development and testing environments.

Examples of data masking techniques — Data masking takes different forms based on the technique used. See some examples of the various types.

Types of data masking

The process of masking data can be initiated in different ways depending on where and when the data is needed. Various types of data masking include the following:

Static data masking creates a separate masked data set from a production database that can be used in non-production environments, such as research, development and modelling. The masked data values must generate test and analytical results that mirror those of the original data and persist over time to ensure accurate and repeatable results.
Dynamic data masking provides role-based security, particularly in production systems, when users require real data. In response to a request for data, dynamic masking transforms, obscures or blocks access to sensitive information fields in real-time based on the user's role. For example, when a financial call center operator handles a customer query, sensitive fields such as birth date, social security number and credit score will be masked unless the operator has the privileges required to view those fields. By streaming data from the production environment in response to a request, it avoids the need to store the masked data in a separate database. However, if the data is located across multiple systems, masking consistency can be an issue.
On-the-fly data masking lets development teams read and mask a small subset of production data directly into a test environment. Data is masked as it is copied from one environment to another, so it is never present in an unmasked form in the target environment or the target database's transaction log. This approach eliminates delays incurred when a staging environment is used to prepare data. This makes it ideal for continuous software development environments.

Data masking challenges

There are several challenges involved with data masking:

Complicated. Data masking is not a simple, one-step activity. The data must be transformed to eliminate the risk of exposing sensitive information through inference attacks. At the same time, the system must maintain the complexity and unique characteristics of the original unmasked data, such as frequency distribution. This ensures that queries and analysis yield the same results as they would from the original.

Referential integrity. Masked data must maintain referential integrity across systems and databases. This is called deterministic data masking and ensures consistency of the masked output across databases and among servers, ensuring a value is always replaced with the same masked value. For example, the name "Helen" is always replaced by the value "Denise" wherever it occurs in the data being masked to preserve primary and foreign keys and relationships.

This may sound straightforward, but most databases are normalized for performance, and sensitive data gets stored in a variety of tables across one or more databases. Structured text, such as Extensible Markup Language (XML), can contain values that must be obfuscated, so they have to be parsed and masked.

Governance. For data used in software testing, the masking process must ensure data still matches internal rules governing the data attributes, such as zip codes or bank account numbers. If care is not taken here, the application will fail an update test because the data validation checks fail.

To overcome these challenges, database administrators must do a detailed review of the data being masked. Others involved with the data -- developers, testers, data scientists and security teams -- should contribute to the review. Input from these stakeholders ensures the appropriate masking techniques are used, data sources of valid replacement values are generated, referential integrity is maintained across all systems and the masked data maintains the characteristics of the original data.

Data masking best practices

Data masking is an important part of how organizations comply with privacy regulations and still are able to mine data for valuable insights. However, the ever-growing volume of structured and unstructured data that organizations collect has increased the complexity of data masking at scale. To remain compliant, it is essential to follow best practices:

Organize and track. Enterprise data is spread across a variety of technologies and locations in different databases, tables and columns. To ensure the correct data is protected, it's important to locate and classify sensitive data that requires protection.
Consider unstructured data. Images, PDFs, and XML- and text-based files must be protected. Images of items, such as passports, drivers licenses, receipts, checks and contracts, must be replaced with fake alternatives. Optical character recognition can be used to detect and mask sensitive content in files.
Secure. Access given to masked data must comply with security policies related to roles, locations and permissions.
Assess. Test the results of data masking techniques to ensure they provide the correct security level and that query results are comparable to those from the original data.

Data masking use cases

Data masking is used in many industries to drive innovation and improve services. For example, the banking and finance industries use it to develop and test new systems as well as improve fraud detection algorithms.

While data masking is used for different reasons, the main driver is data security and personal data privacy.

Data breach prevention. Data obfuscation techniques are the main defense against data breaches. Even if a network's defenses fail to prevent an attacker exfiltrating sensitive data, if it is correctly masked or encrypted, the attackers won't be able to associate the data with individuals.
Privacy by design. Implementing data masking as part of the application development, testing and analysis lifecycles lets important data be shared both internally and externally while remaining compliant with data privacy regulations.
Role-based access control. Many tasks require access to some, but not all, of a subject's data attributes. Data masking ensures employees can perform their duties without seeing data they are not authorized to view.
Faster, safer test data. Masked data retains the integrity and quality needed for testing without compromising the actual data. Automated data masking can be integrated with existing systems to remove time-consuming manual steps. This way, test data can be secured and released from a production environment quickly and without disruption. Data masking ensures production data is not exposed by more relaxed nonproduction environments or third-party developer access.

List of best practices for preventing data breaches — Various data management best practices help prevent data breaches.

Many database vendors offer features and services that simplify data masking and integrate them with existing systems.

Find out more about keeping data safe beyond data masking in our comprehensive guide to data security.

This was last updated in September 2022

Continue Reading About data masking

Data masking vs. data encryption: How do they differ?

Top 7 types of data security technology

How privacy compliance rules will affect IT security

How to secure data at rest, in use and in motion

What are some tips for storage of sensitive data?

Dig Deeper on Data security and privacy

Why GenAI infrastructure optimization starts with the network
GenAI deployment requires optimized infrastructure. The networking aspects, such as workload placement and the type of deployment...
Cisco charts new security terrain with Hypershield
Initially, Hypershield protects software, VMs and containerized applications running on Linux. Cisco's ambition is to eventually ...
Is network automation adoption necessary?
Automation is not a one-size-fits-all strategy for every network problem. Enterprises must examine their network's needs to ...

CIO

Election might decide fate of FTC noncompetes ban
If the FTC's ban on noncompete agreements survives legal challenges, it might still face problems should there be an ...
FTC bans noncompete agreements in split vote
Now that the FTC has issued its final rule banning noncompete clauses, it's likely to face a bevy of legal challenges.
Ally's generative AI strategy eyes multiple LLMs, AI agents
The digital bank plans to privately host multiple LLMs on its GenAI platform, explore autonomous agent technology and evaluate ...

Enterprise Desktop

Creating a patch management policy: Step-by-step guide
A comprehensive patch management policy is insurance against security vulnerabilities and bugs in networked hardware and software...
How to remove a device from Intune enrollment
When a device reaches its end of life, IT needs to remove that device from any management software, such as Microsoft Intune. But...
Windows XP EOL anniversary: 10 years of endpoint evolution
Windows XP EOL meant upgrading to Windows 7, but times have changed. IT pros can use the next upgrade cycle as an opportunity to ...

Cloud Computing

Troubleshooting 7 common errors in AWS CloudFormation
Errors can occur when an AWS developer builds a CloudFormation template, launches a stack or rolls back an update. Prevent and ...
Explore CASB use cases before you decide to buy
CASB tools help secure cloud applications so only authorized users have access. Discover more about this rapidly evolving ...
10 cloud vulnerabilities that can cripple your environment
Enterprises can be devastated by security-related weaknesses or flaws in their cloud environments. Find out where you are most ...

ComputerWeekly.com

Post Office scheme was a ‘charade’ that never intended for large compensation pay-outs
The Post Office established a scheme to compensate subpostmasters affected by tech faults, but it was just a ‘charade’, says ...
Security Think Tank: Maybe let's negotiate with terrorists
In the wake of renewed calls for lawmakers to consider enacting legal bans on ransomware payments, the Computer Weekly Security ...
Microsoft sees datacentre investments key to AI leadership
Microsoft is reportedly spending between $50 to $100m to build out AI infrastructure

Close