Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Understanding tokenization: What is tokenization and when to use it

Tokenization protects sensitive data to reduce the compliance burden.

ISM April 2012 Issue

What's the best way to protect sensitive data from being stolen? Remove it entirely: If data is not present in...

a system, it can’t be stolen. A steady stream of data breaches demonstrates that IT systems are under attack and underscores how even firms with good security knowledge get it wrong. Just ask Sony, RSA, HBGary Federal, or Stratfor. Information security is hard to do, and the complexity of IT operations hinders our ability to protect data.

So why don't we just get rid of sensitive data? No, I am not saying you should delete all your data in order to protect it. I am saying replace sensitive information with something that, if stolen, doesn’t matter. That’s precisely what tokenization does. Tokenization technology removes sensitive data and replaces it with a worthless token. IT systems use the token placeholder as a reference, continuing to function as before, but the risk of leaking information is greatly reduced.

There has been quite a buzz around tokenization in heavily regulated industries such as payment processing. Credit card numbers are a principle target for attackers, with repeated thefts leading the payment card industry to mandate the Data Security Standard (PCI DSS) for all merchants and payment processors. The cost of compliance is significant, with these requirements leading to costly changes in network setup, security controls, and adjustments to daily operations. All companies subject to data security regulations are looking for ways to both simplify controls and -- directly or indirectly -- reduce the cost of compliance.

Tokenizaiton technology: Common use cases
It’s for these reasons we have seen a dramatic uptake in tokenization technology in response to PCI DSS, and are beginning to see adoption in response to HIPAA and state mandated personal information protection. Under these use cases, sensitive information is replaced with a token, which acts as a reference to the real data. Sensitive data is kept in a different location in a heavily secured system that can only be accessed under special circumstances. Tokenization removes the proliferation of sensitive data across IT systems without harmful impact to business processes or applications. Once credit cards, Social Security numbers and intellectual property are removed, they're inaccessible, and a hacker can't steal what's not there.

Conceptually, tokenization has been around for decades. In the last couple of years, merchants and payment processors have implemented tokenization as a means of protecting payment data. It helps secure payment data by removing it from systems that don't require access to the data, reducing the complexity of audits and saving money in the process. Information systems can be operated with fewer controls and restrictions, as you've reduced risk by removing the data that attackers want.

Tokenization helps with data security issues, but the technology is most commonly purchased to address compliance concerns. The three most common uses of the technology:

  • Payment card data security: Substitution of primary account numbers (a.k.a. credit card numbers) and related information with a single token to represent a financial transaction. Using in-house tokenization systems to either support IT back-office business processing systems special needs, to enhance performance or reduce per transaction costs. 
  • Tokenization as a Service: Third-party service providers provide tokens, store sensitive data on customers behalf, and completely remove all sensitive data from the customer site. Tokenization as a Service is most commonly used with payment card services because they dramatically reduce PCI audit costs.
  • Protecting PII: Information associated with passports, national identity documents, Social Security numbers, driver license numbers, 401K, pension and medical benefits data.

We are beginning to see tokenization applied to more complex data sets, such as medical and patient data, for HIPAA and HITECH compliance. There is genuine need to protect this information, and with fines being levied for improper data controls, organizations are beginning to take these regulations seriously. Today the lack of adoption is largely due to technical limitations of tokenization to handle multiple types of complex data subscribed to by many different audiences. In the payment space, a token represents a single credit card used for a single financial transaction. Medical data is far more complex, with information derived from the raw data in multiple ways. Worse, insurance companies, doctors, nurses, medical technicians, government health organizations and clinicians all need access to select portions of that data.

How tokenization works
Think of a subway train or arcade where you use tokens instead of cash. In these systems you purchase tokens and then use a token of limited value to play games or ride the train. While the token has a one-to-one relationship with the currency used to purchase it, the tokens only have value in that system, and nowhere else. Data tokens are similar in that they replace a valuable item, but in data security applications the token is purely a reference without any intrinsic value: its worth is solely in the reference back to the original data.

Tokenization works by creating a random marker -- the token -- then substitutes sensitive data within files and databases with the random marker  used to represent that specific value. As the tokens are an arbitrary random value, they can be created in any form, for any data type, that the user chooses. This means the tokens can “look” and “act” just like the data they replace. For example, a token that replaces a credit card can be created as a 16-digit number that passes a LUHN check, but it’s not a real number and can’t be used to make purchases. This is critical as it means the applications, databases and other associated systems do not need to be altered to accept the tokens. The beauty of tokenization is risks surrounding sensitive data are addressed and business systems continue to function without costly changes.

The following diagram shows the basic architecture of a token system and illustrates a typical transaction.

Basic tokenization architecture

  1. The application collects or generates a piece of sensitive data.
  2. The data is immediately sent to the tokenization server -- it is not stored locally.
  3. The tokenization server generates the random or semi-random token. The sensitive value and the token are stored in a highly secure and restricted database (usually encrypted).
  4. The tokenization server returns the token to the application.
  5. The application stores the token, rather than the original value. The token is used for most transactions with the application.
  6. When the sensitive value is needed, an authorized application or user can request it, or request the token server to use the real data on their behalf. The value is never stored in databases local to the calling application. Token database access is highly restricted, dramatically limiting potential exposure.

What is a token?
A token is a random string used as a surrogate or proxy for some other data. There is no direct mathematical relationship between the original value and the random token, so the original data cannot be determined from the token. The association between the token and the original value is maintained in a database -- called a token vault -- that provides both security for the original data and the relationship between real data and tokens. Outside of this vault, there is no other way to connect the two values.

The preferred method to create tokens is with a random number or alphanumeric value. Completely random tokens offer the greatest security, as the content cannot be reverse engineered. Some vendors use sequence generators or hashing to generate tokens, but both are subject to different forms of attack and can be compromised.

Tokens are an arbitrary random value, so you can choose any data type or length you want. In many cases, customers choose tokens that are combined with a portion of the original data -- say the last four digits of a credit card number – with the remainder being a random value. In this way, tokenization can be used without breaking existing applications.

Encryption vs tokenization
Some of you may be asking how tokenization differs from encryption, and if tokenization provides advantages. To answer those questions, it’s important to first contrast the two technologies. Tokenization is a method of replacing sensitive data with non-sensitive placeholders: tokens. Encryption is a method of protecting data by scrambling it into an unreadable form. It’s a systematic encoding process that is only reversible with the right key. Correctly implemented, encryption is nearly impossible to break, and the original data cannot be recovered without the key.

Format preserving encryption (FPE) is a viable option to tokenization as it accomplishes the same goals, but the crucial differences are two-fold. First, tokens are not directly reversible and encrypted values are. If the encryption system is poorly implemented or the algorithm succumbs to a mathematical attack, the encrypted data will be compromised. Second, encryption has more moving parts, is more complicated and therefore harder to audit for proper deployment. It’s much easier to make mistakes and it’s much harder to identify those mistakes after they occur.

Ultimately, attackers are smart enough to go after the weakest part of either system, targeting the encryption key store or the token vault. Accessing these systems are the only practical ways to break security. If the systems are set up in such a way that key management or de-tokenization services are not available, theoretically, both are equally -- and very -- secure. In practice, the complexity of the encryption system leaves more chance for errors, and ubiquity of key services makes unwanted decryption more likely.

Tokenization servers
Tokenization servers provide several key services to support the tokenization of sensitive data, secure storage, data references and de-tokenization requests. When evaluating a tokenization server – or service – here are the key areas for evaluation:

  • Tokenization requests: How does the tokenization server get inserted into your data processing workflow? How fast can it process requests?
  • Authentication: How does the tokenization system integrate with your identity and access management systems? Can the token server provide strong separation of duties between the administration, tokenization and de-tokenization users?
  • Token storage: Tokens are typically stored in a relational database, and the contents of that data need to be kept safe from illegal access to the database or any of the data files. Encryption is the first line of defense, so reviewing the encryption method and key management facility is critical in understanding if the credit cards in the token vault are secure.
  • De-tokenization: Some merchants, for re-payment or customer support efforts, need access to the original credit card number. This is called a ‘de-tokenization’ request, as authorized users submit a token and get a credit card number in return. De-tokenization requests are the most critical to data security, so it’s  important to understand the interface that makes these requests, which users are allowed to make them, how logging these events occurs, and how the de-tokenization interface is installed and secured.
  • Integration and deployment: How does the token server integrate with your systems, how is it secured and how is it maintained? Key areas to investigate are the application interfaces, scalability, failover and administration..
  • Migration: You need to swap existing sensitive data with tokens. There are many ways to do this, and not every vendor does it well. Investigate how your vendor performs this task, how long it will take, and if it can be done concurrently with on-going transaction processing.
  • Price: Some vendors sell tokenization platforms under software licensing models, charging for the number of CPUs. Others charge on a per-transaction basis, with the costs going down on a per-transaction basis as the volume increases. 

Tokenization as a service
A growing  trend is tokenization provided as a third-party service. It’s now the number one option for merchants within the largest market for tokenization: payment processing. When provided as a service -- assuming tokenization is properly implemented as a black box where no sensitive data or security controls are exposed --  customers literally remove all sensitive payment data storage from their system. Credit cards are collected and tokens issued without the customer ever touching payment data. Services related to the original data like repayments are handled at the third-party provider’s facility, eliminating the need for de-tokenization.  The customer is never exposed to the data or any of the security measures around the token data vault. For PCI DSS compliance, this method provides the maximum possible scope reduction.

Streamlining compliance
If you are considering tokenization, we can assume you want to reduce exposure of sensitive data while saving some money by curtailing security requirements across your IT operation. Tokenization is in and of itself a complex system with its own set of security requirements, so it’s worth stressing that the implementation is critical to security, and in no way should it be taken lightly. You’ll need to address a handful of questions to determine if tokenization is right for you: Does this meet my business requirements? Is it better to use an in-house application or choose a service provider? Which applications need token services, and how hard will they be to set up?

However, how tokenization works is very simple and the value proposition is straightforward. By removing sensitive information from all of the systems that don’t require it, you’ve dramatically reduced the risk to that data. Tokenization helps secure data by reducing the likelihood it will be stolen, and reduces the work necessary to secure systems and meet regulatory compliance. 

About the author:
Adrian Lane is CTO and security strategist for information security research and analysis firm Securosis. Send comments on this article to [email protected].

Dig Deeper on PCI Data Security Standard