This content is part of the Essential Guide: Evaluating data loss prevention tools and technology
Get started Bring yourself up to speed with our introductory content.

Inside DLP: Full-suite products, DLP lite, content analysis

Data loss prevention (DLP) can be a confusing technology. Security expert Rich Mogull discusses the difference between DLP and DLP lite, as well as the ins and outs of content analysis.

Data loss prevention (DLP) may seem confusing and complex, but that's more the result of a relatively generic-sounding term combined with vendor marketing programs glomming onto any term they think will sell a product.

Data loss prevention technology itself is straightforward once it's broken down, even accounting for the differences between dedicated tools and DLP-lite features. Let's take a look inside DLP technology and its characteristics.

Defining DLP

Content analysis is the defining characteristic of DLP. If a tool doesn't include content analysis -- however basic -- it isn't DLP.

Our definition of full-suite DLP reads:

"Products that, based on central policies, identify, monitor and protect data at rest, in motion and in use, through deep content analysis."

This encapsulates the three defining characteristics of the technology:

  • Deep content analysis
  • Broad content coverage across multiple platforms and locations
  • Central policy management

Partial-suite tools include the deep content analysis and central policy (and incident) management, but only on a single platform (such as endpoints). DLP lite tools include some basic content analysis, but typically lack dedicated workflow and sacrifice on broad coverage. Thus, it's easier for us to describe a full DLP tool because this provides the knowledge you need to also evaluate other options, which are a subset.

Content analysis

Content analysis is the defining characteristic of DLP. If a tool doesn't include content analysis -- however basic -- it isn't DLP. Content analysis is a three-part process:

  1. You first capture the data,
  2. Then you crack the file format or reconstruct the traffic, and
  3. Last, you perform analysis using one or more of a variety of techniques to identify policy violations.

Nearly all DLP tools capture contextual data as well -- such as source and destination -- to also include it in analysis. Since we'll discuss how to monitor and capture data with collectors in the technical architecture, let's start with file cracking.

File cracking is an unofficial term for parsing textual data of a source to pass onto the content analysis engine. The analysis engines need to work with text, and many of the file and data formats we use on a daily basis -- such as Office documents or PDF files -- are binary data. The file cracker takes a file, determines the format, then uses a parser to extract all the text. Some tools can support hundreds of file types -- including complex situations like documents embedded in other documents -- then wrap them up in a .zip file. It's the job of the collector to assemble the file and pass it on for cracking. This is as simple as passing over a stored file, or it may be more complex when pulling out a document streaming to a cloud service over HTTP.

Once the file is opened up, the content analysis engine evaluates the text and looks for policy matches. On occasion, the tools will look for a binary, as opposed to a textual match, for data like audio and video files, but the textual analysis is where the real innovation is.

Seven content analysis techniques are commonly available:

  1. Rules/regular expressions use textual analysis to find matching patterns, such as the structure of a credit card or Social Security number. Some of these rules and regular expressions can be quite complex to minimize false positives. This is the technique we see most commonly in DLP-lite features. While it can work well, it's prone to false positives, especially in large environments. You would be surprised at how many things match the format of a valid credit card number.
  2. Database fingerprinting (exact data matching) pulls data from a database and looks only for matches of specified data. Thus, you could load it with hash values of your customers' credit card numbers and stop seeing false positives when your employees order decorative teacups off a website with their personal cards. Database fingerprinting dramatically reduces false positives but works only when you have a good data source. Due to system requirements, it can't usually run on endpoints, depending on the size of the data set.
  3. Partial document matching takes a source file, parses out the text and then looks for subsets of that text. It usually creates a series of overlapping hashes that allow you to do things like identify a single paragraph cut out of a protected document and pasted into a Web mail session. Like database fingerprinting, depending on the size of your data set, it may not work well on endpoints due to performance requirements. However, it can handle very large sets when running on a server or appliance.
  4. Binary file matching creates a hash of a binary file. This is the technique used for protecting nontextual data, but it is most prone to false negatives because even minor changes to a file will fail to match the hash value.
  5. Statistical analysis is a newer technique that uses machine learning or other techniques to analyze a set of known "protected" data and known "clean" data to create rules for near-matches. This is similar to antispam (and based on the same math). Most techniques require you to know exactly what to protect. Statistical analysis is prone to the most false positives. However, it allows you to protect things that resemble known sensitive data but might not be a direct match.
  6. Conceptual/lexicon analysis uses a combination of dictionaries, rules and other analyses to protect information classes that resemble a concept. For example, this could include finding indications of insider trading or job hunting by employees on the corporate network. It is the weakest of the techniques, due to the less clearly defined nature of a concept.
  7. Categories are prebuilt rule sets for common data types -- such as credit card numbers or healthcare data -- that many organizations want to protect. They allow you to kick start your DLP project without having to build all your policies by hand, and you can tune them over time to better fit your requirements.

A data loss prevention policy will combine one or more of these techniques with contextual data and additional rules, such as severity count or per-business unit requirements.

About the author:
Rich Mogull has nearly 20 years of experience in information security, physical security, and risk management. Prior to founding independent information security consulting firm Securosis, he spent seven years at Gartner Inc., most recently as a vice president, where he advised thousands of clients, authored dozens of reports and was consistently rated as one of Gartner's top international speakers. He is one of the world's premier authorities on data security technologies, including DLP, and has covered issues ranging from vulnerabilities and threats to risk management frameworks and major application security.

Next Steps

Check out SearchNetworking's introduction to DLP to learn more about the basics of data loss prevention and how it is different from other network security products.

This was last published in November 2014

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.