Defining DLP

The rash of information thefts, security breaches and data loss incidents in recent years has driven the development of a new breed of products designed to prevent sensitive data from making its way out of enterprise networks. But there is a lot of confusion in the industry about what constitutes a DLP product, how they work and how to deploy them. In this video, viewers will get a primer on the basics of DLP, including:

  • Defining DLP: what it is and what it can do
  • Architecture
  • Which features to look for and which to avoid

About the speaker:
Rich Mogull has more than 17 years experience in information security, physical security, and risk management. He is the founder of independent information security consulting firm Securosis.

Read the full text transcript from this video below. Please note the full transcript is for reference only and may include limited inaccuracies. To suggest a transcript correction, contact  

Defining DLP

Hi, this is Rich Mogull. Today we're going to talk about what's, well, a little unfortunately, one of the more confusing security technologies on the market today: DLP or otherwise known as data loss protection. And I say that it's kind of confusing because there have been a lot of terms used for DLP. Data leak prevention. Information loss prevention. Information leak prevention. Extrusion prevention. Extrusion protection. All sorts of other terms out there.  We have everybody from encryption vendors calling themselves DLP to anybody that blocks a USB port on a laptop.  Which pretty much makes most of the operating systems DLP.

The reality is that DLP is one of the most important security innovations we've seen in recent times.  Because it's one of the very few technologies that is actually finally taking a focus on our information itself, on our actual content and on our actual data. Because of that, it's going to have very big implications for the future.

Now I'm not going to say that this is exactly going to solve all your problems today.  It's still very much an adolescent technology and has a lot room to mature. But what we're going to do over the next period of time here is talk about what is DLP. How does DLP work? What can you expect from it? What can you not expect from it?  A lot of it [will be] making sure you have your right expectations in place. And then what are the top things to look for in a product to be implemented in your own organization.

Now I've been covering DLP for a while now, probably five or six years. And one of the hard things is having a consistent definition as to what DLP is. So I've gone ahead and written up a definition, which is... products that, based on central policies, identify, monitor and protect data at rest, in motion and in use through deep content analysis. Now that definition alone concisely defines what you need to look for in terms of your DLP architecture.  So let's break this out into a few areas.

The first thing of any DLP solution is that you have to have central policies or some sort of centralized management. And the reason is that when you want to go ahead and protect your data, you don't want to have to set things up in eight different locations for the same piece of data. If you want to protect credit card numbers, you won't have to write a policy over here and a policy over here and a policy over here and a policy over here to protect that. You should have that, being able to be centralized and being able to push down throughout your organization to whatever security technologies are actually implementing the DLP.

Now from there, let's look at the three key things that DLP needs to do. It needs to identify the data.  It needs to monitor the use of that data. And then, ideally, you want it also to be able to protect that data. And you want to be able to do that when it's at rest. When it's in your storage infrastructure. When it's in motion. When it's in motion, when it's being moved throughout your organization. And when it's in use.  And the only way you can do that is through deep content analysis. The ability to actually dig into the files and understand what's there.

So I want to talk a little bit about what is that deep content analysis and what are the things you want to look for there. Then we're going to break out those different technology architecture areas. And then we're going to conclude again with some of the things you should look for, the top things to look for when you're protecting a solution.

Now when we talk deep content analysis, there are two sides of this. Actually three sides of this. There is, how do you gain access to the content? There is what kind of context is around that content. So sending, receiving. Where is it going? Where has it been? And the last is, what's the content itself?

Now the contextual stuff as we call it, which is who is the sender or the recipient for email, for example. Where is the file stored if we're looking at data at rest, for example. Those sorts of things. That's actually pretty easy to get. We can have a lot of tools that can look at the metadata. Pretty much every DLP solution on the market can track your basic context. Now it's important because that's going to help you build your rule. Because remember, a piece of data pretty much doesn't mean anything. 

What we're really talking about is protecting information which is a piece of data with a value. And something only has value depending on where it is and how it's being used at that point in time. So is John allowed to send Mary a particular piece of information? We need to know who John is. We need to know who Mary is. We need to know what that piece of data is. So that's what the context on the outside can give us.

Now the next thing we have to do is actually gain access to the content itself. Now if it's moving across the network in plain text, that's pretty easy. If it's a file on a server or even a file that is in part of a file attachment on an email sent, it gets a little bit harder. And so what we need to do there is something called file cracking. File cracking is when you basically can take a particular file and you're actually able to understand what kind of file that is and dig down into it and actually look at the information.

So if I say, "The quick brown fox jumped over the lazy dog," if that's inside of that file, that's inside Microsoft Word. You can't just read it because it's kind of a binary kind of a file. If it's in a ZIP file. If it's embedded in a PDF file or something. So file cracking is the way that you can go through. And most of the tools can open up hundreds of different file types to actually look at what's inside.

Now the next part of file cracking is you have files that are embedded within other files. So, for example, you can take part of an Excel spreadsheet, embed it in a Word document, throw that into an Adobe PDF file and then zip that up into a ZIP file. It's not all that unusual. So you actually need to be able to interpret all of those different things. You need to open up the ZIP file, find the PDF, open that up [and] find the individual content in there. And so this is the very first step of doing content analysis within a DLP solution.

Now the next step is the actual content analysis itself. By deep content analysis, we're not just talking about keywords. Okay, maybe you do set up a keyword policy.  But that's not really what we're talking about here. Deep content analysis tends to fall into just a few major areas, a few major techniques.

Now you can use what we call rules based. Rules based is basically like regular expressions. So it is, "I am looking for something that resembles a credit card number." It has the right number of digits and the "one checking" works and it's the right size and those sorts of things. And even if you have dashes in it or out of it, a good well-written regular expression will be able to figure out that that is a credit card number. So it's a simple way that you can define a policy for basic kinds of content or what we call describe data. You're describing what that content look like.

Well the problem in the example I just gave is, how do I know the different between a credit card number when an employee is purchasing something on Amazon, or a credit card number that is from one of our customers out of our customer database?  And to solve that problem, we have what we call database fingerprinting, otherwise known as exact data matching.

In this case, instead of looking for that general rule that describes that credit card number, you can actually go ahead and do a dump of your database. Create hashes of all of that information in there and then monitor only for those credit card numbers for your customers. Or better yet, only the credit card numbers of your customers from your customer database that has that credit card number as well as the customer name. Maybe as well as the customer address. Whatever rules you want to build. Not that any of you should have unencrypted credit card numbers lying around anyway. It's a PCI violation. But it does give us a better ability to better protected information directly out of our databases. So the first one we have is rules-based, often mostly regular expressions.  And now we're moving into the database fingerprinting.

But what about unstructured data? What if we have a sensitive engineering plan we want to protect? Well for that, we use a content analysis technique known as partial document matching. That is, I can load a document up and then most of these tools use a technique called overlapping hashes. Where basically I can say, "I want to protect this entire document. But I also want to protect if somebody cuts out a paragraph and pastes that paragraph into an instant message." How can you do that?

Well this technique does use the overlapping has techniques and everything else.  So it'll go ahead and figure out actually if somebody copied just that paragraph out and pasted it into something else. And then tried to send that outside of the organization -- really powerful for some of those unstructured documents. Usually you can set things up where you just have a directory and any document in that directory is scanned and loaded up into the system. And you can exclude things like the header with your company address on it so that you don't go ahead and get a lot of false positives off of that.

Now the next couple of techniques start getting kind of interesting. Now they tend to lead to much higher false positives. Everything I've described is fairly straightforward.  False positives probably mean somebody really did cut and paste that or somebody really did use a credit card number. Well another technique we can use is statistical techniques. So this is  the Bayesian analysis machine learning the other kinds of things that we tend to use in our anti-spam solution but kind of flipped around.

So there what I'm doing is taking a whole bunch of documents and I'm saying, "I want you to scan all of these. And I want you to protect anything that looks like it resembles the documents in these repositories." So it does all sorts of advanced math stuff that's well beyond my ability to understand. Now you are going to get more false positives because anything that kind of resembles that stuff, and this is not a perfect science here, is going to trigger alerts. You might get false negatives. You might miss things. Because we're dealing with some more loosely defined criteria now. But I think it starts to become pretty powerful to at least maybe give us a first pass on finding some of the information we wouldn't maybe otherwise build a rule to protect.

And then the last major kind of technique around this is what we call conceptual-based. Which is I have something like insider trading. And what I do is build a special dictionary, a lexicon. It's not just words, it's proximities, it's phrases and other things of common things that can represent the concept of insider trading. And then I can go ahead and generate alerts if I see anything like that.

Now the first techniques to start are what most people are using for DLP today.  They're building basic rules. They're going ahead and loading up information into their databases. And they're doing personal document matching. And some even actually are still doing rules based stuff that's just little keywords or a keyword and regular expression kind of a thing tied together. These other two techniques are just starting to emerge. But I think they are going to be very exciting in terms of where we see DLP head. But, again, realistically that's not what most organizations are deploying today.

So by now what we've covered is basically the content analysis side of this. That is what are the content analysis techniques. A little bit of an overview of how those work. We've got a lot of other information we can provide you guys about that if you want to know the nitty gritty details. We've talked about the file-cracking portion of it.  And we talked about getting the context, the metadata information. What's the metadata for that file? Or if it's email, what's the metadata for that network packet or something in storage?

Now let's actually break down and talk about the actual DLP architectures themselves. Now most organizations who acquire DLP tend to start with the network.  And it's usually my recommendation to start at the network. And the reason is we have basically two major kinds of [in] DLP. Endpoint DLP, which is where you have it installed on your actual system. Or network DLP, where you're scanning things in the network.

Now if we just do endpoint DLP, we can do some pretty interesting stuff.  But the problem is you're going to miss a lot. Because you're not going to be able to get an agent on every single endpoint necessarily. And so if a contractor comes in, doesn't have that agent on their desktop and then sends something out of the organization that they grabbed off the server, endpoint DLP will never deal with that unmanaged system. It only works with managed systems. So because of that, we see a lot of organizations starting with network first.

Now with network DLP, basically there are a couple of different architectures that you use. The most basic is passive monitoring, as we call it.  And this usually involves connecting the network DLP product. Typically it's either an appliance or it's a server dedicated to this. So it's a Dell 1U rack or blade server or something.  Connecting that to a spanned fork on your Web gateway or basically on your exterior faces so as all of your data goes out.

Now from a performance standpoint, you actually don't need as much as you might think. Even if you're running a full gig of Ethernet traffic across there, the DLP solution really only cares about communications traffic. So it's going to watch for Web and FTP and other things. Hopefully it will watch for channels where somebody tries to run the Web protocol over the different kind of ports. Those sorts of things. So you have that network monitoring going on.

For small organization, well let's just say there isn't a DLP tool on the market that won't meet your needs. Even large organizations probably only need about 300 to 500 megs of bandwidth to monitor that communications traffic for any particular egress point

And so you're connected to that port and you're monitoring that traffic. That's great.  You generate alerts when somebody emails or IMs or FTPs something out that they shouldn't. But that's not necessarily going to help you if you actually want to stop anything. So we see a couple of different blocking architectures. One is that you keep that DLP appliance in passive monitoring mode, and then it links up with your Web gateway. So it will link up to whatever that firewall is. Typically, you're using something called the ICAP protocol.

Now Web gateways, if you're doing URL filtering or anything, what they actually do is proxy the traffic. So I can go to make a connection to the outside world as a user.  That connection actually stops at that Web gateway. And then that Web gateway can spool up all the information before it sends it to the outside. And as part of that, it sends it over to the DLP appliance to do a quick check. And the DLP appliance says, "Yes allow this," or "No, don't allow this traffic." So that'll even work for things like somebody going to a webmail account, for example. And I've seen demos of this.

Now the ICAP thing is interesting but it causes a problem. What if you wanted to have a secure connection? So if I go to my bank, I have an SSL connection.  And I know that I'm at that bank because I have that encrypted connection. Well there are now proxies that have been developed. The DLP vendors don't make these. These are actually somebody else that you probably might already have in your organization where I'm actually going ahead and doing that proxy and I'm breaking that encryption. Now it's going to send off alarm bells on the user's desktop. So what you can actually do is configure for your users. It's not going to work for anybody whose system you don't manage.

But for the systems you manage, you can just throw the certificate down there and the users won't be bothered with these kinds of alerts. And then you can actually sniff SSL or encrypted traffic, which is great. Because the early DLP solutions couldn't do this. So if I went someplace encrypted, they're not going to be able to monitor anything I do. And that was one of the big holes. So now we have the ability to do that, to set things up properly.

And that's pretty much it on that basic network side for the synchronous protocols.  Things like FTP and HTTP and instant messaging. Just passively monitor or you integrate with some sort of a proxy. And there are actually other proxies for instant messaging and other protocols.

Email's a little different. Email by its very nature is always proxied. And so every major DLP solution now, what they do is actually include an MTA, a mail transport agent, in their device. And what happens is when you go to send your mail outside the organization, the mail goes first up to the email server. And then it'll go to the DLP solution for analysis. The DLP says, "Oop, I don't want to let this out. I'm going to send it back to the email server." Put it in a quarantine or something along those lines so that an administrator can evaluate that.

So email's a lot easier. It's built into most of the network-based products today. Any one that's worth buying is going to have this capability within the product. And pretty straightforward in terms of how to set it up. Your mail admins, you're going to need to work with them to make sure that they understand, because you can't do this without their help. And they need to send out the mail routes appropriately to account for this.

Now the next major area as I talked about is DLP on the endpoint. So we can do all this monitoring on the network side. We can do blocking of different things as long as we can get the proxies in the right place and with email. Well on the endpoint side it's a little bit different. And so for the endpoint, we see a huge range of capabilities in what an endpoint agent can do. But I like to break this down out into a couple of basic areas.

If I look at what I want to protect on an endpoint, the first thing I probably want to protect is making sure somebody is not putting something on a portable storage.  So we look at things like USB blocking. I might also want to provide DLP protection when this laptop is not on my network anymore tut it's out someplace else in somebody else's network, to make sure people can't send sensitive information out that my web gateway's going to miss. So I want that network monitored. I might also want to know what sensitive data is on my laptop. Or what we call content discovery and endpoint content discovery, which we'll get into. And the last thing is maybe I want to put protections in for things like cut and paste.

That's what we look for in an endpoint DLP solution. You want to have central policies. You want to have those down at the endpoints still. And we're looking at a couple layers. Where we can get the file system level so we can scan/sort content.  And we can also monitor when content is being saved. If it's being burned onto DVD through a USB device, whatever it is. We want to monitor network communications, both for outside communications. Hey, maybe we even just want to monitor printing.  So when somebody goes to print something that's actually network communication.  We put restrictions around that. And then we want to monitor what I call data in use, which is cutting and pasting and putting things into applications which perhaps are not approved for that piece of data.

Now this is the earliest part of this. Now one of the problems with the endpoint is that, even with all of these capabilities that I just talked about, there's only a limited amount of processing power in a laptop. You're not running an eight-core appliance on your network. And because of that, you don't always have the ability to enforce all of the same policies that you might be able to enforce on a stand-alone on the network side itself. With something that's fully dedicated to that.  At least you're not going to be able to enforce all of those because of performance reasons.

Because of that, it might be nice to have your policies adjust when you're system is on the network versus when it's off the network. So for example, when you're on the network maybe you are supporting that database fingerprinting. Or maybe that's always supported on email since email's always going to go through your server.  And then maybe when you unplug from that network you go, "Oop, I can't really do the database fingerprinting any more. But now I can at least generate alerts using a rule that describes a credit card number," if that's I'm trying to protect. So you want that flexibility of where is the device. Where is the endpoint. And be able to have that agent know where it is and adjust its capabilities.

Now I'm going to be blunt.  A lot of what I just talked about isn't in all of the agents out there. As a matter of fact, it's not in many of the agents out there. But it's a rapidly evolving technology. By the time you watch this video and you read the articles that go along with it, the odds are extremely high that more and more products are going to have those capabilities. And I guarantee every single vendor is building those kinds of things in.

We talked about the network side. We talked about the endpoint side. The last area is content discovery. With content discovery, what we're talking about is scanning our storage infrastructure looking for sensitive data. In my mind, this is actually more useful in many cases than network monitoring. Think about the amount of risk you can reduce by just knowing where your most sensitive is. And knowing within a reasonable time period, not always in real-time but within a reasonable time period, if data is moved to someplace it shouldn't be.

Now the way we do this is a couple of techniques One is remote scanning.  Remote scanning is simply connecting to a fileshare that's on a server. Now it doesn't matter if it's a SAN or a NAS or a regular old file server. If there's an open fileshare, then all you do is set your DLP appliance, you point it at it and it's going to go ahead and it's going to scan all the files.

Now as you can imagine, performance error might not be the best. You're running a lot of traffic over the network to do this. We've seen some clients actually position DLP servers. DLPs are all hierarchically managed.  So you can have the central policies in a little server that's maybe connected right to the fiber that's right next to that particular server to speed things up. But it is going to run a little bit slower. Now the advantage of this is you can cover wide areas of your infrastructure with very little changes. The disadvantage is with the performance that I just talked about.

To handle some of that for some kinds of servers, and depending on which DLP solution, you can get an agent. Real similar to the endpoint agent. Sometimes it's the same little program. You can install that agent on the server and you can schedule that to run. And you can monitor your performance and things so you don't hurt that server. So that the analysis is being done locally. And you're not having to deal with that network communications issue. The disadvantage is you have to install a managing agent. The advantage is that you're going to get much better performance than if you're trying to do that remote scanning. And if you have something like a SAN you can go ahead and dedicate an appliance just for scanning that. Not a big deal at all.

Now the next area of the way that we can monitor our storage data is if we have something like a document management system. You can actually integrate with that application. And a lot of the DLPs have partnerships with various companies. Or for something like SharePoint that's pervasive, they actually pre-built their agent. This is really cool because you can actually gain access to all the metadata and some of the functionality of that document management system that help you with the scanning.  And so you have a little bit more context.

Now you can also do enforcement, by the way, with the content discovery. It's not just locating where things are. But you can actually change access control permissions to lock down the access controls. You can change the access controls and then drop a little file there that says, "Hey, we, the security administration, found this file here. It shouldn't have been here. So please contact us first to release it." You can pull it onto your server and quarantine it. We've seen tools that are able to encrypt it locally. A lot of really, really advanced stuff that's starting to come out.

Again, not all of this is completely mature. Definitely not compared to some other technologies that are out there. But the current capabilities today give us an ability to protect ourselves that we've never had before. And even just the knowing how the data is being communicated, where it's stored and be able to put a few basic preventative actions in is absolutely huge. And there's going to be a lot of immediate benefit. And that's one of the big reasons that I'm one of the proponents of DLP.  Even though it might not be, as I said, as mature as some other things that are out there.

So we covered a lot of things today. And hopefully throughout this I have given you ideas in terms of what to look for when you are selecting a solution. The first thing you want is to make sure you have good central policy management. There's a lot of workflow and things we haven't talked about that will be in some of the other articles, the detail of what to look for there.

The second is we want to be able to protect that data on the network. We want to be able to protect it on the endpoint. And we want to be able to scan our storage infrastructure.  So I highly recommend you look at a full-suite solution that's able to do those. You've got to make sure it integrates well with your environment. There's way more than I can talk about in just the 20 minutes here.

And then the last thing, and most importantly, is make sure you're selecting a product that has some good content analysis. And that you're selecting a product that has the content analysis techniques that match your security requirement. If you don't care about credit card numbers or checking data from a database, you don't need that feature. And if you need partial -- you don't want to protect documents -- you don't need the partial document matching feature. So you really want to evaluate these solutions with what's going to integrate with my infrastructure well. What's going to protect the areas that I need to protect at rest, in motion and in use. And what content analysis techniques do I need for the kinds of data and the kinds of information that I use within my organization?

And that's pretty much it. So thank you for your time today, and hopefully you found this useful.

View All Videos

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.