Metadata security and preventing leakage of sensitive information

Without accounting for metadata security, sensitive document data can easily be extracted. Mike Chapple explores technologies to support metadata security.

This tip is part of SearchSecurity.com's Data Protection Security School lesson on Network content monitoring must...

haves. For more in-depth tutorials, visit SearchSecurity.com's Security School page.

Most security professionals know metadata often contains sensitive information about documents hidden from obvious view, but easily extracted with the right tools.  We often discover too late that users are not sensitized to these risks and don’t know the proper steps to follow to ensure information they release outside the organization doesn’t carry a hidden, sensitive payload.

The unintended leakage of this type of information can pose many risks to an organization.

Fortunately, there are a number of technologies available to assist in this effort.  In this tip, we examine the source of metadata security issues and look at two specific ways you can reduce the threat that document metadata poses to your organization.

What is metadata?

Metadata, quite literally, means “data about data”.  In the security field, we usually think of it as the data stored (sometimes in hidden form) as part of our productivity files.  This can take the form of revision history and comments made between document authors and editors, such as the familiar Microsoft Word “Track Changes” functionality shown in Figure 1.

Figure 1: Microsoft Word Track Changes feature
The unintended leakage of this type of information can pose many risks to an organization.  For example, a potential supplier who receives a marked-up copy of a contract draft may learn sensitive details about internal deliberations that put the supplier in a stronger negotiating position.  Opponents in a legal dispute may discover internal discussions about the weak points in an argument.

Figure 2 illustrates another type of metadata: the data the operating system itself retains about files.  In this example, taken from a Mac, you can easily see who created the document, when it was created and last modified, the tools used to edit the document and more.  This type of metadata proved embarrassing to Microsoft a few years ago when investigative reporters used it to uncover that Macs were used to create materials for their “I’m a PC” campaign against Apple.

Figure 2: Macintosh document metadata
These are just two examples of metadata – there are many others.  For example, a database might include hidden columns containing timestamps of modifications or a photo might include metadata with the GPS coordinates where the photo was taken. 

Once we acknowledge the risk that metadata poses to our organizations, we can turn to two ways to control this threat: redacting metadata and using data loss prevention technology.

How to remove metadata before releasing documents

Many of your users may be familiar with the old paper-based process of blacking out sensitive information and making a copy to redact it before release and seek to implement the digital alternative.  It’s important to explain to them that this approach is not effective in the digital world and they need to take extra steps to ensure metadata is removed from the document.  Without taking added precautions, the reviewer may not notice and in turn redact sensitive information stored in metadata.  In fact, the metadata might contain the revision history that includes the actual redacted content!  Because of this, you may wish to consider having a second-level review process where a qualified technician examines redacted documents for metadata before their release.

One popular way to remove metadata from a document is to convert it to PDF format before releasing it.  The National Security Agency uses this approach and recommends a six-step process for secure conversion and redaction of Word documents:

  1. Create a copy of the original document.
  2. Turn off “Track Changes” on the copy and remove all visible reviewer comments.
  3. Delete any sensitive information from the document that you wish to redact.
  4. Use the Microsoft Office Document Inspector to check for any unwanted metadata.
  5. Save the new document and convert it to a PDF file.
  6. Use the Sanitize Document tool in Acrobat Professional as a second check before releasing the redacted PDF.

While somewhat cumbersome, this fail-safe process provides two degrees of assurance that you have removed sensitive metadata before releasing a document.

Use data loss prevention technology

As part of a defense-in-depth approach to information security, you should also introduce a second layer of control designed to prevent the leakage of sensitive information when proper redaction efforts fail.  The easiest way to do this is with a data loss prevention (DLP) product. 

DLP products are already widely deployed to monitor endpoints, networks and data stores for accidental leakage of sensitive information.  If you’re already using such a product, there’s probably not much else you need to do if you’ve already properly configured it to understand the key words and phrases that constitute sensitive information in your environment.  Most DLP products scan the entire contents of a file, including the metadata, when they perform an inspection of outbound documents.  If you haven’t already deployed a DLP product, you may wish to consider doing so.


Document metadata poses a significant risk to your organization’s confidential information because users are not often aware it exists.  Driven by the WYSIWYG approach used by office productivity tools, users may not realize what you see isn’t always everything you get.  Through the use of proper document redaction and data loss prevention technology, you can mitigate this metadata security risk in your environment.

About the author:
Mike Chapple, Ph.D., CISA, CISSP, is an IT security professional with the University of Notre Dame. He previously served as an information security researcher with the National Security Agency and the U.S. Air Force. Mike is a frequent contributor to SearchSecurity.com, a technical editor for Information Security magazine and the author of several information security titles, including the CISSP Prep Guide and Information Security Illuminated.

This was last published in January 2012

Dig Deeper on Data security strategies and governance