How to find sensitive information on the endpoint
Worried that your enterprise endpoints may be harboring sensitive information like credit card numbers or Social Security numbers? Fear not. Mike Chapple offers algorithms and tools to conduct a search and advice on dealing with the results.

There's little question that any security manager who's suffered through a lost laptop incident knows what aspect...
Continue Reading This Article
Enjoy this article as well as all of our content, including E-Guides, news, tips and more.
of the ordeal causes an organization real damage. When a laptop goes missing, it's not the loss of a $2,000 asset that causes heartburn; it's the fear, uncertainty and doubt that results from not knowing whether sensitive information was stored on the missing device.
Fortunately, security professionals may take advantage of a number of sensitive information discovery tools to identify and eradicate sensitive information stored on endpoint devices.
Sensitive information discovery algorithms
Before we delve into the tools available to assist with the search, it's important to have a basic understanding of the algorithms used to detect sensitive numbers. Only by understanding how these algorithms work is it possible to judge the effectiveness of individual scanning tools. We'll specifically look at two types of sensitive numbers commonly sought out by sensitive data discovery tools: credit card numbers and Social Security numbers.
Credit card numbers issued by the major providers follow a standard format that makes it easy to detect them using regular expressions. The rules for valid numbers include:
- Visa numbers have either 13 or 16 digits and always start with a 4.
- MasterCard numbers have 16 digits and always start with a 5, followed by a digit between 1-5.
- American Express numbers have fifteen digits beginning with 34 or 37.
- Discover Card numbers have 16 digits beginning with 6011, 622, 644-649 or 65.
These guidelines are a great starting point for ruling out quite a few false positives because they can easily be adapted to a regular expression. For example, the following regular expression can be plugged into a search tool to find potential Visa card numbers, even if there are whitespace characters between the groups of four digits:
\b4\d{3}[ -]?\d{4}[ -]?\d{4}[ -]?\d{4}[ -]\b
mod 10 mod 10
4128 0057 1492 1925
![]() |
||||
|
![]() |
|||
![]() |
Social Security numbers, on the other hand, are not quite as easy to match because there is no Luhn algorithm equivalent to verify their validity. You can look for patterns of nine digit numbers surrounded by white space and take advantage of a few clues to help with the search:
- SSNs are often (but not always) written with hyphens between the digits, in the form xxx-xx-xxxx. If you're willing to accept the risk of missing unformatted numbers, you can restrict your search to numbers hyphenated in this pattern to dramatically reduce false positives.
- SSNs will never have all 0's in a digit group (i.e. 000-xx-xxxx, xxx-00-xxxx or xxx-xx-0000).
- SSNs will never begin with 666, 732-749 or any number higher than 772.
- Given the first three digits of an SSN, you can determine the highest possible values for the next two digits by consulting the Social Security Administration's High Group Number list.
Software tools to assist in the search
Unless you're looking for an adventure, it's not necessary to write your own code to implement these searches. There are a variety of open source and commercial products available to assist you in detecting these sensitive numbers on enterprise systems. Some examples include:
- Cornell University's Spider
- University of Texas at San Antonio's Sensitive Number Finder
- Identity Finder
These tools use the algorithms described above and allow you to tinker with the settings, such as whether to restrict a search to formatted numbers, numbers in particular file types and other parameters.
Managing sensitive information and data
After deciding upon a search strategy for finding potentially sensitive information, the next step is to decide on a strategy for managing the mountains of results data.
There are two basic approaches to this problem: centralized review or decentralized authority. In the centralized approach, the tools report all results to a central administrator who is responsible for validating and eradicating suspicious data. This is an extremely time-consuming process and taxes valuable IT resources. However, it ensures consistency of rule interpretation and the thorough review of findings.
In the decentralized approach, end users are given responsibility (and accountability) for reviewing results. This distributes the workload among the entire workforce and provides the added benefit of having staff with contextual knowledge perform the review. For example, a staffer who knows that an Excel spreadsheet contains information about parts orders may be able to immediately disregard reports of SSNs in that document, while a centralized reviewer might not know the difference between that and any other file.
![]() |
||||
|
![]() |
|||
![]() |
Scanning systems for sensitive data is a complex problem but, fortunately, there are a variety of tools and techniques available to assist in the process. Minimization, the searching and eradication of sensitive information on endpoints, is a powerful strategy in the arsenal of security administrators seeking to reduce enterprise risk.
About the author:
Mike Chapple, CISA, CISSP, is an IT security professional with the University of Notre Dame. He previously served as an information security researcher with the National Security Agency and the U.S. Air Force. Mike is a frequent contributor to SearchSecurity.com, a technical editor for Information Security magazine and the author of several information security titles, including the CISSP Prep Guide and Information Security Illuminated. He also answers your questions on network security.