In October 2006, Google, the company driving many Internet innovations, released Google Code Search a powerful new service that could have major ramifications for the open source community and information security industry.
How Google Code Search functions
Before delving into its potential ramifications, we should first examine how this service operates. As Google's bots crawl the Internet, fetching pages for Google's search directory, they retrieve source code in dozens of languages, ranging from Ada to Yacc and everything in between. Google then collects the data from its bots, indexes it in a special manner and makes it searchable. But, with the special indexing, the source code is searchable in far more flexible manners than those offered by "normal" Google search directives and operators. The Code Search interface now supports regular expressions, an incredibly flexible method for describing searchable patterns. The result is a blazingly fast service for searching the code of tens of thousands of projects.
Google Code Search's impact on the open source community
At this time, Google Code Search only officially targets open source projects. It's beneficial in that it can help new developers see how other experienced developers solve certain problems and learn from their techniques, and helps experienced developers to locate code snippets and build on them in their own code. But as with most anything, Google Code Search can be used for malicious intentions; bad guys can mine treasures by constructing queries to find security flaws, including buffer overflow vulnerabilities, format string attacks and numerous other issues.
And, the gets() example is merely a sample of the types of problems that can be discovered using Google Code Search. Dug Song, a noted security researcher, and Aaron Campbell posted a blog that examines ways Google Code Search can be used to find a dozen different kinds of flaws, including certain buffer overflow conditions, format string flaws, off-by-one issues, to name a few.
In the near term, Google Code Search will help developers find and fix poorly crafted code that can lead to major security vulnerabilities. And, in the long term, Google Code Search will have improved the state of security, as this service offers an incredibly powerful way to view code. However, the near-term ride will be quite bumpy, as bad guys mine for flaws while the rest of the community races to implement fixes.
Google Code Search and commercial code
And while Google Code Search currently targets open source software projects, this does not mean commercial software organizations should ignore it. In fact, there are many implications for any organization that develops its own code. For instance, if proprietary, in-house code is indexed by Google, bad guys can find flaws and exploit affected systems.
So what can enterprises do to protect themselves from malicious Google Code Search use? Some have suggested clearly labeling source code "proprietary" and including comments in the code itself to defend against Google Code Search scans. While this is a good idea, it does not currently prevent Google from incorporating any available code into its database. I learned this when I performed a search for "proprietary" and found 99,900 hits, with several saying in the code, "This is unpublished proprietary source code for vendor [XYZ]"). In the future, Google may filter out such searches, but it doesn't now.
Preventing Google Code Search misuse
With that said, there are some steps enterprises can take to protect themselves from Google Code Search misuse and abuse. First, make sure that source code cannot be accessed via Web sites unless there is a compelling business reason to have it there. Next, implement robots.txt files in your Web servers. This will limit where the Google's bots search parameters, as they search for Google-searchable content. You can also use robots.txt to tell well-behaved search engines that certain directories are off-limits when it comes to searching. To learn more about how to configure settings for robots.txt files, visit www.robotstxt.org. It is important to note that robots.txt is a double-edged sword: while it keeps well-behaved search services from scouring through sensitive areas of your Web site, it will draw bad guys' attention to those areas.
Nonetheless, it should go without saying that an enterprise needs a quality assurance program to ensure the code itself has minimal flaws. Use code reviews and carefully test all home-grown software. Finally, put serial number strings in each file created by your development team, and periodically search for those strings in Google Code Search and other search engines to see if your code has managed to sneak out of your organization. Then, if the cat gets out of the bag, at least your legal team can detect it and contact the appropriate people to have any sensitive or proprietary information removed from their servers.
About the Author:
Ed Skoudis is a founder and senior security consultant with Intelguardians, a Washington, DC-based information security consulting firm. His expertise includes hacker attacks and defenses, the information security industry and computer privacy issues. In addition to Counter Hack Reloaded, Ed is also the author of Malware: Fighting Malicious Code. He was also awarded 2004, 2005 and 2006 Microsoft MVP awards for Windows Server Security, and is an alumnus of the Honeynet Project. As an expert on SearchSecurity.com, Ed answers your questions relating to information security threats.