The following is an excerpt from the book Network Forensics: Tracking Hackers through Cyberspace by Sherri Davidoff and Jonathan Ham. This section from chapter 10 describes Web proxies and caching and how these technologies can benefit forensic analysts.
It’s a port-80 world out there (and, to a lesser extent, port 443 as well). As of 2009, web traffic made up approximately 52% of all Internet traffic, and was growing at a rate of 24.76% per year.
As a result, firewalls, which filter traffic based on Layer 3 and 4 protocol information such as IP address and TCP ports, are no longer sufficient for protecting enterprise perimeters. There has been an explosion of the use of “web proxies” and “web application gateways,” which often include highly specialized Layer 7-aware firewall capabilities designed to inspect and filter web traffic.
Web proxying and caching have become increasingly popular for both filtering traffic and speeding up requests. Even consumer ISPs have latched onto the idea (sometimes using similar techniques to insert ads into pages as they are downloaded).
Regardless of whether security was a consideration during the original configuration, forensic investigators can take advantage of the granular logs and caches typically retained by web proxies. The rise of content distribution systems and distributed web caches further increases the need for forensic analysts to collect web content from local caches near the target of investigation, since web content is often modified for specific enterprises, geographic regions, device types, or even individual web clients.
10.1 Why Investigate Web Proxies?
Web proxy and cache servers can be gold mines for forensic analysts. A web proxy can literally contain the web browsing history of an entire organization all in one place. When placed at the perimeter of an organization, they often contain the history of all HTTP or HTTPS traffic including blogs, instant messaging, and web-based email such as Gmail and Yahoo! accounts Web caching servers may also contain copies of pages themselves, for a limited time.
This is great for forensic analysts. Investigators can examine web browsing histories for everyone in an organization all at once. Moreover, it’s possible to reconstruct web pages from the cache. Too often, investigators simply visit web sites in order to see what they are. This has some serious drawbacks: first, there is no guarantee you’re seeing what the end-user saw earlier and, second, your surfing now appears in the destination server’s activity logs. If the owner of the server is an attacker or suspect, you may well have just tipped them off. It’s much better to first examine the web cache to see what you can find stored locally.
Web proxies evolved and matured for two general reasons: performance and security. There are many types of web proxies. Below are some simple examples (there are a wide spectrum of products that incorporate different aspects of each of these):
• Caching proxy—Stores previously used pages to speed up performance.
• Content filter—Inspects the content of web traffic and filters based on keywords,
presence of malware, or other factors.
• TLS/SSL proxy—Intercepts web traffic at the session layer to inspect the content
of TLS/SSL-encrypted web traffic.
• Anonymizing proxy—Acts as an intermediary to protect the identities of web
• Reverse proxy—Provides content inspection and filtering of inbound web requests
from the Internet to the web server.
Network Forensics: Tracking Hackers Through Cyberspace
Authors: Sherri Davidoff, Jonathan Ham
Nowadays, web proxies are commonly set up to process all outbound web requests from within organizations to the Internet (sometimes referred to as a “forward proxy”). In this setup, the web proxy is often configured to provide caching, content inspection, and filtering of both outbound requests and inbound replies. This has a number of benefits. The web proxy can identify and filter out suspicious or inappropriate web sites and content. Web proxies also cache commonly used pages, which improves performance by removing the need for commonly accessed external content to be fetched anew every time it is requested. In much the same way as individual web browsers cache content to improve performance for a single client, web proxies cache content for use across the enterprise.
Reverse web proxies can be useful as well. Often, they include logs that allow investigators to identify suspicious requests and source IP addresses associated with web-based attacks against the protected server.
Web proxies are typically involved in investigations for one of several reasons:
• A user on the internal network is suspected of violating web browsing policy.
• An internal system has been compromised or may have downloaded malicious content
via the web.
• There is concern that proprietary data may have been leaked through web-based
• A web server protected by a reverse web proxy is under attack or has been hacked.
• The web proxy itself has been hacked (rare).
Throughout this chapter, we focus on analyzing “forward” web proxies, as they are commonly used in organizations. Many of the forensic techniques used to analyze forward web proxies can also be applied to other types of web proxies in different settings (with the special exception of anonymizing proxies, since these are typically designed to retain very little or no information about the endpoints).
10.2 Web Proxy Functionality
Over time, web proxies have evolved standard functions, including:
• Caching—Locally storing web objects for limited amounts of time and serving them
in response to client web requests to improve performance.
• URI Filtering—Filtering web requests from clients in real-time according to a black-
list, whitelist, keywords, or other methods.
• Content Filtering—Dynamically reconstructing and filtering content of web requests
and responses based on keywords, antivirus scan results, or other methods.
• Distributed Caching—Caching web pages in a distributed hierarchy consisting of
multiple caching web proxies in order to provide locally customized web content, serve
advertisements, and improve performance.
We discuss each of these in turn.
Caching is a way of reusing data to reduce bandwidth use and load on web servers, and speed up web application performance from the end-user perspective. The web is designed as a network-based client-server model. In the simplest case, each web client request would be sent directly to the web server, which would then process the request and return data directly to the web client.
More on network forensics
Enterprise network forensic analysis: Reconstructing a breach
Selling network forensics for both security and troubleshooting
Of course, most web servers host a lot of static data that doesn’t change very often. Individual web clients often make requests for data that they have requested before. Organizations may have many web clients on their internal network making requests for data that has already been retrieved by another internal web client. Over time, the Internet community has evolved and standardized mechanisms for making web usage far more efficient by caching web server data locally and in distributed cache proxies.
Forensic investigators examining hard drives know that web pages are often cached locally by web browsers themselves, and can be retrieved through standard hard drive
analysis techniques. Network forensic investigators should also be aware that web pages are often cached at the perimeters of organizations, as well as by ISPs and distributed cache proxies, and may be retrieved through analysis of web proxy servers. Client web activity may also be logged in these locations.
The HTTP protocol includes built-in mechanisms to facilitate caching that have matured over time. According to RFC 2616 (“Hypertext Transfer Protocol—HTTP/1.1”), “The goal of caching in HTTP/1.1 is to eliminate the need to send requests in many cases, and to eliminate the need to send full responses in many other cases. The former reduces the number of network round-trips required for many operations; we use an ‘expiration’ mechanism for this purpose... The latter reduces network bandwidth requirements; we use a ‘validation’ mechanism for this purpose.”
Expiration and validation mechanisms are important for forensic investigators to understand because they can indicate, among other things:
• How recently a cached web object was retrieved from the server
• Whether a web object is likely to exist in the web proxy cache
• Whether a cached version of a web object was actually viewed by a specific web client
The HTTP protocol is designed to reduce the need for web clients and proxies to make requests of web servers by providing an “expiration model” by which web servers may indicate the length of time that a page is “fresh.” While an object is “fresh,” caching web proxies may serve cached copies of the page to web clients instead of making a new request to the origin web server. This can dramatically reduce the amount of bandwidth used by an organization, and the end-user typically receives the locally cached response far more quickly than a response that must be retrieved from a remote network. When the expiration time of a web object has passed, the page is considered “stale.”
The expiration model is typically implemented through one of two mechanisms:
• Expires header—The “Expires” header lists the date and time after which the object
will be considered stale. Although this is a straightforward mechanism for indicating
page expiration, implementation can be tricky because the client and server dates and
times must be synchronized in order for this to work as designed.
• Cache-Control—As of HTTP/1.1, the “Cache-Control” field supports granular spec-
ifications for caching, including a “max-age” directive that allows the web server to
specify the length of time for which a response is valid. The max-age directive is
defined as a number of seconds that the response is valid after receipt, so it does not
require that the absolute time between the web server and caching proxy/local system
The “validation model,” as defined by RFC 2616, allows caching web proxies and web clients to make requests of the origin web server to determine whether locally cached copies of web objects may still be used. While the proxy and/or local client still needs to contact the server in this case, the server may not need to send a full response, again improving web application performance and reducing bandwidth and load on central servers.
In order to support validation, web servers generate a “cache validator,” which it attaches to each response. Web proxies and clients then provide the cache validator in subsequent requests, and if the web server responds that the object is still “valid” (i.e., using a 304 “Not Modified” HTTP status code), then the locally cached copy is used.
Common cache validators include:
• Last-Modified header—The “Last-Modified” HTTP header is used as a simple
cache validation mechanism based on an absolute date. The web proxy/client sends
the server the most recent “Last-Modified” header, and if the object has not been
modified since that date, it is considered valid.
• Entity Tag (ETag)—An ETag is a unique value assigned by the web server to a
web object located at a specific URI. The mechanism for assigning an ETag value is
not specified by a standard and varies depending on the web server. Often, the ETag
is based on a cryptographic hash of the web object, such as an MD5sum, which by
definition is changed whenever the web object is modified. ETags are sometimes also
generated from the last modified date and time, a random number, or a revision num-
ber. “Strong” ETag values indicate that the cached web object is bit-for-bit identical
to the copy on the server, while “weak” ETag values indicate that the cached web
object is semantically equivalent, although it may not be an exact bit-for-bit copy.
10.2.2 URI Filtering
In many organizations, web proxies are set up in order to restrict and log web surfing activity. Quite often, enterprises limit web requests to a list of known “good” web sites (“whitelisting”) or prevent users from visiting known “bad” web sites (“blacklisting”). This is generally done in order to comply with acceptable use policies, preserve bandwidth, or improve employee productivity.
Read other excerpts from the book
Download the PDF to read sections from chapters 6, 8, and 10 here!
The process of maintaining whitelists is fairly straightforward, but in many organizations employees need to access a wide range of web sites in order to do their jobs, and so restricting web surfing activity to a whitelist is not practical. On the flip side, maintaining blacklists can be quite complex, since it requires that administrators maintain long and constantly changing lists of known “bad” web sites. However, blacklists provide more flexibility and there are published and commercially available blacklists that can ease local administrators’ burdens. URI filtering can also be conducted based on keywords present in the URI.
HR violations, including inappropriate web surfing, are among the most common reasons for network forensic investigations. As a result, forensic investigators may often be called upon to review web activity access logs and provide recommendations for implementing blacklists/whitelists.
Tools such as squidGuard allow administrators to incorporate blacklist/whitelist technology into web proxies.
10.2.3 Content Filtering
As the web becomes more dynamic and complex, transparent web proxies are increasingly used to filter web content. This is especially important because over the past decade, client-side attacks have risen to epidemic proportions, and a large number of system compromises occur through the web.
Content filters are often used to dynamically scan web objects for viruses and malware. In addition, they can filter web responses for inappropriate content based on content keywords or tags in HTTP metadata. Content filters can also be used to filter outbound web traffic, such as HTTP POSTs in order to detect proprietary data leaks or exposure of confidential data (such as Social Security numbers).
10.2.4 Distributed Caching
Increasingly, web providers are relying on distributed hierarchies of caching web proxies to provide web content to clients. Distributed web caching has many benefits for performance, profitability, and functionality. Using a distributed caching system, web providers can reduce the load on central servers, improve performance by storing web content closer to the endpoints, dynamically serve advertisements, and customize web pages based on geographic location or user interests.
The two most commonly used protocols underlying distributed web caches are Internet Cache Protocol and Internet Content Adaptation Protocol.
10.2.4.1 Internet Cache Protocol (ICP)
The Internet Cache Protocol (ICP) is a mechanism for communication between web cache servers in a distributed web cache hierarchy.5 Developed in the mid-late 1990s, the purpose of ICP was to further capitalize on the performance gains resulting from web caches by allowing networks of web cache servers to communicate and request cached web content from “parent” and “sibling” web caches. ICP is designed to be a Layer 4 protocol, typically transported over UDP since requests and responses must occur extremely quickly in order to be useful.
ICP is supported by Squid and the BlueCoat ProxySG, among other popular web proxies.
10.2.4.2 Internet Content Adaptation Protocol (ICAP)
The Internet Content Adaptation Protocol (ICAP) is designed to support distributed cache proxies which can transparently filter and modify requests and responses. ICAP is used to translate web pages into local languages, dynamically insert advertisements into web pages, scan web objects for viruses and malware, censor web responses, and filter web requests. As described in RFC 3507, “ICAP clients... pass HTTP messages to ICAP servers for some sort of transformation or other processing (‘adaptation’). The server executes its transformation service on messages and sends back responses to the client, usually with modified messages. The adapted messages may be either HTTP requests or HTTP responses.”
ICAP reduces the load on central servers, allowing content providers to distribute resource- intensive operations across multiple servers. ICAP also enables content providers, ISPs, and local enterprises to more easily customize web content for local use, and selectively cache customized content “closer” to endpoints, realizing performance improvements.
ICAP, and similar protocols (such as the Open Pluggable Edge Services [OPES]), have enormous implications for practitioners of web forensics. It is simply no longer the case that a forensic analyst can actively visit a URL and expect to receive the same data that an end-user viewed at an earlier date, with a different device, or from a different network location. To recover the best possible evidence, it is always best to retrieve cached web data as close as possible to the target of investigation. For example, if a cached web page is not available on a local hard drive, the next best option may be the enterprise’s caching web proxy, followed by the local ISP’s caching web proxy.