Get started Bring yourself up to speed with our introductory content.

Python Forensics: A Workbench for Inventing and Sharing Digital Forensic Technology

In this excerpt of Python Forensics, author Chet Hosmer offers some ground rules for using the Python programming language in forensic applications.

Python Forensics coverThe following is an excerpt from the book Python Forensics: A Workbench for Inventing and Sharing Digital Forensic Technology written by author Chet Hosmer and published by Syngress. This section from chapter three explains some need-to-know basics for using the Python programming language in forensic applications.


In 1998, I authored a paper entitled "Using SmartCards and Digital Signatures to Preserve Electronic Evidence" (Hosmer, 1998). The purpose of the paper was to advance the early work of Gene Kim, creator of the original Tripwire technology (Kim, 1993) as a graduate student at Purdue. I was interested in advancing the model of using one-way hashing technologies to protect digital evidence, and specifically I was applying the use of digital signatures bound to a SmartCard that provided two-factor authenticity of the signing (Figure 3.1).

Years later I added trusted timestamps to the equation adding provenance, or proof of the exact "when" of the signing.

Two-factor authentication combines a secure physical device such as a SmartCard with a password that unlocks the capability of the card's. This yields "something held" and "something known." In order to perform applications like signing, you must be in possession of the SmartCard and you must know the pin or password that unlocks the cards function.

Thus, my interest in applying one-way hashing methods, digital signature algorithms, and other cryptographic technologies to the field of forensics has been a 15-year journey . . . so far. The application of these technologies to evidence preservation, evidence identification, authentication, access control decisions and network protocols continues today. Thus I want to make sure that you have a firm understanding of the underlying technologies and the many applications for digital investigation, and of course the use of Python forensics.

Figure 3.1: Cryptographic SmartCard.
Figure 3.1: Cryptographic SmartCard.

Before I dive right in and start writing code, as promised I want to set up some ground rules for using the Python programming language in forensic applications.

Naming conventions and other considerations

During the development of Python forensics applications, I will define the rules and naming conventions that are being used throughout the cookbook chapters in the book. Part of this is to compensate for Python's lack of the enforcement of strongly typed variables and true constants. More importantly it is to define a style that will make the programs more readable, and easier to follow, understand, and modify or enhance.

Therefore, here are the naming conventions I will be using.

Rule: Uppercase with underscore separation

Local variable name
Rule: Lowercase with bumpy caps (underscores are optional)
Example: currentTemperature

Global variable name
Rule: Prefix gl lowercase with bumpy caps (underscores are optional)
Note: Globals should be contained to a single module
Example: gl_maximumRecordedTemperature

Functions name
Rule: Uppercase with bumpy caps (underscores optional) with active voice
Example: ConvertFarenheitToCentigrade(. . .)

Object name
Rule: Prefix ob_ lowercase with bumpy caps
Example: ob_myTempRecorder

Rule: An underscore followed by lowercase with bumpy caps
Example: _tempRecorder

Class names
Rule: Prefix class_ then bumpy caps and keep brief
Example: class_TempSystem
You will see many of these naming conventions in action during this chapter.

Our first application "one-way file system hashing"

The objective for our first Python Forensic Application is as follows:

  1. Build a useful application and tool for forensic investigators.
  2. Develop several modules along the way that are reusable throughout the book and for future applications.
  3. Develop a solid methodology for building Python forensic applications.
  4. Begin to introduce more advanced features of the language.


Before we can build an application that performs one-way file system hashing I need to better define one-way hashing. Many of you reading this are probably saying, "I already know what a one-way hashing is, let's move on." However, this is such an important underpinning for computer forensics it is worthy of a good definition, possibly even a better one that you currently have.

One-way hashing algorithms' basic characteristics

  1. The one-way hashing algorithm takes a stream of binary data as input; this could be a password, a file, an image of a hard drive, an image of a solid state drive, a network packet, 1's and 0's from a digital recording, or basically any continuous digital input.
  2. The algorithm produces a message digest which is a compact representation of the binary data that was received as input.
  3. It is infeasible to determine the binary input that generated the digest with only the digest. In other words, it is not possible to reverse the process using the digest to recover the stream of binary data that created it.
  4. It is infeasible to create a new binary input that will generate a given message digest.
  5. Changing a single bit of the binary input data will generate a unique message digest.
  6. Finally, it is infeasible to find two unique arbitrary streams of binary data that produce the same digest.
Table 3.1: Popular one-way hashing algorithms.
Table 3.1: Popular one-way hashing algorithms.

Popular cryptographic hash algorithms?
There are a number of algorithms that produce message digests. Table 3.1 provides background on some of the most popular algorithms.

What are the tradeoffs between one-way hashing algorithms?
The MD5 algorithm is still in use today, and for many applications the speed, convenience, and interoperability have made it the algorithm of choice. Due to attacks on the MD5 algorithm and the increased likelihood of collisions, many organizations are moving to SHA-2 (256 and 512 bits are the most popular sizes). Many organizations have opted to skip SHA-1 as it suffers from some of the same weaknesses as MD5.

Python Forensics

Author: Chet Hosmer

Learn more about Python Forensics from publisher Syngress.

At checkout, use discount code PBTY14 for 25% off.

Considerations for moving to SHA-3 are still in the future, and it will be a couple of years before broader adoption is in play. SHA-3 is completely different and was designed to be easier to implement in hardware to improve performance (speed and power consumption) for use in embedded or handheld devices. We will see how quickly the handheld devices' manufacturers adopt this newly established standard.

What are the best-use cases for one-way hashing algorithms in forensics?
Evidence preservation: When digital data are collected (for example, when imaging a mechanical or solid state drive), the entire contents -- in other words every bit collected -- are combined to create a unique one-way hashing value. Once completed the recalculation of the one-way hashing can be accomplished. If the new calculation matches the original, this can prove that the evidence has not been modified. This assumes of course that the original calculated hash value has been safeguarded against tampering since there is no held secret and the algorithms are available. Anyone could recalculate a hash, therefore the chain of custody of digital evidence, including the generated hash, must be maintained.

Search: One-way hashing values have been traditionally utilized to perform searches of known file objects. For example, if law enforcement has a collection of confirmed child-pornography files, the hashes could be calculated for each file. Then any suspect system could be scanned for the presence of this contraband by calculating the hash values of each file and comparing the resulting hashes to the known list of contraband hash values (those resulting from the child-pornography collection). If matches are found, then the files on the suspect system matching the hash values would be examined further.

Black Listing: Like the search example, it is possible to create a list of known bad hash files. These could represent contraband as with CP example, they could match known malicious code or cyber weapon files or the hashes of classified or proprietary documents. The discovery of hashes matching any of these Black Listed items would provide investigators with key evidence.

White Listing: By creating a list of known good or benign hashes (operating system or application executables, vendor supplied dynamic link libraries or known trustworthy application download files), investigators can use the lists to filter out files that they do not have to examine, because they were previously determined as a good file. Using this methodology you can dramatically reduce the number of files that require examination and then focus your attention on files that are not in the known good hash list.

Change detection: One popular defense against malicious changes to websites, routers, firewall configuration, and even operating system installations is to hash a "known good" installation or configuration. Then periodically you can re-scan the installation or configuration to ensure no files have changed. In addition, you must of course make sure no files have been added or deleted from the "known good" set.

Fundamental requirements

Now that we have a better understanding of one-way hashing and its uses, what are the fundamental requirements of our one-way file system hash application?

When defining requirements for any program or application I want to define them as succinctly as possible, and with little jargon, so anyone familiar with the domain could understand them -- even if they are not software developers. Also, each requirement should have an identifier such that could be traced from definition, through design, development, and validation. I like to give the designers and developers room to innovate, thus I try to focus on WHAT not HOW during requirements definition (Table 3.2). 

Table 3.2: Basic requirements
Table 3.2: Basic requirements

Design considerations

Now that I have defined the basic requirements for the application I need to factor in the design considerations. First, I would like to leverage or utilize as many of the built-in functions of the Python Standard Library as possible. Taking stock of the core capabilities, I like to map the requirements definition to Modules and Functions that I intend to use. This will then expose any new modules either from third party modules or new modules that need to be developed (Table 3.3).

One of the important steps as a designer or at least one of the fun parts is to name the program. I have decided to name this first program p-fish short for Python-file system hashing.

Next, based on this review of Standard Library functions I must define what modules will be used in our first application:

  1. argparse for user input
  2. os for file system manipulation
  3. hashlib for one-way hashing
  4. csv for result output (other optional outputs could be added later)
  5. logging for event and error logging
  6. Along with useful miscellaneous modules like time, sys, and stat
Table 3.3: Standard library mapping
Table 3.3: Standard library mapping

Program structure

Next, I need to define the structure of our program, in other words how I intend to put the pieces together. This is critical, especially if our goal is to reuse components of this program in future applications. One way to compose the components is with a couple simple diagrams as shown in Figures 3.2 and 3.3.

Figure 3.2: Context diagram: Python-file system hashing (p-fish).
Figure 3.2: Context diagram: Python-file system hashing (p-fish).
Figure 3.3: p-fish internal structure.
Figure 3.3: p-fish internal structure.

The context diagram is very straightforward and simply depicts the major inputs and outputs of the proposed program. A user specifies the program arguments, p-fish takes those inputs and processes (hashes, extracts metadata, etc.) the file system produces a report and any notable events or errors to the "p-fish report" and the "p-fish event and error log" files respectively.

Turning to the internal structure I have broken the program down into five major components. The Main program, ParseCommandLine function, WalkPath function, HashFile functions, CSVWriter class and logger (note logger is actually the Python logger module), that is utilized by the major functions of pfish. I briefly describe the operation of each below and during the code walk through section a more detailed line by line explanation of how each function operates is provided.

Main function
The purpose of the Main function is to control the overall flow of this program. For example, within Main I set up the Python logger, I display startup and completion messages, and keep track of the time. In addition, Main invokes the command line parser and then launches the WalkPath function. Once WalkPath completes Main will log the completion and display termination messages to the user and the log.

In order to provide smooth operation of p-fish, I leverage parseCommandLine to not only parse but also validate the user input. Once completed, information that is germane to program functions such as WalkPath, HashFile, and CSVWrite is available from parser-generated values. For example, since the hashType is specified by the user, this value must be available to HashFile. Likewise the CSVWriter needs the path where the resulting pfish report will be written, and WalkPath requires the starting or rootPath to start the walk.

WalkPath function
The WalkPath function must start at the root of the directory tree or path and traverse every directory and file. For each valid file encountered it will call the HashFile function to perform the one-way hashing operations. Once all the files have been processed WalkPath will return control back to Main with the number of files successfully processed.

HashFile function
The HashFile function will open, read, hash, and obtain metadata regarding the file in question. For each file, a row of data will be sent to the CSVWriter to be included in the p-fish report. Once the file has been processed, HashFile will return control back to WalkPath in order to fetch the next file.

CSVWriter (class)
In order to provide an introduction to class and object usage I decided to create CSVWriter as a class instead of a simple function. You will see more of this in upcoming cookbook chapters but CSVWriter sets up nicely for a class/object demonstration. The csv module within the Python Standard Library requires that the "writer" be initialized. For example, I want the resulting csv file to have a header row made up of a static set of columns. Then subsequent calls to writer will contain data that fills in each row. Finally, once the program has processed all the files the resulting csv report must be closed. Note that as I walk through the program code you may wonder why I did not leverage classes and objects more for this program. I certainly could have, but felt for the first application I would create a more function-oriented example.

Read an excerpt

Download the PDF of chapter three to learn more!

The built-in Standard Library logger provides us with the ability to write messages to a log file associated with p-fish. The program can write information messages, warning messages, and error messages. Since this is intended to be a forensic application, logging operations of the program is vital. You can expand the program to log additional events in the code, they can be added to any of the _pfish functions.

Writing the code
I decided to create two files, mainly to show you how to create your own Python module and also to give you some background on how to separate capabilities. For this first simple application, I created (1) and (2) As you may recall, all modules that are created begin with an underscore and since contains all the support functions for pfish I simply named it If you would like to split out the modules to better separate the functions you could create separate modules for the HashFile function, the WalkPath function, etc. This is a decision that is typically based on how tightly or loosely coupled the functions are, or better stated, whether you wish to reuse individual functions later that need to standalone. If that is the case, then you should separate them out.

Figure 3.4: p-fish WingIDE setup.
Figure 3.4: p-fish WingIDE setup.

In Figure 3.4 you can see my IDE setup for the project pfish. You notice the project section to the far upper right that specifies the files associated with the project. I also have both files open -- you can see the two tabs far left about half way down where I can view the source code in each of the files. As you would expect in the upper left quadrant, you can see the program is running and the variables are available for inspection. Finally, in the upper center portion of the screen you can see the current display messages from the program reporting that the command line was processed successfully and the welcome message for pfish.

About the author:
Chet Hosmer is the Chief Scientist & Sr. Vice President at Allen Corporation and a co-founder of WetStone Technologies, Inc. Chet has been researching and developing technology and training for the digital forensic market for almost two decades. He has also been a frequent contributor to technical and news stories relating to digital investigation and has been interviewed and quoted by IEEE, The New York Times, The Washington Post, Government Computer News, and Wired Magazine. Chet also serves as a visiting professor at Utica College where he teaches in the Cybersecurity Graduate program. Chet delivers keynote and plenary talks on various cyber security related topics around the world each year.

Next Steps

Read SearchWindowsServer's introduction to Python

Learn more about digital forensics in a book excerpt from Digital Forensics Processing and Procedures by David Watson and Andrew Jones

This was last published in December 2014

Dig Deeper on Real-time network monitoring and forensics