The following is an excerpt from of Cybersecurity and Applied Mathematics by authors Leigh Metcalf and William Casey and published by Syngress. This section from chapter eight outlines the uses of string analysis when it comes to cyber data.
A string is any sequence of symbols that is interpreted to represent a precise meaning. Written language, including this sentence, provides strings in which informational messages may be expressed as a sequence of words (each of which is a sequence of letters). However, natural languages such as English may give rise to ambiguous meaning. For example, consider the following statement: “Time flies like an arrow; fruit flies like a banana.” In computational settings, it’s important that strings (along with their encoding and interpretation) have discrete, precise meanings. Formal languages provide the general backdrop for our discussion of strings.
A central goal when analyzing cyber data is to seek a string representation for the problem’s objects so that similarities in their string representations will provide a meaningful result for the analysis problem at hand. To emphasize this point, we consider signature based detection of cyber attacks and how the problem of determining safety (or that a system may be compromised) may be considered by the analysis of strings. We set the stage by providing a general background of cyber data and analysis techniques, followed by our historical examples. Then we focus the remainder of the chapter on common contemporary techniques used to analyze cyber sequential or string data.
Many different types of data arise from cyber security scenarios - here we will identify a few prominent data types and outline an organizational framework for thinking about cyber data. Generally, within cyber security scenarios the objects studied may or may not have much known about them. One way to think about information (known and unknown) for digital objects will be similar to that of physical objects, such as an antique. An antique is affected by a provenance, or a history of events, which affect its state. In the real world even a valuable historical object may have partial or disputed information concerning its provenance. For digital objects we consider provenance similarly; there can also be incomplete or partial awareness concerning the origin or histories of data objects (ie, files, programs, configuration settings, etc). With the notion of provenance in mind we may consider major types of cyber data.
Cybersecurity and Applied Mathematics
Learn more about Cybersecurity and Applied Mathematics from publisher Syngress
At checkout, use the discount code PBTY25 for 25% off this and other Elsevier titles
The three main forms of data we consider are static, dynamic, and behavioral.
A static data analysis will focus on objects such as files, system configuration parameters, and programs (specified in a programming language or machine executable). This type of analysis may seek to identify evidence that the security of a system was breached or that a particular program is capable of breaching the system either as a direct effort (ie, malware) or indirectly (ie, a vulnerability) if attacked in a certain way. A computer system is comprised of thousands of programs and libraries, and for each of them a cyber security operator or analyst may only have a small amount of information concerning its provenance, so provenance information is usually thin thereby yielding advantages to attackers who may camouflage malware within the context of limited awareness (ie, the many system files and what they do).
A dynamic analysis will focus on data generated within a computer system when certain stimuli or inputs are provided. One form of dynamic data is a system log file, which contains metadata concerning the operation of various components. For example, we may monitor how a Microsoft system registry database changes before and after a given program is executed, and likewise what (if any) files are created as a consequence of executing a given program by evaluating logs or designing our own system monitors. Another and more interactive example arises when monitoring network traffic and identifying problematic communications (possibly to known command and control botnet servers). Still another example is fuzz testing in which a large variety of stimuli is provided to a program or library to find fault conditions, which may prove to be a software vulnerability. A dynamic analysis may be realized either as a system monitor or as an experiment in which the cyber analyst has created meaningful ways to observe system states.
A behavioral analysis will also be focused on dynamic data such as log and monitor data, but the focus of behavioral analysis will be to consider the sequences of actions and events as expressions of behaviors. Therefore, this also includes some model of behavior (such as baseline and anomalous behaviors). One type of behavioral analysis will include program tracing and a statistical model for learning a common behavior of a malware group, which distinguishes it from benign software or other malware groups. In this way, the behavioral data may include both the trace outputs as well as the model which describes a particular behavior.
Modes of analyzing cyber data
One common operational mode of analysis is a forensic analysis which usually takes place after an event, for example a data breach, to investigate what happened. The mode of forensic analysis is similar in nature to that of a crime scene, where an investigator and focuses on the artifacts left behind in order to gain some understanding of key questions, for example attribution (ie, who initiated the attack).
Read an excerpt
Download the PDF of chapter eight in full to learn more!
Another mode of analysis is experimental, often testing an object to identify how it compares to a reference data set. These types of analysis are commonly done for artifacts of unknown provenance to test if they are malware or contain vulnerabilities. Still another more operational mode is online analysis, and this may be thought of as a filter pipeline where actions are tested against a set of signatures in real time, with the possible outcome that a signature match may invoke a response to keep the system safe. Examples of this include network filters.
Another type of analysis mode is formal methods and verification and this approach is more logical in that it considers how programs or data is to be interpreted by the system and will attempt to compute or verify that certain unsafe states are not reachable. This type of analysis often employs computational processes to prove various properties of the software artifact.
Generally, in practice, various types of cyber data and analytical modes are mixed in ways to provide the best approaches for the problem at hand. In order to introduce the area of string analysis in cyber data we have selected a few of the most primitive string comparison methods which often are applied to the various data forms and as a part of a variety of modes for analyzing cyber data.
About the authors:
Leigh Metcalf researches network security, game theory, formal languages and dynamical systems. She is Editor in Chief of the Journal on Digital Threats and has a PhD in Mathematics.
Will Casey works in threat analysis, code analysis, natural language processing, genomics, bioinformatics and applied mathematics. He has a MS and MA in Mathematics and a PhD in Applied Mathematics.
Cybersecurity and Applied Mathematics
Reprinted with permission from Elsevier/Syngress, Copyright ©2016