The following is an excerpt from Research Methods for Cyber Security by authors Thomas W. Edgar and David O. Manz and published by Syngress. This section from chapter six explores the various categories of machine learning.
Machine learning is a field of study that looks at using computational algorithms to turn empirical data into usable models. The machine learning field grew out of traditional statistics and artificial intelligences communities. From the efforts of mega corporations such as Google, Microsoft, Facebook, Amazon, and so on, machine learning has become one of the hottest computational science topics in the last decade. Through their business processes immense amounts of data have been and will be collected. This has provided an opportunity to re-invigorate the statistical and computational approaches to autogenerate useful models from data.
Machine learning algorithms can be used to (a) gather understanding of the cyber phenomenon that produced the data under study, (b) abstract the understanding of underlying phenomena in the form of a model, (c) predict future values of a phenomena using the above-generated model, and (d) detect anomalous behavior exhibited by a phenomenon under observation. There are several open-source implementations of machine learning algorithms that can be used with either application programming interface (API) calls or nonprogrammatic applications. Examples of such implementations include Weka, Orange, and RapidMiner. The results of such algorithms can be fed to visual analytic tools such as Tableau and Spotfire to produce dashboards and actionable pipelines.
Cyber space and its underlying dynamics can be conceptualized as a manifestation of human actions in an abstract and high-dimensional space. In order to begin solving some of the security challenges within cyber space, one needs to sense various aspects of cyber space and collect data. The observational data obtained is usually large and increasingly streaming in nature. Examples of cyber data include error logs, firewall logs, and network flow.
Categories of machine learning
There are two dimensions around which machine learning is generally categorized: the process by which it learns and the type of output or problem it attempts to solve. For the first machine learning-based solution strategies can be broadly classified into three categories based on the mechanism used to perform learning namely, supervised learning, semi-supervised learning, and unsupervised learning. For the latter, machine learning algorithms can be broken into four categories: classification, clustering, regression, and anomaly detection.
Research Methods for Cyber Security
Authors: Thomas W. Edgar and David O. Manz
Learn more about Research Methods for Cyber Security from publisher Syngress
At checkout, use discount code PBTY25 for 25% off this and other Elsevier titles
The style of learning has an impact upon the question you are trying to solve. In some cases, you have data that you do not know the ground truth, other times it is possible to label data with categories or classifications. Sometimes you know what a good result looks like but you may not know what variables are important to get there. By categorizing machine learning techniques by the learning style can help you in selecting the best approach for your research. Table 6.1 discusses the different styles and provides a sample set of machine learning algorithms.
Supervised learning involves using a labeled dataset (e.g., the outcomes are known and labeled). Unsupervised learning is used in cases where the labels of the data are unknown (e.g., when the outcomes are unknown, but some similar measure is desired). Examples of unsupervised learning approaches include self-organizing maps (SOMs), K-means clustering, expectation-maximization (EM), and hierarchical clustering. Unsupervised learning approaches can also be used for preliminary data exploration such as clustering similar error logs entries. Results of unsupervised algorithms are frequently visualized using visual analytic tools. An important caveat on using an unsupervised approach is to make sure one knows the numeric space that the data encompasses as well as the type of distance measure applied. Semisupervised approaches are a hybrid of unsupervised and supervised approaches. Such approaches are used when only some of the data is unlabeled. Semisupervised approaches are used when a portion of the data is unlabeled. Such approaches can be inductive or transductive.
Read an excerpt
Download the PDF of chapter six in full to learn more!
While it is sometimes helpful in picking algorithms based on what type of input data is available, it is equally helpfully to break them out along the
result types provided. Variables within a dataset can be numeric (i.e., discrete or continuous), ordinal (i.e., order matters), cardinal (i.e., integer valued), nominal/categorical (i.e., used as an outcome class name). Machine learning algorithms can also be categorized based on the type of problem they solve. An example of such a breakdown of algorithms is listed in Table 6.2.
Decision tree algorithms: classification trees (e.g.,C4.5) can be used in cases of a nominal class variable while regression trees can be used for continuous numeric valued outcome variables.
As discussed by Murphy et al., several issues affect the alternative learning schemes, including:
- Dynamic range of the features
- Number of features
- Type of the class variable
- Types of the features
- Heavily correlated features
In order to be operationally beneficial, cyber security machine learning based models need to have the ability to: (1) represent a real-world system, (2) infer system properties, and (3) learn and adapt based on expert knowledge and observations. Probabilistic graphical models have wide applications for assessing and quantifying cyber security risks. These models contain desirable properties including representation of a real-world system, inference about queries of interest related to the system, and learning from expert knowledge and past experience. The probabilistic terms in these models may be estimated or learned from historical data, generated from simulation experiments, or elicited through informed judgments of subject matter experts.
Did you know?
A common application of anomaly detection machine learning algorithms is credit card fraud detection. Machine learning is used to generate models of each customer's behavior and usage pattern. If the activity appears that is deemed anomalous by the model then a fraud alert is triggered. So when you go on vacation and get a fraud alert from using your credit card, you should know this means you deviated enough from your schedule such that it appears anomalous.
Owing to the adaptive nature of cyber threats, probabilistic cyber risk models need to accommodate efficient updating of model structure and parameter estimates as new intelligence and information becomes available. Also, understanding relationships between factors influencing the occurrence and impacts of such events is a critical task. Bayesian networks, or probabilistic directed acyclic graphs, have mathematical properties for characterizing relationships between dynamic event and system factors, can be updated using probabilistic theories, and produce inference and predictions for unobserved factors given evidence. Past research indicates the potential for the application of BNs along with attack graphs for real-world cyber defenses. HMMs have been widely used to generate data-driven models for several cyber security solutions.
About the author:
David O. Manz is currently a Senior Cyber Security Scientist in the National Security Directorate at the Pacific Northwest National Laboratory in Richland, WA. He holds a B.S. in Computer and Information Science from the Robert D. Clark Honors College at the University of Oregon and a Ph.D. in Computer Science from the University of Idaho. Dr. Manz's work at PNNL includes enterprise resilience and cyber security, secure control system communication, and critical infrastructure security. Enabling his research is an application of relevant research methods for cyber security (Cyber Security Science). Prior to his work at PNNL, Dr. Manz spent five years as a researcher on Group Key Management Protocols for the Center for Secure and Dependable Systems at the University of Idaho (U of I). He also has experience teaching undergraduate and graduate computer science courses at U of I, and as an adjunct faculty at Washington State University. Dr. Manz has co-authored numerous papers and presentations on cyber security, control system security, and cryptographic key management.