Sergey Nivens - Fotolia
When Charles Givre, lead data scientist at Deutsche Bank, teaches security teams about the benefits of applying security data science techniques, he often focuses on a common malware tactic: domain-generation algorithms.
Used by malicious programs to establish contact with a command-and-control server, domain-generation algorithms, or DGAs, create a list of domain names as potential contact points using pseudo-random algorithms. The domains change often -- usually daily -- and can look random or use random words.
For humans, finding a single computer's call to a random domain is a difficult problem. Yet data analysis can quickly call out the anomalous communications.
"Machine learning and data science are being employed in the security realm to rapidly scan through massive data sets and find things based on previous patterns without a human having to tell the machine to do that," Givre said. "More organizations are collecting more data from their networks and systems, and it becomes a virtual impossibility to have a person collect and analyze that data and find something."
As business and IT operations produce more data, finding the signal of an attack in all the noise has become increasingly difficult. Moreover, attackers often change their tactics to avoid detection, so rule-based systems often miss the signs of a compromise.
Still, adopting an approach that incorporates security data science techniques is not straightforward.
"The security community is still wrapping its head around what data science and data analytics techniques really mean for them," said Joshua Saxe, chief data scientist at security software firm Sophos. "Everyone thinks it's a good idea, but it's the Wild West right now in terms of what people are getting out of it."
Done right, data analytics and machine learning can help companies quickly reduce the amount of data they need to parse to highlight potential threats. But data analysts warn that the improper use of security data science techniques can quickly increase the amount of noisy data and overwhelm human analysts.
"Machine learning and data analysis are tools," said Alex Vaystikh, co-founder and CTO of cybersecurity firm SecBI. "It is a capability that can easily make a lot of noise and distract. We tend to ignore the downside of this capability until it's too late and the customer has already deployed it."
Machine learning has already had its successes. Spam detection, for example, is considered to be a solved problem. Yet data analysis techniques and machine learning go beyond just spam.
Many companies use data analysis tools to reduce the amount of data they need to send to a human analyst. Security data science techniques need to "help them to analyze the tremendous amount of data that they are generating and aggregating, because they don't have the physical, human ability to process it," SecBI's Vaystikh said. It is "the sidekick, so to speak, to help the analyst deal with the data."
Apple, for example, has fully adopted the sidekick mentality. At the Spark + AI Summit in June, Dominique Brezinski, an information security engineer at Apple, showed off a platform the company uses to ingest more than 300 billion network events per day, amounting to 100 TB of data. The system, based on Databricks Delta data management platform, processes the data into refined tables architected for specific tasks that allow the company's security team to quickly search for potential events.
"We want to go from what signal we detected and be able to figure out what the root cause is as well as what actually happened, and how we need to contain [it] and what the scope of the incident might be," Brezinski said in a presentation at the conference.
Anomaly detection: The way forward or the wrong path?
In many ways, machine learning is about pattern recognition and detecting those events that do not fit a particular pattern: anomalies. Yet anomaly detection -- a frequently touted capability of many systems -- often does more bad than good, some data analysis experts argue. Often systems do not tell the analyst what triggered an anomaly, why it occurred or what else happened.
"In this field, you find thousands of anomalies every single day, and that often means you are no better than when you started and perhaps even worse," SecBI's Vaystikh said. "Now, not only are you looking at less data, but you are looking at anomalies, something that was generated by an algorithm. You are not looking at something that is necessarily valuable."
Context ends up being everything in these cases. Security data science techniques and machine learning algorithms either need to have the context built in or need to communicate it to the analysts.
Cryptojacking is a good example. A browser extension that communicates with the internet, does processing on the system and sends data back may not seem anomalous, but cryptojacking extensions are easy to detect if the algorithms know what to look for, Vaystikh said. Cryptomining programs regularly upload more than they download, have a very regular cycle and stick to a very small set of tasks.
"It violates the physics of how such extensions typically behave," he said. --R.L.
Security data science and machine learning can also reveal new information that would otherwise be hidden in the data.
This year, Google data scientists and academic researchers from New York University, Princeton University and the University of California, San Diego, created a model of what ransomware payments look like on the bitcoin blockchain. The result: The researchers were able to track almost 20,000 actual payments worth approximately $16 million.
Finding patterns in the data
However, for most companies, such large solutions or side-project research may not be viable.
"Very large organizations can often build their own data storage and data analysis solutions, because they will often have data scientists on staff to write code and identify patterns," Sophos' Saxe said. "The vast majority of organizations do not have the resources to do that."
Instead, most companies need to focus on collecting data, minimizing it and then using other products to sift through that data.
"It is only with machine learning that we can reduce the haystacks -- I think all companies have that problem," Deutsche Bank's Givre said. "For all companies, that should be the first initiative."
Embarking on a data analysis and machine learning project is a good way for companies to dip their toes into what could become their data lake and start creating the infrastructure -- both technical and educational -- for making the most of their security data.
"I think for a CISO in a leadership role, the key thing is to understand how machine learning works," Givre said . "Before a manager starts looking at a product, it is important that they have an understanding of the strengths and weaknesses of these products, at least at a conceptual level."
Making your data useful
For many CISOs, learning the fundamentals of security data science may be enough. That expertise can then be used to help differentiate the hype from the systems and products based on data analytics that can actually help a company reduce its workload and increase visibility into its security posture.
"Not all of them are good, but some of them are," Sophos' Saxe said. "And the ones that are do much better than rule-based [systems]."
Saxe also warned that any data system -- whether created in-house or as a product or service -- needs to give the analyst context. Without context, and the ability to dig into the decision process, the conclusions may be prone to error and waste precious time and resources.
"There are machine learning models where it says that something is worth looking at but can't tell you why. That is useless," he said.
Mature systems based on data end up becoming experts for their particular data set, whether just a set of machine-learning algorithms or some form of artificial intelligence system. State-of-the-art systems would, for example, take the final step of not just sending an alert but recommending a course of action -- perhaps, even implementing it.
Companies should understand that the field of data analytics for security is still very much in flux, and most groups are going to be spending their time learning how to store and access data, explore the data for patterns, and create a workable pipeline for turning that data into information, Deutsche Bank's Givre said.
"Understand that a lot of these techniques are new, and their application to security is new, so having the tolerance to invest in an activity that may not work out is important," he said. "You need to be able to take a calculated risk and try these analytics techniques on your security data. If you are not willing to do that, you might as well just go home."
Dig Deeper on Malware, virus, Trojan and spyware protection and removal
Top 25 Data Science Certification Courses for 2021
The 13 Best Data Science Books Experts Say You Should Read
Data science vs. machine learning vs. AI: How they work together
Volunteers join forces to tackle COVID-19 security threats