alphaspirit - Fotolia
Machine learning algorithms can provide enormous benefits to enterprise security teams as long as they are properly trained. But there may not be enough teachers for the technology to be widely -- and effectively -- adopted.
In this Q&A, Justin Somaini, CSO at SAP, talks about how the enterprise software giant uses machine learning for security. Specifically, he describes the differences between supervised and unsupervised algorithms and explains the challenges presented by the former. In addition, he outlines how machine learning anomaly detection is used within SAP to find potential issues, and why another layer of analysis is needed to accurately separate anomalies and actionable threats.
Somaini also discusses the challenges of accruing relevant data and properly training machine learning algorithms and why that could hinder adoption of the technology for security applications. Here are excerpts of the Q&A with Somaini covering machine learning, anomaly detection, algorithms and more.
How do you define machine learning for security, and how does SAP use the technology?
Justin Somaini: I see these advanced algorithms falling into two buckets: supervised and unsupervised. Do we need to train the algorithm or does it figure out things on its own?
When these algorithms are applied for security problems, we hear a lot of these terms -- machine learning, deep learning, artificial intelligence -- but the majority of them are used for anomaly detection, which isn't necessarily always a security problem. Unsupervised algorithms are typically used for machine learning anomaly detection.
What's really interesting is the supervised algorithm area; it's really about training that algorithm with data and teaching it what bad anomalies look like. And you're able to use that algorithm to identify things that look similar to [bad examples]. That technology is mostly used in the log analysis space.
So when we look at this for security usage, it's not one or the other with supervised or unsupervised. Most likely, you want to identify anomalies, but then you still need to figure out if the anomaly is a security issue or not.
If you have a supervised algorithm looking for things that are similar to known security issues, you really want to make sure that there is some filtration there. The efficacy of those supervised algorithms is never 100%. They're usually anywhere from 50% and 90% if you're really, really good. There still needs to be a human [who] goes back to those gray areas and feeds the correct data back into the algorithm so it learns. That's why it's a supervised algorithm.
It's important to note that neither one of these algorithms removes the historical methods of identifying security problems, such as regular expression and correlation rules. We're just building another layer of the cake.
That's how we see it at SAP. With our log analysis that we have internally, we definitely have those layers of the cake. We have legacy models in regards to patterns of things we would look for, and then we build those patterns, the correlations, and both supervised algorithms and unsupervised algorithms to find things.
Do you think we will see machine learning anomaly detection used widely in the near future?
Somaini: Absolutely, I do. But I think it's different than what we typically see for any technology hype cycle.
We've been living in the artificial intelligence and machine learning hype cycle over the past couple [of] years. I think there's amazing opportunities with those technologies and the analytical models.
But the challenges we face with supervised algorithm detection is you need to train it, and also most of our networks are generally different from one another. There are a lot of similarities with those networks, but it's hard to purchase an algorithm that is trained on what my network looks like.
So I need to create training data on what my environment looks like, and that can be a very painful process because you need a lot of data scientists and you need people who are skilled in managing these algorithms. It's very difficult for a lot of organizations.
At SAP, we're a little bit blessed and lucky to have not only great engineers, but great data scientists, so we were able to build those mechanisms for ourselves. But I don't know when we'll get to a point where those mechanisms and algorithms will be easily consumed by companies that wouldn't have to train those algorithms.
In your experience, are there certain issues or threats that machine learning algorithms are more likely to catch?
Somaini: There is a spectrum. There are some very easy things you can train the algorithms on, and there are other things that are incredibly complicated.
If we're able to drive the problem into something that's consumable, for instance, cross-site scripting injections for websites, you can fairly easily create and train an algorithm on what data is submitted under normal courses and what is anomalous data. You can train it on the thousands of different types of input attacks, like SQL injections.
If you're able to take the problem and narrow it down, then you're able to build that repertoire of machine learning algorithms to assist you moving forward. That's very different from creating an algorithm to identify everything on the network and tell you who is attacking what; that's a very general, wide-reaching question that is almost nearly impossible to build an algorithm for unless you've been building up modules around it for a very, very long time.
Can machine learning help with social engineering attacks, or do those attacks fall on the other end of the spectrum?
Somaini: It can help, but you need to apply those algorithms in a completely different context.
For instance, at SAP, we're deploying algorithms within business applications not only to assist with the applications' processes, but to help with security, as well. Applying security algorithms into the same business processes can help you around social engineering because it's not the phone call to the individual that is the problem, but you can then identify what that customer service representative, for example, does in their ticketing system when they're changing the email address of a customer.
When you put those algorithms into the business processes, they can identify fraud, like anomalous purchasing orders or CFO phishing attacks, and things like that. You can't stop the phone calls or phishing emails, but you can prevent the execution of the social engineering attack.