LAS VEGAS -- Data science is being used increasingly to help infosec professionals sift through vast stores of threat intelligence data to help stop advanced attackers and evolving malware, but new security machine learning techniques are needed.
That was the message from Joshua Saxe, director of Invincea Labs' data science research group, during a Black Hat 2015 session. Saxe explained that machine learning, data visualization, and scalable storage technologies are converging in data science to bolster information security.
Machine learning is an important part of being able to organize and analyze that data. While there are analogues to machine learning in other areas, like speech recognition, Saxe said, security has extra concerns to make sure machine learning stays viable.
"Problems that once seemed intractable have now been solved to the point that image recognition has had a breakthrough in accuracy," Saxe said. "The big difference with machine learning in this area is that there's nothing trying to trick the system. If you train image recognition to detect cats and dogs, that will continue to work because cats and dogs will look about the same."
This is not the case with using security machine learning for detecting malware or stopping attacks, Saxe said, because the adversary is always evolving. According to Saxe, a system that starts out able to detect more than 80% of threats could drop to 60% accuracy or lower within two years because of changing malware and tactics.
The inherent challenge of using data science to harness big data can be daunting, but continually monitoring model accuracy can lead to impressive results, Saxe said, because there is a lot of data to be harnessed.
"We have access to the signals needed to detect attackers, but they are currently the 'dark matter' of our field," Saxe said. "We have the data, but we don't know how to analyze the data properly. We have packet capture, SIEM logs and more, but there is too much data for anyone to go through. We need algorithms to sift through data, because the 400 million-plus malware samples gathered across networks could hold the next StuxNet and we don't know."