Years into the "big data" hype cycle, most organizations have not tapped into its promise for information security. Why not? Marcus Ranum sits down with Gartner Research Director Anton Chuvakin to find out his latest thinking on big data and security.
Dr. Chuvakin is known for his straight talk on security information and event management (SIEM) as well as big data security and analytics. He is part of the technical professionals unit at Gartner, which focuses on in-depth technical research aimed at helping enterprise architects with their technology projects.
Marcus Ranum: Anton, today I thought we could talk about big data and one of the first questions I should ask: Is it still just marketing hype? What do you think big data is?
Anton Chuvakin: As I mentioned in a recent blog post, if you fertilize the field of big data with enough marketing hype, something will grow. Well, keep waiting for it. Use of big data analytics approaches for security seems like the most ‘BS-rich' area of the entire InfoSec realm. However, there are definitely end-user organizations doing it for real.
A definition that Gartner came up with some number of years ago states:
If you have nasty, messy data and you do want to know what to do with it; you can come up with a schema based on that knowledge, normalize the data to that schema and then toss it into an RDBMS.
Anton Chuvakin, Gartner research director
‘Big data' is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
I find it useful and flexible enough without being overly broad -- overly broad definitions lead to Microsoft Excel being labeled a big data tool. It works for security–relevant big data as well. Note that it does not just say ‘lots of data' but also considers variety of data. Lots of nicely structured data may not, strictly speaking, be big data.
You and I both come from backgrounds involving a lot of trolling through data -- specifically, system logs -- so I tend to see big data as a sort of ‘buy a great big backhoe because you can do anything with a big enough backhoe' approach to data exploration, rather than data analysis.
Once you've figured out what fields you want to analyze and what you want to do with them, precomputing the data as it comes into your input stream makes more sense. It seems to me, big data is predicated on you not knowing what you're going to do with your data, so you should just throw lots of storage and CPUs at it. Does that sound right?
Chuvakin: Big data is predicated on you not knowing what to do with it in advance, but that is actually a good thing. The magic here comes from so-called late schema binding. If you have nasty, messy data and you do want to know what to do with it, you can come up with a schema based on that knowledge, normalize the data to that schema and then toss it into an RDBMS. On the other hand nasty, messy data that you want to explore somehow may not be easy to normalize, at least not at once. Thus, big data does often mean exploration and flexibility.
Slow adoption of big data security
Several factors will make mass adoption of big data technology for security unlikely in the near term, according to Gartner research:
- Dearth of COTS [commercial off-the-shelf] big data tools to collect, store and analyze massive amounts of diverse security data and come to conclusions automatically;
- Pervasive culture for buying COTS, seeking out-of-the-box features and contents that conflicts with the free-form data exploration approach characteristic of most successful big data projects in other industries;
- Rapid evolution of big data technologies and their inherent complexities related to distributed computing and storage, new data access language and APIs, unstructured data, and so forth;
- Data exploration, hypothesis testing and modeling approaches needed for making use of big data that are alien to many security teams that prefer boxed solutions and canned content.
I guess the other issue that comes up with big data is privacy concerns. We've all seen ample ways in which you can integrate large amounts of data and unexpectedly come up with indicators that might be unintentionally revealing.
In November, I got an email from a friend of mine who had gotten a tickle from Amazon suggesting that my birthday was coming up. It included links to my Wish List, which has some surrealist items on it that I found amusing. It took us a while to figure out how Amazon even knew we knew each other, until I remembered that I had once had a World of Warcraft time card direct-shipped to his address. I suppose someone had done a query for shipping addresses matching other customers, then built a friend-network model and started sending email. There's potential for mayhem hidden in that. Is this something we should do something about, or is it too late?
Chuvakin: Well, privacy is one topic I won't touch with a 10-mile pole. Frankly, I have no idea what it is in specific, measurable terms, and thus I am unwilling to discuss it. However, I'd rather like it if somebody sent me a gift from my Amazon Wish List.
Is big data only going to appeal to large businesses that are retrofitting new analysis atop old data dumps? It seems to me that it's something an organization can avoid, if IT departments actually think about what data they're collecting, what it means and then preprocessing it accordingly.
Chuvakin: At this point, building your own big data platform is not just for the large, mature Type A organizations. At Gartner, we say that big data analytics for security is for the ‘Type A of Type A.'
Our research shows that big data use for security will continue to be populated by the most advanced, mature, Type A organizations for the near future. Security may well be becoming a big data problem, but riding that big data wave will stay difficult and expensive for most organizations, at least for the next one to two years.
To add to this, several factors will make any semblance of mass adoption of big data technology for security unlikely in the near term. (See "Slow adoption of big data security.")
1) Load your data into Hadoop; 2) !?!?; and 3) Profit! Ultimately, it seems like big data isn't going to solve the age-old problem: If you don't know what you're looking for you won't know how to look for it.
We've both been bumping up against this issue for a very long time in our system log analysis efforts. Do you see anything coming down the pike that's promising?
You think Oracle/SQL is hard and scary? Don't even come within a mile radius of Hadoop.
Anton Chuvakin, Gartner research director
Chuvakin: Well, if you phrase it like that, it starts to sound pessimistic. However, if I insert ‘data exploration' as step 2, it changes now, doesn't it? Big data approaches often do go by that flow: collect->explore->profit. And big data tools make this possible, even if it's not easy.
Exploring unstructured big data piles, however, is much harder than running SIEM reports and may involve text analytics, hard-core statistical methods and other esoteric disciplines that are far removed from traditional security skill sets. It is not all about the keyword search.
Apart from exploration, more goal-driven approaches were also found to work for big data. Start thinking of clear goals and then testing them on data. Some organizations report success from using this model on security data as well as other big data.
Anton, as always it's great to talk to you. I'm going to go back to grepping some syslogs now.
Chuvakin: One of these days, do try to grep a 100 TB data set. After a few weeks of waiting, you'll want Hadoop on your side. By the way, somebody did create a clustered grep implementation that works on Hadoop distributed file system data.
Now, whenever you grep, you can say ‘BIG DATA!!!'
About the author:
Marcus J. Ranum, chief security officer of Tenable Security Inc., is a world-renowned expert on security system design and implementation. He is the inventor of the first commercial bastion host firewall.