The Boston Marathon bombings offer a stark reminder of the failings of big data and security, namely intelligence agencies inabilities to connect the dots -- before and after the April 15 attacks.
The lack of information sharing across organizations and business units, or awareness that a particular data set even exists, is a common problem. Big data analytics can help solve this dilemma, according to its proponents, and provide data intelligence that detects suspicious patterns and potential threats by expanding the definition of security data to all parts of the business.
Intelligence-driven security fueled by big data analytics will disrupt several infosec product segments in the next two years, according to executives at RSA, the security division of EMC. "With the pervasiveness of big data touching everything we do," said Arthur Coviello, Jr., EMC executive vice president and chairman of the RSA security division, during his RSA keynote in February, "our attack surface is about to be altered and expand, and our risks magnified in ways that we couldn't have imagined."
As organizations and employees increasingly operate in mobile, Web and social media environments, taking advantage of information identified by analytics or patterns across a wide variety of data sets including unstructured text and binary data -- audio, images and video -- can offer valuable insights into business risks far beyond IT.
But even with the use of advanced statistical modeling and predictive analytics, unknown security threats still go undetected. Will big data and high performance analytics really make security better? Maybe, but today meaningful use of big data technologies on large volumes of data for security is rare and extremely challenging, according to Anton Chuvakin, research director of security and risk management, Gartner, who quipped: "Organizations that use traditional predictive analytics for security? You mean 'both of them?'"
Gartner defines "big data" based on the 3Vs -- volume, variety and velocity. "Organizations that have actually invested time (often years) and resources (often millions of dollars) into building their very own platform for security big data analytics have found value," said Chuvakin. "Typically, such value manifests through better fraud detection, wider and deeper security incident detection and more effective incident investigation."
One such company is Visa, the credit card processing giant. The company made a splash earlier this year when it disclosed to The Wall Street Journal that it was using a new analytics engine and 16 different models, which could be updated in less than an hour, to detect credit card fraud. Steve Rosenbush of The Journal blogged about the improvements behind the high performance analytics engine, which according to Visa, looks at up to 500 aspects of a transaction, compared to earlier technology, which could only handle 40. The powerful analytics capabilities are enabled in part by Visa's adoption of non-relational database technology in 2010 and the open source Apache Hadoop software framework, which is designed for low cost storage and computation of distributed data across clusters of commodity servers.
Rush for big analytics
For all the hoopla about Hadoop, which uses the MapReduce programming model (derived from Google's technology) to "map" and "reduce" data, along with a distributed file system (HDFS) with built-in automation for failures and redundancy -- use of Hadoop is rare in large and mid-size companies. The tools to access Hadoop's storage and computational capabilities remain elusive, outside of complex interfaces and tools for data scientists or programmatic access for skilled coders with knowledge of MapR Hive for SQL-like queries, or Pig for high level dataflow.
That may soon change, however, as technology vendors from all sides -- big data infrastructure companies and enterprise software providers -- attempt to provide big data analytics tools for enterprise users. Cloudera, which offers a distribution for Hadoop (CDH) and SAS announced a strategic partnership to integrate SAS High Performance Analytics and SAS Visual Analytics among other tools in late April. InfoBright, EMC's Greenplum and MapR are moving into the enterprise space with analytics and visualization tools that enable corporate analysts to work with large data sets and develop analytic processes, in some cases using sandboxing and virtualization.
"This kind of analysis has been a need for a very long time, and technology is only just now being made available that can actually perform this kind of analysis at scale," said Mark Seward, senior director of security and compliance at Splunk. The company's security information and event management (SIEM) technology is used by roughly 2,000 companies to analyze machine data, which includes all systems data, the "Internet of things" and connected devices.
Any ACSII text can be indexed by Splunk, which can use up to 150 commands on the return data set to perform statistical analysis and render visualizations. According to Seward, Splunk can scale to petabytes of data. It does not natively handle binary data but Hadoop and other converters are available. To use Splunk, security-minded IT professionals essentially need to understand Unix shell scripting commands, SQL and have access to documentation about what kind of fields they have in the data.
"With the advent of Hadoop and indexing technologies like Splunk, now the technologies are available to take a look in much more detail around machine-generated data and user-generated data to understand what is happening inside of an organization, or what is happening inside of a manufacturing line, for example," said Seward. When you think about risk in the entire organization, you are not only thinking about security in the traditional sense, but you are also thinking about what people do day-to-day in all of the data -- or as much of it as you can get -- that would be a risk to your particular business. "I may need to look at heating and ventilation data to understand if someone went into the manufacturing plant and turned up the temperature a couple of degrees, which could jeopardize the whole production of a product," he said.
Big data analytics services
Companies that don't want to do it internally can look to outside big data analytics services. Opera Solutions, which specializes in predictive analytics, uses machine learning to recognize patterns in open-source data, such as page views and twitter feeds, to abstract predictive intelligence from big data flows. The company's 80-million-word threat ontology extracts multilingual phrases (in 15 languages). It prioritizes the phrases for levels of threat, based on roughly 450 million relationships between those words. "It's been built in a unique way," according to a company spokeswoman, "not by machines trying to understand what the relationships are -- but basically, by sourcing throughout the Web all the human-built relationships that people have created."
The company's big data analytics services are primarily used in the government and commercial sectors to warn clients about external threats in advance, including violent protests or potential terrorists. Recently, for example, Opera Solutions used its predictive intelligence to warn a client in advance about a planned demonstration outside of their offices, which enabled executives to reschedule an important meeting to divert protesters.
"A lot of organizations have security experts or product groups, and they could do the job if there were just one or two documents that they had to review," said Herb Kelsey, vice president of analytics at Opera Solutions. "But we are looking at hundreds of millions of documents and pieces of information a day, far outstripping the capacity that a human being has, whether they are a corporate analyst or a CSO." The machine learning mimics their behavior and then provides a refined set of information that is much smaller and more manageable. "We do involve human beings at multiple steps in the process to understand how they might go about solving a problem…and then teach the machine to mimic that behavior," he said.
Opera Solutions employs about 230 data scientists who are machine learning specialists, as well as domain experts and IT staff. Like other big data analytics companies, Opera Solutions is planning to offer a big data analytics tool based on its technology for the enterprise. Kelsey is also developing Secure Community of Interest (SCoI) for the company's wholly owned subsidiary, Opera Solutions Government Services. SCoI is designed to secure documents for distribution by encrypting and storing them in the public cloud and limiting access through authentication to protect sensitive data against unauthorized internal use and external threats.
Data analytics service providers, or payment processing companies such as Visa may benefit from examining millions of pieces of unstructured data, or billions of transactions; however, big data analytics security practitioners advise organizations to start small projects with an agile, flexible process.
"I have spoken to some organizations that have genuinely outgrown their SQL-based SIEM and moved to Hadoop-based systems -- that they built themselves," said Chuvakin. "Their experience with data analysis using their SIEM has definitely helped them jumpstart their big data analytics project. It is a very good idea to start your data analysis project on small data -- and structured data at that."
The range of algorithms that companies use for security data analysis is fairly wide, according to Chuvakin. It ranges from "simple summarization to machine learning, clustering, profiling, and all kinds of outlier and anomaly detection."
Data science problems
Many organizations bog themselves down with investments and IT instead of taking time to analyze their big data. The first step is to see if you can gather the data in which the problem lies, advised Kelsey, "and sometimes that's an awful lot of data." These pieces of measurement may include, for example, all of your network logs on a variety of systems; data on employee behavior -- when are they coming into the building, what databases are they accessing, are they introducing applications into the environment?
Organizations need to focus on patterns of behavior by collecting data from machines, applications and people's digital footprints as they go about their daily business, agreed Seward. "You've got to be able to look at wide-ranging data -- structured and unstructured -- from a six-month period, at least, to detect the types of behavioral changes that I'm talking about," he said. That means terabytes or even petabytes of data to be able to observe patterns or anomalies.
The second hurdle, especially if it is unstructured data, is having people, whether they are internal or external, who are actually skilled at doing the statistical analysis and the analytics that enables you to get to "real" answers, essentially, those signals that are indicative of a particular event. Hypothetically, this means somebody who could look at all that data and determine: if someone is accessing this type of information at this time of day with a corresponding particular website, we are going to track that as some sort of malicious intent. "You need those people at your disposal," said Kelsey.
Third, you need some way of presenting that information to the powers that be whether it is a physical report or in some other form, according to Kelsey: "We find that people want that information in real time, but now you are developing an application."
Many companies have issues in two areas: "What we are finding is that most organizations are lacking in their abilities to gather the data, especially unstructured data, largely because it does span multiple languages," said Kelsey. The second issue involves finding people who can actually do the analytics. There's a fair amount of competition and people are struggling to find advanced degree analytics professionals within the United States or even worldwide. "They are pushing up against the same people -- Amazon, Google, ourselves and the credit reporting agencies -- it's a pretty small group of people," said Kelsey, who indicated that Opera Solutions has had to double the number of data scientists on its staff in the last 18 months.
To put big data analytics to use, organizations have to use collective observations, experience and logical analysis to identify patterns in the data. "Predictive analytics is really you applying statistical analysis and modeling to your observation," said Seward, "and then seeing if something that you see in the present, or the past is going to trend in the future, based on those observations and based on a statistical model." He added, "The knowledge of that observation and the kind of statistical model that you want to run is yours, and no one else's -- you have to decide those things."
Fans of the movie, "Minority Report," may applaud the predictive policing model that is emerging in some major metropolitan areas such as Los Angeles. It combines advanced statistical analysis of previous crimes, visualization, machine learning and artificial intelligence to predict when and where offenses will occur in effort to prevent them, and save resources. Much of the research on "PredPol" is being done at UCLA.
However, most organizations and industries are on their own when it comes to statistical modeling and big data analytics. "There is nothing 'canned' you can buy that will just magically analyze your security big data," said Chuvakin. "All of the analytics deployments I've seen use both a home-grown platform and home-grown analytics. There are vendors that will sell you a customized Hadoop implementation, but there are no vendors that will build your analytics for you."
The propensity for false positives and misinterpretations presents its own risks. Kate Crawford, a principal researcher at Microsoft Research, cautioned big data analytics practitioners against hidden biases in a recent Harvard Business Review blog and offered several examples. She calls the problem "data fundamentalism," or "the notion that correlation always indicates causation, and that massive data sets and predictive analytics always reflect objective truth."
In addition to the technology implementation, another issue that many companies face is who owns the analytics and that may come down to a combination of highly skilled talent. "Some successful efforts had combined ownership of the system by security and fraud teams," said Chuvakin. "The security team would have to invest in people with unusual, and often expensive, skillsets such as statistics. Of course, if they can pull a statistician from another project at the company, it would be great as well. However, this expert in statistics has to be paired with somebody who knows the subject matter -- security."
Business schools are introducing more courses on big data analytics, spearheaded in part by programs from companies such as Cloudera.
Even so, there really isn't a way to meet the demand for highly skilled talent in the coming years, according to Kelsey. "The momentum in the past two years has been around infrastructure and other capabilities that allow you to organize the data, and a lot of companies are starting to adopt that -- the problem is that infrastructure does not really have analytic capability," he said.
"If you really want it to scale, which is what everyone has to do, you have to have tools," he continued, "so you have to figure out, how can I put this tool into an environment to be used so that a data scientist doesn't have to be with [enterprise users] to solve that problem every single time. It is really that shift that will unlock a lot of capabilities," he said. "Now, will companies still try to do it themselves? Yes, but that talent pool can't grow fast enough, and that is true in the government as well."
About the author:
Kathleen Richards is the features editor at Information Security magazine. Contact her at [email protected].
Send comments on this column to [email protected].