Among all the things that would benefit security and improve defenses, insight must rank among the first. We need it to identify when, where and how attacks occur -- and succeed. With insight, we could see how access privileges are being misused or abused, or when what looks like legitimate access is actually fraud. We could also understand how and where investments can be better applied to strengthen security and mitigate risks.
However, this kind of insight is highly elusive. The reality is most organizations struggle with making use of the security information they already gather. In a recent Enterprise Management Associates (EMA) study of 200 organizations of 1,000 personnel or more worldwide, 58 percent of those knowledgeable about security log and event management say they collect more than 50 gigabytes of this data each day. Fifteen percent say they collect a terabyte or more. If each event is averaged at 300 bytes apiece, that’s more than three billion log events daily.
The sheer volume of security data is not the only problem; the pace of security-relevant information has multiplied as well. In 2009, Carnegie Mellon researchers estimated that over 4,000 new malware variants were discovered on average each day. According to antivirus supplier Trend Micro, that number is now more than one every second. Security data is hammering the vendors of security technologies and services just as hard or harder than it hits their customers.
Despite these challenges, organizations clearly see an opportunity to improve security through data-driven insight. Even though 40 percent of all respondents to the EMA survey said they are overwhelmed with the monitoring data they already collect, 73 percent indicated they would collect even more security data, if they could make use of it.
That “if” has taunted security professionals for years. In its 2010 Data Breach Investigations Report, Verizon found that evidence of compromise in log files was either not noticed or not acted upon in 86 percent of breaches in the year’s caseload. In that same report, 61 percent of breaches were discovered by a third party. By 2012, third-party incident discoveries had increased to 92 percent. How are security teams -- and indeed, the entire security industry -- seeking to turn this performance around and harness security data that continues to grow and accelerate?
The scale and speed of security data, the desire to collect a wider variety of security-relevant information, and the vital need to make better use of it all, suggest an answer: The often-invoked “3 V’s” of “big data”: volume, velocity, and variety. Today, new classes of technologies are emerging alongside established approaches to improve performance in delivering richer insight with large data sets. Organizations are adopting these technologies and techniques for better insight into the mountains of security data. This big data security analytics trend is more than a passing fad; it promises to transform the information security landscape.
Big data security analytics: Beyond SIEM
Security data may exhibit any or all of the qualities of big data, but that doesn’t necessarily mean the techniques for managing it have to be exotic. Well-established security technologies, such as SIEM, increasingly focus on performance and scale. Recent vendor acquisitions in the space such as NitroSecurity (now McAfee) and Q1Labs (now IBM) have highlighted more straightforward deployment and administration in dealing with mushrooming security data, while longstanding leaders have invested in re-architecting to improve performance, as with HP ArcSight’s recently introduced Correlation Optimized Retention and Retrieval Engine (CORR-Engine). Challenging these incumbents are vendors such as Splunk, whose implementation of search signaled the early potential of Internet scale technologies applied to security and IT operational data.
These assets are common in the enterprise, but SIEM is hardly the end of the story. Many security organizations seek to dive deeper or go broader still. The breadth of data that factors into IT risk management or meeting compliance requirements can be extensive. Many organizations seek to get beyond the analytics packaged with off-the-shelf tools to better understand their security management challenges. An interest in going deeper, meanwhile, has led to the rise of investigative platforms such as AccessData’s SilentRunner, EMC/RSA’s NetWitness and Solera Networks that couple full packet capture with analysis. It’s hardly a stretch to realize that full packet data can add up quickly -- and storage budgets aren’t unlimited. Organizations that value these capabilities must therefore consider data management strategies that can embrace the necessary scale, speed, and data complexity to get the most out of the data they can handle.
Distributed platforms for big data
Many established techniques for storing, managing and analyzing data are predicated on proven concepts such as relational databases. The advantages of technology optimized for large data sets include maturity, particularly in meeting the performance and reliability demands of enterprise operations. The sheer growth and velocity of security data, however, has motivated enterprises and vendors of security products and services alike to look to emerging Internet-scale technologies that arose among the likes of Google, Yahoo, Amazon, and Facebook.
These are the technologies many often imply when invoking the concept of big data. They include the Hadoop ecosystem of Apache open source projects and "Not Only SQL" (NoSQL) platforms such as Cassandra, MongoDB, and others. NoSQL techniques directly address some of the limitations of traditional, relational data stores when analysis of a body of data is the priority. They are often highly distributed systems, developed in many cases to deliver better performance in data management and retrieval at Internet scale. They are well suited for the analysis of an entire body of recorded data to discover patterns, trends, and anomalies -- which makes them compelling candidates for handling large and diverse bodies of security-relevant data.
In general, these techniques tend to embrace highly distributed, fault-tolerant architectures on commodity hardware. Storage concepts such as the Hadoop File System (HDFS) exhibit resilient data distribution among multiple nodes in a cluster. MapReduce highlights parallelism in data retrieval. Jobs are parceled out or “mapped” to a number of subsidiary nodes, with results handed back up (or “reduced”) in the ultimate output of the original tasks.
Massively parallel processing (MPP) is a primary benefit of these distributed platforms. In its Black Hat Europe 2012 presentation (.pdf), Packetloop, an emerging provider of big data security analytics with a hosted service currently in beta, compared the analysis of 2.5 terabytes of data on 4 compute units (8 hours) to analyzing the same volume across 128 compute units (15 minutes). Although this example is conceptual, it makes an important point: If some of the most important security data is buried in massive data volumes, and time is of the essence, whether analyzing an individual attack or adapting tactics to a dynamic threat landscape, performance at scale becomes vital.
Another advantage of NoSQL approaches like Hadoop ecosystems is their greater tolerance for flexibility in embracing a wider variety of data and data structure compared to relational systems, which often require data to conform to a defined schema on (or before) ingestion. When structure is needed to define a dataset for analysis, many NoSQL techniques allow more precise definitions later in the process -- at “query time” rather than on initial data intake, for example. This further suggests how NoSQL techniques may enable security organizations to take in a wider variety of data, such as unstructured data from both internal and external sources, or binary content such as images or video (which may be managed via the application of structured or semi-structured metadata).
Database management systems and analytics
Techniques such as MapReduce are not always familiar to many data managers and analysts; they require expertise with tools such as Pig or Hive for data processing and retrieval. When data has at least some structure -- or at least some of the data is structured, as is typical with infrastructure monitoring or network protocol data, for example -- database management systems exist for both traditional and NoSQL environments (Cassandra is an example of the latter). These may be deployed alongside platforms such as Hadoop or existing toolsets such as SIEM or GRC systems in hybrid environments that make the most of a combination of methods.
Database management systems can optimize analytic performance with large data sets through approaches such as columnar techniques. Columnar systems take their name from their orientation to the columns of data (as represented in a two-dimensional table of rows and columns) in a dataset. They are not necessarily incompatible with logical concepts such as the relational database, but rather represent a physical approach to data storage that, in effect, stores each column of data from multiple rows contiguously. Storing data in columns improves the performance of retrieval when one seeks to aggregate totals and averages for one or a few columns of data across multiple records, such as source and destination address information. Effectively, a column can serve as an index, further reducing storage and data management requirements. Vendors with a security focus that embrace a columnar approach include Sensage, which has long provided many enterprises with security data warehousing.
Vendors are working to address the personnel and security issues that come with big data security analysis.
Obviously, not every organization will be able -- or willing -- to make substantial investments in data management and analysis for security. Organizations that already find it difficult to hire experienced security engineers may be doubly frustrated by the rarity of the true data scientist, and the costs of finding and retaining both. Some will have concerns about protecting such extensive data, which could multiply an organization’s requirements for assuring due care for sensitive information.
These are areas where technology vendors and service providers are looking to help close gaps. Security vendors are extending the advantages of big data to challenges such as malware recognition and defense for a wide range of customers, from consumers and SMBs to enterprises. Those targeting the market for big security data such as Zettaset, emphasize enterprise-ready implementations of big data technologies that feature enhanced capabilities for securing the data they manage.
Service providers, meanwhile, can provide outsourcing, managed or professional services, whether in security data management and analysis or in more advanced security services that utilize big data techniques “on the backend.” Product and service vendors alike may see hosted technologies as an option, in line with strategies for delivering security capabilities “from the cloud” -- a model which, just coincidentally, aligns well with the environments in which Internet-scale data management platforms first took shape.
Analytical databases offer another example of improved performance in accessing a large body of collected data. They are called “analytical” to distinguish their function from transactional systems on which business processes depend in real time. While they may use the same data as real-time systems, analytical environments emphasize the exploration of collected or historical data to deepen insight. Platforms such as EMC’s Greenplum, or IBM’s Netezza or ParAccel, combine online analytical processing (OLAP) capabilities with storage in a readily deployed "analytical appliance" physical form factor to accelerate analytical performance with larger data sets, including security-relevant data.
Not all requirements lend themselves to large-scale, centralized platforms, however. Transferring large quantities of data to a central location can be highly expensive in terms of network utilization and availability. This is where techniques that distribute analysis throughout an environment may have appeal. Narus, a vendor of large-scale, real-time network intelligence technology (and since 2010, a wholly owned subsidiary of Boeing), provides network analytics at a service provider scale to identify malicious activity in network content. Narus’ partnership with analytic data warehouse vendor Teradata to capture, analyze and correlate IP traffic in real time illustrates the union of distributed analysis with centralized techniques.
Putting big data security to work
The range of application of these technologies in security is broad, but many deployments often share the same objectives.
In terms of security tactics, common themes are identifying patterns and anomalies. This is particularly useful to security vendors such as McAfee, Sourcefire, Symantec and Trend Micro, who are using big data platforms not only to harness their own deluge of security data, but to take advantage of their visibility across thousands or millions of systems and networks where these vendors have a presence. This gives them a large data set to make determinations of “normal” (in terms of widely accepted software, for example) as a baseline against which anomalies such as new malware -- or recent variants of recognized malware -- stand out. This also helps to speed countermeasures among their customers. Similar approaches can be used to tackle challenges such as fraud recognition. These techniques hold the promise of better recognizing behavioral anomalies that today may be too difficult to distinguish from normal user activity -- a limitation that often serves to cloak the actions of the more adept adversary.
Another application of big data security is strategic: Providing greater flexibility in finding and measuring meaningful data may help security management in areas from resource optimization to risk management. Some early enterprise adopters of large-scale security data management interviewed for the EMA study are exploring provocative ideas borrowed from industry. The public health concept of “populations at risk,” for example, may be useful in identifying traits common among aspects of business where security incidents are more frequent. Comparing this information to breach data can help organizations better grasp where they stand in comparison to peers, which can help them gauge an appropriate level of investment. This suggests the value of data mining for automating the learning and identification of patterns in data, and the value of tools such as Skytree Server or Apache Mahout for data exploration.
The future of the big data security analytics trend may well be a product of the flexibility provided by big data tools and techniques. Better performance and more flexible data management for large and diverse data sets make it easier to reveal successful exploits and risk factors that may go unrecognized today -- if organizations have the analytic capability to do so. Analytic tools that recognize and harness this power may therefore be the next important direction for data-driven security. The seeds of this direction are already evident in synthesis platforms such as Palantir or the open source Maltego. Recently introduced hosted offerings that emphasize big data analytics in addition to Packetloop include PixlCloud, founded by one of security’s pioneers in data visualization, Rafael Marty.
Big data security analytics start small
Where, should organizations begin when weighing their opportunities in data-driven security? Those far down the road with big data suggest -- perhaps paradoxically -- starting small. One CSO of a large financial services organization recommends exploring easily obtained bodies of existing data, such as a subset of log data or the databases of security tools, with readily accessible analytics like the Tableau Software suite. This helps security professionals to recognize when techniques used in other aspects of business, such as business risk analytics or BI tools, might be amenable to security data.
Progressing one step at a time does more than build literacy in “security data science.” It also helps build familiarity with the requirements of building a security data management architecture. For example, the obstacles encountered and overcome in gaining usable access to security-relevant data has led one IT security executive with a major health care organization to include a new requirement in all his RFPs for security and systems management tools: The ability to access data directly from these tools, rather than relying solely on the dashboards and self-contained analytics they provide.
Regardless whether they take on the opportunities and challenges of big data directly, or benefit from security technologies and services that increasingly depend on large-scale data analysis, organizations of any and every size and capability stand to benefit from a growing interest in data-driven security. This is a trend that promises to have a transformative impact on the nature of defense and security management industrywide, in line with the broader trend toward turning growing volumes of big data from a burden into a strategic advantage.
About the author:
Scott Crawford, CISSP, CISM, is managing research director covering security and risk for Enterprise Management Associates, an industry analyst firm based in Boulder, Colo. He was formerly CISO for the Comprehensive Nuclear-Test-Ban Treaty Organization’s International Data Centre in Vienna, Austria, and his experience includes security and data management for public-sector organizations such as the University Corporation for Atmospheric Research, and systems management for Fortune 500 enterprises including Emerson. Send comments on this article to firstname.lastname@example.org.