|Click to enlarge.
Doubleclick to restore.
Perhaps a phone call in the middle of the night: "We've got a bit of a mess here, Dave. That new worm blew our primary and backup e-mail servers out of the water, and I've had to take us off-net for an hour or two. I hope it's not longer than that. Oh, by the way, the customer care intranet will be disconnected until we can get the staff out there to disinfect about 30 or 40 machines tomorrow or maybe Monday. Is that going to be a problem for anyone?"
Is this how security incidents are handled at your organization? If your security incident management is solely or primarily the responsibility of your IT managers, it's a good bet you're not adequately prepared for events that may disrupt business, or, in a worst-case scenario, shut it down.
It's time to reinvent the process, from the top down. (See the Incident Response Matrix above for suggestions.)
That scenario isn't at all far-fetched. Last January's SQL Slammer worm crashed thousands of networks worldwide. Last August, Blaster and Welchia infected millions of computers. Many of those networks affected had highly evolved infosecurity defenses.
If random Internet attacks with no specific targets caused problems on such a wide scale, who can be sure that any infosecurity defenses would withstand a dedicated, targeted attack? It's not a question of whether you can eliminate all risk and prevent critical security incidents from happening. You can't. Rather, what determines whether you stay in business is what you do when incidents occur.
What follows is a basic blueprint for reinventing incident management in your organization. There are six practical steps you can take to build an effective incident management program (IMP), and while the content and form may vary from organization to organization, the message is consistent: Start thinking in terms of incident management as a critical business component and act accordingly.
Business Risk Management
Before discussing specific building blocks of an incident management program, it's important to understand that incident management is really about risk management. That's hardly a new concept in the enterprise, but it's fundamental. Everything that follows about incident management flows from it: Who has overall responsibility, who responds to particular types of events, how you rate incident severity, etc.
Traditional forms of risk management have included risk assignment through insurance of various types; crisis management through contingency planning for disasters like fires and floods; and financial risk management through portfolio diversity, long-term development strategies, etc.
As organizations have become more dependent on IT, disaster recovery planning (DRP) and business continuity planning (BCP) have evolved in response to threats to the IT infrastructure. Clearly, they aren't just about IT, but about the survival and health of the business, and they need to be integrated into a corporate-wide IMP.
All too often, DRP and BCP -- even when supported and endorsed by corporate management -- have been incomplete, because they have been seen primarily as an IT responsibility, which gives line business units an excuse to steer clear: "It's your system, what are you going to do for us if it breaks?"
In most companies, however, the operational IT organization doesn't own enterprise risk. Asset protection, risk management and fiduciary responsibilities belong to the board and senior operational and financial management. It stands to reason that responsibility and accountability for enterprise risk -- and, by extension, incident management -- can't and shouldn't be delegated to IT, but the steps to getting incident management on track need to flow from the top.
Step 1: Management Takes the Reins
"It's a technical problem. Let the techies chase it down and fix it."
In many organizations, management has abdicated responsibility for incident management and response planning largely because they sometimes just don't understand the nature of infosecurity threats in today's technological environment.
While a virus infection on a single desktop or even in a single departmental office is a local problem that technical staff can and should handle, a system-wide Internet worm attack is a technical problem that may adversely affect the operations of the entire network -- the core business of the enterprise -- and must be treated as a business-threatening corporate risk. Similarly, an isolated intrusive network scan from a computer outside of the corporate network is a technical problem to be traced and stopped by the network technical staff, but a concerted DDoS attack is a very different situation.
In January 2003, the SQL Slammer worm indirectly shut down more than 10,000 Bank of America ATMs by infecting database servers on the same network. Parts of the bank's ATM network were hit again in August by Blaster and Welchia. Bank of America certainly doesn't regard Internet worms as "just an IT problem."
In the modern business model, IT is no longer a simple support service -- it's an integral part of the corporate structure. Attacks or threats to an organization's information security aren't threats to computers or databases or the network; they're threats to mission-critical business functions, and must be treated as such.
That won't happen without executive support, which is critical if incident management and response is to be embraced as a corporate function by the business unit managers and key department executives whose participation is crucial.
Who in the organization should "own" the IMP? The question that will answer that question is, "Who owns corporate risk?" Ultimately, the answer is the board of directors, but in operational terms it probably means the CFO or CEO, or perhaps the CIO in some sectors. Neither MIS nor the line IT organization owns corporate risk. The information security office doesn't own corporate risk. Incident management is business risk management and must be treated as such. The IMP should be accountable to the highest risk management officer in the organization.
Step 2: Assign Responsibilities
An effective IMP must assign responsibilities and specify routine procedures in the event of an incident. The goal of incident management planning is to predetermine necessary actions and responses to specific classes of incidents, so that no one is expected to make decisions under pressure with minimal information.
Your first step is to create an incident management policy that clearly defines:
- What incidents are covered. For example, you may choose to omit low-level events from your plan-a minor policy violation, such as a user loading a game on his desktop or using an instant messaging app; a virus infection of one or two machines.
- Who is responsible for detection and reporting? Is detection of an attack a networking or infosecurity responsibility? Is IT or the business unit -- or both -- responsible for reporting?
- Who is responsible for business decisions related to the incident? Do you shut down the Web server? Is that a line manager's decision or the unit's senior VP?
- Who is responsible and accountable for ensuring timely recovery from an incident? IT personnel, yes, but someone on the business-side needs to be working with them and may have overall responsibility for getting back on line.
Next, getting down to brass tacks, your computer incident response team (CIRT) policy should specify first responders, responsibility for management of the response to a specific incident and follow-up and reporting responsibilities. That's the mostly technical first part of a more detailed and comprehensive IMP.
Besides the MIS and network technical staff who are first responders, who should be part of the IMP? At the very least: risk management, corporate legal, corporate security, public relations, human resources and labor relations, the office responsible for regulatory compliance, and all major business units with oversight and advisory responsibility.
This is a primary reason for locating the IMP as high in the organization as possible. None of those offices report to IT or the information security office (ISO), and the IT organization can't mandate and ensure their cooperation and participation.
Further, while initial response in many cases will be technical, IT staff can't make decisions in a vacuum. If a security incident requires taking the entire corporate network or a significant portion of it offline for a period of time, is that a decision the network security officer should be making on his own at 5 a.m.? If criminal or actionable activity is discovered or suspected, do you want a harried sysadmin deciding to call law enforcement and taking necessary steps to preserve critical evidence?
If a security breach is severe enough to require public disclosure, who determines that and who speaks for the organization? California's Database Security Breach Notification Act (SB 1386), which went into effect last July, requires companies to inform California customers of incidents involving the compromise of their names in combination with their Social Security, driver's license or credit card numbers. Someone other than the manager of database administration will need to decide to call the newspaper and/or send out notices to your customers. But who?
If an employee in the accounting office or a sales manager in the field is suspected of compromising the system or abusing company policy in a way that has an operational impact, you can't expect the MIS department to initiate disciplinary action. And, you won't want the infosecurity admin deciding to file a Suspicious Activity Report (SAR) for a regulated financial institution.
Step 3: Take Stock of Your Assets
Once senior management has defined the responsibility for and mission of the IMP, specific measures should be implemented to make it a reality.
The first of these is a current vulnerability assessment (VA) and business impact analysis (BIA) in all mission-critical business units. How and where are core business functions vulnerable to service interruptions caused by security breaches and other interruptions of technology-based business support systems?
IT and IS staff should conduct or contract for technical vulnerability assessments to analyze what systems may be vulnerable to attack, exploit or outage. However, they can't conduct the business impact analysis without the direct involvement of business unit management. The business units need to identify mission-critical functions that could be adversely affected by an IT system outage or corruption, and assign severity classifications.
This process should yield a formal report that identifies those core business processes that depend on IT support. The report should also rank interruptions to those processes by the impact severity.
From the VA/BIA report, two critical documents will be developed, which are the core of the IMP: An incident classification matrix (Step 4) and an incident response matrix (Step 5).
Step 4: Classify Potential Incidents
The incident classification matrix defines the set of potential incidents included under the IMP, with severity and business impact classifications. Incidents may include operational failures in core support systems; local and/or system-wide virus and malware problems; malicious attacks (external and internal); abuse of policy or unacceptable use of corporate resources, and so on. This document, which will mostly be developed by IT/IS staff, should be comprehensive enough to allow all first responders to clearly identify the type of incident involved.
The incident classification matrix may be organized in different ways: by business function/ business unit; by network topography or by technology, depending on the needs of your enterprise. The purpose is to define the potential threats and exploits to IT systems and the business functions they support, with classifications by severity and business impact.
For example, a virus infection detected on one to five desktops by internal network monitoring would get a low classification for scope (local), severity and business impact. The operational outage of a primary file or application server for a critical business function is quite another matter. That would get a high classification, even more so if it affected the entire enterprise rather than a single business unit.
There are a number of criteria for incident classification:
- Scope: Is only a single host or network segment affected? Or is the impact departmental, regional or enterprise-wide?
- Impact on systems and operations: Is it a local service outage or interruption of noncritical function, or has a central server or a mission-critical business application failed?
- Impact on data protection: Is operational or transaction data lost or corrupted, or has sensitive or privacy-protected data been stolen or exposed?
- Duration: Is this a brief interruption, easily remedied through normal operation? Or, is it a more serious disruption, requiring the BCP or even DRP to kick in?
- Legal considerations: Does the incident require possible law enforcement action? Is there a breach of regulatory constraints? Is there a potential threat of litigation or employee-related action?
Step 5: Prepare Responses in Advance
The incident response matrix is directly keyed to the incident classification matrix. This document defines and specifies the appropriate response to each potential incident based on the severity and business impact classifications. It clearly assigns responsibility for that response to specific individuals or organizational units.
The document may be extremely detailed, and should be as specific as necessary to ensure timely and consistent handling of all critical incidents.
Depending on the nature of the incident, the individuals who are assigned specific responsibilities may not be IT or IS staff. The key to success is predetermining what your response will be as possible, so that the support-level staff isn't asked to make business-critical decisions under pressure (See "Incident Response Matrix").
By one common model, there are seven primary components to incident response:
- Detection, verification and notification: How is a given incident detected? Who verifies that the incident is real, and who is notified?
- Classification determination: Who is responsible for determining scope and severity of the incident and triggering escalation processes as indicated?
- Containment: Who are the first responders assigned to stop, contain or limit the damage, and what actions are required to do so?
- Escalation: Who is notified beyond the first responders? How is ownership of the incident and responsibility for resolution determined and assigned?
- Resolution: What processes are required to maintain business functions until the incident is resolved and who is responsible for them? Who takes what actions to resolve and close the immediate incident?
- Recovery: What actions are required, and who is responsible for them, and getting business back to normal after the incident is resolved?
- Reporting and follow-up: Who is responsible for reporting on the incident and initiating follow-up or corrective measures?
The answers to these questions will vary with the scope, severity and business impact of the incident. The resolution and recovery for a desktop infection might simply be disconnecting, disinfecting and reconnecting a machine, then notifying the user and manager.
A critical server outage, however, would require a more complex series of actions, including determining the cause and duration of the outage, the recovery procedure and duration of the outage and implementing the BCP.
Step 6: Report, Follow Up, Reassess
The last leg of a successful IMP is reporting, follow-up and reassessment. Consistent handling of incidents will enable consistent reporting after the fact, and the responsibility for that reporting should be included in the response matrix. Generally, the individual or office responsible for the incident should prepare a report, in a consistent format: Describe the incident, and what actions were taken to deal with the problem and mitigate the risk.
Consistent reports will facilitate incident correlation and trend analysis, which may reveal structural or technical vulnerabilities that need to be addressed to further mitigate risk. The value of enterprise programs for patch management was evident to many of the organizations affected by the Blaster and Welchia worms, and enterprise programs for patch management were developed because their after-action reports revealed that the extent of the problem correlated to failures in the distribution and application of security patches.
Fred Trickey, CISSP, is information security administrator at Yeshiva University in New York.