“Houston, we've had a problem,” is one of the great quotations from the world of science and technology.
Technically speaking, however, when they radioed Mission Control in Houston, the crew of the Apollo 13 should have said: "We've had an incident," as a problem is the unknown cause underlying an incident.
An incident is any event not part of the standard operation of the service that causes, or may cause, an interruption or a reduction of the quality of the service. Thankfully, like any well-organized project, Houston had an abort plan, and, combined with considerable ingenuity on the part of the astronauts, they were able to return the crew safely to earth.
It was a classic case of incident or crisis management where the aim was to restore normal state service operations as quickly as possible in order to minimize the adverse impact on operations. It should be standard practice for any organization to have an incident management plan and emergency response team in place to ensure any incidents, whether they're natural or man-made disasters, malicious attacks or the result of employee negligence, can be quickly dealt with and normal services resumed. Incidents are part of everyday business life, with help desks or IT support handling less serious issues and an emergency response team handling major incidents. But, how many organizations have a problem management team?
A problem is defined as the unknown cause of one or more incidents, and problem management process flow deals with determining the underlying cause of such incidents and then finding a permanent solution to them. It differs from incident management, as the focus is on the resolution of the problem in order to prevent it from instigating incidents rather than the speed of the response to the incident itself.
The incident on board Apollo 13 led to a lengthy investigation in order to determine the cause of the problem. This knowledge was then used to ensure the problem didn't reoccur in future missions. This type of post-incident analysis can play a vital role in ensuring enterprise network operations remain uninterrupted and run efficiently. Without it, downtime can increase, and time and money can be wasted on dealing with repeat incidents.
Let's take a simple example to highlight the difference between incident and problem management: a frozen network file server that is preventing employees from accessing their documents. The incident response may be simply to reboot the server in order to restore access quickly. The problem response will be to find the underlying reason that the server froze, so it can be fixed and prevented from happening again. (Note that problem management is different from a lessons-learned exercise, which reviews how an incident was handled to see if improvements can be made for handling future incidents.)
While problem solving and incident response are related, they don't necessarily require the same skill sets, and the personnel involved in the two processes will most likely be different as well; someone may know how to restore the last database backups, but not know how to work out what caused the database to crash in the first place. Problem solving is more inclined toward forensics and tracking down what happened to cause an event, while incident management requires more operational knowledge of how a system can be restored.
A problem is usually identified following multiple incidents with similar symptoms -- such as a virus spreading across a collection of networked computers and affecting their performance -- or from a single incident that has a significant impact, like the one above where nobody can access files on a particular server.
Once the underlying cause of an incident has been successfully diagnosed, it becomes a “known error,” and the task then is to find a suitable work-around or permanent solution. A work-around should only be used to minimize the effects of a problem until a permanent solution is found, and the problem should still be referred to as a known error.
One technique for identifying the root cause of a problem is to use an Ishikawa diagram, a tool for mapping the causes of an event. Potential causes are usually grouped into categories such as people, processes and policies, hardware, software and environment, and any source of variation within them can help identify where the cause of the problem lies. Other techniques, such as Apollo Root Cause Analysis, can also be used to identify causes and solutions.
Although problem management is closely associated with incident management, conflicts between the two may arise due to the demands of speedy resolution of the incident and long-term resolution of the problem. Using the earlier example, an immediate reboot of the file server may destroy useful diagnostic information to identify the cause of the problem. One way to solve this conflict is to agree beforehand what diagnostic information is needed, how long to allow for diagnostics before service is restored, and what necessary resources will be available to those trying to resolve the problem."
A proactive approach to problem management is to try to identify and resolve problems before incidents occur. This involves analysing trends from log reports and help desk requests, following relevant newsgroups for advanced warning of problems occurring elsewhere, and targeting support action.
The problem-management process is intended to reduce the number, severity and adverse impact of incidents and problems on the business, and prevent recurrence of incidents related to those errors. The success of the team can easily be measured by monitoring the average time for diagnosis and resolution of problems, the number of repeat problems, and the number of major incidents that occur. Having a problem management process in place will help any organization minimize repeat incidents and lead to a more reliable network and application environment.
About the author:
Michael Cobb, CISSP-ISSAP, is a renowned security author with more than 15 years of experience in the IT industry. He is the founder and managing director of Cobweb Applications, a consultancy that provides data security services delivering ISO 27001 solutions. He co-authored the book IIS Security and has written numerous technical articles for leading IT publications. Cobb serves as SearchSecurity.com’s contributing expert for application and platform security topics, and has been a featured guest instructor for several of SearchSecurity.com’s Security School lessons.