Service Operations

1Introduction 2Serv. Mgmt. 3Principles 4Process 5Activities 6Organization 7Consideration 8Implementation 9Issues AAppendeces

4. Service Operation Processes


4.2 Incident Management

Incident Definition
"An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set. Incident Management is the process for dealing with all incidents; this can include failures, questions or queries reported by the users (usually via a telephone call to the Service Desk), by technical staff, or automatically detected and reported by event monitoring tools."

4.2.1 Purpose, Goals and Objectives
The primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within SLA limits.

4.2.2 Scope
Incident Management includes any event which disrupts, or which could disrupt, a service. This includes events which are communicated directly by users, either through the Service Desk or through an interface from Event Management to Incident Management tools.

Incidents can also be reported and/or logged by technical staff (if, for example, they notice something untoward with a hardware or network component they may report or log an incident and refer it to the Service Desk). This does not mean, however, that all events are incidents. Many classes of events are not related to disruptions at all, but are indicators of normal operation or are simply informational (see section 4.1). Although both incidents and service requests are reported to the Service Desk, this does not mean that they are the same. Service requests do not represent a disruption to agreed service, but are a way of meeting the customer's needs and may be addressing an agreed target in an SLA. Service requests are dealt with by the Request Fulfilment process (see section 4.3).

4.2.3 Value to Business
The value of Incident Management includes:

Incident Management is highly visible to the business, and it is therefore easier to demonstrate its value than most areas in Service Operation. For this reason, Incident Management is often one of the first processes to be implemented in Service Management projects. The added benefit of doing this is that Incident Management can be used to highlight other areas that need attention - thereby providing a justification for expenditure on implementing other processes.

4.2.4 Policies, Principles and Basic Concepts
There are some basic things that need to be taken into account and decided when considering Incident Management. These are covered in this section. Timescales
Timescales must be agreed for all incident-handling stages (these will differ depending upon the priority level of the incident) - based upon the overall incident response and resolution targets within SLAs - and captured as targets within OLAs and Underpinning Contracts (UCs). All support groups should be made fully aware of these timescales. Service Management tools should be used to automate timescales and escalate the incident as required based on pre-defined rules. Incident Models
Many incidents are not new - they involve dealing with something that has happened before and may well happen again. For this reason, many organizations will find it helpful to pre-define 'standard' Incident Models - and apply them to appropriate incidents when they occur.

An Incident Model is a way of pre-defining the steps that should be taken to handle a process (in this case a process for dealing with a particular type of incident) in an agreed way. Support tools can then be used to manage the required process. This will ensure that 'standard' incidents are handled in a pre-defined path and within pre-defined timescales.

Incidents which would require specialized handling can be treated in this way (for example, security-related incidents can be routed to Information Security Management and capacity- or performance-related incidents that would be routed to Capacity Management.

The Incident Model should include:

The models should be input to the incident-handling support tools in use and the tools should then automate the handling, management and escalation of the process. Major Incidents
Figure 4.2 Incident Management process flow
Figure 4.2 Incident Management process flow
A separate procedure, with shorter timescales and greater urgency, must be used for 'major' incidentsR. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system - such that they will be dealt with through the major incident process.

Some lower-priority incidents may also have to be handled through this procedure - due to potential business impact - and some major incidents may not need to be handled in this way if the cause and resolutions are obvious and the normal incident process can easily cope within agreed target resolution times - provided the impact remains low!

Where necessary, the major incident procedure should include the dynamic establishment of a separate major incident team under the direct leadership of the Incident Manager, formulated to concentrate on this incident alone to ensure that adequate resources and focus are provided to finding a swift resolution. If the Service Desk Manager is also fulfilling the role of Incident Manager (say in a small organization), then a separate person may need to be designated to lead the major incident investigation team - so as to avoid conflict of time or priorities - but should ultimately report back to the Incident Manager. If the cause of the incident needs to be investigated at the same time, then the Problem Manager would be involved as well but the Incident Manager must ensure that service restoration and underlying cause are kept separate. Throughout, the Service Desk would ensure that all activities are recorded and users are kept fully informed of progress.

4.2.5 Process Activities, Methods And Techniques
The process to be followed during the management of an incident is shown in Figure 4.2. The process includes the following steps. Incident Identification
Work cannot begin on dealing with an incident until it is known that an incident has occurred. It is usually unacceptable, from a business perspective, to wait until a user is impacted and contacts the Service Desk. As far as possible, all key components should be monitored so that failures or potential failures are detected early so that the incident management process can be started quickly. Ideally, incidents should be resolved before they have an impact on users! Please see section 4.1 for further details. Incident LoggingR
All incidents must be fully logged and date/time stamped, regardless of whether they are raised through a Service Desk telephone call or whether automatically detected via an event alert. Note: If Service Desk and/or support staff visit the customers to deal with one incident, they may be asked to deal with further incidents 'while they are there'. It is important that if this is done, a separate Incident Record is logged for each additional incident handled - to ensure that a historical record is kept and credit is given for the work undertaken. All relevant information relating to the nature of the incident must be logged so that a full historical record is maintained - and so that if the incident has to be referred to other support group(s), they will have all relevant information to hand to assist them.

The information needed for each incident is likely to include: Incident Categorization
Figure 4.3 Multi-level incident categorization
Figure 4.3 Multi-level incident categorization

Part of the initial logging must be to allocate suitable incident categorization coding so that the exact type of the call is recorded. This will be important later when looking at incident types/frequencies to establish trends for use in Problem Management, Supplier Management and other ITSM activities.

Please note that the check for Service Requests in this process does not imply that Service Requests are incidents. This is simply recognition of the fact that Service Requests are sometimes incorrectly logged as incidents (e.g. a user incorrectly enters the request as an incident from the web interface). This check will detect any such requests and ensure that they are passed to the Request Fulfilment process.

Multi-level categorization is available in most tools - usually to three or four levels of granularity. For example, an incident may be categorized as shown in Figure 4.3.

All organizations are unique and it is therefore difficult to give generic guidance on the categories an organization should use, particularly at the lower levels. However, there is a technique that can be used to assist an organization to achieve a correct and complete set of categories - if they are starting from scratch! The steps involve:

  1. Hold a brainstorming session among the relevant support groups, involving the SD Supervisor and Incident and Problem Managers.
  2. Use this session to decide the 'best guess' top-level categories - and include an 'other' category. Set up the relevant logging tools to use these categories for a trial period.
  3. Use the categories for a short trial period (long enough for several hundred incidents to fall into each category, but not too long that an analysis will take too long to perform).
  4. Perform an analysis of the incidents logged during the trial period. The number of incidents logged in each higher-level category will confirm whether the categories are worth having - and a more detailed analysis of the 'other' category should allow identification of any additional higher-level categories that will be needed.
  5. A breakdown analysis of the incidents within each higher-level category should be used to decide the lower-level categories that will be required.
  6. Review and repeat these activities after a further period - of, say, one to three months - and again regularly to ensure that they remain relevant. Be aware that any significant changes to categorization may cause some difficulties for incident trending or management reporting - so they should be stabilized unless changes are genuinely required.

If an existing categorization scheme is in use, but it is not thought to be working satisfactorily, the basic idea of the technique suggested above can be used to review and amend the existing scheme.

NOTE: Sometimes the details available at the time an incident is logged may be incomplete, misleading or incorrect. It is therefore important that the categorization of the incident is checked, and updated if necessary, at call closure time (in a separate closure categorization field, so as not to corrupt the original categorization) - please see paragraph Incident Proritization
Another important aspect of logging every incident is to agree and allocate an appropriate prioritization code - as this will determine how the incident is handled both by support tools and support staff.

Prioritization can normally be determined by taking into account both the urgency of the incident (how quickly the business needs a resolution) and the level of impact it is causing. An indication of impact is often (but not always) the number of users being affected. In some cases, and very importantly, the loss of service to a single user can have a major business impact - it all depends upon who is trying to do what - so numbers alone is not enough to evaluate overall priority! Other factors that can also contribute to impact levels are:

An effective way of calculating these elements and deriving an overall priority level for each incident is given in Table 4.1:

Priority codeDescriptionTarget resolution time
1Critical1 hour
2High8 hours
3Medium24 hours
4Low48 hours
Table 4.1 Simple priority coding system

In all cases, clear guidance - with practical examples - should be provided for all support staff to enable them to determine the correct urgency and impact levels, so the correct priority is allocated. Such guidance should be produced during service level negotiations.

However, it must be noted that there will be occasions when, because of particular business expediency or whatever, normal priority levels have to be overridden. When a user is adamant that an incident's priority level should exceed normal guidelines, the Service Desk should comply with such a request - and if it subsequently turns out to be incorrect this can be resolved as an off-line management level issue, rather than a dispute occurring when the user is on the telephone.

Some organizations may also recognize VIPs (high-ranking executives, officers, diplomats, politicians, etc.) whose incidents would be handled on a higher priority than normal - but in such cases this is best catered for and documented within the guidance provided to the Service Desk staff on how to apply the priority levels, so they are all aware of the agreed rules for VIPs, and who falls into this category.

It should be noted that an incident's priority may be dynamic - if circumstances change, or if an incident is not resolved within SLA target times, then the priority must be altered to reflect the new situation.

Note: some tools may have constraints that make it difficult automatically to calculate performance against SLA targets if a priority is changed during the lifetime of an incident. However, if circumstances do change, the change in priority should be made - and if necessary manual adjustments made to reporting tools. Ideally, tools with such constraints should not be selected. Initial Diagnosis
If the incident has been routed via the Service Desk, the Service Desk Analyst must carry out initial diagnosis, typically while the user is still on the telephone - if the call is raised in this way - to try to discover the full symptoms of the incident and to determine exactly what has gone wrong and how to correct it. It is at this stage that diagnostic scripts and known error information can be most valuable in allowing earlier and accurate diagnosis.

If possible, the Service Desk Analyst will resolve the incident while the user is still on the telephone - and close the incident if the resolution is successful.

If the Service Desk Analyst cannot resolve the incident while the user is still on the telephone, but there is a prospect that the Service Desk may be able to do so within the agreed time limit without assistance from other support groups, the Analyst should inform the user of their intentions, give the user the incident reference number and attempt to find a resolution. Incident Escalation

The Service Desk should keep the user informed of any relevant escalation that takes place and ensure the Incident Record is updated accordingly to keep a full history of actions.

Note regarding Incident allocation
There may be many incidents in a queue with the same priority level - so it will be the job of the Service Desk and/or Incident Management staff initially, in conjunction with managers of the various support groups to which incidents are escalated, to decide the order in which incidents should be picked up and actively worked on. These managers must ensure that incidents are dealt with in true business priority order and that staff are not allowed to 'cherry-pick' the incidents they choose! Investigation and Diagnosis
In the case of incidents where the user is just seeking information, the Service Desk should be able to provide this fairly quickly and resolve the service request - but if a fault is being reported, this is an incident and likely to require some degree of investigation and diagnosis.

Each of the support groups involved with the incident handling will investigate and diagnose what has gone wrong - and all such activities (including details of any actions taken to try to resolve or re-create the incident) should be fully documented in the incident record so that a complete historical record of all activities is maintained at all times.

Note: Valuable time can often be lost if investigation and diagnostic action (or indeed resolution or recovery actions) are performed serially. Where possible, such activities should be performed in parallel to reduce overall timescales - and support tools should be designed and/or selected to allow this. However, care should be taken to coordinate activities, particularly resolution or recovery activities, otherwise the actions of different groups may conflict or further complicate a resolution!

This investigation is likely to include such actions as: Resolution and Recovery
When a potential resolution has been identified, this should be applied and tested. The specific actions to be undertaken and the people who will be involved in taking the recovery actions may vary, depending upon the nature of the fault - but could involve:

Even when a resolution has been found, sufficient testing must be performed to ensure that recovery action is complete and that the service has been fully restored to the user(s).

Regardless of the actions taken, or who does them, the Incident Record must be updated accordingly with all relevant information and details so that a full history is maintained.

The resolving group should pass the incident back to the Service Desk for closure action. Incident Closure
The Service Desk should check that the incident is fully resolved and that the users are satisfied and willing to agree the incident can be closed. The Service Desk should also check the following:

Rules for re-opening incidents
Despite all adequate care, there will be occasions when incidents recur even though they have been formally closed. Because of such cases, it is wise to have predefined rules about if and when an incident can be reopened. It might make sense, for example, to agree that if the incident recurs within one working day then it can be re-opened - but that beyond this point a new incident must be raised, but linked to the previous incident(s). The exact time threshold/rules may vary between individual organizations - but clear rules should be agreed and documented and guidance given to all Service Desk staff so that uniformity is applied.

Supporting Material
  1. Short Cut Guide to Availability, Continuity, and Disaster Recovery
    • ch 1 - business drivers behind recovery management
    • ch 2 - challenges posed by the increasing complexity of IT environments
    • ch 3 - common issues in recovery management

4.2.6 Triggers, Input and Output / Inter-process Interfaces
Incidents can be triggered in many ways. The most common route is when a user rings the Service Desk or completes a web-based incident-logging screen, but increasingly incidents are raised automatically via Event Management tools. Technical staff may notice potential failures and raise an incident, or ask the Service Desk to do so, so that the fault can be addressed. Some incidents may also arise at the initiation of suppliers - who may send some form of notification of a potential or actual difficulty that needs attention.

The interfaces with Incident Management include:

4.2.7 Information Management
Most information used in Incident Management comes from the following sources:

Incident Management also requires access to the CMS. This will help it to identify the CIs affected by the incident and also to estimate the impact of the incident.

The Known Error Database provides valuable information about possible resolutions and workarounds. This is discussed in detail in paragraph

4.2.8 Metrics
The metrics that should be monitored and reported upon to judge the efficiency and effectiveness of the Incident Management process, and its operation, will include:

Reports should be produced under the authority of the Incident Manager, who should draw up a schedule and distribution list, in collaboration with the Service Desk and support groups handling incidents. Distribution lists should at least include IT Services Management and specialist support groups. Consider also making the data available to users and customers, for example via SLA reports.

4.2.9 Challenges, Critical Success Factors And Risks Challenges
The following challenges will exist for successful Incident Management: Critical Success Factors
The following factors will be critical for successful Incident Management: Risks
The risks to successful Incident Management are actually similar to some of the challenges and the reverse of some of the Critical Success Factors mentioned above. They include:

[To top of Page]

Visit my web site