|
Incident Management Table of Contents | |||
| ||||
| ||||
Incident Management is a process within the Service Operation module of the ITIL Service Lifecycle.

Most IT department specialist groups contribute to handling Incidents at one time or another. The "First Point of Contact" with regard to this effort is always the Service Desk. Moreoever, it should remain responsible for monitoring the resolution effort for all registered Incidents. In doing so, it accepts initial OWNERSHIP of the Incident. To act efficiently and effectively requires the use of formal methods, supported by software tools which ...
| Objectives | Coverage | Policies | Scaling | Concepts | Roles | Measuring | Processes | Appendix |
It can include …
Relationship to Other Processes
There is a close interface amongst Incident, Problem and Change Management processes. Incident Tickets may reference Problems and Known errors as being inter-related.

View Fault Tracking and Incident Management
Problem Management differs from Incident Management in that its main goal is the detection of the underlying causes of an Incident and their subsequent resolution and prevention. In many situations this goal can be in direct conflict with the goals of Incident Management where the aim is to restore the service to the Customer as quickly as possible, often through a temporary Work-around, rather than the implementation of a more permanent resolution -- which may prevent future Incidents from occurring. The speed of service restoral is of paramount consideration in managing incidents, whereas, an investigation of the underlying Problem can require addtional time which will delay the restoration of service (causing downtime but preventing recurrence).
Many Incidents recur and the appropriate resolution actions are well known. This is not always the case, however, and a procedure for matching Incidents against Problems and Known Errors is useful. Successful matching provides proven resolution methods thus avoiding the need for further investigation effort. For those incidents that do not match the Problems / Known errors, the Problem Management can proactively / reactively perform causal analysis.
Resolved Incidents are analyzed and Problems are identified. Problems are analyzed and Known errors are determined. Known errors are analyzed and Request for Changes are created to correct the defective object causing the incidents.
Because of its' impact on business operations, Availability Management may lead (or will certainly be involved in) Situation Management. The early restoration of Mission and Business Critical applications will have major affects on the costs associated with service failures.
![]()
How the Process Scales
Incident Management is one of the first ITIL processes an organization will implement. Most organizations will already have some incident management capability though it may be little more than informal recording of reported Trouble Tickets.
Most of the discussion on Incident Management implies an organization of a certain threshold. This threshold provides the organization with tangible benefits resulting from the recording of Incident information based upon the premise that the lessons learned from restoring service in one instance can be used to expedite subsequent resolutions. Beyond the normal incident analyst's ability to accumulate experience this implies some organizational capability to record incidents so that incident solutions can be shared amongst all Incident staff.
As the number of incidents in the organization increases there is a commensurate need to scale the Incident department. Like all good workload planning, this necessitates being able to distinguish Incident which can be disposed of quickly from those which will require more attention or different expertise. Mid-size organizations may solve this by :
At some point the number of Incidents being experienced leads to a requirement for more robust tools. Securing an Incident Management system to record Incidents becomes a necessity (a detailed list of Help Desk vendors is provided by the Helpdesk.com).
Prioritizing
The priority of an Incident is determined by analysis and review of two dimensions - the importance of the component (known as a Configuration Item (CI)) out of service as a result of the Incident and the breadth or scope of the incident as determined by the number and organizational importance of the resources impacted.
The priority of an Incident is determined by a number of infrastructure, component and temporal characteristics:
Typically Mission Critical applications which, in the event of failure, might jeopardize ...
"Business Critical" applications relate to core business functions, processes or procedures where failure would have an external impact. If the risks of system failure are manageable the impact can be considered low and a lower priority may be assigned.
| Sev | Business Impact | Examples | ||
|---|---|---|---|---|
1 | A major service disruption which impacts a critical customer. The service is not immediately recoverable and a large number of users are unable to perform a significant portion or their duties. |
| ||
2 | A situation which results in a loss of major functionality to a customer and severely impacts service. A Group of users are affected and unable to perform a portion of their jobs. |
| ||
3 | A situation of minimal customer impact where a workaround is currently known and available. A small number of users are affected. |
|
![]()
Categorizing Incidents
Incidents should be classified in order to...
Many Incidents are regularly experienced and the appropriate resolution actions are well known. This is not always the case, however, and a procedure for matching Incident classification data against that for Problems and Known Errors is useful. Successful matching provides proven remediation methods, avoiding the need for further investigation effort.
This attempt will involve routing the Trouble Ticket to an appropriate Subject Matter Group for subsequent treatment should any restoration efforts at first point of contact be ineffectual. A current list of Incident categories is available in Appendix.
Incident Escalation
Escalation is a procedure whereby an individual or group is notified that a Incident exists and is a request for aid. It is typically performed via a telephone call or page out and is performed at a technical level for on call support. The escalation process is first invoked by the Service Desk to ensure that Incidents are resolved within SLAs and appropriate resources are notified as necessary. The process is necessary to ensure that unresolved tickets at all levels are proprly addressed for attention within the IT organization.
Transferring an Incident from first-line to second-line support groups or further is called Functional Escalation and takes place because of lack of knowledge or expertise or when agreed time intervals elapse (which triggers a re-evaluation of the resources assigned to the Incident). Automatic functional escalation based on established time intervals should be negotiated and agreed to by business operations and should be recorded in Service Level Agreements (SLA) for easy reference.
Hierarchical Escalation can take place at any moment during the resolution process when it is likely that resolution of an Incident will not be in time or satisfactory. Hierarchical resolution action is invoked when it becomes apparent that available resources may be insufficient to resolve the Incident. Management is required to approve additional resources. This should take place long enough before the (SLA) agreed resolution time is exceeded so that corrective actions by authorized line management can be carried out - for example hiring third-party specialists.
As resources are increasingly allocated to resolve the Incident, the resolution "TEAM" becomes bigger, more formalized with increasing role differentiation. The Incident Coordinator is assigned administrative and communication roles to free the Incident Analyst to focus on finding a solution. When there becomes a need for more expertise as the Incident escalates to a Severity 1, there is an increasing need for coordinating multiple activities (ie., project management). A Severity 1 Situation Manager is created to perform this role. Organizational scale may imply that these roles will be assumed by the same person.
![]()
Communications
Incidents require communications in order to keep those most affected by a failure informed of the progress being made towards resolution. It is important to acknowledge and demonstrate the responsiveness of the IT service provider to the consumers of the service. If delays are to be expected it is important to provide an estimate of the resolution time so that staff can plan their time accordingly.
In a recent survey by Forrester of service desk practices they idenitified the importance of regular updates on Incident Tickets as a source of poor customer satisfaction...
| "While technology users tend to be pleased with the demeanor of the help desk staff, they are least
likely to be satisfied with the timeliness of updates regarding the status of their issues. Seventy-six
percent of users who are satisfied with their help desk, compared with just 10% of users who are
dissatisfied with their help desk, are pleased with the timeliness of updates received. When we
asked tech users what the IT organization could do to improve, 41% overall suggested more timely
information on the status of issues. But two-thirds of users who are displeased with their help desk,
versus 36% of those who are satisfied with the help desk, are likely to suggest timely updates as a
necessary area of improvement. The service desk is the eyes, ears, and face of the IT organization to the vase majority of business users. When you think you’re communicating enough, take two steps further. Communicate information and service expectations with each incident. Summarize and report — both up the management chain and out to the users — on a monthly and quarterly basis at a minimum. It is far easier to cut back on communications than it is to repair the damage an aloof and out-of-touch service desk can cause for all of IT." Chip Gliedman, Thirty-One Best Practices For The Service Desk, June 28, 2005, p. 6, The MUNS Report, Vol 5, issue 23, Nov 8, 2006 |
Incidents may also impact Service Level Agreements and may require immediate attention with specific protocols for keeping staff informed. With the increasing dependency of many job function on computing resources it is increasingly difficult to plan around the use of computing resources for extended periods. Moreover, service delivery to the public may be directly affected. Both require the invocation of contingency actions which may require some lead time in order to restore the service to its' original functionality. Keeping business units and personnel informed can have important financial implications.
It is best practice for an organization to establish default communications procedures for each of its' Severity determinations. A typical communications protocol might look like the following:
| Severity | Tier 1 | Tier 2 | Tier 3 | Management |
|---|---|---|---|---|
| Who Does What | Service Desk records incident | Transfer to Incident Support | Transfer to Service Partners | Incident Support informs senior management and other necessary Fulfillment Providers |
| Sev 1 | At Time of call | Immediately | at 1 hour | at 2 hours and every 2 hours thereafter |
| Sev 2 | At Time of call | at 30 minutes | at 2 hours | at 4 hours and every 4 hours thereafter |
| Sev 3 | At Time of call | as set out in SLA | as set out in SLA | |
As a general rule Incident Initiators (ie., clients reporting an Incident) should be informed at the following times...
![]()
Incident Status
Throughout the Incident process the status of the Incident indicates it's progress through the system. The table below defines examples of codes which may be used to distinguish an incident as it migrates through the Incident Management process.
| These represent the most basic codes. Every Trouble Ticket should have the following three log entries to indicate STATUS changes. | |
| NEW | An incident has been received, logged and is active within the incident tracking database. This status is usually automatically created when a Trouble Ticket is created and is accompanied by basic information identifying the caller and an initial description of the incident. |
| RESOLVED | The requested action has been fulfilled. This could be:
Care should be taken at this point to ensure that the solution used is recorded on the Incident database and noted in any Knowledge Bases used by the organization. |
| CLOSED | The Ticket is now considered inactive and there is no further consideration of it beyond analytical purposes by Availability, Capacity or Problem Management. Unsolved Trouble Tickets may be recorded as Problem Tickets by Problem Management. |
| Additional codes are used to indicate routing, diagnostic and resolution activities. Describing the Trouble Ticket according to elements in the Incident Lifecycle will provide Availability and Problem Management with valuable information to improve overall availability and to resolve recurring Problems. | |
| FAILURE | The best guess of the time when the device actually failed. The Incident Analyst should make a best guess from the person calling in or from automated alerts. |
| DETECTION | The best guess when the failure was first detected by a user. This is not always the same as the time the Ticket was created. |
| ASSIGNED | An incident has been assigned to an specialized individual or group for further investigation and resolution. |
| WIP | Work In Progress. The incident is being actively considered by a person or team charged with restoring service. |
| PENDING | The Ticket is a HOLDING position pending the availability of equipment or resources to work on it. |
| INVESTIGATION | Notation with time of efforts to collect information about the incident including url of any web sites checked or literature referenced. |
| DIAGNOSIS | Notation with time suggesting resolution of the incident (note: this is about restoration of service and may not be equated with solving the root cause of the incident). |
| RESTORATION STARTED | A notation as to when an attempt at service restoration was started and what the results were. |
| RESTORATION COMPLETED | The restoration of service to the condition preceeding an incident is complete - either by the First Point of Contact or following an escalation to another group for treatment. |
| VERIFICATION | A notation as to when the User verified the system was restored. |
| ESCALATION | A description of additional resources allocated to the incident. Verification: A notation as to when the User verified the system was restored |
| COMMUNICATIONS | A message notifying a Group has been issued indicating progress towards |
![]()
Assuring Quality of Incident Data
Completing a Trouble Ticket to the rigour suggested by the complete set of the above codes obviously involves significant effort and can add to the restoration time frame. To reduce the impact of recording on the restoration effort many of the log entries can be entered after the restoration in completed.
The organization must decide how much effort it wishes to put into the Incident Database. The majority of the benefits of the above effort do not accrue to Incident Management. Instead, the primary beneficiaries of a quality Incident database are:
Any proactive analysis of incident information assumes the overall reliability of the data itself. Unfortunately, all to often, once an Incident is CLOSED an analyst is encouraged to proceed to a subsequent Ticket rather than devoting sufficient attention to ensuring the accurate recording of the Incident.
If the organization wishes to undertake any proactive measures to remove faults from its' infrastructure it must consider the quality of the primary data source early and establish methods to ensure its' integrity. At a minimum this requires the accurate reporting of the solution - whether a resolution or a workaround. With a robust Availability Management process the organization may extend quality assurance to encompass the accurate logging of Incident events (perhaps using the extended STATUS codes cited above).
Organizations may wish to place this quality assurance role with the Service Desk (if charged with overall and/or continuing responsibility for the Incident), an Incident Coordinator (who assumes the Ownership from the Service Desk for the Ticket) or within an explicitly defined Quality Assurance section.
By comparison, a Key Performance Indicator is a measure of "how well" the process is performing. They will often be a measure of a Critical Success Factor and, when monitored and acted upon, will identify opportunities for the improvement of the process. These improvements should positively influence the outcome and, as such, Key Performance Indicators have a cause-effect relationship with the Key Goal Indicators of the process.
Goal Indicators (targets)
When does the service commitment begin and end:
| View Putting the Frameworks Together to Improve BusinessR |
Controls
| |||
Inputs
| Activities | Outputs
| |
Mechanisms
|
Alerts may be configured to automatically create a Trouble Ticket or the Support Agent may be responsible for the Ticket's creation.
Known Errors are tracked and monitored by Problem Management processes through Error Control procedures.
The more complete and accessible the Knowledge Base the more likely that incident can be resolved quickly, at first point of contact by the Service Desk. In essence, a knowledge base places the expertise of Subject matter Experts at the disposal of generalists. A Knowledge Base may contain:
Many vendors offer FAQs specific to their product offerings, but, the most extensive and easy to use Knowledge Base probably belongs to Microsoft.
There is frequently a period of time when incidents are being reported to the Service Desk as independent events. The ability to correlate these occurrence through pattern matching practices, routines and procedures will aid in the speedy assessment of the incident's total impact and the breadth of the described symptoms and actions undertaken.
Best practices define a 'grace' period between the restoration of the service and the CLOSING of the Trouble Ticket. It is in this period that the incident is re-assessed to ensure that any symptoms do not re-occur. In addition, this period should permit the recording of Incident detail not entered during the 'heat' of the restoration effort. The explicit review of an Incident Ticket to ensure that solutions and log entries contribute to the overall enhancement of the data system would extend its' usefulness as a valuable Availability and Problem Management tool.
![]()
Process Activities
IM1 - Incident Detection and Preliminary Treatment
Define the fault in service continuity in a way which expedites incident containment and time to restore. and minimizes service disruption.![]()
| All Incidents should be recorded, preferably through the automatic generation of an Incident Ticket to be completed as the incident is addressed and service is restored. Symptoms, basic diagnostic data, and information about the related Configuration Item should be included in Incident records during detection and recording. An alert to the Service Manager is required in the case of serious degradation of service levels, in case it is necessary to take special action. Incident should be handled using standard approaches and procedures and within timeframes negotiated with Customers of the Incident Management service. These timeframes represent service commitments within which IM will operate. |
A web portal contains information on commonly occurring problems and incidents which is known and readily accessible to clients. This facility permits a wide variety of frequently occurring problems to be fixed using solutions which can be invoked by the client directly, thereby, reducing the number of calls to the Service Desk. If the customer is unable to find answers through online self-service, the issue should be escalated to assisted service channels like email, live help, or phone, in a context-sensitive manner. This escalation from self-service to assisted service should appear seamless to your customers, i.e., they should not have to start it all again.
If the customer locates and implements the solution satisfactorily the Incident is considered completed. It is important to capture statistics on the use of the self help facility in order to supplement Incident statistics to accurately depict service volumes. The use of the information source might be accompanied by a question "Did you find your answer ?" and the responses collected to provide data on the usefulness of the Knowledge Base
If a solution is not found in the Knowledge Base the customer must contact the Service Desk.
The Service Desk or an Online Incident/Service Request Management System represents the First Point of Contact for the recording of incidents. Alternate contact methods might be by web form, e-mail, phone call or voice mail.
Service Desk response rates are typically negotiated with customers and documented in SLA's.
This assessment involves "initializing" the Incident Ticket with sufficient information to either resolve it quickly or route it properly to the subject matter group best able to restore service expeditiously. The Ticket is either filled in by the user accessing a web-based self-help form from the Incident Management System or is filled in by a Service Desk agent.
The following fields are typically filled in or automatically generated on the Incident Ticket during this stage:
An Incident Ticket is created with as much initial information as can be quickly collected.
If the Incident has been created using an online form then the eligibility of the person to report an incident has been established by their ability to access the form. If, however, the incident is reported through a call to the Service Desk, then the caller must be authenticated by matching them against a list of eligible users maintained in the Incident Management database. Any user calling in who cannot be located within the database will be assumed to legitimate for the purpose of the initial inquiry. If the incident can be resolved immediately then no further action is required to authenticate the caller. If, however, the incident needs to be routed to a SME then the Service Desk Agent should undertake a manual check by contacting appropriate business sources and then add the caller into the Online Services database. Further, the Service Desk Agent should undertake to ensure that any known additional databases are notified of the person's existence.
The User is encouraged to describe the symptoms associated with the incident and can be provided diagnostic scripts and references to existing events occurring within the infrastructure (including Known Errors, Problems and recent changes). A Service Desk agent has the ability to expand the diagnostic tools available. The assessment could involve some rudimentary and standard interventions (e.g., "pinging" a server to determine its' responsiveness), or reference to pre-existing conditions, incidents, changes, problems in the infrastructure with which this incident might be "matched". The Service Desk agent should attempt to record what information sources and techniques proved particularly useful in any initial diagnosis.
Reported Incidents are checked against an electronic Bulletin Board where high severity incidents are recorded for easy reference by Users and Support agents. If a match is found then the Ticket is associated with a Master Ticket on which restoration attempts are being logged.
The origin of these Tickets might also include automated alerts from management agents attached to servers, applications, etc or might include a re-direction from another Service Partner who originally received the Ticket either erroneously or, through investigation, suggests that consideration by another group may prove beneficial. If the Incident is not part of a larger event it may still be related to past incidents, Problem Tickets or existing Known Errors. If discovered these other Tickets are associated with the new Incident Ticket. Some Incidents are widely felt and, in these situations, there may be a number of Clients experiencing and reporting symptoms. It is important to capture information on the extent of the incident and all resolution attempts, but there is seldom any benefit in logging and maintaining Tickets on all reported instances of the same root cause. Sometimes, the resolution of "related" items (i.e. incident, changes, and problems) may indicate a workaround or solution which can result in immediate service restoration. In the instances where a solution is readily identifiable service restoration can be immediately undertaken.
Some reported incidents could be beyond the scope of Service Desk to consider.
Examples might be:
The Service Desk agent hands the request off to the appropriate agent who deals with OUT of SCOPE requests. This person deals diplomatically with the client with reference to Service Catalogues and SLAs and may refer them to appropriate business manager for further discussions.
Incident reported to the Service Desk agent may be resolved immediately. The agent attempts to restore service to the User by either talking them through a series of remedial steps or using remote software. The ability to solve problems quickly is determined by:
The Service Desk agent (as FPOC) will assign a CTI to the Incident which establishes to which group the incident will be referred. This routing will occur when:
![]()
IM2 - Incident Investigation and Diagnosis
Record the incident's symptoms as accurately and quickly as possible in order to solve or route the incident to the source best able to resolve it expeditiously.![]()
Once the Incident has been assigned to a support group, it should accept assignment of the Incident, specify the date and time (preferably automatically), ensuring:
The Service Desk should advise the Customer of any identified Work-around, if it is possible to provide one immediately. It should review the Incident against Known Errors, Problem, Solutions, Planned Changes or knowledge bases and, if necessary, ask the Service Desk to re-evaluate the assigned Severity and Priority, adjusting them as required, based on agreed service levels . Wherever possible, affected Users should be provided with the means to continue business, though perhaps with degraded service. An example could be that faulty printers might necessitate printing taking place at another more distant location. The effect of such a Work-around is to minimize the impact of the Incident on the business and to provide more time to investigate and devise a structural resolution. Temporary Work-arounds may have to proliferate while a more permanent solution is worked upon. |
The service Desk continually monitors the queue for the status of Incidents being worked on in relation to service commitment resolution times. If, after pre-determined times, the STATUS continues to be NEW, pager and/or e-mails alerts are sent to designated staff and the Ticket is escalated to ensure it gets accepted by a group of SME (Subject Matter Experts).
The escalation process continues until the Ticket is formally accepted and its' STATUS becomes ASSIGNED.
Escalation proceeds over four main groupings:
Typical Automated Paging:
SME accepts Incident Ticket and STATUS is set to WIP (Work in Progress) or PENDING (on Hold). The time when the STATUS changed is recorded in the Incident record.
The time between STATUS=ASSIGNED and STATUS=WIP is a measure of process latency and attempts should be directed at reducing overall process latency. The commitment to accept the Ticket within established timeframes should be negotiated in an Operating Level Agreement (OLA).
Incident Agent (IA) verifies or assigns Incident Severity, CTI and priority and assesses whether the assignment of the Ticket to their Work Group is correct. If the Ticket needs to be re-assigned it is returned to the Service Desk for re-consideration (note: if the IA knows the SME Group to whom the Ticket should be directed they may re-assign it directly. What is important to note is if there are any reasons for the mis-direction which can be used to improve future routing. These "lessons" should be directed to the the Service Desk (or application group charged with this responsibility) to make any adjustments to CTI tables or to modify any assistance associated with User Self assignment using the CTI tables.
Evaluate the incident and research solutions from Knowledge Bases, Known Errors, and Problem Tickets. Scenarios are investigated and recommendations made and tested. The time frame in which this is done is governed by a service commitment stated in the Service Catalog for Incident Management and negotiated in SLA/OLAs with Customer groups.
Until a workable solution is determined, the SME evaluates the likelihood that that restoration can be accomplished within the service commitment for this CTI and Severity assignment. If it appears that the incident cannot be resolved within the service commitment then a decision to add or change the resources devoted to resolution needs to be advanced. If the existing resources are deemed sufficient to solve the incident within established commitments the SME continues to investigate.
Once a solution or workaround has been agreed upon the Incident Agent determines the most appropriate method to implement it. The timing of implementing the solution might need to be considered in view of the clients' and solutions current availability. The selected strategy is agreed to by the User and logged into the Incident Ticket and any restoration plans are attached to the Ticket.
Once the restoration method has been decided upon and agreed to by the Customer, control will transfer to other process based upon the conditions associated with the solution:
All actions taken to date (eg. Service Request fulfillment and/or a Change) are considered in the context of whether they have adequately restored service to the User. If not, then service restoration using the most promising approach is undertaken. If no resources or a change is required then the Incident Agent undertakes to implement the selected Solution (which could be a workaround). The Incident Agent may need to negotiate a time for the restoration effort with the User, which might involve a site visit or, alternative, might be accomplished using remote software or walking the User through the solution over a phone. The Incident Ticket STATUS is set to RESOLVED and an e-mail is sent to the Customer informing them that service has been restored.
Control is transferred to the Closure process (IM5).
![]()
IM3 - Incident Escalation
Add resources to the incident to facilitate the restoration effort and keep business customers affected informed of the effort in order to permit them to minimize the effects of outages.![]()
Escalation involves increasing the number and, potentially, the specialization of the resources devoted to resolution. Because of this increase there are issues of activity coordination and keeping Clients informed of the progress of resolution. The SME assumes control of the Incident and ensures that administrative tasks are completed.
|
incident Analyst reviews the restoration approach with any external service providers who need to be involved in the effort and with any internal management who might need to approve of the approach and resources needed to ensure the effectiveness of the restoration.
![]()
IM3.2 - Escalation Communications
Keep people aware of escalation actions so as to ensure availabilities and assist those affected to better plan around the outage.
When escalation procedures are invoked the SME ensures a page is sent to the appropriate groups at set times according to Escalation Communication Protocols.
Severity 1 and 2 incidents are posted on an electronic Bulletin Board to keep management, Work Groups and Service Desk informed of high severity incidents and the progress occurring towards their restoration. The Incident Agent ensures that the Incident Ticket reflects communications undertaken.
During the escalation there might be a need to update clients on the progress towards restoration. This permits them to adjust their work schedules accordingly
Management is approached to confirm the resolution process or to suggest alternate approaches and to approve how the granted additional resources will be assigned to the investigation.
The incident Agent coordinates activities towards finding a viable solution to restore service and to minimize the effects of incident on the infrastructure and on business operations. All suggested solutions and abandoned approaches should be recorded on the Incident Ticket either during the investigation or immediately following it in order to capture the investigation process for later review and incident analysis.
When a restoration approach is agreed upon it is implemented. Care should be exercised in implementing solutions which might reverberate to other parts of the infrastructure (ie. should be considered a Change). In these situations, an appropriate Change Management authority (corporate or local) should be consulted for advise on implementing the solution.
The restoration effort continues against a backdrop of service commitments for the Severity and Type of Incident. The Severity Escalation Protocols determine at what point during the restoration efforts automatic check-points for escalation (ie., adding more or different resources to the effort) are considered.
![]()
IM4 - Situation Management
Ensure a heightened availability and awareness for Incidents deemed of Major significance.![]()
| A Situation is a major service disruption which impacts a survival or business critical service to an major business customer. The service is unrecoverable and a large range of clients are affected and are unable to perform a significant portion of their jobs. The Senior Leadership Team in the organization needs to be readily available to advise or approve any recommended plans of action. |
The Key line of business technical representatives - with responsibility for any Configuration Items affected by the incident - should be included in a pager list for all Severity 1 invocations. The representative self-identifies with any pages which might affect them and require attention. For Severity 1 Incidents the business leads for all affected business units are notified according to a designated list for each business.
There may be a need to "triage" all knowledge sources to discuss the situation. This meeting, involving management, makes recommendations as to which Units and personnel will handle the incident and clears the way for any resource expenditures which may be required during the effort.
Situation handling requires immediate attention and all efforts in this process are devoted to getting the most appropriate resources devoted to the restoration task as quickly as possible. For Situations, the Service Desk monitors reported incidents for frequency and severity (CI's affected). It also monitors the STATUS and if that status is not WIP (Work in Progress) within an interval established as a Service Commitment for the CTI affected then action is triggered to ensure the Ticket is accepted. The sequence of attempting contact is:
A plan of action is developed or revised by the Situation Manager for identification of the incident's cause. Immediate attention is directed towards developing solutions for the restoration of the service. Hence, research centers on identifying known workarounds and assessment of risks in employing them. Depending on the action plan there may be a need to test a solution and consideration should be given to securing necessary equipment and resources if this is anticipated.
If the incident is resulting in widespread service outages wherein a number of callers may be reporting similar incidents, the creation of an IVR message alerting the callers that the situation is being attended to will inform customers while avoiding unnecessary attention by key resources. The recording or revision of an IVR message should be governed by an assessment of the benefits of providing the message or update.
A proposed message is composed using IVR message templates and remaining aware of the need for brevity in message duration. The message is sent to the Service Desk for placement on the IVR. After restoration of the service the IVR message is removed. Key business leads are notified by e-mail and optionally, in addition, by phone that the situation is being worked on by the identified Work Group Lead and provided any estimates of the Time to Restore service.
There may be a need to "triage" all knowledge sources to discuss the situation. A meeting consisting of all available expertise is devoted to discuss the situation. In the case of an application failure the major Incident Manager should consult the appropriate Application Support Matrix to identify who the Subject Matter Experts are. On the basis of the discussion the Work Group Lead may re-visit IM4.4 - The Action Plan. When no further triages are necessary further action will depend on whether a solution has been identified as a result of the meeting. If no solution has been identified the Situation Manager may re-visit IM4.4 - The Action Plan.
If the solution requires a modification to the infrastructure then an Urgent Change process may need to be invoked to consider the risks associated with implementing the Change.
The Service Desk representative should attend these meetings and ensure a record of actions/decisions is maintained, ideally as part of the overall Incident record.
Once a solution is determined the Situation Manager coordinates the Implementation of the recommended solution. If the change is Urgent, it may have involved invoking the Change Management process wherein the service was restored.
![]()
IM5 - Closure
Complete the Incident in a manner which promotes the identification and quick restoration of future service outages.![]()
| Incident Ticket review for completeness and accuracy (e.g., it might be decided that this Incident was not similar to another Ticket it was associated with for resolution - i.e. incorrectly associated) and allowance of an adequate time for problems with the restoration effort to surface. Ensure all knowledge bases are updated to facilitate the identification of future incidents which might be attributable to an identical or similar cause If the solution was a workaround the resolved Incident might be referred to Problem Management for root cause analysis and subsequent treatment to eliminate future occurrences of this type of incident or to improve the workaround. |
Ensure all knowledge bases are updated to facilitate the identification of future incidents which might be attributable to an identical or similar cause.
The Service Desk should ensure that the primary person working on the Incident records the solution to the Incident. This could be a workaround or the identified root cause of the incident.
Any IVR messages pertaining to a RESOLVED incident should be removed or updated to reflect the service restoration.
After a period of 10 days in which the Customer has not experienced a re-occurrence of the symptoms, the Incident is changed from RESOLVED to CLOSED. If the Customer requests that the Incident be re-opened, it is re-assigned to a Workgroup and processing recommences as process IM2 - Investigation and Diagnosis.
Other action which could occur at this point include closing the Ticket as a result of it being out of Scope in IM1, or, The Ticket could be referred to Problem Management for further treatment. This will often be the case when a Workaround is implemented and further investigation is required to determine the root cause of the Incident in order to implement a Change into the infrastructure so that the incident will not re-occur.
| Terminology | Maturity Characteristics | Example Default Severities |
| Term | Definition |
| Availability | Ability of a component or service to perform its required function at a stated instant or over a stated period of time. It is usually expressed as the availability ratio, i.e. the proportion of time that the service is actually available for use by the Customers within the agreed service hours. |
| Category, Type and Item (CTI) | Method for Classification of a group of Change documents according to three-fold hierarchical coding structure used by many organizations. |
| Client | People and/or groups who are the targets of service. To be distinguished from User - the consumer of the target (the degree to which User and Client are the same represents a measure of correct targeting) and the Customer - who pays for the service. |
| Causal Factors | Those contributors (human known errors and component failures) that, if eliminated, would have either prevented the occurrence of an incident or reduced its severity. |
| Classification | Process of formally identifying Incidents, Problems and Known errors by origin, symptoms and cause. |
| Change Management | Process of controlling Changes to the infrastructure or any aspect of services, in a controlled manner, enabling approved Changes with minimum disruption. |
| Configuration Item (CI) | Component of an infrastructure - or an item, such as a Request for Change, associated with an infrastructure - that is (or is to be) under the control of Configuration Management.CIs may vary widely in complexity, size and type, from an entire system (including all hardware, software and documentation) to a single module or a minor hardware component. |
| Configuration Management Database (CMDB) | A database that contains all relevant details of each CI and details of the important relationships between CIs. |
| Core Business Process | A process that relies on the unique knowledge and skills of the owner and that contributes to the owner’s competitive advantage. |
| Critical Success Factor (CSF) | Critical Success Factors - the most important issues or actions for management to achieve control over and within its' IT processes. |
| Customer | Payer of a service; usually theCustomer management has responsibility for the cost of the service, either directly through charging or indirectly in terms of demonstrable business need. |
| Environment | A collection of hardware, software, network and procedures that work together to provide a discrete type of computer service. There may be one or more environments on a physical platform e.g. test, production. An environment has unique features and characteristics that dictate how they are administered in similar, yet diverse, manners. |
| Error Control | The processes involved in progressing Known errors until they are eliminated by the successful implementation of a Change under the control of the Change Management process. |
| Impact | Measure of the business criticality of an Problem Often equal to the extent to which Incidents lead to distortion of agreed or expected service levels. |
| Incident | Any event that is not part of the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service. |
| Lifecycle | A series of states connected by allowable transitions. The life cycle represents an approval process for Change documents. |
| Known Error | An Incident or Problem for which the root cause is known and for which a temporary Work-around or a permanent alternative has been identified. If a business case exists, an RFC will be raised, but, in any event, it remains a known error unless it is permanently fixed by a Change. |
| Mean Time to Repair (MTTR) | The elapsed time to restore service measured from either the time of the failure or the time the failure was reported to the time the service was restored to the Users satisfaction. |
| Priority | Sequence in which an Incident or Problem needs to be resolved, based on impact and urgency. |
| Process | A connected series of actions, activities, Changes etc. performed by agents with the intent of satisfying a purpose or achieving a goal. |
| Process Control | The process of planning and regulating, with the objective of performing a process in an effective and efficient way. |
| Release | A collection of new and/or changed CIs which are tested and introduced into the live environment together. |
| Request for Change (RFC) | Form, or screen, used to record details of a request for a Change to any CI within an infrastructure or to procedures and items associated with the infrastructure. |
| Resolution | Action that will resolve an Incident. This may be a Work-around. |
| Role | A set of responsibilities, activities and authorizations. |
| Service Level Agreement | A written agreement between a service provider and Customer(s) that documents agreed services and the levels at which they are provided at various costs. |
| Service Level Management | Disciplined, proactive methodology and procedures used to ensure that adequate levels of service are delivered to supported IT users in accordance with business priorities and at acceptable costs. |
| Service Request | Every Incident not being a failure in the IT Infrastructure. |
| System | An integrated composite that consists of one or more of the processes, hardware, software, facilities and people, that provides a capability to satisfy a stated need or objective. |
| Urgency | Measure of the business criticality of an Incident or Problem based on the impact and on the business needs of the Customer. |
| Version | An identified instance of a Configuration Item within a product breakdown structure or configuration structure for the purpose of tracking and auditing change history. Also used for software Configuration Items to define a specific identification released in development for drafting, review or modification, test or production. |
| Work-around | Method of avoiding an Incident or Problem, either from a temporary fix or from a technique that means the Customer is not reliant on a particular aspect of a service that is known to be a problem. |
![]()
COBIT Incident and Problem Management Maturity Descriptions
| 0 Non-existent | There is no awareness of the need for managing problems and incidents. The problem-solving process is informal and users and IT staff deal individually with problems on a case-by-case basis. |
| 1 (Initial/Ad Hoc) | The organization has recognized that there is a need to solve problems and evaluate incidents. Key knowledgeable individuals provide some assistance with problems relating to their area of expertise and responsibility. The information is not shared with others and solutions vary from one support person to another, resulting in additional problem creation and loss of productive time, while searching for answers. Management frequently changes the focus and direction of the operations and technical support staff. |
| 2 (Repeatable but Intuitive) | There is a wide awareness of the need to manage IT related problems and incidents within both the business units and information services function. The resolution process has evolved to a point where a few key individuals are responsible for managing the problems and incidents occurring. Information is shared among staff; however, the process remains unstructured, informal and mostly reactive. The service level to the user community varies and is hampered by insufficient structured knowledge available to the problem solvers. Management reporting of incidents and analysis of problem creation is limited and informal. |
| 3 (Defined Process) | The need for an effective problem management system is accepted and evidenced by budgets for the staffing, training and support of response teams. Problem solving, escalation and resolution processes have been standardized, but are not sophisticated. Nonetheless, users have received clear communications on where and how to report on problems and incidents. The recording and tracking of problems and their resolutions is fragmented within the response team, using the available tools without centralisation or analysis. Deviations from established norms or standards are likely to go undetected. |
| 4 (Managed and Measurable) | The incident/problem management process is understood at all levels within the organization. Responsibilities and ownership are clear and established. Methods and procedures are documented, communicated and measured for effectiveness. The majority of problems and incidents are identified, recorded, reported and analyzed for continuous improvement and are reported to stakeholders. Knowledge and expertise are cultivated, maintained and developed to higher levels as the function is viewed as an asset and major contributor to the achievement of IT objectives. The incident response capability is tested periodically. Problem and incident management is well integrated with interrelated processes, such as change, availability and configuration management, and assists customers in managing data, facilities and operations. |
| 5 Optimized | The incident/problem management process has evolved into a forward-looking and proactive one, contributing to the IT objectives. Problems are anticipated and may even be prevented. Knowledge is maintained, through regular contacts with vendors and experts, regarding patterns of past and future problems and incidents. The recording, reporting and analysis of problems and resolutions is automated and fully integrated with configuration data management. Most systems have been equipped with automatic detection and warning mechanism, which are continuously tracked and evaluated. |
| Category | Severity | Dispatch To: | Description |
| Acquisition HW/SA | 4 | ||
| AS400 Problems and Issues | 1,2 or 3 | AS400 | AS400 printing setup, applications, system/service restart, mainframe mprinter |
| HW install | 4 | Deskside/MAC | PC, printers, scanners, etc |
| SW install | 4 | Deskside/MAC | Templates, standard image products |
| MPR Account | 2 or 3 | MMACs and deletions, resets for MPR accounts (MCBS & MOL only) | |
| Network | 2 or 3 | Network | MACs |
| Network Connectivity | 1 | Network | Connects to network and MPR |
| Network Issues | 1 | Network | Restoration of deleted files, missing drive mapping, rights to files and/or directions, rights on NT desktops |
| Network Failure | 1 | Network | DHCP,WINS,DNS down, tape backup unit down, Novell server down |
| Outlook Problems and Issues | 3 | Network | MS Outlook e-mail usage |
| Desktop | 3 or 4 | Deskside/MAC | HW problems with a desktop device |
| Printer | 3 or 4 | Deskside/MAC | printer driver issue, setup or troubleshoot |
| Telecom - data | 4 | dispatch | ISDN, WAN, router, bridge, hub |
| Telecom - voice | 4 | dispatch | phone and voicemail requests |