HDI - Implementing Problem Management

Overview Implementation Operations Optimization Measurement Annexes

7.1 Overview

7.1.1 Description

A problem may be the result of one or more incidents; the Problem Management process seeks to identify the root cause of those problems. The goals of the process are to minimize severity and any adverse impacts caused by problems and ultimately to eliminate them where it makes business sense to do so.

7.1.2 Relationships to other processes

Figure 7.1 - Relationship between processes Each ITIL process relates with the others to some degree. Some of the processes are completely dependent on one another; this is the case with Problem and Incident Management. It is sometimes difficult to differentiate between these two processes. While the role of Incident Management is to find a quick resolution or a `fix' so that the customer can get back to work as quickly as possible, Problem Management's purpose is to identify the root cause and eliminate it. When the symptoms are addressed and not the problem itself, the problem will almost certainly reappear.

The escalation of an issue from problem to a Request For Change (RFC) securely links the processes of Change and Problem Management as well. Knowing that a `problem' can be the result of one or more incidents, these relationships can be illustrated as shown in Figure 7.2 below.

Figure 7.2 - Relationship Between Incident and Problem Management

While Change and Incident Management are the two processes that most directly affect or are affected by Problem Management, other relationships can have an impact as well. The failure of any of the other ITIL processes can cause an incident and/ or a problem. For example, poor planning for load volume (Capacity Management) or lack of redundancy in critical systems (Continuity Management) can affect the availability of hardware, software, data or the network (Availability Management). It is easy to see the potential for these situations either to be, or to become, a problem for an organization. Similarly, a failed change that has been deployed (Release Management) can also generate a problem for the support team and core business users.

7.1.3 Key inputs and outputs to the process

Problem Management can be broken down into two components, reactive and proactive. In reactive mode, Problem Management is 'fire-fighting', similar to the mode that Incident Management typically engages in. All service events, whatever the cause, are (or should be) logged through the Support Center. Incident Management is a key source of information in isolating and identifying the fact that a problem exists. When Incident Management recognizes the possibility chat an incident or group of incidents may be a problem, it then becomes Problem Management's responsibility to find the cause and eliminate it. Through this cycle disruptions for the customer are reduced by providing a workaround, or by solving the problem entirely.

Figure 7.3 - Problem tracking These temporary fixes or permanent solutions are recorded in Configuration Management and become part of the knowledgebase. By reducing the number of issues that need to be escalated to second or third level support staff, the first level resolution rate will increase.

Additionally, Problem Management has responsibilities that are proactive in nature. Most significantly of these are root cause and trend analysis. In these, it becomes apparent why the has collected in Incident Management is so important. The quality of information gathered a trouble tickets directly affects the simplicity with which root cause and trend analysis can be performed.

DescriptionSourceImportance
INPUTS
Incident detailsIncident Management High
Configuration details Configuration ManagementHigh
dowtime details Availability ManagementHigh
failed change details Release ManagementHigh
Defined workarounds Incident ManagementMedium
potential problem reported Incident Management
Capacity Management
Availability Management
Medium
Trend reported Incident ManagementMedium
Event survey results Support CenterMedium
Annual survey results Customer Service ManagementLow
anecdotal evidence allLow
OUTPUTS
Known error Problem ManagementHigh
Fix/workaround Problem ManagementHigh
Management/metric reporting Problem ManagementHigh
Major problem review Problem ManagementMedium
Request for Change Problem ManagementMedium
Updated and closed problem and known error records Problem Managementmedium
Suggested improvements for procedures, documentation, training needs Problem ManagementMedium
Anecdotal evidence Problem ManagementLow

7.1.4 Possible problems and issues

Not separating Incident Management from Problem Management: these processes are closely related but distinct in their goals. The two processes have clearly different definitions and a built in conflict between their focuses. Organizations are often tempted to combine these two and this is often the case in smaller IT departments where resources are scarce. The firefighting inherent to Incident Management, and the need to get the customers up and running as quickly as possible often does not leave time to perform Problem Management and examine issues for root causes. On the other hand, time spent on more in-depth investigation of problems takes away from the quick resolution of incidents that customers want and need. Even where resources are not scarce, the same individual or group should not manage both processes unless the conflicts are acknowledged and time can be set aside to work exclusively on one process versus another. This is very difficult to balance and is not recommended.

No shared understanding of what a problem is: ITIL provides the definition of a problem as the unknown, underlying cause of multiple incidents or a single significant incident. Yet in some organizations this has either not been accepted or it is not internalized and understood by staff. Handing out guidebooks is not enough! Everyone in the Support Center should be speaking the same language and the distinction between an incident and a problem must be made clear to all. The processes surrounding them must also be clearly defined, documented and understood.

Lack of recognition of conflicting goals of Incident and Problem Management: it makes sense that a reduction in problems should result in a reduction of incidents and an increase in the Support Center's first level resolution rate at first. As the Support Center becomes more proactive, it will start solving problems before incidents can occur. Eliminating and preventing problems will then decrease the number of calls received and reduce the number of `easy' tickets. This in turn makes the overall closure rates at first level decrease. This is one of the inherent conflicts between Problem and Incident Management that sometimes puts the processes at cross-purposes. The targets for first level resolution should be adjusted and measured accordingly.

Problem Management as a discipline: Problem Management should be adopted as more than just another mechanistic process. Common techniques must exist and be understood (e.g. how to perform root cause analysis in your organization). Managers must look for and reward the desired behaviors and react accordingly to unwanted outcomes with training and coaching where necessary. For Problem Management to become a discipline, it must become part of the culture. Team members should be clear about their responsibilities, know how to accomplish them and know what to expect if they do not.

Problem Management without a distinct process owner: Problem Management often fails at this level because either no one group or no one individual has been assigned ownership of the overall process. Without central responsibility for monitoring all existing problems, their status, business impact and resolutions, the speed and quality with which problems are resolved is extremely inconsistent. Some problems may be forgotten and never solved at all.

Problem Management seen as bureaucracy: when staff members see Problem Management as additional layers of regulations that keep them from getting the job done, they will resist it. They may also see Problem Management as a process designed to apportion blame. It is imperative that everyone understands the process is not about finding fault; it is about finding the answer to a problem and resolving it so that it does not occur again.

No Problem Management: often Incident Management is implemented without its complementary support process of Problem Management. While Incident Management is not enough, it is preferable to the customer than waiting until a permanent solution is found. But if Incident Management staff starts to focus on resolving problems instead of getting the customer up and running, customer satisfaction will suffer. Ultimately the customer wants both; to get back to work (Incident Management) and to never experience the problem again (Problem Management).

Problem Management versus the `hero complex': incidents that are easy to resolve with quick fixes or workarounds provide constant positive rewards for a technician. By staying reactive, they can provide quick action on these issues, allowing them to `rescue' customers and get them back to work. When staff members are rewarded for fixing rather than preventing, they have no incentive to proactively seek the elimination of recurring incidents. Establishing Problem Management can help break this cycle by rewarding people for preventing incidents.

[To top of Page]

7.2 - Implementation

7.2.1 The implementation process

Begin by assigning functional and process ownership to a designated Problem Manager with the reputation, respect and resources to get the job done. It is most effective when the same person(s) does not handle Incident and Problem Management. Instead these responsibilities should be clearly delineated to avoid conflicts with competing objectives.

Separation of these functions provides balance of effort to both processes. While the Incident Management team, or Support Center, focuses on getting the user back to work (either through a temporary fix or a workaround procedure) the Problem Management team begins to work on a permanent solution. As the Single Point of Contact (SPOC) for the IT department, however, the Support Center and its Manager cannot abdicate responsibility for participating in the implementation of Problem Management in the organization. Incident Management will often be the first to realize the need for a Problem Management process and begin the drive for its implementation within the organization.

7.2.2 Support Center Manager's role

Responsibilities and activities
Because Problem Management cannot succeed without a firmly established Incident Management process, the Support Center Manager's key focus needs to be on the development and optimization of Incident Management. During implementation of Problem Management, the Support Center Manager should be working closely with the designated Problem Manager or Management team to design how the processes will work together. Notification, escalation and feedback loops should be of primary concern.

In a smaller organization, where responsibility for the Incident and Problem Management processes resides with the Support Center Manager, concentration should be on defining how much time can be spent exclusively on each. A lot of thought needs to be given to how to keep proactive Problem Management activities from being neglected during periods where there are high volumes of incidents or whether the organization should implement proactive activities at all. When responsibility resides within the same group, a lead should be assigned to each process to ensure coverage.

In an organization attempting to implement both components of Problem Management at the same time, or where resources are limited, Problem Management may have to be limited initially to reactive problem and error control. After enough historical data has been accumulated to be useful, the focus can increase to include proactive activities: root cause analysis, trend identification, problem prevention, instigation of RFCs, and providing management information. In any case, it is important to review time and resources periodically for both reactive and proactive processes to ensure balance.

Deliverables

Competencies

Key performance indicators (KPIs)

7.2.3 Support Center Function's role

Figure 7.4 - Support Center relationship to Problem Management

Responsibilities and activities
Problem Management should be separated from Incident Management if at all possible. The Support Center staff should work with their manager to enable the Problem Management team to get their process started. As the SPOC to the customer, the Support Center is in a position to offer a great deal of input regarding the problems experienced and how they affect the business customers. They are also more familiar than anyone else in the department with the data accumulated within the trouble ticket application because they receive and record all calls and deal directly with users on a regular basis.

In a small IT department, where Incident and Problem Management cannot be separated, a Support Center Manager might identify separate individuals in the Support Center to act as leads for the reactive and proactive aspects of Problem Management as a `specialty'. Alternatively, staff could perform multi-week rotations on some periodic basis to focus on Problem Management, analyze data and initiate RFCs. It is important to periodically review the time and resources spent on each to ensure that enough attention is being given to both. It is especially important to pay attention to the proactive activities, as these quickly may be ignored when incident levels are high.

7.2.4 Planning for implementation

Steps to take
  1. Establish a formal Incident Management system.
    There should be some level of Incident Management already in place such as formal problem categorizations, descriptions, prioritizations, impact assessments, duration estimations and resolution processes. It is best practice, and wise planning, to segregate Problem Management responsibility from Incident Management. An individual or group must be made accountable and responsible for the Problem Management process in your organization. This need not be a full time role.

  2. Use Incident Management data and information to drive the Problem Management system
    . Assuming that the Incident Management process is well enough established to capture reliable and compliant data, begin to focus on using that data within Problem Management. If an effective process is not yet in place, the ticket tracking system is unlikely to be fully utilized to record it. Adjust existing methods or create new ways to capture the information needed to perform good, detailed analysis.

  3. Define a process for gathering issues to review for further analysis.
    The Problem Manager, or Problem Management team, will define a process for gathering issues from many sources (including incident records, faulty changes, discussions with Service Level Management regarding services and customer perceptions, client satisfaction surveys and industry benchmarking) to review for further analysis. They should plan two-way communication paths so that everyone in the department knows and understands the process.

  4. Provide sufficient resources to staff the Problem Management system.
    The Problem Manager does not need to have the technical experience to resolve the problem hinself/herself. However, there should be someone in the team with the technical expertise to help guide Problem Management in your organization. They will be able to help pinpoint possible causes and suggest potential solutions and the correct people to implement them. If you cannot have a group dedicated to Problem Management, the Problem Manager should have access to the technical resources necessary to assist them in the analysis and resolution.

  5. Track the problems from identification to resolution.
    Determine the method to employ for tracking problems. Choose or develop tools that will allow you to record problems through the entire life cycle, including its resolution. If your ticket tracking tool supports mapping incidents to a `master ticket' or `case' (or better still to a problem ticket) consider using this functionality to track problems and to associate related incident tickets to problem tickets (either manually or automatically).

Groups to contact

Necessary information and data

Measurements that should be in place

7.2.5 Implementing key process activities: hints and tips What to implement first

Incident Management: you will need access to reliable and compliant data for root cause analysis and trend analysis. Robust Incident Management will ensure a good starting point for you to collect, review and analyze this information.

Things that always work
The Pareto Principle : if you examine the top 80% of your tickets, you will probably find that they cover about 20% of the top issues in your department. You may want to investigate and consider purchasing a software application that can automatically apply this principle to your data for you.

Little things that deliver big returns
If the tickets have been entered and categorized correctly you will be able to look for tickets recorded on similar issues. A good ticket tracking system will allow you to `relate' your tickets to make this even easier.

Little things that always get forgotten
Your customer: do not forget to verify the priority and business impact to your customer. It makes no sense to spend resources (time, people or money) working on problems that are not important to your customer.

7.2.6 Key process activities

Implement Incident Management: if the Incident Management process is not in place in your organization this will need to be implemented before implementing Problem Management or, less preferably, in tandem. Problem Management depends on Incident Management to identify issues that may be problems, and also on the data collected through Incident Management to perform root cause analysis.

7.2.7 Methods and techniques

Root cause analysis: the root cause of a problem can be found by examining all the data you can possibly gather surrounding an issue. You might start by mining the data in your ticket tracking system. Examine the data from multiple perspectives:

Popular techniques cited in ITIL that can be used in Problem Management include:

The most important thing about techniques is to ensure they are shared within your organization.

[To top of Page]

7.3 - Ongoing Operations

7.3.1 The implementation process

Problem Management can be described in its most fundamental way as call avoidance. While Problem Management has a reactive component, it can and should be more focused on proactive activities. Removing root causes, preventing the occurrence (or recurrence) of Incidents and minimizing the consequences of those that cannot be prevented are the key activities involved. When your systems are stable, and your service processes are well defined and working as designed, incidents and problems become easier to manage.

The scope of this process is to count recurring tickets, analyze trends and identify problems. Incident Management can exist without Problem Management and it is far easier to manage each Incident as it arises than to invest the time necessary to perform root cause analysis. But by itself, Incident Management is ineffective. Many problems can have the same root cause but display different symptoms. This can make it difficult to recognize a problem or even to realize that a series of incidents may be related. Similarly, the organization could be experiencing a number of problems that share common symptoms but are not related at all.

7.3.2 Support Center Manager's rote

Responsibilities and activities
The Support Center Manager should work very closely with the Problem Manager to ensure a continued feedback loop between Incident and Problem Management. The Support Center owns the Incident Management process, and functions as the SPOC, interfacing with customers. With the implementation of Problem Management, the Support Center will continue to be the main source of potential issues to be examined on the reactive side of Problem Management and instrumental in defining business impact. The Support Center will also act as the SPOC for Problem Management to receive requests to review incidents as potential problems. In this way, there can be consistency of definitions and interpretations.

On the proactive side, Problem Management will be investigating trends and root cause on data provided in trouble tickets by the Incident Management staff. There should be continuous dialog regarding the collection and categorization of this data and the Support Center Manager is ultimately responsible for the quality of this data. Ideally, Problem Management could automate parameter triggers to flag a specific set of conditions as needing Root Cause analysis.

As problems are categorized as Known Errors, Problem Management should be providing the workarounds back to the Service Manager and Incident Management team. For permanent solutions, it is expected that the Support Center Manager is actively participating in the Change Management process and therefore aware of the resolution before it is implemented and fully aware of Release Management's implementation schedule. It is the Support Center Manager's responsibility to keep the Support Center aware of information about the changes coming out of the Release and Change Management processes so that they will be informed and prepared when these changes occur.

In the case of a small center where the Support Center Manager is overseeing both Incident and Problem Management, it is important that the appropriate staffing is identified and kept informed as if they were separate groups. It is a good idea to have a proactive problem specialist on this team to participate with the Support Center Manager in the Change Advisory Board so neither process gets overlooked.

Deliverables

7.3.3 Support Center Function's role

Responsibilities and activities
Participating in the Problem Management process, staff should be seeking feedback regarding the potential problems they have reported. Either it is not a problem, or there should be some kind of workaround provided. The Support Center by nature is extremely reactive and will not necessarily have time to constantly ask for this information. However if these reports are not being provided, it is their responsibility to report that to their Manager to follow up. The Manager should ensure that the Support Center obtains the information that it needs to provide effective service and provide status updates to the customers. Other responsibilities would include analyzing data, investigating problems, submitting RFCs, monitoring Known Errors and monitoring the status of problem resolutions.

In a small center, the proactive Problem lead would be responsible for providing this information to the Manager and the rest of their team. This lead person should also attend Change Advisory Board meetings and work with the Support Center Manager to understand and communicate the impacts from the Problem Management perspective.

In either case, producing a daily report of the top ten most common issues each day is an effective way to provide a minimum level of ongoing reactive Problem Management. Workarounds or permanent solutions should be added to the knowledgebase and Support Center staff should be allocating the time needed to maintain this database regularly. The top ten lists can also be used proactively to monitor what the problems are, track what has been done to resolve them and plan what will be done next.

Deliverables

KPIs

[To top of Page]

7.4 - Optimization

7.4.1 - The optimization process

The more control over incidents an IT department has, the better availability and reliability it can provide to its customers. As more and more problems are eliminated from the environment, the Support Center can begin to focus on improvements in efficiency; isolating application errors, developing training plans and providing better service. With proper communication and management reporting, the benefits will be clearly visible and IT will be better positioned with the core business.

7.4.2 Support Center Manager's role Responsibilities and activities

Figure 7.5 - Optimization

Incident reduction leads to Incident prevention. To optimize the process, the Support Center Manager can now begin focusing on improving efficiencies internally. Working with the Service Level Manager, efforts to redesign Operational Level Agreements (OLAs), the internal `service level agreements' between the Support Center and the second and third level support teams, can yield better turnaround time for ticket resolution. Review of historical data may reveal areas within the customer SLAs that should be revised as well. Developing training plans that involve the Support Center staff's interaction with the other teams provide value at the Support Center and open new career path opportunities. Successful companies train their staff to anticipate customer needs and to solve Problems before the customer even knows they exist.

Deliverables

KPIs

7.4.3 Support Center Function's role Responsibilities and activities

Good Problem Management will lower the number of tickets that are easily resolved. As problems and some of the routine incidents disappear, Support Center staff can begin to focus on the proactive goals of the Support Center. This includes more time to spend on difficult issues and in follow-up with customers. A Support Center that used to be too busy handling password resets can now find out why some tickets are still unresolved at second level. They can identify areas where training would be appropriate in giving them the skills they need to resolve more of the complex issues.

7.4.4 Future impact of this process on the Support Center

Figure 7.6 - IT Service Quality Cycle Proactive problem solving guarantees that your problems will stay fixed. The Support Center will be able to demonstrate savings to the organization and embark on a continuing program of process and service improvement. Above all, the purpose of implementing Problem Management is to improve service and provide quality support to your customers.

[To top of Page]

7.5 Measurement, costing and management reporting

7.5.1 Implementing Benefits and costs

Benefits and Costs
Most of your customers are not concerned about why a problem occurred, nor do they want to spend time on resolving it. They just want their computer system to work!

The cost of problems is proportional to the volume of incidents and number of customers that are affected by them. Duration of an incident and the resources needed to resolve them also accumulate costs to the business due to lost productivity. Implementation of Problem Management will help offset these costs, reducing the Incidents that interfere with the ability of the organization to conduct its business. Ultimately, this will provide better system availability and reliability and a better overall customer experience.

Cost elements for implementation
Typically there are no costs for tools required to implement Problem Management because the tools should already be in place for other processes (i.e. ticket tracking for Incident Management). If you are not staffed to handle Problem Management, you may consider acquiring additional staff. In most cases, the roles and responsibilities for this process are distributed among existing employees.

Making the business case to implement Problem Management
If you have previously made the business case for the implementation of ITIL in your organization, Problem Management will already make sense. Discuss the drawbacks of remaining reactive as an IT organization. You should be able to articulate how recurring tickets affect productivity. A key point to make to customers is that a problem resolved through Problem Management stays resolved. You will be able to promise better availability and reliability of systems and improved service quality. But be careful to set your customers' expectations so that they expect continuous ongoing improvement rather than immediate results.

Management reporting
Top Ten reporting of incidents, broken out by individual and by department give a useful perspective of the issues facing the organization and allows the IT department to demonstrate its commitment to eliminating problems.

Management reports should explain how metrics and key performance indicators relate to the strategic objectives of the organization. For instance, if a strategic goal of IT is to provide systems and tools that maximize staff productivity, make sure that management reports show actual data and also compare that data to organizational goals. For example: in support of the goal to maximize productivity, the overall system availability goal is set at 98.5% during 24 hours of operation. The management reports should show the availability of individual systems and the overall average. Any major deviations from the goal should be explained along with a statement about what needs to be or is being done to address the cause(s). If the deviation was a one-time situation and is not expected to recur, your report should include a statement to that effect.

7.5.2 Ongoing operations

Cost elements for ongoing operations
To move beyond reactive Problem Management, it may be necessary to look at staffing levels. If feasible and justifiable in your organization, consider acquiring additional staff. If you have been managing Incident and Problem Management through one individual or group, now may be the time to consider breaking these processes out. Promoting one of your specialists to focus on Problem Management may be in order at this time.

If you do not have good reporting tools in place, it may also be the time to consider purchasing some. These products vary in price and functionality. Look for one that will interface easily with your ticket tracking system.

Metrics and Key Performance Indicators

Management reporting
Because so much time is spent investigating incidents, finding workarounds, diagnosing root causes, and in the identification and implementation of problem resolutions, reports to management should reflect the time and effort spent on each. To give that information context, this should be measured against the impact of the problem to see the value in the resource expenditure.

7.5.3 Optimization

Benefits and costs
Ineffectiveness has a high cost for the organization in downtime and in employee turnover. If employees do not have adequate tools, they cannot work effectively (loss of productivity = cost to organization) and they may become dissatisfied (potential for increase in turnover).

Metrics and Key Performance Indicators

Management reporting
Problem Management reports should be geared towards monitoring the quality of the process.

7.5.4 Tools Implementation

Ongoing operations

Optimizing

[To top of Page]

Annex Documents

Overview

[To top of Page]

Annex A7.1 - Problem Management Implementation Checklist

[To top of Page]

Annex A7.2 - Problem Management Ongoing Process Checklist

[To top of Page]

Annex A7.3 - Problem Management Optimization Checklist

[To top of Page]

Annex A7.4 - Sample Problem Management Process Workflow

Sample Problem Management Process Flow

[To top of Page]

Annex A7.5 - Problem Management Evaluation Checklist

Questions to ask in determining justification for problem resolution.

[To top of Page]

Annex A7.6 - Sample Problem Tracking Spreadsheet

Sample Problem Tracking Sheet

[To top of Page]



Visit my web site