Service Operations

1Introduction 2Serv. Mgmt. 3Principles 4Process 5Activities 6Organization 7Consideration 8Implementation 9Issues AAppendeces

4. Service Operation Processes


4.1 Event Management

An event can be defined as any detectable or discernible occurrence that has significance for the management of the IT Infrastructure or the delivery of IT service and evaluation of the impact a deviation might cause to the services. Events are typically notifications created by an IT service, Configuration Item (CI) or monitoring tool.

Effective Service Operation is dependent on knowing the status of the infrastructure and detecting any deviation from normal or expected operation. This is provided by good monitoring and control systems, which are based on two types of tools:

4.1.1 Purpose, Goals and Objectives
The ability to detect events, make sense of them and determine the appropriate control action is provided by Event Management. Event Management is therefore the basis for Operational Monitoring and Control (see Appendix B).

In addition, if these events are programmed to communicate operational information as well as warnings and exceptions, they can be used as a basis for automating many routine Operations Management activities, for example executing scripts on remote devices, or submitting jobs for processing, or even dynamically balancing the demand for a service across multiple devices to enhance performance.

Event Management therefore provides the entry point for the execution of many Service Operation processes and activities. In addition, it provides a way of comparing actual performance and behaviour against design standards and SLAs. As such, Event Management also provides a basis for Service Assurance and Reporting; and Service Improvement. This is covered in detail in the Continual Service Improvement publication.

4.1.2 Scope
Event Management can be applied to any aspect of Service Management that needs to be controlled and which can be automated. These include:

The difference between monitoring and Event Management
These two areas are very closely related, but slightly different in nature. Event Management is focused on generating and detecting meaningful notifications about the status of the IT Infrastructure and services. While it is true that monitoring is required to detect and track these notifications, monitoring is broader than Event Management. For example, monitoring tools will check the status of a device to ensure that it is operating within acceptable limits, even if that device is not generating events.

Put more simply, Event Management works with occurrences that are specifically generated to be monitored. Monitoring tracks these occurrences, but it will also actively seek out conditions that do not generate events.

4.1.3 Value To Business
Event Management's value to the business is generally indirect; however, it is possible to determine the basis for its value as follows:

4.1.4 Policies, Principles and Basic Concepts

Two things are significant about the above examples:

4.1.5 Process Activities, Methods And Techniques
Figure 4.1 The Event Management process
Figure 4.1 The Event Management process

Figure 4.1 is a high-level and generic representation of Event Management. It should be used as a reference and definition point, rather than an actual Event Management flowchart. Each activity in this process is described below. Event Occurs
Events occur continuously, but not all of them are detected or registered. It is therefore important that everybody involved in designing, developing, managing and supporting IT services and the IT Infrastructure that they run on understands what types of event need to be detected.R Event Notification
Most CIs are designed to communicate certain information about themselves in one of two ways:

Event notifications can be proprietary, in which case only the manufacturer's management tools can be used to detect events. Most CIs, however, generate Event notifications using an open standard such as SNMP (Simple Network Management Protocol).

Many CIs are configured to generate a standard set of events, based on the designer's experience of what is required to operate the CI, with the ability to generate additional types of event by 'turning on' the relevant event generation mechanism. For other CI types, some form of 'agent' software will have to be installed in order to initiate the monitoring. Often this monitoring feature is free, but sometimes there is a cost to the licensing of the tool.

In an ideal world, the Service Design process should define which events need to be generated and then specify how this can be done for each type of CI. During Service Transition, the event generation options would be set and tested.

In many organizations, however, defining which events to generate is done by trial and error. System managers use the standard set of events as a starting point and then tune the CI over time, to include or exclude events as required. The problem with this approach is that it only takes into account the immediate needs of the staff managing the device and does not facilitate good planning or improvement. In addition, it makes it very difficult to monitor and manage the service over all devices and staff. One approach to combating this problem is to review the set of events as part of continual improvement activities.

A general principle of Event notification is that the more meaningful the data it contains and the more targeted the audience, the easier it is to make decisions about the event. Operators are often confronted by coded error messages and have no idea how to respond to them or what to do with them. Meaningful notification data and clearly defined roles and responsibilities need to be articulated and documented during Service Design and Service Transition (see also paragraph on 'Instrumentation'). If roles and responsibilities are not clearly defined, in a wide alert, no one knows who is doing what and this can lead to things being missed or duplicated efforts. Event Detection
Once an Event notification has been generated, it will be detected by an agent running on the same system, or transmitted directly to a management tool specifically designed to read and interpret the meaning of the event. Event Filtering
The purpose of filtering is to decide whether to communicate the event to a management tool or to ignore it. If ignored, the event will usually be recorded in a log file on the device, but no further action will be taken.

The reason for filtering is that it is not always possible to turn Event notification off, even though a decision has been made that it is not necessary to generate that type of event. It may also be decided that only the first in a series of repeated Event notifications will be transmitted.

During the filtering step, the first level of correlation is performed, i.e. the determination of whether the event is informational, a warning, or an exception (see next step). This correlation is usually done by an agent that resides on the CI or on a server to which the CI is connected.

The filtering step is not always necessary. For some CIs, every event is significant and moves directly into a management tool's correlation engine, even if it is duplicated. Also, it may have been possible to turn off all unwanted Event notifications. Significance of Events
Every organization will have its own categorization of the significance of an event, but it is suggested that at least these three broad categories be represented: Event Correlation
If an event is significant, a decision has to be made about exactly what the significance is and what actions need to be taken to deal with it. It is here that the meaning of the event is determined.

Correlation is normally done by a 'Correlation Engine', usually part of a management tool that compares the event with a set of criteria and rules in a prescribed order. These criteria are often called Business Rules, although they are generally fairly technical. The idea is that the event may represent some impact on the business and the rules can be used to determine the level and type of business impact.

A Correlation Engine is programmed according to the performance standards created during Service Design and any additional guidance specific to the operating environment.

Examples of what Correlation Engines will take into account include: Trigger
If the correlation activity recognizes an event, a response will be required. The mechanism used to initiate that response is called a trigger.

There are many different types of triggers, each designed specifically for the task it has to initiate. Some examples include: Response Selection
At this point in the process, there are a number of response options available. It is important to note that the response options can be chosen in any combination. For example, it may be necessary to preserve the log entry for future reference, but at the same time escalate the event to an Operations Management staff member for action.

The options in the flowchart are examples. Different organizations will have different options, and they are sure to be more detailed. For example, there will be a range of auto responses for each different technology. The process of determining which one is appropriate and how to execute it are not represented in this flowchart. Some of the options available are:

Special types of incident: In some cases an event will indicate an exception that does not directly impact any IT service, for example, a redundant air conditioning unit fails, or unauthorized entry to a data centre. Guidelines for these events are as follows: Review Actions
With thousands of events being generated every day, it is not possible formally to review every individual event. However, it is important to check that any significant events or exceptions have been handled appropriately, or to track trends or counts of event types, etc. In many cases this can be done automatically, for example polling a server that had been rebooted using an automated script to see that it is functioning correctly.

In the cases where events have initiated an incident, problem and/or change, the Action Review should not duplicate any reviews that have been done as part of .those processes. Rather, the intention is to ensure that the handover between the Event Management process and other processes took place as designed and that the expected action did indeed take place. This will ensure that incidents, problems or changes originating within Operations Management do not get lost between the teams or departments.

The Review will also be used as input into continual improvement and the evaluation and audit of the Event Management process. CIose Event
Some events will remain open until a certain action takes place, for example an event that is linked to an open incident. However, most events are not 'opened' or 'closed'.

Informational events are simply logged and then used as input to other processes, such as Backup and Storage Management. Auto-response events will typically be closed by the generation of a second event. For example, a device generates an event and is rebooted through auto response - as soon as that device is successfully back online, it generates an event that effectively closes the loop and clears the first event.

It is sometimes very difficult to relate the open event and the close notifications as they are in different formats. It is optimal that devices in the infrastructure produce 'open' and 'close' events in the same format and specify the change of status. This allows the correlation step in the process to easily match open and close notifications.

In the case of events that generated an incident, problem or change, these should be formally closed with a link to the appropriate record from the other process.

4.1.6 Triggers, Input And Output/interprocess Interfaces
Event Management can be initiated by any type of occurrence. The key is to define which of these occurrences is significant and which need to be acted upon. Triggers include:

Event Management can interface to any process that requires monitoring and control, especially those that do not require real-time monitoring, but which do require some form of intervention following an event or group of events. Examples of interfaces with other processes include:

4.1.7 Information Management
Key information involved in Event Management includes the following:

4.1.8 Metrics
For each measurement period in question, the metrics to check on the effectiveness and efficiency of the Event Management process should include the following:

4.1.9 Challenges, Critical Success Factors And Risks Challenges
There are a number of challenges that might be encountered: Critical Success Factors
In order to obtain the necessary funding a compelling Business Case should be prepared showing how the benefits of effective Event Management can far outweigh the costs - giving a positive return on investment.

.One of the most important CSFs is achieving the correct level of filtering. This is complicated by the fact that the significance of events changes. For example, a user logging into a system today is normal, but if that user leaves the organization and tries to log in it is a security breach.

There are three keys to the correct level of filtering, as follows:

Proper planning is needed for the rollout of the monitoring agent software across the entire IT Infrastructure. This should be regarded as a project with realistic timescales and adequate resources being allocated and protected throughout the duration of the project. Risks
The key risks are really those already mentioned above: failure to obtain adequate funding; ensuring the correct level of filtering; and failure to maintain momentum in rolling out the necessary monitoring agents across the IT Infrastructure. If any of these risks is not addressed it could adversely impact on the success of Event Management.

4.1.10 Designing for Event Management
Effective Event Management is not designed once a service has been deployed into Operations. Since Event Management is the basis for monitoring the performance and availability of a service, the exact targets and mechanisms for monitoring should be specified and agreed during the Availability and Capacity Management processes (see Service Design publication). However, this does not mean that Event Management is designed by a group of remote system developers and then released to Operations Management together with the system that has to be managed. Nor does it mean that, once designed and agreed, Event Management becomes static - day-to-day operations will define additional events, priorities, alerts and other improvements that will feed through the Continual Improvement process back into Service Strategy, Service Design etc.

Service Operation functions will be expected to participate in the design of the service and how it is measured (see section 3.4).

For Event Management, the specific design areas include the following. Instrumentation
Instrumentation is the definition of what can be monitored about CIs and the way in which their behaviour can be affected. In other words, instrumentation is about defining and designing exactly how to monitor and control the IT Infrastructure and IT services.

Instrumentation is partly about a set of decisions that need to be made and partly about designing mechanisms to execute these decisions.

Decisions that need to be made include:

Mechanisms that need to be designed include:

Note: A strong interface exists here with the application's design. All applications should be coded in such a way that meaningful and detailed error messages/codes are generated at the exact point of failure - so that these can be included in the event and allow swift diagnosis and resolution of the underlying cause. The need for the inclusion and testing of such error messaging is covered in more detail in the Service Transition publication. Error Messaging
Error messaging is important for all components (hardware, software, networks, etc.). It is particularly important that all software applications are designed to support Event Management. This might include the provision of meaningful error messages and/or codes that clearly indicate the specific point of failure and the most likely cause. In such cases the testing of new applications should include testing of accurate event generation.

Newer technologies such as Java Management Extensions (JMX) or HawkNLTM provide the tools for building distributed, web-based, modular and dynamic solutions for managing and monitoring devices, applications and service-driven networks. These can be used to reduce or eliminate the need for programmers to include error messaging within the code - allowing a valuable level of normalization and code-independence. Event Detection and Alert Mechanisms
Good Event Management design will also include the design and population of the tools used to filter, correlate and escalate Events.

The Correlation Engine specifically will need to be populated with the rules and criteria that will determine the significance and subsequent action for each type of event.

Thorough design of the event detection and alert mechanisms requires the following: Identification of Thresholds
Thresholds themselves are not set and managed through Event Management. However, unless these are properly designed and communicated during the instrumentation process, it will be difficult to determine which level of performance is appropriate for each CI.

Also, most thresholds are not constant. They typically consist of a number of related variables. For example, the maximum number of concurrent users before response time slows will vary depending on what other jobs are active on the server. This knowledge is often only gained by experience, which means that Correlation Engines have to be continually tuned and updated through the process of Continual Service Improvement.

[To top of Page]

Visit my web site