Service Operations

3. Service Operation Principles

When considering Service Operation it is tempting to focus only on managing day-to-day activities and technology as ends in themselves. However, Service Operation exists within a far greater context. As part of the Service Management Lifecycle, it is responsible for executing and performing processes that optimize the cost and quality of services. As part of the organization, it is responsible for enabling the business to meet its objectives. As part of the world of technology, it is responsible for the effective functioning of components that support services. The principles in this chapter are aimed at helping Service Operation practitioners to achieve a balance between all of these roles and to focus on effectively managing the day-to-day aspects while maintaining a perspective of the greater context.

3.1 Functions, Groups, Teams, Departments And Divisions

The Service Operation publication uses several terms to refer to the way in which people are organized to execute processes or activities. There are several published definitions for each term and it is not the purpose of this publication to enter the debate about which definition is best. Please note that the following definitions are generic and not prescriptive. They are provided simply to define assumptions and to facilitate understanding of the material. The reader should adapt these principles to the organizational practices used in their own organization.

Function: A function is a logical concept that refers to the people and automated measures that execute a defined process, an activity or a combination of processes or activities. In larger organizations, a function may be broken out and performed by several departments, teams and groups, or it may be embodied within a single organizational unit (e.g. Service Desk). In smaller organizations, one person or group can perform multiple functions - e.g. a Technical Management department could also incorporate the Service Desk function.
Group: A group is a number of people who are similar in some way. In this publication, groups refer to people who perform similar activities - even though they may work on different technology or report into different organizational structures or even in different companies. Groups are usually not formal organization structures, but are very useful in defining common processes across the organization - e.g. ensuring that all people who resolve incidents complete the Incident Record in the same way. In this publication the term 'group' does not refer to a group of companies that are owned by the same entity.
Team: A team is a more formal type of group. These are people who work together to achieve a common objective, but not necessarily in the same organization structure. Team members can be co-located, or work in multiple different locations and operate virtually. Teams are useful for collaboration, or for dealing with a situation of a temporary or transitional nature. Examples of teams include project teams, application development teams (often consisting of people from several different business units) and incident or problem resolution teams.
Department: Departments are formal organization structures which exist to perform a specific set of defined activities on an ongoing basis. Departments have a hierarchical reporting structure with managers who are usually responsible for the execution of the activities and also for day-to-day management of the staff in the department.
Division: A division refers to a number of departments that have been grouped together, often by geography or product line. A division is normally self-contained and is able to plan and execute all activities in a supply chain.
Role: A role refers to a set of connected behaviours or actions that are performed by a person, team or group in a specific context. For example, a Technical Management department can perform the role of Problem Management when diagnosing the root cause of incidents. This same department could also be expected to play several other roles at different times, e.g. it may assess the impact of changes (Change Management role), manage the performance of devices under their control (Capacity Management role), etc. The scope of their role and what triggers them to play that role are defined by the relevant process and agreed by their line manager.

3.2 Achieving Balance In Service Operations

Service Operation is more than just the repetitive execution of a standard set of procedures or activities. All functions, processes and activities are designed to deliver a specified and agreed level of services, but they have to be delivered in an ever-changing environment.

This forms a conflict between maintaining the status quo and adapting to changes in the business and technological environments. One of Service Operation's key roles is therefore to deal with this conflict and to achieve a balance between conflicting sets of priorities.

This section of the publication highlights some of the key tensions and conflicts and identifies how IT organizations can recognize that they are suffering from an imbalance by tending more towards one extreme or the other. It also provides some high-level guidelines on how to resolve the conflict and thus move towards a best-practice approach. Every conflict therefore represents an opportunity for growth and improvement.

3.2.1 Internal It View Versus External Business View

Figure 3.1 Achieving a balance
between external and internal focus

The most fundamental conflict in all phases of the ITSM Lifecycle is between the view of IT as a set of IT services (the external business view) and the view of IT as a set of technology components (internal IT view).

The external view of IT is the way in which services are experienced by its users and customers. They do not always understand, nor do they care about, the details of what technology is used to manage those services. All they are concerned about is that the services are delivered as required and agreed.

The internal view of IT is the way in which IT components and systems are managed to deliver the services. Since IT systems are complex and diverse, this often means that the technology is managed by several different teams or departments - each of which is focused on achieving good performance and availability of 'its' systems.

Both views are necessary when delivering services. The organization that focuses only on business requirements without thinking about how they are going to deliver will end up making promises that cannot be kept. The organization that focuses only on internal systems without thinking about what services they support will end up with expensive services that deliver little value.

The potential for role conflict between the external and internal views is the result of many variables, including the maturity of the organization, its management culture, its history, etc. This makes a balance difficult to achieve, and most organizations tend more towards one role than the other. Of course, no organization will be totally internally or externally focused, but will find itself in a position along a spectrum between the two. This is illustrated in Figure 3.1:

Table 3.1 outlines some examples of the characteristics of positions at the extreme ends of the spectrum. The purpose of this table is to assist organizations in identifying to which extreme they are closer, not to identify real-life positions to which organizations should aspire.

Extreme internal focus Extreme external focus
Primary focus

Performance and management of IT Infrastructure devices, systems and staff, with little regard to the end result on the IT service.

Achieving high levels of IT service performance with little regard to how it is achieved.

Metrics

Focus on technical performance without showing what this means for services

Internal metrics (e.g. network uptime) reported to the business instead of service performance metrics.

Focus on External Metrics without showing internal staff how these are derived or how they can be improved
Internal staff are expected to devise their own metrics to measure internal performance.

Customer/user experience

High consistency of delivery, but only delivers a percentage of what the business needs.
Uses a 'push' approach to delivery, i.e. prefers to have a standard set of services for all business units.

Poor consistency of delivery
'IT consists of good people with good intentions, but cannot always execute'
Reactive mode of operation.
Uses a 'pull' approach to delivery, i.e. prefers to deliver customized services upon request.

Operations strategy

Standard operations across the board
All new services need to fit into the current architecture and procedures.

Multiple delivery teams and multiple technologies
New technologies require new operations approaches and often new IT Operations teams.

Procedures and manual

Focus purely on how to manage the technology, not on how its performance relates to IT services

Focuses primarily on what needs to be done and when and less on how this should be achieved.

Cost strategy

Cost reduction achieved purely through technology consolidation
Optimization of operational procedures and resources
Business impact of cost cutting often only understood later
Return on Investment calculations are focused purely on cost savings or 'payback periods'.

Budget allocated on the basis of which business unit is perceived to have the most need
Less articulate or vocal business units often have inferior services as there is not enough funding allocated to their services.

Training

Training is conducted as an apprenticeship, where new Operations staff have to learn the way things have to be done, not why

Training is conducted on a project-by-project basis
There are no standard training courses since operational procedures and technology are constantly changing.

Operations staff

Specialized staff, organized according to technical specialty
Staff work on the false assumption that good technical achievement is the same as good customer service.

Generalist staff, organized partly according to technical capability and partly according to their relationship with a business unit
Reliance on 'heroics', where staff go out of their way to resolve problems that could have been prevented by better internal processes.

Table 3.1 Examples of extreme internal and external focus

This does not mean that the external focus is unimportant The whole point of Service Management is to provide services that meet the objectives of the organization as a whole. It is critical to structure services around customers. At the same time, it is possible to compromise the quality of services by not thinking about how they will be delivered.

Building Service Operation with a balance between internal and external focus requires a long-term, dedicate. approach reflected in all phases of the ITSM Service Lifecycle. This will require the following:

An understanding of what services are used by the business and why.
An understanding of the relative importance and impact of those services on the business.
An understanding of how technology is used to provide IT services.
Involvement of Service Operation in Continual Service Improvement projects that aim to identify ways of delivering more, increase service quality and lower cost.
Procedures and manuals that outline the role of IT Operations in both the management of technology and the delivery of IT services.
A clearly differentiated set of metrics to report to the business on the achievement of service objectives; an. to report to IT managers on the efficiency and effectiveness of Service Operation.
All IT Operations staff understand exactly how the performance of the technology affects the delivery of IT services and in turn how these affect the business and the business goals.
A set of standard services delivered consistently to all Business Units and a set of non-standard (sometimes customized) services delivered to specific Business Units - together with a set of Standard Operating Procedures (SOPs) that can meet both sets of requirements.
A cost strategy aimed at balancing the requirements of different business units with the cost savings available through optimization of existing technology or investment in new technology - and an understanding of the cost strategy by all involved IT resources.
A value-based, rather than cost-based, Return on Investment strategy.
Involvement of IT Operations staff in the Service Design and Service Transition phases of the ITSM Lifecycle.
Input from and feedback to Continual Service Improvement to identify areas where there is an imbalance and the means to identify and enforce improvement.
A clear communication and training plan for business. While many organizations are good at developing Communication Plans for projects, this often does not extend into their operational phase.

3.2.2 Stability Versus Responsiveness

Figure 3.2 Achieving a balance between
focus on stability and responsiveness

No matter how good the functionality is of an IT service and no matter how well it has been designed, it will be worth far less if the service components are not available or if they perform inconsistently.

This means that Service Operation needs to ensure that the IT Infrastructure is stable and available as designed. At the same time, Service Operation needs to recognize that business and IT requirements change.

Some of these changes are evolutionary. For example, the functionality, performance and architecture of a platform may change over a number of years. Each change brings with it an opportunity to provide better levels of service to the business. In evolutionary changes, it is possible to plan how to respond to the change and thus maintain stability while responding to the changes.

Many changes, though, happen very quickly and sometimes under extreme pressure. For example, a Business Unit unexpectedly wins a contract that requires additional IT services, more capacity and faster response times. The ability to respond to this type of change without impacting other services is a significant challenge.

Many IT organizations are unable to achieve this balance and tend to focus on either the stability of the IT Infrastructure or the ability to respond to changes quickly.

Extreme focus on stability Extreme focus on responsiveness
Primary focus
Technology
Developing and refining standard IT management techniques and processes.

Output to the business
Agrees to required changes before determining what it will take to deliver them.

Typical problems experienced
IT can demonstrate that it is complying with SOPS and Operational Level Agreements (OLAs), even when there is clear misalignment to business requirements
IT staff are not available to define or execute routine tasks because they are busy on projects for new services
Technology growth strategy
Growth strategy based on analyzing existing demand on existing systems
New services are resisted and Business Units sometimes take ownership of ' their own' systems to get access to new services.

Technology purchased for each new business requirement
Using multiple technologies and solutions for similar solutions, to meet slightly different business needs.

Technology used to deliver IT services
Existing or standard technology to be used; services must be adjusted to work within existing parameters
Over-provisioning. No attempt is made to model the new Service on the existing infrastructure. New, dedicated technology is purchased for each new project.

Capacity Management
Forecasts based on future business activity for each service individually and do not take into account IT activity or other IT services Existing workloads not relevant.
Forecasts based on projections of current workloads
System performance is maintained at consistent levels through tuning and demand management, not by workload forecasting and management.

Table 3.2 Examples of extreme focus on stability and responsiveness

Table 3.2 outlines some examples of the characteristics of positions at extreme ends of the spectrum. The purpose of this table is to assist organizations in identifying to which extreme they are closer, not to identify real-life positions to which organizations should aspire.

Building an IT organization that achieves a balance between stability and responsiveness in Service Operation will require the following actions:

Ensure investment in technologies and processes that are adaptive rather than rigid, e.g. virtual server and application technology and the use of Change Models (see Service Transition publication).
Build a strong Service Level Management (SLM) process which is active from the Service Design phase to the Continual Service Improvement phase of the ITSM Lifecycle.
Foster integration between SLM and the other Service Design processes to ensure proper mapping of business requirements to IT operational activities and components of the IT Infrastructure. This makes it easier to model the effect of changes and improvements.
Initiate changes at the earliest appropriate stage in the ITSM Lifecycle. This will ensure that both functional (business) and manageability (IT operational) requirements can be assessed and built or changed together.
Ensure IT involvement in business changes as early as possible in the change process to ensure scalability, consistency and achievability of IT services sustaining business changes.
Service Operation teams should provide input into the ongoing design and refinement of the architectures and IT services (see Service Design and Service Strategy publications).
Implement and use SLM to avoid situations where business and IT managers and staff negotiate informal agreements.

3.2.3 Quality Of Service Versus Cost Of Service

Figure 3.3 Balancing service quality and cost

Service Operation is required consistently to deliver the agreed level of IT service to its customers and users, while at the same time keeping costs and resource utilization at an optimal level.

In Figure 3.3, an increase in the level of quality usually results in an increase in the cost of that service, and vice versa. However, the relationship is not always directly proportional:

Early in the service's lifecycle it is possible to achieve significant increases in service quality with a relatively small amount of money. For example, improving service availability from 55% to 75% is fairly straightforward and may not require a huge investment.
Later in the service's lifecycle, even small improvements in quality are very expensive. For example, improving the same service's availability from 96% to 99.9% may require large investments in highavailability technology and support staff and tools.

While this may seem straightforward, many organizations are under severe pressure to increase the quality of service while reducing their costs. In Figure 3.3, the relationship between cost and quality is sometimes inverse. It is possible (usually inside the range of optimization) to increase quality while reducing costs. This is normally initiated within Service Operation and carried forward by Continual Service Improvement. Some costs can be reduced incrementally over time, but most cost savings can be made only once. For example, once a duplicate software tool has been eliminated, it cannot be eliminated again for further cost savings.

Achieving an optimal balance between cost and quality (shown between the dotted lines in Figure 3.3) is a key role of Service Management. There is no industry standard for what this range should be, since each service will have a different range of optimization, depending on the nature of the service and the type of business objective being met. For example, the business may be prepared to spend more to achieve high availability on a mission-critical service, while it is prepared to live with the lower quality of an administrative tool.

Determining the appropriate balance of cost and quality should be done during the Service Strategy and Service Design Lifecycle phases, although in many organizations it is left to the Service Operation teams - many of whom do not generally have all the facts or authority to be able to make this type of decision.

.Unfortunately, it is also common to find organizations that are spending vast quantities of money without achieving any clear improvements in quality. Again, Continual Service Improvement will be able to identify the cause of the inefficiency, evaluate the optimal balance for that service and formulate a corrective plan.

Achieving the correct balance is important. Too much focus on quality will result in IT services that deliver more than necessary, at a higher cost, and could lead to a discussion on reducing the price of services. Too much focus on cost will result in IT delivering on or under budget, but putting the business at risk through substandard IT services.

Special note: just how far is too much?

Over the past several years, IT organizations have been under pressure to cut costs. In many cases this resulted in optimized costs and quality. But, in other cases, costs were cut to the point where quality started to suffer. At first, the signs were subtle - small increases in incident resolution times and a slight increase in the number of incidents. Over time, though, the situation became more serious as staff worked long hours to handle multiple workloads and services ran on ageing or outdated infrastructure.

There is no simple calculation to determine when costs have been cut too far, but good SLM is crucial to making customers aware of the impact of cutting too far, so recognizing these warning signs and symptoms can greatly enhance an organization's ability to correct this situation.

Figure 3.4 Achieving a balance between focus on cost and quality
Figure 3.4 Achieving a balance between focus on cost and quality

Service Level Requirements - together with a clear understanding of the business purpose of the service and the potential risks - will help to ensure that the service is delivered at the appropriate cost. They will also help to avoid 'over sizing' of the service just because budget is available, or 'under sizing' because the business does not understand the manageability requirements of the solution. Either result will cause customer dissatisfaction and even more expense when the solution is re-engineered or retro-fitted to the requirements that should have been specified during Service Design.

Table 3.3 outlines some examples of the characteristics of positions at extreme ends of the cost/quality spectrum. The purpose of this table is to assist organizations in identifying to which extreme they are closer, not to identify real-life positions to which organizations should aspire.

Achieving a balance will ensure delivery of the level of service necessary to meet business requirements at an optimal (as opposed to lowest possible) cost. This will require the following:

A Financial Management process and tools that can account for the cost of providing IT services; and which model alternative methods of delivering services at differing levels of cost. For example, comparing the cost of delivering a service at 98% availability or at 99.9% availability; or the cost of providing a service with or without additional functionality.
Ensuring that decisions around cost versus quality are made by the appropriate managers during Service Strategy and Service Design. IT operational managers are generally not equipped to evaluate business opportunities and should only be asked to make financial decisions that are related to achieving operational efficiencies.

Extreme focus on quality Extreme focus on cost
Primary focus
Delivering the level of quality demanded by the business regardless of what it takes

Meting budget and reducing costs.

Typical problems experienced
Escalating budgets
IT services generally deliver more than is necessary for business success
Escalating demands for higher quality services
IT limits h quality of services based upon their budget available
Escalaions from the business to get more serrice from IT.

Financial Management
IT usually does not have a method of communicating the cost of IT services. Accounting methods are based on an agregrated method (eg., cost of IT per user)

Financial reporting is done purely on budgeted amounts. There is no way of linking activities in IT to the delivery of IT services.

Table 3.3 Examples of extreme focus on quality and cost

3.2.4 Reactive Versus Proactive

A reactive organization is one which does not act unless it is prompted to do so by an external driver, e.g. a new business requirement, an application that has been developed or escalation in complaints made by users and customers. An unfortunate reality in many organizations is the focus on reactive management mistakenly as the sole means to ensure services that are highly consistent and stable, actively discouraging proactive behaviour from operational staff. The unfortunate irony of this approach is that discouraging effort investment in proactive Service Management can ultimately increase the effort and cost of reactive activities and further risk stability and consistency in services.

A proactive organization is always looking for ways to improve the current situation. It will continually scan the internal and external environments, looking for signs of potentially impacting changes. Proactive behaviour is usually seen as positive, especially since it enables the organization to maintain competitive advantage in a changing environment. However, being too proactive can be expensive and can result in staff being distracted. The need for proper balance in reactive and proactive behaviour often achieves the optimal result.

Generally, it is better to manage IT services proactively, but achieving this is not easily planned or achieved. This is because building a proactive IT organization is dependent on many variables, including:

The maturity of the organization. The longer the organization has been delivering a consistent set of IT services, the more likely it is to understand the relationship between IT and the business and the IT Infrastructure and IT services.
The culture of the organization. Some organizations have a culture that is focused on innovation and are more likely to be proactive. Others are more likely to focus on the status quo and as such are likely to resist change and have more reactive focus.
The role that IT plays in the business and the mandate that IT has to influence the strategy and tactics of the business. For example, a company where the CIO is a board member is likely to have an IT organization that is far more proactive and responsive than a company where IT is seen as an administrative overhead.
The level of integration of management processes and tools. Higher levels of integration will facilitate better knowledge of opportunities.
The maturity and scope of Knowledge Management in the organization; this is especially seen in organizations which have been able to store and organize historical data effectively - especially Availability and Problem Management data.

Figure 3.5 Achieving a balance between being too reactive or too proactive
Figure 3.5 Achieving a balance between
being too reactive or too proactive

From a maturity perspective, it is clear that newer organizations will have different priorities and experiences from a more established organization - what is best practice for a mature organization may not suit a younger organization. Therefore an imbalance could result from an organization being either less or more mature. Consider the following:

Less mature organizations (or organizations with newer IT services or technology) will generally be more reactive, simply because they do not know all the variables involved in running their business and providing IT services.
IT staff in newer organizations tend to be generalists because it is unclear exactly what is required to deliver stable IT services to the business.
Incidents and problems in newer organizations are fairly unpredictable because the technology is relatively new and changes quickly.
More mature organizations tend to be more proactive, simply because they have more data and reporting available and know the typical patterns of incidents and workflows. Thus, they forecast exceptions far more easily.
Staff working in mature organizations also generally tend to have more established relationships between IT staff and the business and so can be more proactive about meeting changing business requirements - this is especially true when IT is seen as a strategic component of the business.

While proactive behaviour in Service Operation is generally good, there are also times where reactive behaviour is needed. The role of Service Operation is therefore to achieve a balance between being reactive and proactive. This will require:

Formal Problem Management and Incident Management processes, integrated between Service Operation and Continual Service Improvement.
The ability to be able to prioritize technical faults as well as business demands. This needs to be done during Service Operation, but the mechanisms need to be put in place during Service Strategy and Design. These mechanisms could include incident categorization systems, escalation procedures and tools to facilitate impact assessment for changes.
Data from Configuration and Asset Management to provide data where required, saving projects time and making decisions more accurate.
Ongoing involvement of SLM in Service Operation.

Table 3.4 outlines some examples of the characteristics of positions at extreme ends of the spectrum. The purpose of this table is to assist organizations in identifying to which extreme they are closer, not to identify real-life positions to which organizations should aspire.

Extremely Reactive Extremely Proactive
Primary focus
Responds to business needs and incidents only after they are reported.
Anticipates business requirements before they are reported and problems before they occur

Typical problems experienced
Preparing to deliver new services takes a long time because each project is dealt with as if it is the first
Similar incidents occur again and again, as there is no way of trending them
Staff turnover is high and morale is generally low, as IT staff keep moving from project to project without achieving a lasting, stable set of IT services.
Money is spent before the requirements are stated. In some cases IT purchases items that will never be used because they anticipated the wrong requirements or because the project is stopped
IT staff tend to have been in the organization for a long time and tend to assume that they know the business requirements better than the business does.

Capacity Planning
Wait until there are capacity problems and then purchase surplus capacity to last until the next capacity-related incident
Anticipate capacity problems and spend money on preventing these - even when the scenario is unlikely to happen.

IT Service Continuity Planning
No plans exist until after a major event or disaster
IT Plans focus on recovering key systems, but without ensuring that the business can recover its processes
Over-planning (and over-spending) of IT Recovery options. Usually immediate recovery is provided for most IT services, regardless of their impact or priority.

Change Management
Changes are often not logged, or logged at the last minute as Emergency Changes
Not enough time for proper impact and cost assessments
Changes are poorly tested and controlled, resulting in a high number of incidents
Changes are requested and implemented even when there is no real need, i.e. a significant amount of work done to tix items that are not broken

Table 3.4 Examples of extremely reactive and proactive behaviour

3.3 Providing Services

All Service Operation staff must be fully aware that they are there to 'provide service' to the business. They must provide a timely (rapid response and speedy delivery of requirements), professional and courteous service to allow the business to conduct its own activities - so that the commercial customer's needs are met and the business thrives.

It is important that staff are trained not only in how to deliver and support IT services, but also in the manner in which that service should be provided. For example, staff that are capable and deliver service effectively may still cause significant customer dissatisfaction if they are insensitive or dismissive. Conversely, no amount of being nice to a customer will help if the service is not being delivered.

A critical element of being a proficient service provider is placing as much emphasis on recruiting and training staff to develop competency in dealing with and managing customer relationships and interactions as they do on technical competencies for managing the IT environment.

3.4 Operations Staff Involvement In Service Design And Service Transition

It is extremely important that Service Operation staff are involved in Service Design and Service Transition and potentially also in Service Strategy where appropriate.

One key to achieving balance in Service Operation is an effective set of Service Design processes. These will provide IT Operations Management with:

Clear definition of IT service objectives and performance criteria
Linkage of IT service specifications to the performance of the IT Infrastructure
Definition of operational performance requirements
A mapping of services and technology
The ability to model the effect of changes in technology and changes to business requirements
Appropriate cost models (e.g. customer or service based) to evaluate Return on Investment and cost reduction strategies.

The nature of IT Operations Management involvement should be carefully positioned. Service Design is a phase in the Service Management Lifecycle using a set of processes, not a function independent of Service Operation. As such, many of the people who are involved in Service Design will come from IT Operations Management.

This should not only be encouraged, but Service Operation staff should be measured on their involvement in Service Design activities - and such activities should be included in job descriptions and roles, etc. This will help to ensure continuity between business requirements and technology design and operation and it will also help to ensure that what is designed can also be operated. IT Operations Management staff should also be involved during Service Transition to ensure consistency and to ensure that both stated business and manageability requirements are met.

Resources must be made available for these activities and the time required should be taken into account, as appropriate.

3.5 Operational Health

Many organizations find it helpful to compare the monitoring and control of Service Operation to health monitoring and control. In this sense, the IT Infrastructure is like an organism that has vital life signs that can be monitored to check whether it is functioning normally. This means that it is not necessary to monitor continuously every component of every IT system to ensure that it is functioning.

Operational Health can be determined by isolating a few important 'vital signs' on devices or services that are defined as critical for the successful execution of a Vital Business Function. This could be the bandwidth utilization on a network segment, or memory utilization on a major server. If these signs are within normal ranges, the system is healthy and does not require additional attention. This reduction in the need for extensive monitoring will result in cost reduction and operational teams and departments that are focused on the appropriate areas for service success.

However, as with organisms, it is important to check systems more thoroughly from time to time, to check for problems that do not immediately affect vital signs. For example a disk may be functioning perfectly, but it could be nearing its Mean Time Between Failures (MTBF) threshold. In this case the system should be taken out of service and given a thorough examination or 'health check'. At the same time, it should be stressed that the end result should be the healthy functioning of the service as a whole. This means that health checks on components should be balanced against checks of the 'end-to-end' service. The definition of what needs to be monitored and what is healthy versus unhealthy is defined during Service Design, especially Availability Management and SLM.

.Operational Health is dependent on the ability to prevent incidents and problems by investing in reliable and maintainable infrastructure. This is achieved through good availability design and proactive Problem Management. At the same time, Operational Health is also dependent on the ability to identify faults and localize them effectively so that they have minimal impact on the service. This requires strong (preferably automated) Incident and Problem Management.

The idea of Operational Health has also led to a specialized area called 'Self Healing Systems'. This is an application of Availability, Capacity, Knowledge, Incident and Problem Management and refers to a system that has been designed to withstand the most severe operating conditions and to detect, diagnose and recover from most incidents and Known Errors. Self Healing Systems are known by different names, for example Autonomic Systems, Adaptive Systems and Dynamic Systems. Characteristics of Self Healing Systems include:

Resilience is designed and built into the system, for example multiple redundant disks or multiple processors. This protects the system against hardware failure since it is able to continue operating using the duplicated hardware component.
Software, data and operating system resilience is also designed into the system, for example mirrored databases (where a database is duplicated on a backup device) and disk-striping technology (where individual bits of data are distributed across a disk array - so that a disk failure results in the loss of only a part of data, which can be easily recovered using algorithms).
The ability to shift processing from one physical device to another without any disruption to the service. This could be a response to a failure or because the device is reaching high utilization levels (some systems are designed to distribute processing workloads continuously, to make optimum use of available capacity, which is also known as virtualization).
Built-in monitoring utilities which enable the system to detect events and to determine whether these represent normal operations or not.
A correlation engine (see paragraph 4.1.5.6 on Event Management). This will enable the system to determine the significance of each event and also to determine whether there is any predefined response to that event.
A set of diagnostic tools, such as diagnostic scripts, fault trees and a database of Known Errors and common workarounds. These are used as soon as an error is detected, to determine the appropriate response.
The ability to generate a call for human intervention by raising an alert or generating an incident.

While the concept of Operational Health is not a core concept of Service Operation, it is often a helpful metaphor to assist in determining what needs to be monitored and how frequently to perform preventive maintenance.

What and when to monitor for operational health should be determined in Service Design, tested and refined during Service Transition and optimized in Continual Service Improvement, as necessary.

3.6 Communication

Good communication is needed with other IT teams and departments, with users and internal customers, and between the Service Operation teams and departments themselves. Issues can often be prevented or mitigated with appropriate communication. This section is aimed at summarizing the communication that should take place in Service Operation. This is not a separate process, but a checklist of the type of communication that is required for effective Service Operation.

An important principle is that all communication must have an intended purpose or a resultant action. Information should not be communicated unless there is a clear audience. In addition, that audience should have been actively involved in determining the need for that communication and what they will do with the information.

A detailed description of the types of communication typical in Service Operation is contained in Appendix B of this publication, together with a description of the typical audience and the actions that are intended to be taken as a result of each communication. These include:

Routine operational communication
Communication between shifts
Performance reporting
Communication in projects
Communication related to changes
Communication related to exceptions
Communication related to emergencies
Training on new or customized processes and service desi.ns
Communication of strategy and design to Service Operation teams.

Please note that there is no definitive medium for communication, nor is there a fixed location or frequency. In some organizations communication has to take place in meetings. Other organizations prefer to use e-mail or the communication inherent in their Service Management tools.

There should therefore be a policy around communication within each team or department and for each process. Although this should be formal, the policy should not be cumbersome or complex. For example, a manager might require that all communications regarding changes must be sent by e-mail. As long as this is specified in the department's SOPs (in whatever form they exist), there is no need to create a separate policy for it.

Although the typical content of communication is fairly consistent once processes have been defined, the means of communication are changing with every new introduction of technology. The list of alternatives is growing and, today, includes:

E-mail, to traditional clients or mobile devices
SMS messages
Pagers
Instant messaging and web-based 'chats'
Voice over Internet Protocol (VoIP) utilities that can turn any connected device to an inexpensive communication medium
Teleconference and virtual meeting utilities, which have revolutionized meetings, which are now held across long distances
Document-sharing utilities.
The means of communication itself is outside the scope of this publication. However, the following points should be noted:
Communication is primary and the means of communication must ensure that they serve this goal. For example, the need for secure communication may eliminate the possibility of some of the above means. The need for quality may eliminate some VolP options.
It is possible to use any means of communication as long as all stakeholders understand how and when the communication will take place.

3.6.1 Meetings

Different organizations communicate in different ways. Where organizations are distributed, they will tend to rely on e-mail and teleconferencing facilities. Organizations that have more mature Service Management processes and tools will tend to rely on the tools and processes for communication (e.g. using an Incident Management tool to escalate and track incidents, instead of requesting email or telephone calls for updates).

Other organizations prefer to communicate using meetings. However, it is important not to get into the mode whereby the only time work is done, or management is involved, is during a meeting. Also, faceto-face meetings tend to increase costs (e.g. travel, time spent in informal discussions, refreshments, etc.), so meeting organizers should balance the value of the meeting with the number and identity of the attendees and the time they will spend in, and getting to, the meeting.

The purpose of meetings is to communicate effectively to a group of people about a common set of objectives or activities. Meetings should be well controlled and brief, and the focus should be on facilitating action. A good rule is not to hold a meeting if the information can be communicated effectively by automated means.

A number of factors are essential for successful meetings. Although these may seem to be common sense, they are sometimes neglected:

Establish and communicate a clear agenda to ensure that the meeting achieves its objective and to help the facilitator prevent attendees from 'hijacking' the meeting.
Ensure that the rules for participating are understood. Organizations tend to have a formal set of meeting rules, ranging from relatively informal to very formal (e.g. Roberts Rules of Order).
Make use of 'parking lots' or notes that record issues that are not directly relevant to the purpose of the meeting, but which can be called on if the need for discussion arises.
Minutes of the meeting: rules should be set about when minutes are taken. Minutes are used to remind people who are assigned actions and to track the progress of delegated actions. They are also useful in ensuring that cross-functional decisions and actions are tracked and followed through.
Use techniques to encourage the appropriate level of participation. One technique when discussing improvements, for example, is the 'keep, stop, start' technique. Participants are encouraged to list items that they would like to keep, things that need to be stopped and initiatives or actions that they would like to see started.

Examples of typical meetings are given below:

3.6.1.1 The Operations Meeting

Operations meetings are normally held between the managers of the IT operational departments, teams or groups, at the beginning of each business day or week. The purpose of this type of meeting is to make staff aware of any issue relevant to Operations (such as change schedules, business events, maintenance schedules, etc.) and to provide an opportunity for staff to raise any issues of which they are aware. This is an opportunity to ensure that all departments in a data centre are synchronized. In geographically dispersed organizations it may not be possible to have a single daily Operations meeting. In these cases it is important to coordinate the agenda of the meetings and to ensure that each meeting has two components: 1 The first part of the meeting will cover aspects that apply to the organization as a whole, e.g. new policies, changes that affect all regions and business events that span all regions.

2 The second part of the meeting will cover aspects that apply only to the local region, e.g. local operations schedules, changes to local equipment, etc.

The Operations meeting is usually chaired by the IT Operations Manager or a senior Operations Manager and attended by all managers and supervisors (except those whose shifts are not on duty). It is also helpful to have at least one representative from the Service Desk at the meeting so that they are aware of any situations that could give rise to incidents.

Opportunities to improve services or processes should be captured, if raised, and forwarded to the team responsible for Continual Service Improvement.

3.6.1.2 Department, Group Or Team Meetings

These meetings are essentially the same as the Operations meeting, but are aimed at a single IT department, group or team. Each manager or supervisor relays the information from the Operations meeting that is relevant to their team.

Additionally, these meetings will also cover the following:

A more detailed discussion of incidents, problems and changes that are still being worked on, with information about:

Progress to date
Confirmation of what still needs to be done
Estimated completion times
Request for additional resources, if required
Discussion of potential problems or concerns

Confirmation of staff availability for roster duties
Confirmation of vacation schedules.

3.6.1.3 Customer Meetings

From time to time it will be necessary to hold meetings with customers, apart from the regular Service Level Review meetings. Examples include:

Follow-up after serious incidents. The purpose of these meetings is to repair the relationship with the customers, but also to ensure that IT has all the information required to prevent recurrence. Customers also have the opportunity to provide information about unforeseen business impacts. These meetings are helpful in agreeing actions for similar types of incident that may occur in future.
A customer forum, which can be used for a range of purposes, including testing ideas for new services or solutions, or gathering requirements for new or revised services or procedures. A customer forum is generally a regular meeting with customers to discuss areas of common concern.

3.7 Documentation

IT Operations Management and all of the Technical and Application Management teams and departments are involved in creating and maintaining a range of documents. These are detailed in Chapters 4, 5 and 6 of this publication and include the following:

Participation in the definition and maintenance of process manuals for all processes they are involved in. These will include processes in other phases of the IT Service Management Lifecycle (e.g. Capacity Management, Change Management, Availability Management) as well as for all processes included in the Service Operation phase.
Establishing their own technical procedures manuals. These must be kept up to date and new material must be added as it becomes relevant, under Change Control. It should be remembered that their procedures should always be structured to meet the objectives and constraints defined within higher-level Service Management processes, such as SLM. For example, a technical procedure for managing servers should always ensure that it aims at achieving the availability and performance levels agreed to in the Operational Level Agreements and Service Level Agreements (SLAs).
Participation in the creation and maintenance of planning documents, e.g. the Capacity and Availability Plans and the IT Service Continuity Plans.
Participation in the creation and maintenance of the Service Portfolio. This will include quantifying costs and establishing the operational feasibility of each proposed service.
Participation in the definition and maintenance of Service Management tool work instructions in order to meet reporting requirements.

	Extreme internal focus	Extreme external focus
Primary focus	Performance and management of IT Infrastructure devices, systems and staff, with little regard to the end result on the IT service.	Achieving high levels of IT service performance with little regard to how it is achieved.
Metrics	Focus on technical performance without showing what this means for services Internal metrics (e.g. network uptime) reported to the business instead of service performance metrics.	Focus on External Metrics without showing internal staff how these are derived or how they can be improved Internal staff are expected to devise their own metrics to measure internal performance.
Customer/user experience	High consistency of delivery, but only delivers a percentage of what the business needs. Uses a 'push' approach to delivery, i.e. prefers to have a standard set of services for all business units.	Poor consistency of delivery 'IT consists of good people with good intentions, but cannot always execute' Reactive mode of operation. Uses a 'pull' approach to delivery, i.e. prefers to deliver customized services upon request.
Operations strategy	Standard operations across the board All new services need to fit into the current architecture and procedures.	Multiple delivery teams and multiple technologies New technologies require new operations approaches and often new IT Operations teams.
Procedures and manual	Focus purely on how to manage the technology, not on how its performance relates to IT services	Focuses primarily on what needs to be done and when and less on how this should be achieved.
Cost strategy	Cost reduction achieved purely through technology consolidation Optimization of operational procedures and resources Business impact of cost cutting often only understood later Return on Investment calculations are focused purely on cost savings or 'payback periods'.	Budget allocated on the basis of which business unit is perceived to have the most need Less articulate or vocal business units often have inferior services as there is not enough funding allocated to their services.
Training	Training is conducted as an apprenticeship, where new Operations staff have to learn the way things have to be done, not why	Training is conducted on a project-by-project basis There are no standard training courses since operational procedures and technology are constantly changing.
Operations staff	Specialized staff, organized according to technical specialty Staff work on the false assumption that good technical achievement is the same as good customer service.	Generalist staff, organized partly according to technical capability and partly according to their relationship with a business unit Reliance on 'heroics', where staff go out of their way to resolve problems that could have been prevented by better internal processes.
Table 3.1 Examples of extreme internal and external focus

	Extreme focus on stability	Extreme focus on responsiveness
Primary focus	Technology Developing and refining standard IT management techniques and processes.	Output to the business Agrees to required changes before determining what it will take to deliver them.
Typical problems experienced	IT can demonstrate that it is complying with SOPS and Operational Level Agreements (OLAs), even when there is clear misalignment to business requirements	IT staff are not available to define or execute routine tasks because they are busy on projects for new services
Technology growth strategy	Growth strategy based on analyzing existing demand on existing systems New services are resisted and Business Units sometimes take ownership of ' their own' systems to get access to new services.	Technology purchased for each new business requirement Using multiple technologies and solutions for similar solutions, to meet slightly different business needs.
Technology used to deliver IT services	Existing or standard technology to be used; services must be adjusted to work within existing parameters	Over-provisioning. No attempt is made to model the new Service on the existing infrastructure. New, dedicated technology is purchased for each new project.
Capacity Management	Forecasts based on future business activity for each service individually and do not take into account IT activity or other IT services Existing workloads not relevant.	Forecasts based on projections of current workloads System performance is maintained at consistent levels through tuning and demand management, not by workload forecasting and management.
Table 3.2 Examples of extreme focus on stability and responsiveness

	Extreme focus on quality	Extreme focus on cost
Primary focus	Delivering the level of quality demanded by the business regardless of what it takes	Meting budget and reducing costs.
Typical problems experienced	Escalating budgets IT services generally deliver more than is necessary for business success Escalating demands for higher quality services	IT limits h quality of services based upon their budget available Escalaions from the business to get more serrice from IT.
Financial Management	IT usually does not have a method of communicating the cost of IT services. Accounting methods are based on an agregrated method (eg., cost of IT per user)	Financial reporting is done purely on budgeted amounts. There is no way of linking activities in IT to the delivery of IT services.
Table 3.3 Examples of extreme focus on quality and cost

	Extremely Reactive	Extremely Proactive
Primary focus	Responds to business needs and incidents only after they are reported.	Anticipates business requirements before they are reported and problems before they occur
Typical problems experienced	Preparing to deliver new services takes a long time because each project is dealt with as if it is the first Similar incidents occur again and again, as there is no way of trending them Staff turnover is high and morale is generally low, as IT staff keep moving from project to project without achieving a lasting, stable set of IT services.	Money is spent before the requirements are stated. In some cases IT purchases items that will never be used because they anticipated the wrong requirements or because the project is stopped IT staff tend to have been in the organization for a long time and tend to assume that they know the business requirements better than the business does.
Capacity Planning	Wait until there are capacity problems and then purchase surplus capacity to last until the next capacity-related incident	Anticipate capacity problems and spend money on preventing these - even when the scenario is unlikely to happen.
IT Service Continuity Planning	No plans exist until after a major event or disaster IT Plans focus on recovering key systems, but without ensuring that the business can recover its processes	Over-planning (and over-spending) of IT Recovery options. Usually immediate recovery is provided for most IT services, regardless of their impact or priority.
Change Management	Changes are often not logged, or logged at the last minute as Emergency Changes Not enough time for proper impact and cost assessments Changes are poorly tested and controlled, resulting in a high number of incidents	Changes are requested and implemented even when there is no real need, i.e. a significant amount of work done to tix items that are not broken
Table 3.4 Examples of extremely reactive and proactive behaviour