Service Operations

1Introduction 2Serv. Mgmt. 3Principles 4Process 5Activities 6Organization 7Consideration 8Implementation 9Issues AAppendeces

5. Common Service Operation Activities

5.1Monitor/Control 5.2Ops 5.3Mainframe 5.4Service 5.5Network 5.6Storage 5.7Database 5.8Directory 5.9Desktop 5.10Middleware 6.11Web 5.12Facilities 5.13Security 5.14Improvement

Chapter 4 dealt with the processes required for effective Service Operation and Chapter 6 will deal with the organizational aspects. This chapter focuses on a number of operational activities that ensure that technology is aligned with the overall Service and Process objectives. These activities are sometimes described as processes, but in reality they are sets of specialized technical activities all aimed at ensuring that the technology required to deliver and support services is operating effectively and efficiently.

These activities will usually be technical in nature - although the exact technology will vary depending on the type of services being delivered. This publication will focus on the activities required to manage IT.

Important note on managing technology
It is tempting to divorce the concept of Service Management from the management of the infrastructure that is used to deliver those services.

In reality, it is impossible to achieve quality services without aligning and 'gearing' every level of technology (and the people who manage it) to the services being provided. Service Management involves people, process and technology. In other words, the common Service Operation activities are not about managing the technology for the sake of having good technology performance. They are about achieving performance that will integrate the technology component with the people and process components to achieve service and business objectives. See Figure 5.1 for examples of how technology is managed in maturing organizations.

Figure 5.1 Achieving maturity in Technology Management
Figure 5.1 Achieving maturity in Technology Management

Figure 5.1 illustrates the steps involved in maturing from a technology-centric organization to an organization that harnesses technology as part of its business strategy. Figure 5.1 further outlines the role of Technology Managers in organizations of differing maturity. The diagram is not comprehensive, but it does provide examples of the way in which technology is managed in each type of organization. The bold headings indicate the major role played by IT in managing technology. The text in the rows describes the characteristics of an IT department at each level.

The purpose of this diagram in this chapter is as follows:

In some cases a dedicated group may handle all of a process or activity while in other cases processes or activities may be shared or split between groups. However, by way of broad guidance, the following sections list the required activities under the functional groups most likely to be involved in their operation. This does not mean that all organizations have to use these divisions. Smaller organizations will tend to assign groups of these activities (if they are needed at all) to single departments, or even individuals.

Finally, the purpose of this chapter is not to provide a detailed analysis of all the activities. They are specialized, and detailed guidance is available from the platform vendors and other, more technical, frameworks; new categories will be added continually as technology evolves. This chapter simply aims to highlight the importance and nature of technology management for Service Management in the IT context.

[To top of Page]

5.1 Monitoring And Control

The measurement and control of services is based on a continual cycle of monitoring, reporting and subsequent action. This cycle is discussed in detail in this section because it is fundamental to the delivery, support and improvement of services. It is also important to note that, although this cycle takes place during Service Operation, it provides a basis for setting strategy, designing and testing services and achieving meaningful improvement. It is also the basis for SLM measurement. Therefore, although monitoring is performed by Service Operation functions, it should not be seen as a purely operational matter. All phases of the Service Lifecycle should ensure that measures and controls are clearly defined, executed and acted upon.

5.1.1 Definitions
Monitoring refers to the activity of observing a situation to detect changes that happen over time.

In the context of Service Operation, this implies the following:

Reporting refers to the analysis, production and distribution of the output of the monitoring activity.

In the context of Service Operation, this implies the following:

Control refers to the process of managing the utilization or behaviour of a device, system or service. It is important to note, though, that simply manipulating a device is not the same as controlling it. Control requires three conditions:
  • The action must ensure that behaviour conforms to a defined standard or norm
  • The conditions prompting the action must be defined, understood and confirmed
  • The action must be defined, approved and appropriate for these conditions.

In the context of Service Operation, control implies the following:

5.1.2 Monitor Control Loops
Figure 5.2 The Monitor Control Loop
Figure 5.2 The Monitor Control Loop

The most common model for defining control is the Monitor Control Loop. Although it is a simple model, it has many complex applications within IT Service Management. This section will define the basic concepts of the Monitor Control Loop Model and subsequent sections will show how important these concepts are for the Service Management Lifecycle.

Figure 5.2 outlines the basic principles of control. A single activity and its output are measured using a predefined norm, or standard, to determine whether it is within an acceptable range of performance or quality. If not, action is taken to rectify the situation or to restore normal performance.

Typically there are two types of Monitor Control Loops:

To help clarify the difference, solving Capacity Management through over-provisioning is open loop; a load-balancer that detects congestion/failure and redirects capacity is closed loop.

5.1.2.1 Complex Monitor Control Loop
The Monitor Control Loop in Figure 5.2 is a good basis for defining how Operations Management works, but within the context of ITSM the situation is far more complex. Figure 5.3 illustrates a process consisting of three major activities. Each one has an input and an output, and the output becomes an input for the next activity.

In this diagram, each activity is controlled by its own Monitor Control Loop, using a set of norms for that specific activity. The process as a whole also has its own Monitor Control Loop, which spans all the activities and ensures that all norms are appropriate and are being followed.

Figure 5.3 Complex Monitor Control Loop
Figure 5.3 Complex Monitor Control Loop

In Figure 5.3 there is a double feedback loop. One loop focuses purely on executing a defined standard, and the second evaluates the performance of the process and also the standards whereby the process is executed. An example of this would be if the first set of feedback loops at the bottom of the diagram represented individual stations on an assembly line and the higher-level loop represented Quality Assurance.

The Complex Monitor Control Loop is a good organizational learning toolR .

The first level of feedback at individual activity level is concerned with monitoring and responding to data (single facts, codes or pieces of information). The second level is concerned with monitoring and responding to information (a collection of a number of facts about which a conclusion may be drawn). Refer to the Service Transition publication for a full discussion on Data, Information, Knowledge and Wisdom.

All of this is interesting theory, but does not explain how the Monitor Control Loop concept can be used to operate IT services. And especially - who defines the norm? Based on what has been described so far, Monitor Control Loops can be used to manage:

To define how to use the concept of Monitor Control Loops in Service Management, the following questions need to be answered:

The following sections will expand on the concept of Monitor Control Loops and demonstrate how these questions are answered.

5.1.2.2 The ITSM Monitor Control Loop
In ITSM, the complex Monitor Control Loop can be represented as shown in Figure 5.4. Figure 5.4 can be used to illustrate the control of a process or of the components used to deliver a service. In this diagram the word 'activity' implies that it refers to a process. To apply it to a service, an 'activity' could also be a 'Cl'. There are a number of significant features in Figure 5.4.

Figure 5.4 ITSM Monitor Control Loop
Figure 5.4 ITSM Monitor Control Loop

Service Transition provides a major set of checks and balances in these processes. It does so as follows:

Why is this loop covered under Service Operation?
Figure 5.4 represents Monitoring and Control for the whole of IT Service Management. Some readers of the Service Operation publication may feel that it should be more suitably covered in the Service Strategy publication. However, Monitoring and Control can only effectively be deployed when the service is operational. This means that the quality of the entire set of IT Service Management processes depends on how they are monitored and controlled in Service Operation. The implications of this are as follows:
  • Service Operation staff are not the only people with an interest in what is monitored and how they are controlled.
  • While Service Operation is responsible for monitoring and control of services and components, they are acting as stewards of a very important part of the set of ITSM Monitoring and Control loops.
  • If Service Operation staff define and execute Monitoring and Control procedures in isolation, none of the Service Management processes or functions will be fully effective. This is because the Service Operation functions will not support the priorities and information requirements of the other processes, e.g. attempting to negotiate an SLA when the only data available is page-swap rates on a server and detailed bandwidth utilization of a network.

5.1.2.3 Defining What Needs To Be Monitored
The definition of what needs to be monitored is based on understanding the desired outcome of a process, device or system. IT should focus on the service and its impact on the business, rather than just the individual components of technology. The first question that needs to be asked is What are we trying to achieve?'.

5.1.2.4 Internal and External Monitoring and Control
At the outset, it will become clear that there are two levels of monitoring:

The distinction between Internal and External Monitoring is an important one. If Service Operation focuses only on Internal Monitoring, it will have very well-managed infrastructure, but no way of understanding or influencing the quality of services. If it focuses only on External Monitoring, it will understand how poor the service quality is, but will have no idea what is causing it or how to change it.

In reality, most organizations have a combination of Internal and External Monitoring, but in many cases these are not linked. For example, the Server Management team knows exactly how well the servers are performing and the Service Level Manager knows exactly how the users perceive the quality of service provided by the servers. However, neither of them knows how to link these metrics to define what level of server performance represents good quality service. This becomes even more confusing when server performance that is acceptable in the middle of the month, is not acceptable at month-end.

5.1.2.5 Defining Objectives For Monitoring And Control
Many organizations start by asking the question 'What are we managing?'. This will invariably lead to a strong Internal Monitoring System, with very little linkage to the real outcome or service that is required by the business.

The more appropriate question is 'What is the end result of the activities and equipment that my team manages?'. Therefore the best place to start, when defining what to monitor, is to determine the required outcome.

The definition of Monitoring and Control objectives should ideally start with the definition of the Service Level Requirements documents (see Service Design publication). These will specify how the customers and users will measure the performance of the service, and are used as input into the Service Design processes. During Service Design, various processes will determine how the service will be delivered and managed. For example, Capacity Management will determine the most appropriate and cost-effective way to deliver the levels of performance required. Availability Management will determine how the infrastructure can be configured to provide the fewest points of failure.

If there is any doubt about the validity or completeness of objectives, the COBIT framework provides a comprehensive, high-level set of objectives as a checklist. More information on COBIT is provided in Appendix A of this publication. The Service Design Process will help to identify the following sets of inputs for defining Operational Monitoring and Control norms and mechanisms:  They will work with customers and users to determine how the output of the service will be measured. This will include measurement mechanisms, frequency and sampling. This part of Service Design will focus specifically on the Functional Requirements.

All of this means that a very important part of defining what Service Operation monitors and how it exercises control is to identify the stakeholders of each service.

Stakeholders can be defined as anyone with an interest in the successful delivery and receipt of IT services. Each stakeholder will have a different perspective of what it will take to deliver or receive an IT service. Service Operation will need to understand each of these perspectives in order to determine exactly what needs to be monitored and what to do with the output. Service Operation will therefore rely on SLM to define exactly who these stakeholders are and how they contribute to or use the service. This is discussed more fully in the Service Design and Continual Service Improvement publications.

Note on Internal and External Monitoring Objectives
The required outcome could be internal or external to the Service Operation functions, although it should always be remembered that an internal action will often have an external result. For example, consolidating servers to make them easier to manage may result in a cost saving, which will affect the SLM negotiation and review cycle as well as the Financial Management processes.

5.1.2.6 Types Of Monitoring
There are many different types of monitoring tool and different situations in which each will be used. This section focuses on some of the different types of monitoring that can be performed and when they would be appropriate.

Active versus Passive Monitoring

Reactive versus Proactive

Please note that Reactive and Proactive Monitoring could be active or passive, as per Table 5.1.

 ActivePassive
Reactive Used to diagnose which device is causing the failure and under what conditions (e.g. 'ping' a device, or run and track a sample transaction through a series of devices)

Requires knowledge of the infrastructure topography and the mapping of services to CIs

Detects and correlates event records to determine the meaning of the events and the appropriate action (e.g. a user logs in three times with the incorrect password, which generates represents a security exception and is escalated through Information Security Management procedures)

Requires detailed knowledge of the normal operation of the infrastructure and services

Proactive Used to determine the real-time status of a device, system or service - usually for critical components or following the recovery of a failed device to ensure that it is fully recovered (i.e. is not going to cause further incidents) Event records are correlated over time to build trends for Proactive Problem Management.

Patterns of events are defined and programmed into correlation tools for future recognition.

Table 5.1 Active and Passive Reactive and Proactive Monitoring

Continuous Measurement versus Exception-Based Measurement

Performance Versus Output
There is an important distinction between the reporting used to track the performance of components or teams or department used to deliver a service and the reporting used to demonstrate the achievement of service quality objectives. IT managers often confuse these by reporting to the business on the performance of their teams or departments (e.g. number of calls taken per Service Desk Analyst), as if that were the same thing as quality of service (e.g. incidents solved within the agreed time).

Performance Monitoring and metrics should be used internally by the Service Management to determine whether people, process and technology are functioning correctly and to standard.

Users and customers would rather see reporting related to the quality and performance of the service. Although Service Operation is concerned with both types of reporting, the primary concern of this publication is Performance Monitoring, whereas monitoring of Service Quality (or Output-Based Monitoring) will be discussed in detail in the Continual Service Improvement publication.

5.1.2.7 Monitoring in Test Environments
As with any IT Infrastructure, a Test Environment will need to define how it will use monitoring and control. These controls are more fully discussed in the Service Transition publication.

5.1.2.8 Reporting And Action
'A report alone creates awareness; a report with an action plan achieves results.'

Reporting And Dysfunction
Practical experience has shown that there is more reporting in dysfunctional organizations than in effective organizations. This is because reports are not being used to initiate pre-defined action plans, but rather: to shift the blame for an incident to try to find out who is responsible for making a decision as input to creating action plans for future occurrences. In dysfunctional organizations a lot of reports are produced which no one has the time to look at or query.

Monitoring without control is irrelevant and ineffective. Monitoring should always be aimed at ensuring that service and operational objectives are being met. This means that unless there is a clear purpose for monitoring a system or service, it should not be monitored.

This also means that when monitoring is defined, so too should any required actions. For example, being able to detect that a major application has failed is not sufficient.

The relevant Application Management team should also have defined the exact steps that it will take when the application fails. In addition, it should also be recognized that action may need to be taken by different people, for example a single event (such as an application failure) may trigger action by the Application Management team (to restore service), the users (to initiate manual processing) and management (to determine how this event can be prevented in future).

The implications of this principle are outlined in more detail in relation to Event Management (see section 4.1).

5.1.2.9 Service Operation Audits
Regular audits must be performed on the Service Operation processes and activities to ensure:

Service Operation Managers may choose to perform such audits themselves, but ideally some form of independent element to the audits is preferable.

The organization's internal IT audit team or department may be asked to be involved or some organizations may choose to engage third-party consultancy/audit/ assessment companies so that an entirely independent expert view is obtained. Service Operation audits are part of the ongoing measurement that takes place as part of Continual Service Improvement and are discussed in more detail in that publication.

5.1.2.10 Measurement, Metrics And KPIs
This section has focused primarily on the monitoring and control as a basis for Service Operation. Other sections of the publication have covered some basic metrics that could be used to measure the effectiveness and efficiency of a process. Although this publication is not primarily about measurement and metrics, it is important that organizations using these guidelines have robust measurement techniques and metrics that support the objectives of their organization. This section is a summary of these concepts.

Measurement

Measurement refers to any technique that is used to evaluate the extent, dimension or capacity of an item in relation to a standard or unit.
  • Extent refers to the degree of compliance or completion (e.g. are all changes formally authorized by the appropriate authority)
  • Dimension refers to the size of an item, e.g. the number of incidents resolved by the Service Desk
  • Capacity refers to the total capability of an item, for example maximum number of standard transactions that can be processed by a server per minute.

Measurement only becomes meaningful when it is possible to measure the actual output or dimensions of a system, function or process against a standard or desired level, e.g. the server must be capable of processing a minimum of 100 standard transactions per minute. This needs to be defined in Service Design, and refined over time through Continual Service Improvement, but the measurement itself takes place during Service Operation.

Metrics

Metrics refer to the quantitative, periodic assessment of a process, system or function, together with the procedures and tools that will be used to make these assessments and the procedures for interpreting them.

This definition is important because it not only specifies what needs to be measured, but also how to measure it, what the acceptable range of performance will be and what action will need to be taken as a result of normal performance or an exception. From this, it is clear that any metric given in the previous section of this publication is a very basic one and will need to be applied and expanded within the context of each organization before it can be effective.

Key Performance Indicators

A KPI refers to a specific, agreed level of performance that will be used to measure the effectiveness of an organization or process.

KPIs are unique to each organization and have to be related to specific inputs, outputs and activities. They are not generic or universal and thus have not been included in this publication.

A further reason for not including them is the fact that similar metrics can be used to achieve very different KPIs. For example, one organization used the metric 'Percentage of Incidents resolved by the Service Desk' to evaluate the performance of the Service Desk. This worked effectively for about two years, after which the IT manager began to realize that this KPI was being used to prevent effective Problem Management, i.e. if, after two years, 80% of all incidents are easy enough to be resolved in 10 minutes on the first call, why have we not come up with a solution for them? In effect, the KPI now became a measure for how ineffective the Problem Management teams were.

5.1.2.11 Interfaces To Other Service Lifecycle Practices
Operational Monitoring and Continual Service Improvement
This section has focused on Operational Monitoring and Reporting, but monitoring also forms the starting point for Continual Service Improvement. This is covered in the Continual Service Improvement publication, but key differences are outlined here. Quality is the key objective of monitoring for Continual Service Improvement (CSI). Monitoring will therefore focus on the effectiveness of a service, process, tool, organization or CI. The emphasis is not on assuring realtime service performance; rather it is on identifying where improvements can be made to the existing level of service, or IT performance.

Monitoring for CSI will therefore tend to focus on detecting exceptions and resolutions. For example, CSI is not as interested in whether an incident was resolved, but whether it was resolved within the agreed time and whether future incidents can be prevented.

CSI is not only interested in exceptions, though. If an SLA is consistently met over time, CSI will also be interested in determining whether that level of performance can be sustained at a lower cost or whether it needs to be upgraded to an even better level of performance. CSI may therefore also need access to regular performance reports.

However, since CSI is unlikely to need, or be able to cope with, the vast quantities of data that are produced by all monitoring activity, they will most likely focus on a specific subset of monitoring at any given time. This could be determined by input from the business or improvements to technology.

.This has two main implications:

[To top of Page]

5.2 IT Operations

5.2.1 Console Management/Operations Bridge
These provide a central coordination point for managing various classes of events, detecting incidents, managing routine operational activities and reporting on the status or performance of technology components.

Observation and monitoring of the IT Infrastructure can occur from a centralized console - to which all system events are routed. Historically, this involved the monitoring of the master operations console of one or more mainframes - but these days is more likely to involve monitoring of a server farm(s), storage devices, network components, applications, databases, or any other CIs, including any remaining mainframe(s), from a single location, known as the Operations Bridge.

There are two theories about how the Operations Bridge was so named. One is that it resembles the bridge of a large, automated ship (such as spaceships commonly seen in science fiction movies). The other theory is that the Operations Bridge represents a link between the IT Operations teams and the traditional Help Desk. In some organizations this means that the functions of Operational Control and the Help Desk were merged into the Service Desk, which performed both sets of duties in a single physical location.

Regardless of how it was named, an Operations Bridge will pull together all of the critical observation points within the IT Infrastructure so that they can be monitored and managed from a centralised location with minimal effort. The devices being monitored are likely to be physically dispersed and may be located in centralized computer installations or dispersed within the user community, or both.

The Operations Bridge will combine many activities, which might include Console Management, event handling, firstline network management, Job Scheduling and out-of-hours support (covering for the Service Desk and/or second-line support groups if they do not work 24/7). In some organizations, the Service Desk is part of the Operations Bridge.

The physical location and layout of the Operation's Bridge needs to be carefully designed to give the correct accessibility and visibility of all relevant screens and devices to authorized personnel. However, this will become a very sensitive area where controlled access and tight security will be essential.

Smaller organizations may not have a physical Operations Bridge, but there will certainly still be the need for Console Management, usually combined with other technical roles. For example, a single team of technical staff will manage the network, servers and applications. Part of their role will be to monitor the consoles for those systems - often using virtual consoles so that they can perform the activity from any location. However, it should be noted that these virtual consoles are powerful tools and, if used in insecure locations or over unsecured connections, could represent a significant security threat.

5.2.2 Job Scheduling
IT Operations will perform standard routines, queries or reports delegated to it as part of delivering services; or as part of routine housekeeping delegated by Technical and Application Management teams.

Job Scheduling involves defining and initiating job scheduling software packages to run batch and real-time work. This will normally involve daily, weekly, monthly, annual and ad hoc schedules to meet business needs.

In addition to the initial design, or periodic redesign, of the schedules, there are likely to be frequent amendments or adjustments to make during which job dependencies have to be identified and accommodated. There will also be a role to play in defining alerts and Exception Reports to be used for monitoring/checking job schedules. Change Management plays an important role in assessing and validating major changes to schedules, as well as creating Standard Change procedures for more routine changes.

Run-time parameters and/or files have to be received (or expedited if delayed) and input - and all run-time logs have to be checked and any failures identified.

If failures do occur, then re-runs will have to be initiated, under the guidance of the appropriate business units, often with different parameters or amended data/file versions. This will require careful communications to ensure correct parameters and files are used.

Many organizations are faced with increasing overnight batch schedules which can, if they overrun the overnight batch slot, adversely impact upon the online day services - so are seeking ways of utilizing maximum overnight capacity and performance, in conjunction with Capacity Management. This is where Workload Management techniques can be useful, such as:

Anecdote
One large organization, which was faced with batch overrun/utilization problems, identified that, due to human nature where people were seeking to be 'tidy', all jobs were being started on the hour or at 15-minute intervals during the hour (i.e. n o'clock, 15 minutes past, half past, 15 minutes to, etc.). By re-scheduling of work so that it started as soon as other work finished, and staggering the start times of other work, it was able to gain significant reductions in contention and achieve much quicker overall processing, which resolved its problems without a need for upgrades.

Job Scheduling has become a highly sophisticated activity, including any number of variables - such as timesensitivity, critical and non-critical dependencies, workload balancing, failure and resubmission, etc. As a result, most operations rely on Job Scheduling tools that allow IT Operations to schedule jobs for the optimal use of technology to achieve Service Level Objectives.

The latest generation of scheduling tools allows for a single toolset to schedule and automate technical activities and Service Management process activities (such as Change Scheduling). While this is a good opportunity for improving efficiency, it also represents a greater single point of failure. Organizations using this type of tool therefore still use point solutions as agents and also as a backup in case the main toolset fails.

5.2.3 Backup and Restore
Backup and Restore is essentially a component of good IT Service Continuity Planning. As such, Service Design should ensure that there are solid backup strategies for each service and Service Transition should ensure that these are properly tested. In addition, regulatory requirements specify that certain types of organization (such as Financial Services or listed companies) must have a formal Backup and Restore strategy in place and that this strategy is executed and audited. The exact requirements will vary from country to country and by industry sector. This should be determined during Service Design and built into the service functionality and documentation.

The only point of taking backups is that they may need to be restored at some point. For this reason it is not as important to define how to back a system up as it is to define what components are at risk and how to effectively mitigate that risk. There are any number of tools available for Backup and Restore, but it is worth noting that features of storage technologies used for business data are being used for backup/restore (e.g. snapshots). There is therefore an increasing degree of integration between Backup and Restore activities and those of Storage and Archiving (see section 5.6).

5.2.3.1 Backup
The organization's data has to be protected and this will include backup (copying) and storage of data in remote locations where it can be protected - and used should it need to be restored due to loss, corruption or implementation of IT Service Continuity Plans.

An overall backup strategy must be agreed with the business, covering:

There is also a need to procure and manage the necessary media (disks, tapes, CDs, etc.) to be used for backups, so that there is no shortage of supply.

Where automated devices are being used, pre-loading of the required media will be needed in advance. When loading and clearing media returned from off-site storage it is important that there is a procedure for verifying that these are the right ones. This will prevent the most recent backup being overwritten with faulty data, and then having no valid data to restore. After successful backups have been taken, the media must be removed for storage.

The actual initiation of the backups might be automated, or carried out from the Operations Bridge.

Some organizations may utilize Operations staff to perform the physical transportation and racking of backup copies to/from remote locations, where in other cases this may be handed over to other groups such as internal security staff or external contractors.

If backups are being automated or performed remotely, then Event Monitoring capabilities should be considered so that any failures can be detected early and rectified before they cause problems. In such cases IT Operations has a role to play in defining alerts and escalation paths.

In all cases, IT Operations staff must be trained in backup (and restore) procedures - which must be well documented in the organization's IT Operations Procedures Manual. Any specific requirements or targets should be referenced in OLAs or UCs where appropriate, while any user or customer requirements or activity shoe be specified in the appropriate SLA.

5.2.3.2 Restore
A restore can be initiated from a number of sources, ranging from an event that indicates data corruption, through to a Service Request from a user or customer logged at the Service Desk. A restore may be needed in the case of:

The steps to be taken will include:

5.2.4 Print and Output
Many services consist of generating and delivering information in printed or electronic form. Ensuring the right information gets to the right people, with full integrity, requires formal control and management.

Print (physical) and Output (electronic) facilities and services need to be formally managed because:

Many organizations will have centralized bulk printing requirements which IT Operations must handle. In addition to the physical loading and re-loading of paper and the operation and care of the printers, other activities may be needed, such as:

[To top of Page]

5.3 Mainframe Management

Mainframes are still widely in use and have well established and mature practices. Mainframes form the central component of many services and its performance will therefore set a baseline for service performance and user or customer expectations, although they may never know that they are using the mainframe.

The ways in which mainframe management teams are organized are quite diverse. In some organizations Mainframe Management is a single, highly specialized team that manages all aspects from daily operations through to system engineering. In other organizations, the activities are performed by several teams or departments, with engineering and third-level support being provided by one team and daily operations being combined with the rest of IT Operations (and very probably managed through the Operations Bridge).

Typically, the following activities are likely to be undertaken:

[To top of Page]

5.4 Server Management And Support

Servers are used in most organizations to provide flexible and accessible services from hosting applications or databases, running client/server services, Storage, Print and File Management. Successful management of servers is therefore essential for successful Service Operation.

The procedures and activities which must be undertaken by the Server Team(s) or department(s) - separate teams may be needed where different server-types are used (UNIX, Wintel etc) - include:

5.5 Network Management

As most IT services are dependent on connectivity, Network Management will be essential to deliver services and also to enable Service Operation staff to access and manage key service components.

Network Management will have overall responsibility for all of the organization's own Local Area Networks (LANs), Metropolitan Area Networks (MANs) and Wide Area Networks (WANs) - and will also be responsible for liaising with third-party network suppliers.

Note on managing VoIP as a service
Many organizations have experienced performance and availability problems with their VoIP solutions, in spite of the fact that there seems to be more than adequate bandwidth available. This results in dropped calls and poor sound quality. This is usually because of variations in bandwidth utilization during the call, which is often the result of utilization of the network by other users, applications or other web activity. This has led to the differentiation between measuring the bandwidth available to initiate a call (Service Access Bandwidth - or SAB) and the amount of bandwidth that must be continuously available during the call (Service Utilization Bandwidth - or SUB). Care should be taken in differentiating between these when designing, managing or measuring VoIP services.

Their role will include the following activities:

Network Management is also often responsible, often in conjunction with Desktop Support, for remote connectivity issues such as dial-in, dial-back and VPN facilities provided to home-workers, remote workers or suppliers.

Some Network Management teams or departments will also have responsibility for voice/telephony, including the provision and support for exchanges, lines, ACD, statistical software packages etc. and for Voice over Internet Protocol (VoIP) and Remote Monitoring (RMon) systems.

At the same time, many organizations see VoIP and telephony as specialized areas and have teams dedicated to managing this technology. Their activities will be similar to those described above.

[To top of Page]

5.6 Storage And Archive

Many services require the storage of data for a specific time and also for that data to be available off-line for a certain period after it is no longer used. This is often due to regulatory or legislative requirements, but also because history and audit data are invaluable for a variety of purposes, including marketing, product development, forensic investigations, etc.

A separate team or department may be needed to manage the organization's data storage technology such as:

[To top of Page]

5.7 Database Administration

Database Administration must work closely with key Application Management teams or departments - and in some organizations the functions may be combined or linked under a single management structure. Organizational options include:

Database Administration works to ensure the optimal performance, security and functionality of databases that they manage. Database Administrators typically have the following responsibilities:

[To top of Page]

5.8 Directory Services Management

A Directory Service is a specialized software application that manages information about the resources available on a network and which users have access to. It is the basis for providing access to those resources and for ensuring that unauthorized access is detected and prevented (see section 4.5 for detailed information on Access Management).

Directory Services views each resource as an object of the Directory Server and assigns it a name. Each name is linked to the resource's network address, so that users don't have to memorize confusing and complex addresses.

Directory Services is based on the OSI's X.500 standards and commonly uses protocols such as Directory Access Protocol (DAP) or Lightweight Directory Access Protocol (LDAP). LDAP is used to support user credentials for application login and often includes internal and external user/customer data which is especially good for extranet call logging. Since LDAP is a critical operational tool, and generally kept up to date, it is also a good source of data and verification for the CMS.

Directory Services Management refers to the process that is used to manage Directory Services. Its activities include:

[To top of Page]

5.9 Desktop Support

As most users access IT services using desktop or laptop computers, it is key that these are supported to ensure the agreed levels of availability and performance of services. Desktop Support will have overall responsibility for all of the organization's desktop and laptop computer hardware, software and peripherals. Specific responsibilities will include:

[To top of Page]

5.10 Middleware Management

Middleware is software that connects or integrates software components across distributed or disparate applications and systems. Middleware enables the effective transfer of data between applications, and is therefore key to services that are dependent on multiple applications or data sources.

A variety of technologies are currently used to support program-to-program communication, such as object request brokers, message-oriented middleware, remote procedure calls and point-to-point web services. Newer technologies are emerging all the time, for example Enterprise Service Bus (ESB), which enables programs, systems and services to communicate with each other regardless of the architecture and origin of the applications. This is especially being used in the context of deploying Service Oriented Architectures (SOAs).

Middleware Management can be performed as part of an Application Management function (where it is dedicated to a specific application) or as part of a Technical Management function (where it is viewed as an extension to the Operating System of a specific platform).

Functionality provided by middleware includes:

Middleware Management is the set of activities that are used to manage middleware. These include:

[To top of Page]

5.11 Internet/Web Management

Many organizations conduct much of their business through the Internet and are therefore heavily dependent upon the availability and performance of their websites. In such cases a separate Internet/Web Support team or department will be desirable and justified.

The responsibilities of such a team or department incorporate both Intranet and Internet and are likely to include:

[To top of Page]

5.12 Facilities and Data Centre Management

Facilities Management refers to the management of the physical environment of IT Operations, usually located in Data Centres or computer rooms. This is a vast and complex area and this publication will provide an overview of its key role and activities. A more detailed overview is contained in Appendix E.

In many respects Facilities Management could be viewed as a function in its own right. However, because this publication is focused on where IT Operations are housed, it will cover Facilities Management specifically as it relates to the management of Data Centres and as a subset of the IT Operations Management function.

Important note regarding Data Centres
Data Centres are generally specialized facilities and, while they use and benefit from generic Facilities Management disciplines, they need to adapt these. For example layout, heating and conditioning, power planning and many other aspects are all managed uniquely in Data Centres.

This means that, although Data Centres may be facilities owned by an organization, they are better managed under the authority of IT Operations, although there may be a functional reporting line between IT and the department that manages other facilities for the organization.

The main components of Facilities Management are as follows:

5.12.1 Data Centre Strategies
Managing a Data Centre is far more than hosting an open space where technical groups install and manage equipment, using their own approaches and procedures. It requires an integrated set of processes and procedures involving all IT groups at every stage of the ITSM Lifecycle. Data Centre operations are governed by strategic and design decisions for management and control and are executed by operators. This requires a number of key factors to be put in place:

[To top of Page]

5.13 Information Security Management And Service Operation

Information Security Management as a process is covered in the ITIL Service Design publication. Information Security Management has overall responsibility for setting policies, standards and procedures to ensure the protection of the organization's assets, data, information and IT services. Service Operation teams play a role in executing these policies, standards and procedures and will work closely with the teams or departments responsible for Information Security Management.

Service Operation teams cannot take ownership of Information Security Management, as this would represent a conflict. There needs to be segregation of roles between the groups defining and managing the process and the groups executing specific activities as part of ongoing operation. This will help protect against breaches to security measures, as no single individual should have control over two or more phases of a transaction or operation. Information Security Management should assign responsibilities to ensure a cross-check of duties.

The role of Service Operation teams is outlined next.

5.13.1 Policing And Reporting
This will involve Operation staff performing specific policing activities such as the checking of system journals, logs, event/monitoring alerts etc, intrusion detection and/or reporting of actual or potential security breaches. This is done in conjunction with Information Security Management to provide a check and balance system to ensure effective detection and management of security issues.

Service Operation staff are often first to detect security events and are in the best position to be able to shut down and/or remove access to compromised systems.

Particular attention will be needed in the case of third-party organizations that require physical access into the organization. Service Operation staff may be required to escort visitors into sensitive areas and/or control their access. They may also have a role to play in controlling network access to third parties, such as hardware maintainers dialling in for diagnostic purposes, etc.

5.13.2 Technical Assistance
Some technical support may need to be provided to IT Security staff to assist in investigating security incidents and assist in production of reports or in gathering forensic evidence for use in disciplinary action or criminal prosecutions.

Technical advice and assistance may also be needed regarding potential security improvements (e.g. setting up appropriate firewalls or access/password controls).

The use of event, incident, problem and configuration management information can be relied on to provide accurate chronologies of security-related investigations.

5.13.3 Operational Security Control
For operational reasons, technical staff will often need to have privileged access to key technical areas (e.g. root system passwords, physical access to Data Centres or communications rooms etc). It is therefore essential that adequate controls and audit trails are kept of all such privileged activities so as to deter and detect any security events.

Physical controls need to be in place for all secure areas with logging in-out of all staff. Where third-party staff or visitors need access, it may be Service Operation staff that are responsible for escorting and managing the movement of such personnel.

In the case of privileged systems access, this needs to be restricted to only those people whose need to access the system has been verified - and withdrawn immediately when that need no longer exists. An audit trail must be maintained of who has had access and when, and of all activities performed using those access levels.

5.13.4 Screening And Vetting
All Service Operation staff should be screened and vetted to a security level appropriate to the organization in question. Suppliers and third-party contractors should also be screened and vetted - both the organizations and the specific personnel involved. Many organizations have started using police or government agency background checks, especially where contractors will be working with classified systems. Where necessary, appropriate nondisclosure and confidentiality agreements must be agreed.

5.13.5 Training And Awareness
All Service Operation staff should be given regular and ongoing training and awareness of the organization's security policy and procedures. This should include details of disciplinary measures in place. In addition, any security requirements should be specified in the employee's contract of employment.

5.13.6 Documented policies and procedures
Service Operation documented procedures must include all relevant information relating to security issues - extracted from the organization's overall security policy documents. Consideration should be given to the use of handbooks to assist in getting the security messages out to all relevant staff.

[To top of Page]

5.14 Improvement Of Operational Activities

All Service Operation staff should be constantly looking for areas in which process improvements can be made to give higher IT service quality and/or performed in a more cost effective way. This might include some of the following activities.

5.14.1 Automation Of Manual Tasks
Any tasks which have to be carried out manually, particularly those that have to be regularly repeated, are likely to be more time consuming, costly and error prone than those that can be systemized and automated. All tasks should be examined for potential automation to reduce effort and costs and to minimize potential errors.

A judgement must be made on the costs of the automation and the likely benefits that will occur.

5.14.2 Reviewing Makeshift Activities Or Procedures
Because of the pragmatic nature of Service Operation, it may sometimes arise that makeshift activities or processes are introduced to address short-term operational expediencies. There is a danger that such practices can be continued and become the 'norm' - leading to ongoing inefficiencies. Where any makeshift activities or procedures do have to be introduced it is important that these are reviewed as soon as the immediate expediency is overcome - and either dispensed with or replaced with efficient agreed processes for the longer term.

5.14.3 Operational Audits
Regular audits should be conducted of all Service Operation processes to ensure that they are working satisfactorily.

5.14.4 Using Incident and Problem Management
Problem and Incident Management provide a rich source of operational improvement opportunities. These .processes are discussed in detail in Chapter 4 of this publication.

5.14.5 Communication
It should go without saying that good communication about changing requirements, technology and processes will result in improvement in Service Operation. However, communication is often neglected. Service Operation improvement is dependent on formal and regular communication between teams responsible for design, support and operation of services.

5.14.6 Education And Training
Service Operation teams should understand the importance of what they do on a daily basis. Education is required to ensure that staff understand what business functions or services are supported by their activities. This will encourage greater care and attention to detail and will also help Service Operation teams to better identify business priorities.

Training programmes should ensure that all staff have the appropriate skills for the technology or applications that they are managing. Training should always be provided when new technology is introduced, or when existing technology is changed.

[To top of Page]


Visit my web site