Service Design

Service Design Process

^4.1SC Mgmt

^4.2SLM

^4.3Capacity Mgmt

^4.4Availability Mgmt

^4.5 Continuity Mgmt

^4.6Security Mgmt

^4.7Supplier Mgmt

4.5 IT Service Continuity Management

4.5.1 Purpose, Goals and Objectives

'The goal of ITSCM is to support the overall Business Continuity Management process by ensuring that the required IT technical and service facilities (including computer systems, networks, applications, data repositories, telecommunications, environment, technical support and Service Desk) can be resumed within required, and agreed, business timescales.'

As technology is a core component of most business processes, continued or high availability of IT is critical to the survival of the business as a whole. This is achieved by introducing risk reduction measures and recovery options. Like all elements of ITSM, successful implementation of ITSCM can only be achieved with senior management commitment and the support of all members of the organization. Ongoing maintenance of the recovery capability is essential if it is to remain effective. The purpose of ITSCM is to maintain the necessary ongoing recovery capability within the IT services and their supporting components.

The objectives of ITSCM are to:

Maintain a set of IT Service Continuity Plans and IT recovery plans that support the overall Business Continuity Plans (BCPs) of the organization
Complete regular Business Impact Analysis (BIA) exercises to ensure that all continuity plans are maintained in line with changing business impacts and requirements
Conduct regular Risk Analysis and Management exercises, particularly in conjunction with the business and the Availability Management and Security Management processes, that manage IT services within an agreed level of business risk
Provide advice and guidance to all other areas of the business and IT on all continuity- and recovery-related issues
Ensure that appropriate continuity and recovery mechanisms are put in place to meet or exceed the agreed business continuity targets
Assess the impact of all changes on the IT Service Continuity Plans and IT recovery plans
Ensure that proactive measures to improve the availability of services are implemented wherever it is cost-justifiable to do so
Negotiate and agree on the necessary contracts with suppliers for the provision of the needed recovery capability to support all continuity plans in conjunction with the Supplier Management process.

4.5.2 Scope

ITSCM focuses on those events that the business considers significant enough to be considered a disaster. Less significant events will be dealt with as part of the Incident Management process. What constitutes a disaster will vary from organization to organization. The impact of a loss of a business process, such as financial loss, damage to reputation or regulatory breach, is measured through a BIA exercise, which determines the minimum critical requirements. The specific IT technical and service requirements are supported by ITSCM. The scope of ITSCM within an organization is determined by the organizational structure, culture and strategic direction (both business and technology) in terms of the services provided and how these develop and change over time.

ITSCM primarily considers the IT assets and configurations that support the business processes. If (following a disaster) it is necessary to relocate to an alternative working location, provision will also be required for items such as office and personnel accommodation, copies of critical paper records, courier services and telephone facilities to communicate with customers and third parties.

The scope will need to take into account the number and location of the organization's offices and the services performed in each.

ITSCM does not usually directly cover longer-term risks such as those from changes in business direction, diversification, restructuring, major competitor failure, and so on. While these risks can have a significant impact on IT service elements and their continuity mechanisms, there is usually time to identify and evaluate the risk and include risk mitigation through changes or shifts in business and IT strategies, thereby becoming part of the overall business and IT Change Management programme.

Similarly, ITSCM does not usually cover minor technical faults (for example, non critical disk failure), unless there is a possibility that the impact could have a major impact on the business. These risks would be expected to be covered mainly through the Service Desk and the Incident Management process, or resolved through the planning associated with the processes of Availability Management, Problem Management, Change Management, Configuration Management and 'business as usual' operational management.

The ITSCM process includes:

The agreement of the scope of the ITSCM process and the policies adopted.
Business Impact Analysis (BIA) to quantify the impact loss of IT service would have on the business.
Risk Analysis (RA) - the risk identification and risk assessment to identify potential threats to continuity and the likelihood of the threats becoming reality. This also includes taking measures to manage the identified threats where this can be cost-justified.
Production of an overall ITSCM strategy that must be integrated into the BCM strategy. This can be produced following the two steps identified above, and is likely to include elements of risk reduction as well as selection of appropriate and comprehensive recovery options.
Production of an ITSCM plan, which again must be integrated with the overall BCM plans.
Testing of the plans.
Ongoing operation and maintenance of the plans.

4.5.3 Value to the Business

ITSCM provides an invaluable role in supporting the Business Continuity Planning process. In many organizations, ITSCM is used to raise awareness of continuity and recovery requirements and is often used to justify and implement a Business Continuity Planning process and Business Continuity Plans. The ITSCM should be driven by business risk as identified by Business Continuity Planning, and ensures that the recovery arrangements for IT services are aligned to identified business impacts, risks and needs.

4.5.4 Policies, Principles and Basic Concepts

	Key activities Policy setting Scope Initiate a project Business Impact Analysis Risk Assessment IT Service Continuity Strategy Develop IT Service Continuity Plans Develop IT plans, recovery plans and procedures Organization Planning Testing strategy Education, awareness and training Review and audit Testing Change Management
Figure 4.21 Lifecycle of Service Continuity Management

A lifecycle approach should be adopted to the setting up and operation of an ITSCM process. Figure 4.21 shows the lifecycle of ITSCM, from initiation through to continual assurance that the protection provided by the plan is current and reflects all changes to services and service levels. ITSCM is a cyclic process through the lifecycle to ensure that once service continuity and recovery plans have been developed they are kept aligned with Business Continuity Plans (BCPs) and business priorities. Figure 4.21 also shows the role played within the ITSCM process of BCM.

Initiation and requirements stages are principally BCM activities. ITSCM should only be involved in these stages to support the BCM activities and to understand the relationship between the business processes and the impacts caused on them by loss of IT service. As a result of these initial BIA and Risk Analysis activities, BCM should produce a Business Continuity Strategy, and the first real ITSCM task is to produce an ITSCM strategy that underpins the BCM strategy and its needs.

The Business Continuity Strategy should principally focus on business processes and associated issues (e.g. business process continuity, staff continuity, buildings continuity). Once the Business Continuity Strategy has been produced, and the role that IT services has to provide within the strategy has been determined, an ITSCM strategy can be produced that supports and enables the Business Continuity Strategy. This ensures that cost-effective decisions can be made, considering all the 'resources' to deliver a business process. Failure to do this tends to encourage ITSCM options that are faster, more elaborate and expensive than are actually needed.

The activities to be considered during initiation depend on the extent to which continuity facilities have been applied within the organization. Some parts of the business may have established individual Business Continuity Plans based around manual work-arounds, and IT may have developed continuity plans for systems perceived to be critical. This is good input to the process. However, effective ITSCM depends on supporting critical business functions. The only way of implementing effective ITSCM is through the identification of critical business processes and the analysis and coordination of the required technology and supporting IT services.

This situation may be even more complicated in outsourcing situations where an ITSCM process within an external service provider or outsourcer organization has to meet the needs not only of the customer BCM process and strategy, but also of the outsourcer's own BCM process and strategy. These needs may be in conflict with one another, or may conflict with the BCM needs of one of the other outsourcing organization's customers.

However, in many organizations BCM is absent or has very little focus, and often ITSCM is required to fulfill many of the requirements and activities of BCM. The rest of this section has assumed that ITSCM has had to perform many of the activities required by BCM. Where a BCM process is established with Business Continuity Strategies and Plans in place, these documents should provide the focus and drive for establishing ITSCM.

4.5.5 Process Activities, Methods and Techniques

The following sections contain details of each of the stages within the ITSCM lifecycle.

HCI Observation

The approach for this process differs from that in other sections. Lifecycle is defined in ITIL as "A series of states connected by allowable transitions in the life of an IT Service". In other process descriptions (eg. capacity, availability, etc) the IT service is an application or system. The commentary is NOT directed at implementing an availability or capacity "service". In this section "lifecycle" seems to be defined as the total ITSCM process itself rather than it's invocation and operational elements culminating in the restoration of the service. These are covered under the "Implementation" section. It's really about consistency as other sections might well benefit from Initiation and Requirements and Strategy sections (the latter including in the ITIL version 3 Service Strategy book).

This demonstrates the difficulties often encountered in coordinating a common approach amongst a set of different writers. Deviations from agreed-upon formats sometimes happen. It is the job of the "Chief Architect" to ensure such deviations do not stray to far afield.

4.5.5.1 Stage 1 - Initiation

The initiation process covers the whole of the organization and consists of the following activities:

Policy setting - this should be established and communicated as soon as possible so that all members of the organization involved in, or affected by, Business Continuity issues are aware of their responsibilities to comply with and support ITSCM. As a minimum, the policy should set out management intention and objectives
Specify terms of reference and scope - this includes defining the scope and responsibilities of all staff in the organization. It covers such tasks as undertaking a Risk Analysis and Business Impact Analysis and determination of the command and control structure required to support a business interruption. There is also a need to take into account such issues as outstanding audit points, regulatory or client requirements and insurance organization stipulations, and compliance with standards such as ISO 27001, the Standard on Information Security Management, which also addresses Service Continuity requirements
Allocate resources - the establishment of an effective Business Continuity environment requires considerable resource in terms of both money and manpower. Depending on the maturity of the organization, with respect to ITSCM, there may be a requirement to familiarize and/or train staff to accomplish the Stage 2 tasks. Alternatively, the use of experienced external consultants may assist in completing the analysis more quickly. However, it is important that the organization can then maintain the process going forward without the need to rely totally on external support
Define the project organization and control structure - ITSCM and BCM projects are potentially complex and need to be well organized and controlled. It is strongly advisable to use a recognized standard project planning methodology such as Projects IN a Controlled Environment (PRINCE2�) or Project Management Body Of Knowledge (PMBOK�)
Agree project and quality plans - plans enable the project to be controlled and variances addressed. Quality plans ensure that the deliverables are achieved and to an acceptable level of quality. They also provide a mechanism for communicating project resource requirements and deliverables, thereby obtaining 'buy-in' from all necessary parties.

4.5.5.2 Stage 2 - Requirements and Strategy

Ascertaining the business requirements for IT service continuity is a critical component in order to determine how well an organization will survive a business interruption or disaster and the costs that will be incurred. If the requirements analysis is incorrect, or key information has been missed, this could have serious consequences on the effectiveness of ITSCM mechanisms.

This stage can effectively be split into two sections:

Requirements - perform Business Impact Analysis and risk assessment
Strategy - following the requirements analysis, the strategy should document the required risk reduction measures and recovery options to support the business.

Requirements - Business Impact Analysis
The purpose of a Business Impact Analysis (BIA) is to quantify the impact to the business that loss of service would have. This impact could be a 'hard' impact that can be precisely identified - such as financial loss - or 'soft' impact - such as public relations, morale, health and safety or loss of competitive advantage. The BIA will identify the most important services to the organization and will therefore be a key input to the strategy.

The BIA identifies:

The form that the damage or loss may take - for example:

Lost income
Additional costs
Damaged reputation
Loss of goodwill
Loss of competitive advantage
Breach of law, health and safety
Risk to personal safety
Immediate and long-term loss of market share
Political, corporate or personal embarrassment
Loss of operational capability - for example, in a command and control environment

How the degree of damage or loss is likely to escalate after a service disruption, and the times of the day, week, month or year when disruption will be most severe
The staffing, skills, facilities and services (including the IT services) necessary to enable critical and essential business processes to continue operating at a minimum acceptable level
The time within which minimum levels of staffing, facilities and services should be recovered
The time within which all required business processes and supporting staff, facilities and services should be fully recovered
The relative business recovery priority for each of the IT services.

One of the key outputs from a BIA exercise is a graph of the anticipated business impact caused by the loss of a business process or the loss of an IT service over time, as illustrated in Figure 4.22. This graph can then be used to drive the business and IT continuity strategies and plans. More preventative measures need to be adopted with regard to those processes and services with earlier and higher impacts, whereas greater emphasis should be placed on continuity and recovery measures for those where the impact is lower and takes longer to develop. A balanced approach of both measures should be adopted to those in between. These items provide the drivers for the level of ITSCM mechanisms that need to be considered or deployed. Once presented with these options, the business may decide that lower levels of service or increased delays are more acceptable, based on a cost-benefit analysis, or it may be that comprehensive disaster prevention measures will need to be implemented.

These assessments enable the mapping of critical service, application and technology components to critical business processes, thus helping to identify the ITSCM elements that need to be provided. The business requirements are ranked and the associated ITSCM elements confirmed and prioritized in terms of risk reduction and recovery planning. The results of the BIA, discussed earlier, are invaluable input to several areas of process design including Service Level Management to understand the required service levels.

Figure 4.22 Graphical representation of business impacts
Figure 4.22 Graphical representation of business impacts

Impacts should be measured against particular scenarios for each business process, such as an inability to settle trades in a money market dealing process, or an inability to invoice for a period of days. An example is a money market dealing environment where loss of market data information could mean that the organization starts to lose money immediately as trading cannot continue. In addition, customers may go to another organization, which would mean potential loss of core business. Loss of the settlement system does not prevent trading from taking place, but if trades already conducted cannot be settled within a specified period of time, the organization may be in breach of regulatory rules or settlement periods and suffer fines and damaged reputation. This may actually be a more significant impact than the trade embargo because of an inability to satisfy customer expectations.

It is also important to understand how impacts may change over time. For instance, it may be possible for a business to function without a particular process for a short period of time. In a balanced scenario, impacts to the business will occur and become greater over time. However, not all organizations are affected in this way. In some organizations, impacts are not apparent immediately. At some point, however, for any organization, the impacts mass to such an extent that the business can no longer operate. ITSCM ensures that contingency options are identified so that the appropriate measure can be applied at the appropriate time to keep business impacts from service disruption to a minimum level.

When conducting a BIA, it is important that senior business area representatives' views are sought on the impact following loss of service. It is also equally important that the views of supervisory staff and more junior staff are sought to ensure all aspects of the impact following loss of service are ascertained. Often different levels of staff will have different views on the impact, and all will have to be taken into account when producing the overall strategy.

In many organizations it will be impossible, or it will not be cost-justifiable, to recover the total service in a very short timescale. In many cases, business processes can be re-established without a full complement of staff, systems and other facilities, and still maintain an acceptable level of service to clients and customers. The business recovery objectives should therefore be stated in terms of:

The time within which a pre-defined team of core staff and stated minimum facilities must be recovered
The timetable for recovery of remaining staff and facilities.

It may not always be possible to provide the recovery requirements to a detailed level. There is a need to balance the potential impact against the cost of recovery to ensure that the costs are acceptable. The recovery objectives do, however, provide a starting point from which different business recovery and ITSCM options can be evaluated.

Supporting Material

Aliant - BIA Service
Gartner - Sample BIA
Wiki - BIA Definition
BIA Template (doc)
DRII/BCI Professional Practice Narrative - Best Practices for BIA
CSU - Service Continuity Management - Business Impact Analysis
Video - HCI - Business Impact Analysis, or Powerpoint

Requirements - Risk Analysis

Figure 4.23 Management of Risk

Figure 4.24 Example summary risk profile

The second driver in determining ITSCM requirements is the likelihood that a disaster or other serious service disruption will actually occur. This is an assessment of the level of threat and the extent to which an organization is vulnerable to that threat. Risk Analysis can also be used in assessing and reducing the chance of normal operational incidents and is a technique used by Availability Management to ensure the required availability and reliability levels can be maintained. Risk Analysis is also a key aspect of Information Security Management. A diagram on Risk Analysis and Management (Figure 4.20) is contained within the Availability Management process in section 4.4.

A number of Risk Analysis and Management methods are available for both the commercial and government sectors. Risk Analysis is the assessment of the risks that may give rise to service disruption or security violation. Risk management is concerned with identifying appropriate risk responses or cost-justifiable countermeasures to combat those risks.

A standard methodology, such as the Management of Risk (M_o_R), should be used to assess and manage risks within an organization. The M o_R framework is illustrated in Figure 4.23.

The M_o_R approach is based around the above framework, which consists of the following:

M_o_R Principles: these principles are essential for the development of good risk management practice and are derived from corporate governance principles.
M-o_R Approach: an organization's approach to these principles needs to be agreed and defined within the following living documents:
- Risk Management Policy
- Process Guide
- Plans
- Risk Registers
- Issue Logs.
M_o_R Processes: the following four main steps describe the inputs, outputs and activities that ensure that risks are controlled:
- Identify: the threats and opportunities within an activity that could impact the ability to reach its objective
- Assess: the understanding of the net effect of the identified threats and opportunities associated with an activity when aggregated together
- Plan: to prepare a specific management response that will reduce the threats and maximize the opportunities
- Implement: the planned risk management actions, monitor their effectiveness and take corrective action where responses do not match expectations.
Embedding and Reviewing M_o_R: having put the principles, approach and processes in place, they need to be continually reviewed and improved to ensure they remain effective.
Communication: having the appropriate communication activities in place to ensure that everyone is kept up-to-date with changes in threats, opportunities and any other aspects of risk management.

This M_o_R method requires the evaluation of risks and the development of a risk profile, such as the example in Figure 4.24.

Figure 4.24 provides an example risk profile, containing many risks that are outside the defined level of 'acceptable risk'. Following the Risk Analysis it is possible to determine appropriate risk responses or risk reduction measures (ITSCM mechanisms) to manage the risks, i.e. reduce the risk to an acceptable level or mitigate the risk. Wherever possible, appropriate risk responses should be implemented to reduce either the impact or the likelihood, or both, of these risks from manifesting themselves.

In the context of ITSCM, there are a number of risks that need to be taken into consideration. Table 4.1 is not a comprehensive list but does give some examples of risks and threats that need to be addressed by the ITSCM process.

Supporting Material

M_of_R Website
Wiki - Risk Management
Risk Management Standard
Risk Register template (doc)
Video - HCI - Risk Analysis, or Powerpoint
Risk Analysis ICOM

IT Service Continuity Strategy

Risk	Threat
Loss of internal IT systems/networks, PABXs, ACDs, etc.	Fire Power failure Arson and vandalism Flood Aircraft impact Weather damage, e.g. hurricane Environmental disaster Terrorist attack Sabotage Catastrophic failure Electrical damage, e.g. lightning Accidental damage Poor-quality software
Loss of external IT systems/networks, e.g. e-commerce servers, cryptographic systems	All of the above Excessive demand for services Denial of service attack, e.g. against an internet firewall Technology failure, e.g. cryptographic system
Loss of data	Technology failure Human error Viruses, malicious software, e.g. attack applets
Loss of network services	Damage or denial of access to network service provider's premises Loss of service provider's IT Loss of service provider's data Failure of the service provider
Unavailability of key technical and support staff	Industrial action Denial of access to premises Resignation Sickness/injury Transport difficulties
Failure of service providers, e.g. outsourced IT	Commercial failure, e.g. insolvency Denial of access to premises Unavailability of service provider's staff Failure to meet contractual service levels
Table 4.1 Examples of risks and threats

The results of the Business Impact Analysis and the Risk Analysis will enable appropriate Business and IT Service Continuity strategies to be produced in line with the business needs. The strategy will be an optimum balance of risk reduction and recovery or continuity options. This includes consideration of the relative service recovery priorities and the changes in relative service priority for the time of day, day of the week, and monthly and annual variations. Those services that have been identified as high impacts in the short term within the BIA will want to concentrate on preventative risk reduction methods - for example, through full resilience and fault tolerance - while an organization that has low short-term impacts would be better suited to comprehensive recovery options, as described in the following sections. Similar advice and guidance can be found in the Business Continuity Institute's BCI Good Practice Guidelines.

Risk Response Measures
Most organizations will have to adopt a balanced approach where risk reduction and recovery are complementary and both are required. This entails reducing, as far as possible, the risks to the continued provision of the IT service and is usually achieved through Availability Management. However well planned, it is impossible to completely eliminate all risks - for example, a fire in a nearby building will probably result in damage, or at least denial of access, as a result of the implementation of a cordon. As a general rule, the invocation of a recovery capability should only be taken as a last resort. Ideally, an organization should assess all of the risks to reduce the potential requirement to recover the business, which is likely to include the IT services.

The risk reduction measures need to be implemented and should be instigated in conjunction with Availability Management, as many of these reduce the probability of failure affecting the availability of service. Typical risk reduction measures include:

Installation of UPS and backup power to the computer
Fault-tolerant systems for critical applications where even minimal downtime is unacceptable - for example, a banking system
RAID arrays and disk mirroring for LAN servers to prevent against data loss and to ensure continued availability of data
Spare equipment/components to be used in the event of equipment or component failure - for example, a spare LAN server already configured with the standard configuration and available to replace a faulty server with minimum build and configuration time
The elimination of SPoFs, such as single access network points or single power supply into a building
Resilient IT systems and networks
Outsourcing services to more than one provider
Greater physical and IT-based security controls
Better controls to detect service disruptions, such as fire detection systems, coupled with suppression systems
A comprehensive backup and recovery strategy, including off-site storage.

The above measures will not necessarily solve an ITSCM issue and remove the risk totally, but all or a combination of them may significantly reduce the risks associated with the way in which services are provided to the business.

Off-Site Storage
One risk response method is to ensure all vital data is backed up and stored off-site. Once the recovery strategy has been defined, an appropriate backup strategy should be adopted and implemented to support it. The backup strategy must include regular (probably daily) removal of data (including the CMS to ease recovery) from the main data centres to a suitable off-site storage location. This will ensure retrieval of data following relatively minor operational failure as well as total and complete disasters. As well as the electronic data, all other important information and documents should be stored off-site, with the main example being the ITSCM plans.

ITSCM Recovery Options
An organization's ITSCM strategy is a balance between the cost of risk reduction measures and recovery options to support the recovery of critical business processes within agreed timescales. The following is a list of the potential IT recovery options that need to be considered when developing the strategy.

Manual Workarounds
For certain types of services, manual work-arounds can be an effective interim measure for a limited timeframe until the IT service is resumed. For instance, a Service Desk call logging service could survive for a limited time using paper forms linked to a laptop computer with a spreadsheet.

Reciprocal Arrangements
In the past, reciprocal arrangements were typical contingency measures where agreements were put in place with another organization using similar technology. This is no longer effective or possible for most types of IT systems, but can still be used in specific cases - for example, setting up an agreement to share high-speed printing facilities. Reciprocal arrangements can also be used for the off-site storage of backups and other critical information.

Gradual Recovery
This option (sometimes referred to as 'cold standby') includes the provision of empty accommodation, fully equipped with power, environmental controls and local network cabling infrastructure, telecommunications connections, and available in a disaster situation for an organization to install its own computer equipment. It does not include the actual computing equipment, so is not applicable for services requiring speedy recovery, as set-up time is required before recovery of services can begin. This recovery option is only recommended for services that can bear a delay of recovery time in days or weeks, not hours. Any non-critical service that can bear this type of delay should take into account the cost of this option versus the benefit to the business before determining if a gradual recovery option should be included in the ITSCM options for the organization.

The accommodation may be provided commercially by a third party, for a fee, or may be private, (established by the organization itself) and provided as either a fixed or portable service.

A portable facility is typically a prefabricated building provided by a third party and located, when needed, at a predetermined site agreed with the organization. This may be in another location some distance from the home site, perhaps another owned building. The replacement computer equipment will need to be planned, but suppliers of computing equipment do not always guarantee replacement equipment within a fixed deadline, though they would normally do so under their best efforts.

Intermediate Recovery
This option (sometimes referred to as 'warm standby') is selected by organizations that need to recover IT facilities within a predetermined time to prevent impacts to the business process. The predetermined time will have been agreed with the business during the BIA.

Most common is the use of commercial facilities, which are offered by third-party recovery organizations to a number of subscribers, spreading the cost across those subscribers. Commercial facilities often include operation, system management and technical support. The cost varies depending on the facilities requested, such as processors, peripherals, communications, and how quickly the services must be restored.

The advantage of this service is that the customer can have virtually instantaneous access to a site, housed in a secure building, in the event of a disaster. It must be understood, however, that the restoration of services at the site may take some time, as delays may be encountered while the site is re-configured for the organization that invokes the service, and the organization's applications and data will need to be restored from backups.

One potentially major disadvantage is the security implications of running IT services at a third party's data centre. This must be taken into account when planning to use this type of facility. For some organizations, the external intermediate recovery option may not be appropriate for this reason.

If the site is invoked, there is often a daily fee for use of the service in an emergency, although this may be offset against additional cost of working insurance.

Commercial recovery services can be provided in self contained, portable or mobile form where an agreed system is delivered to a customer's site, within an agreed time.

Fast Recovery
This option (sometimes referred to as 'hot standby') provides for fast recovery and restoration of services and is sometimes provided as an extension to the intermediate recovery provided by a third-party recovery provider. Some organizations will provide their own facilities within the organization, but not on an alternative site to the one used for the normal operations. Others implement their own internal second locations on an alternative site to provide more resilient recovery.

Where there is a need for a fast restoration of a service, it is possible to 'rent' floor space at the recovery site and install servers or systems with application systems and communications already available, and data mirrored from the operational servers. In the event of a system failure, the customers can then recover and switch over to the backup facility with little loss of service. This typically involves the re-establishment of the critical systems and services within a 24-hour period.

Immediate Recovery
This option (also often referred to as 'hot standby', 'mirroring', 'load balancing' or 'split site') provides for immediate restoration of services, with no loss of service. For business critical services, organizations requiring continuous operation will provide their own facilities within the organization, but not on the same site as the normal operations. Sufficient IT equipment will be 'dual located 'in either an owned or hosted location to run the compete service from either location in the event of loss of one facility, with no loss of service to the customer. The second site can then be recovered whilst the service is provided from the single operable location. This is an expensive option, but may be justified for critical business processes or VBFs where non-availability for a short period could result in a significant impact, or where it would not be appropriate to be running IT services on a third party's premises for security or other reasons. The facility needs to be located separately and far enough away from the home site that it will not be affected by a disaster affecting that location. However, these mirrored servers and sites options should be implemented in close liaison with Availability Management as they support services with high levels of availability.

The strategy is likely to include a combination of risk response measures and a combination of the above recovery options, as illustrated in Figure 4.25.

Manual Immediate Fast Intermediate Gradual
Service Desk Yes Yes Yes Yes
Mainframe payroll Yes Yes Yes
Financial system Yes Yes
Dealer system Yes Yes Yes
Figure 4.25 Example set of recovery options

Figure 4.25 shows that a number of options may be used to provide continuity of service. An example from Figure 4.25 shows that, initially, continuity of the Service Desk is provided using manual processes such as a set of forms, and maybe a spreadsheet operating from a laptop computer, whilst recovery plans for the service are completed on an alternative 'fast recovery' site. Once the alternative site has become operational, the Service Desk can switch back to using the IT service. However, use of the external 'fast recovery' alternative site is probably limited in duration, so while running temporarily from this site, the 'intermediate site' can be made operational and long-term operations can be transferred there.

Different services within an organization require different in-built resilience and different recovery options. Whatever option is chosen, the solution will need to be cost justified. As a general rule, the longer the business can survive without a service, the cheaper the solution will be. For example, a critical healthcare system that requires continuous operation will be very costly, as potential loss of service will need to be eliminated by the use of immediate recovery, whereas a service the absence of which does not severely affect the business for a week or so could be supported by a much cheaper solution, such as intermediate recovery.

As well as the recovery of the computing equipment, planning needs to include the recovery of accommodation and infrastructure for both IT and user staff. Other areas to be taken into account include critical services such as power, telecommunications, water, couriers, post, paper records and reference material.

It is important to remember that the recovery is based around a series of stand-by arrangements including accommodation, procedures and people, as well as systems and telecommunications. Certain actions are necessary to implement the stand-by arrangements. For example:

Negotiating for third-party recovery facilities and entering into a contractual arrangement
Preparing and equipping the stand-by accommodation
Purchasing and installing stand-by computer systems.

4.5.5.3 Stage 3 - Implementation

Once the strategy has been approved, the IT Service Continuity Plans need to be produced in line with the Business Continuity Plans.

ITSCM plans need to be developed to enable the necessary information for critical systems, services and facilities to either continue to be provided or to be reinstated within an acceptable period to the business. An example ITSCM recovery plan is contained in Appendix K. Generally the Business Continuity Plans rely on the availability of IT services, facilities and resources. As a consequence of this, ITSCM plans need to address all activities to ensure that the required services, facilities and resources are delivered in an acceptable operational state and are 'fit for purpose' when accepted by the business. This entails not only the restoration of services and facilities, but also the understanding of dependencies between them, the testing required prior to delivery (performance, functional, operational and acceptance testing) and the validation of data integrity and consistency.

It should be noted that the continuity plans are more than just recovery plans, and should include documentation of the resilience measures and the measures that have been put into place to enable recovery, together with explanations of why a particular approach has been taken (this facilitates decisions should invocation determine that the particular situation requires a modification to the plan). However, the format of the plan should enable rapid access to the recovery information itself, perhaps as an appendix that can be accessed directly. All key staff should have access to copies of all the necessary recovery documentation.

Management of the distribution of the plans is important to ensure that copies are available to key staff at all times. The plans should be controlled documents (with formalized documents maintained under Change Management ^N control) to ensure that only the latest versions are in circulation and each recipient should ensure that a personal copy is maintained off-site.

The plan should ensure that all details regarding recovery of the IT services following a disaster are fully documented. It should have sufficient details to enable a technical person unfamiliar with the systems to be able to follow the procedures. The recovery plans include key details such as the data recovery point, a list of dependent systems, the nature of the dependency and their data recovery points, system hardware and software requirements, configuration details and references to other relevant or essential information about the service and systems.

It is a good idea to include a checklist that covers specific actions required during all stages of recovery for the service and system. For example, after the system has been restored to an operational state, connectivity checks, functionality checks or data consistency and integrity checks should be carried out prior to handing the service over to the business.

There are a number of technical plans that may already exist within an organization, documenting recovery procedures from a normal operational failure. The development and maintenance of these plans will be the responsibility of the specialist teams, but will be coordinated by the Business Continuity Management team. These will be useful additions or appendices to the main plan. Additionally, plans that will need to be integrated with the main BCP are:

Emergency Response Plan: to interface to all emergency services and activities
Damage Assessment Plan: containing details of damage assessment contacts, processes and plans
Salvage Plan: containing information on salvage contacts, activities and processes
Vital Records Plan: details of all vital records and information, together with their location, that are critical to the continued operation of the business
Crisis Management and Public Relations Plan: the plans on the command and control of different crisis situations and management of the media and public relations
Accommodation and Services Plan: detailing the management of accommodation, facilities and the services necessary for their continued operation
Security Plan: showing how all aspects of security will be managed on all home sites and recovery sites
Personnel Plan: containing details of how all personnel issues will be managed during a major incident
Communication Plan: showing how all aspects of communication will be handled and managed with all relevant areas and parties involved during a major incident
Finance and Administration Plan: containing details of alternative methods and processes for obtaining possible emergency authorization and access to essential funds during a major incident.

Finally, each critical business area is responsible for the development of a plan detailing the individuals who will be in the recovery teams and the tasks to be undertaken on invocation of recovery arrangements.

The ITSCM Plan must contain all the information needed to recover the IT systems, networks and telecommunications in a disaster situation once a decision to invoke has been made, and then to manage the business return to normal operation once the service disruption has been resolved. One of the most important inputs into the plan development is the results of the Business Impact Analysis. Additionally other areas will need to be analyzed, such as Service Level Agreements (SLA), security requirements, operating instructions and procedures and external contracts. It is likely that a separate SLA with alternative targets will have been agreed if running at a recovery site following a disaster.

Other areas that will need to be implemented following the approval of the strategy are:

Organization Planning
During the disaster recovery process, the organizational structure will inevitably be different from normal operation and is based around:

Executive - including senior management/executive board, with overall authority and control within the organization and responsible for crisis management and liaison with other departments, divisions, organizations, the media, regulators, emergency services etc.
Coordination - typically one level below the executive group and responsible for coordinating the overall recovery effort within the organization
Recovery - a series of business and service recovery teams, representing the critical business functions and the services that need to be established to support these functions. Each team is responsible for executing the plans within their own areas and for liaison with staff, customers and third parties. Within IT the recovery teams should be grouped by IT service and application. For example, the infrastructure team may have one or more people responsible for recovering external connections, voice services, local area networks, etc. and the support teams may be split by platform, operating system or application. In addition, the recovery priorities for the service, application or its components identified during the Business Impact Analysis should be documented within the recovery plans and applied during their execution.

Testing
Experience has shown that recovery plans that have not been fully tested do not work as intended, if at all. Testing is therefore a critical part of the overall ITSCM process and the only way of ensuring that the selected strategy, standby arrangements, logistics, business recovery plans and procedures will actually work in practice.

The IT service provider is responsible for ensuring that the IT services can be recovered in the required timescales with the required functionality and the required performance following a disaster.

There are four basic types of tests that can be undertaken:

Walk-Through Tests can be conducted when the plan has been produced simply by getting the relevant people together to see if the plan(s) will work at least in a simulated way.
Full Tests should be conducted as soon as possible after the plan production and at regular intervals of at least annually thereafter. They should involve the business units to assist in proving the capability to recover the services appropriately. They should, as far as possible, replicate an actual invocation of all standby arrangements and should involve external parties if they are planned to be involved in an actual invocation. The tests must not only prove recovery of the IT services but also the recovery of the business processes. It is recommended that an independent observer records all the activities of the tests and the timings of the service recovery. The observer's documentation of the tests will be vital input into the subsequent post mortem review. The full tests may be announced or unannounced. The first test of the plan is likely to be announced and carefully planned, but subsequent tests may be 'sprung' on key players without warning. It is also essential that many different people get involved, including those not very familiar with the IT service and systems, as the people with the most knowledge may not be available when a disaster actually occurs.
Partial Tests can also be undertaken where recovery of certain elements of the overall plan is tested, such as single services or servers. These types of tests should be in addition to the full test not instead of the full test. The full test is the best way of testing that all services can be recovered in required timescales and can run together on the recovery systems.
Scenario Tests can be used to test reactions and plans to specific conditions, events and scenarios. They can include testing that BCPs and IT Service Continuity Plans interface with each other, as well as interfacing with all other plans involved in the handling and management of a major incident.

All tests need to be undertaken against defined test scenarios, which are described as realistically as possible. It should be noted, however, that even the most comprehensive test does not cover everything. For example, in a service disruption where there has been injury or even death to colleagues, the reaction of staff to a crisis cannot be tested and the plans need to make allowance for this. In addition, tests must have clearly defined objectives and Critical Success Factors, which will be used to determine the success or otherwise of the exercise.

4.5.5.4 Stage 4 - Ongoing Operation

This stage consists of the following:

Education, Awareness and Training - this should cover the organization and, in particular, the IT organization, for service continuity-specific items. This ensures that all staff are aware of the implications of business continuity and of service continuity and consider these as part of their normal working, and that everyone involved in the plan has been trained in how to implement their actions
Review- regular review of all of the deliverables from the ITSCM process needs to be undertaken to ensure that they remain current
Testing - following the initial testing, it is necessary to establish a programme of regular testing to ensure that the critical components of the strategy are tested, preferably at least annually, although testing of IT Service Continuity Plans should be arranged in line with business needs and the needs of the BCPs. All plans should also be tested after every major business change. It is important that any changes to the IT technology are also included in the strategy, implemented in an appropriate fashion and tested to ensure that they function correctly within the overall provision of IT following a disaster. The backup and recovery of IT service should also be monitored and tested to ensure that when they are needed during a major incident, they will operate as needed. This aspect is covered more fully in the Service Operation publication
Change Management - the Change Management process should ensure that all changes are assessed for their potential impact on the ITSCM plans. If the planned change will invalidate the plans, then the plan must be updated before the change is implemented, and it should be tested as part of the change testing. The plans themselves must be under very strict Change Management and Configuration Management^N control. Inaccurate plans and inadequate recovery capabilities may result in the failure of BCPs. Also, on an ongoing basis, whenever there are new services or where services have major changes, it is essential that a BIA and risk assessment is conducted on the new or changed service and the strategy and plans updated accordingly.

Invocation
Invocation is the ultimate test of the Business Continuity and ITSCM Plans. If all the preparatory work has been successfully completed, and plans developed and tested, then an invocation of the Business Continuity Plans should be a straight forward process, but if the plans have not been tested, failures can be expected. It is important that due consideration is given to the design of all invocation processes, to ensure that they are fit for purpose and interface to all other relevant invocation processes.

Invocation is a key component of the plans, which must include the invocation process and guidance. It should be remembered that the decision to invoke, especially if a third-party recovery facility is to be used, should not be taken lightly. Costs will be involved and the process will involve disruption to the business. This decision is typically made by a 'crisis management' team, comprising senior managers from the business and support departments (including IT), using information gathered through damage assessment and other sources.

A disruption could occur at any time of the day or night, so it is essential that guidance on the invocation process is readily available. Plans must be available to key staff in the office and away from the office. The decision to invoke must be made quickly, as there may be a lead-time involved in establishing facilities at a recovery site. In the case of a serious building fire, the decision may be fairly easy to make. However, in the case of power failure or hardware fault, where a resolution is expected within a short period, a deadline should be set by which time if the incident has not been resolved, invocation will take place. If using external services providers, they should be warned immediately if there is a chance that invocation might take place.

The decision to invoke needs to take into account the:

Extent of the damage and scope of the potential invocation
Likely length of the disruption and unavailability of premises and/or services
Time of day/month/year and the potential business impact. At year-end, the need to invoke may be more pressing, to ensure that year-end processing is completed on time.

Therefore the design of the invocation process must provide guidance on how all of these areas and circumstances should be assessed to assist the person invoking the continuity plan.

The ITSCM Plan should include details of activities that need to be undertaken, including:

Retrieval of backup tapes or use of data vaulting to retrieve data
Retrieval of essential documentation, procedures, workstation images, etc. stored off-site
Mobilization of the appropriate technical personnel to go to the recovery site to commence the recovery of required systems and services
Contacting and putting on alert telecommunications suppliers, support services, application vendors, etc. who may be required to undertake actions or provide assistance in the recovery process.

The invocation and initial recovery is likely to be a time of high activity, involving long hours for many individuals. This must be recognized and managed by the recovery team leaders to ensure that breaks are provided and prevent 'burn-out'. Planning for shifts and hand-overs must be undertaken to ensure that the best use is made of the facilities available. It is also vitally important to ensure that the usual business and technology controls remain in place during invocation, recovery and return to normal to ensure that information security is maintained at the correct level and that data protection is preserved.

Once the recovery has been completed, the business should be able to operate from the recovery site at the level determined and agreed in the strategy and relevant SLA. The objective, however, will be to build up the business to normal levels, maintain operation from the recovery site in the short term and vacate the recovery site in the shortest possible time. Details of all these activities need to be contained within the plans. If using external services, there will be a finite contractual period for using the facility. Whatever the period, a return to normal must be carefully planned and undertaken in a controlled fashion. Typically this will be over a weekend and may include some necessary downtime in business hours. It is important that this is managed well and that all personnel involved are aware of their responsibilities to ensure a smooth transition.

4.5.6 Triggers, Inputs, Outputs and Interfaces

Many events may trigger ITSCM activity. These include:

New or changed business needs, or new or changed services
New or changed targets within agreements, such as SLRs, SLAs, OLAs or contracts
The occurrence of a major incident that requires assessment for potential invocation of either Business or IT Continuity Plans
Periodic activities such as the BIA or Risk Analysis activities, maintenance of Continuity Plans or other reviewing, revising or reporting activities
Assessment of changes and attendance at Change Advisory Board meetings
Review and revision of business and IT plans and strategies
Review and revision of designs and strategies
Recognition or notification of a change of risk or impact of a business process or VBF, an IT service or component
Initiation of tests of continuity and recovery plans.

Integration and interfaces exist from ITSCM to all other processes. Important examples are as follows:

Change Management - all changes need to be considered for their impact on the continuity plans, and if amendments are required to the plan, updates to the plan need to be part of the change. The plan itself must be under Change Management control
Incident and Problem Management - incidents can easily evolve into major incidents or disasters. Clear criteria need to be agreed and documented on for the invocation of the ITSCM plans
Availability Management - undertaking Risk Analysis and implementing risk responses should be closely coordinated with the availability process to optimize risk mitigation
Service Level Management - recovery requirements will be agreed and documented in the SLAs. Different service levels could be agreed and documented that could be acceptable in a disaster situation.
Capacity Management - ensuring that there are sufficient resources to enable recovery onto replacement computers following a disaster
Configuration Management - the CMS documents the components that make up the infrastructure and the relationship between the components. This information is invaluable for all the stages of the ITSCM lifecycle, the maintenance of plans and recovery facilities
Information Security Management - a very close relationship exists between ITSCM and Information Security Management. A major security breach could be considered a disaster, so when conducting BIA and Risk Analysis, security will be a very important consideration.

4.5.6.7 Inputs

There are many sources of input required by the ITSCM process:

Business information: from the organization's business strategy, plans and financial plans, and information on their current and future requirements
IT information: from the IT strategy and plans and current budgets
A Business Continuity Strategy and a set of Business Continuity Plans: from all areas of the business
Service information: from the SLM process, with details of the services from the Service Portfolio and the Service Catalogue and service level targets within SLAs and SLRs
Financial information: from Financial Management, the cost of service provision, the cost of resources and components
Change information: from the Change Management process, with a Change Schedule and a need to assess all changes for their impact on all ITSCM plans
CMS: containing information on the relationships between the business, the services, the supporting services and the technology
Business Continuity Management and Availability Management testing schedules
IT Service Continuity Plans and test reports from supplier and partners, where appropriate.

4.5.6.2 Outputs

The outputs from the ITSCM process include:

A revised ITSCM policy and strategy
A set of ITSCM plans, including all Crisis Management, Emergency Response Plans and Disaster Recovery Plans, together with a set of supporting plans and contracts with recovery service providers
Business Impact Analysis exercises and reports, in conjunction with BCM and the business
Risk Analysis and Management reviews and reports, in conjunction with the business, Availability Management and Security Management
An ITSCM testing schedule
ITSCM test scenarios
ITSCM test reports and reviews.

Forecasts and predictive reports are used by all areas to analyze, predict and forecast particular business and IT scenarios and their potential solutions.

4.5.7 Key Performance Indicators

IT services are delivered and can be recovered to meet business objectives:

Regular audits of the ITSCM Plans to ensure that, at all times, the agreed recovery requirements of the business can be achieved
All service recovery targets are agreed and documented in SLAs and are achievable within the ITSCM Plans
Regular and comprehensive testing of ITSCM Plans
Regular reviews are undertaken, at least annually, of the business and IT continuity plans with the business areas
Negotiate and manage all necessary ITSCM contracts with third party
Overall reduction in the risk and impact of possible failure of IT services.

Awareness throughout the organizations of the plans:

Ensure awareness of business impact, needs and requirements throughout IT
Ensure that all IT service areas and staff are prepared and able to respond to an invocation of the ITSCM Plans
Regular communication of the ITSCM objectives and responsibilities within the appropriate business and IT service areas.

4.5.8 Information Management

ITSCM needs to record all of the information necessary to maintain a comprehensive set of ITSCM plans. This information base should include:

Information from the latest version of the BIA
Comprehensive information on risk within a Risk
Register, including risk assessment and risk responses
The latest version of the BCM strategy and BCPs
Details relating to all completed tests and a schedule of all planned tests
Details of all ITSCM Plans and their contents
Details of all other plans associated with ITSCM Plans
Details of all existing recovery facilities, recovery suppliers and partners, recovery agreements and contracts, spare and alternative equipment
Details of all backup and recovery processes, schedules, systems and media and their respective locations.

All the above information needs to be integrated and aligned with all BCM information and all the other information required by ITSCM. Interfaces to many other processes are required to ensure that this alignment is maintained.

4.5.9 Challenges, Critical Success Factors and Risks

One of the major challenges facing ITSCM is to provide appropriate plans when there is no BCM process. If there is no BCM process, then IT is likely to make incorrect assumptions about business criticality of business processes and therefore adopt the wrong continuity strategies and options. Without BCM, expensive ITSCM solutions and plans will be rendered useless by the absence of corresponding plans and arrangements within the business. Also, if BCM is absent, then the business may fail to identify inexpensive non-IT solutions and waste money on ineffective, expensive IT solutions.

In some organizations, the business perception is that continuity is an IT responsibility, and therefore the business assumes that IT will be responsible for disaster recovery and that IT services will continue to run under any circumstances. This is especially true in some outsourced situations where the business may be reluctant to share its BCM information with an external service provider.

If there is a BCM process established, then the challenge becomes one of alignment and integration. ITSCM must ensure that accurate information is obtained from the BCM process on the needs, impact and priorities of the business, and that the ITSCM information and plans are aligned and integrated with those of the business.

Having achieved that alignment, the challenge becomes one of keeping them aligned by management and control of business and IT change. It is essential, therefore, that all documents and plans are maintained under strict Change Management and Configuration Management control.

The main CSFs for the ITSCM process are:

IT services are delivered and can be recovered to meet business objectives
Awareness throughout the organization of the business and IT Service Continuity Plans.

Some of the major risks associated with ITSCM include:

Lack of commitment from the business to the ITSCM processes and procedures
Lack of commitment from the business and a lack of appropriate information on future plans and strategies
Lack of senior management commitment or a lack of resources and/or budget for the ITSCM process
The processes focus too much on the technology issues and not enough on the IT services and the needs and priorities of the business
Risk Analysis and Management are conducted in isolation and not in conjunction with Availability Management and Security Management
ITSCM plans and information become out-of-date and lose alignment with the information and plans of the business and BCM.

Supporting Material

Video - HCI - Service Continuity Management, or Powerpoint
CSU - Service Continuity Management Objectives
CSU - Service Continuity Management - Business Impact Analysis
ITSCM ICOM Chart
Arcserve - Guide to Availability, Continuity & Disaster Recovery
Business Case
Technical Barriers
Operational Challenges
Putting it All Together

Video - HCI - Business Impact Analysis, or Powerpoint
Video - HCI - Risk Analysis, or Powerpoint
ISAAC - The Risk IT Framework (2009)

	Manual	Immediate	Fast	Intermediate	Gradual
Service Desk	Yes		Yes	Yes	Yes
Mainframe payroll	Yes			Yes	Yes
Financial system			Yes		Yes
Dealer system		Yes		Yes	Yes
Figure 4.25 Example set of recovery options