Service Design

1Introduction 2Serv. Mgmt. 3Principles 4Processes 5Tech Activities 6Organization 7Tech Considerations 8Implementation 9Challenges Appendeces

Service Design Process

4.1SC Mgmt 4.2SLM 4.3Capacity Mgmt 4.4Availability Mgmt 4.5 Continuity Mgmt 4.6Security Mgmt 4.7Supplier Mgmt

4.5 IT Service Continuity Management

4.5.1 Purpose, Goals and Objectives
'The goal of ITSCM is to support the overall Business Continuity Management process by ensuring that the required IT technical and service facilities (including computer systems, networks, applications, data repositories, telecommunications, environment, technical support and Service Desk) can be resumed within required, and agreed, business timescales.'

As technology is a core component of most business processes, continued or high availability of IT is critical to the survival of the business as a whole. This is achieved by introducing risk reduction measures and recovery options. Like all elements of ITSM, successful implementation of ITSCM can only be achieved with senior management commitment and the support of all members of the organization. Ongoing maintenance of the recovery capability is essential if it is to remain effective. The purpose of ITSCM is to maintain the necessary ongoing recovery capability within the IT services and their supporting components.

The objectives of ITSCM are to:

4.5.2 Scope
ITSCM focuses on those events that the business considers significant enough to be considered a disaster. Less significant events will be dealt with as part of the Incident Management process. What constitutes a disaster will vary from organization to organization. The impact of a loss of a business process, such as financial loss, damage to reputation or regulatory breach, is measured through a BIA exercise, which determines the minimum critical requirements. The specific IT technical and service requirements are supported by ITSCM. The scope of ITSCM within an organization is determined by the organizational structure, culture and strategic direction (both business and technology) in terms of the services provided and how these develop and change over time.

ITSCM primarily considers the IT assets and configurations that support the business processes. If (following a disaster) it is necessary to relocate to an alternative working location, provision will also be required for items such as office and personnel accommodation, copies of critical paper records, courier services and telephone facilities to communicate with customers and third parties.

The scope will need to take into account the number and location of the organization's offices and the services performed in each.

ITSCM does not usually directly cover longer-term risks such as those from changes in business direction, diversification, restructuring, major competitor failure, and so on. While these risks can have a significant impact on IT service elements and their continuity mechanisms, there is usually time to identify and evaluate the risk and include risk mitigation through changes or shifts in business and IT strategies, thereby becoming part of the overall business and IT Change Management programme.

Similarly, ITSCM does not usually cover minor technical faults (for example, non critical disk failure), unless there is a possibility that the impact could have a major impact on the business. These risks would be expected to be covered mainly through the Service Desk and the Incident Management process, or resolved through the planning associated with the processes of Availability Management, Problem Management, Change Management, Configuration Management and 'business as usual' operational management.

The ITSCM process includes:

4.5.3 Value to the Business
ITSCM provides an invaluable role in supporting the Business Continuity Planning process. In many organizations, ITSCM is used to raise awareness of continuity and recovery requirements and is often used to justify and implement a Business Continuity Planning process and Business Continuity Plans. The ITSCM should be driven by business risk as identified by Business Continuity Planning, and ensures that the recovery arrangements for IT services are aligned to identified business impacts, risks and needs.

4.5.4 Policies, Principles and Basic Concepts
Figure 4.21 Lifecycle of Service Continuity Management Key activities
  • Policy setting
  • Scope
  • Initiate a project

  • Business Impact Analysis
  • Risk Assessment
  • IT Service Continuity Strategy
  • Develop IT Service Continuity Plans

  • Develop IT plans, recovery plans and procedures
  • Organization Planning
  • Testing strategy

  • Education, awareness and training
  • Review and audit
  • Testing
  • Change Management
Figure 4.21 Lifecycle of Service Continuity Management

A lifecycle approach should be adopted to the setting up and operation of an ITSCM process. Figure 4.21 shows the lifecycle of ITSCM, from initiation through to continual assurance that the protection provided by the plan is current and reflects all changes to services and service levels. ITSCM is a cyclic process through the lifecycle to ensure that once service continuity and recovery plans have been developed they are kept aligned with Business Continuity Plans (BCPs) and business priorities. Figure 4.21 also shows the role played within the ITSCM process of BCM.

Initiation and requirements stages are principally BCM activities. ITSCM should only be involved in these stages to support the BCM activities and to understand the relationship between the business processes and the impacts caused on them by loss of IT service. As a result of these initial BIA and Risk Analysis activities, BCM should produce a Business Continuity Strategy, and the first real ITSCM task is to produce an ITSCM strategy that underpins the BCM strategy and its needs.

The Business Continuity Strategy should principally focus on business processes and associated issues (e.g. business process continuity, staff continuity, buildings continuity). Once the Business Continuity Strategy has been produced, and the role that IT services has to provide within the strategy has been determined, an ITSCM strategy can be produced that supports and enables the Business Continuity Strategy. This ensures that cost-effective decisions can be made, considering all the 'resources' to deliver a business process. Failure to do this tends to encourage ITSCM options that are faster, more elaborate and expensive than are actually needed.

The activities to be considered during initiation depend on the extent to which continuity facilities have been applied within the organization. Some parts of the business may have established individual Business Continuity Plans based around manual work-arounds, and IT may have developed continuity plans for systems perceived to be critical. This is good input to the process. However, effective ITSCM depends on supporting critical business functions. The only way of implementing effective ITSCM is through the identification of critical business processes and the analysis and coordination of the required technology and supporting IT services.

This situation may be even more complicated in outsourcing situations where an ITSCM process within an external service provider or outsourcer organization has to meet the needs not only of the customer BCM process and strategy, but also of the outsourcer's own BCM process and strategy. These needs may be in conflict with one another, or may conflict with the BCM needs of one of the other outsourcing organization's customers.

However, in many organizations BCM is absent or has very little focus, and often ITSCM is required to fulfill many of the requirements and activities of BCM. The rest of this section has assumed that ITSCM has had to perform many of the activities required by BCM. Where a BCM process is established with Business Continuity Strategies and Plans in place, these documents should provide the focus and drive for establishing ITSCM.

4.5.5 Process Activities, Methods and Techniques
The following sections contain details of each of the stages within the ITSCM lifecycle.

HCI Observation
The approach for this process differs from that in other sections. Lifecycle is defined in ITIL as "A series of states connected by allowable transitions in the life of an IT Service". In other process descriptions (eg. capacity, availability, etc) the IT service is an application or system. The commentary is NOT directed at implementing an availability or capacity "service". In this section "lifecycle" seems to be defined as the total ITSCM process itself rather than it's invocation and operational elements culminating in the restoration of the service. These are covered under the "Implementation" section. It's really about consistency as other sections might well benefit from Initiation and Requirements and Strategy sections (the latter including in the ITIL version 3 Service Strategy book).

This demonstrates the difficulties often encountered in coordinating a common approach amongst a set of different writers. Deviations from agreed-upon formats sometimes happen. It is the job of the "Chief Architect" to ensure such deviations do not stray to far afield.

4.5.5.1 Stage 1 - Initiation
The initiation process covers the whole of the organization and consists of the following activities:

4.5.5.2 Stage 2 - Requirements and Strategy
Ascertaining the business requirements for IT service continuity is a critical component in order to determine how well an organization will survive a business interruption or disaster and the costs that will be incurred. If the requirements analysis is incorrect, or key information has been missed, this could have serious consequences on the effectiveness of ITSCM mechanisms.

This stage can effectively be split into two sections:

Requirements - Business Impact Analysis
The purpose of a Business Impact Analysis (BIA) is to quantify the impact to the business that loss of service would have. This impact could be a 'hard' impact that can be precisely identified - such as financial loss - or 'soft' impact - such as public relations, morale, health and safety or loss of competitive advantage. The BIA will identify the most important services to the organization and will therefore be a key input to the strategy.

The BIA identifies:

One of the key outputs from a BIA exercise is a graph of the anticipated business impact caused by the loss of a business process or the loss of an IT service over time, as illustrated in Figure 4.22. This graph can then be used to drive the business and IT continuity strategies and plans. More preventative measures need to be adopted with regard to those processes and services with earlier and higher impacts, whereas greater emphasis should be placed on continuity and recovery measures for those where the impact is lower and takes longer to develop. A balanced approach of both measures should be adopted to those in between. These items provide the drivers for the level of ITSCM mechanisms that need to be considered or deployed. Once presented with these options, the business may decide that lower levels of service or increased delays are more acceptable, based on a cost-benefit analysis, or it may be that comprehensive disaster prevention measures will need to be implemented.

These assessments enable the mapping of critical service, application and technology components to critical business processes, thus helping to identify the ITSCM elements that need to be provided. The business requirements are ranked and the associated ITSCM elements confirmed and prioritized in terms of risk reduction and recovery planning. The results of the BIA, discussed earlier, are invaluable input to several areas of process design including Service Level Management to understand the required service levels.

Figure 4.22 Graphical representation of business impacts
Figure 4.22 Graphical representation of business impacts

Impacts should be measured against particular scenarios for each business process, such as an inability to settle trades in a money market dealing process, or an inability to invoice for a period of days. An example is a money market dealing environment where loss of market data information could mean that the organization starts to lose money immediately as trading cannot continue. In addition, customers may go to another organization, which would mean potential loss of core business. Loss of the settlement system does not prevent trading from taking place, but if trades already conducted cannot be settled within a specified period of time, the organization may be in breach of regulatory rules or settlement periods and suffer fines and damaged reputation. This may actually be a more significant impact than the trade embargo because of an inability to satisfy customer expectations.

It is also important to understand how impacts may change over time. For instance, it may be possible for a business to function without a particular process for a short period of time. In a balanced scenario, impacts to the business will occur and become greater over time. However, not all organizations are affected in this way. In some organizations, impacts are not apparent immediately. At some point, however, for any organization, the impacts mass to such an extent that the business can no longer operate. ITSCM ensures that contingency options are identified so that the appropriate measure can be applied at the appropriate time to keep business impacts from service disruption to a minimum level.

When conducting a BIA, it is important that senior business area representatives' views are sought on the impact following loss of service. It is also equally important that the views of supervisory staff and more junior staff are sought to ensure all aspects of the impact following loss of service are ascertained. Often different levels of staff will have different views on the impact, and all will have to be taken into account when producing the overall strategy.

In many organizations it will be impossible, or it will not be cost-justifiable, to recover the total service in a very short timescale. In many cases, business processes can be re-established without a full complement of staff, systems and other facilities, and still maintain an acceptable level of service to clients and customers. The business recovery objectives should therefore be stated in terms of:

It may not always be possible to provide the recovery requirements to a detailed level. There is a need to balance the potential impact against the cost of recovery to ensure that the costs are acceptable. The recovery objectives do, however, provide a starting point from which different business recovery and ITSCM options can be evaluated.

Supporting Material
  1. Aliant - BIA Service
  2. Gartner - Sample BIA
  3. Wiki - BIA Definition
  4. BIA Template (doc)
  5. DRII/BCI Professional Practice Narrative - Best Practices for BIA
  6. CSU - Service Continuity Management - Business Impact Analysis
  7. Video - HCI - Business Impact Analysis, or Powerpoint

Requirements - Risk Analysis
Figure 4.23 Management of Risk
Figure 4.23 Management of Risk
Figure 4.24 Example summary risk profile
Figure 4.24 Example summary risk profile

The second driver in determining ITSCM requirements is the likelihood that a disaster or other serious service disruption will actually occur. This is an assessment of the level of threat and the extent to which an organization is vulnerable to that threat. Risk Analysis can also be used in assessing and reducing the chance of normal operational incidents and is a technique used by Availability Management to ensure the required availability and reliability levels can be maintained. Risk Analysis is also a key aspect of Information Security Management. A diagram on Risk Analysis and Management (Figure 4.20) is contained within the Availability Management process in section 4.4.

A number of Risk Analysis and Management methods are available for both the commercial and government sectors. Risk Analysis is the assessment of the risks that may give rise to service disruption or security violation. Risk management is concerned with identifying appropriate risk responses or cost-justifiable countermeasures to combat those risks.

A standard methodology, such as the Management of Risk (M_o_R), should be used to assess and manage risks within an organization. The M o_R framework is illustrated in Figure 4.23.

The M_o_R approach is based around the above framework, which consists of the following:

This M_o_R method requires the evaluation of risks and the development of a risk profile, such as the example in Figure 4.24.

Figure 4.24 provides an example risk profile, containing many risks that are outside the defined level of 'acceptable risk'. Following the Risk Analysis it is possible to determine appropriate risk responses or risk reduction measures (ITSCM mechanisms) to manage the risks, i.e. reduce the risk to an acceptable level or mitigate the risk. Wherever possible, appropriate risk responses should be implemented to reduce either the impact or the likelihood, or both, of these risks from manifesting themselves.

In the context of ITSCM, there are a number of risks that need to be taken into consideration. Table 4.1 is not a comprehensive list but does give some examples of risks and threats that need to be addressed by the ITSCM process.

Supporting Material
  1. M_of_R Website
  2. Wiki - Risk Management
  3. Risk Management Standard
  4. Risk Register template (doc)
  5. Video - HCI - Risk Analysis, or Powerpoint
  6. Risk Analysis ICOM

IT Service Continuity Strategy
RiskThreat
Loss of internal IT systems/networks, PABXs, ACDs, etc.Fire
Power failure
Arson and vandalism
Flood
Aircraft impact
Weather damage, e.g. hurricane
Environmental disaster
Terrorist attack
Sabotage Catastrophic failure
Electrical damage, e.g. lightning
Accidental damage
Poor-quality software
Loss of external IT systems/networks, e.g. e-commerce servers, cryptographic systemsAll of the above
Excessive demand for services
Denial of service attack, e.g. against an internet firewall
Technology failure, e.g. cryptographic system
Loss of dataTechnology failure
Human error
Viruses, malicious software, e.g. attack applets
Loss of network servicesDamage or denial of access to network service provider's premises
Loss of service provider's IT
Loss of service provider's data
Failure of the service provider
Unavailability of key technical and support staffIndustrial action
Denial of access to premises
Resignation
Sickness/injury
Transport difficulties
Failure of service providers, e.g. outsourced ITCommercial failure, e.g. insolvency
Denial of access to premises
Unavailability of service provider's staff
Failure to meet contractual service levels
Table 4.1 Examples of risks and threats

The results of the Business Impact Analysis and the Risk Analysis will enable appropriate Business and IT Service Continuity strategies to be produced in line with the business needs. The strategy will be an optimum balance of risk reduction and recovery or continuity options. This includes consideration of the relative service recovery priorities and the changes in relative service priority for the time of day, day of the week, and monthly and annual variations. Those services that have been identified as high impacts in the short term within the BIA will want to concentrate on preventative risk reduction methods - for example, through full resilience and fault tolerance - while an organization that has low short-term impacts would be better suited to comprehensive recovery options, as described in the following sections. Similar advice and guidance can be found in the Business Continuity Institute's BCI Good Practice Guidelines.

Risk Response Measures
Most organizations will have to adopt a balanced approach where risk reduction and recovery are complementary and both are required. This entails reducing, as far as possible, the risks to the continued provision of the IT service and is usually achieved through Availability Management. However well planned, it is impossible to completely eliminate all risks - for example, a fire in a nearby building will probably result in damage, or at least denial of access, as a result of the implementation of a cordon. As a general rule, the invocation of a recovery capability should only be taken as a last resort. Ideally, an organization should assess all of the risks to reduce the potential requirement to recover the business, which is likely to include the IT services.

The risk reduction measures need to be implemented and should be instigated in conjunction with Availability Management, as many of these reduce the probability of failure affecting the availability of service. Typical risk reduction measures include:

The above measures will not necessarily solve an ITSCM issue and remove the risk totally, but all or a combination of them may significantly reduce the risks associated with the way in which services are provided to the business.

Off-Site Storage
One risk response method is to ensure all vital data is backed up and stored off-site. Once the recovery strategy has been defined, an appropriate backup strategy should be adopted and implemented to support it. The backup strategy must include regular (probably daily) removal of data (including the CMS to ease recovery) from the main data centres to a suitable off-site storage location. This will ensure retrieval of data following relatively minor operational failure as well as total and complete disasters. As well as the electronic data, all other important information and documents should be stored off-site, with the main example being the ITSCM plans.

ITSCM Recovery Options
An organization's ITSCM strategy is a balance between the cost of risk reduction measures and recovery options to support the recovery of critical business processes within agreed timescales. The following is a list of the potential IT recovery options that need to be considered when developing the strategy.

Manual Workarounds
For certain types of services, manual work-arounds can be an effective interim measure for a limited timeframe until the IT service is resumed. For instance, a Service Desk call logging service could survive for a limited time using paper forms linked to a laptop computer with a spreadsheet.

Reciprocal Arrangements
In the past, reciprocal arrangements were typical contingency measures where agreements were put in place with another organization using similar technology. This is no longer effective or possible for most types of IT systems, but can still be used in specific cases - for example, setting up an agreement to share high-speed printing facilities. Reciprocal arrangements can also be used for the off-site storage of backups and other critical information.

Gradual Recovery
This option (sometimes referred to as 'cold standby') includes the provision of empty accommodation, fully equipped with power, environmental controls and local network cabling infrastructure, telecommunications connections, and available in a disaster situation for an organization to install its own computer equipment. It does not include the actual computing equipment, so is not applicable for services requiring speedy recovery, as set-up time is required before recovery of services can begin. This recovery option is only recommended for services that can bear a delay of recovery time in days or weeks, not hours. Any non-critical service that can bear this type of delay should take into account the cost of this option versus the benefit to the business before determining if a gradual recovery option should be included in the ITSCM options for the organization.

The accommodation may be provided commercially by a third party, for a fee, or may be private, (established by the organization itself) and provided as either a fixed or portable service.

A portable facility is typically a prefabricated building provided by a third party and located, when needed, at a predetermined site agreed with the organization. This may be in another location some distance from the home site, perhaps another owned building. The replacement computer equipment will need to be planned, but suppliers of computing equipment do not always guarantee replacement equipment within a fixed deadline, though they would normally do so under their best efforts.

Intermediate Recovery
This option (sometimes referred to as 'warm standby') is selected by organizations that need to recover IT facilities within a predetermined time to prevent impacts to the business process. The predetermined time will have been agreed with the business during the BIA.

Most common is the use of commercial facilities, which are offered by third-party recovery organizations to a number of subscribers, spreading the cost across those subscribers. Commercial facilities often include operation, system management and technical support. The cost varies depending on the facilities requested, such as processors, peripherals, communications, and how quickly the services must be restored.

The advantage of this service is that the customer can have virtually instantaneous access to a site, housed in a secure building, in the event of a disaster. It must be understood, however, that the restoration of services at the site may take some time, as delays may be encountered while the site is re-configured for the organization that invokes the service, and the organization's applications and data will need to be restored from backups.

One potentially major disadvantage is the security implications of running IT services at a third party's data centre. This must be taken into account when planning to use this type of facility. For some organizations, the external intermediate recovery option may not be appropriate for this reason.

If the site is invoked, there is often a daily fee for use of the service in an emergency, although this may be offset against additional cost of working insurance.

Commercial recovery services can be provided in self contained, portable or mobile form where an agreed system is delivered to a customer's site, within an agreed time.

Fast Recovery
This option (sometimes referred to as 'hot standby') provides for fast recovery and restoration of services and is sometimes provided as an extension to the intermediate recovery provided by a third-party recovery provider. Some organizations will provide their own facilities within the organization, but not on an alternative site to the one used for the normal operations. Others implement their own internal second locations on an alternative site to provide more resilient recovery.

Where there is a need for a fast restoration of a service, it is possible to 'rent' floor space at the recovery site and install servers or systems with application systems and communications already available, and data mirrored from the operational servers. In the event of a system failure, the customers can then recover and switch over to the backup facility with little loss of service. This typically involves the re-establishment of the critical systems and services within a 24-hour period.

Immediate Recovery
This option (also often referred to as 'hot standby', 'mirroring', 'load balancing' or 'split site') provides for immediate restoration of services, with no loss of service. For business critical services, organizations requiring continuous operation will provide their own facilities within the organization, but not on the same site as the normal operations. Sufficient IT equipment will be 'dual located 'in either an owned or hosted location to run the compete service from either location in the event of loss of one facility, with no loss of service to the customer. The second site can then be recovered whilst the service is provided from the single operable location. This is an expensive option, but may be justified for critical business processes or VBFs where non-availability for a short period could result in a significant impact, or where it would not be appropriate to be running IT services on a third party's premises for security or other reasons. The facility needs to be located separately and far enough away from the home site that it will not be affected by a disaster affecting that location. However, these mirrored servers and sites options should be implemented in close liaison with Availability Management as they support services with high levels of availability.

The strategy is likely to include a combination of risk response measures and a combination of the above recovery options, as illustrated in Figure 4.25.

 ManualImmediateFastIntermediateGradual
Service DeskYes YesYesYes
Mainframe payrollYes  YesYes
Financial system  Yes Yes
Dealer system Yes YesYes
Figure 4.25 Example set of recovery options

Figure 4.25 shows that a number of options may be used to provide continuity of service. An example from Figure 4.25 shows that, initially, continuity of the Service Desk is provided using manual processes such as a set of forms, and maybe a spreadsheet operating from a laptop computer, whilst recovery plans for the service are completed on an alternative 'fast recovery' site. Once the alternative site has become operational, the Service Desk can switch back to using the IT service. However, use of the external 'fast recovery' alternative site is probably limited in duration, so while running temporarily from this site, the 'intermediate site' can be made operational and long-term operations can be transferred there.

Different services within an organization require different in-built resilience and different recovery options. Whatever option is chosen, the solution will need to be cost justified. As a general rule, the longer the business can survive without a service, the cheaper the solution will be. For example, a critical healthcare system that requires continuous operation will be very costly, as potential loss of service will need to be eliminated by the use of immediate recovery, whereas a service the absence of which does not severely affect the business for a week or so could be supported by a much cheaper solution, such as intermediate recovery.

As well as the recovery of the computing equipment, planning needs to include the recovery of accommodation and infrastructure for both IT and user staff. Other areas to be taken into account include critical services such as power, telecommunications, water, couriers, post, paper records and reference material.

It is important to remember that the recovery is based around a series of stand-by arrangements including accommodation, procedures and people, as well as systems and telecommunications. Certain actions are necessary to implement the stand-by arrangements. For example:

4.5.5.3 Stage 3 - Implementation
Once the strategy has been approved, the IT Service Continuity Plans need to be produced in line with the Business Continuity Plans.

ITSCM plans need to be developed to enable the necessary information for critical systems, services and facilities to either continue to be provided or to be reinstated within an acceptable period to the business. An example ITSCM recovery plan is contained in Appendix K. Generally the Business Continuity Plans rely on the availability of IT services, facilities and resources. As a consequence of this, ITSCM plans need to address all activities to ensure that the required services, facilities and resources are delivered in an acceptable operational state and are 'fit for purpose' when accepted by the business. This entails not only the restoration of services and facilities, but also the understanding of dependencies between them, the testing required prior to delivery (performance, functional, operational and acceptance testing) and the validation of data integrity and consistency.

It should be noted that the continuity plans are more than just recovery plans, and should include documentation of the resilience measures and the measures that have been put into place to enable recovery, together with explanations of why a particular approach has been taken (this facilitates decisions should invocation determine that the particular situation requires a modification to the plan). However, the format of the plan should enable rapid access to the recovery information itself, perhaps as an appendix that can be accessed directly. All key staff should have access to copies of all the necessary recovery documentation.

Management of the distribution of the plans is important to ensure that copies are available to key staff at all times. The plans should be controlled documents (with formalized documents maintained under Change Management N control) to ensure that only the latest versions are in circulation and each recipient should ensure that a personal copy is maintained off-site.

The plan should ensure that all details regarding recovery of the IT services following a disaster are fully documented. It should have sufficient details to enable a technical person unfamiliar with the systems to be able to follow the procedures. The recovery plans include key details such as the data recovery point, a list of dependent systems, the nature of the dependency and their data recovery points, system hardware and software requirements, configuration details and references to other relevant or essential information about the service and systems.

It is a good idea to include a checklist that covers specific actions required during all stages of recovery for the service and system. For example, after the system has been restored to an operational state, connectivity checks, functionality checks or data consistency and integrity checks should be carried out prior to handing the service over to the business.

There are a number of technical plans that may already exist within an organization, documenting recovery procedures from a normal operational failure. The development and maintenance of these plans will be the responsibility of the specialist teams, but will be coordinated by the Business Continuity Management team. These will be useful additions or appendices to the main plan. Additionally, plans that will need to be integrated with the main BCP are:

Finally, each critical business area is responsible for the development of a plan detailing the individuals who will be in the recovery teams and the tasks to be undertaken on invocation of recovery arrangements.

The ITSCM Plan must contain all the information needed to recover the IT systems, networks and telecommunications in a disaster situation once a decision to invoke has been made, and then to manage the business return to normal operation once the service disruption has been resolved. One of the most important inputs into the plan development is the results of the Business Impact Analysis. Additionally other areas will need to be analyzed, such as Service Level Agreements (SLA), security requirements, operating instructions and procedures and external contracts. It is likely that a separate SLA with alternative targets will have been agreed if running at a recovery site following a disaster.

Other areas that will need to be implemented following the approval of the strategy are:

Organization Planning
During the disaster recovery process, the organizational structure will inevitably be different from normal operation and is based around:

Testing
Experience has shown that recovery plans that have not been fully tested do not work as intended, if at all. Testing is therefore a critical part of the overall ITSCM process and the only way of ensuring that the selected strategy, standby arrangements, logistics, business recovery plans and procedures will actually work in practice.

The IT service provider is responsible for ensuring that the IT services can be recovered in the required timescales with the required functionality and the required performance following a disaster.

There are four basic types of tests that can be undertaken:

All tests need to be undertaken against defined test scenarios, which are described as realistically as possible. It should be noted, however, that even the most comprehensive test does not cover everything. For example, in a service disruption where there has been injury or even death to colleagues, the reaction of staff to a crisis cannot be tested and the plans need to make allowance for this. In addition, tests must have clearly defined objectives and Critical Success Factors, which will be used to determine the success or otherwise of the exercise.

4.5.5.4 Stage 4 - Ongoing Operation
This stage consists of the following:

Invocation
Invocation is the ultimate test of the Business Continuity and ITSCM Plans. If all the preparatory work has been successfully completed, and plans developed and tested, then an invocation of the Business Continuity Plans should be a straight forward process, but if the plans have not been tested, failures can be expected. It is important that due consideration is given to the design of all invocation processes, to ensure that they are fit for purpose and interface to all other relevant invocation processes.

Invocation is a key component of the plans, which must include the invocation process and guidance. It should be remembered that the decision to invoke, especially if a third-party recovery facility is to be used, should not be taken lightly. Costs will be involved and the process will involve disruption to the business. This decision is typically made by a 'crisis management' team, comprising senior managers from the business and support departments (including IT), using information gathered through damage assessment and other sources.

A disruption could occur at any time of the day or night, so it is essential that guidance on the invocation process is readily available. Plans must be available to key staff in the office and away from the office. The decision to invoke must be made quickly, as there may be a lead-time involved in establishing facilities at a recovery site. In the case of a serious building fire, the decision may be fairly easy to make. However, in the case of power failure or hardware fault, where a resolution is expected within a short period, a deadline should be set by which time if the incident has not been resolved, invocation will take place. If using external services providers, they should be warned immediately if there is a chance that invocation might take place.

The decision to invoke needs to take into account the:

Therefore the design of the invocation process must provide guidance on how all of these areas and circumstances should be assessed to assist the person invoking the continuity plan.

The ITSCM Plan should include details of activities that need to be undertaken, including:

The invocation and initial recovery is likely to be a time of high activity, involving long hours for many individuals. This must be recognized and managed by the recovery team leaders to ensure that breaks are provided and prevent 'burn-out'. Planning for shifts and hand-overs must be undertaken to ensure that the best use is made of the facilities available. It is also vitally important to ensure that the usual business and technology controls remain in place during invocation, recovery and return to normal to ensure that information security is maintained at the correct level and that data protection is preserved.

Once the recovery has been completed, the business should be able to operate from the recovery site at the level determined and agreed in the strategy and relevant SLA. The objective, however, will be to build up the business to normal levels, maintain operation from the recovery site in the short term and vacate the recovery site in the shortest possible time. Details of all these activities need to be contained within the plans. If using external services, there will be a finite contractual period for using the facility. Whatever the period, a return to normal must be carefully planned and undertaken in a controlled fashion. Typically this will be over a weekend and may include some necessary downtime in business hours. It is important that this is managed well and that all personnel involved are aware of their responsibilities to ensure a smooth transition.

4.5.6 Triggers, Inputs, Outputs and Interfaces
Many events may trigger ITSCM activity. These include:

Integration and interfaces exist from ITSCM to all other processes. Important examples are as follows:

4.5.6.7 Inputs
There are many sources of input required by the ITSCM process:

4.5.6.2 Outputs
The outputs from the ITSCM process include:

Forecasts and predictive reports are used by all areas to analyze, predict and forecast particular business and IT scenarios and their potential solutions.

4.5.7 Key Performance Indicators
IT services are delivered and can be recovered to meet business objectives:

Awareness throughout the organizations of the plans:

4.5.8 Information Management
ITSCM needs to record all of the information necessary to maintain a comprehensive set of ITSCM plans. This information base should include:

All the above information needs to be integrated and aligned with all BCM information and all the other information required by ITSCM. Interfaces to many other processes are required to ensure that this alignment is maintained.

4.5.9 Challenges, Critical Success Factors and Risks
One of the major challenges facing ITSCM is to provide appropriate plans when there is no BCM process. If there is no BCM process, then IT is likely to make incorrect assumptions about business criticality of business processes and therefore adopt the wrong continuity strategies and options. Without BCM, expensive ITSCM solutions and plans will be rendered useless by the absence of corresponding plans and arrangements within the business. Also, if BCM is absent, then the business may fail to identify inexpensive non-IT solutions and waste money on ineffective, expensive IT solutions.

In some organizations, the business perception is that continuity is an IT responsibility, and therefore the business assumes that IT will be responsible for disaster recovery and that IT services will continue to run under any circumstances. This is especially true in some outsourced situations where the business may be reluctant to share its BCM information with an external service provider.

If there is a BCM process established, then the challenge becomes one of alignment and integration. ITSCM must ensure that accurate information is obtained from the BCM process on the needs, impact and priorities of the business, and that the ITSCM information and plans are aligned and integrated with those of the business.

Having achieved that alignment, the challenge becomes one of keeping them aligned by management and control of business and IT change. It is essential, therefore, that all documents and plans are maintained under strict Change Management and Configuration Management control.

The main CSFs for the ITSCM process are:

Some of the major risks associated with ITSCM include:

Supporting Material
  1. Video - HCI - Service Continuity Management, or Powerpoint
  2. CSU - Service Continuity Management Objectives
  3. CSU - Service Continuity Management - Business Impact Analysis
  4. ITSCM ICOM Chart

    Arcserve - Guide to Availability, Continuity & Disaster Recovery

  5. Business Case
  6. Technical Barriers
  7. Operational Challenges
  8. Putting it All Together

  9. Video - HCI - Business Impact Analysis, or Powerpoint
  10. Video - HCI - Risk Analysis, or Powerpoint
  11. ISAAC - The Risk IT Framework (2009)

[To top of Page]


Visit my web site