Service Design

Service Design Process

^4.1SC Mgmt

^4.2SLM

^4.3Capacity Mgmt

^4.4Availability Mgmt

^4.5 Continuity Mgmt

^4.6Security Mgmt

^4.7Supplier Mgmt

4.4 Availability Management

4.4.1 Purpose, Goals and Objective

The goal of the Availability Management process is to ensure that the level of service availability delivered in all services is matched to or exceeds the current and future agreed needs of the business, in a cost-effective manner.

The purpose of Availability Management is to provide a point of focus and management for all availability-related issues, relating to both services and resources, ensuring that availability targets in all areas are measured and achieved.

The objectives of Availability Management are to:

Produce and maintain an appropriate and up-to-date Availability Plan that reflects the current and future needs of the business
Provide advice and guidance to all other areas of the business and IT on all availability-related issues
Ensure that service availability achievements meet or exceed all their agreed targets, by managing services and resources-related availability performance
Assist with the diagnosis and resolution of availability related incidents and problems
Assess the impact of all changes on the Availability Plan and the performance and capacity of all services and resources
Ensure that proactive measures to improve the availability of services are implemented wherever it is cost-justifiable to do so.

Availability Management should ensure the agreed level of availability is provided. The measurement and monitoring of IT availability is a key activity to ensure availability levels are being met consistently. Availability Management should look to continually optimize and proactively improve the availability of the IT infrastructure, the services and the supporting organization, in order to provide cost effective availability improvements that can deliver business and customer benefits.

4.4.2 Scope

The scope of the Availability Management process covers the design, implementation, measurement, management and improvement of IT service and component availability. Availability Management needs to understand the service and component availability requirements from the business perspective in terms of the:

Current business processes, their operation and requirements
Future business plans and requirements
Service targets and the current IT service operation and delivery
IT infrastructure, data, applications and environment and their performance
Business impacts and priorities in relation to the services and their usage.

Understanding all of this will enable Availability Management to ensure that all the services and components are designed and delivered to meet their targets in terms of agreed business needs. The Availability Management process:

Should be applied to all operational services and technology, particularly those covered by SLAs. It can also be applied to those IT services deemed to be business critical regardless of whether formal SLAs exist
Should be applied to all new IT services and for existing services where Service Level Requirements (SLRs) or Service Level Agreements (SLAs) have been established
Should be applied to all supporting services and the partners and suppliers (both internal and external) that form the IT support organization as a precursor to the creation of formal agreements
Considers all aspects of the IT services and components and supporting organizations that may impact availability, including training, skills, process effectiveness, procedures and tools.

The Availability Management process does not include Business Continuity Management and the resumption of business processing after a major disaster. The support of BCM is included within IT Service Continuity Management (ITSCM). However, Availability Management does provide key inputs to ITSCM, and the two processes have a close relationship, particularly in the assessment and management of risks and in the implementation of risk reduction and resilience measures.

The Availability Management process should include:

Monitoring of all aspects of availability, reliability and maintainability of IT services and the supporting components, with appropriate events, alarms and escalation, with automated scripts for recovery
Maintenance of a set of methods, techniques and calculations for all availability measurements, metrics and reporting
Assistance with risk assessment and management activities
Collection of measurements, analysis and production of regular and ad hoc reports on service and component availability
Understanding the agreed current and future demands of the business for IT services and their availability
Influencing the design of services and components to align with business needs
Producing an Availability Plan that enables the service provider to continue to provide and improve services in line with availability targets defined in Service Level Agreements (SLAs), and to plan and forecast future availability levels required, as defined in Service Level Requirements (SLRs)
Maintaining a schedule of tests for all resilient and fail-over components and mechanisms
Assistance with the identification and resolution of any incidents and problems associated with service or component unavailability
Proactive improvement of service or component availability wherever it is cost-justifiable and meets the needs of the business.

Figure 4.13 The Availability Management process
Figure 4.13 The Availability Management process

4.4.3 Value to the Business

The Availability Management process ensures that the availability of systems and services matches the evolving and agreed upon needs of the business. The role of IT within the business is now pivotal. The availability and reliability of IT services can directly influence customer satisfaction and the reputation of the business. This is why Availability Management is essential in ensuring IT delivers the right levels of service availability required by the business to satisfy its business objectives and deliver the quality of service demanded by its customers. In today's competitive marketplace, customer satisfaction with service(s) provided is paramount. Customer loyalty can no longer be relied on, and dissatisfaction with the availability and reliability of IT service can be a key factor in customers taking their business to a competitor.

The Availability Management process and planning, just like Capacity Management, must be involved in all stages of the Service Lifecycle, from Strategy and Design, through Transition and Operation to Improvement. The appropriate availability and resilience should be designed into services and components from the initial design stages. This will ensure not only that the availability of any new or changed service meets its expected targets, but also that all existing services and components continue to meet all of their targets. This is the basis of stable service provision.

4.4.4 Policies, Principles and Basic Concepts

The Availability Management process is continually trying to ensure that all operational services meet their agreed availability targets, and that new or changed services are designed appropriately to meet their intended targets, without compromising the performance of existing services. In order to achieve this, Availability Management should perform the reactive and proactive activities illustrated in Figure 4.13.

The reactive activities of Availability Management consist of monitoring, measuring, analyzing, reporting and reviewing all aspects of component and service availability. This is to ensure that all agreed service targets are measured and achieved. Wherever deviations or breaches are detected, these are investigated and remedial action instigated. Most of these activities are conducted within the Operations stage of the lifecycle and are linked into the monitoring and control activities, Event and Incident Management processes^R.
The proactive activities consist of producing recommendations, plans and documents on design guidelines and criteria for new and changed services, and the continual improvement of service and reduction of risk in existing services wherever it can be cost-justified. These are key aspects to be considered within Service Design activities.

An effective Availability Management process, consisting of both the reactive and proactive activities, can 'make a big difference' and will be recognized as such by the business, if the deployment of Availability Management within an IT organization has a strong emphasis on the needs of the business and customers. To reinforce this emphasis, there are several guiding principles that should underpin the Availability Management process and its focus:

Service availability is at the core of customer satisfaction and business success: there is a direct correlation in most organizations between the service availability and customer and user satisfaction, where poor service performance is defined as being unavailable.
Recognizing that when services fail, it is still possible to achieve business, customer and user satisfaction and recognition: the way a service provider reacts in a failure situation has a major influence on customer and user perception and expectation.
Improving availability can only begin after understanding how the IT services support the operation of the business.
Service availability is only as good as the weakest link on the chain: it can be greatly increased by the elimination of Single Points of Failure (SPoFs) or an unreliable or weak component.
Availability is not just a reactive process. The more proactive the process, the better service availability will be. Availability should not purely react to service and component failure. The more events and failures are predicted, pre-empted and prevented, the higher the level of service availability.
It is cheaper to design the right level of service availability into a service from the start rather than try and 'bolt it on' subsequently. Adding resilience into a service or component is invariably more expensive than designing it in from the start. Also, once a service gets a bad name for unreliability, it becomes very difficult to change the image. Resilience is also a key consideration of IT Service Continuity Management, and this should be considered at the same time.

The scope of Availability Management covers the design, implementation, measurement and management of IT service and infrastructure availability. This is reflected in the process description shown in Figure 4.13 and described in the following paragraphs.

The Availability Management process has two key elements:

Reactive activities: the reactive aspect of Availability. Management involves the monitoring, measuring, analysis and management of all events, incidents and problems involving unavailability. These activities are principally involved within operational roles.
Proactive activities: the proactive activities of Availability Management involve the proactive planning, design and improvement of availability. These activities are principally involved within design and planning roles.

Availability Management is completed at two interconnected levels:

Service availability: involves all aspects of service availability and unavailability and the impact of component availability, or the potential impact of component unavailability on service availability
Component availability: involves all aspects of component availability and unavailability.

Availability Management relies on the monitoring, measurement, analysis and reporting of the following aspects:

Availability: the ability of a service, component or CI to perform its agreed function when required. It is often measured and reported as a percentage:

	(Agreed Service Time (AST) - downtime)
Availability (%) =		X 100%
	Agreed Service Time (AST)

Note: Downtime should only be included in the above calculation when it occurs within the Agreed Service Time (AST). However, total downtime should also be recorded and reported.

Reliability: a measure of how long a service, component or Cl can perform its agreed function without interruption. The reliability of the service can be improved by increasing the reliability of individual components or by increasing the resilience of the service to individual component failure (i.e. increasing the component redundancy, e.g. by using load-balancing techniques). It is often measured and reported as Mean Time Between Service Incidents (MTBSI) or Mean Time Between Failures (MTBF):

	Available Time in Hours
Reliability (MTBSI in hours) =
	Number of breaks

	Available time in hours - Total downtime in hours
Reliability (MTBF in hours) =
	Number of breaks

Maintainability: a measure of how quickly and effectively a service, component or CI can be restored to normal working after a failure. It is measured and reported as Mean Time to Restore Service (MTRS) and should be calculated using the following formula:

	Available time in hours - Total downtime in hours
Maintainability (MTRS in hrs) =
	Number of service breaks

MTRS should be used to avoid the ambiguity of the more common industry term Mean Time To Repair (MTTR), which in some definitions includes only repair time, but in others includes recovery time. The downtime in MTRS covers all the contributory factors that make the service, component or CI unavailable:

Time to record
Time to respond
Time to resolve
Time to physically repair or replace
Time to recover.

Example

A situation where a 24 x 7 service has been running for a period of 5,020 hours with only two breaks, one of six hours and one of 14 hours, would give the following figures:

Availability = (5,020-(6+14)) / 5,020 x 100 = 99.60%
Reliability (MTBSI) = 5,020 / 2 = 2,510 hours
Reliability (MTBF) = 5,020-(6+14) / 2 = 2,500 hours
Maintainability (MTRS) = (6+14) / 2 = 10 hours

Serviceability: the ability of a third-party supplier to meet the terms of their contract. Often this contract will include agreed levels of availability, reliability and/or maintainability for a supporting service or component.^R

Figure 4.14 Availability terms and measurements
Figure 4.14 Availability terms and measurements

These aspects and their inter-relationships are illustrated in Figure 4.14.

Although the principal service target contained within SLAs for the customers and business is availability, as illustrated in Figure 4.14, some customers also require reliability and maintainability targets to be included as well. Where these are included they should relate to service reliability and maintainability targets, whereas the reliability and maintainability targets contained in OLAs and contracts relate to component and supporting service targets and can often include availability targets relating to the relevant components or supporting services.

The term Vital Business Function (VBF) is used to reflect the business critical elements of the business process supported by an IT service. An IT service may support a number of business functions that are less critical. For example, an automated teller machine (ATM) or cash dispenser service VBF would be the dispensing of cash. However, the ability to obtain a statement from an ATM may not be considered as vital. This distinction is important and should influence availability design and associated costs. The more vital the business function generally, the greater the level of resilience and availability that needs to be incorporated into the design required in the supporting IT services. For all services, whether VBFs or not, the availability requirements should be determined by the business and not by IT. The initial availability targets are often set at too high a level, and this leads to either over-priced services or an iterative discussion between the service provider and the business to agree an appropriate compromise between the service availability and the cost of the service and its supporting technology.

Certain VBFs may need special designs, which are now being used as a matter of course within Service Design plans, incorporating:

High availability: a characteristic of the IT service that minimizes or masks the effects of IT component failure to the users of a service.
Fault tolerance: the ability of an IT service, component or CI to continue to operate correctly after failure of a component part.
Continuous operation: an approach or design to eliminate planned downtime of an IT service. Note that individual components or CIs may be down even though the IT service remains available.
Continuous availability: an approach or design to achieve 100% availability. A continuously Available IT service has no planned or unplanned downtime.

Industry View
Many suppliers commit to high availability or continuous availability solutions only if stringent environmental standards and resilient processes are used. They often agree to such contracts only after a site survey has been completed and additional, sometimes costly, improvements have been made.

Availability Management commences as soon as the availability requirements for an IT service are clear enough to be articulated. It is an ongoing process, finishing only when the IT service is decommissioned or retired. The key activities of the Availability Management process are:

Determining the availability requirements from the business for a new or enhanced IT service and formulating the availability and recovery design criteria for the supporting IT components
Determining the VBFs, in conjunction with the business and ITSCM
Determining the impact arising from IT service and component failure in conjunction with ITSCM and, where appropriate, reviewing the availability design criteria to provide additional resilience to prevent or minimize impact to the business
Defining the targets for availability, reliability and maintainability for the IT infrastructure components that underpin the IT service to enable these to be documented and agreed within SLAs, OLAs and contracts
Establishing measures and reporting of availability, reliability and maintainability that reflect the business, user and IT support organization perspectives
Monitoring and trend analysis of the availability, reliability and maintainability of IT components
Reviewing IT service and component availability and identifying unacceptable levels
Investigating the underlying reasons for unacceptable availability
Producing and maintaining an Availability Plan that prioritizes and plans IT availability improvements.

4.4.5 Process Activities, Methods and Techniques

The Availability Management process depends heavily on the measurement of service and component achievements with regard to availability.

Key message

'If you don't measure it, you can't manage it'
'If you don't measure it, you can't improve it'
'If you don't measure it, you probably don't care'
'If you can't influence or control it, then don't measure it'

'What to measure and how to report it' inevitably depend on which activity is being supported, who the recipients are and how the information is to be utilized. It is important to recognize the differing perspectives of availability to ensure measurement and reporting satisfies these varied needs:

The business perspective considers IT service availability in terms of its contribution or impact on the VBFs that drive the business operation.
The user perspective considers IT service availability as a combination of three factors, namely the frequency, the duration and the scope of impact, i.e. all users, some users, all business functions or certain business functions - the user also considers IT service availability in terms of response times. For many performance-centric applications, poor response times are considered equal in impact to failures of technology.
The IT service provider perspective considers IT service: and component availability with regard to availability, reliability and maintainability.
In order to satisfy the differing perspectives of availability, Availability Management needs to consider the spectrum of measures needed to report the 'same' level of availability in different ways. Measurements need to be meaningful and add value if availability measurement and reporting are ultimately to deliver benefit to the IT and business organizations. This is influenced strongly by the combination of 'what you measure' and 'how you report it'.

4.4.5.1 The Reactive Activities of Availability Management

Monitor, Measure, Analyse And Report Service and Component Availability
A key output from the Availability Management process is the measurement and reporting of IT availability. Availability measures should be incorporated into SLAs, OLAs and any underpinning contracts. These should be reviewed regularly at Service Level review meetings. Measurement and reporting provide the basis for:

Monitoring the actual availability delivered versus agreed targets
Establishing measures of availability and agreeing on availability targets with the business
Identifying unacceptable levels of availability that impact the business and users
Reviewing availability with the IT support organization
Continual improvement activities to optimize availability.

The IT service provider organizations have, for many years, measured and reported on their perspective of availability. Traditionally these measures have concentrated on component availability and have been somewhat divorced from the business and user views. Typically these traditional measures are based on a combination of an availability percentage (%), time lost and the frequency of failure. Some examples of these traditional measures are as follows:

Per cent available - the truly 'traditional' measure that represents availability as a percentage and, as such, much more useful as a component availability measure than a service availability measure. It is typically used to track and report achievement against a service level target. It tends to emphasize the 'big number' such that if the service level target was 98.5% and the achievement was 98.3%, then it does not seem that bad. This can encourage a complacent behaviour within the IT support organization.
Per cent unavailable - the inverse of the above. This representation, however, has the benefit of focusing on non-availability. Based on the above example, if the target for non-availability is 1.5% and the achievement was 1.7%, then this is a much larger relative difference. This method of reporting is more likely to create awareness of the shortfall in delivering the level of availability required.
Duration - achieved by converting the percentage unavailable into hours and minutes. This provides a more 'human' measure that people can relate to. If the weekly downtime target is two hours, but one week the actual downtime was four hours; this would represent a trend leading to an additional four days of non-availability to the business over a full year. This type of measure and reporting is more likely to encourage focus on service improvement.
Frequency of failure - used to record the number of interruptions to the IT service. It helps provide a good indication of reliability from a user perspective. It is best used in combination with 'duration' to take a balanced view of the level of service interruptions and the duration of time lost to the business.
Impact of failure - this is the true measure of service unavailability. It depends on mature incident recording where the inability of users to perform their business tasks is the most important piece of information captured. All other measures suffer from a potential to mask the real effects of service failure and are often converted to a financial impact.

The business may have, for many years, accepted that the IT availability that they experience is represented in terms of component availability rather than overall service or business availability. However, this is no longer being viewed as acceptable and the business is keen to better represent availability in measure(s) that demonstrate the positive and negative consequences of IT availability on their business and users.

Key message

The most important availability measurements are those that reflect and measure availability from the business and user perspective.

Availability Management needs to consider availability from both a business/IT service provider perspective and from an IT component perspective. These are entirely different aspects, and while the underlying concept is similar, the measurement, focus and impact are entirely different.

The sole purpose of producing these availability measurements and reports, including those from the business perspective, is to improve the quality and availability of IT service provided to the business and users. All measures, reports and activities should reflect this purpose.

Availability, when measured and reported to reflect the experience of the user, provides a more representative view on overall IT service quality. The user view of availability is influenced by three factors:

Frequency of downtime
Duration of downtime
Scope of impact.

Measurements and reporting of user availability should therefore embrace these factors. The methodology employed to reflect user availability could consider two approaches:

Impact by user minutes lost: this is to base calculations on the duration of downtime multiplied by the number of users impacted. This can be the basis to report availability as lost user productivity, or to calculate the availability percentage from a user perspective, and can also include the costs of recovery for lost productivity (e.g. increased overtime payments).
Impact by business transaction: this is to base calculations on the number of business transactions that could not be processed during the period of downtime. This provides a better indication of business impact reflecting differing transaction processing profiles across the time of day, week etc. In many instances it may be the case that the user impact correlates to a VBF, e.g. if the user takes customer purchase orders and a VBF is customer sales. This single measure is the basis to reflect impact to the business operation and user.

The method employed should be influenced by the nature of the business operation. A business operation supporting data entry activity is well suited to reporting that reflects user productivity loss. Business operations that are more customer-facing, e.g. ATM services, benefit from reporting transaction impact. It should also be noted that not all business impact is user-related. With increasing automation and electronic processing, the ability to process automated transactions or meet market cut-off times can also have a large financial impact that may be greater than the ability of users to work.

The IT support organization needs to have a keen awareness of the user experience of availability. However, the real benefits come from aggregating the user view into the overall business view. A guiding principle of the Availability Management process is that 'Improving availability can only begin when the way technology supports the business is understood'. Therefore Availability Management isn't just about understanding the availability of each IT component, but is all about understanding the impact of component failure on service and user availability. From the business perspective, an IT service can only be considered available when the business is able to perform all vital business functions required to drive the business operation. For the IT service to be available, it therefore relies on all components on which the service depends being available, i.e. systems, key components, network, data and applications.

The traditional IT approach would be to measure individually the availability of each of these components. However, the true measure of availability has to be based on the positive and negative impacts on the VBFs on which the business operation is dependent. This approach ensures that SLAs and IT availability reporting are based on measures that are understood by both the business and IT. By measuring the VBFs that rely on IT services, measurement and reporting becomes business-driven, with the impact of failure reflecting the consequences to the business. It is also important that the availability of the services is defined and agreed with the business and reflected within SLAs. This definition of availability should include:

What is the minimum available level of functionality of the service?
At what level of service response is the service considered unavailable?
Where will this level of functionality and response be measured?
What are the relative weightings for partial service unavailability?
If one location or office is impacted, is the whole service considered unavailable, or is this considered to be 'partial unavailability'? This needs to be agreed with the customers.

Reporting and analysis tools are required for the manipulation of data stored in the various databases utilized by Availability Management. These tools can either be platform- or PC-based and are often a combination of the two. This will be influenced by the database repository technologies selected and the complexity of data processing and reporting required. Availability Management, once implemented and deployed, will be required to produce regular reports on an agreed basis, e.g. monthly availability reports, Availability Plan, Service Failure Analysis (SFA) status reports, etc. The activities involved within these reporting activities can require much manual effort and the only solution is to automate as much of the report generation activity as possible. For reporting purposes, organizational reporting standards should be used wherever possible. If these don't exist, IT standards should be developed so that IT reports can be developed using standard tools and techniques. This means that the integration and consolidation of reports will subsequently be much easier to achieve.

Unavailability Analysis
All events and incidents causing unavailability of services and components should be investigated, with remedial actions being implemented within either the Availability Plan or the overall SIP. Trends should be produced from this analysis to direct and focus activities such as Service Failure Analysis (SFA) to those areas causing the most impact or disruption to the business and the users.

The overall costs of an IT service are influenced by the levels of availability required and the investments required in technology and services provided by the IT support organization to meet this requirement. Availability certainly does not come for free. However, it is important to reflect that the unavailability of IT also has a cost, therefore unavailability isn't free either. For highly critical business processes and VBFs, it is necessary to consider not only the cost of providing the service, but also the costs that are incurred from failure. The optimum balance to strike is the cost of the availability solution weighed against the costs of unavailability.

Before any SLR is accepted, and ultimately the SLR or SLA is negotiated and agreed between the business and the IT organization, it is essential that the availability requirements of the business are analyzed to assess if/how the IT service can deliver the required levels of availability. This applies not only to new IT services that are being introduced, but also to any requested changes to the availability requirements of existing IT services.

The cost of an IT failure could simply be expressed as the number of business or IT transactions impacted, either as an actual figure (derived from instrumentation) or based on an estimation. When measured against the VBFs that support the business operation, this can provide an obvious indication of the consequence of failure. The advantage of this approach is the relative ease of obtaining the impact data and the lack of any complex calculations. It also becomes a 'value' that is understood by both the business and IT organization. This can be the stimulus for identifying improvement opportunities and can become a key metric in monitoring the availability of IT services.

The major disadvantage of this approach is that it offers no obvious monetary value that would be needed to justify any significant financial investment decisions for improving availability. Where significant financial investment decisions are required, it is better to express the cost of failure arising from service, system, application or function loss to the business as a monetary 'value'.^R

The monetary value can be calculated as a combination of the tangible costs associated with failure, but can also include a number of intangible costs. The monetary value should also reflect the cost impact to the whole organization, i.e. the business and IT organization.

Tangible costs can include:

Lost user productivity
Lost IT staff productivity
Lost revenue
Overtime payments
Wasted goods and material
Imposed fines or penalty payments.

Figure 4.15 The expanded incident lifecycle

These costs are often well understood by the finance area of the business and IT organization, and in relative terms are easier to obtain and aggregate than the intangible costs associated with an IT failure. Intangible costs can include:

Loss of customers
Loss of customer goodwill (customer dissatisfaction)
Loss of business opportunity (to sell, gain new customers or revenue, etc.)
Damage to business reputation
Loss of confidence in IT service provider
Damage to staff morale.

It is important not simply to dismiss the intangible costs (and the potential consequences) on the grounds that they may be difficult to measure. The overall unavailability of service, the total tangible cost and the total intangible costs arising from service unavailability are all key metrics in the measurement of the effectiveness of the Availability Management process.

Expanded Incident Lifecycle
A guiding principle of Availability Management is to recognize that it is still possible to gain customer satisfaction even when things go wrong. One approach to help achieve this requires Availability Management to ensure that the duration of any incident is minimized to enable normal business operations to resume as quickly as possible. An aim of Availability Management is to ensure the duration and impact from incidents impacting IT services are minimized, to enable business operations to resume as quickly as is possible. The analysis of the 'expanded incident lifecycle' enables the total IT service downtime for any given incident to be broken down and mapped against the major stages through which all incidents progress (the lifecycle). Availability Management should work closely with Incident Management and Problem Management in the analysis of all incidents causing unavailability.

A good technique to help with the technical analysis of incidents affecting the availability of components and IT services is to take an incident 'lifecycle' view. Every incident passes through several major stages. The time elapsed in these stages may vary considerably. For Availability Management purposes, the standard incident 'lifecycle', as described within Incident Management, has been expanded to provide additional help and guidance, particularly in the area of 'designing for recovery'. Figure 4.15 illustrates the expanded incident lifecycle.

From the above it can be seen that an incident can be broken down into individual stages within a lifecycle that can be timed and measured. This lifecycle view provides an important framework in determining, amongst others, systems management requirements for event and incident detection, diagnostic data capture requirements and tools for diagnosis, recovery plans to aid speedy recovery and how to verify that IT service has been restored. The individual stages of the lifecycle are considered in more detail as follows.

Incident Detection: the time at which the IT service provider organization is made aware of an incident. Systems management tools positively influence the ability to detect events and incidents and therefore to improve levels of availability that can be delivered. Implementation and exploitation should have a strong focus on achieving high availability and enhanced recovery objectives. In the context of recovery, such tools should be exploited to provide automated failure detection, assist failure diagnosis and support automated error recovery, with scripted responses. Tools are very important in reducing all stages of the incident lifecycle, but principally the detection of events and incidents. Ideally the event is automatically detected and resolved, before the users have noticed it or have been impacted in any way.
Incident Diagnosis: the time at which diagnosis to determine the underlying cause has been completed. When IT components fail, it is important that the required level of diagnostics is captured, to enable problem determination to identify the root cause and resolve the issue. The use and capability of diagnostic tools and skills is critical to the speedy resolution of service issues. For certain failures, the capture of diagnostics may extend service downtime. However, the non-capture of the appropriate diagnostics creates and exposes the service to repeat failures. Where the time required to take diagnostics is considered excessive, or varies from the target, a review should be instigated to identify if techniques and/or procedures can be streamlined to reduce the time required. Equally the scope of the diagnostic data available for capture can be assessed to ensure only the diagnostic data considered essential is taken. The additional downtime required to capture diagnostics should be included in the recovery metrics documented for each IT component.
Incident Repair: the time at which the failure has been repaired/fixed. Repair times for incidents should be continuously monitored and compared against the targets agreed within OLAs, underpinning contracts and other agreements. This is particularly important with respect to externally provided services and supplier performance. Wherever breaches are observed, techniques should be used to reduce or remove the breaches from similar incidents in the future.
Incident Recovery: the time at which component recovery has been completed. The backup and recovery requirements for the components underpinning a new IT service should be identified as early as possible within the design cycle. These requirements should cover hardware, software and data and recovery targets.
The outcome from this activity should be a documented set of recovery requirements that enables the development of appropriate recovery plans^R. To anticipate and prepare for performing recovery such that reinstatement of service is effective and efficient requires the development and testing of appropriate recovery plans based on the documented recovery requirements. Wherever possible, the operational activities within the recovery plan should be automated. The testing of the recovery plans also delivers approximate timings for recovery. These recovery metrics can be used to support the communication of estimated recovery of service and validate or enhance the Component Failure Impact Analysis documentation. Availability Management must continuously seek and promote faster methods of recovery for all potential Incidents. This can be achieved via a range of methods, including automated failure detection, automated recovery, more stringent escalation procedures, exploitation of new and faster recovery tools and techniques. Availability requirements should also contribute to determining what spare parts are kept within the Definitive Spares to facilitate quick and effective repairs, as described within the Service Transition publication.
Incident Restoration: the time at which normal business service is resumed. An incident can only be considered 'closed' once service has been restored and normal business operation has resumed. It is important that the restored IT service is verified as working correctly as soon as service restoration is completed and before any technical staff involved in the incident are stood down. In the majority of cases, this is simply a case of getting confirmation from the affected users. However, the users for some services may be customers of the business, i.e. ATM services, internet-based services. For these types of services, it is recommended that IT service verification procedures are developed to enable the IT service provider organization to verify that a restored IT service is now working as expected. These could simply be visual checks of transaction throughput or user simulation scripts that validate the end-to-end service.

Each stage, and the associated time taken, influences the total downtime perceived by the user. By taking this approach it is possible to see where time is being 'lost' for the duration of an incident. For example, the service was unavailable to the business for 60 minutes, yet it only took five minutes to apply a fix - where did the other 55 minutes go?

Using this approach identifies possible areas of inefficiency that combine to make the loss of service experienced by the business greater than it need be. These could cover areas such as poor automation (alerts, automated recovery etc.), poor diagnostic tools and scripts, unclear escalation procedures (which delay the escalation to the appropriate technical support group or supplier), or lack of comprehensive operational documentation. Availability Management needs to work in close association with Incident and Problem Management to ensure repeat occurrences are eliminated. It is recommended that these measures are established and captured for all availability incidents. This provides Availability Management with metrics for both specific incidents and trending information. This information can be used as input to SFA assignments, SIP activities and regular Availability Management reporting and to provide an impetus for continual improvement activity to pursue cost-effective improvements. It can also enable targets to be set for specific stages of the incident lifecycle. While accepting that each incident may have a wide range of technical complexity, the targets can be used to reflect the consistency of how the IT service provider organization responds to incidents.

An output from the Availability Management process is the real-time monitoring requirements for IT services and components. To achieve the levels of availability required and/or ensure the rapid restoration of service following an IT failure requires investment and exploitation of a systems management toolset. Systems management tools are an essential building block for IT services that require a high level of availability and can provide an invaluable role in reducing the amount of downtime incurred. Availability Management requirements cover the detection and alerting of IT service and component exceptions, automated escalation and notification of IT failures and the automated recovery and restoration of components from known IT failure situations. This makes it possible to identify where 'time is being lost' and provides the basis for the identification of factors that can improve recovery and restoration times. These activities are performed on a regular basis within Service Operation.

Service Failure Analysis

Figure 4.16 The structured approach to Service Failure Analysis (SFA)
Figure 4.16 The structured approach to Service Failure Analysis (SFA)

Service Failure Analysis (SFA) is a technique designed to provide a structured approach to identifying the underlying causes of service interruptions to the user. SFA utilizes a range of data sources to assess where and why shortfalls in availability are occurring. SFA enables a holistic view to be taken to drive not just technology improvements, but also improvements to the IT support organization, processes, procedures and tools. SFA is run as an assignment or project, and may utilize other Availability Management methods and techniques to formulate the recommendations for improvement. The detailed analysis of service interruptions can identify opportunities to enhance levels of availability. SFA is a structured technique to identify improvement opportunities in end-to-end service availability that can deliver benefits to the user. Many of the activities involved in SFA are closely aligned with those of Problem Management, and in a number of organizations these activities are performed jointly by Problem and Availability Management.

The high-level objectives of SFA:

To improve the overall availability of IT services by producing a set of improvements for implementation or input to the Availability Plan
To identify the underlying causes of service interruption to users
To assess the effectiveness of the IT support organization and key processes
To produce reports detailing the major findings and recommendations
That availability improvements derived from SFA-driven activities are measured.

SFA initiatives should use input from all areas and all processes including, most importantly, the business and users. Each SFA assignment should have a recognized sponsor(s) (ideally, joint sponsorship from the IT and business) and involve resources from many technical and process areas. The use of the SFA approach:

Provides the ability to deliver enhanced levels of availability without major cost
Provides the business with visible commitment from the IT support organization
Develops in-house skills and competencies to avoid expensive consultancy assignments related to availability improvement
Encourages cross-functional team working and breaks barriers between teams, and is an enabler to lateral thinking, challenging traditional thoughts and providing innovative, and often inexpensive, solutions
Provides a programme of improvement opportunities that can make a real difference to service quality and user perception
Provides opportunities that are focused on delivering benefit to the user
Provides an independent 'health check' of IT Service Management processes and is the stimulus for process improvements.

To maximize both the time of individuals allocated to the SFA assignment and the quality of the delivered report, a structured approach is required. This structure is illustrated in Figure 4.16. This approach is similar to many consultancy models utilized within the industry, and in many ways Availability Management can be considered as providing via SFA a form of internal consultancy.

The above high-level structure is described briefly as follows.

Select opportunity: prior to scheduling an SFA assignment, there needs to be agreement as to which IT service or technology is to be selected. It is recommended that an agreed number of assignments are scheduled per year within the Availability Plan and, if possible, the IT services are selected in advance as part of the proactive approach to Availability Management. Before commencing with the SFA, it is important that the assignment has a recognized sponsor from within the IT organization and/or the business and that they are involved and regularly updated with progress of the SFA activity. This ensures organizational visibility to the SFA and ensures recommendations are endorsed at a senior level within the organization.
Scope assignment: this is to state explicitly what areas are and are not covered within the assignment. This is normally documented in Terms of Reference issued prior to the assignment.
Plan assignment: the SFA assignment needs to be planned a number of weeks in advance of the assignment commencing, with an agreed project plan and a committed set of resources. The project should look at identifying improvement opportunities that benefit the user. It is therefore important that an end-to-end view of the data and Management Information System (MIS) requirements is taken. The data and documentation should be collected from all areas and analyzed from the user and business perspective. A 'virtual' SFA team should be formed from all relevant areas to ensure that all aspects and perspectives are considered. The size of the team should reflect the scope and complexity of the SFA assignment.
Build hypothesis: this is a useful method of building likely scenarios, which can help the study team draw early conclusions within the analysis period. These scenarios can be built from discussing the forthcoming assignment with key roles, e.g. senior management and users, or by using the planning session to brainstorm the list from the assembled team. The completed hypotheses list should be documented and input to the analysis period to provide some early focus on the data and Management Information System (MIS) that match the individual scenarios. It should be noted that this approach also eliminates perceived issues, i.e. no data or MIS substantiates what is perceived to be a service issue.
Analyze data: the number of individuals that form the SFA team dictates how to allocate specific analysis responsibilities. During this analysis period the hypotheses list should be used to help draw some early conclusions.
Interview key personnel: it is essential that key business representatives and users are interviewed to ensure the business and user perspectives are captured. It is surprising how this dialogue can identify quick win opportunities, as often what the business views as a big issue can be addressed by a simple IT solution. Therefore these interviews should be initiated as soon as possible within the SFA assignment. The study team should also seek input from key individuals within the IT service provider organization to identify additional problem areas and possible solutions that can be fed back to the study team. The dialogue also helps capture those issues that are not easily visible from the assembled data and MIS reports.
Findings and conclusions: after analysis of the data and MIS provided, interviews and continual revision of the hypothesis list, the study team should be in a position to start documenting initial findings and conclusions. It is recommended that the team meet immediately after the analysis period to share their individual findings and then take an aggregate view to form the draft findings and conclusions. It is important that all findings can be evidenced by facts gathered during the analysis. During this phase of the assignment, it may be necessary to validate finding(s) by additional analysis to ensure the SFA team can back up all findings with clear documented evidence.
Recommendations: after all findings and conclusions have been validated, the SFA team should be in a position to formulate recommendations. In many cases, the recommendations to support a particular finding are straightforward and obvious. However, the benefit of bringing a cross-functional team together for the SFA assignment is to create an environment for innovative lateral-thinking approaches. The SFA assignment leader should facilitate this session with the aim of identifying recommendations that are practical and sustainable once implemented.
Report: the final report should be issued to the sponsor with a management summary. Reporting styles are normally determined by the individual organizations. It is important that the report clearly shows where loss of availability is being incurred and how the recommendations address this. If the report contains many recommendations, an attempt should be made to quantify the availability benefit of each recommendation, together with the estimated effort to implement. This enables informed choices to be made on how to take the recommendations forward and how these should be prioritized and resourced.
Validation: it is recommended that for each SFA, key measures that reflect the business and user perspectives prior to the assignment are captured and recorded as the 'before' view. As SFA recommendations are progressed, the positive impacts on availability should be captured to provide the 'after' view for comparative purposes. Where anticipated benefits have not been delivered, this should be investigated and remedial action taken. Having invested time and effort in completing the SFA assignment, it is important that the recommendations, once agreed by the sponsor, are then taken forward for implementation. The best mechanism for achieving this is by incorporating the recommendations as activities to be completed within the Availability Plan or the overall SIP. The success of the SFA assignment as a whole should be monitored and measured to ensure its continued effectiveness.

Hints and tips

Consider categorizing the recommendations under the following headings:

Detection: Recommendations that, if implemented, will provide enhanced reporting of key indicators to ensure underlying IT service issues are detected early to enable a proactive response.
Reduction: Recommendations that, if implemented, will reduce or minimize the user impact from IT service interruption, e.g. recovery and/or restoration can be enhanced to reduce impact duration.
Avoidance: Recommendations that, if implemented, will eliminate this particular cause of IT service interruption.

4.4.5.2 The Proactive Activities of Availability Management

The capability of the Availability Management process is positively influenced by the range and quality of proactive methods and techniques utilized by the process. The following activities are the proactive techniques and activities of the Availability Management process.

Identifying Vital Business Functions (VBFs)
The term Vital Business Function (VBF) is used to reflect the business critical elements of the business process supported by an IT service. The service may also support less critical business functions and processes, and it is important that the VBFs are recognized and documented to provide the appropriate business alignment and focus.

Designing for Availability
The level of availability required by the business influences the overall cost of the IT service provided. In general, the higher the level of availability required by the business, the higher the cost. These costs are not just the procurement of the base IT technology and services required to underpin the IT infrastructure. Additional costs are incurred in providing the appropriate Service Management processes, systems management tools and high-availability solutions required to meet the more stringent availability requirements. The greatest level of availability should be included in the design of those services supporting the most critical of the VBFs.

When considering how the availability requirements of the business are to be met, it is important to ensure that the level of availability to be provided for an IT service is at the level actually required, and is affordable and cost justifiable to the business. Figure 4.17 indicates the products and processes required to provide varying levels of availability and the cost implications.

Figure 4.17 Relationship between levels of availability and overall costs
Figure 4.17 Relationship between levels of availability and overall costs

Relationship Between Levels of Availability and Overall Costs

Base product and components - The procurement or development of the base products, technology and components should be based on their capability to meet stringent availability and reliability requirements. These should be considered as the cornerstone of the availability design. The additional investment required to achieve even higher levels of availability will be wasted and availability levels not met if these base products and components are unreliable and prone to failure.
Systems Management - Systems management should provide the monitoring, diagnostic and automated error recovery to enable fast detection and speedy resolution of potential and actual IT failure.
Service Management processes - Effective Service Management processes contribute to higher levels of availability. Processes such as Availability Management, Incident Management, Problem Management, Change Management, Configuration Management, etc. play a crucial role in the overall management of the IT service.
High-availability design - The design for high availability needs to consider the elimination of SPoFs and/or the provision of alternative components to provide minimal disruption to the business operation should an IT component failure occur. The design also needs to eliminate or minimize the effects of planned downtime to the business operation normally required to accommodate maintenance activity, the implementation of changes to the IT infrastructure or business application. Recovery criteria should define rapid recovery and IT service reinstatement as a key objective within the designing for recovery phase of design.
Special solutions with full redundancy - To approach continuous availability in the range of 100% requires expensive solutions that incorporate full mirroring or redundancy. Redundancy is the technique of improving availability by using duplicate components. For stringent availability requirements to be met, these need to be working autonomously in parallel. These solutions are not just restricted to the IT components, but also to the IT environments, i.e. data centres, power supplies, air conditioning and telecommunications.

Where new IT services are being developed, it is essential that Availability Management takes an early and participating design role in determining the availability requirements. This enables Availability Management to influence positively the IT infrastructure design to ensure that it can deliver the level of availability required. The importance of this participation early in the design of the IT infrastructure cannot be underestimated. There needs to be a dialogue between IT and the business to determine the balance between the business perception of the cost of unavailability and the exponential cost of delivering higher levels of availability.

As illustrated in Figure 4.17, there is a significant increase in costs when the business requirement is higher than the optimum level of availability that the IT infrastructure can deliver. These increased costs are driven by major redesign of the technology and the changing of requirements for the IT support organization.

It is important that the level of availability designed into the service is appropriate to the business needs, the criticality of the business processes being supported and the available budget. The business should be consulted early in the Service Design lifecycle so that the business availability needs of a new or enhanced IT service can be costed and agreed. This is particularly important where stringent availability requirements may require additional investment in Service Management processes, IT service and System Management tools, high-availability design and special solutions with full redundancy.

It is likely that the business need for IT availability cannot be expressed in technical terms. Availability Management therefore provides an important role in being able to translate the business and user requirements into quantifiable availability targets and conditions. This is an important input into the IT Service Design and provides the basis for assessing the capability of the IT design and IT support organization in meeting the availability requirements of the business.

The business requirements for IT availability should contain at least:

A definition of the VBFs supported by the IT service
A definition of IT service downtime, i.e. the conditions under which the business considers the IT service to be unavailable
The business impact caused by loss of service, together with the associated risk quantitative availability requirements, i.e. the extent to which the business tolerates IT service downtime or degraded service
The required service hours, i.e. when the service is to be provided
An assessment of the relative importance of different working periods
Specific security requirements
The service backup and recovery capability.

Once the IT technology design and IT support organization are determined, the service provider organization is then in a position to confirm if the availability requirements can be met. Where shortfalls are identified, dialogue with the business is required to present the cost options that exist to enhance the proposed design to meet the availability requirements. This enables the business to reassess if lower or higher levels of availability are required, and to understand the appropriate impact and costs associated with their decision.

Determining the availability requirements is likely to be an iterative process, particularly where there is a need to balance the business availability requirement against the associated costs. The necessary steps are:

Determine the business impact caused by loss of service
From the business requirements, specify the availability, reliability and maintainability requirements for the IT service and components supported by the IT support organization
For IT services and components provided externally, identify the serviceability requirements
Estimate the costs involved in meeting the availability, reliability, maintainability and serviceability requirements
Determine, with the business, if the costs identified in meeting the availability requirements are justified
Determine, from the business, the costs likely to be incurred from loss or degradation of service
Where these are seen as cost-justified, define the availability, reliability, maintainability and serviceability requirements in agreements and negotiate into contracts.

Hints and Tips

If costs are seen as prohibitive, either:

Reassess the IT infrastructure design and provide options for reducing costs and assess the consequences on availability; or
Reassess the business use and reliance on the IT service and renegotiate the availability targets within the SLA.

The SLM process is normally responsible for communicating with the business on how its availability requirements for IT services are to be met and negotiating the SLR/SLA for the IT Service Design process. Availability Management therefore provides important support and input to the both SLM and design processes during this period. While higher levels of availability can often be provided by investment in tools and technology, there is no justification for providing a higher level of availability than that needed and afforded by the business. The reality is that satisfying availability requirements is always a balance between cost and quality. This is where Availability Management can play a key role in optimizing availability of the IT Service Design to meet increasing availability demands while deferring an increase in costs.

Designing service for availability is a key activity driven by Availability Management. This ensures that the required level of availability for an IT service can be met. Availability Management needs to ensure that the design activity for availability looks at the task from two related, but distinct, perspectives:

Designing for availability: this activity relates to the technical design of the IT service and the alignment of the internal and external suppliers required to meet the availability requirements of the business. It needs to cover all aspects of technology, including infrastructure, environment, data and applications.
Designing for recovery: this activity relates to the design points required to ensure that in the event of an IT service failure, the service and its supporting components can be reinstated to enable normal business operations to resume as quickly as is possible. This again needs to cover all aspects of technology.

Additionally, the ability to recover quickly may be a crucial factor. In simple terms, it may not be possible or cost justified to build a design that is highly resilient to failure(s). The ability to meet the availability requirements within the cost parameters may rely on the ability consistently to recover in a timely and effective manner. All aspects of availability should be considered in the Service Design process and should consider all stages within the Service Lifecycle.

The contribution of Availability Management within the design activities is to provide:

The specification of the availability requirements for all components of the service
The requirements for availability measurement points (instrumentation)
The requirements for new/enhanced systems and Service Management
Assistance with the IT infrastructure design
The specification of the reliability, maintainability and serviceability requirements for components supplied by internal and external suppliers
Validation of the final design to meet the minimum levels of availability required by the business for the IT service.

If the availability requirements cannot be met, the next task is to re-evaluate the Service Design and identify cost justified design changes. Improvements in design to meet the availability requirements can be achieved by reviewing the capability of the technology to be deployed in the proposed IT design. For example:

The exploitation of fault-tolerant technology to mask the impact of planned or unplanned component downtime
Duplexing, or the provision of alternative IT infrastructure components to allow one component to take over the work of another component
Improving component reliability by enhancing testing regimes
Improved software design and development
Improved processes and procedures
Systems management enhancements/exploitation
Improved externally supplied services, contracts or agreements
Developing the capability of the people with more training.

Hints and Tips

Consider documenting the availability design requirements and considerations for new IT services and making them available to the design and implementation functions. In the longer term seek to mandate these requirements and integrate within the appropriate governance mechanisms that cover the introduction of new IT services.

Part of the activity of designing for availability must ensure that all business, data and information security requirements are incorporated within the Service Design. The overall aim of IT security is 'balanced security in depth', with justifiable controls implemented to ensure that the Information Security Policy is enforced and that continued IT services within secure parameters (i.e. confidentiality, integrity and availability) continue to operate. During the gathering of availability requirements for new IT services, it is important that requirements that cover IT security are defined. These requirements need to be applied within the design phase for the supporting technology. For many organizations, the approach taken to IT security is covered by an Information Security Policy owned and maintained by Information Security Management. In the execution of the security policy, Availability Management plays an important role in its operation for new IT services.

Where the business operation has a high dependency on IT service availability, and the cost of failure or loss of business reputation is considered not acceptable, the business may define stringent availability requirements. These factors may be sufficient for the business to justify the additional costs required to meet these more demanding levels of availability. Achieving agreed levels of availability begins with the design, procurement and/or development of good-quality products and components. However, these in isolation are unlikely to deliver the sustained levels of availability required. To achieve a consistent and sustained level of availability requires a investment in and deployment of effective Service Management processes, systems management tools, high-availability design and ultimately special solutions with full mirroring or redundancy.

Designing for availability is a key activity, driven by Availability Management, which ensures that the stated availability requirements for an IT service can be met. However, Availability Management should also ensure that within this design activity there is focus on the design elements required to ensure that when IT services fail, the service can be reinstated to enable normal business operations to resume as quickly as is possible. 'Designing for recovery' may at first sound negative. Clearly good availability design is about avoiding failures and delivering, where possible, a fault-tolerant IT infrastructure. However, with this focus is too much reliance placed on technology, and has as much emphasis been placed on the fault tolerance aspects of the IT infrastructure? The reality is that failures will occur. The way the IT organization manages failure situations can have a positive effect on the perception of the business, customers and users of the IT services.

Key Message

Every failure is an important 'moment of truth' - an opportunity to make or break your reputation with the business.

By providing focus on the 'designing for recovery' aspects of the overall availability, design can ensure that every failure is an opportunity to maintain and even enhance business and user satisfaction. To provide an effective 'design for recovery', it is important to recognize that both the business and the IT organization have needs that must be satisfied to enable an effective recovery from IT failure.

These are informational needs that the business requires to help them manage the impact of failure on their business and set expectation within the business, user community and their business customers. These are the skills, knowledge, processes, procedures and tools required to enable the technical recovery to be completed in an optimal time.

Hints and Tips

Consider documenting the recovery design requirements and considerations for new IT services and make them available to the areas responsible for design and implementation. In the longer term, seek to mandate these requirements and integrate them within the appropriate governance mechanisms that cover the introduction of new IT services.

A key aim is to prevent minor incidents from becoming major incidents by ensuring the right people are involved early enough to avoid mistakes being made and to ensure the appropriate business and technical recovery procedures are invoked at the earliest opportunity. The instigation of these activities is the responsibility of the Incident Management process and a role of the Service Desk. To ensure business needs are met during major IT service failures, and to ensure the most optimal recovery, the Incident Management process and Service Desk need to have defined and to execute effective procedures for assessing and managing all incidents.

Key Message

The above are not the responsibilities of Availability Management. However, the effectiveness of the Incident Management process and Service Desk can strongly influence the overall recovery period. The use of Availability Management methods and techniques to further optimize IT recovery may be the stimulus for subsequent continual improvement activities to the Incident Management process and the Service Desk.

In order to remain effective, the maintainability of IT services and components should be monitored, and their impact on the 'expanded incident lifecycle' understood, managed and improved.

Component Failure Impact Analysis

Figure 4.18 Component Failure Impact Analysis
Figure 4.18 Component Failure Impact Analysis

Component Failure Impact Analysis (CFIA) can be used to predict and evaluate the impact on IT service arising from component failures within the technology. The output from a CFIA can be used to identify where additional resilience should be considered to prevent or minimize the impact of component failure to the business operation and users. This is particularly important during the Service Design stage, where it is necessary to predict and evaluate the impact on IT service availability arising from component failures within the proposed IT Service Design. However, the technique can also be applied to existing services and infrastructure.

CFIA is a relatively simple technique that can be used to provide this information. IBM devised CFIA in the early 1970s, with its origins based on hardware design and configuration. However, it is recommended that CFIA be used in a much wider context to reflect the full scope of the IT infrastructure, i.e. hardware, network, software, applications, data centres and support staff. Additionally the technique can also be applied to identify impact and dependencies on IT support organization skills and competencies amongst staff supporting the new IT service. This activity is often completed in conjunction with ITSCM and possibly Capacity Management.

The output from a CFIA provides vital information to ensure that the availability and recovery design criteria for the new IT service is influenced to prevent or minimize the impact of failure to the business operation and users. CFIA achieves this by providing and indicating:

SPoFs that can impact availability
The impact of component failure on the business operation and users
Component and people dependencies
Component recovery timings
The need to identify and document recovery options
The need to identify and implement risk reduction measures.

The above can also provide the stimulus for input to ITSCM to consider the balance between recovery options and risk reduction measures, i.e. where the potential business impact is high there is a need to concentrate on high-availability risk reduction measures, i.e. increased resilience or standby systems.

Having determined the IT infrastructure configuration to be assessed, the first step is to create a grid with CIs on one axis and the IT services that have a dependency on the CI on the other, as illustrated in Figure 4.18. This information should be available from the CMS, or alternatively it can be built using documented configuration charts and SLAs.

The next step is to perform the CFIA and populate the grid as follows:

Leave a blank when a failure of the CI does not impact the service in any way
Insert an 'X' when the failure of the CI causes the IT service to be inoperative
Insert an 'A' when there is an alternative CI to provide the service
Insert an 'M' when there is an alternative CI, but the service requires manual intervention to be recovered.

Having built the grid, CIs that have a large number of Xs are critical to many services and can result in high impact should the CI fail. Equally, IT services having high counts of Xs are complex and are vulnerable to failure. This basic approach to CFIA can provide valuable information in quickly identifying SPoFs, IT services at risk from CI failure and what alternatives are available should CIs fail. It should also be used to assess the existence and validity of recovery procedures for the selected CIs. The above example assumes common infrastructure supporting multiple IT services. The same approach can be used for a single IT service by mapping the component CIs against the VBFs and users supported by each component, thus understanding the impact of a component failure on the business and user. The approach can also be further refined and developed to include and develop 'component availability weighting' factors that can be used to assess and calculate the overall effect of the component failure on the total service availability.

To undertake an advanced CFIA requires the CFIA matrix to be expanded to provide additional fields required for the more detailed analysis. This could include fields such as:

Component availability weighting: a weighting factor appropriate to the impact of failure of the component on the total service availability. For example, if the failure of a switch can cause 2,000 users to lose service out of a total service user base of 10,000, then the weighting factor should be 0.2, or 20%
Probability of failure: this can be based on the reliability of the component as measured by the Mean Time Between Failures (MTBF) information if available or on the current trends. This can be expressed as a low/medium/high indicator or as a numeric representation
Recovery time: this is the estimated recovery time to recover the CI. This can be based on recent recovery timings, recovery information from disaster recovery testing or a scheduled test recovery
Recovery procedures: this is to verify that up-to-date recovery procedures are available for the CI Device independence: where software CIs have duplex files to provide resilience, this is to ensure that file placements have been verified as being on separate hardware disk configurations. This also applies to power supplies - it should be verified that alternate power supplies are connected correctly
Dependency: this is to show any dependencies between CIs. If one CI failed, there could be an impact on other CIs - for example, if the security CI failed, the operating system might prevent tape processing.

Single Point of Failure Analysis
A Single Point of Failure (SPoF) is any component within the IT infrastructure that has no backup or fail-over capability, and has the potential to cause disruption to the business, customers or users when it fails. It is important that no unrecognized SPoFs exist within the IT infrastructure design or the actual technology, and that they are avoided wherever possible.

The use of SPoF analysis or CFIA as techniques to identify SPoFs is recommended. SPoF and CFIA analysis exercises should be conducted on a regular basis, and wherever SPoFs are identified, CFIA can be used to identify the potential business, customer or user impact and help determine what alternatives can or should be considered to cater for this weakness in the design or the actual infrastructure. Countermeasures should then be implemented wherever they are cost-justifiable. The impact and disruption caused by the potential failure of the SPoF should be used to cost-justify its implementation.

Supporting Material

Wiki - Single Point of Failure Analysis

Fault Tree Analysis

Figure 4.19 Example Fault Tree Analysis

Fault Tree Analysis (FTA) is a technique that can be used to determine the chain of events that causes a disruption to IT services. FTA, in conjunction with calculation methods, can offer detailed models of availability. This can be used to assess the availability improvement that can be achieved by individual technology component design options. Using FTA:

Information can be provided that can be used for< availability calculations
Operations can be performed on the resulting fault tree; these operations correspond with design options
The desired level of detail in the analysis can be chosen.

FTA makes a representation of a chain of events using Boolean notation. Figure 4.19 gives an example of a fault tree.

Essentially FTA distinguishes the following events:

Basic events - terminal points for the fault tree, e.g. power failure, operator error. Basic events are not investigated in great depth. If basic events are investigated in further depth, they automatically become resulting events.
Resulting events - intermediate nodes in the fault tree, resulting from a combination of events. The highest point in the fault tree is usually a failure of the IT service.
Conditional events - events that only occur under certain conditions, e.g. failure of the air-conditioning equipment only affects the IT service if equipment temperature exceeds the serviceable values.
Trigger events - events that trigger other events, e.g. power failure detection equipment can trigger automatic shutdown of IT services.

These events can be combined using logic operators, i.e.:

AND-gate - the resulting event only occurs when all input events occur simultaneously
OR-gate - the resulting event occurs when one or more of the input events occurs
Exclusive OR-gate - the resulting event occurs when one and only one of the input events occurs
Inhibit gate - the resulting event only occurs when the input condition is not met.

This is the basic FTA technique. This technique can also be refined, but complex FTA and the mathematical evaluation of fault trees are beyond the scope of this publication.

Supporting Material

Wiki - Fault Tree Analysis
Fault Tree Analysis Tutorial - Ericson, 2000

Modeling
To assess if new components within a design can match the stated requirements, it is important that the testing regime instigated ensures that the availability expected can be delivered. Simulation, modeling or load testing tools to generate the expected user demand for the new IT service should be seriously considered to ensure components continue to operate under anticipated volume and stress conditions.

Modeling tools are also required to forecast availability and to assess the impact of changes to the IT infrastructure. Inputs to the modeling process include descriptive data of the component reliability, maintainability and serviceability. A spreadsheet package to perform calculations is usually sufficient. If more detailed and accurate data is required, a more complex modeling tool may need to be developed or acquired. The lack of readily available availability modeling tools in the marketplace may require such a tool to be developed and maintained 'in-house', but this is a very expensive and time-consuming activity that should only be considered where the investment can be justified. Unless there is a clearly perceived benefit from such a development and the ongoing maintenance costs, the use of existing tools and spreadsheets should be sufficient. However, some System Management tools do provide modeling capability and can provide useful information on trending and forecasting availability needs.

Risk Analysis and Management

Figure 4.20 Risk Analysis and Management
Figure 4.20 Risk Analysis and Management

To assess the vulnerability of failure within the configuration and capability of the IT service and support organization it is recommended that existing or proposed IT infrastructure, service configurations, Service Design and supporting organization (internal and external suppliers) are subject to formal Risk Analysis and Management exercises. Risk Analysis and Management is a technique that can be used to identify and quantify risks and justifiable countermeasures that can be implemented to protect the availability of IT systems. The identification of risks and the provision of justified countermeasures to reduce or eliminate the threats posed by such risks can play an important role in achieving the required levels of availability for a new or enhanced IT service. Risk Analysis should be undertaken during the design phase for the IT technology and service to identify:

Risks that may incur unavailability for IT components within the technology and Service Design
Risks that may incur confidentiality and/or integrity exposures within the IT technology and Service Design.

Most risk assessment and management methodologies involve the use of a formal approach to the assessment of risk and the subsequent mitigation of risk with the implementation of subsequent cost-justifiable countermeasures, as illustrated in Figure 4.20.

Risk Analysis involves the identification and assessment of the level (measure) of the risks calculated from the assessed values of assets and the assessed levels of threats to, and vulnerabilities of, those assets. Risk is also determined to a certain extent by its acceptance. Some organizations and businesses may be more willing to accept risk whereas others cannot.

Risk management involves the identification, selection and adoption of countermeasures justified by the identified risks to assets in terms of their potential impact on services if failure occurs, and the reduction of those risks to an acceptable level. Risk management is an activity that is associated with many other activities, especially ITSCM, Security Management and Service Transition. All of these risk assessment exercises should be coordinated rather than being separate activities.

This approach, when applied via a formal method, ensures coverage is complete, together with sufficient confidence that:

All possible risks and countermeasures have been identified
All vulnerabilities have been identified and their levels accurately assessed
All threats have been identified and their levels accurately assessed
All results are consistent across the broad spectrum of the technology reviewed
All expenditure on selected countermeasures can be
justified.

Formal Risk Analysis and Management methods are now an important element in the overall design and provision of IT services. The assessment of risk is often based on the probability and potential impact of an event occurring. Counter-measures are implemented wherever they are cost-justifiable, to reduce the impact of an event, or the probability of an event occurring, or both.

Management of Risk (MoR) provides an alternative generic framework for the management of risk across all parts of an organization - strategic, programme, project and operational. It incorporates all the activities required to identify and control the exposure to any type of risk, positive or negative, that may have an impact on the achievement of your organization's business objectives.

MoR provides a framework that is tried, tested and effective to help you eliminate - or manage - the risks involved in reaching your goals. MoR adopts a systematic application of principles, approach and processes to the task of identifying, assessing and then planning and implementing risk responses. Guidance stresses a collaborative approach and focuses on the following key elements:

Developing a framework that is transparent, repeatable and adaptable
Clearly communicating the policy and its benefits to all staff
Nominating key individuals in senior management to 'own' risk management initiatives and ensure they
move forwards
Ensuring the culture engages with and supports properly considered risk, including innovation
Embedding risk management systems in management and applying them consistently
Ensuring that risk management supports objectives - rather than vice versa
Explicitly assessing the risks involved in working with other organizations
Adopting a no-blame approach to monitoring and reviewing risk assessment activity.

Supporting Material

A Risk Management Standard, AIRMIC, ALARM, IRM: 2002
Risk Management Guide - NIST
Accenture - Risk Management Models (MP3)

Availability Testing Schedule
A key deliverable from the Availability Management process is the 'availability testing schedule'. This is a schedule for the regular testing of all availability mechanisms. Some availability mechanisms, such as 'load balancing', 'mirroring' and 'grid computing', are used in the provision of normal service on a day-by-day basis; others are used on a fail-over or manual reconfiguration basis. It is essential, therefore, that all availability mechanisms are tested in a regular and scheduled manner to ensure that when they are actually needed for real they work. This schedule needs to be maintained and widely circulated so that all areas are aware of its content and so that all other proposed activities can be synchronized with its content, such as:

The change schedule
Release plans and the release schedule
All transition plans, projects and programmes
Planned and preventative maintenance schedules
The schedule for testing IT service continuity and recovery plans
Business plans and schedules.

Planned and Preventative Maintenance
All IT components should be subject to a planned maintenance strategy. The frequency and levels of maintenance required varies from component to component, taking into account the technologies involved, criticality and the potential business benefits that may be introduced. Planned maintenance activities enable the IT support organization to provide:

Preventative maintenance to avoid failures
Planned software or hardware upgrades to provide new functionality or additional capacity
Business requested changes to the business applications
Implementation of new technology and functionality for exploitation by the business.

The requirement for planned downtime clearly influences the level of availability that can be delivered for an IT service, particularly those that have stringent availability requirements. In determining the availability requirements for a new or enhanced IT service, the amount of downtime and the resultant loss of income required for planned maintenance may not be acceptable to the business. This is becoming a growing issue in the area of 24 x 7 service operation. In these instances, it is essential that continuous operation is a core design feature to enable maintenance activity to be performed without impacting the availability of IT services.

Where the required service hours for IT services are less than 24 hours per day and/or seven days per week, it is likely that the majority of planned maintenance can be accommodated without impacting IT service availability. However, where the business needs IT services available on a 24-hour and seven-day basis, Availability Management needs to determine the most effective approach in balancing the requirements for planned maintenance against the loss of service to the business. Unless mechanisms exist to allow continuous operation, scheduled downtime for planned maintenance is essential if high levels of availability are to be achieved and sustained. For all IT services, there should logically be a 'low-impact' period for the implementation of maintenance. Once the requirements for managing scheduled maintenance have been defined and agreed, these should be documented as a minimum in:

SLAs
OLAs
Underpinning contracts
Change Management schedules
Release and Deployment Management schedules.

Hints and tips

Availability Management should ensure that building in preventative maintenance is one of the prime design considerations for a '24 x 7' IT service.

The most appropriate time to schedule planned downtime is clearly when the impact on the business and its customers is least. This information should be provided initially by the business when determining the availability requirements. For an existing IT service, or once the new service has been established, monitoring of business and customer transactions helps establish the hours when IT service usage is at its lowest. This should determine the most appropriate time for the component(s) to be removed for planned maintenance activity.

To accommodate the individual component requirements for planned downtime while balancing the IT service availability requirements of the business provides an opportunity to consider scheduling planned maintenance to multiple components concurrently. The benefit of this approach is that the number of service disruptions required to meet the maintenance requirements is reduced. While this approach has benefits, there are potential risks that need to be assessed. For example:

The capability of the IT support organization to coordinate the concurrent implementation of a high number of changes
The ability to perform effective problem determination where the IT service is impacted after the completion of multiple changes
The impact of change dependency across multiple components where back-out of a failed change requires multiple changes to be removed.

The effective management of planned downtime is an important contribution in meeting the required levels of availability for an IT service. Where planned downtime is required on a cyclic basis to an IT component(s), the time that the component is unavailable to enable the planned maintenance activity to be undertaken should be defined and agreed with the internal or external supplier. This becomes a stated objective that can be formalized, measured and reported. All planned maintenance should be scheduled, managed and controlled to ensure that the individual objectives and time slots are not exceeded and to ensure that activities are coordinated with all other schedules of activity to minimize clashes and conflict (e.g. change and release schedules, testing schedules.) In addition they provide an early warning during the maintenance activity of the time allocated to the planned outage duration being breached. This can enable an early decision to be made on whether the activity is allowed to complete with the potential to further impact service or to abort the activity and instigate the back-out plan. Planned downtime and performance against the stated objectives for each component should be recorded and used in service reporting.

Production of the Projected Service Outage (PSO) Document
Availability Management should produce and maintain the PSO document. This document consists of any variations from the service availability agreed within SLAs. This should be produced based on input from:

The change schedule
The release schedules
Planned and preventative maintenance schedules
Availability testing schedules
ITSCM and Business Continuity Management testing schedules.

The PSO contains details of all scheduled and planned service downtime within the agreed service hours for all services. These documents should be agreed with all the appropriate areas and representatives of both the business and IT. Once the PSO has been agreed, the Service Desk should ensure that it is communicated to all relevant parties so that everyone is made aware of any additional, planned service downtime.

Continual Review and Improvement
Changing business needs and customer demand may require the levels of availability provided for an IT service to be reviewed. Such reviews should form part of the regular service reviews with the business undertaken by SLM. Other input should also be considered on a regular basis from ITSCM, particularly from the updated Business Impact Analysis and Risk Analysis exercises. The criticality of services will often change and it is important that the design and the technology supporting such services is regularly reviewed and improved by Availability Management to ensure that the change of importance in the service is reflected within a revised design and supporting technology. Where the required levels of availability are already being delivered, it may take considerable effort and incur significant cost to achieve a small incremental improvement within the level of availability.

A key activity for Availability Management is continually to look at opportunities to optimize the availability of the IT infrastructure in conjunction with Continual Service Improvement activities. The benefits of this regular review approach are that, sometimes, enhanced levels of availability may be achievable, but with much lower costs. The optimization approach is a sensible first step to delivering better value for money. A number of Availability Management techniques can be applied to identify optimization opportunities. It is recommended that the scope should not be restricted to the technology, but also include a review of both the business process and other end-to-end business-owned responsibilities. To help achieve these aims, Availability Management needs to be recognized as a leading influence over the IT service provider organization to ensure continued focus on availability and stability of the technology.

Availability Management can provide the IT support organization with a real business and user perspective on how deficiencies within the technology and the underpinning process and procedure impact on the business operation and ultimately their customers. The use of business-driven metrics can demonstrate this impact in real terms and, importantly, also help quantify the benefits of improvement opportunities. Availability Management can play an important role in helping the IT service provider organization recognize where it can add value by exploiting its technical skills and competencies in an availability context. The continual improvement technique can be used by Availability Management to harness this technical capability. This can be used with either small groups of technical staff or a wider group within a workshop or SFA environment.

The impetus to improve availability comes from one or more of the following:

The inability for existing or new IT services to meet SLA targets on a consistent basis
Period(s) of IT service instability resulting in unacceptable levels of availability
Availability measurement trends indicating a gradual deterioration in availability
Unacceptable IT service recovery and restoration times
Requests from the business to increase the level of availability provided
Increasing impact on the business and its customers of IT service failures as a result of growth and/or increased business priorities or functionality
A request from SLM to improve availability as part of an overall SIP
Availability Management monitoring and trend analysis.

Availability Management should take a proactive role in identifying and progressing cost-justified availability improvement opportunities within the Availability Plan. The ability to do this places reliance on having appropriate and meaningful availability measurement and reporting. To ensure availability improvements deliver benefits to the business and users, it is important that measurement and reporting reflects not just IT component availability but also availability from a business operation and user perspective.

Where the business has a requirement to improve availability, the process and techniques to reassess the technology and IT service provider organization capability to meet these enhanced requirements should be followed. An output of this activity is enhanced availability and recovery design criteria. To satisfy the business requirement for increased levels of availability may require additional financial investment to enhance the underpinning technology and/or extend the services provided by the IT service provider organization. It is important that any additional investment to improve the levels of availability delivered can be cost-justified. Determining the cost of unavailability as a result of IT failure(s) can help support any financial investment decision in improving availability.

4.4.6 Triggers, Inputs, Outputs and Interfaces

4.4.6.1 Triggers

New or changed business needs or new or changed services
New or changed targets within agreements, such as SLRs, SLAs, OLAs or contracts
Service or component breaches, availability events and alerts, including threshold events, exception reports
Periodic activities such as reviewing, revising or reporting
Review of Availability Management forecasts, reports and plans
Review and revision of business and IT plans and strategies
Review and revision of designs and strategies
Recognition or notification of a change of risk or impact of a business process or VBF, an IT service or component
Request from SLM for assistance with availability targets and explanation of achievements.

4.4.6.2 Interfaces

Incident and Problem Management: in providing assistance with the resolution and subsequent justification and correction of availability incidents and problems
Capacity Management: with the provision of resilience and spare capacity
IT Service Continuity Management: with the assessment of business impact and risk and the provision of resilience, fail-over and recovery mechanisms
Service Level Management: assistance with the determining of availability targets and the investigation and resolution of service and component breaches.

4.4.6.3 Inputs

A number of sources of information are relevant to the Availability Management process. Some of these are as follows:

Business information: from the organization's business strategy, plans and financial plans, and information on their current and future requirements, including the availability requirements for new or enhanced IT services
Business impact information: from BIAs and assessment of VBFs underpinned by IT services
Previous Risk Analysis and Assessment reports and a risk register
Service information: from the Service Portfolio and the Service Catalogue,
Service information: from the SLM process, with details of the services from the Service Portfolio and the Service Catalogue, service level targets within SLAs and SLRs, and possibly from the monitoring of SLAs, service reviews and breaches of the SLAs
Financial information: from Financial Management, the cost of service provision, the cost of resources and components
Change and release information: from the Change Management process with a Change Schedule, the Release Schedule from Release Management and a need to assess all changes for their impact on service availability
Configuration Management: containing information on the relationships between the business, the services, the supporting services and the technology
Service targets: from SLAs, SLRs, OLAs and contracts
Component information: on the availability, reliability and maintainability requirements for the technology components that underpin IT service(s)
Technology information: from the CMS on the topology and the relationships between the components and the assessment of the capabilities of new technology
Past performance: from previous measurements, achievements and reports and the Availability Management Information System (AMIS)
Unavailability and failure information: from incidents and problems.

4.4.6.4 Outputs

The outputs produced by Availability Management should include:

The Availability Management Information System (AMIS)
The Availability Plan for the proactive improvement of IT services and technology
Availability and recovery design criteria and proposed service targets for new or changed services
Service availability, reliability and maintainability reports of achievements against targets, including input for all service reports
Component availability, reliability and maintainability reports of achievements against targets
Revised risk analysis reviews and reports and an updated risk register
Monitoring, management and reporting requirements for IT services and components to ensure that deviations in availability, reliability and maintainability are detected, actioned, recorded and reported
An Availability Management test schedule for testing all availability, resilience and recovery mechanisms
The planned and preventative maintenance schedules
The Projected Service Outage (PSO) in conjunction with Change and Release Management
Details of the proactive availability techniques and measures that will be deployed to provide additional resilience to prevent or minimize the impact of component failures on the IT service availability
Improvement actions for inclusion within the SIP.

4.4.7 Key Performance Indicators

Many KPIs can be used to measure the effectiveness and efficiency of Availability Management, including the following examples:

Manage availability and reliability of IT service:

Percentage reduction in the unavailability of services and components
Percentage increase in the reliability of services and components
Effective review and follow-up of all SLA, OLA and underpinning contract breaches
Percentage improvement in overall end-to-end availability of service
Percentage reduction in the number and impact of service breaks
Improvement in the MTBF (Mean Time Between Failures)
Improvement in the MTBSI (Mean Time Between Systems Incidents)
Reduction in the MTRS (Mean Time to Restore Service).
Satisfy business needs for access to IT services:
Percentage reduction in the unavailability of services
Percentage reduction of the cost of business overtime due to unavailable IT
Percentage reduction in critical time failures, e.g. specific business peak and priority availability needs are planned for
Percentage improvement in business and users satisfied with service (by CSS results).
Availability of IT infrastructure achieved at optimum costs:
Percentage reduction in the cost of unavailability
Percentage improvement in the Service Delivery costs
Timely completion of regular Risk Analysis and system review
Timely completion of regular cost-benefit analysis established for infrastructure Component Failure Impact Analysis (CFIA)
Percentage reduction in failures of third-party performance on MTRS/MTBF against contract targets
Reduced time taken to complete (or update) a Risk Analysis
Reduced time taken to review system resilience
Reduced time taken to complete an Availability Plan
Timely production of management reports
Percentage reduction in the incidence of operational reviews uncovering security and reliability exposures in application designs.

4.4.8 Information Management

The Availability Management process should maintain an AMIS that contains all of the measurements and information required to complete the Availability Management process and provide the appropriate information to the business on the level of IT service provided. This information, covering services, components and supporting services, provides the basis for regular, ad hoc and exception availability reporting and the identification of trends within the data for the instigation of improvement activities. These activities and the information contained within the AMIS provide the basis for developing the content of the Availability Plan.

In order to provide structure and focus to a wide range of initiatives that may need to be undertaken to improve availability, an Availability Plan should be formulated and maintained. The Availability Plan should have aims, objectives and deliverables and should consider the wider issues of people, processes, tools and techniques as well as having a technology focus. In the initial stages it may be aligned with an implementation plan for Availability Management, but the two are different and should not be confused. As the Availability Management process matures, the plan should evolve to cover the following:

Actual levels of availability versus agreed levels of availability for key IT services. Availability measurements should always be business- and customer-focused and report availability as experienced by the business and users.
Activities being progressed to address shortfalls in availability for existing IT services. Where investment decisions are required, options with associated costs and benefits should be included.
Details of changing availability requirements for existing IT services. The plan should document the options available to meet these changed requirements. Where investment decisions are required, the associated costs of each option should be included.
Details of the availability requirements for forthcoming new IT services. The plan should document the options available to meet these new requirements. Where investment decisions are required, the associated costs of each option should be included.
A forward-looking schedule for the planned SFA assignments.
Regular reviews of SFA assignments should be completed to ensure that the availability of technology is being proactively improved in conjunction with the SIP.
A technology futures section to provide an indication of the potential benefits and exploitation opportunities that exist for planned technology upgrades. Anticipated availability benefits should be detailed, where possible based on business-focused measures, in conjunction with Capacity Management. The effort required to realize these benefits where possible should also be quantified.

During the production of the Availability Plan, it is recommended that liaison with all functional, technical and process areas is undertaken. The Availability Plan should cover a period of one to two years, with a more detailed view and information for the first six months. The plan should be reviewed regularly, with minor revisions every quarter and major revisions every half year. Where the technology is only subject to a low level of change, this may be extended as appropriate.

It is recommended that the Availability Plan is considered complementary to the Capacity Plan and Financial Plan, and that publication is aligned with the capacity and business budgeting cycle. If a demand is foreseen for high levels of availability that cannot be met due to the constraints of the existing IT infrastructure or budget, then exception reports may be required for the attention of both senior IT and business management.

In order to facilitate the production of the Availability Plan, Availability Management may wish to consider having its own database repository. The AMIS can be utilized to record and store selected data and information required to support key activities such as report generation, statistical analysis and availability forecasting and planning. The AMIS should be the main repository for the recording of IT availability metrics, measurements, targets and documents, including the Availability Plan, availability measurements, achievement reports, SFA assignment reports, design criteria, action plans and testing schedules.

Hints and Tips

Be pragmatic, define the initial tool requirements and identify what is already deployed that can be used and shared to get started as quickly as possible. Where basic tools are not already available, work with the other IT service and systems management processes to identify common requirements with the aim of selecting shared tools and minimizing costs. The AMIS should address the specific reporting needs of Availability Management not currently provided by existing repositories and integrate with them and their contents.

4.4.9 Challenges, Critical Success Factors and Risks

4.4.9.1 Challenges

meeting the expectations of the customers, the business and senior management. These expectations are that services will always be available not just during their agreed service hours, but that all services will be available on a 24-hour, 365-day basis. When they aren't, it is assumed that they will be recovered within minutes. This is only the case when the appropriate level of investment and design has been applied to the service, and this should only be made where the business impact justifies that level of investment. However, the message needs to be publicized to all customers and areas of the business, so that when services do fail they have the right level of expectation on their recovery. It also means that Availability Management must have access to the right level of quality information on the current business need for IT services and its plans for the future. This is another challenge faced by many Availability Management processes.
the integration of all of the availability data into an integrated set of information (AMIS) that can be analyzed in a consistent manner to provide details on the availability of all services and components. This is particularly challenging when the information from the different technologies is often provided by different tools in differing formats.
convincing the business and senior management of the investment needed in proactive availability measures. Investment is always recognized once failures have occurred, but by then it is really too late. Persuading businesses and customers to invest in resilience to avoid the possibility of failures that may happen is a difficult challenge. Availability Management should work closely with Service Continuity Management, Security Management and Capacity Management in producing the justifications necessary to secure the appropriate investment.

4.4.9.2 Critical Success Factors

Manage availability and reliability of IT service
Satisfy business needs for access to IT services
Availability of IT infrastructure, as documented in SLAs, provided at optimum costs.

4.4.9.3 Risks

A lack of commitment from the business to the Availability Management process
A lack of commitment from the business and a lack of appropriate information on future plans and strategies
A lack of senior management commitment or a lack of resources and/or budget to the Availability Management process
The reporting processes become very labour-intensive
The processes focus too much on the technology and not enough on the services and the needs of the business
The Availability Management information (AMIS) is maintained in isolation and is not shared or consistent with other process areas, especially ITSCM, Security Management and Capacity Management. This investment is particularly important when considering the necessary service and component backup and recovery tools, technology and processes to meet the agreed needs.

Supporting Material

Video - HCI - Availability Management, or Powerpoint
CSU - Availability Mgmt
CSU - Availability Mgmt Basic Concepts
CSU - Availability Mgmt Key Terminology
CSU - Availability Mgmt Objectives
Availability Mgmt ICOM Chart