|1Introduction||2Serv. Mgmt.||3Principles||4Processes||5Tech Activities||6Organization||7Tech Considerations||8Implementation||9Challenges||Appendeces|
|4.1SC Mgmt||4.2SLM||4.3Capacity Mgmt||4.4Availability Mgmt||4.5 Continuity Mgmt||4.6Security Mgmt||4.7Supplier Mgmt|
The purpose of Availability Management is to provide a point of focus and management for all availability-related issues, relating to both services and resources, ensuring that availability targets in all areas are measured and achieved.
The objectives of Availability Management are to:
Availability Management should ensure the agreed level of availability is provided. The measurement and monitoring of IT availability is a key activity to ensure availability levels are being met consistently. Availability Management should look to continually optimize and proactively improve the availability of the IT infrastructure, the services and the supporting organization, in order to provide cost effective availability improvements that can deliver business and customer benefits.
Understanding all of this will enable Availability Management to ensure that all the services and components are designed and delivered to meet their targets in terms of agreed business needs. The Availability Management process:
The Availability Management process does not include Business Continuity Management and the resumption of business processing after a major disaster. The support of BCM is included within IT Service Continuity Management (ITSCM). However, Availability Management does provide key inputs to ITSCM, and the two processes have a close relationship, particularly in the assessment and management of risks and in the implementation of risk reduction and resilience measures.
The Availability Management process should include:
|Figure 4.13 The Availability Management process|
The Availability Management process and planning, just like Capacity Management, must be involved in all stages of the Service Lifecycle, from Strategy and Design, through Transition and Operation to Improvement. The appropriate availability and resilience should be designed into services and components from the initial design stages. This will ensure not only that the availability of any new or changed service meets its expected targets, but also that all existing services and components continue to meet all of their targets. This is the basis of stable service provision.
An effective Availability Management process, consisting of both the reactive and proactive activities, can 'make a big difference' and will be recognized as such by the business, if the deployment of Availability Management within an IT organization has a strong emphasis on the needs of the business and customers. To reinforce this emphasis, there are several guiding principles that should underpin the Availability Management process and its focus:
The scope of Availability Management covers the design, implementation, measurement and management of IT service and infrastructure availability. This is reflected in the process description shown in Figure 4.13 and described in the following paragraphs.
The Availability Management process has two key elements:
Availability Management is completed at two interconnected levels:
Availability Management relies on the monitoring, measurement, analysis and reporting of the following aspects:
Availability: the ability of a service, component or CI to perform its agreed function when required. It is often measured and reported as a percentage:
|Note: Downtime should only be included in the above calculation when it occurs within the Agreed Service Time (AST). However, total downtime should also be recorded and reported.|
Reliability: a measure of how long a service, component or Cl can perform its agreed function without interruption. The reliability of the service can be improved by increasing the reliability of individual components or by increasing the resilience of the service to individual component failure (i.e. increasing the component redundancy, e.g. by using load-balancing techniques). It is often measured and reported as Mean Time Between Service Incidents (MTBSI) or Mean Time Between Failures (MTBF):
Maintainability: a measure of how quickly and effectively a service, component or CI can be restored to normal working after a failure. It is measured and reported as Mean Time to Restore Service (MTRS) and should be calculated using the following formula:
MTRS should be used to avoid the ambiguity of the more common industry term Mean Time To Repair (MTTR), which in some definitions includes only repair time, but in others includes recovery time. The downtime in MTRS covers all the contributory factors that make the service, component or CI unavailable:
ExampleA situation where a 24 x 7 service has been running for a period of 5,020 hours with only two breaks, one of six hours and one of 14 hours, would give the following figures:
Serviceability: the ability of a third-party supplier to meet the terms of their contract. Often this contract will include agreed levels of availability, reliability and/or maintainability for a supporting service or component.R
|Figure 4.14 Availability terms and measurements|
These aspects and their inter-relationships are illustrated in Figure 4.14.
Although the principal service target contained within SLAs for the customers and business is availability, as illustrated in Figure 4.14, some customers also require reliability and maintainability targets to be included as well. Where these are included they should relate to service reliability and maintainability targets, whereas the reliability and maintainability targets contained in OLAs and contracts relate to component and supporting service targets and can often include availability targets relating to the relevant components or supporting services.
The term Vital Business Function (VBF) is used to reflect the business critical elements of the business process supported by an IT service. An IT service may support a number of business functions that are less critical. For example, an automated teller machine (ATM) or cash dispenser service VBF would be the dispensing of cash. However, the ability to obtain a statement from an ATM may not be considered as vital. This distinction is important and should influence availability design and associated costs. The more vital the business function generally, the greater the level of resilience and availability that needs to be incorporated into the design required in the supporting IT services. For all services, whether VBFs or not, the availability requirements should be determined by the business and not by IT. The initial availability targets are often set at too high a level, and this leads to either over-priced services or an iterative discussion between the service provider and the business to agree an appropriate compromise between the service availability and the cost of the service and its supporting technology.
Certain VBFs may need special designs, which are now being used as a matter of course within Service Design plans, incorporating:
Many suppliers commit to high availability or continuous availability solutions only if stringent environmental standards and resilient processes are used. They often agree to such contracts only after a site survey has been completed and additional, sometimes costly, improvements have been made.
Availability Management commences as soon as the availability requirements for an IT service are clear enough to be articulated. It is an ongoing process, finishing only when the IT service is decommissioned or retired. The key activities of the Availability Management process are:
Key message'If you don't measure it, you can't manage it'
'If you don't measure it, you can't improve it'
'If you don't measure it, you probably don't care'
'If you can't influence or control it, then don't measure it'
'What to measure and how to report it' inevitably depend on which activity is being supported, who the recipients are and how the information is to be utilized. It is important to recognize the differing perspectives of availability to ensure measurement and reporting satisfies these varied needs:
The IT service provider organizations have, for many years, measured and reported on their perspective of availability. Traditionally these measures have concentrated on component availability and have been somewhat divorced from the business and user views. Typically these traditional measures are based on a combination of an availability percentage (%), time lost and the frequency of failure. Some examples of these traditional measures are as follows:
The business may have, for many years, accepted that the IT availability that they experience is represented in terms of component availability rather than overall service or business availability. However, this is no longer being viewed as acceptable and the business is keen to better represent availability in measure(s) that demonstrate the positive and negative consequences of IT availability on their business and users.
Key messageThe most important availability measurements are those that reflect and measure availability from the business and user perspective.
Availability Management needs to consider availability from both a business/IT service provider perspective and from an IT component perspective. These are entirely different aspects, and while the underlying concept is similar, the measurement, focus and impact are entirely different.
The sole purpose of producing these availability measurements and reports, including those from the business perspective, is to improve the quality and availability of IT service provided to the business and users. All measures, reports and activities should reflect this purpose.
Availability, when measured and reported to reflect the experience of the user, provides a more representative view on overall IT service quality. The user view of availability is influenced by three factors:
Measurements and reporting of user availability should therefore embrace these factors. The methodology employed to reflect user availability could consider two approaches:
The method employed should be influenced by the nature of the business operation. A business operation supporting data entry activity is well suited to reporting that reflects user productivity loss. Business operations that are more customer-facing, e.g. ATM services, benefit from reporting transaction impact. It should also be noted that not all business impact is user-related. With increasing automation and electronic processing, the ability to process automated transactions or meet market cut-off times can also have a large financial impact that may be greater than the ability of users to work.
The IT support organization needs to have a keen awareness of the user experience of availability. However, the real benefits come from aggregating the user view into the overall business view. A guiding principle of the Availability Management process is that 'Improving availability can only begin when the way technology supports the business is understood'. Therefore Availability Management isn't just about understanding the availability of each IT component, but is all about understanding the impact of component failure on service and user availability. From the business perspective, an IT service can only be considered available when the business is able to perform all vital business functions required to drive the business operation. For the IT service to be available, it therefore relies on all components on which the service depends being available, i.e. systems, key components, network, data and applications.
The traditional IT approach would be to measure individually the availability of each of these components. However, the true measure of availability has to be based on the positive and negative impacts on the VBFs on which the business operation is dependent. This approach ensures that SLAs and IT availability reporting are based on measures that are understood by both the business and IT. By measuring the VBFs that rely on IT services, measurement and reporting becomes business-driven, with the impact of failure reflecting the consequences to the business. It is also important that the availability of the services is defined and agreed with the business and reflected within SLAs. This definition of availability should include:
Reporting and analysis tools are required for the manipulation of data stored in the various databases utilized by Availability Management. These tools can either be platform- or PC-based and are often a combination of the two. This will be influenced by the database repository technologies selected and the complexity of data processing and reporting required. Availability Management, once implemented and deployed, will be required to produce regular reports on an agreed basis, e.g. monthly availability reports, Availability Plan, Service Failure Analysis (SFA) status reports, etc. The activities involved within these reporting activities can require much manual effort and the only solution is to automate as much of the report generation activity as possible. For reporting purposes, organizational reporting standards should be used wherever possible. If these don't exist, IT standards should be developed so that IT reports can be developed using standard tools and techniques. This means that the integration and consolidation of reports will subsequently be much easier to achieve.
All events and incidents causing unavailability of services and components should be investigated, with remedial actions being implemented within either the Availability Plan or the overall SIP. Trends should be produced from this analysis to direct and focus activities such as Service Failure Analysis (SFA) to those areas causing the most impact or disruption to the business and the users.
The overall costs of an IT service are influenced by the levels of availability required and the investments required in technology and services provided by the IT support organization to meet this requirement. Availability certainly does not come for free. However, it is important to reflect that the unavailability of IT also has a cost, therefore unavailability isn't free either. For highly critical business processes and VBFs, it is necessary to consider not only the cost of providing the service, but also the costs that are incurred from failure. The optimum balance to strike is the cost of the availability solution weighed against the costs of unavailability.
Before any SLR is accepted, and ultimately the SLR or SLA is negotiated and agreed between the business and the IT organization, it is essential that the availability requirements of the business are analyzed to assess if/how the IT service can deliver the required levels of availability. This applies not only to new IT services that are being introduced, but also to any requested changes to the availability requirements of existing IT services.
The cost of an IT failure could simply be expressed as the number of business or IT transactions impacted, either as an actual figure (derived from instrumentation) or based on an estimation. When measured against the VBFs that support the business operation, this can provide an obvious indication of the consequence of failure. The advantage of this approach is the relative ease of obtaining the impact data and the lack of any complex calculations. It also becomes a 'value' that is understood by both the business and IT organization. This can be the stimulus for identifying improvement opportunities and can become a key metric in monitoring the availability of IT services.
The major disadvantage of this approach is that it offers no obvious monetary value that would be needed to justify any significant financial investment decisions for improving availability. Where significant financial investment decisions are required, it is better to express the cost of failure arising from service, system, application or function loss to the business as a monetary 'value'.R
The monetary value can be calculated as a combination of the tangible costs associated with failure, but can also include a number of intangible costs. The monetary value should also reflect the cost impact to the whole organization, i.e. the business and IT organization.
Tangible costs can include:
|Figure 4.15 The expanded incident lifecycle|
These costs are often well understood by the finance area of the business and IT organization, and in relative terms are easier to obtain and aggregate than the intangible costs associated with an IT failure. Intangible costs can include:
It is important not simply to dismiss the intangible costs (and the potential consequences) on the grounds that they may be difficult to measure. The overall unavailability of service, the total tangible cost and the total intangible costs arising from service unavailability are all key metrics in the measurement of the effectiveness of the Availability Management process.
Expanded Incident Lifecycle
A guiding principle of Availability Management is to recognize that it is still possible to gain customer satisfaction even when things go wrong. One approach to help achieve this requires Availability Management to ensure that the duration of any incident is minimized to enable normal business operations to resume as quickly as possible. An aim of Availability Management is to ensure the duration and impact from incidents impacting IT services are minimized, to enable business operations to resume as quickly as is possible. The analysis of the 'expanded incident lifecycle' enables the total IT service downtime for any given incident to be broken down and mapped against the major stages through which all incidents progress (the lifecycle). Availability Management should work closely with Incident Management and Problem Management in the analysis of all incidents causing unavailability.
A good technique to help with the technical analysis of incidents affecting the availability of components and IT services is to take an incident 'lifecycle' view. Every incident passes through several major stages. The time elapsed in these stages may vary considerably. For Availability Management purposes, the standard incident 'lifecycle', as described within Incident Management, has been expanded to provide additional help and guidance, particularly in the area of 'designing for recovery'. Figure 4.15 illustrates the expanded incident lifecycle.
From the above it can be seen that an incident can be broken down into individual stages within a lifecycle that can be timed and measured. This lifecycle view provides an important framework in determining, amongst others, systems management requirements for event and incident detection, diagnostic data capture requirements and tools for diagnosis, recovery plans to aid speedy recovery and how to verify that IT service has been restored. The individual stages of the lifecycle are considered in more detail as follows.
The outcome from this activity should be a documented set of recovery requirements that enables the development of appropriate recovery plansR. To anticipate and prepare for performing recovery such that reinstatement of service is effective and efficient requires the development and testing of appropriate recovery plans based on the documented recovery requirements. Wherever possible, the operational activities within the recovery plan should be automated. The testing of the recovery plans also delivers approximate timings for recovery. These recovery metrics can be used to support the communication of estimated recovery of service and validate or enhance the Component Failure Impact Analysis documentation. Availability Management must continuously seek and promote faster methods of recovery for all potential Incidents. This can be achieved via a range of methods, including automated failure detection, automated recovery, more stringent escalation procedures, exploitation of new and faster recovery tools and techniques. Availability requirements should also contribute to determining what spare parts are kept within the Definitive Spares to facilitate quick and effective repairs, as described within the Service Transition publication.
Each stage, and the associated time taken, influences the total downtime perceived by the user. By taking this approach it is possible to see where time is being 'lost' for the duration of an incident. For example, the service was unavailable to the business for 60 minutes, yet it only took five minutes to apply a fix - where did the other 55 minutes go?
Using this approach identifies possible areas of inefficiency that combine to make the loss of service experienced by the business greater than it need be. These could cover areas such as poor automation (alerts, automated recovery etc.), poor diagnostic tools and scripts, unclear escalation procedures (which delay the escalation to the appropriate technical support group or supplier), or lack of comprehensive operational documentation. Availability Management needs to work in close association with Incident and Problem Management to ensure repeat occurrences are eliminated. It is recommended that these measures are established and captured for all availability incidents. This provides Availability Management with metrics for both specific incidents and trending information. This information can be used as input to SFA assignments, SIP activities and regular Availability Management reporting and to provide an impetus for continual improvement activity to pursue cost-effective improvements. It can also enable targets to be set for specific stages of the incident lifecycle. While accepting that each incident may have a wide range of technical complexity, the targets can be used to reflect the consistency of how the IT service provider organization responds to incidents.
An output from the Availability Management process is the real-time monitoring requirements for IT services and components. To achieve the levels of availability required and/or ensure the rapid restoration of service following an IT failure requires investment and exploitation of a systems management toolset. Systems management tools are an essential building block for IT services that require a high level of availability and can provide an invaluable role in reducing the amount of downtime incurred. Availability Management requirements cover the detection and alerting of IT service and component exceptions, automated escalation and notification of IT failures and the automated recovery and restoration of components from known IT failure situations. This makes it possible to identify where 'time is being lost' and provides the basis for the identification of factors that can improve recovery and restoration times. These activities are performed on a regular basis within Service Operation.
Service Failure Analysis
|Figure 4.16 The structured approach to Service Failure Analysis (SFA)|
Service Failure Analysis (SFA) is a technique designed to provide a structured approach to identifying the underlying causes of service interruptions to the user. SFA utilizes a range of data sources to assess where and why shortfalls in availability are occurring. SFA enables a holistic view to be taken to drive not just technology improvements, but also improvements to the IT support organization, processes, procedures and tools. SFA is run as an assignment or project, and may utilize other Availability Management methods and techniques to formulate the recommendations for improvement. The detailed analysis of service interruptions can identify opportunities to enhance levels of availability. SFA is a structured technique to identify improvement opportunities in end-to-end service availability that can deliver benefits to the user. Many of the activities involved in SFA are closely aligned with those of Problem Management, and in a number of organizations these activities are performed jointly by Problem and Availability Management.
The high-level objectives of SFA:
SFA initiatives should use input from all areas and all processes including, most importantly, the business and users. Each SFA assignment should have a recognized sponsor(s) (ideally, joint sponsorship from the IT and business) and involve resources from many technical and process areas. The use of the SFA approach:
To maximize both the time of individuals allocated to the SFA assignment and the quality of the delivered report, a structured approach is required. This structure is illustrated in Figure 4.16. This approach is similar to many consultancy models utilized within the industry, and in many ways Availability Management can be considered as providing via SFA a form of internal consultancy.
The above high-level structure is described briefly as follows.
Hints and tipsConsider categorizing the recommendations under the following headings:
Identifying Vital Business Functions (VBFs)
The term Vital Business Function (VBF) is used to reflect the business critical elements of the business process supported by an IT service. The service may also support less critical business functions and processes, and it is important that the VBFs are recognized and documented to provide the appropriate business alignment and focus.
Designing for Availability
The level of availability required by the business influences the overall cost of the IT service provided. In general, the higher the level of availability required by the business, the higher the cost. These costs are not just the procurement of the base IT technology and services required to underpin the IT infrastructure. Additional costs are incurred in providing the appropriate Service Management processes, systems management tools and high-availability solutions required to meet the more stringent availability requirements. The greatest level of availability should be included in the design of those services supporting the most critical of the VBFs.
When considering how the availability requirements of the business are to be met, it is important to ensure that the level of availability to be provided for an IT service is at the level actually required, and is affordable and cost justifiable to the business. Figure 4.17 indicates the products and processes required to provide varying levels of availability and the cost implications.
|Figure 4.17 Relationship between levels of availability and overall costs|
Relationship Between Levels of Availability and Overall Costs
Where new IT services are being developed, it is essential that Availability Management takes an early and participating design role in determining the availability requirements. This enables Availability Management to influence positively the IT infrastructure design to ensure that it can deliver the level of availability required. The importance of this participation early in the design of the IT infrastructure cannot be underestimated. There needs to be a dialogue between IT and the business to determine the balance between the business perception of the cost of unavailability and the exponential cost of delivering higher levels of availability.
As illustrated in Figure 4.17, there is a significant increase in costs when the business requirement is higher than the optimum level of availability that the IT infrastructure can deliver. These increased costs are driven by major redesign of the technology and the changing of requirements for the IT support organization.
It is important that the level of availability designed into the service is appropriate to the business needs, the criticality of the business processes being supported and the available budget. The business should be consulted early in the Service Design lifecycle so that the business availability needs of a new or enhanced IT service can be costed and agreed. This is particularly important where stringent availability requirements may require additional investment in Service Management processes, IT service and System Management tools, high-availability design and special solutions with full redundancy.
It is likely that the business need for IT availability cannot be expressed in technical terms. Availability Management therefore provides an important role in being able to translate the business and user requirements into quantifiable availability targets and conditions. This is an important input into the IT Service Design and provides the basis for assessing the capability of the IT design and IT support organization in meeting the availability requirements of the business.
The business requirements for IT availability should contain at least:
Once the IT technology design and IT support organization are determined, the service provider organization is then in a position to confirm if the availability requirements can be met. Where shortfalls are identified, dialogue with the business is required to present the cost options that exist to enhance the proposed design to meet the availability requirements. This enables the business to reassess if lower or higher levels of availability are required, and to understand the appropriate impact and costs associated with their decision.
Determining the availability requirements is likely to be an iterative process, particularly where there is a need to balance the business availability requirement against the associated costs. The necessary steps are:
Hints and TipsIf costs are seen as prohibitive, either:
The SLM process is normally responsible for communicating with the business on how its availability requirements for IT services are to be met and negotiating the SLR/SLA for the IT Service Design process. Availability Management therefore provides important support and input to the both SLM and design processes during this period. While higher levels of availability can often be provided by investment in tools and technology, there is no justification for providing a higher level of availability than that needed and afforded by the business. The reality is that satisfying availability requirements is always a balance between cost and quality. This is where Availability Management can play a key role in optimizing availability of the IT Service Design to meet increasing availability demands while deferring an increase in costs.
Designing service for availability is a key activity driven by Availability Management. This ensures that the required level of availability for an IT service can be met. Availability Management needs to ensure that the design activity for availability looks at the task from two related, but distinct, perspectives:
Additionally, the ability to recover quickly may be a crucial factor. In simple terms, it may not be possible or cost justified to build a design that is highly resilient to failure(s). The ability to meet the availability requirements within the cost parameters may rely on the ability consistently to recover in a timely and effective manner. All aspects of availability should be considered in the Service Design process and should consider all stages within the Service Lifecycle.
The contribution of Availability Management within the design activities is to provide:
If the availability requirements cannot be met, the next task is to re-evaluate the Service Design and identify cost justified design changes. Improvements in design to meet the availability requirements can be achieved by reviewing the capability of the technology to be deployed in the proposed IT design. For example:
Hints and TipsConsider documenting the availability design requirements and considerations for new IT services and making them available to the design and implementation functions. In the longer term seek to mandate these requirements and integrate within the appropriate governance mechanisms that cover the introduction of new IT services.
Part of the activity of designing for availability must ensure that all business, data and information security requirements are incorporated within the Service Design. The overall aim of IT security is 'balanced security in depth', with justifiable controls implemented to ensure that the Information Security Policy is enforced and that continued IT services within secure parameters (i.e. confidentiality, integrity and availability) continue to operate. During the gathering of availability requirements for new IT services, it is important that requirements that cover IT security are defined. These requirements need to be applied within the design phase for the supporting technology. For many organizations, the approach taken to IT security is covered by an Information Security Policy owned and maintained by Information Security Management. In the execution of the security policy, Availability Management plays an important role in its operation for new IT services.
Where the business operation has a high dependency on IT service availability, and the cost of failure or loss of business reputation is considered not acceptable, the business may define stringent availability requirements. These factors may be sufficient for the business to justify the additional costs required to meet these more demanding levels of availability. Achieving agreed levels of availability begins with the design, procurement and/or development of good-quality products and components. However, these in isolation are unlikely to deliver the sustained levels of availability required. To achieve a consistent and sustained level of availability requires a investment in and deployment of effective Service Management processes, systems management tools, high-availability design and ultimately special solutions with full mirroring or redundancy.
Designing for availability is a key activity, driven by Availability Management, which ensures that the stated availability requirements for an IT service can be met. However, Availability Management should also ensure that within this design activity there is focus on the design elements required to ensure that when IT services fail, the service can be reinstated to enable normal business operations to resume as quickly as is possible. 'Designing for recovery' may at first sound negative. Clearly good availability design is about avoiding failures and delivering, where possible, a fault-tolerant IT infrastructure. However, with this focus is too much reliance placed on technology, and has as much emphasis been placed on the fault tolerance aspects of the IT infrastructure? The reality is that failures will occur. The way the IT organization manages failure situations can have a positive effect on the perception of the business, customers and users of the IT services.
Key MessageEvery failure is an important 'moment of truth' - an opportunity to make or break your reputation with the business.
By providing focus on the 'designing for recovery' aspects of the overall availability, design can ensure that every failure is an opportunity to maintain and even enhance business and user satisfaction. To provide an effective 'design for recovery', it is important to recognize that both the business and the IT organization have needs that must be satisfied to enable an effective recovery from IT failure.
These are informational needs that the business requires to help them manage the impact of failure on their business and set expectation within the business, user community and their business customers. These are the skills, knowledge, processes, procedures and tools required to enable the technical recovery to be completed in an optimal time.
Hints and TipsConsider documenting the recovery design requirements and considerations for new IT services and make them available to the areas responsible for design and implementation. In the longer term, seek to mandate these requirements and integrate them within the appropriate governance mechanisms that cover the introduction of new IT services.
A key aim is to prevent minor incidents from becoming major incidents by ensuring the right people are involved early enough to avoid mistakes being made and to ensure the appropriate business and technical recovery procedures are invoked at the earliest opportunity. The instigation of these activities is the responsibility of the Incident Management process and a role of the Service Desk. To ensure business needs are met during major IT service failures, and to ensure the most optimal recovery, the Incident Management process and Service Desk need to have defined and to execute effective procedures for assessing and managing all incidents.
Key MessageThe above are not the responsibilities of Availability Management. However, the effectiveness of the Incident Management process and Service Desk can strongly influence the overall recovery period. The use of Availability Management methods and techniques to further optimize IT recovery may be the stimulus for subsequent continual improvement activities to the Incident Management process and the Service Desk.
In order to remain effective, the maintainability of IT services and components should be monitored, and their impact on the 'expanded incident lifecycle' understood, managed and improved.
Component Failure Impact Analysis
|Figure 4.18 Component Failure Impact Analysis|
Component Failure Impact Analysis (CFIA) can be used to predict and evaluate the impact on IT service arising from component failures within the technology. The output from a CFIA can be used to identify where additional resilience should be considered to prevent or minimize the impact of component failure to the business operation and users. This is particularly important during the Service Design stage, where it is necessary to predict and evaluate the impact on IT service availability arising from component failures within the proposed IT Service Design. However, the technique can also be applied to existing services and infrastructure.
CFIA is a relatively simple technique that can be used to provide this information. IBM devised CFIA in the early 1970s, with its origins based on hardware design and configuration. However, it is recommended that CFIA be used in a much wider context to reflect the full scope of the IT infrastructure, i.e. hardware, network, software, applications, data centres and support staff. Additionally the technique can also be applied to identify impact and dependencies on IT support organization skills and competencies amongst staff supporting the new IT service. This activity is often completed in conjunction with ITSCM and possibly Capacity Management.
The output from a CFIA provides vital information to ensure that the availability and recovery design criteria for the new IT service is influenced to prevent or minimize the impact of failure to the business operation and users. CFIA achieves this by providing and indicating:
The above can also provide the stimulus for input to ITSCM to consider the balance between recovery options and risk reduction measures, i.e. where the potential business impact is high there is a need to concentrate on high-availability risk reduction measures, i.e. increased resilience or standby systems.
Having determined the IT infrastructure configuration to be assessed, the first step is to create a grid with CIs on one axis and the IT services that have a dependency on the CI on the other, as illustrated in Figure 4.18. This information should be available from the CMS, or alternatively it can be built using documented configuration charts and SLAs.
The next step is to perform the CFIA and populate the grid as follows:
Having built the grid, CIs that have a large number of Xs are critical to many services and can result in high impact should the CI fail. Equally, IT services having high counts of Xs are complex and are vulnerable to failure. This basic approach to CFIA can provide valuable information in quickly identifying SPoFs, IT services at risk from CI failure and what alternatives are available should CIs fail. It should also be used to assess the existence and validity of recovery procedures for the selected CIs. The above example assumes common infrastructure supporting multiple IT services. The same approach can be used for a single IT service by mapping the component CIs against the VBFs and users supported by each component, thus understanding the impact of a component failure on the business and user. The approach can also be further refined and developed to include and develop 'component availability weighting' factors that can be used to assess and calculate the overall effect of the component failure on the total service availability.
To undertake an advanced CFIA requires the CFIA matrix to be expanded to provide additional fields required for the more detailed analysis. This could include fields such as:
Single Point of Failure Analysis
A Single Point of Failure (SPoF) is any component within the IT infrastructure that has no backup or fail-over capability, and has the potential to cause disruption to the business, customers or users when it fails. It is important that no unrecognized SPoFs exist within the IT infrastructure design or the actual technology, and that they are avoided wherever possible.
The use of SPoF analysis or CFIA as techniques to identify SPoFs is recommended. SPoF and CFIA analysis exercises should be conducted on a regular basis, and wherever SPoFs are identified, CFIA can be used to identify the potential business, customer or user impact and help determine what alternatives can or should be considered to cater for this weakness in the design or the actual infrastructure. Countermeasures should then be implemented wherever they are cost-justifiable. The impact and disruption caused by the potential failure of the SPoF should be used to cost-justify its implementation.
|Figure 4.19 Example Fault Tree Analysis|
Fault Tree Analysis (FTA) is a technique that can be used to determine the chain of events that causes a disruption to IT services. FTA, in conjunction with calculation methods, can offer detailed models of availability. This can be used to assess the availability improvement that can be achieved by individual technology component design options. Using FTA:
FTA makes a representation of a chain of events using Boolean notation. Figure 4.19 gives an example of a fault tree.
Essentially FTA distinguishes the following events:
These events can be combined using logic operators, i.e.:
This is the basic FTA technique. This technique can also be refined, but complex FTA and the mathematical evaluation of fault trees are beyond the scope of this publication.
Modeling tools are also required to forecast availability and to assess the impact of changes to the IT infrastructure. Inputs to the modeling process include descriptive data of the component reliability, maintainability and serviceability. A spreadsheet package to perform calculations is usually sufficient. If more detailed and accurate data is required, a more complex modeling tool may need to be developed or acquired. The lack of readily available availability modeling tools in the marketplace may require such a tool to be developed and maintained 'in-house', but this is a very expensive and time-consuming activity that should only be considered where the investment can be justified. Unless there is a clearly perceived benefit from such a development and the ongoing maintenance costs, the use of existing tools and spreadsheets should be sufficient. However, some System Management tools do provide modeling capability and can provide useful information on trending and forecasting availability needs.
Risk Analysis and Management
|Figure 4.20 Risk Analysis and Management|
To assess the vulnerability of failure within the configuration and capability of the IT service and support organization it is recommended that existing or proposed IT infrastructure, service configurations, Service Design and supporting organization (internal and external suppliers) are subject to formal Risk Analysis and Management exercises. Risk Analysis and Management is a technique that can be used to identify and quantify risks and justifiable countermeasures that can be implemented to protect the availability of IT systems. The identification of risks and the provision of justified countermeasures to reduce or eliminate the threats posed by such risks can play an important role in achieving the required levels of availability for a new or enhanced IT service. Risk Analysis should be undertaken during the design phase for the IT technology and service to identify:
Most risk assessment and management methodologies involve the use of a formal approach to the assessment of risk and the subsequent mitigation of risk with the implementation of subsequent cost-justifiable countermeasures, as illustrated in Figure 4.20.
Risk Analysis involves the identification and assessment of the level (measure) of the risks calculated from the assessed values of assets and the assessed levels of threats to, and vulnerabilities of, those assets. Risk is also determined to a certain extent by its acceptance. Some organizations and businesses may be more willing to accept risk whereas others cannot.
Risk management involves the identification, selection and adoption of countermeasures justified by the identified risks to assets in terms of their potential impact on services if failure occurs, and the reduction of those risks to an acceptable level. Risk management is an activity that is associated with many other activities, especially ITSCM, Security Management and Service Transition. All of these risk assessment exercises should be coordinated rather than being separate activities.
This approach, when applied via a formal method, ensures coverage is complete, together with sufficient confidence that:
Formal Risk Analysis and Management methods are now an important element in the overall design and provision of IT services. The assessment of risk is often based on the probability and potential impact of an event occurring. Counter-measures are implemented wherever they are cost-justifiable, to reduce the impact of an event, or the probability of an event occurring, or both.
Management of Risk (MoR) provides an alternative generic framework for the management of risk across all parts of an organization - strategic, programme, project and operational. It incorporates all the activities required to identify and control the exposure to any type of risk, positive or negative, that may have an impact on the achievement of your organization's business objectives.
MoR provides a framework that is tried, tested and effective to help you eliminate - or manage - the risks involved in reaching your goals. MoR adopts a systematic application of principles, approach and processes to the task of identifying, assessing and then planning and implementing risk responses. Guidance stresses a collaborative approach and focuses on the following key elements:
Planned and Preventative Maintenance
All IT components should be subject to a planned maintenance strategy. The frequency and levels of maintenance required varies from component to component, taking into account the technologies involved, criticality and the potential business benefits that may be introduced. Planned maintenance activities enable the IT support organization to provide:
The requirement for planned downtime clearly influences the level of availability that can be delivered for an IT service, particularly those that have stringent availability requirements. In determining the availability requirements for a new or enhanced IT service, the amount of downtime and the resultant loss of income required for planned maintenance may not be acceptable to the business. This is becoming a growing issue in the area of 24 x 7 service operation. In these instances, it is essential that continuous operation is a core design feature to enable maintenance activity to be performed without impacting the availability of IT services.
Where the required service hours for IT services are less than 24 hours per day and/or seven days per week, it is likely that the majority of planned maintenance can be accommodated without impacting IT service availability. However, where the business needs IT services available on a 24-hour and seven-day basis, Availability Management needs to determine the most effective approach in balancing the requirements for planned maintenance against the loss of service to the business. Unless mechanisms exist to allow continuous operation, scheduled downtime for planned maintenance is essential if high levels of availability are to be achieved and sustained. For all IT services, there should logically be a 'low-impact' period for the implementation of maintenance. Once the requirements for managing scheduled maintenance have been defined and agreed, these should be documented as a minimum in:
Hints and tipsAvailability Management should ensure that building in preventative maintenance is one of the prime design considerations for a '24 x 7' IT service.
The most appropriate time to schedule planned downtime is clearly when the impact on the business and its customers is least. This information should be provided initially by the business when determining the availability requirements. For an existing IT service, or once the new service has been established, monitoring of business and customer transactions helps establish the hours when IT service usage is at its lowest. This should determine the most appropriate time for the component(s) to be removed for planned maintenance activity.
To accommodate the individual component requirements for planned downtime while balancing the IT service availability requirements of the business provides an opportunity to consider scheduling planned maintenance to multiple components concurrently. The benefit of this approach is that the number of service disruptions required to meet the maintenance requirements is reduced. While this approach has benefits, there are potential risks that need to be assessed. For example:
The effective management of planned downtime is an important contribution in meeting the required levels of availability for an IT service. Where planned downtime is required on a cyclic basis to an IT component(s), the time that the component is unavailable to enable the planned maintenance activity to be undertaken should be defined and agreed with the internal or external supplier. This becomes a stated objective that can be formalized, measured and reported. All planned maintenance should be scheduled, managed and controlled to ensure that the individual objectives and time slots are not exceeded and to ensure that activities are coordinated with all other schedules of activity to minimize clashes and conflict (e.g. change and release schedules, testing schedules.) In addition they provide an early warning during the maintenance activity of the time allocated to the planned outage duration being breached. This can enable an early decision to be made on whether the activity is allowed to complete with the potential to further impact service or to abort the activity and instigate the back-out plan. Planned downtime and performance against the stated objectives for each component should be recorded and used in service reporting.
Production of the Projected Service Outage (PSO) Document
Availability Management should produce and maintain the PSO document. This document consists of any variations from the service availability agreed within SLAs. This should be produced based on input from:
The PSO contains details of all scheduled and planned service downtime within the agreed service hours for all services. These documents should be agreed with all the appropriate areas and representatives of both the business and IT. Once the PSO has been agreed, the Service Desk should ensure that it is communicated to all relevant parties so that everyone is made aware of any additional, planned service downtime.
Continual Review and Improvement
Changing business needs and customer demand may require the levels of availability provided for an IT service to be reviewed. Such reviews should form part of the regular service reviews with the business undertaken by SLM. Other input should also be considered on a regular basis from ITSCM, particularly from the updated Business Impact Analysis and Risk Analysis exercises. The criticality of services will often change and it is important that the design and the technology supporting such services is regularly reviewed and improved by Availability Management to ensure that the change of importance in the service is reflected within a revised design and supporting technology. Where the required levels of availability are already being delivered, it may take considerable effort and incur significant cost to achieve a small incremental improvement within the level of availability.
A key activity for Availability Management is continually to look at opportunities to optimize the availability of the IT infrastructure in conjunction with Continual Service Improvement activities. The benefits of this regular review approach are that, sometimes, enhanced levels of availability may be achievable, but with much lower costs. The optimization approach is a sensible first step to delivering better value for money. A number of Availability Management techniques can be applied to identify optimization opportunities. It is recommended that the scope should not be restricted to the technology, but also include a review of both the business process and other end-to-end business-owned responsibilities. To help achieve these aims, Availability Management needs to be recognized as a leading influence over the IT service provider organization to ensure continued focus on availability and stability of the technology.
Availability Management can provide the IT support organization with a real business and user perspective on how deficiencies within the technology and the underpinning process and procedure impact on the business operation and ultimately their customers. The use of business-driven metrics can demonstrate this impact in real terms and, importantly, also help quantify the benefits of improvement opportunities. Availability Management can play an important role in helping the IT service provider organization recognize where it can add value by exploiting its technical skills and competencies in an availability context. The continual improvement technique can be used by Availability Management to harness this technical capability. This can be used with either small groups of technical staff or a wider group within a workshop or SFA environment.
The impetus to improve availability comes from one or more of the following:
Availability Management should take a proactive role in identifying and progressing cost-justified availability improvement opportunities within the Availability Plan. The ability to do this places reliance on having appropriate and meaningful availability measurement and reporting. To ensure availability improvements deliver benefits to the business and users, it is important that measurement and reporting reflects not just IT component availability but also availability from a business operation and user perspective.
Where the business has a requirement to improve availability, the process and techniques to reassess the technology and IT service provider organization capability to meet these enhanced requirements should be followed. An output of this activity is enhanced availability and recovery design criteria. To satisfy the business requirement for increased levels of availability may require additional financial investment to enhance the underpinning technology and/or extend the services provided by the IT service provider organization. It is important that any additional investment to improve the levels of availability delivered can be cost-justified. Determining the cost of unavailability as a result of IT failure(s) can help support any financial investment decision in improving availability.
In order to provide structure and focus to a wide range of initiatives that may need to be undertaken to improve availability, an Availability Plan should be formulated and maintained. The Availability Plan should have aims, objectives and deliverables and should consider the wider issues of people, processes, tools and techniques as well as having a technology focus. In the initial stages it may be aligned with an implementation plan for Availability Management, but the two are different and should not be confused. As the Availability Management process matures, the plan should evolve to cover the following:
During the production of the Availability Plan, it is recommended that liaison with all functional, technical and process areas is undertaken. The Availability Plan should cover a period of one to two years, with a more detailed view and information for the first six months. The plan should be reviewed regularly, with minor revisions every quarter and major revisions every half year. Where the technology is only subject to a low level of change, this may be extended as appropriate.
It is recommended that the Availability Plan is considered complementary to the Capacity Plan and Financial Plan, and that publication is aligned with the capacity and business budgeting cycle. If a demand is foreseen for high levels of availability that cannot be met due to the constraints of the existing IT infrastructure or budget, then exception reports may be required for the attention of both senior IT and business management.
In order to facilitate the production of the Availability Plan, Availability Management may wish to consider having its own database repository. The AMIS can be utilized to record and store selected data and information required to support key activities such as report generation, statistical analysis and availability forecasting and planning. The AMIS should be the main repository for the recording of IT availability metrics, measurements, targets and documents, including the Availability Plan, availability measurements, achievement reports, SFA assignment reports, design criteria, action plans and testing schedules.
Hints and TipsBe pragmatic, define the initial tool requirements and identify what is already deployed that can be used and shared to get started as quickly as possible. Where basic tools are not already available, work with the other IT service and systems management processes to identify common requirements with the aim of selecting shared tools and minimizing costs. The AMIS should address the specific reporting needs of Availability Management not currently provided by existing repositories and integrate with them and their contents.