Availability Management

As with IT Service Continuity Management, the importance of Availability Management has never been more apparent. Whereas ITSCM deals with major disasters, Availability Management deals with IT Service and component level outages. Availability is a key measure of most Service Level Agreements. Effective Availability Management is considered a primary factor influencing customer satisfaction and company reputation.

Note that for customers even performance degradation can be considered “Unavailability” of a Service. The goal of the ITSM Availability Management process is to cost-effectively meet an agreed upon level of required Availability and to continuously reduce the frequency and duration of Availability Incidents over time (without incurring extra costs).

It is important to recognize that when things go wrong, it is still possible to achieve business and user satisfaction. The way in which the organization handles Unavailability is just as important as the frequency and duration of outages. In addition to preventing Availability Incidents, companies should always look for ways to accelerate Service recovery and maintaining good communication with customers throughout the process.

The lifecycle of an Availability Incident can be divided into the following stages:

  • Start
  • Detect
  • Diagnose
  • Repair
  • Recover
  • Restore

Log Intelligence is not aiming to replace mature Availability and Systems Management solutions, but rather to enhance these in critical areas. Behavioral anomaly detection based on log information can enhance detection of incidents through mechanisms available in other Availability and System Management tools. Failure alerts can also be generated for legacy and homegrown systems that could otherwise not be monitored by these commercial tools.

Log data can also play an important role in accelerating the diagnosis and repair of Availability Incidents. Diagnosis is the process of determining the root-cause of a Service interruption. This stage tends to take up a considerable portion of the time to recovery and thus is a prime candidate for recovery acceleration. Diagnosis is also critical to analyze “what happened” after the fact and to ensure that steps are taken to prevent similar Availability Incidents in the future.

Log data can accelerate time to root-cause diagnosis and repair by providing investigative data at the fingertips of the analysts. In fact, most analysts turn to log data as a first step to figuring out “what happened”. It is critical to collect and retain 100% of log data because for diagnostic purposes. It is impossible to predict ahead of time what information is going to be necessary to diagnose and recover from an outage.