Availability Management: Monitor
Availability and System Management tools focus on the detection, alerting, escalation and notification of IT failures. Log Intelligence can enhance the speed and accuracy of detection by enhancing System Management products with unique alert types. An automatic baseline can be established of the common message rate for a particular IT Component and any deviation against the baseline is reported to the Systems Management console:
- An unusually low message rate is a reliable indicator of performance degradation that could end in a Service or Component failure
- An unusual ratio between accepted and denied connections or successful and unsuccessful logins is an other indicator of a Service anomaly that requires immediate attention
- Search filter based alerts can be configured for legacy and homegrown applications that could otherwise not be integrated into the System Management
Only if the failure cause is known the System Management tools is able to perform auto-recovery actions. However, in most instances considerable “manual” diagnostic analysis is required to identify the root-cause of an Incident. Instant access to system activity log records can significant speed up the diagnosis:
- Access to the chronological sequence of actions and events performed by and on a Component during the ten minutes before failure to look for unusual events, such as a spike in utilization or a recent change in configuration that may have caused the failure
- Search for past events with a error code in the raw log data archives
- Search for similar events on other devices in the raw log data archives
After every incident a meeting should be convened to determine the root-cause of the Incident and to agree on steps to prevent similar Incidents in the future. The log data audit trail also fulfills an important role in this process.