Incident Management: Monitor
Many organizations have implemented technologies such as network or system management platforms to help monitor the IT infrastructure. These platforms periodically poll the network devices and servers to determine whether they are still operational and available. The polling interval is usually configurable. Depending on the number of IT components being monitored, the interval can range from seconds to minutes.
However, because the polling interval can be minutes, many times when an IT component fails, it will take minutes for these management platforms to notice the failure. In addition, if the IT component fails temporary and restores service before the next poll, sometimes these management platforms may not even notice the failure.
Most of the network devices, servers and applications will generate log messages as they encounter errors. Log messages can indicate a network device is about to reboot due to errors, or a server has been restarted, or a routing failure has occurred. These log messages can act as warnings to the IT organization before disaster hits.
An effective Incident Management process requires real-time and automated incident detection mechanisms. Log Intelligence can be used as the foundation to continuously monitor the IT infrastructure and detect any faults or errors. Log Intelligence platforms can complement the existing management solutions by:
- Continuously monitor all logs generated by the IT infrastructure, also those from homegrown and legacy systems that are not otherwise tied into the System Management architecture
- Identify log messages that could act as a warning and alert administrators as needed
- Notify administrators when a device, server or application has failed
- Send alerts to Incident Management or Problem Management technologies to create tickets for tracking.
- Eliminate lost or incorrect Incidents by providing detailed information related to the incidents
In addition to alerting when an incident occurs, IT organizations can also utilize Log Intelligence to obtain details of the incident in order to investigate and diagnose the incident. Also, administrators can collect and analyze all relevant log information related to the incident. Real-time monitoring ensures the impact to the business is minimized, e.g., reduce the number of people or systems affected.