What is fault management?
Fault management is a discipline of IT operations management focused on detecting, isolating, and resolving problems. Faults occur any time a configuration item (CI) malfunctions or whenever an event interferes or prevents proper operation or service delivery. Fault management’s goal is the rapid resolution of errors, minimization or avoidance of network or service downtime, and maintaining optimal network performance and efficiency.
Why is fault management important?
Beyond merely responding to and resolving problems, fault management provides a number of valuable benefits to IT operations management, including:
- Establishing baseline conditions for proper network and CI operations;
- Monitoring overall network health and threat detection;
- Alerting administrators of potential system failure;
- Identifying and isolating the source of malfunctions; and,
- Ongoing logging of data for analysis and correlation in support of automatic fault resolution.
What is fault monitoring?
Fault monitoring is an ongoing cycle of inspecting network traffic for problems and supporting rapid time to repair in five steps:
- Detection: Know when something goes wrong.
- Isolation/Diagnosis: Identify the source and location of the problem.
- Correlation: Analyze all potential causes and effects of the problem.
- Restoration: Mitigate the problem and reestablish proper operations.
- Resolution: Confirm and document that the problem has been fixed.