In the world of IT operations, incident management is the critical process of identifying and fixing problems that affect the health, availability, and reliability of the services and systems you and your customers rely on. Incident management is not a haphazard affair; without a standard model, with specific steps to guide your team from start to finish, incident management processes would not deliver satisfactory results to ensure normal service. Effective incident management is about understanding why issues happen in the first place through investigation and diagnosis, discerning roles and responsibilities, building a knowledge base, and gaining vital insights that can lead to better operations, fewer incidents, and faster resolutions.
What is IT Incident Management?
According to the Information Systems Audit and Control Association (ISACA), incident management and response is “a key component of an enterprise business continuity and resilience program.” Under ISACA, incident management follows the Control Objectives for Information and Related Technologies (COBIT) framework for IT management and governance. Another popular approach to incident management is outlined by the Information Technology Infrastructure Library (ITIL) and is called ITIL incident management, which defines incident management as a way of managing the lifecycle of incidents (unplanned interruptions or reductions in quality of IT services) in order to restore affected services as quickly as possible.
ITIL Incident Management Process
An efficient incident management process must be based on a consistent, repeatable series of steps that allows the IT operations team to recognize, respond to, and address detrimental events and service requests in order to achieve incident resolution. Whichever framework your organization prefers, there are a number of steps considered best practices involved in the ITIL incident management process intended to resolve the incident, but also to ensure the entire process results in the best possible outcome for your service operation. These steps for incident managers include:
- Step 1: Incident identification
- Step 2: Incident logging
- Step 3: Incident categorization
- Step 4: Incident prioritization
- Step 5: Incident assignment
- Step 6: Incident response
Step 1: Incident Identification
Incident identification is the first step in the life of an incident, which involves the detection and notification of the incident.
Step 2: Incident Logging
Once the incident is identified, the service desk logs the incident and creates a ticket that captures information associated with the event, including time, location, performance data, and any relevant relational information. Incidents can be logged through a variety of ways including phone calls, emails, SMS, web forms in self-service portals, or from live chat messages.
Step 3: Incident Categorization
Incidents are then categorized and subcategorized based on the area of IT or business the incident has created conflict. The categorization is an arrangement of the event type, such as the device, service, or location. There may be multiple categories associated with a single incident.
Step 4: Incident Prioritization
Depending on the category and impact on service availability, the event is evaluated and triaged to ensure it gets the proper level of attention. Based on priority, incidents can be classified as:
- Critical
- High
- Medium
- Low
Step 5: Incident Assignment
After category and priority have been established, the incident is routed and assigned to the appropriate personnel.
Step 6: Incident Response
Once identified, categorized, prioritized, and logged, the service desk is able to resolve the incident. The incident resolution process can be broken down into three steps:
- Incident diagnosis: Diagnosis begins with the user describing their problem and answering troubleshooting questions.
- Incident escalation: Following diagnosis, incidents can be reassigned to a different team to most efficiently address the issue.
- Incident resolution: Fixing the incident is only part of the process. Incident resolution requires testing the associated system or service to ensure it can be returned to full operation.
It’s important to know, however, that while these incident management frameworks are critical and follow a logical progression, they were developed years ago, when enterprise networks were simpler, and IT operations teams were focused on maintaining a manageable number of devices and software products that didn’t change very often. Today, it’s a different story and best practices for incident management has changed. Enterprise infrastructure configurations for even modest organizations are highly complex, software-driven, and distributed across on-premises facilities and the cloud, and they change moment-by-moment as services, virtual machines, computing instances, and mobile devices appear and disappear from the network.
Each add, move, and the change affects the performance of adjacent devices and services. Monitoring and managing major incidents under these circumstances are impossible using traditional tools and approaches. There are too many of them to manually track, and they generate too much data. Automating your CMDB tracks these changes in real-time at machine speed.
Data and Incident Management at Scale
In the context of monitoring today’s modern enterprises, data is the primary challenge for configuration management. Every configuration item (CI) has data associated with it, including things like model numbers, serial numbers, software licenses, and more. And every configuration item generates data that lets IT operations know about the performance and state of the CI. This might include things like throughput speed, computing capacity, sources and destinations of communications, access, physical location, temperature, and patterns of activity that are measured against expectations.
When your IT estate is comprised of thousands of devices and services, each generating a constant stream of data, even a small percentage of unexpected signals would mean that legacy IT operations monitoring tools would quickly be overwhelmed by events triggering the incident management process. Unfortunately, legacy tools can’t keep up with the relentless pace of activity, nor do they have the intelligence to recognize actual trouble signals from momentary anomalies.
From Legacy to AIOps
Artificial Intelligence for IT Operations (AIOps) allows you to transcend traditional incident management frameworks because it is engineered to ingest all of that data and transform it from the source of the problem into powerful insight. That transformation starts with a foundation of best practices that includes:
- Real-time discovery and knowing the state of every CI in the estate
- Collecting and normalizing the disparate sources of data into a single operational data lake that automatically populates your CMDB with complete, accurate, and up-to-date operational data
- Using data to create a contextual map of the estate that identifies device and service dependencies, even as they change moment-to-moment
With this level of intelligence and configuration, when an incident occurs, a series of automations are triggered that make the incident management process far faster and more efficient. In fact, AIOps can eliminate the need for generating most tickets by recognizing parent-child relationships, removing redundancies, and understanding when events—while unexpected—are not indicative of a problem. Then, when an event is acknowledged and needs to be investigated, an incident is created in the service desk as a new service request. Service requests are formal requests from users asking service providers to offer information, approval or advice.
For higher-level events that do require a person or team of people to respond, the incident is generated and enriched automatically with all the relevant data the incident management team needs to quickly fix the problem, restore the service, and close out the incident. The precision of that data, based on root cause analysis, means that your team is able to do more than merely suppress symptoms. Instead, they can diagnose and address the source of the problem.
In one large enterprise, with an IT estate consisting of more than a half-million configuration items, AIOps was responsible for reducing the average time to incident closure from 2.5 hours per ticket down to 15 minutes. That one measurable improvement resulted in an average annual savings of $14 million dollars. This is the result of following incident management best practices on a grand scale.
The Importance of Incident Management
That level of performance and efficiency is vital to understanding and maintaining the health and availability as well as managing the risk of the systems and services you and your customers rely on. But an effective and efficient incident management program produces value and benefits beyond simply knowing things are working well.
When your systems are working at their peak, you can minimize the risk of revenue lost to downtime and minimize the risk of service level agreement (SLA) violations. Those improvements have a direct impact on the bottom line by translating to higher levels of customer satisfaction and better customer retention rates. What’s more, because your skilled IT staff are not wasting their time attending to routine tasks, their talents can be reallocated to higher-level responsibilities.
Our experience has shown that when IT staff are liberated from incident management drudgery, they are more prone to play a role in the innovation of new services, including those that generate new sources of revenue. Not to mention IT staff that feel challenged and valued are easier to retain.
AIOps also supports advanced analytics that can turn operational data into business insights that can lead to better outcomes for the organization. Those outcomes should include using the AIOps platform for greater automation as IT operations’ understanding of the infrastructure grows over time.
ScienceLogic, SL1, and Your Incident Management Program
As for AIOps, technology alone can’t solve your incident management problems. Yes, it takes great technology, but it also requires the expertise of a partner that can guide you on the path to AIOps value. ScienceLogic has both.
The ScienceLogic SL1 AIOps platform consistently outranks the competition when evaluated by top industry analysts. But ScienceLogic also understands that the journey to AIOps value takes time, starting with gaining an understanding of the organization’s current state of technology, challenges, and goals for the business. That consultative process is a commitment to work through each phase of the AIOps journey with both technical and business consultative support. These include:
- Foundation – Establish complete, real-time discovery to monitor both hybrid and multi-cloud environments.
- Crawl – Consolidate ITOM tools, integrate events, route tickets and notifications, and establish the organization’s incident management and automated ticketing processes.
- Walk – Migrate business services from a device-centric to a service-centric posture.
- Run – Apply analytics to IT problem management and develop troubleshooting and repair automations.
If you’ve been frustrated with the shortcomings of your IT operations monitoring capabilities and the inefficiencies of your incident management program as your IT environment grows larger, more complex, and more dynamic, why not take a look at ScienceLogic’s SL1 AIOps platform and talk to one of our experts? We’ve got the platform and the know-how to help you solve your monitoring and incident management challenges, and to help you realize the full potential of your investments in technology.
Want to learn even more about incident management? Read this eBook>