Over the past decade, we have seen unprecedented advances in how software-based services are delivered in the modern enterprise. The mobile revolution, the rise of public cloud environments, agile, and CI/CD methodologies, microservice-based architectures, container-based deployments, and more. All of these innovations have increased the velocity, volume, and variety of new software services.
While these advances have fueled extraordinary growth and business agility, they have also created tremendous pressure on enterprise IT departments, who must still ensure that infrastructure and applications are secure, available, and effective. But the ways IT has done this in the past can no longer keep up. Traditional IT Operation Management (ITOM) tools and processes have been rendered obsolete – seemingly overnight.
The Pitfalls of Traditional ITOM
For example, in most enterprises today, it is still the case that when a service outage happens, an event is fired from some monitoring tool, which in turn generates an outage incident in some ticketing system. Given that it is a high severity incident, a notification probably goes out via email, SMS, or some other messaging system. If the root cause and remediation is not readily obvious, some sort of collaboration session might be initiated over a crisis bridge (phone, Webex, Zoom, etc.) with relevant stakeholders. Everyone then brings their own tool to the party and tries to figure out the root cause. Once the cause is identified, a fix is created and applied, the service is restored, and the incident is closed. According to Gartner, the time it takes to determine root cause accounts for over 70% of the time it takes to restore the service to normal operation (70% of MTTR).
Now in a world where there’s a reasonably finite set of IT services and LOB applications, and those artifacts were not being updated frequently, the workflow above could be perfectly adequate (and indeed has been) for ensuring IT was meeting its SLA commitments to the business. However, in today’s world where the average number of IT managed services is in the thousands, and many of those services are being updated on a weekly, if not daily, basis—the model above simply does not scale.
AIOps: The Next Wave of IT Transformation
In recent years, AIOps has come forward to represent a set of technologies intended to address the velocity and complexity of today’s enterprise IT. This new wave is anchored, as suggested in the name, by AI/ML algorithmic techniques that promise to automate the human-centric analysis component of IT operations, thereby radically reducing MTTR and improving overall service performance. But it does not stop there. Another core aspect of AIOps, in ScienceLogic’s opinion, is about changing the perspective by which enterprise infrastructure and services are managed. These trends are summarized in Figure 1 below. We’ll address each of these in the paragraphs that follow.
Make the Transition to Services
Traditional IT infrastructure or operations management (ITIM/ITOM) tools have focused on understanding the current performance of compute, storage, and networking devices. Metrics are collected from these devices by various monitoring tools and when those metrics exceed pre-defined thresholds, an event is fired for an IT operator to review and determine if he/she needs to take action to avoid service degradation. In the early days, this approach often resulted in the fabled “sea of red” problem in unoptimized network operations centers (NOCs).
Operators would be overwhelmed by too many events, most of which are noise, which in turn impacted their ability to focus on what really matters. Over the years, several event correlation and management tools were brought to market to address this problem, resulting in fewer, more highly qualified events, for operators to act on. But even these tools are approaching a point of diminishing returns in today’s increasingly dynamic IT landscape.
In the end, focusing on measuring the health of devices and applications is analogous to focusing on the health of various human physiological systems – circulatory, nervous, digestive, etc. Yes – it good to know those systems are ok – but ultimately what you want to know first is – is the person feeling ok? The IT equivalent of a “person” in this metaphor is the business service. The business service is a topological data construct comprised of compute, storage, and networking infrastructure as well as one or more applications. It not only serves as a useful lens to represent the health of what really matters to the business, it also abstracts away the complexity of the underlying infrastructure and provides a more manageable and actionable view of the IT estate to the operator.
Start with the End (User Experience) in Mind
Another important KPI to measure is end-user experience. (Examples: How are consumers of this service experiencing it? Is it slow, are they taking too many clicks to accomplish typical workflows?) Digital Experience Management (DEM) is an increasingly important aspect of operations management. Often DEM serves as the “canary in the coal mine” and can provide early warning for a service that is beginning to fail so preventative steps can be taken. Equally important, it can provide a measure of service effectiveness and inform product managers on ways to optimize the user experience to make the service more compelling and easier to use. End-user experience measurement is an increasingly important dimension for understanding service health, particularly in a world where more and more of the hosting infrastructure is abstracted/virtualized.
Let the Machine Figure It Out
As mentioned earlier, a core tenet of this new AIOps approach to IT operations is, well, AI. Specifically using machine learning (ML) algorithms to reason over data that accurately represents the artifacts and relationships in the IT estate. When ML techniques are applied to a comprehensive domain-specific model, the time to value in terms of usable insights (training time) can be drastically reduced. Not only can ML algorithms be used to rapidly isolate root cause of issues when they occur, ML can be used to detect anomalies in time series data, and serve as a “first alert” to the operator that a service degrading event is likely to occur. ML promises radical increases in IT efficiency by separating the signal from the noise, focusing IT on the issues that really matter, and recommending actions needed to remediate.
Put Automation at Your Fingertips
Of course, once you have a solid understanding of what is going on, you naturally want to take action to address it quickly. For years, IT has relied on automated “recipes” to address common datacenter issues in a consistent fashion. These Runbook Automations (RBAs) are typically scripted actions to clear caches, dump logs, restart servers, etc. Moving forward, to gain greater efficiency, IT must be able to automate more complex tasks, some perhaps involving multiple systems and data transformations.
This kind of automation typically requires the services of IT developers to automate and test via code/script. These are usually scarce resources in IT so important efficiency initiatives are often serialized behind IT dev resource availability. And reuse was low as these automations were often built to address a specific need. In the new world, automation is everything. Not just automation of analysis (via ML) but automation of arbitrarily complex actions. Automation needs to be made accessible to IT operators – so they can compose their own workflows and integrations using a library of pre-built activities. This “democratization” of process automation is essential for IT to truly gain efficiency and responsiveness – and keep up with their increasingly dynamic environment.
Journey’s End: Autonomic IT
In the end, ScienceLogic believes that a truly comprehensive AIOps platform must contain the following ingredients:
- A comprehensive, real-time data lake that accurately captures the current state of the IT environment – infrastructure, applications, and most importantly business services, and their dependencies
- A rich set of analytical techniques for correlating data from multiple sources, understanding likely root cause, identifying anomalies that could be predictive of future service issues, and make recommendations on remediation actions based on past experience; and
- An integrated library of both pre-built automations(that can be tied to specific contexts as recommended actions) as well as the ability to easily compose and execute new automated workflows and integrations using pre-built activities and low code techniques.
To help our customers understand where they are today across these three dimensions of data, algorithmic analytics, and automation, we put together a journey map below (Figure2).
In our experience, most customers find themselves today in stage 2 or 3 or somewhere in-between. In principle, the combinations of these three AIOps capabilities, when fully matured and integrated, would allow for truly autonomic, self-healing IT services (Stage 5).
Essentially services that require very little human oversight to manage and maintain. While completely hands-off, machine-driven services may be technically possible, the more realistic goal is something akin to adaptive cruise control and lane assist in modern cars. Where the machine takes care of staying in the lane, maintaining constant speed, and can even slow down and speed up as needed to deal with issues. But human operators would be available as needed for major changes in direction, parking, restarting, etc.
Whatever the metaphor, it is clear that AIOps, when fully realized, could deliver unprecedented levels of efficiency and agility to modern IT. IT would no longer be trying to keep up, but accelerating – and serving as a catalyst for business transformation and growth.
That is the goal. The destination. And one that we at ScienceLogic believe is fully worth the journey.
Want to learn how ScienceLogic can help you on your AIOps Journey? Read EMA AIOps Radar report»