The world of software is growing more complex, and simultaneously changing faster than ever before. The simple monolithic applications of recent memory are being replaced by horizontal cloud-native applications. It is no surprise that such applications are more complex and can break into infinitely more ways (and ever new ways). They also generate a lot more data to keep track of. The pressure to move fast means software release cycles have shrunk drastically from months to hours, with constant change being the new normal.

At the same time, the growing reliance on software means the cost of a problem is much higher than in the old days. If your grocery store cash register was down for an hour, you lost dozens of sales. But if an Amazon region is down for an hour, the impact quickly gets into the tens of millions.

Many software technologies have emerged to help keep tabs on these high value software estates and keep them running healthy – monitoring, APM, log management, and AIOps to name a few. They specialize in different types of data collection, and together can help a human observer analyze the data to understand any problems.

The Human Bottleneck

But given the trendlines, familiar techniques that rely on skilled humans to analyze vast quantities of observability data are just not scaling anymore. For example, queries, alerts, or scripts that look for familiar patterns or problems don’t work to catch novel problems. We increasingly rely on the intuition and skills of the user to have a hunch about new problems and know where to start drilling down to analyze them. The bottleneck is no longer data collection, or the tools that can search the data–it is the human mind not knowing exactly what to look for.

We can’t just keep adding people to tackle this problem. Aside from cost, there simply aren’t enough skilled people to hire. This means we need help from more intelligent software to automate the analysis–help us see problems quickly, understand their business impact, and figure out the root cause.

The Solution – Machine Learning-Driven Automation

Machine learning-driven automation is one such software innovation. As applied to software health, it takes advantage of three cutting-edge technologies.

Anomaly Detection

The first is anomaly detection. These refer to machine learning (ML) techniques that can learn the “normal” patterns of any data stream generated by software, and automatically detect deviations from the normal. Such techniques have been applied to time series data as well as events (such as log events). Well-designed anomaly detection is good at detecting changes in your software environment. But by itself it can be a bit noisy and overwhelming, just given the complexity and churn in large environments.

Event Correlation

This brings us to the second category of machine learning—event correlation. This is a class of techniques that learn by watching event streams in any environment, and automatically uncover groups of events that seem to be connected (Example: events with a cause-and-effect relationship).

 Layering event correlation on top of anomaly detection has two benefits:

  • It significantly reduces the amount of noisy information humans must pay attention to (by a factor of hundreds or thousands), so you are mostly seeing useful clusters of events.
  • Sophisticated event correlation techniques can construct full timelines of significant events that capture the cause and effect any time a problem occurs. These serve as machine-generated, root cause reports, replacing the human generated “post-mortem” reports that are typically compiled through slow and painstaking manual analysis. This area of machine learning has advanced to the point the outcomes are as accurate as those generated by a skilled human. And of course, they can work much faster than human speed.

Natural Language Processing

The final area of machine learning-driven automation is natural language processing (NLP). You’ve already experienced rudimentary forms of NLP when you interact with a chatbot on any website. Essentially these are ML  models that are trained on existing knowledge bases. When you type a question or prompt, they try to match your input to a categorized list of answers in the knowledge base. But if you have heard about ChatGPT then you know this space is poised for a big leap in intelligence and accuracy. Variants of such NLP are invaluable in connecting the dots between a bug you might be troubleshooting and the entire corpus of public knowledge in the same domain (think of websites like Stack Overflow). Using APIs like GPT-3, the machine learning software can match your root cause report (from Step 2) and tell you if others have already encountered a similar bug (and if so, what they had to say about it).

Putting It All Together

It is worth taking a moment to reflect on what these techniques can do if put together.

  • They can autonomously analyze your environment and automatically pick up weird or unexpected events.
  • They can automatically analyze the relationships between these weird events, and spit out accurate, comprehensive root cause reports describing what went wrong and why.
  • And finally, they can pull in the collective wisdom of the world and tell you what others concluded when they had a similar problem.
  • And they can do all of this as well as a skilled human. But they don’t get tired, can scale indefinitely, and work much faster than humans.

Want to see it in action? Sign up for a free trial of ScienceLogic Zebrium.

X