- Why ScienceLogic
- Main Menu
- Why ScienceLogic
Why ScienceLogic
See why our AI Platform fuels innovation for top-tier organizations.
- Why ScienceLogic
- Customer Enablement
- Trust Center
- Technology Partners
- Pricing
- Contact Us
- Product ToursSee ScienceLogic in actionTake a Tour
Experience the platform and use cases first-hand.
- Platform
- Main Menu
- Platform
Platform
Simplified. Modular-based. Efficient. AI-Enabled.
- Platform Modules
- Core Technologies
- Platform Overview
- Virtual ExperienceSkylar AI RoadmapRegister Today
Learn about our game-changing AI innovations! Join this virtual experience with our CEO, Dave Link and our Chief Product Officer, Mike Nappi.
November 26
- Solutions
- Main Menu
- Solutions
Solutions
From automating workflows to reducing MTTR, there's a solution for your use case.
- By Industry
- By Use Case
- By Initiative
- Explore All Solutions
- Survey ResultsThe Future of AI in IT OperationsGet the Results
What’s holding organizations back from implementing automation and AI in their IT operations?
- Learn
- Main Menu
- Learn
Learn
Catalyze and automate essential operations throughout the organization with these insights.
- Blog
- Community
- Resources
- Events
- Podcasts
- Platform Tours
- Customer Success Stories
- Training & Certification
- Explore All Resources
- 157% Return on InvestmentForrester TEI ReportRead the Report
Forrester examined four enterprises running large, complex IT estates to see the results of an investment in ScienceLogic’s SL1 AIOps platform.
- Company
- Main Menu
- Company
Company
We’re on a mission to make your IT team’s lives easier and your customers happier.
- About Us
- Careers
- Newsroom
- Leadership
- Contact Us
- Virtual Event2024 Innovators Awards SpotlightRegister Now
Save your seat for our upcoming PowerHour session on November 20th.
The Observability Challenge: Limitations of the Human Brain
This is part one of a three-part blog series on Observability—the challenges and the solutions.
As software systems get more complex, the term Observability is increasingly being used alongside more familiar ones like “Monitoring”. In fact, the terms are not synonymous – they mean slightly different things. Monitoring is the act of watching a system to keep tabs on its overall health – in other words keeping an eye on the symptoms so you know when something needs attention.
Observability is the act of inferring the internal state of the system based on externally visible symptoms. In other words, it is supposed to inform you why you are seeing the symptoms you are seeing, particularly when things go wrong. The reality is that hardly any self-described observability tools fulfill that part of the promise (show you the why). But trends in software complexity, scale, and speed mean we need this missing piece now more than ever. And automated root cause analysis is the key to getting there.
Background
Contemporary monitoring strategies follow the black box approach to monitoring (first popularized by Google’s SRE team). The idea is that since software systems are so complicated, it would be very hard to monitor everything that occurs within the system. Instead, we can simplify our task by focusing on externally visible symptoms that matter – the golden signals of monitoring – latency, traffic, errors, and saturation. Just like checking temperature and blood pressure tell us when to worry about an individual’s health, these golden signals allow us to know when something is wrong and needs attention.
The Challenge: Something broke, but what the heck happened?
The challenge is what needs to happen next. As part two of this series will explain, the troubleshooting process involves a lot of digging into metrics, traces, and logs.
For a simple problem (like a crashed server, or memory/CPU starvation), this process can be fairly quick. But for a more complex problem, this “troubleshooting” process is far from quick – it can consume hours of skilled engineer’s time, while users are impacted.
The challenge is not the tools, or their ability to collect the data – modern tools allow you to centralize data effectively. Nor is the problem with the speed of searching or analyzing large quantities of data – tools are getting faster and more scalable so they can certainly keep up with data growth.
The challenge is that the scale, complexity, and speed of change in software are all growing so fast that it takes a team of humans time to know even what to look for or where to start. The bottleneck is now the human brain, which isn’t getting faster.
A decade ago, you had a monolithic software architecture, with tens of log files or metric streams to keep an eye on. When it broke, there were typically only a few hundred ways failures could occur. And you had weeks, if not months, to roll out changes so you had time to learn about failure modes and what to look for.
Today, software is increasingly horizontal – a distributed spider web of inter-related micro-services. This complexity means that there might be thousands of failure modes. It also means data volume is exploding, and you might have to look through thousands of log streams when troubleshooting. And the accelerating speed of new deployments means change is constant, so there simply isn’t time to learn about all these possible problems.
So, the challenge now is that humans need help to even know where to look, or what to look for. And if AI/ML can actually show us root cause, that is truly getting you to the promise of observability. Part two of this series will describe what automated RCA actually does. And part three will describe how these fits into ScienceLogic’s over-arching AIOps strategy.
Want to learn more about Zebrium? Request a free trial>