Automating Data Center Incident Response: Collecting Clues at the Scene of the Crime

The heart of IT is the data center. When something goes wrong, it can take a team of investigators to figure out what’s wrong. But maybe you don’t need investigator. Maybe all you need is automation.

Have you ever seen a crime drama where the main character is trying to solve a complex caper by correlating clues and leads pinned haphazardly on a cluttered bulletin board? Often the scene involves a sleep-deprived detective, fueled by coffee and cigarettes, pacing in a dim-lit room, drawing lines between photos of suspects and circumstances. In the end, and after a lengthy process of elimination, the movie concludes with a dramatic take-down as the hero cuffs the villain.

Sounds a lot like managing a data center, doesn’t it? Something goes wrong, and you and your team of skilled investigators try to follow the clues until, by process of elimination, you’re able to identify and fix the problem. It’s a long, arduous trek, and one that is compounded by the multitude of interconnected devices and applications—and the ever-changing landscape of virtual machines and instances that blink into and out of existence—that comprises today’s IT environments.

The heart of IT is in the data center. And when things go wrong in the data center, it usually means there are holes in the organization. And closing those holes is the purpose of IT.

The Uptime Institute reported that 31 percent of data center managers surveyed reported an outage last year. That same survey reported that 80 percent of respondents believed their incidents were preventable. Among the most common causes of downtime?

  • On-premises power failures (33%);
  • Network failures (30%); and,
  • Hardware and software errors (28%).

What’s more, nearly a third of the survey’s respondents said the problem resulting in downtime was traced to a service provider, so they had no way of knowing exactly how to close those holes. That’s the kind of frustrating hindsight that occurs after the problem is diagnosed. But unless you are at the scene of the crime when the crime takes place, you’ll always be chasing after clues and racing to lower your mean time to repair (MTTR). Unless that is, you can automate the incident response process.

Before you think, “My data center is already automated,” we should be clear that we aren’t talking about automating things like provisioning, cloud scaling, and orchestration. Those capabilities are table stakes for data center management. The kind of automation we’re talking about—the kind that is needed to improve data center performance and minimize downtime—applies to the process of triaging and diagnosing incidents, and translating that information into response and repair.

Data center automation handles the processes you expect; data center incident automation handles the events you don’t expect.

Right now, incident response triage is a difficult, time-consuming, mostly manual process that demands an inordinate amount of time and attention from you and your staff. And because it is a manual process, it is prone to errors, compounded by the complex nature of today’s virtualized systems. A workload that existed at the time of the incident may not exist by the time you arrive at the scene. But with our datacenter automation pack, you can go back to the scene of the crime—and gather a snapshot of the event from the exact moment the incident occurred, including a record of all the necessary contextual triage data—captured for you automatically—that allows the system to know what happened, the conditions that existed at the time, and either fix the problem automatically or generate a ticket that tells you precisely what to do to fix it quickly.

Even though automation, by definition, is a means for allowing machines to take over many of the tasks your team is doing today, that doesn’t mean robots will displace people. Instead, your high-skill individuals become an even greater asset to the business organization. Instead of spending their time mired in the drudgery of break-fix, they become integral to delivering a higher quality of service to users and customers and engaged in business innovation that translates to improvements like:

  • Improved cost savings;
  • Increased customer and staff satisfaction;
  • Greater business retention; and,
  • Stronger business development.

If you see yourself facing the kinds of challenges described in this scenario, and if you aspire to achieve the kind of improvements that are possible through data center incident automation, put down that tepid mug of joe, empty your ashtray, and read this data sheet.

We’re here to help whenever you’re ready.

Request a demo

symposium rhs ad

symposium rhs ad