Minimize MTTR to Mitigate Impact of Change Management
In the first blog this demo series, we showed you how to use Restorepoint to remediate after a network breach. In our second blog of this three-part series, we walk you through a change management instance—showing how to speed problem resolution and how to mitigate the impact of poor change management to minimize MTTR.
What to Do When an Incorrect Change Is Made
Today we’re going to look at ways to mitigate the impact of poor change management, or somebody accidentally making an incorrect change. What we have set up here is in the same environment. I have another policy rule that checks that all devices are only using Sys log Servers that are on this list of approved IPs.
I went ahead and triggered a change to one of the devices before starting this demo, where I went to the router called “Andy” and instead of adding syslog server ending in .181. I did a typo and added the one ending in .118. If we switch to that device, we can look at the device I made that change on, and we can confirm that it is violating the policy rule of “No unapproved syslog servers.” How would we use Restorepoint with SL1 to look at that and help Incidents Response or IT Operations (ITOps) teams to effectively:
- Address when a device is violating a policy rule;
- Identify exactly what the root cause was; and
- Identify the changes that need to be undone.
If we switch into SL1 again, we can look and see, here is the message that was triggered from the device. We see there’s a compliance violation that “Andy” is no longer configured in accordance with our policy rules.
Leveraging Runbook & Scheduled Automations in SL1
You’ll already be familiar with the concept of runbook automations and scheduled automations as they happen within SL1. And we can leverage that in combination with Restorepoint to enrich alerts with information coming from Restorepoint to help answer some of those questions like:
- What happened within the environment?;
- When was a change made?; and
- Who made that change?
You can look here in the list of available automation policies or automation scripts for this alert. As we look at this event, you can see here that in the list of available automations and scripts, one of the options is Restorepoint event enrichment. This tells SL1 to reach out to Restorepoint and collect data about the configuration and the difference between the two most recent configs and add that information to this event. It’s already happened in this case, but we can look at the results of that by selecting view automation actions here.
We can see that this event in Restorepoint now contains information about these changes that we’ve made. We see here that I tried to make a change and used .118 as the syslog host, which is the address that triggered the policy violation. The person responding to this ticket would have been available to them immediately and be aware that a change was made.
Somebody might not be responding to this ticket directly from within the SL1 platform, but instead would be working from an external ticketing system. In my demo environment, we have SL1 synchronized with ServiceNow for ticket handling and the same policy that automatically applied the Restorepoint automation to this rule also automatically created an incident in ServiceNow. We can switch directly into that from this screen, and we can see here in ServiceNow that we have an incident that was created automatically by SL1, with information about the device.
Problem Resolution & Troubleshooting Catalyst
If we scroll down and look at the details and work notes, we can see that this information about the difference between the most recent configurations is available at their fingertips. We can also see here exactly what the timestamps are of the two configurations, so we know when that change was made, which can be a tremendous time saver when it comes to troubleshooting.
This is another example demonstrating how quickly Restorepoint was detected and flagged as a violation. We made this incorrect change and saw that in SL1 and how we used that information to automatically enrich both the event in SL1, and external tickets to speed up troubleshooting and problem resolution by our customers operations team.