When a business-impacting event occurs, it’s critical that the right people are kept in the loop and that work to resolve the issue begins promptly. Unfortunately, many organizations lack a way to communicate both internally and externally in a proactive and consistent way to make this happen. As a result, issues that should be fixed quickly can go unresolved for hours, business stakeholders lose trust in IT, and customers lose trust in the business.
These challenges stem from the all-too-common situation where different teams use different tools for notifications, chat, and incident response. For instance, ITOps may use OpsGenie for incident notification and Slack for collaboration, but everyone else outside of IT may use Microsoft Teams for day-to-day collaboration.
With this mix of different stakeholders using different tools, how do you ensure that the CIO viewing an SL1 Business Service dashboard knows that an engineer is working on the corresponding PagerDuty incident? How do you make sure that the Slack “war room” channel for an incident is kept up to date throughout the lifecycle of an event?
In this blog and video, we will demonstrate how ScienceLogic’s pre-built notification workflows help address communication and collaboration gaps across your organization so you can keep stakeholders informed and engaged, and speed incident resolution.
Proactive Notification to Internal and External Stakeholders
When service-impacting incidents occur, it is critical that information on these events be shared across the IT ecosystem so that everyone is working off the same data. It is also crucial that the right people are informed as soon as possible to resolve issues quickly and keep stakeholders happy. This includes executives who need to know when key services are down, customers who need to know there is some level service disruption that is actively being addressed, and of course, IT operators who need to start fixing the issue. The worst scenario is when a customer complains that a service is not working as expected, and no one in IT knows.
Achieving this seamless information sharing across technologies and people starts with SL1’s real-time data lake, which acts as a single source of data that harvests data from and shares intelligent data with the rest of your IT ecosystem. When SL1 detects and generates a service-impacting incident, using the notification workflows, SL1 can automatically send event pertinent details to the tool of your choice. This could be a MSFT Teams message notifying business stakeholders of an outage, or the creation of an Opsgenie incident notifying IT operators that there is a problem. Already, this workflow provides value to the organization by keeping everyone on the same page when something happens.
Faster Incident Notification, Response, and Resolution
Keeping external stakeholders informed of issues when they happen is just part of the challenge. They also must be resolved quickly. But resolving them can often require engagement across multiple teams, even just in IT. For instance, SecOps may need to provide a credential or be engaged when making changes to network configurations. Or ITOps may need to ask a clarifying question on a system architecture that involves people in DevOps.
The SL1 notification and collaboration workflows help speed incident notification, response, and resolution by fostering collaboration across your organization and can work together more effectively. Because SL1 automatically captures and sends incident and event details to the tools your team uses, everyone, whether in DevOps, SecOps, or ITOps can be on the same page about what is happening. By providing a common operating picture, whoever is responsible for resolving the issue doesn’t have to waste time forwarding incidents, manually handing off information, or explaining issues to other members of the team. This drastically reduces the lag time it takes to share information, ultimately speeding both response and resolution times.
Let’s look at a typical example of how SL1 notification and collaboration workflows keep stakeholders informed and to speed resolution of the issue:
1. Our scenario begins when SL1 detects an event. This could be a service-level event (like the degradation of your Active Directory service), or an event on a specific device (like an interface utilization event on a core network router).
2. Based on the type of events or the attributes of the service/device, SL1 processes and transforms the event data and routes the event notification to different channels, teams, or even different tools. This step includes formatting event data into a customizable message pre-configured according to your organization’s needs. For instance, you can send an incident to an ITOps admin in PagerDuty, a notification to the DevOps team in Slack, and an update to business owners in Microsoft Teams. You can configure PagerDuty further to proactively send a notice out to impacted external customers notifying them of a known service degradation.
3. An engineer responds by either acknowledging the notification or incident record or selecting the acknowledge button that accompanied the chat message. Once acknowledged, the acknowledged state is reflected in the SL1 event, and any other tools integrated with SL1. As a result, all tools within the IT ecosystem stay in synch for the life of the issue.
4. The engineer realizes he needs a few details from the DevOps team on how the application is architected. He uses the Slack channel where the event details were posted to directly collaborate with the DevOps team to get the information he needs. He also tags SecOps to notify them of any changes he is making.
5. When the engineer working on the event resolves the issue, and SL1 detects the resolution, SL1 clears the event, and notifies the same groups of people via the same channels that the issue has been resolved. Similarly, if an engineer marks the issue as resolved in an external tool, SL1 will automatically clear the associated events.
By connecting notification and collaboration tools with SL1 events and synchronizing state changes wherever they occur, SL1 significantly improves your organization’s ability to collaborate and keep all stakeholders informed.