Facebook Outage: The Case for Configuration & Change Management
In the age of cloud, digital transformation, application modernization, and the mobile economy, the network is the lifeblood behind enabling excellent customer experiences. Network Operations (NetOps) and IT Operations (ITOps) teams are constantly aware that a disruption in core network systems performance can have a massive impact on their business.
These teams design, plan, test, configure, deploy, monitor, and adjust critical network devices on a frequent basis so their traffic loads and service levels are continuously met. In the world of hybrid cloud, this includes managing a complex fabric of edge and core routers/switches as well as key connection points between data centers, WAN providers, and hyper-scalers (ex. AWS, Azure, Google, etc) that are often highly dependent on each other.
To ensure this array of network technologies operates as expected, NetOps teams must be constantly aware of the potential impact of making a configuration change. Understanding which systems were updated last, and what exactly changed is essential. Having a backup of the last configurations is also vital not only for recovery but to analyze the cause. Configuration backups should be a core part of a change operation in case restoration is required. ITOps teams must be aware of any changes so they can quickly identify the root cause of a service outage before it spirals out of control.
Five Hours Down, Millions of Dollars Lost, & Billions of Unhappy Customers
On Monday, we saw an example of the potential impact of a network configuration change with the multi-hour service outage at Facebook, WhatsApp, and Instagram. Billions of customers and partners experienced a service disruption and as a result, Facebook stock took an immediate hit to its market value. From their blog on the issue, we understand the following:
“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
– Santosh Janardhan, Network Traffic Operations at Facebook
From an industry perspective, we know that Gartner estimates roughly 50% of outages impacting mission-critical services will be caused by change/configuration/release integration or hand-off issues while 80% are caused by people and process issues. Therefore, it’s critical companies managing complex networks invest in highly automated workflows around network configuration and change management.
Solving NetOps & Security Issues with ScienceLogic & Restorepoint
ScienceLogic recently announced its acquisition of Restorepoint, an industry-leading network configuration and change management solution, focused on solving common network and security operations issues. Restorepoint can detect changes and take an instant backup by listening to change events from network devices. Backups can also be triggered via their API as part of a change management workflow that is run in other platforms. As backups are taken, Restorepoint automatically compares the current configuration with the previous backup, sending a change notification or failed backup via email or straight into the ScienceLogic SL1 platform (SL1).
With the integration of SL1 our customers see line-by-line configuration changes within the event. Recovery time is therefore slashed because NetOps engineers know the cause faster and can automatically restore the previous configuration in a few clicks. Without Restorepoint, NetOps engineers are dependent on the availability of the last-known good configuration. Teams often find their backups are either not current, or worse not complete because of a lack of understanding around the correct backup process. This leads to an inability to restore automatically.
As your business considers its own strategy to avoid potential service outages and negative impacts on customer experiences, we encourage you to engage with our ScienceLogic or Restorepoint team and see how our NetOps and ITOps solutions can improve your business outcomes.