Operational Resilience has always been the goal of any financial institutions IT Team but achieving “always available” and “always secure” is more complicated to achieve than many would expect.

With new regulatory rules coming into force that compels financial institutions to demonstrate their ability to recover from service outages and security breaches, a big spotlight is being shone on how effective current business continuity plans and security controls are. In collaboration with the Prudential Regulation Authority, the Financial Conduct Authority has produced new regulatory rules and policies that are driving this change. These place requirements on financial service providers to manage operational disruption by introducing measures to respond, recover and adapt from disruptive events.

A deadline has been set for March 31, 2022. Businesses must have identified their important business services, set impact tolerances, and have undertaken mapping and testing to identify vulnerabilities in their operational resilience. In addition, businesses must have recognized, prioritized, and invested in operational resilience improvements and developed communications plans as part of their incident response processes. The question is, what does this mean in practice?

Let’s take a closer look at the four key processes that we believe will need addressing as a matter of priority.

1. Important Business Services

The first requirement for financial service providers is to identify and document important business services. A crucial word here is “important”; the importance is measured by the impact of disruption on customers and markets through harm to individuals or market instability.

The identification of services needs to be granular in nature to identify each specific service rather than groups of related services. From a network security viewpoint, services may have a high degree of commonality or share many of the same infrastructure components; separate assessments are necessary to provide sufficient confidence in the thoroughness of any impact assessment.

The critical criteria for important business services are that they must have identifiable stakeholders, customers, providers, participants. The importance will change with the vulnerability and number of affected customers—the more significant the effect, the greater the importance.

As part of this process, the business services must be mapped to the resources it relies upon, including people, processes, infrastructure, and information. For the network security team, this provides visibility of which services depend on their networks.

2. Impact Tolerance

Businesses will be expected to quantify the impact tolerance for each of the identified important business services. In addition, a measure of the maximum acceptable level of disruption (including time and cost) must be defined and documented consistently across business services.

Assessing impact requires considering both the effect of disruption on the business services and its potential duration. These two factors combined to produce an overall measure of the total impact on customers and markets. Impact toleration is a measure of acceptable disruption occurring over a bearable period.

For example, the impact tolerance for a total network failure that disrupts a business’s entire operation will be set lower but will be more expensive in terms of the disruption and reputational damage, than a single business service interrupted by a network configuration error. The cost of the disruption to a single service will be lower but is maybe tolerable for a longer period if it occurred out of hours.

Rather than focussing on identifying and implementing controls to prevent incidents based on risk appetite, this process requires acceptance that incidents will occur and looks to assess the potential impact when the inevitable happens. But, of course, you’ll still be expected to carry on with prevention too.

3. Testing Resilience

Once important business services are identified and each has an agreed impact tolerance, the regulations require businesses to test (at least annually) their ability to operate within these impact tolerances under all credible disruptive events. These tests are focused solely on incident response and recovery rather than the traditional approach of testing incident prevention.

The purpose of testing is two-fold: the first is to validate the impact tolerances and revise where necessary. The second is to periodically review current resilience and identify deficiencies and vulnerabilities due to temporal changes in business processes, services, or the threat landscape that may influence the calculation of impact tolerance. For example, recovery plans for network incidents should be regularly tested to provide evidence that not only are they effective in restoring services, but that restoration is achieved within the required timeframes. In addition, infrastructure changes and service scaling can, over time, result in increased recovery times that may adversely affect impact tolerances.

4. Communicating Disruptions

An essential requirement in the new regulations is the importance of transparent communications, both internal and external.

An internal communications plan should define processes for incident management, defining key decision-makers and the information they should be provided with for informed decision making, along with information gathering processes and escalation paths. These processes will support both incident recovery and post-incident analysis for impact assessment lessons learned exercises.

An external communication plan should define processes for delivering warnings and advice for customers, affected third parties, regulators, media, and other interested parties. Its focus is on open and honest communications, so everyone knows what is going on and when things are expected to return to normal. This can be challenging when the network has failed, and it’s not immediately obvious why.

The network security team will often find themselves at the heart of the information-gathering activities, assessing the cause, extent, and impact of disruption to network services and their effect on the important business services. Centralized network monitoring and forensic readiness will be vital to support information gathering to support clear communications.

In Conclusion

For businesses delivering financial services, the regulatory authorities are shining a light on operational resilience, imposing regulatory requirements to assess and monitor the impact of disruption on customers and financial markets and demonstrate financially viable recovery capabilities.

This affects all aspects of the business, but particularly the team managing the network infrastructure which underpins all business services. Incident recovery plans were often just paper-based, but now need to be validated and regularly tested to prove that the businesses can operate within the tolerances being set.

The problem for most organizations is that the network is often managed in silos, by business area, skillset, and then in network vendor silos too. Centralized monitoring of network availability is relatively easy to achieve, but centralizing the security, compliance, and configuration management of the whole network is almost impossible.

Restorepoint, a ScienceLogic company, provides a centralized, automated multi-vendor network configuration management solution that helps leading financial institutions lower the risk of network disruption and improve compliance. With support for more than 100 network and security vendors, Restorepoint quickly enables customers to centralize backup of network device configurations, recover from misconfiguration or hardware failures automatically, and detect changes and compliance weaknesses.

Book a live demo and see how you could use Restorepoint to drive multi-vendor network efficiency, eliminate time-consuming manual processes and achieve operational resilience.

Tune in next week for part two of this series focussing on how network vulnerabilities can impact operational resilience.

X