For many - (including me), much of what is described as resilience still looks suspiciously like “warmed up” business continuity. Nevertheless, I believe that resilience is different to business continuity. I don’t see it as an alternative to business continuity – rather, that there is a strong inter-relationship and that they complement each other as part of an overarching framework for operational stability. In the financial sector there are regulatory developments in this area, and they are forming a useful backdrop by which to differentiate what resilience is, what business continuity is and how they are related. To assist with this comparison I’m going to use the recent PWC paper “Operational resilience: How to set and test impact tolerances” for comparison. The report clarifies a number of concepts contained in the regulatory discussion paper, DP18 - Building the UK financial sector’s operational resilience” issues by the Bank of England, the Prudential regulatory Authority and the Financial Conduct Authority. The regulators define operational resilience as: “an approach to operational risk management that includes preventative measures and the capabilities – in terms of people, processes and organisational culture– to adapt and recover when things go wrong”. This would seem to point us towards a wider perspective of managing operational impairment above and beyond the scope of most business continuity programmes.
There are significant differences between resilience and business continuity, and it all depends on the way in which potential threats to impairment of operational stability are treated:
Resilience mitigates operational impairment by:
- Strengthening infrastructure to make less likely to fail in different circumstances
- Making infrastructure “elastic” so that it can scale up (as well as down) during periods when operational volumes may be subject to sudden, large fluctuations
- Establishing “self-repairing” capabilities so that single points of failure are removed
- Embedding the firms risk appetite throughout the organisation by influencing day-to-day behaviour.
- Resilience measures are embedded in the day-to-day operations of the firm
Business Continuity mitigates operational impairment by:
- creating contingency plans that provide the means for recovering and continuing activities by using alternative methods and processes to reinstate critical operations (which may operate at a reduced level for a period).
- invoking contingency plans at the point when failure has occurred
Resilience and Business Continuity are, of course, related and complementary. No organisation is 100% resilient 100% of the time. For the times when “something slips through the net” or is of sufficient magnitude to overwhelm existing levels of resilience, a contingency plan will be required.
Resilience and Business Continuity also complement each other in situations where the cost of mitigation, likelihood and impact are more practically served by establishing contingency plans rather than by investing in infrastructure modifications.
Differences also exist in how “proof of capability” is assured. Business Continuity plans and arrangements are generally assured by a scenario test: a “make believe” situation where contingency arrangements are re-enacted to ensure that they will operate as expected. Assurance of operational resilience is obtained by stress testing which means simulating a real attack or dysfunction on the target. In some cases – particularly IT defences – the attack will be performed on an accurate replica of the target infrastructure. The stress test is designed to pressurise one or more aspects of operational infrastructure to breaking point in order to ensure that the required level of resilience is present.
To illustrate the points made above, I’ll use “the bridge analogy” in the PWC paper to further illustrate why a lot of thinking around resilience still leans towards business continuity principles and not resilience.
I liked the section on impact tolerances because it provides target thresholds for resilience. Another good thing in here (and with the regulatory focus on products and services) is that it looks at the impact on the organisation rather than arbitrary recovery “objectives” based on departmental silos. Nevertheless, a similar approach already being applied within many business continuity programmes: using a service or product based focus to establish scope and priorities for recovery ensures that planning activities are focussed on organisational objectives and not departmental silos. From a practical perspective, it’s difficult to see an alternative way of defining scope & priorities when the focus is determining what aspects of the firm’s infrastructure you should either strengthen to avoid failure or recover when failure occurs.
What I’m not so sure about in this example given in the PWC report is the jump from impact tolerance to stress test – because an approach based on establishing and improving resilience should include practical steps to establish current levels of resilience and improve them where necessary. A step seems to be missing – the bit that relates to the current stability, strength and elasticity of operational activities and supporting infrastructure and what practical improvements can be embedded in. The impact tolerances defined for the bridge have a single root cause –which is the that the bridge is closed for some reason. Bearing in mind that there are several things that could of cause bridge closure including weather conditions and traffic accidents, the issue to be addressed is “how do we make our bridge more resilient in order to maintain revenue stability”.
For the purposes of this discussion, I’d like to consider some options that would be available to make the bridge operationally resilient to traffic accidents and, consequently, embed measures that would protect revenue:
1. Paint chevrons in each lane to provide guidance to drivers on maintaining a safe distance
2. Introduce conservative speed limits - say 30 mph
3. Prohibit lane changing
4. If number plate recognition is in place, introduce a penalty system of fines for drivers who change lanes or who exceed the speed limit
5. For drivers who regularly use the bridge, award vouchers for consistently observing speed and other restrictions
6. Provide escorts for vehicles who are transporting hazardous materials.
The above options make the bridge more resilient by
a. Influencing driver behaviour – most accidents are caused by driver error or temporary recklessness. Chevrons, speed limits and prohibited lane changing all contribute to a safer driving environment
b. Fines, to reward risk averse behaviour (and therefore influence drivers to drive in accordance with the bridge owners risk appetite)
c. By restricting speed, reduce the chances of speed related accidents
d. Implementing measures to contain specific threats (i.e. HazMat spills)
How does the owner of the bridge measure it’s resilience? By establishing metrics that track the performance of the resilience measures: logging and comparing hours lost due to traffic accidents and the duration of accident remediation over the course of time.
Of course, there is always the chance that something outside of the bridge owner’s control causes a traffic accident of significant magnitude that causes the bridge to close. It’s conceivable that, say, a large commercial vehicle has a “blow-out” causing the driver to lose control and causes a major accident. In this case the contingency plan outlined in the stress testing section of PWC’s report comes into play. As stated earlier the business continuity plan provides a “backstop” for the overall resilience arrangements – both approaches complement one another. The resilience measures provide a driving environment which raises the bridges’ resilience to accidents by reducing risks related to speed, driver error and recklessness. The business continuity plan provides the capability to recover from bridge closure (within a target timeframe) for root causes that are outside of the bridge owner’s capability to influence.
The stress testing example in “Operational resilience: How to set and test impact tolerances” looks suspiciously like a business continuity scenario test to me. Why? because it is a test of recovery approaches, it’s not a test that confirms that the resilience of the bridge to traffic accidents is at the required threshold. Admittedly, the bridge analogy poses quite a challenge to performing a meaningful stress test, because it requires modelling a physical environment – implying an approach based on modelling a range traffic flow and accidents.
The challenges of stress testing with the “the bridge analogy” are not as significant within the operational environments in financial services firms. Stress tests can be performed on a range different activities and services to establish how operations would be likely to behave under different operational stress scenarios (such as IT infrastructure failures, significant transactional volume spikes, counterparty operational failures and cyber-attacks). The overall objective of stress testing is not to re-enact a specific scenario but to simulate threat manifestations and enable conclusions to be drawn on whether the organisation will be likely to buckle or remain stable under this pressure.
Stress testing for resilience has been going on in the financial sector for some time – although not in areas of operational risk. Typically, the tests have focussed on simulating significant adverse changes to economic and financial market variables, which would be expected to put a firm’s business models under pressure. The insights drawn from these activities therefore provide indications of potential losses that might be incurred and the ability of the firm to survive them. For operational resilience approaches are needed to replicate this type of testing in operational environments. Stress testing operational models establishes intrinsic levels of resilience and is designed to put operational activities under pressure and observe the outcome. It requires different techniques to be applied and different tests to be designed to those which are currently used within business continuity exercises.
To conclude, I’d like to reiterate that I’m not critiquing PWC’s efforts here – their report will certainly assist financial firms with complying with many of regulatory compliance requirements of DP 18. What I’m slightly uncomfortable with is the continued strong taste of business continuity and a “vacuum” of subject matter specifically related to resilience. This may, of course, be intentional by the author of the PWC report - which makes no claim to cover everything.
My own opinion is that operational resilience is different from "vanilla" Business Continuity, can be measured and, using the right techniques, can be stress tested with meaningful outcomes. I’ll be sharing my ideas on this, looking at the specific areas of risk (which are not exclusive to financial services), over the coming weeks and look forward to any observations and feedback on my posts.