Resilience is not Redundancy. What Catastrophic Failure Actually Teaches You.

by | Mar 5, 2026 | Government, Military

The Standard Playbook Is the Wrong Playbook

There is a version of this post that starts with a framework. A neat model, some boxes and arrows, a memorable acronym. That is not this post.

This post starts with the moment a platform serving a quarter of a million people stopped working, and the realisation that almost everything we thought we knew about resilience was wrong.

When organisations talk about resilience, they usually mean uptime. They invest in redundant hardware, failover architecture, and service-level agreements (SLAs) that measure availability in terms of nines. 99.9%. 99.99%. The higher the number, the more resilient the system.

This is not wrong exactly. But it is dangerously incomplete.

Redundancy is a strategy for delaying failure. It is not a strategy for surviving it. And at the scale of a genuinely complex, deeply integrated platform, the question is not whether the system will fail. It is when, and how fast you can recover when it does.

The programmes that figure this out usually do so the hard way.

What Catastrophic Failure Actually Looks Like

Op CADILLAC was the operation that stood up to respond to a catastrophic failure in a Ministry of Defence (MOD) business environment. Not battlespace systems, not mission-critical weapons or command infrastructure, but the digital fabric that 250,000 people use to do their jobs every day: communications, file management, shared services. The kind of systems that get classified as ‘business space’ rather than frontline capability, and therefore tend to sit lower on the priority list.

When it failed, it failed badly. And the first thing you discover in that situation is how many assumptions were wrong.

The recovery documentation existed. The disaster recovery (DR) plan was in place. The supplier contracts specified response obligations. What was missing was something those documents cannot capture: the practised, internalised knowledge of how to function under sustained pressure when the normal tools are gone, and the clock is running.

That gap showed up quickly. The troubleshooting procedures existed, but they were shallow. They assumed failure would manifest as something familiar, something you could test for and trace through the documentation. The reality was that the issues ran deep and the procedures were not built for that. The knowledge bases were well maintained. The real-world knowledge needed to use them under pressure was not.

Op CADILLAC bought time. It was a fast, focused effort to restore basic function and stop the bleeding. What it exposed, in the process of doing that, was a longer list of structural problems that quick fixes cannot solve. That became Op INFINITI, the operation tasked with implementing the proper remediation.

The relationship between the two is instructive. CADILLAC taught us what was actually broken. INFINITI gave us the chance to fix it properly. But the more important lesson was upstream of both: the failure did not have to be as damaging as it was.

Three Things That Separate Survivable From Not

Having led both operations, I would point to three factors that determine whether a programme can survive catastrophic failure or be defeated by it.

The first is clarity of command. In a real failure event, the governance structures that work fine under normal operations often buckle. Decision-making slows down precisely when it needs to speed up. The programmes that recover quickly are those in which someone has unambiguous authority to make calls, including uncomfortable ones, and everyone around them knows it. This is not about rank. It is about pre-agreed, rehearsed clarity: who decides, on what, and on whose authority. Without it, the recovery becomes a committee exercise at exactly the wrong moment.

The second is a rehearsed response. There is a significant difference between a team that has practised failure and a team that has only ever practised normal operations. Reading a recovery procedure in a crisis is completely different from executing one you have run before, even in a simulated environment. The people who perform well in real failure are almost always those who have already been through some version of it. That does not happen by accident. It has to be designed in, resourced, and taken seriously before anything goes wrong.

The third is graceful degradation. Not all functions are equally critical, even within a business environment. A platform that fails uniformly is significantly harder to recover from than one designed to shed load in an ordered way, preserving the most important functions while everything else degrades. This requires a difficult conversation upfront: what is the minimum viable service? What can we live without for twelve hours? For forty-eight? Most programmes avoid that conversation because it feels defeatist. The ones that have it are far better prepared when they need to be.

The Gap Most Programmes Have and Will Not Admit To

If you are running or overseeing a complex digital programme, there is a simple test worth applying honestly.

When did you last run a full failure exercise? Not a tabletop discussion, not a review of the DR documentation, but an actual rehearsed test of recovery under realistic conditions? If the honest answer is never, or more than a year ago, you have a gap.

The second test is harder. Do the people responsible for managing a failure event actually know what to do, and have they practised it together recently? Procedures on a shared drive do not answer this question. The people who recovered the MOD platform did not succeed because they had good documentation. They succeeded because they were experienced, calm under pressure, and clear about what decisions they were empowered to make.

The third test is about design. Has your architecture ever been reviewed explicitly through the lens of graceful degradation? Not for availability, not for security, not for performance, but for the question ‘when this fails, how does it fail, and what do we preserve?’

The Line Between Business Space and Mission Critical Is Blurrier Than It Looks

It is tempting to use the ‘business space’ classification as a justification for investing less in resilience. These are not the systems that directly control platforms or deliver fires. They are the supporting infrastructure.

That argument does not survive contact with a real failure event. When 250,000 people cannot access the tools they need to communicate, coordinate, and do their jobs, the operational consequences accumulate quickly. The boundary between what counts as critical and what does not looks very different from inside a sustained outage than it does on a risk register.

The lesson is not that every programme needs to invest in the resilience engineering of a flight control system. The lesson is simpler and harder to argue with: understand what you actually cannot afford to lose, design for that, and practise recovering it before you have to.

Resilience Is a Capability, Not an Attribute

The standard way of talking about resilience treats it as something a system has. You design it in, you certify it, you point to the SLA and the redundancy architecture, and call it done.

That is not what resilience is.

Resilience is a capability that an organisation builds and maintains through deliberate practice. It degrades if it is not exercised. It is invisible until you need it, and by then, it is too late to build it from scratch.

Treat failure as certain. Design for recovery. Practise it before you need it.

Everything else is just hoping the nines hold.

Note: The Op names have been changed to protect the innocent.

Written by Seb Matthews

Author, speaker, and advisor on leadership under pressure and organisational performance.

Related Posts

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *