Earlier this week, a failure at Amazon Web Services (AWS) – the cloud computing infrastructure that underpins huge swathes of the internet – triggered a cascading outage. From communication and collaboration tools like Slack and Zoom to payment services, critical websites and banking services from Lloyds, Bank of Scotland and Halifax, the knock-on effects impacted users worldwide. 

These events are no longer shocks; they are the new operational reality. The systems and suppliers that now provide the operating model for the world are heavily interconnected and rely on each other. When something breaks, the impact can cascade, affecting countless other services. In this environment, the question is no longer ‘if’ the next major shock will hit, but ‘when.’

The greatest barrier to surviving this reality is a leadership mindset stuck on prevention. Leaders are understandably focused on stopping incidents, but today’s interconnectedness means we can no longer fully control our own operational reality. When our services depend on a complex supply chain of other providers, we cannot guarantee ‘perfect safety’. Something will always break.

The real work, therefore, is shifting from prevention alone to a deliberate balance of preparation, response, and recovery. The most important question for a leader is not “Have we stopped all bad things from happening?” – but “When a bad thing happens, how quickly can we recover?”

This recovery speed is the true measure of organisational resilience. Resilience is built, not bought. It’s a muscle strengthened through consistent practice.

Building resilience

Building this muscle cannot be delegated solely to the IT department; it is a “whole-organisation challenge.” It requires empowered teams and, most importantly, a culture of learning, not blame. When learning is the goal, every incident becomes an investment in future strength. When blame is the default, problems are hidden, and the organisation becomes progressively more brittle.

Building this muscle requires intentional practice. It starts with leadership driving three core activities: First, they must map their territory, ensuring the organisation has a clear, shared picture of its critical services. Next, they must test that reality, creating a safe-to-fail environment to expose blind spots before a crisis does. Finally, they must drill the response, normalising the testing so that practicing for failure becomes a routine, productive exercise.

This is more than a defensive strategy; it’s a source of deep competitive advantage. When a systemic shock hits – whether from an accidental outage, a malicious cyber-attack, or a critical supply chain failure – the difference between collapse and continuity becomes preparedness.

For example, during last year’s Crowdstrike incident, United Airlines was able to recover remarkably quickly – an advantage they have credited to their investment in preparing for recovery, agility, empowering teams to solve issues, and clear communication. Other airlines took much longer, with Delta estimating its recovery costs at $550m.

While the unprepared are paralysed, the resilient organisation is already executing its recovery playbook, maintaining customer trust, and getting back online.

This week’s AWS incident is another warning. Building resilience is not a technical problem. It is a fundamental duty of leadership, as vital to businesses as financial stewardship and legal integrity.

Dai Vaughan is the chief technology officer of Public Digital 

Read more: Why firing for AI speed can cost your business more than it saves