Analyzing the CrowdStrike Outage: What Went Wrong, Why It Happened, and What We Can Learn

September 19, 2024

The recent CrowdStrike outage has sent shockwaves through the global IT landscape, leading many to consider it one of the biggest failures in tech history. Affecting over 8 million machines globally, including critical systems in airports, hospitals, banks, the sheer scale and impact of this outage is staggering. But what really happened? Who’s to blame? And what can we learn from this unprecedented failure?

We will break down the incident, understand the technical failures, and explore the lessons the industry can draw from this catastrophe.

What happened during the outage?

At the core of this massive disruption was a cascade of updates from both Microsoft and CrowdStrike, a leading cybersecurity provider. Initially, CrowdStrike released a software update aimed at enhancing its Falcon sensor software through a change to a specific configuration file.

According to multiple accounts, the systems were running fine with initial updates from both companies. However, everything broke after Microsoft pushed another update. This subsequent configuration update created an incompatibility, leading to machines blue-screening, boot-looping, and, in some cases, becoming unrecoverable due to encryption issues, notably those involving Bitlocker encryption.

On July 19, the CrowdStrike update was sent to certain Windows computers. The aforementioned configuration file is part of what CrowdStrike calls ‘channel files’, which are part of the behavioral protection mechanisms used by the Falcon sensor. Updates to channel files are standard procedure and occur several times a day in response to new threat intelligence. On Windows systems, channel files reside in the

C:\Windows\System32\drivers\CrowdStrike\

directory, and have a file name that starts with “C-“. Each of these files is assigned a unique name. The specific channel file involved in this incident was 291 thus starts with “C-00000291-” and ends with a .sys extension. They are not kernel drivers. This channel file helps the Falcon software monitor named pipe execution on Windows systems. Named pipes are channels used by different programs to talk to each other.

The update was meant to detect and block new, harmful ways hackers were trying to use named pipes to control systems, particularly through C2 frameworks. This channel file configuration update, coupled with the newer Windows configuration update, triggered a logic error resulting in an operating system crash.

For a more technical explanation of the events and mistakes behind the outage, be sure to check out this article by Sean Michael Kerner on TechTarget.

Who’s responsible?

The CrowdStrike incident brings up a crucial legal question: Who is responsible when IT systems fail on such a massive scale? There are two key angles to explore here:

CrowdStrike’s fault: CrowdStrike is being scrutinized for releasing an update that wasn’t robust under certain configurations. Should their software have been able to handle a Windows update without such catastrophic consequences? Many in the tech industry believe that individual programs should shoulder the burden of ensuring compatibility with the operating system, but that doesn’t necessarily absolve them of all responsibility.
Microsoft’s fault: Windows also bears some responsibility, as the second update they pushed was the catalyst that triggered this disaster. Some might argue that Microsoft’s patch should have been more thoroughly tested, especially given CrowdStrike’s enterprise-level integration with Windows. However, it’s nearly impossible for any company to test every possible software configuration.

Legal/Practical complexities

The legal ramifications of this outage are complex. With such widespread damage, many organizations have already pursued lawsuits against both CrowdStrike and Microsoft. The courts will have to decide which party is more liable, but that’s easier said than done. Assigning blame here is like walking into a crime scene and pointing the finger at a random bystander, accusing them without fully understanding all the details. While it seems CrowdStrike’s software was unstable under the new Windows configuration, it’s equally valid to ask why Microsoft pushed the configuration in the first place.

The updates clashed in a way that was unforeseeable by either party due to gaps in the quality assurance (QA) process. Was it preventable? Possible, but hindsight is always 20/20. Before this event, there likely wasn’t even a process to check for this kind of issue.

Consequences

This outage was not just a tech issue – it had real, and in some cases, life-threatening consequences. Hospitals had to cancel surgeries, patients missed critical treatments, emergency services were knocked out in some countries. The enormity of these consequences show how reliant the world has become on a seamless IT infrastructure.

One of the most concerning aspects of this outage is that many machines may never recover. Systems encrypted with Bitlocker, for example, are especially at risk of this if the necessary encryption keys were lost or not stored correctly. The scale of damage, from encrypted machines to critical infrastructure like MRI machines, means that some organizations could face months of recovery, with certain systems being lost forever.

The IT environment

The process of recovering these machines is manual and painstaking. Many many stories have already been shared on Twitter where IT workers have had to physically visit each machine, perhaps trying to locate where the encryption keys are stored, and manually fix a .sys file in the drivers folder to stop the machine from boot-looping. Considering some organizations only have a few IT personnel responsible for sometimes thousands of machines, the logistics of recovery are unbearable. IT workers, many of whom are usually overworked and undervalued, were being pushed past their limit, with some reportedly working 72-hour shifts to bring systems back online.

Was this preventable?

One of the key questions to this disaster. While it’s easy to point fingers and say that both Microsoft and CrowdStrike should have tested their updates more thoroughly, the reality is more nuanced. Blind spots in the QA process exist, especially when dealing with the vast array of software configurations present in modern IT environments. This incident has prompted many organizations to rethink how they test software updates, particularly in mission-critical environments.

There are already investigations underway to determine exactly what went wrong and how to prevent similar events in the future. There will likely be new industry standards established for testing updates across various configurations, but it’s unlikely that any process will ever be foolproof.

Lessons learned

The fact that medical systems, airports, and emergency services could be taken down so thoroughly shows a lack of redundancy. In environments where lives are at stake, more robust failover systems need to be in place.
While Microsoft and CrowdStrike undoubtedly work closely together, this incident suggests that their collaboration could be strengthened, particularly when pushing new updates. More frequent and thorough testing between the two vendors could prevent such catastrophes.
The unsung heroes in this situation are absolutely the IT workers who are manually fixing these systems. This crisis has made very clear that they play a critical role in keeping the world spinning. They deserve way more respect than they get, and more resources should be granted toward supporting them.

The CrowdStrike outage is a wake-up call for the entire IT industry, and the world. The sheer scale of this failure and its far-reaching, widespread consequences show just how vulnerable, how fragile our infrastructure can be when things go wrong. Whether or not this was preventable, one thing is clear: the world needs better processes and a renewed focus on safeguarding our systems.