Did AI cause the CrowdStrike crash?

I recently saw a Substack post from a technology analyst titled “Crowdstrike publicly confirms AI Caused worldwide computer outage.” Since none of the other coverage I’ve seen suggested that AI was the cause of the outage, I looked into it. My conclusion: AI did not cause the outage, and it’s irresponsible to suggest that it did.

What actually happened?

On July 19, thousands of Windows computers crashed, showing the nasty “Blue Screen of Death.” And this was no ordinary crash; to restart the computers and get them working, IT staffers needed to restart each computer in safe mode and delete a file, which took many hours of effort. Companies were crippled, especially travel companies whose IT woes cancelled or delayed many thousands of flights.

The application at the center of this problem is the Falcon application from CrowdStrike, which makes security software. Unlike other applications, Falcon runs in the Windows kernel, which allows it to monitor operations in real time across the OS and report on what it’s observing. Normal applications are unable to modify the kernel, so when they crash, they don’t normally take down the whole operating system. Falcon obeyed no such safeguards.

Microsoft estimates that 8.5 million PCs were affected. All of them were in organizations that had signed up for CrowdStrike’s security software. When CrowdStrike sent out the faulty update, the PCs in those organizations attempted to run the updated software and crashed, requiring the reboot in safe mode and fix before they could function again.

Here’s the relevant part of CrowdStrike’s incident report:

The CrowdStrike Falcon sensor delivers AI and machine learning to protect customer systems by identifying and remediating the latest advanced threats. In February 2024, CrowdStrike introduced a new sensor capability to enable visibility into possible novel attack techniques that may abuse certain Windows mechanisms. This capability pre-defined a set of fields for Rapid Response Content to gather data. As outlined in the RCA, this new sensor capability was developed and tested according to our standard software development processes.

On March 5, 2024, following a successful stress test, the first Rapid Response Content for Channel File 291 was released to production as part of a content configuration update, with three additional Rapid Response updates deployed between April 8, 2024 and April 24, 2024. These performed as expected in production.

On July 19, 2024, a Rapid Response Content update was delivered to certain Windows hosts, evolving the new capability first released in February 2024. The sensor expected 20 input fields, while the update provided 21 input fields. In this instance, the mismatch resulted in an out-of-bounds memory read, causing a system crash. Our analysis, together with a third-party review, confirmed this bug is not exploitable by a threat actor.

While this scenario with Channel File 291 is now incapable of recurring, it informs the process improvements and mitigation steps that CrowdStrike is deploying to ensure further enhanced resilience.

Anyone with experience in coding is familiar with these sorts of errors. When one part of a piece of software is connected to another and the two parts disagree about file formats, that can crash the program. Languages like C++, in which the Falcon application was coded, do not automatically protect against out-of-bounds memory problems like this, and as a result, it’s easy to make an error that accesses a sensitive part of the computer’s memory and crash the machine.

CrowdStrike’s Root Cause Analysis (RCA) goes into far more detail. After reviewing what happened, the remediation recommendations include validating the number of fields used when the code is compiled, checking for out-of-bounds memory reads, testing for a wider variety of criteria, and developing new modes of testing. There is no recommendation to avoid AI or machine learning as inherently unsafe.

What was the cause of the problem?

Causation is always tricky, since a number of things generally need to happen to cause an incident such as this. So let’s list some of the things that contributed and ask, if each of these elements were different, would the crash still have occurred?

  • The Falcon application was inadequately tested. If the update had been tested more extensively, that could have detected and prevented the problem.
  • The Falcon application was automatically updated. Because the modified version of the application was distributed automatically to all the subscribed machines, it was able to crash them as soon as it was distributed.
  • The Falcon application was written in C++. If it had been written in a language that managed memory differently, it could not have caused a crash of this type.
  • The Falcon application was run in the Windows OS kernel. A non-kernel application that crashed would only have affected the one application, not the whole machine.
  • The Falcon application used AI and machine learning. Of all the possible causes, this is the most questionable. AI and machine learning applications are not inherently less stable than other applications. A non-AI application with characteristics similar to Falcon could easily have crashed all those machines.

Put simply, an inadequately tested, automatically updated C++ application that ran in the kernel could and did crash all those machines. The fact that the application was an AI application was not essential, nor was there any reason to believe that quality of the application caused it to be a problem.

I asked security experts about this. One responded, “Hi. Josh. You are right, this has nothing to do with AI. It’s a programming error, a test error, could have happened to any program. The fact that the program does some AI tasks has nothing to do with this error.”

Is it time to shut down AI?

David Daniels, the analyst who posted that AI caused the CrowdStrike outage, has also suggested that all consumer access to generative AI should be shut down, and cited that opinion in his post about CrowdStrike. I have to point out that 1) Falcon was not a generative AI application and 2) it was not a consumer application, so such a shutdown would have made no difference in this instance. Daniels frequently comments on a wide variety of posts about AI on LinkedIn, almost always with something like this: “AI is over. AI caused the worldwide CrowdStrike outage. AI sparked the deadly UK riots. AI is HASTENING the pace of climate change. AI is dangerous and inefficient.” He then links to his Substack post on how AI caused the CrowdStrike outage.

I know people who believe that AI will solve all the problems of humanity. Others are certain that the value of AI is overblown and it’s nowhere near that powerful. And many are concerned about the power of AI to create deepfakes and other pernicious effects. This is a debate worth having.

But the terms AI and machine learning encompass a vast collection of techniques that are already in place. Machine-learning driven applications trade stocks, predict when airplane engines need maintenance, read and summarize huge collections of text to enable lawyers and customer support staff to be more efficient, and power thousands and thousands of other applications. At this point banning “AI” would be virtually impossible without disrupting the entire economy — and there is no way to test when an application is or is not “AI.” As for the emergence of more powerful”generative AI” applications, it’s hard to distinguish them from other other AI applications — and genAI had nothing to do with the CrowdStrike crash.

In a heated and important debate like this, it’s essential to bring facts, not emotion, to bear. Evidence of the dangers of AI is important. Evidence of the benefits of AI is, too. Any given technology is never all good or all evil. Intelligent regulation requires a more nuanced understanding of where the problems are and what it will take to identify and fix them.

I welcome that debate and would be pleased to participate in it. But if you start with “AI caused the CrowdStrike crash,” and repeat that opinion to anyone who will listen, you’ve undermined your credibility to participate in a logical debate.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

One Comment