CrowdStrike releases root cause analysis after Blue Screen of Death issue that caused global outage



The large-scale CrowdStrike outage that occurred on July 19, 2024 (Japan time)

affected 8.5 million Windows terminals , causing severe damage such as the inability to operate a huge number of systems in the aviation industry, hospitals, government agencies, etc. CrowdStrike has released a root cause analysis report on this large-scale outage.

Falcon Content Update Remediation and Guidance Hub | CrowdStrike
https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/



External Technical Root Cause Analysis — Channel File 291
(PDF file)

https://www.crowdstrike.com/wp-content/uploads/2024/08/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf




CrowdStrike releases root cause analysis of the global Microsoft breakdown - ABC News
https://www.abc.net.au/news/2024-08-07/drt-crowdstrike-root-cause-analysis/104193866

CrowdStrike, which provides internet security products to various companies and government agencies, regularly updates its products. The problem occurred with the Falcon Platform , a cybersecurity program that provides automatic protection from malware, antivirus support, and incident response. The Falcon Platform is cloud-based and works in conjunction with CrowdStrike's servers, eliminating the need for customers to install and manage additional equipment or software.

According to CrowdStrike, updates to the Falcon platform have been implemented multiple times per day since the program's release.

On July 19, 2024, CrowdStrike released a sensor configuration update for the Falcon platform for certain Windows hosts. 'The Falcon platform's sensors are a system that detects suspicious activity such as malware,' said Sigi Good, professor of information systems at the Australian National University.

When sensor configuration updates are implemented, the location and number of sensors in the program will change. In the July 19, 2024 update, the sensor was expecting 20 input fields, but actually provided 21. According to CrowdStrike, this 'counting discrepancy' caused the global outage.



'

The content interpreter only expected 20 values. Therefore, when the content interpreter tried to access the 21st value, it read out-of-bounds memory beyond the end of the input data array, resulting in a system crash,' CrowdStrike reported.

CrowdStrike's system has privileged access to the ' kernel mode ,' which is responsible for basic functions of Windows, and Good said, 'Kernel mode is constantly monitoring what you are doing, receiving requests from the applications you are using, and seamlessly servicing the applications.' In other words, if the Falcon platform, which is at the heart of the PC's system, fails, the entire system crashes, which leads to a large-scale failure like this one.

In response to CrowdStrike's report, Associate Professor Toby Murray of the University of Melbourne's School of Computing and Information Systems said, 'Basic testing by human developers could have prevented this massive outage. The fundamental problem that caused this outage was a lack of proper quality review and assurance, which would have led to catastrophic problems sooner or later.'



CrowdStrike also said it has engaged two independent software security vendors to conduct further research into the Falcon platform's sensor code for both security and quality assurance.

in Software, Posted by log1r_ut