Microsoft reports on CrowdStrike outage analysis and outlines future improvements



Microsoft has reported a detailed analysis of a large-scale failure caused by a CrowdStrike update that occurred on July 19, 2024, in which Windows was forced to reboot repeatedly after hitting a blue screen.

Windows Security best practices for integrating and managing security tools | Microsoft Security Blog

https://www.microsoft.com/en-us/security/blog/2024/07/27/windows-security-best-practices-for-integrating-and-managing-security-tools/



Microsoft is Working with the Security Industry to Prevent Another CrowdStrike Outage - Thurrott.com

https://www.thurrott.com/cloud/306255/microsoft-is-working-with-the-security-industry-to-prevent-another-crowdstrike-outage

Microsoft finally explains the root cause behind CrowdStrike outage - Neowin
https://www.neowin.net/news/microsoft-finally-explains-the-root-cause-behind-crowdstrike-outage/

Microsoft acknowledged CrowdStrike's findings that the problem was caused by an out-of-bounds read memory safety error in CrowdStrike's CSAgent.sys driver. According to overseas media Neowin, CSAgent.sys is a driver that is registered with Windows to receive notifications about file operations such as file creation and modification, which allows security products such as CrowdStrike to scan new files saved to the disk.

The detailed causes of the problem are summarized in the following article.

What was wrong with CrowdStrike's code that caused many Windows to have blue screens?



The cause of the CrowdStrike outage was a mistake in a configuration file that CrowdStrike updates and distributes several times a day. An invalid address was specified in the configuration file, causing the kernel driver to try to read invalid memory, causing the system to crash.

According to Microsoft, some users have criticized the company for allowing kernel-level access to third-party software developers such as CrowdStrike. Microsoft has explained why it provides kernel-level access to security products, saying, 'Kernel drivers provide greater visibility into the entire system and can detect threats such as malware and rootkits that may load early in the boot process and before user-mode applications.' 'Kernel drivers can provide better performance in situations such as high-throughput network activity.' 'By providing early launch malware protection (ELAM) early in the boot process, we are devising a way to prevent malware and other malware from disabling our software, even if an attacker has administrator-level privileges.'

On the other hand, kernel drivers have been pointed out to reduce the potential resiliency of the machine on which they are installed, and containment and recovery features available in the event of a problem are very weak. To address this issue, Microsoft has been moving complex core services from the kernel to user mode, and in 2019, it included protective features such as TPM 2.0 and Secure Boot in its security baseline, significantly raising the security defaults of Windows.



As future improvements, Microsoft has stated that it will 'provide safe rollout guidance, best practices, and technologies to make security product updates safer,' 'reduce the need for kernel drivers to access critical security data,' 'provide enhanced isolation and tamper resistance through technologies such as VBS enclaves ,' and 'enable zero trust approaches such as high integrity attestation , which provides a way to determine the security state of a machine.'

CrowdStrike CEO George Kurtz reported that as of July 26, 2024, one week after the outage occurred, approximately 97% of affected systems had been restored.

97% of Windows users worldwide were restored within a week of the massive CrowdStrike outage that caused the system to go blue screen - GIGAZINE

in Software, Posted by log1r_ut