What Really Happened? CrowdStrike Investigation Reveals Cause of Global IT Crash
As the fallout continues, CrowdStrike has released a post-incident review (PIR) detailing the buggy update that caused 8.5 million Windows machines to crash last week. The review attributes the issue to a flaw in the test software, which failed to properly validate the content update pushed out on Friday. In response, CrowdStrike has pledged to enhance its content update testing, improve error handling, and implement a staggered deployment strategy to prevent future incidents.
CrowdStrike’s Falcon software, used globally by businesses to protect against malware and security breaches on millions of Windows machines, received a problematic content configuration update intended to “gather telemetry on possible novel threat techniques.” These updates are typically routine, but this specific update led to widespread crashes.
CrowdStrike issues configuration updates in two ways: Sensor Content, which updates the Falcon sensor operating at the kernel level in Windows, and Rapid Response Content, which adjusts the sensor’s behavior for malware detection. The issue on Friday stemmed from a tiny 40KB Rapid Response Content file.
Unlike cloud-based updates, sensor updates incorporate AI and machine learning models to enhance long-term detection capabilities. These capabilities include Template Types, which are codes that enable new detection methods and are configured by the Rapid Response Content delivered on Friday.
CrowdStrike manages a cloud-based system to validate content before release, aiming to prevent incidents like Friday’s crash. However, a bug in the Content Validator allowed one of two Rapid Response Content updates (Template Instances) to pass despite containing problematic data. CrowdStrike typically performs automated and manual testing on Sensor Content and Template Types but appears to have overlooked thorough testing for the Rapid Response Content that caused the crash.
This oversight led to the sensor loading the faulty Rapid Response Content into its Content Interpreter, causing an out-of-bounds memory exception and resulting in a Windows crash (BSOD).
New Measures
To prevent future occurrences, CrowdStrike has committed to improving Rapid Response Content testing through:
- Local developer testing,
- Content update and rollback testing,
- Stress testing, fuzzing, and fault injection.
- Conduct stability and content interface testing on Rapid Response Content.
Additionally, CrowdStrike is updating its cloud-based Content Validator to better check Rapid Response Content releases. “A new check is in process to guard against this type of problematic content from being deployed in the future,” CrowdStrike stated.
On the driver side, CrowdStrike will enhance error handling in the Content Interpreter, part of the Falcon sensor. The company will also implement a staggered deployment of Rapid Response Content, gradually rolling out updates to larger portions of its user base instead of an immediate push to all systems.
Fallout
The full extent of the economic damage is still being assessed and may never be fully known. On Saturday, Microsoft said about 8.5 million Windows devices had been affected. A report by insurer Parametrix, cited by Reuters, estimated on Wednesday that the total direct financial loss to US Fortune 500 companies, excluding Microsoft, was $5.4 billion. Delta is among the global airlines still struggling to fully restore systems, leading to more cancellations and delays.
Malaysia has publicly demanded that both CrowdStrike and Microsoft cover the losses incurred in the country. In the UK, most systems are back online, though manually removing the rogue code is reportedly taking time for some Windows operators without dedicated IT teams. The United Kingdom’s Health Service (NHS) has warned of a knock-on effect due to thousands of lost appointments.
Many cybersecurity experts believe the review revealed that major mistakes were made by the firm. In particular, the deployment of updates to all customers at once in a so called ‘rapid response update’
The incident has also raised concerns among experts that many organisations are not well-equipped to implement contingency plans when a single point of failure such as an IT system, or a piece of software within it, goes down.
CrowdStrike CEO George Kurtz has has apologised for the impact of the outage. He has also been asked to testify before the US House of Representatives’ Homeland Security Committee.
Additional links from CrowdStrike and other technology vendors:
- Workaround steps for individual hosts
- Workaround steps for public cloud or similar environment including virtual
- AWS-specific documentation
- Azure environments – CrowdStrike Falcon agent guidance from Microsoft
- User Access Recovery Key in the Workspace ONE Portal
- Windows encryption management via Tanium
- Bitlocker recovery via Citrix
- Intel vPro® technology remediation guide
- CrowdStrike and Rubrik Customer Content Update Recovery For Windows Hosts
- Recovery Tool to help with CrowdStrike issue impacting Windows endpoints
Source: https://www.linkedin.com/pulse/what-really-happened-crowdstrike-investigation-mc42e/