News

CrowdStrike Blames Internal Testing Faults for Mass Outage

In an initial post-mortem, Austin-based security firm CrowdStrike said that an issue in its testing software led to the company pushing through a faulty update that took down more than 8.5 million Windows systems last week.

In a Wednesday blog post, the company delved more into details behind the massive outage, which led to flight cancellations and disruption of public services, like 911 systems. On July 19, CrowdStrike issued a "Rapid Response" content configuration update for its Falcon software, intended to gather information on current and ongoing security incidents.

CrowdStrike's Rapid Response updates aim to react to the overall threat landscape "at operational speed." According to the company, the Rapid Response update pushed through on July 19 contained the error in one of its two "Template Instances." Here's CrowdStrike's summary of Template Instances:

"Rapid Response Content is delivered as 'Template Instances,' which are instantiations of a given Template Type. Each Template Instance maps to specific behaviors for the sensor to observe, detect or prevent. Template Instances have a set of fields that can be configured to match the desired behavior."

Before being pushed through, Template Instances are first checked through the company's Content Validator. The two Template Instances pushed through last week both passed validation, despite one of the two containing corrupted content data. This is what led to Windows machines running Falcon sensor version 7.11 and above.

"When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception," said CrowdStrike. "This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)."

In CrowdStrike's preliminary assessment of the situation, it puts part of the blame on the company's internal testing procedures. While the company does do extensive human and AI testing for its main Sensor Content, the company relies only on the Content Validator for its smaller Rapid Response releases.

As for how the company avoids situations like this in the future, it all lies with how it tests and deploys its Rapid Response updates. The company plans to employ local developer testing, increased rollback and content testing, content interference testing and additional stress testing before any updates are pushed through.

As for deployment of content, the company said it will be making the following changes:

  • Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
  • Improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout.
  • Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.
  • Provide content update details via release notes, which customers can subscribe to.

CrowdStrike said it will be releasing a full "Root Cause Analysis" on the incident once its investigation is complete. In the meantime some customers are reporting that the company has sent them $10 Uber Eats gift card as an apology for the outage.

About the Author

Chris Paoli (@ChrisPaoli5) is the associate editor for Converge360.

Featured

comments powered by Disqus

Subscribe on YouTube