Hacker News new | past | comments | ask | show | jobs | submit login
CrowdStrike Official RCA is now out [pdf] (crowdstrike.com)
120 points by Sarkie 10 months ago | hide | past | favorite | 36 comments



> In summary, it was the confluence of these issues that resulted in a system crash: [...] the lack of a specific test for non-wildcard matching criteria in the 21st field.

I feel they focus a lot on their content validator lacking a check to catch this specific error (probably since that sounds like a more understandable oversight) when the more glaring issue is that they didn't try actually running this template instance on even a single machine, which would've instantly revealed the issue.

Even for amateur software with no unit/integration tests, the developer will still have typically ran it on their own machine to see it working. Here CrowdStrike seem to have been flying blind, just praying new template instances work if they pass the validation checks.

They do at least promise to "ensure that every new Template Instance is tested" further down.


Absolutely. This is the number one issue I see causing problems with devs on my teams. It is extremely simple to test your damn work. Smoke test it. Make sure the damn machine boots. Make sure the app runs.

This is covered in part by a staged deployment... but that's just having your users test for you. Where's the automated integration test, or just the boot test?


It doesn't even cover the barest of organisational root cause. How are they planning to do defense in depth and prevent any internal threat actor from wedging every machine in the world?


Crowdstrike takes it self seriously, for a security company. That means don't ask questions of the experts.

Everyone else sees these services as the patsy when the problem happens.

From a technical perspective it's a hot mess (you are spot on). But business says "everything is fine, this is fine, carry on", because it meets their goal of CYA.


That's a lot of words to say "We did not test a file that gets ingested by a kernel level program, not even once"

At no point did they deploy this file to a computer they owned and attempted to boot it. They purposely decided to deploy behavior to every computer they could without even once making sure it wouldn't break from something stupid.

Are these people fucking nuts?

I do more testing than this and I might be incompetent. Also nothing I touch will kill millions of PCs. I get having pressure put on you from above, I get being encouraged to cut corners so some shithead can check off a box on his yearly review and make more money while stiffing you on your raise, I get making mistakes.

But like, fuck man, come on.


I think it is worse than that. When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want. I know that we are human and that bugs occasionally appear in code. But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.

I've made changes on personal projects that I thought were simple, and yet broke stuff. But CrowdStrike is a multi-billion dollar company -- how can it be possible to have such a broken process. Their RCA document was interesting, but didn't cover any of the interesting issues. It seems that they don't know about the 5 Whys process (https://en.wikipedia.org/wiki/Five_whys) or decided that those answers were so embarrassing that they had to omit them.


> When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want.

It's not uncommon for devs to be working against outdated databases / config dumps. Certainly bad practice but when devs have the option of being lazy vs doing chores, they will pick the path of less resistance.

> But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.

We're assuming that the person who changed the code also made the choice to initiate the rollout. They are 2 separate actions which can be made by separate individuals and could also involve many multiple steps in between, each undertaken by a separate individual as well.

Distance from Prod does introduce a sense of malaise and complacency, I've found.


The whole thing smells of silo'ed teams syndrome.

Team 1 tells Team 2 that the schema is updating.

Team 2 updates their schema.

Team 2 tests against updated schema

All green in test.

Team 1 doesn't actually follow the schema.

Deployment fails.

---

It's really hard to assign blame, but I'd put more blame on Team 2 for not being defensive with their inputs enough.

As we all know there are greater issues with their deployment pipelines (lack of canaries, phased rollouts etc.) but no point going over those in this context.


Leeeroy Jenkiiiins!


They should've read "parse, not validate": https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...


Thanks that was a good read.


A lot of mitigation actions but nothing to really stop it happening again: a fail safe system in their boot start driver. Bad programming and QA caused the issue, but bad design allowed it to happen


I think the QA issues are by far the most important part. A security component of this type, by its nature, has to be able to prevent your computer from doing anything at all, since any part of userspace (at least) could be compromised.

The "fail safe" for a security component is in fact to prevent any user space code from running at all - better that than having it actively harm other systems, exfiltrate data, destroy connected hardware etc. So, no amount of clever design can prevent the CrowdStrike sensor from nuking your system if bad security rules get deployed.

For example, if a bad definition file makes it think that the legit libc or win32 libraries are compromised, it should prevent any userspace program from running, which is just as destructive as failing during boot.

That is why appropriate QA is critical for this type of program. I would expect any definition update of any kind to be tested on dozens of systems with a wide variety of Windows configurations and known-good software far before ever being deployed to any customer system. It seems that CrowdStrike thought the exact opposite of this, and in fact their customers were the first to ever run their new code end-to-end, not the last...


> So, no amount of clever design can prevent the CrowdStrike sensor from nuking your system if bad security rules get deployed.

This is too binary a way to think about a complex system. Availability is also a security goal so we shouldn’t cavalierly trade it for minor risks which are mostly edge cases.

For example, say that the fail-safe was an old, old idea where they kept the second most recent version, and if the system failed to start or crashed repeatedly, it automatically rolled back to the last known good version. That turns this kind of problem into at most a reboot – a huge win every customer would have taken - and the only case it would introduce a vulnerability is if there’s an active attack which only the latest rules will block which is so virulent that the number of systems approximates the number who’ll be affected by a bad update. That’s an unlikely set of events, especially because there’s a really tight window where such a fast-spreading attack wouldn’t have compromised the host before CrowdStrike could ship the update.

Another variation of that idea: any time the system fails to start repeatedly, the service blocks processes other than its updater so normal apps aren’t exposed as potential vectors but the system can self-heal in most cases.


famous windows "guru" Alex Ionescu was their main kernel architector for long time, funny he didn't comment anything about that fail


Add a new threat actor to the list, those pesky parameter counts actively trying to evade detection:

"This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process, the Template Type (using a test Template Instance) stress testing or the first several successful deployments of IPC Template Instances in the field."

Curious that csagent.sys isn't mentioned until last page, p. 12:

"csagent.sys is CrowdStrike’s file system filter driver, a type of kernel driver that registers with components of the Windows operating system…"


Well I guess I should post the obligatory

> Some people, when confronted with a problem, think

> “I know, I’ll use regular expressions.”

> Now they have two problems.



Note: this was distributed to their customers today


Is it just me or does it seem like this change simply wasn't tested beyond a simple unit test?


The big thing I was wondering is what their coverage analysis is like. I can see a developer missing this in a hurry, but where’s the review or second-order analysis? Tons of projects with far less importance monitor branch coverage, use fuzz testing and path analysis tools, etc. and while it’s not trivial to test a kernel driver it’s not _that_ hard, especially when you have the resources of a company valued in the tens of billions which allegedly specializes in exactly this kind of work.

The thing I’ve been thinking about are all of the assurances they made about SDLC, testing, secure development practices, etc. They have so many huge customers in regulated industries, government, etc. that they completed almost every certification in existence and seeing this really raises questions about how those assertions were reviewed.


I posit that there are multiple disparate teams involved.

--

Team 1 tells Team 2 that the schema is updating.

Team 2 updates their schema.

Team 2 tests against updated schema (which would be a test file)

All green in test.

Team 1 doesn't actually follow the schema.

Deployment fails.


It’s worse than that. They updated the schema, and tested it with previous data that does not exercise the new parameter. Tests are passing. When they go and actually use the new parameter, it crashes.

The new schema was improperly tested (among a list of other failures).


My thoughts as well.


kinda sounds like this was a regex bug?

> The selection of data in the channel file was done manually and included a regex wildcard matching criterion in the 21st field for all Template Instances, meaning that execution of these tests during development and release builds did not expose the latent out-of-bounds read in the Content Interpreter when provided with 20 rather than 21 inputs.


Sounds more like a off by 1 bug that was hidden by regexs if I'm reading correctly


Very easily hidden. Something obtuse like

    (.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)
or even this

    (.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{1})
would simply fail to match.

And I wouldn't necessarily blame the developer in either scenario - they received a card that says "hey the channel file will now have an extra field in it's schema"... noone said "btw it's optional".

Calling it a "first year programming mistake" like I'm reading in some media is somewhat incendiary. I see unmarshalling errors happen all the time.

The forest that we must not miss is the kernel-level driver simply dies with no error recovery and bricks the system.


I think that’s just the nature of kernel programming. Once you’re running in kernel space, there are essentially no safety guards, which is why kernel programming is so difficult. Any faults that occur in user space causing a seg fault + core dump do not exist in kernel space. Especially since kernel code generally has to be written in C, it can be quite difficult even for the best engineers to get everything right.


Yeah, my read was that they changed an interface to include an optional parameter but never actually tested the underlying code by providing said optional parameter.

The bug in clients (sensors) wasn't due to regex, the regex was in their integration unit testing which also had a bug and was never supplying the 21st parameter to the client code.


regex isn't probably a good thing in a kernel boot code considering it's NP hard


That's true statement but what does it have to do with the RCA? From what I read it appears the regex was in the integration tests for the template.


I don't think so. As far as I understood this, the wildcard match was basically considered a no-op (since anything matches, they probably optimized by not even attempting the match), and so that 21st field was never provided to their Content Interpeter, so it never crashed before. The first time they actually added a non-wildcard match, the Content Interpeter was actually asked to check the 21st field as well, and it crashed because it only had an array of 20 items.


[flagged]


This problem comes from CrowdStrike's agent kernel subsystem itself, and not Windows.

While I agree that Windows, as a client focused operating system, is a hot mess, but I would also compare Windows NT Kernel as much better than Linux Kernel, in terms of code quality and organization. FreeBSD kernel is still my favourite though.

At least Windows NT Kernel have a stable ABI, and Rust for Linux have to bindgen the glue layer for each Linux Kernel build.


Windows supports eBPF. CrowdStrike was written before that supports existed, though.


A badly written eBPF program at this level could still prevent you from using your computer though. Not in this specific way, but if you use eBPF to prevent other things from running, and you accidentally deploy an eBPF program that, say, triggers on every process start - then you'll prevent every process from running, and the machine will be just as useless as one that doesn't boot at all.


Also eBPF is still in beta for windows and is nowhere near parity with Linux.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: