What everyone seems to be missing in the CrowdStrike incident

plorkyeran · 2024-07-24T00:23:39.000000Z

With a title like "What everyone seems to be missing about ..." I was sort of expecting something other than a repeat of some of the most common comments made about it. "Why didn't they do a staggered rollout?" is literally the first thing that anyone with any sort of IT experience asks.

NoPicklez · 2024-07-24T00:57:18.000000Z

Myself included, I thought perhaps I had missed something important but it was what most other people were asking

sshine · 2024-07-24T06:23:05.000000Z

I didn’t even bother to read the article; I went straight to the comment section to see why the author is wrong!

svantex · 2024-07-24T07:29:57.000000Z

It is kind of obvious, isn't it? But I've yet to see any hard questions asked in main stream media about the process of simultaneous global rollout of "content updates".

But in a recent update of https://www.crowdstrike.com/falcon-content-update-remediatio... they have a long explanation of of things are supposed to work, with a lot of nice words (sounds almost AI-written...) and quite a few implications that it's really the customers fault who have not configured their systems to for example stay one version behind the latest and still a very short explanation of what went wrong.

But... Lo and behold, what are CrowdStrike going to do to avoid this happening in the future?

"Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment." About time...

My main point, and the reason for the title, is that this has not been the major takeaway in main stream media analyses. Of course not "everyone" has missed this, but pretty much all media articles about the incident do appear to miss this.

NoPicklez · 2024-07-25T00:27:05.000000Z

Mainstream media are unlikely to report into such detail, they would often ask questions such as "How could such an update cause a global outage?" or "How could this be allowed to happen".

Most people didn't even know what Crowdstrike was, let alone understand the concept of testing updates and staggering them.

Lastly, the media are at risk of reporting the wrong thing and being a target of litigation. Therefore they report often in hyperbole and without much factual information until the facts are determined.

ssahoo · 2024-07-24T03:26:03.000000Z

My biggest concern is that how they missed to test the kernel code to run without a failover mechanism to last known configuration? It's common sense. Their QA never raised this concern for availability?

strict9 · 2024-07-23T22:03:41.000000Z

>How is it possible that someone sends out an update affecting the behavior of kernel mode code, all at once, simultaneously, to millions and millions of systems around the whole globe at once!?

>I've participated in many roll outs, and never would I allow a big-bang roll out like this. CrowdStrike should be charged with negligence for having this type of process. It's just plain irresponsible.

Agree with all of this. Related to deployment process or lack of one, the hour of the deployment has always struck me from the beginning. The largest impact was in the United States yet an update was pushed very early hours US time.

Presumably the off-hours deployment wasn't because of lowering the potential impact as they sent it to everyone.

gruez · 2024-07-24T01:24:54.000000Z

>Agree with all of this. Related to deployment process or lack of one, the hour of the deployment has always struck me from the beginning.

This isn't some UI makeover that they can push until next Tuesday. They're pushing updates to the detection logic for what could be an evolving threat, so odd timing of the update is at least somewhat justified. Do you really want a botnet to rip through corporate networks over the weekend while you wait for a Tuesday deploy?

svantex · 2024-07-24T07:36:20.000000Z

Well, you don't want a fancy anti-virus update to rip through the global population of customers, effectively killing 8.5 million systems (according to an estimate by Microsoft) either in the space of approximately 78 minutes, right? What possible malware threat warrants that risk? And in this case, according to CrowdStrike, it was "to detect novel attack techniques that abuse Named Pipes". That doesn't really sound like such an urgent situation.

AndrewKemendo · 2024-07-24T00:30:46.000000Z

What’s the most confusing thing to me is why anyone is giving CS any slack here

This isn’t a indie game dev or some group of volunteers

CS actively sells and uses fear to gain customers to a bad product. Their whole 3B business is based on unearned trust

IMO that constitutes fraud if it’s this magnitude. Someone needs to go to jail and investors need to lose all their money and liquidate the company

userbinator · 2024-07-24T00:08:34.000000Z

How is it possible that someone sends out an update affecting the behavior of kernel mode code, all at once, simultaneously, to millions and millions of systems around the whole globe at once!?

Looking at all the responses to the incident advocating for more centralised control, it almost seems like it was a deliberate provoking of the acceleration towards digital totalitarianism. "The only thing we have to fear, is fear itself."

If it's not - all an attacker would have to do is to deposit a file in %WINDIR%\System32\drivers\CrowdStrike with a name such as C-00000291.sys containing zeros - and the system becomes unbootable without manual intervention!*

An attacker who has already gained enough permissions to do that can just "delete system32" instead, or worse.

hk1337 · 2024-07-24T00:16:01.000000Z

> digital totalitarianism

That seems a bit dramatic. I don’t do big corporate IT but I thought a lot of corporate IT shops have the ability with Microsoft to choose what updates are pushed out to computers on their domain. If so, then something like that could have prevented it, presuming they have the ability to allow a single computer or small group to receive the update to confirm it works successfully.

salawat · 2024-07-24T01:38:51.000000Z

It's not overly dramatic at all. Think about what these IT systems we build are fundamentally designed to do. An ID for everything, an event for every transaction, all of which are becoming more and more integrated under central authorities, whether intentionally or not.

I'm getting to the point where I'm not willing to implement these types of systems for anyone anymore. Not after seeing the breadth of data hoovering and consolidation being pursued.

At some point I just realized the only thing preventing these systems being used in the ways I dread is y'all being decent.

...I'm not willing to cut that check anymore. Seen too much.

chrisjj · 2024-07-23T19:36:17.000000Z

> The "content update" that CrowdStrike sent out was full of zeroes. Nothing else. Obviously not the intended content. And this simple data caused the driver to crash

CrowdStrike has stated that no the crash was not related to the file of zeros.

svantex · 2024-07-23T20:13:15.000000Z

Interesting, do you have a link to this statement? Also, do they state what did cause the crash? At least removing the file of zeroes does solve the problem, as the instructions both from Microsoft and CrowdStrike states "Boot into safe mode. Delete C-00000291*.sys." That's the file(s) with the zeroes... See https://www.crowdstrike.com/falcon-content-update-remediatio... and https://www.youtube.com/watch?v=Bn5eRUaMZXk (3 minutes 20 seconds in).

gruez · 2024-07-23T23:43:38.000000Z

AFAIK in one of the older crowdstrike threads, there was a tweet that said the driver checked for a sentinel value of AAAAA... before loading it, so an entirely blank value wouldn't have caused the issue. I can't find the source now, but some comments do seem to corroborate it:

https://news.ycombinator.com/item?id=41005546

chrisjj · 2024-07-23T20:42:05.000000Z

Yes I do. https://www.crowdstrike.com/blog/falcon-update-for-windows-h...

> CrowdStrike states "Boot into safe mode. Delete C-00000291*.sys." That's the file(s) with the zeroes

That's potentially multiple files, but do we know only one comprises just zeros?

svantex · 2024-07-23T21:04:15.000000Z

Right, they write rather cryptically "This is not related to null bytes contained within Channel File 291 or any other Channel File."

That's not quite the same as saying "This is not related to Channel File 291 containing all nul bytes."...

I don't have first to hand knowledge here, but rely on Dave Plummer's statement.

Regardless of zeroes or single files or not, the fact is that bad data in C-00000291.sys in combination with bad validition in the driver causes it to crash. Deleting C-00000291.sys causes the driver to stop crashing.

Anyway, my main point isn't really about this. It's about the big bang global roll out simultaneously to at least 8.5 million systems in one go that's irresponsible.

The driver architecture is the lesser evil here, although it's bad enough!

FreakLegion · 2024-07-24T01:27:02.000000Z

> the fact is that bad data in C-00000291.sys in combination with bad validition in the driver causes it to crash

This is, in fact, not a fact. We really don't know yet.

CrowdStrike blue screened one of my laptops twice right as the incident was getting started, before a fix was available. There was no boot loop in my case. I was back up and in the middle of an episode of Breaking Bad the second time it got me, 30 minutes after the first. Did the agent wait that long to load a content update it had already loaded before? Maybe, but it's at least as likely that the content was loaded the whole time, and that some activity pattern set it off. Thus, I'm skeptical of the problem being simple content validation.

chrisjj · 2024-07-23T21:20:18.000000Z

> the fact is that bad data in C-00000291.sys in combination with bad validition in the driver causes it to crash.

I think we've seen no evidence that data is to blame.

> Deleting C-00000291.sys causes the driver to stop crashing.

So perhaps just its existence is to blame.

> The driver architecture is the lesser evil here

Except if the crash had been limited to the driver, it would have left the machine running unprotected which is far greater an evil.

svantex · 2024-07-24T07:49:00.000000Z

CrowdStrike does confirm that the data is to blame. "problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception" https://www.crowdstrike.com/falcon-content-update-remediatio... .

chrisjj · 2024-07-24T08:28:53.000000Z

Yes - subsequent to my comment. Thanks. But how can this latest statement can be true, if the previous statement that the crash was not related to the zero bytes content is true?

svantex · 2024-07-24T09:43:54.000000Z

Good question. There's some evidence that not all affected systems has seen this 'all zeroes' file, the first account stories varies. But something was definitely broken in the deployed data. But, once again, CrowdStrike does not paint a clear picture and it raises new questions and only partially answers old ones.

Why is it so hard for manufacturers to just go ahead and explain what really went wrong, without a lot of corporate b..t? Probably, if they do really say what happened in so many words they might open themselves for negligence lawsuits. Hopefully somebody files one anyway. The industry needs to learn to be better, and the only thing that talks loudly enough is probably money. Lost revenue, liability damages, and share holder value loss.

GuB-42 · 2024-07-24T00:21:42.000000Z

Speculation: this "all zero" file is part of a signed batch, they have to have signatures, they are not that dumb (I hope...). By removing a file, the batch becomes incomplete, fails the check, and some corruption recovery mechanism takes over, most likely disabling the update and triggering an update. In the meantime, they fixed the content update, fixing the crash.

NoPicklez · 2024-07-24T01:01:01.000000Z

With all of this no matter how bad it was, how negligible CS was or just how plain silly it was.

Rules in many industries are often written in blood, whilst not the case here (I would hope), these incidents are what sparks change in many cases. We now have large companies all around the world bearing the brunt of not testing an update before putting it into production. Whether it was because of them or reliance on a third party, executives en masse are now aware of what happens if you don't do these things.

When we talk about promoting changes into production before testing, we will all come back to this moment

galagladi · 2024-07-24T16:28:50.000000Z

Why the heck does every single computer in an organization (even the computers in the airport terminals??) need such invasive monitoring software installed? Sounds like compensating for not designing the org's software securely, and not using the operating system's existing security facilities.

Sure there's the defense in depth argument, but this is too in depth, and as proven today not without its risks.

jmclnx · 2024-07-24T11:54:52.000000Z

> In, short, by CrowdStrike hacking the protocol and Microsoft allowing it to happen

This seems to be saying Microsoft is letting Crowdstrike to bypass Microsoft's own security measures for Windows.

That rings true, too many times in this industry, easy wins over security. I think the only place this does not happen is on OpenBSD.

kkfx · 2024-07-24T07:22:02.000000Z

It's not about automatic updates, it's about ownership or those who decide to update, manually or automatically vs those who can't decide.

Aside the biggest issue is still having in 2024 non-declarative, non-rollbackable systems in productions, no LOM, no easy massive automation.

znkynz · 2024-07-23T23:51:07.000000Z

(to the question posed in the article subject) is a disclosed root cause, and a learning orientated commitment from Crowdstrike on how they plan to prevent this happening in the future. They feel extremely silent right now (as a customer).

ChrisArchitect · 2024-07-24T01:02:38.000000Z

Direct post link: https://blog.axantum.com/2024/07/what-everyone-is-missing-fr...

mensetmanusman · 2024-07-23T23:55:44.000000Z

Would be cool if we could ever know, but I’m guessing the harm from evil doers attacking old device software would be high. (Maybe not as high as the CS incident though).

ricc · 2024-07-24T01:43:08.000000Z

People have been saying DevOps is dead… But this is something popular or even basic for DevOps practitioners, albeit with a different name: Canary release.

jitl · 2024-07-24T00:16:53.000000Z

The article seems to be entirely speculative. Why decide the channel files are executable code, without any first-hand investigation?

svantex · 2024-07-24T07:46:31.000000Z

Yes, my article is pretty much speculation - in the absence of a proper explanation by CrowdStrike. (Now there actually is sort of an explanation, but it raises almost as many questions as before). I don't have data for a first hand investigation, but do cite the investigation by Dave Plummer - which of course also contains quite a bit of speculation.

Whether or not "Rapid Response Content" and "Template Instances" are Turing complete is unclear, but the fact of the matter is that according to CrowdStrike "problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception", so the interpretation of the content is at least fairly complex. CrowdStrike also states that "Each Template Instance maps to specific behaviors for the sensor to observe, detect or prevent", and mentions a "Content Interpreter". Whether it's code or configuration data is not really relevant though, the point is that it's interpreted by a kernel mode driver which did not have sufficient validation of it's "Content" to prevent the crash.

from-nibly · 2024-07-24T01:34:35.000000Z

They don't have to actually be executable. They are config that affects the way the code behaves. In fact its pointed out in TFA that the file was just all 0s so right now in order to prevent a machine with crowdstrike from booting you just have to above a file of all 0s in the right folder.

pjkundert · 2024-07-23T19:27:23.000000Z

A “Turing Complete” windows “driver” executing unsigned arbitrary code provisioned by “randos on the internet” (CrowdStrike)?

What could go wrong…

dmz73 · 2024-07-23T23:17:00.000000Z

On one hand you are correctly concerned. On the other hand, how much code are you (and everyone else for that matter) running that is not provisioned by "randos on the internet"? Even if you compiled it yourself, do you check all your kernel code and drivers? Firmware? That can just as easily crash your computer and even make it impossible to recover. If you main worry if that "windows driver" was "Turing Complete" you should be horrified that some kernels even have their own scripting built in or available, BSDs can use Lua, and even Linux has Lua kernel module (lunatik).

gruez · 2024-07-23T23:46:23.000000Z

You need that type of capability to detect rapidly evolving malware. Otherwise it can just load its own driver and bypass your user mode scanners, or infect all the computers while you're going through the (presumably third party?) certification process.

smegger001 · 2024-07-24T02:53:09.000000Z

malware isn't evolving so quickly as to make to make not fallowing basic IT best practice of testing and staged role out obsolete.

What I dont understand is why also didn't these companies just role back to the previous known working image when servers failed to reboot? please say they are not all allowing auto updates without testing and no backups.

chrisjj · 2024-07-23T19:37:53.000000Z

Best not to overlook the fact the article qualifies its analysis with "probably".

svantex · 2024-07-23T20:24:09.000000Z

Yes, since CrowdStrike won't tell us, we'll have to rely on our own or third party analysis. As I write "Since as usual the company won't release any detailed information on what really happened, we'll have to rely on other sources. I found that Dave Plummer's account on YouTube was very good, and trustworthy." But, absolutely, probably is a required qualifier for some statements about the details.

What is definitely known is that a WHQL kernel mode driver from CrowdStrike crashes, and removing a single file external to the driver causes it to stop crashing. Some pretty sure conclusions can be drawn from that. No "probably" required.

svantex · 2024-07-23T19:38:39.000000Z

Indeed!

fredski42 · 2024-07-24T02:40:08.000000Z

What everyone seems to be missing is that MS has no way to return (automatically) to a last-known-good (anymore).

kenny11 · 2024-07-24T03:42:21.000000Z

This is covered in the linked article as well as Dave Plummer's video referenced therein. The Crowdstrike driver is specifically marked as critical to starting the system and so disables the last-known-good mechanisms.

Here's the section where Dave talks about it: https://youtu.be/wAzEJxOo1ts?si=aCX8pOTP0D_IRNAx&t=670

fredski42 · 2024-07-24T14:52:59.000000Z

Still, if it’s that critical it should be deployed through the OS vendor, not some 3rd party. And regression/canary testing at each level (3rd party, MS, customer) seems to also be completely bypassed here, which also baffles me.

swayvil · 2024-07-24T00:18:56.000000Z

I think it was intentional. A distraction.

smegger001 · 2024-07-24T02:41:34.000000Z

What? You think Cloudstrike is going to burn their multi-billion dollar business down for what? The only other things of note were the GOP convention, and Biden publicly announcing his pulling out both of which were widely publicized by both side so unless you think they are burning dollars by the pallet load to distract from the feast of Saint Symmachus, or death of welsh snooker champion Ray Reardon I'm not sure what they supposed to be hiding. I think you need to loosen the tinfoil go outside and touch grass my dude.

swayvil · 2024-07-24T17:46:23.000000Z

Dial it down my dude.

incomingpain · 2024-07-25T10:36:11.000000Z

What everyone seems to be missing about water is that it's wet.