Hacker News new | past | comments | ask | show | jobs | submit | page 6 login
CrowdStrike Update: Windows Bluescreen and Boot Loops (reddit.com)
4480 points by BLKNSLVR 5 days ago | hide | past | favorite | 3847 comments



Isn't Crowdstrike the same company the heavily lobbied to get make all their features a requirement for government computers? https://www.opensecrets.org/federal-lobbying/clients/summary... They have plenty of money for congress, but it seem little for any kind of reasonable software development practices. This isn't the first time crowdstrike has pushed system breaking changes.

Since we are in political season here in the US, they are also well known as the company that investigated the Russian hack of the DNC.

https://www.crowdstrike.com/blog/bears-midst-intrusion-democ...


The DNC has since has implemented many layers of protection including crowdstrike, hardware keys, as well as special auth software from Google. They learned many lessons from 2016.

If I were to hazard a guess I think the OP is attempting to say they are incompetent and wrong in fingering the GRU as the cause of the DNC hacks (even though they were one of many groups that made that very obvious conclusion).

What? No.

Not you, the person you were responding to.

Afaik didn't they hack republicans too? They only released democrat emails though.

Correct. Also, the DNC breach was investigated by FireEye and Fidelis as well (who also attributed it to Russia).


The second link has nothing to do with the DNC breach. It's the Ukrainian military disagreeing with Crowdstrike attributing a hack of Ukrainian software to Russia. And ThreatConnect also attributed it to Russia: https://threatconnect.com/blog/shiny-object-guccifer-2-0-and...

>we assess Guccifer 2.0 most likely is a Russian denial and deception (D&D) effort that has been cast to sow doubt about the prevailing narrative of Russian perfidy


So Ukraine's military and the app creator denied their artillery app was hacked by Russians, which might have caused them to lose some artillery pieces? Sounds like they aren't entirely unbiased. Ironically, DNC initially didn't believe they were hacked either.

And CrowdStrike accurately point all the facts.

Seems like they're pretty good at what they do. Maybe that's why there are so many critical infrastructure depends on them.


I mean... the DNC thought Bernie hacked them so...

Yeah this is the fringe view. The fact that the GRU is responsible is the closest thing you can get to settled in infosec.

Especially since the alternative scenarios described usually devolve into conspiracy theories about inside jobs


There's something of a difference between 'alternative scenarios' and demonstrating that the 'settled' story doesn't fit with the limited evidence. One popular example is that the exploit Crowdstrike claim was used wasn't in production until after they claimed it was used.

>There's something of a difference between 'alternative scenarios' and demonstrating that the 'settled' story doesn't fit with the limited evidence.

You've failed to demonstrate that, since your second link doesn't show the Ukrainian military disputing the DNC hack, just a separate hack of Ukrainian software, and the first link doesn't show ThreatConnect disagreeing with the assessment. ThreatConnect (and CrowdStrike, Fidelis, and FireEye) attributes the DNC hack to Russia.

>One popular example is that the exploit Crowdstrike claim was used wasn't in production until after they claimed it was used.

Can you provide more info there?


> You've failed to demonstrate that

I see that now. I should have been more careful while searching for and sharing links. I have shot myself in the foot. And I'm not going to waste my time or others digging for and sharing what I think I remembered reading. I've done enough damage today. Thank you for your thorough reply.


Ok, who did it then?

According to that link the most money they contributed to lobbying in the past 5 years was $600,000 most years around $200,000. That’s barely the cost of a senior engineer.

You'd be surprised how cheap politicians are.

IIRC Menendez was accused and found guilty of accepting around $30,000 per year from foreign governments?

That's probably only the part they had the hard proof for.

Also, the press release[1] says:

> between 2018 and 2022, Senator Menendez and his wife engaged in a corrupt relationship with Wael Hana, Jose Uribe, and Fred Daibes – three New Jersey businessmen who collectively paid hundreds of thousands of dollars of bribes, including cash, gold, a Mercedes Benz, and other things of value

and later:

> Over $480,000 in cash — much of it stuffed into envelopes and hidden in clothing, closets, and a safe — was discovered in the home, as well as over $70,000 in cash in NADINE MENENDEZ’s safe deposit box, which was also searched pursuant to a separate search warrant

This seems to be more than $120K over 4 years. Of course, not all of the cash found may be result of those bribes, but likely at least some of it is.

[1] https://www.justice.gov/usao-sdny/pr/us-senator-robert-menen...


I always half-jokingly think "should I buy a politician?"

I feel like a few friends could go in on it.


It could be like an "insurance" where people pay for politician lobbying. Pool our resources and put it in the right spots.

Ok but that point still defeats the premise that Crowdstrike are spending a large enough amount on lobbying that it is hampering their engineering dept.

I believe the OP was using figurative language. The point seems to be that _something_ is hampering their engineering department and they shouldn't be lobbying the government to have their software so deeply embedded into so many systems until they fix that.

In the UK, a housing minister was bribed with £12,000 in return for a £45m tax break.

3750:1 return on investment, you don't get many investments that lucrative!


Given its origin and involvement in these high profile cases I always thought Crowdstrike is a government subsidized company which barely has any real function or real product. I stand corrected I guess.

This still doesn't demonstrate that it has any real function tbf.

Business Continuity Plan chaos gorilla as a service.

There's something missing here... You know nothing about Crowdstrike (as per your own statement) and critical infrastructure depends on them.

That two things tell us something about your knowledge;)


On the bright side, they are living up to their aptronym.

I wonder if it might starting being a common turn of phrase. "Crowdstrike that directory", etc.

There's a brokenness spectrum. Here are some points on it:

- operational and configured

- operational and at factory defaults

- broken, remote fixable

- crowdstruck (broken remotely by vendor, but not fixable remotely)

- bricked

Usage:

> don't let them install updates or they'll crowdstrike it.


> Isn't Crowdstrike the same company the heavily lobbied to get make all their features a requirement for government computers?

Do you have any more sources on this specifically? The link you gave doesn't seem to reference anything specific.


Seems to be a perfectly rational decision to maximise short term returns for the owners of the company.

Now make of that what you will.


This demonstrated that Crowdstrike lacks the most basic of tests and staging environments.

Corporate brainrot strikes again.

If it's true that a bad patch was the reason for this I assume someone, or multiple people, will have a really bad day today. Makes me wonder what kind of testing they have in place for patches like this, normally I wouldn't expect something to go out immediately to all clients but rather a gradual rollout. But who knows, Microsoft keeps their master keys on a USB stick while selling cloud HSM so maybe Crowdstrike just yolos their critical software updates as well while selling security software to the world.

Sounds like it was a 'channel file' which I think is akin to an av definition file that caused the problem rather than an actual software change. So they must have had a bug lurking in their kernel driver which was uncovered by a particular channel file. Still, seems like someone skipped some testing.

https://x.com/George_Kurtz/status/1814235001745027317

https://x.com/brody_n77/status/1814185935476863321


The parser crashing the system on a malformed input file strongly suggests their software stack in general is trash

Sounds like something a fuzzer likely would have found pretty quickly.

How about a try-catch block? The software reading the definition file should be minimally resilient against malformed input. That's like programming 101.

A badpage fault in a kernel driver doesn't exactly recover from exceptions like that

Who needs testing when apologizing to your customers is cheaper?

Reputational damage from this is going to be catastrophic. Even if that’s the limit of their liability it’s hard not to see customers leaving en masse.

Ironically some /r/wallstreetbets poster put out an ill-informed “due diligence” post 11 hours ago concerning CrowdStrike being not worth $83 billion and placing puts on the stock.

Everybody took the piss out of them for the post. Now they are quite likely to become very rich.

https://www.reddit.com/r/wallstreetbets/s/jJ6xHewXXp



That user is the equivalent of using a screwdriver to look for gold and succeeding.

Not sure what material in their post is ill-informed. Looks like what happened today is exactly what that poster warned of in one of their bullet points.

Yea, everyone is dunking on OP here. But they essentially said that crowdstrike's customers were all vulnerable to something like this. And we saw a similar thing play out only a few years ago with SolarWinds. It's not surprising that this happened. Ofc with making money the timing is the crucial part which is hard to predict.

A convenient alibi?

The company will perish, there is no doubt in that.

Nah they'll be fine. It happened 7 months ago on a smaller scale, people forgot about that pretty quickly.

You don't ditch the product over something like this as the alternative is mass hacking.


Is the alternative "mass hacking"? I thought all this software did was check a box on some compliance list. And slow down everyone's work laptop by unnecessarily scanning the same files over and over again.

I assume you're not in Sec industry?

This sounds like someone who said "dropbox ain't hard to implement"


As someone said earlier in these comments the software is required if you want to operate with government entities. So until that requirement changes it is not going anywhere and continues to print money for the company.

But then, if what you say is true and their software is indeed mandatory in some context, they also have no incentive or motivation to care about the quality of their product, about it bringing actual value or even about it being reliable.

They may just misuse this unique position in the market and squeeze as much profit from it as possible.

The mere fact that there exists such a position in the market is, in my opinion, a problem because it creates an entity which has a guaranteed revenue stream while having no incentive to actually deliver material results.


If the government agencies insist on using this particular product then you're right. If it's a choice between many such products than there should be some competition between them.

Surely there are more than one anti-virus that can check the audit box?

From experiencing different AV products at various jobs, they all use kernel level code to do their thing, so any one of them can have this situation happen.

Presumably those other companies try running things at least once before pushing it to the entire world though.

I'd kind of expect IT administrators to try out these updates on a staging machine before fully deploying to all critical systems. But here we are.

You, the admin, don't get to see what Falcon is doing before it does it.

Your security ppl. have a dashboard that might show them alerts from selected systems if they've configured it, but Crowdstrike central can send commands to agents without any approval whatsoever.

We had a general login/build host at my site that users began having terrible problems using. Configure/compile stuff was breaking all the time. We thought...corrupted source downloads, bad compiler version, faulty RAM...finally, we started running repeated test builds.

Guy from our security org then calls us. He says: "Crowdstrike thinks someone has gotten onto linux host <host>, and has been trying to setup exploits for it and other machines on the network; it's been killing off the suspicious processes but they keep coming back..."

We had to explain to our security that it was a machine where people were expected to be building software, and that perhaps they could explain this to CS.

"No problem; they'll put in an exception for that particular use. Just let us know if you might running anything else unusual that might trigger CS."

TL;DR-please submit a formal whitelist request for every single executable on your linux box so that our corporate-mandate spyware doesn't break everyone's workflow with no warning.


EDR stands for Endpoint Detection and Response.

People don't realize there's that last bit: Response, what do you do when something is Detected.

That's your Admin setup.


Some of them might have saner rollout strategy and/or better quality control.

AV definition needs to be roll out quickly for 0day.

Developers aren't used to security lifecycle so quite a few commenters in this thread equates SDLC and Security


Extremely unlikely. This isn't the first blowup Crowdstrike has had; though it's the worst (IIRC), Crowdstrike is "too big to fail" with tons of enterprise customers who have insane switching costs, even after this nonsense.

Unfortunately for all of us, Crowdstrike will be around for awhile.


Businesses would be crazy to continue with Crowdstrike after this. It's going to cause billions in losses to a huge number of companies. If I was a risk assessment officer at a large company I'd be speed dialling every alternative right now.

Cybersecurity industry has regular and annual security testing/competitions done by various Organizations that simulates tons of attacks.

Vendors are tested against these cases and graded with their effectiveness.

I heard Crowdstrike is "best-in-market" for good reasons as others who have more deep knowledge of the industry have shared in this thread.


> I heard Crowdstrike is "best-in-market"

A friend of mine who used to work for Crowdstrike tells me they're a hot mess internally and it's amazing they haven't had worse problems than this already.


That sounds like any other companies I have ever worked for: looks great from the outside but a hot mess on the inside.

I have never worked for a company where everything is smooth sailing.

What I noticed is that the smaller the company, the less hot mess they are but at the same time they're also struggling to pay the bill because they don't innovate fast.


it would be crazy not to at least investigate migration paths away from Crowdstrike, or better redundancies for yourself

While it probably should, I regret to inform you that SolarWinds is still alive and well.

I mean, Boeing is still around...

I would assume that its enterprise customers have an uptime SLA as part of their contract, and that breaching it isn't very cheap for Crowdstrike.

I highly doubt their SLA says something about compensating for damages. At most you won't have to pay for the time they were down.

And even more ironically; A botched update doesn't mean they are down. It means you are down. So I don't even think their SLA applies to this.


Yeah, they'll pay with "credits" for the downtime, if what is currently happening even technically qualifies as downtime.

Software doesn't have uptime guarantees. They might have time-to-fix on critical issues, though.

I assume this is gross negligence, which would leave them open to claims made through courts, though.


As at 4am NY time CRWD has lost $10Bn (~13%) in marketcap. Of course they've tested, but just not enough for this issue (as is often the case).

This is probably several seemingly non consequential issues coming together.

I'm not sure why though, when the system is this important that even successfully tested updates aren't rolled out piecemeal though (or perhaps it has and we're only seeing the result of partial failures around the world)


Testing is never enough. In fact, it won't catch 99% of issues by the virtue of them often testing happy paths only, or that they test what humans can think of, and by no means they are exhaustive.

A robust canarying mechanism is the only way you can limit the blast radius.

Set up A/B testing infra at the binary level so you can ship updates selectively and compare their metrics.

Been doing this for more than 10 years now, it's the ONLY way.

Testing is not.


Depends on what you mean by enough. It should be more than enough to catch issues like this one specifically.

If they can't even manage that they'll fail at your approach as well.


Canary offers more bang for the buck, and is much easier to set up. So I kind of disagree.

> Canary offers more bang for the buck

I'm not sure that justifies potentially bricking the devices of hundreds(?) of your clients by shipping untested updates to them. Of course it depends... and would require deeper financial analysis.


They won't be able to test exhaustively every failure mode that could lead to such issues.

That's why canaries are easier and more "economical" to implement and gives better value per unit effort.


> They won't be able to test exhaustively every failure mode that could lead to such issues.

That might be acceptable. My point is that if you are incapable of having even absolutely basic automated tests (that would take a few minutes at most) for extremely impactful software like this starting with something more complex seems like a waste of time (clearly the company is run by incompetent people so they'd just mess it up)


But they can test obvious failure modes like this one. You need both.

Exactly. They knocked half the world offline probably killed thousands in ERs and the stock is only down to about June lows.

And when it’s more costly for customers to walk back the mistake of adopting your service.

Yeah, I get the impression a lot of SaaS companies operate on this model these days. We just signed with a relatively unknown CI platform, because they were available for support during our evaluation. I wonder how available they’ll be when we have a contract in place…


hah that tweet was one heck of an apology. "we deployed a fix to the issue, speak with your customer rep"

Unfortunately cybersecurity still revolves around obscurity.

Doesn't matter what testing exists. More scale. More complexity. More Bugs.

Its like building a gigantic factory farm. And then realizing that environment itself is the birthing chamber and breeding ground of superbugs with the capacity to wipe out everything.

I used to work at a global response center for big tech once upon a time. We would get hundreds of issues, we couldn't replicate cause we literally have to set up our own govt or airline or bank or telco to test certain things.

So I used to joke with the corporate robots to just hurry up and take over govts, airlines, banks and telcos already, cause thats the only path to better control.


> Its like building a gigantic factory farm. And then realizing that environment itself is the birthing chamber and breeding ground of superbugs with the capacity to wipe out everything.

Factorio player detected


Testing + a careful incremental rollout in stages is the solution. Don't patch all systems world-wide at once, start with a few, add a few more, etc. Choose them randomly.

Here's hoping they start from the top.

They won't, but hope springs eternal.


i've seen photos of the bsod from an affected machine, the error code is `PAGE_FAULT_IN_NONPAGED_AREA`. here's some helpful takeaways from this incident:

1) mistakes in kernel-level drivers can and will crash the entire os

2) do not write kernel-level drivers

3) do not write kernel-level drivers

4) do not write kernel-level drivers

5) if you really need a kernel-level driver, do not write it in a memory unsafe language


I've said this elsewhere but the enabling of instant auto-updates on software relied on by a mission critical system is a much bigger problem than kernel drivers.

Just imagine that there's a proprietary firewall that everyone uses on their production servers. No kernel-level drivers necessary. A broken update causes the firewall to blindly reject any kind of incoming or outgoing request.

Easier to rollback because the system didn't break? Not really, you can't even get into the system anymore without physical access. The chaos would be just as bad.

A firewall is an easy example, but it can be any kind of application. A broken update can effectively bring the system down.


There sure are a lot of mission-critical systems and companies hit by this. I am surprised that auto-updates are enabled. I read about some large companies/services in my country being affected, but also a few which are unaffected. Maybe they have hired a good IT provider.

I'm not surprised, seeing how this madness has even infected OSS/Linux.

https://github.com/canonical/microk8s/issues/1022

A k8s variety. By Canonical. Screams production, no one is using this for their gaming PC. Comes with.. auto-updates enabled through snap.

Yup, that once broke prod at a company I worked at.

Should our DevOps guy have prevented this? I guess so, though I don't blame him. It was a tiny company and he did a good job given his salary, much better than similar companies here. The blame goes to Canonical - if you make this the default it better come with a giant, unskippable warning sign during setup and on boot.


Snap auto update pissed me off so much I started Nix-ifyng my entire workflow.

Declarative, immutable configurations for the win...


One thing to consider with security software, though, is that time is of essence when it comes to getting protection again 0day vulnerabilities.

Gotta think that the pendulum might swing into the other direction now and enterprises will value gradual, canary deployments over instant 100% coverage.


I'm not a Windows programmer so the exact meaning of PAGE_FAULT_IN_NONPAGED_AREA is not clear to me. I am familiar with UNIX style terminology here.

Is this just a regular "dereferencing a bad pointer", what would be a "segmentation violation" (SEGV) on UNIX, a pointer that falls outside the mapped virtual address space?

As this is in ring 0 and potentially has direct access to raw, non-virtual physical addressing, is there a distinction between "paged memory" (virtual address space) and "nonpaged memory" (physical address) with this error?

Is it possible to have a page fault failure in a paged area (PAGE_FAULT_IN_PAGED_AREA?), or would that be non-fatal and would be like "minor page fault" (writing to a shared page, COW) or "major page fault" (having to hit disk/swap to bring the page into physical memory)?

Are there other PAGE_FAULT_ errors on Windows?

Searching for this is difficult, as all the results are for random spammy user-centric tech sites with "how do I solve PAGE_FAULT_IN_PAGED_AREA blue screen?" content, not for a programmer audience.




Basically all AV either runs as root or uses a kernel driver. I guess the former is preferable

Rust's memory safety does not prevent category errors like using nonpaged memory for things supposed to be paged and vice versa

this all-or-nothing mindset is is reductive and defeatist—harm reduction is valuable. sure, rust won’t magically make your kernel driver bug free, but will reduce the surface area for bugs, which will likely make it more stable.

Yes, I fully agree.

Unfortunately, we have decades of first Haskell pseudo-fans, a sidequest of generic "static typing (don't look at how weak the type system is)" pseudo-fans, and now Rust afficionados that do act like it's all-or-nothing and types will magically fix things including category and logic errors.

At some point tiredness and reactivity steeps in.


Other takeaways:

- do not put critical infrastructure online

- do not push updates that work around the update schedule

- do not push such updates to all machines at once

- do not skip testing and QA, relevant to the number and kind of the machines affected

Even one of these would have massively improved the situation, even with a kernel-level driver written in an unsafe language.


Memory safe language does not prevent crash.

In case of potential UB (and then memory corruption), you get a guaranteed crash.

Wait, crash? :wink:


did you have a crowdstroke while writing this reply?

The problem is that some viruses may run in the kernel mode, so an AV has to do the same, or it will be powerless against such viruses.

If a virus got that far, you're already in trouble. What stops them from attacking the anti-virus?

If you think AV cannot stop viruses in the same privilege level, then that is more reason for AV to run in the kernel mode. Because by your logic, an AV in user mode cannot stop a virus in user mode.

>5) if you really need a kernel-level driver, do not write it in a memory unsafe language

I C what you're doing... >_>


pointing out the obvious? why are you upset i’m stating mixing hot oil and water will make a mess?

an audio driver once blue screen of death'd my windows whenever i started Discord.

i'm surprised i'm not hearing a stronger call for microkernels yet


0) don't load a new driver into your working kernel.

5) Well how much of those kernel-level drivers we rely upon ARE written in a memory unsafe language ??? Like 99% ?

And we are not crashing and dying every day?

Sure, Rust is the way to go. it just took Rust 18 years to mature to that level.

Also, quite frankly, if your unwrap() makes your program terminate because an array out of bounds isn't that exactly the same thing ? (program terminates)

But IMHO if we are hopping along a minefield at this moment every second of every day, well... If this is the worst case scenario, yeah it's not that worse after all.


> Well how much of those kernel-level drivers we rely upon ARE written in a memory unsafe language ??? Like 99% ? And we are not crashing and dying every day?

we shouldn't discount the consequences of memory safety vulnerabilities just because flights haven't physically been grounded.

> Also, quite frankly, if your unwrap() makes your program terminate because an array out of bounds isn't that exactly the same thing ? (program terminates)

this is a strawman, if you were writing a kernel-level driver in rust you'd configure the linter to deny code which can cause panics.

here's a subset:

- https://rust-lang.github.io/rust-clippy/master/index.html#/u...

- https://rust-lang.github.io/rust-clippy/master/index.html#in...


Not a helpful takeaway, I've yet to see a Java kernel driver.

Nobody is telling you to use Java. Although, if you want to revive Singularity that would be pretty neat.

And I never said that anyone is telling me to use Java. It was an example.

Because of the nature of AV software, its code would be drowning in "unsafe" memory accesses no matter the language we chose. This is AV, it's always trying to read the memory that is not AV's, from its very design.

This is a story about bad software management processes, not programming languages.


Reading memory from another process can be done through memory-safe APIs.

To give an example from the linux userspace world: https://docs.rust-embedded.org/rust-sysfs-gpio/nix/sys/uio/f...


be the change you wish to see

This was apparently caused by a faulty "channel file"[0], which is presumably some kind of configuration database that the software uses to identify malware.

So there wasn't any new kernel driver deployed, the existing kernel driver just doesn't fail gracefully.

[0]: https://x.com/brody_n77/status/1814185935476863321


Why on earth don't they have staged rollouts for updates?

Everytime i look into such catastrophic issues, it always boils down to lack of robust canarying mechanisms.

They have enough client base that they can even run an A/B test on the whole binary level, but no.


Also, why not have some sort of graceful degradation (well kind of), like: OS Boots, loads CS driver, the driver loads some new feature/config, and before/after new recent thing ("runtime flag") marked whether it successfully worked, and if not on the next reboot that thing gets either disabled, or the previous known good config (obviously some combination of things might cause another issue), but instead of blindly rebooting to the same state....

I think pfsense does this (from memory, been a while using it). Basically dual-partitions, and if it failed to come up on the active partition after an update it'd revert. Granted you need to have the space to have two partitions, but for a small partition/image not so bad.

What surprises me is if its a content update, and the code fell over when dealing with it - just basically bad release engineering isn't it not to cater for that in the first place? i.e. some tests in the pipeline before releasing the content update would've picked it up given it sounds like 100% failure rates.


The problem space kind of dictates that this couldn't be a solution, cause malware could load an arbitrary feature/config and mark it as 0, then the AV would be disabled on next boot, right?

fair point indeed!

Why put effort in engineering when you can just fear monger in marketing and buy politicians in sales?

More importantly, why are CS customers not validating? Upstream patches should be treated as faulty/malicious if not tested to show otherwise, especially if they're kernel level.

Perhaps a dumb question for someone who actually knows how Microsoft stuff works...

Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

Added: OK, from another post I now know Crowdstrike has some sort of kernel mode that allows this sort of catastrophe on Linux. So I guess there is a bigger question here...


> Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

Because malware that gets into a system will do just that -- install its own backdoor drivers -- and will then erect defense to protect itself from future updates or security actions. e.g. change the path that Windows Updater uses to download new updates, etc.

Having a kernel module that answers to CloudStrike makes it harder for that to happen, since CS has their own (non-malicious) backdoor to confirm that the rest of the stack is behaving as expected. And it's at the kernel level, so it has visibility into deeper processes that a user-space program might not (or that is easy to spoof).


Or, much more likely, the malware will use a memory access bug in an existing, poorly written kernel module (say, CrowdStrike?) to load itself at the kernel level without anyone knowing, perhaps then flashing an older version of the BIOS/EFI and nestle there, or finding it's way into a management interface. Hell, it might even go ahead and install an existing buggy driver by itself it's not already there.

All of these invasive techniques end up making security even worse in the long term. Forget malware - there's freely available cheating software that does this. You can play around with it, it still works.


Maybe I am in the minority, but it always puzzled me that anybody in IT would think a mega-priviledged piece of software that looks into all files was a good idea.

If there is any place that historically was exploited more than all other things it was broken parsers. Congratulations if such an exploited file is now read by your AV-software it now sits now at a position where it is allowed (expected) to read all files and it would not surprise me if it could write them as well.

And you just doubled the number of places in which things can go wrong. Your system/software that reads a PNG image might do everything right, but do you know how well your AV-software parses PNGs?

This is just an example, but the question we really should ask ourselves is: why do we have systems where we expect malicous files to just show up in random places? The problem with IT security is not that people don't use AV software, it is that they run systems that are so broken by design that they are sprinkled on top.

This is like installing a sprinkler system in a house full of gasoline. Imagine gasoline everywhere including in some of the water piping — in the best case your sprinkler system reacts in time and kills the fire, in the worst case it sprays a combustive mix into it.

The solution is of course not to build houses filled with gasoline. Meanwhile AV-world wants to sell you ever more elaborate, AI-driven sprinkler systems. They are not the ones profiting from secure systems, just saying..


> but it always puzzled me that anybody in IT would think a mega-priviledged piece of software that looks into all files was a good idea.

Because otherwise, a piece of malware that installs itself at a "mega-privileged" level can easily make itself completely invisible to a scanner running as a low-priv user.

Heck, just placing itself in /root and hooking a few system calls would likely be enough to prevent a low-priv process from seeing it.


You're ignoring the parent's question of "why do we have systems where we expect malicous files to just show up in random places?", which I think is a good question. If a system is truly critical, you don't secure it by adding antivirus. You secure it by restricting access to it, and restricting what all software on the machine can do, such that it's difficult to attack in the first place. If your critical machines are immune to commodity malware, now you only have to worry about high-effort targeted attacks.

My point exactly. Antivirus is a cheap on top measure thst makes people feel they have done something, the actual safety of a system comes from preventing people and software from doing things they shouldn't do.

Why would you design a system where a piece of malware can "install itself" at a mega-priviledged position?

My argument was that this is the flaw, and everything else is just trying to put lipstick on a pig.

If you have a nightclub and you have problem controlling which people get in, the first idea would be to not have a thousand unguarded doors and to then recruit people that search the inside of your nightclub for people they think didn't pay.

You probably would think about reducing the numbers of doors and adding effective mechanisms to them that help you with your goals.

I am not saying we don't need software that checks files at the door, I say we need to reduce the number of doors leading directly to the nightclubs cash reserve.


I wonder why and how does security software read a PNG file. Sure it's not tough to parse a PNG file, but what does it look for exactly?

Some file formats allow data to be appended or even prepended to the expected file data and will just ignore the extra data. This has been used to create executables that happen to also be a valid image file.

I don't know about PNG, but I'm fairly sure JPEG works this way. You can concatenate a JPEG file to the end of an executable, and any JPEG parser will understand it fine, as it looks for a magic string before beginning to parse the JPEG.

A JPEG that has something prepended might raise an eyebrow. A JPEG that has something executable prepended should raise alarms.


Why make something like that executable in the first place? I like the Unix model where things that should be executable are marked so. I know bad parsers and format decoders can lead to executable exploits, but I've always felt uncomfortable with the windows .exe model. Also VBA in excel, word... I believe a better solution would be to have a minimal executable surface than invasive software.

Vendors are allowed to install drivers , even via Windows update. Many vendors like HP, install functionality like telemetry as drivers to make it more difficult for the users to remove the software.

So next time you think you are doing a "clean install", you are likely just re-installing the same software that came with the machine.


It doesn't install the driver, it is the driver. As for the Linux version, it uses eBPF which has a sandbox designed to never crash the kernel. Windows does have something similar nowadays, but Crowdstrike's code probably predates it and was likely just rawdogging the kernel.

> Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

While the files are named XXX.SYS they are apparently not drivers. The issue is that a corrupted XXX.SYS was loaded by the already-installed driver which promptly crashes.


As I understand it was a definition update that caused a crash inside already installed driver.

For a while I've joked with family and colleagues that software is so shitty on a widespread basis these days that it won't be long before something breaks so badly that the planet stops working. Looks like it happened.

"There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. It demands the same skill, devotion, insight, and even inspiration as the discovery of the simple physical laws which underlie the complex phenomena of nature."

"The most important property of a program is whether it accomplishes the intention of its user."

C.A.R. Hoare


Agreed but have you been in the industry lately? Nobody hires assembly programmers anymore. Want money you must work at wobbly top of abstraction mountain.

I am well aware, but the quotes are timeless for a reason. Not to be cheeky, but "Want money" is exactly how you get to the many routinely broken endpoint solutions that wind up reducing reliability and at times increasing the attack surface. Wherever you are in the stack, please make it more robust and easier to reason about. No matter how far from the assembly.

It’s not just about the tech abstraction mountain, it’s about the app logic and dev process too.

A react native JS app with a clear spec and a solid release process can be more reliable than bloated software that receives an untested hotfix, even if the latter was handwritten in assembly.


I guess this article might need some updating soon:

https://www.crowdstrike.com/resources/reports/total-economic...


"Falcon Complete managed detection and response (MDR) delivers 403% ROI, zero breaches and zero hidden costs"

I'm always curious on how security software can provide a ROI.

I had McAfee tell me one time that the hackersafe logo on our website would increase sales by 10%, this was at a Fortune 50 doing billions in sales online every year.

I was pretty hyped because it would have done wonders for my career, but then they walked it back and wouldn't explain it to me. I wasn't mad, I was disappointed.


I ran an AB test on 2012 not sure its relevant now, we tested the McAfee logo and conversion was boosted by 2%. Bigger boost was a lock icon, 3%. It kept increasing the more locks we added an topped at 5% after 5 lock icons.

The intersection of ROI and human psychology!

1 lock: “looks safe, I buy”

2 locks: “wow really safe, I buy more”

50 locks: “I’m being lied to”


> ...delivers -407% ROI...

FTFY.


My entire emergency department got knocked offline by this. Really scary when you have ambulances coming in and are trying to stabilize a heart attack.

Update: 911 is down in Oregon too, no more ambulances at least.


Do you have offline backup processes at least? Nasty situation.

We're really prepared for epic to go down and have an isolated cluster that we access in emergencies. I transitioned from software engineering so I've only been in the ED for a year, but from what I could see there didn't seem to be a plan for what to do if every computer in the department bluescreened at once.

"Always look on the bright side of life!" - M Python

But at least the best instant messaging app in the world Microsoft Teams and the best web browser in the world Microsoft Edge are working fine, right?

It's a bsod loop, so not really.

What do we do next week?

So assuming everyone uses sneaker-net to restart what’s looking like millions of windows boxes, there comes recriminations but then … what?

I think we need to look at minimum viable PC - certain things are protected more than others. Phones are a surprisingly good example - there is a core set of APIs and no fucker is ever allowed to do anything except through those. No matter how painful. At some point MSFT is going to enforce this the way Apple does. The EU court cases be damned.

For most tasks for most things it’s hard to suggest that an OS and a webbrowser are not the maximum needed.

We have been saying it for years - what I think we need is a manifesto for much smaller usable surface areas


In this case even dockerized environments would allow you to redeploy with ease.

But that's too much work, many of these systems are running docker resistant software. Management doesn't want to invest in modernization - it works this quarter, it's someone else's problem next quarterly.

You're basically proposing Windows 12 to radically limit what software and drivers can do. Even then eventually someone will probably still break it with weird code.

I'm actually amazed these updates are being tested in prod. Do they have no QA environments ?

Do I personally need to create a startup company called Paranoia... We actually run a clone of your prod environment minus any sensitive data, then we install all the weird and strange updates before they hit your production servers...

As an upsell we'll test out privileges, to take sure your junior engineers can't break prod.

Someone raise a seed round, I'm down to get started this week.


> In this case even dockerized environments would allow you to redeploy with ease.

Not if the CIO mandated that your bare-metal OS hosting Docker has to run a rootkit developed by bozos.


I think this is existential for Windows, and by extension MSFT. Something like 95% of corporate IT activity is either over http (ie every saas and web app) or is over the serial port (controlling that HVAC, that window blind, that garage lifter)

So what we need in 95% of boxes is not a fully capable PC - we need a really locked down OS. Or rather we can get by with a locked down OS.

I would put good money on there already being a tiny OS from the ground up in MSFT that could be relabelled windows-locked-Down(13) and sold exclusively to large corporates (and maybe small ones who sign a special piece of marketing paper)

The thing is once you do that you are breaking the idea that windows can run everywhere (or rather we claim Linux runs everywhere but the thing that’s on my default unbuntu install and the thing on my router are different


Isn't that basically the point of WinRT and Windows 10 S Mode? The problem is getting developers to adopt the new more secure APIs.

So apparently "The issue has been identified, isolated and a fix has been deployed" https://x.com/George_Kurtz/status/1814235001745027317

Yet the chaos seems to continue. Could it be that this fix can't be rolled out automatically to affected machines because they crash during boot - before the Crowdstrike Updater runs?


Correct. Many just end up in an endless loop and never actually boot.

It's about as bad as it gets.


That update is so tone-deaf and half-assed. There's no apology.

If you go to the website, there's nothing on their front-page. The post on their blog (https://www.crowdstrike.com/blog/statement-on-windows-sensor...) doesn't even link to the solution. There's no link to "Support Portal" anywhere to be seen on their front-page. So, you have to go digging to find the update.

And the "Fix" that they've "Deployed" requires someone to go to Every. Single. Machine. Companies with fleets of 50k machines are on this HN thread - how are they supposed to visit every machine?!?!


They won't apologize for legal reasons. Also, it will only make their stock fall further.

The CEO actually did apologize: "We're deeply sorry for the impact that we've caused to customers, to travelers, to anyone affected by this..."

https://www.reuters.com/technology/crowdstrike-ceo-apologize...


Any response they make in the middle of a global outage will be half-assed. They have all available resources figuring out what the hell just happened and how to fix it.

An apology this early is a lose-lose. If they do apologize they'll piss off people dealing with it and want a fix not an apology. If they do t apologize they're tone deaf and don't seem to care.


Imagine being anywhere near the team that sent this...

lol sounds good, but how the hell do they deploy a fix to a machine that has crash and is looping BSOD with no internet or netwrok connectivity...

You do what I've been doing for the last 10 hours or so. you walk to each and every desktop and manually type in the bitlocker key so you can remove the offending update.

at least the virtual devices can be fixed sitting at a desk while suckling at a comfort coffee..


Yeah, you need to manually fix each affected system by booting in safe mode. Not possible to do remotely.

And you will need your bitlocker recovery key to access your encrypted drive in safe mode. I luckily had mine available offline

There's going be a lot of handholding to get end users through this.


You can enable safemode for next boot without the recovery key and then you can delete the offending file on that next boot.

That requires being able to boot in the first place

You can do a minimal boot. I'm told.

Ouch!


There's potentially a huge issue here for people using BitLocker with on-prem AD, because they'll need the BitLocker recovery keys for each endpoint to go in an fix it.

And if all those recovery keys are stored in AD (as they usually are), and the Domain Controllers all had Crowdstrike on them...


Bitlocker keys are apparently not necessary: https://x.com/AttilaBubby/status/1814216589559861673

It might work on some machines, but doubt to work on the rest. Worth the try.

This is the best definition of "Single point of failure" i have ever seen.

Assuming that they also have a regular Bitlocker password, there's hope with a bit manual effort. https://news.ycombinator.com/item?id=41003893

Most of the large deployments I've seen don't use pre-boot PINs, because of the difficulty of managing them with users - they just use TPM and occasionally network unlock.

So might save a few people, but I suspect not many.


Yeah but TPM-only Bitlocker shouldn't be affected anyway by this issue, these machines should start up just fine.

Whoever only has AD-based Bitlocker encryption is straight up fucked. Man, and that on a Friday.


That's the easy part? just do the domain controller first?

I got around BitLocker and booted into safe mode by setting automatic boot to safe mode via bcdedit https://blog.vladovince.com/mitigating-the-crowdstrike-outag...

> CrowdStrike Engineering has identified a content deployment related to this issue and reverted those changes.

> Workaround Steps:

> Boot Windows into Safe Mode or the Windows Recovery Environment

> Navigate to the C:\Windows\System32\drivers\CrowdStrike directory

> Locate the file matching “C-00000291*.sys”, and delete it.

> Boot the host normally.


Was thinking about a bootable usb-stick that would do that automagically. But I guess it is harder to boot from a usb-stick in these environments than the actual fix.

I guess more feasible and even neater to do it if you have network boot or similar.


So booting into safe mode should do the trick right, even if Bitlocker is enabled?

What if you have 50k workstations? Can you even do this remotely?

The problem may be fixed but I can see some companies having a really shit weekend.


2000s vibes.

This gem from the ABC news coverage has my mind 100% boggled:

"711 has been affected by the outage … went in to buy a sandwich and a coffee and they couldn’t even open the till. People who had filled up their cars were getting stuck in the shop because they couldn’t pay."

Can't even take CASH payment without the computer, what a world!


Technically a payment terminal can go into island mode and take offline credit card transactions and post them later. PIN can be verified against the card.

Depends if the retailer wants to take the chance of all that.


That is if the terminal is not dead itself

The terminal is probably not running Crowdstrike...

Crowdstrike != Windows ;)

Terminal running Windows? Someone is going to make it run Crowdstrike too.

You might be surprised..

"Probably" is a load bearing word

Dude. SO MUCH STUFF runs on Windows.

Terminal is probably fine, the machine that tells it number to charge is dead... And it is probably not even setup to accept manual payment inputs.

Yeah, all depends on how much config you want to allow employees to do, but I’m sure the functionality is there if you wish to enable it.

Having worked with some of these retail systems, yes, it depends on how they are configured.

There are stores in many places in the country with sporadic internet or where outages are not uncommon, and where you would want to configure the terminals to still work while offline. In these cases, the payment terminals can be configured to take offline transactions, and they are stored locally on the lane or a server located in the store until a connection to the internet is re-established.


Not this time. Use paper, pen, and a non-electronic cash box.

Good luck putting a payment terminal into island mode when it's in a bluescreen loop.

I'm seeing several reports of things like being unable to buy tickets for the train on-line in Belgium.

They use Windows as a part of their server infrastructure?


At least they'd take cash if the computer wasn't broken. That's getting quite rare in the UK.

Not really? I've only really seen people not taking cash at trendy street food stalls and bougie coffee shops, pretty much everywhere else does.

Not just indie coffee shops - chains too. Pubs, clothes shops... Even the Raspberry Pi store

Netherlands is the worst at this. More and more “PIN ONLY”. Also more and more tight rules about how much you're allowed to have.

Luckily I can just give someone a paper wallet containing crypto. No transactions, no traceability, no rules.


In London it’s really common

At where though? The example given was in 711 which is a nationwide chain a bit like a Tesco Express or Sainsbury's Local, both of which still accept cash nationwide in the UK too.

Aldi, apparently. That's where Piers Corbyn couldn't buy strawberries with cash. https://www.mirror.co.uk/news/uk-news/piers-corbyn-splits-op...

This whole thing likely would have been averted had microkernel architectures caught on during the early days (with all drivers in user mode). Performance would have likely been a non-issue, not only due to the state of the art L4 designs that came later, but mostly because had it been adopted everything in the industry would have evolved with it (async I/O more prevalent, batched syscalls, etc.).

I will admit we've done pretty well with kernel drivers (and better than I would have ever expected tbh), but given our new security focused environment it seems like now is the time to start pivoting again. The trade offs are worth it IMO.


Not disagreeing with you but we need operating systems with snapshots before updates and a trivial way to rollback the update.

Linux has some immutable OS versions and also btrfs snapshots and booting a specific snapshot from the GRUB bootloader


I wonder if for critical applications we'll ever go back to just PXE booting images from a central server: just load a barebones kernel and the app you want to run into a dedicated memory segment, mark everything else as NX, and you don't even have to worry about things like viruses and hacks anymore. Run into an issue? Just reboot!

Speaking as somebody who manages a large piece of a 911 style system for first responders and has done so for 10 years (and is not affected by this outage) - this is why we do not allow third parties to push live updates to our systems.

It's unfortunate, the ambulances are still running in our area of responsibility, but it's highly likely that the hospitals they are delivering patients to are in absolute chaos.


I just skimmed through the news. A lot of airports, hospitals, and even governments are down! It's ironic how people are putting their eggs in one basket, trying to avoid downtime caused by malware by relying on a company that put their system down. A lot of lessons will be learned after this for sure.

Unless you run half your devices on one security vendor and half on another surely there is no way round it? Companies install this stuff over "Windows Defender" so they can point fingers at the security vendor when they get hacked, this is the other side of the coin.

It has happened before where security software has unwanted effects, can't say i remember anyone else managing to blue screen Windows and require a safe mode boot to fix the endpoints though.


Relying on easy-install "security vendors" is the problem. It's one thing to run an antivirus on a general purpose PC that doesn't have a qualified human admin. But many of the computers affected here are single-purpose devices, which should operate with a different approach to security.

Thanks god all the critical infrastructure in my country is still on MS DOS!

Hahaha, you mean all the CRITIC~1.INF ?

> Hahaha, you mean all the CRITIC~1.INF ?

Kids those days. It shall be CRITIC~1.COM


That school running their HVAC infra on an Amiga must be pretty happy.

The biggest mistake here is running a global update on a Friday. Disrespect to every sysadmin worldwide.

Disrespect to every CIO to make their business depend on a single operating system, running automatic updates of system software without any canaries and phased deployments.

You're saying I should diversify my 100% Linux operation to also use Windows?

While I believe Linux is a more reasonable operating system than Windows, shit can happen everywhere.

So if you have truly mission critical systems you should probably have more have at least 2 significantly different systems, each of them being able to maintain some emergency operations independently. Doing this with 2 Linux distros is easier than doing it with Linux and Windows. For workstations Macs could considered, for servers BSD.

Probably many companies will accept the risk that everything goes down. (Well, they probably don't say that. They say maintaining a healthy mix is too expensive.)

In that case you need a clearly phased approach to all updates. First update some canaries used by IT. If that goes well update 10% of the production. If that goes well (well, you have to wait until affected employees have actually worked a reasonable time) you can roll out increasingly more.

No testing in a lab (whether at the vendor or you own IT) will ever find all problems. If something slips through and affects 10% of your company it's significantly different from affecting (nearly) everyone.


Maybe some OpenBSD would be a good hedge. It can also help spot over-reliance on some Linux quirks.

What makes you think windows is the only alternative? Have you never heard about Gnu Hurd?

More seriously I am not saying you should run some critical services on menuetos or riscos but the BSDs are still alive and kicking as well as illumos and its derivatives. And yes I think a bit of diversity allows some additional resilience. It may necessitate more workforce but imho it is worth the downsides.


The biggest mistake is not ringfencing this update in a test environment before sign-off for general deployment.

Presumably they do test their updates, they're just maybe not good enough tests.

The ideal would be to do canary rollouts (1%, then 5%, 10% etc.) to minimise blast radius, but I guess that's incompatible with antiviruses protecting you from 0-day exploits.


While I'm usually a proponent of update waves like that, I know some teams can get loose with the idea if they determine the update isn't worth that kind of carefulness.

Not saying CS doesn't care enough but what may be a minor update to the team that did this and not necessary for a slow rollout is actually something that really should be supervised in that way.


Our worst outage occurred when we were deploying some kernel security patches and we grew complacent and updated the main database and it's replica at the same time. We had a maintenance with downtime anyway at the same time, so whatever. The update worked on the other couple hundred systems.

Except, unknown to us, our virtualization provider had a massive infrastructural issue at exactly that moment preventing VMs from booting back up... That wasn't a fun night to failover services into the secondary DC.


Was this update meant to save from a 0 day?

Update: change color of text in console

Agreed. What happened to Patch Tuesdays?!

I don't think the day matter anymore really.

The issue is update rollout process, lack of diversity of these kind of tools in the industry, and the absolute failure of the software industry to make decent software without bug and security holes.


Yeah, airlines prefer mid-week chaos & grounding.

Crowdstrike is a perfect name for a company that could cause a worldwide outage.

Makes me think of flystrike which is also a perfect analogy https://en.wikipedia.org/wiki/Myiasis

Yeah, I've always thought it was a bad name. I see them during Formula 1 advertisements because they sponsor the Mercedes team.

They might as well have named themselves "cluster bomb" as they have done a huge amount of damage today and for the next few days.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: