"First, in some cases, a reboot of the instance may allow for the CrowdStrike Falcon agent to be updated to a previously healthy version, resolving the issue.
Second, the following steps can be followed to delete the CrowdStrike Falcon agent file on the affected instance:
1. Create a snapshot of the EBS root volume of the affected instance
2. Create a new EBS volume from the snapshot in the same Availability Zone
3. Launch a new instance in that Availability Zone using a different version of Windows
4. Attach the EBS volume from step (2) to the new instance as a data volume
5. Navigate to the \windows\system32\drivers\CrowdStrike\ folder on the attached volume and delete "C-00000291*.sys"
6. Detach the EBS volume from the new instance
7. Create a snapshot of the detached EBS volume
8. Create an AMI from the snapshot by selecting the same volume type as the affected instance
9. Call replace root volume on the original EC2 Instance specifying the AMI just created"
Yes it can, that's what I ended up writing at 4am this morning, lol. We manage way more instances then is feasible to do anything by hand. This is probably too late to help anyone, but you can also just stop instance, detach root, attach it to another instance, delete file(s), offline drive, detach, reattach to original instance, and then start instance. You need a "fixer" machine in the same AZ.
FWIW, I find the high-level overview more useful, because then I can write a script tailored to my situation. Between `bash`, `aws` CLI tool, and Powershell, it would be straightforward to programmatically apply this remedy.
Someone on X has shared the kernel stack trace of the crash
The faulting driver in the stack trace was csagent.sys.
Now, Crowdstrike has got two mini filter drivers registered with Microsoft (for signing and allocation of altitude).
1) csagent.sys - Altitude (321410)
This altitude falls within the range for Anti-Virus filters.
2) im.sys - Altitude (80680)
This altitude falls within the range for access control drivers.
So, it is clear that the driver causing the crash is their AV driver, csagent.sys.
The workaround that CrowdStrike has given is to delete C-00000291*.sys files from the directory:
C:\Windows\System32\Drivers\CrowdStrike\
These files being suggested to be deleted are not driver files (.sys files) but probably some kind of virus definition database files.
The reason they name these files with the .sys extension is possibly to leverage Windows System File Checker tool's ability to restore back deleted system files.
This seems to be a workaround and the actual fix might be done in their driver, csagent.sys and the fix will be rolled out later.
Anyone having access a Falcon endpoint might see a change in the timestamp of the driver csagent.sys when the actual fix rolls out.
When you see the size if the impact across the world, the number of people who will die because hospital, emergency and logistics systems are down…
You don’t need conventional war any more. State actors can just focus on targeting widely deployed “security systems” that will bring down whole economies and bring as much death and financial damage as a missile, while denying any involvement…
I always think it's easy for state actors to pull out this trick.
Considering PR review is usually done within the team. A state actor can simply insert a manager, a couple of senior developers and maybe a couple of junior developers into a large team to do the job. Push something in Friday so few people bother to check, gets approved by another implant and here you go.
Seeing all the cancelled and delayed flights, it makes me think a hacking kind of climate activism/radicalism would be more useful than gluing hands to roads, or throwing paint on art.
Activism is mostly about awareness, because generally you believe your position to be the one a logical person will accept if they learn about it, so doing things that get in the news but only gets you a small fine or month in jail are preferred.
Taking destructive action is usually called "ecoterrorism" and isn't really done much anymore.
Given how obvious the vector is for targeting after its so widespread, stands reason to believe the same state actors would push phishing schemes and other such efforts in order to justify having a tool like crowdsrike used everywhere. We are focusing on the bear trap snapping shut here, but someone took the time to set up that trap right where we'd be stepping in the first place.
I was in my 20s during the peak hysteria of post-9/11 and GWOT. I had to cope with the hysteria hyped 24/7 by media and DHS of a constant terror threat to determine if it was real.
The fact that global infra is so flimsy and vulnerable brought me tremendous relief. If the terror threats were real, we would have been experiencing infrastructure attacks daily.
I remember driving through rural California thinking if the terrorist cells were everywhere, they could trivially <attack critical infra that I don't want to be flagged by the FBI for>
I've read a lot of cyber security books like Countdown to Zero Day, Sandworm, Ghost in the Wires and each one brings me relief. Many of our industrial systems have the most flimsy, pathetic , unencrypted & uncredentialed wireless control protocols that are vulnerable to remote attack.
The fact that we rarely see incidents like this, and when they do happen, they are due to gross negligence rather than malice, is a tremendous relief.
This is the silver lining of global capitalism. When every power on earth is invested in the same assets there is little interest in rocking the boat unless the financial justification to do so is sufficiently massive.
Until deglobalization sufficiently spreads to the software ecosystem. I have just a few hours ago attended a lecture by a very high profile German cybersecurity researcher (though he keeps a low profile). The guy is a real greybeard, can fluently read any machine code, he was building and selling Commodore64 cards at 14yo. (I don't even know what that is.) He's hell bent on not letting in any US code nor a single US chip. Intel is building a 2nm fab in Magdeburg, Germany, the most advanced in the world when it will be completed. German companies are developing their own fabs not based on or purchased from ASML. German developing their own chip designs. A new German operating system in Berlin.
Huawei, after their CEO got imprisoned in Canada took Linux source code and rewrote it file by file in C++. Now they're using it in all their products, called HarmonyOS. The Chinese are recruiting ex-TSMC engineers in mainland China and giving them everything, free house, car, money, free pass between Taiwan and China just to build their own fab in a city I don't know how to spell the name.
I'm not German but I'll go to the hell with the move to deglobalize, or in other words, de-Americanize. This textarea cannot possibly express my anger and hatred against the past fifty years of the domination of Imperium Americana. Not a single moment they let us live without bloodshed and brutal oppression.
We are far past that point. So many critical systems are running on autopilot, with people who built and understood them retiring, and a new batch of unaware, aloof, apathetic people at the helm.
There's no real need for some Bad Actor -- at some point, entropy will take care of it. Some trivial thing somewhere will fail, and create a cascade of failures that will be cataclysmic in its consequences.
It's not fear-mongering, it's kind of a logical conclusion to decades of outsourcing, chasing profit above and over anything else, and sheer ignorance borne of privilege. We forgot what it took to build the foundations that keep us alive.
That's just what old people like to think: that they are super important and could never be replaced. A few months ago I replaced a "critical" employee that was retiring and everyone was worried what would happen when he was gone. I learned his job in a month.
Most people aren't very important or special and most jobs aren't that difficult.
why the fuck is our critical infrastructure running on WINDOWS. Fuck the sad state of IT. CIOs and CTOs across the board need to be fired and held accountable for their shitty decisions in these industries.
yes CRWD is a shitty company but seems they are a "necessity" by some stupid audit/regulatory board that oversees these industries. But at the end of the day, these CIOs/CTOs are completely fucking clueless as to the exact functions this software does on a regular basis. A few minions might raise an issue but they stupidly ignore them because "rEgUlAtOrY aUdIt rEqUiReS iT!1!"
While Linux isn't a panacea, the OS does matter as Linux provides tools for security scanners like Crowdstrike to operate entirely in userspace, with just a sandboxed eBPF program performing the filtering and blocking within the kernel. And yes, CrowdStrike supports this mode of operation, which I'll be advocating we switch over to on Monday. So yeah, for this specific issue, Linux provides a specific feature that would have prevented this issue.
> The OS doesn't matter, the question should be why is critical infrastructure online and allowed to receive OTA updates from third parties.
Not exactly. I think the question is why is critical infrastructure getting OTA updates from third parties automatically deployed directly to PROD without any testing.
These updates need to go to a staging environment first, get vetted, and only then go to PROD. Another upside of that it won't go to PROD everywhere all at once, resulting in such a worldwide shitshow.
I think you have the priority backwards. We shouldn’t be relying on trusting the QA process of a private company for national security systems. Our systems should have been resilient in the face of Crowdstrike incompetence.
> I think you have the priority backwards. We shouldn’t be relying on trusting the QA process of a private company for national security systems. Our systems should have been resilient in the face of Crowdstrike incompetence.
I think you misunderstood me. I wasn't talking about Crowdstrike having a staging environment, I was talking about their customers. So 911 doesn't go down immediately once Crowdstrike pushes a bad update, because the 911 center administrator stages the update, sees that it's bad, and refuses to push it to PROD.
I think that would even provide some resiliency in the fact of incompetent system administrators, because even if they just hit "install" on every update, they'll tend to do it at different times of day, which will slow the rollout of bad updates and limit their impact. And the incompetent admin might not hit "install" because he read the news that day.
Lol if they can't do staging to mitigate balls ups on the high availability infrastructure side (optus in aus earlier this year pushed a router config that took down 000 emergency for a good chunk of the nation) we got bugger all hope of big companies getting it further up the stack in software.
In this case it wasn’t an update to the OS but an update to something running on the OS supplied by an unrelated vendor.
But if we entertain the idea that another OS would not need CrowdStrike or anything else that required updates to begin with, I have doubts. Even your CPU needs microcode updates nowadays.
Of course the OS matters! Windows is a nasty ball of patches in order to maintain backward compatibility with the 80s. Linux and OSX don't have to maintain all the nasty hacks to keep this backward compatibility.
Also, Crowdstrike is a security (patch) company because Windows security sucks to the point they have, by default, real-time virus protection running constantly (runs my CPU white hot for half the day, can you imagine the global impact on the environment?!).
It's so bad on security that its given birth to a whole industry to fix it i.e. Crowdstrike. Every time I pass a bluescreen in a train station or advertisement I'd like "hA! you deserve that for choosing Windows".
IBM’s z/OS maintains compatibility with the 60’s, and machines running it continued to process billions of transactions every second without taking a break.
The OS matters, as well as the ecosystem and, and this is most important, the developer and operations culture around it.
> Of course the OS matters! Windows is a nasty ball of patches in order to maintain backward compatibility with the 80s. Linux and OSX don't have to maintain all the nasty hacks to keep this backward compatibility.
Just don't tell that to Linus Torvalds :) Because Linux absolutely does maintain compatibility with old ABI from 90-s.
> Just don't tell that to Linus Torvalds :) Because Linux absolutely does maintain compatibility with old ABI from 90-s.
That’s nothing. IBM’s z/OS maintains compatibility with systems dating all the way back to the 60’s. If they want to think they are reading a stack of punch cards, the OS is happy to fool them.
You should look into what a kernel driver is. You can panic a Linux kernel with 2 lines of code just as you can panic a Windows kernel, they just got lucky that this fault didn't occur in their Linux version.
And to be honest, I don't think recovering from this would be that much easier for non-technical folk on a fully encrypted Linux machine, not that it's particularly hard on Windows, it's just a lot of machines to do it on.
In Linux it could be implemented as an eBPF thing while most of the app runs in userspace.
And, for specialised uses, such as airline or ER systems, a cut-down specialised kernel with a minimal userland would not require the kind of protection Crowdstrike provides.
But this is a 3rd party software with ring-0 access to all of your computers deciding to break them. The technical features of the OS absolutely do not matter.
The question is whether other OSs would require it to have kernel mode privileges. People run complicated stuff in kernel mode for performance, because the switch to/from userspace is expensive.
Guess what’s also expensive? A global outage is expensive. Much more than taking the performance hit a better, more isolated, design would avoid.
This is true. Linux large fleet management is still missing some features large enterprises demand. Do they need all those features, idk, but they demand them if they're switching from Windows.
Windows also has better ways such as filter drivers and hooks. If everybody used Linux, Crowd Strike would still opt for the kernel driver since the software they create is effectively spyware that wants access to stuff as deep as possible.
If they opted for an eBPF service but put that into early boot chain, the bootloop or getting stuck could still happen.
The only long time solution is to stop buying software from a company that has a track record of being pushy and having terrible software practices like rolling out updates to the entire field.
I think the only real solution is for MSFT to stop allowing kernel level drivers, as Apple has already (sorf of, but nearly) done. Sure, lots and lots of crap runs on windows in kernelspace, but what happened today cost a sizable fraction of world's GDP. There won't be a better wake up call.
But would the Linux sysadmins of the world play along in the way that the Windows sysadmins of the world did? I think they might've given Crowd Strike the finger and confined them to a smaller blast radius anyhow. And if they wouldn't have... well they will now.
Once it gets popular, I think it would happen. The business people and C-suite would request quick dirty solutions like Crowd Strike's offerings to check boxes when entering new markets and go around the red tape. So they'll force Unix people to do as they say or else.
Agreed. It's a safer culture because it grew up in the wild. Windows, by contrast, is for when everybody you're using it with has the same boss... places where sanity can be imposed by fiat.
If Microsoft is to be blamed here, it's not for the quality of their software, it's for fostering a culture where dangerous practices are deemed acceptable.
> If they opted for an eBPF service but put that into early boot chain, the bootloop or getting stuck could still happen.
If the in-kernel part is simple and passes data to a trusted userland application the likelyhood of a major outage like the one we saw is much reduced.
More specifically why is critical stuff not equipped properly to revert itself and keep working and/or fail over? This should be built-in stuff at this point, have the last working OS snapshot on its own storage chip and automatically flash it back, even if it takes a physical switch… things like this just shouldn’t happen.
> why the fuck is our critical infrastructure running on WINDOWS
Because it’s cheaper.
I feel like many in this thread are obsessing over the choice of OS when the actual core question is why, given the insane money we spend on healthcare, are all healthcare systems shitty and underinvested?
A sensible, well constructed system would have fallbacks, no matter if the OS of choice is Windows or Linux.
The difference is that lots of different companies can share the burden of implementing all that in Linux (or BSD, or anything else) while only Microsoft can implement that functionality in Windows and even their resources are limited.
Very little healthcare functionality would ever need to be created at the OS level. The burden could be shared no matter if machines were running Windows or Linux, they’re mostly just regular applications.
Not talking about the applications - those could be ported and, ideally, financed by something like the UNDP so that the same tools are available everywhere to any interested part.
I'm talking about Crowdstrike's Falcon-like monitoring. It exists to intercept "suspicious" activity by userland applications and/or other kernel modules.
Cheaper? Well, perhaps when you require your OS to have some sort of support contract. And your support vendor charges you unhealthy sums.
And then you get to see the value of the millions of dollars you've paid for support contracts that don't protect your systems at all. But those contracts do protect specific employees. When the sky falls down, the big money execs don't have a solution. But it's not their fault because the support experts they pay huge sums don't have solutions either. Somehow paying millions of dollars to support contractors that can't save you is not seen as a fireable offense. Instead it is a career-saving scapegoat.
Within companies that have been bitten this time, the team that wasn't affected because they made better process decisions will not be promoted as smarter. Their voice will continue to be marginalized by the people whose decisions led to this disaster. Because, hey, look, everyone got bit right? Nobody looks around to notice the people who were not bitten and recognize their better choices. And "I told you so" is a pretty bad look right now.
> I feel like many in this thread are obsessing over the choice of OS when the actual core question is why, given the insane money we spend on healthcare, are all healthcare systems shitty and underinvested?
Because it's basically impossible to compete in the space.
Epic is a pile or horseshit, but you try convincing a hospital to sign up to your better version.
Tons of critical infrastructure in the US is run on IBM zOS. It doesn't matter what operating system you use, what matters is updates aren't automatic and everything is as air gapped as possible.
> why the fuck is our critical infrastructure running on WINDOWS.
That hits the nail on the head.
But it is a rhetorical question. We know why, generally, software sacks, and specifically why Windows is the worst and is the most popular
Good software is developed by pointy headed needs (like us) and successful software is marketed to business executives are have serious pathologies
There are exceptions (I am struggling to think of one) where a serious piece of good software has survived being mass marketed, but the constraints (basically business and science) conflict
1/ linux is as vulnerable to kernel panics induced by such software. In fact, CS had a similar snafu mid April, affecting linux kernels. Luckily, there are far fewer moronic companies running CS on linux boxes at scale.
2/ it does offer protection - if you are running total shit architecture and you need to trust your endpoints not to be compromised, something like this is sadly a must.
Incidentally, google, which prides itself at running a zero-trust architecture, sent a lot of people home on Friday. Not so zero-trust after all, it seems.
No, its just soooooo bad at security/stability that it gave birth to Crowdstrike. They very fact that Crowdstrike is so big and prevalent means is proof of the gapping hole in Windows security. Its given birth to a multibillion dollar industry!
Crowdstrike/falcon use is not by any means limited to Windows. Plenty of Linux heavy companies mandate it on all infrastructure (although I hope that changes after this incident).
It’s mandated because someone believes Linux is as bad as Windows in that regard.
And, quite frankly, a well configured and properly locked down Windows would be as secure as a locked down Linux install. It’d also be a pain to use, but that’s a different question.
Critical systems should run a limited set of applications precisely to reduce attack surface.
The reality is the wetwear that interfaces with any OS is always going to be the weakest link. Doesn't matter what OS they run, I guarantee they will click links and download files from anywhere.
I can pretty easily make it so a user on Linux can't download executables and can't even then can't do any damage without a severe vulnerability. That is actually pretty difficult to do in a typical Windows AD deployment. There is a big difference between the two OSes.
In fact, there's a couple billion Linux devices running around locked down hard enough that the most clueless users you can imagine don't get their bank details stolen.
> yes CRWD is a shitty company but seems they are a "necessity" by some stupid audit/regulatory board that oversees these industries.
Yep, this is the problem. The part about Windows is a distraction here.
That bullshit regulation is a much larger security issue than Windows. Incomparably so. If you run it over Linux, you'll get basically the same lack of security.
I've picked the perfect day to return from vacation. Being greeted by thousands of users being mad at you and people asking for your head on a plate makes me reconsider my career choice. Here's to 12 hours of task force meetings...
Huge sympathies to you. If it's any consolation, because the scale of the outage is SO massive and widely reported, it will quickly become apparent that this was beyond your control, and those demanding your 'head on a plate' are likely to appear rather foolish. Hang in there my friend.
To their credit, the stakeholder that asked for my head personally came to me and apologised once they realised that entire airports have been shut down worldwide. But yeah, not a Friday/funday hahaha
Ye and these types make any problem worse. Any technical problem also becomes a social problem to deal with these lunatics and keep the house of cards from crumbeling.
It's not a management thing, it's very much a personality trait ... that for whatever reason seems to survive in pockets of management in most organisations over a certain size.
It's not a trait that survives well at yard crew level, trade assistents that freak out at spiders either get over it or never make it through apprenceships to become tradespeople.
In IT those who deal with failing processes, stopped jobs, smoking hardware, insuffcient RAM, tight deadlines learn to cope or get sidelined or fired (mostly).
To be clear, I've seen people get frazzled at most levels and many job types in various companies.
My thesis is there's a layer of management in which nervous types who utterly lose their cool at the first sign of trouble can survive better than elsewhere in large organisations.
But that's just been my experience over many years in several different types of work domains.
Ohhh absolutely. And it's not just users, it's also management. "How does this affect us? Are we compromised? What are our options? Why didn't we prevent this? How do you prevent this going forward? How soon can you have it back up? What was affected? Why isn't it everyone? Why are things still down? Why didn't X or Y unrelated vendor schlock prevent this?..."
And on and on and on. Just the amount of time spent unproductively discussing this nightmare is going to cost billions.
Nothing is more annoying than having a user ask a litany of questions obvious to the person working on the problem and looking for the answers while working on the problem and looking for the answers.
They’re valid for a postmortem analysis. They’re not helpful while you’re actively triaging the incident, because they don’t get you any steps closer to fixing it.
Exactly my thinking. Asking these questions doesn't help us now. But after all the action is done, they should be asked. And really should be questions that always get asked from time to time, incident or no incident.
The problem is that you are only focusing on making the computers work and not the system.
"we don't know yet" is a valid response and gives the rest something to work, and it shouldn't annoy you that it's being asked, first of all because if they are asking is because you are already late.
you have to to tell the rest of the team what you know and you don't know, and update them accordingly.
until your team says something the rest don't know if it's a 30 minute thing or the end of the world or if we need to start dusting off the faxes.
Your head belongs on the plate for not being able to point back to your recommendation for failover posture enhancement such as identifying core business systems, core function roles, having fully offline emergency systems, warning of the dangers of making cloud services your only services, and then pointing to the proposed costs to implement these systems being lower than the damages caused by outage to core business services.
Move to a new career if you feel you don't have the ability to push right back against this.
The only surprising thing is that this doesn't happen every month.
Nobody understands their runtime environment. Most IT org's long ago "surrendered" control and understanding of it, and now even the "management" of it (I use the term loosely) is outsourced.
This is mostly physical machines in person, kiosks and pos terminals, office desktops and things like that. Windows is a tiny portion of GCP and AWS and the web in general.
I'm 100% "cloud" with tens of thousands of linux containers running and haven't been affected at all.
"I'm going to install an agent from Company X, on this machine, which it is essential that they update regularly, and which has the potential to both increase your attack surface and prevent not just normal booting but also successful operation of the OS kernel too". I am not going to provide you with a site specific test suite, you're going to just have to trust me that it wont interrupt your particular machine".
Why are so many mission critical hardware connected systems connected to the internet at all or getting automatic updates?
This is just basic IT common sense. You only do updates during a planned outage, after doing an easily reversible backup, or you have two redundant systems in rotation and update and test the spare first. Critical systems connected to things like medical equipment should have no internet connectivity, and need no security updates.
I follow all of this in my own home so a bad update doesn’t ruin my work day… how do big companies with professional IT not know this stuff?
Well that context makes it make a little more sense... I still wouldn't be trusting a service like that for mission critical hardware that shouldn't be connected to the internet in the first place.
The question with these types of services is: is your goal to keep the system as reliable as possible, or to be able to place the blame on a 3rd party when it goes down? If it's a critical safety system that human lives depend on, the answer better be the former.
But that's besides the point in any enterprise environment. Or even in a SMB where third parties are doing IT stuff for you.
Your opinion doesn't matter there. Compliance matters. Paper Risk aversion matters. And they don't always align with common IT sense and, as had been proven now, reality.
If you must trust the software not to do rogue updates then I have to swing back into the camp of blaming the operating system. Is Linux better at this?
I've noticed phones have better permissions controls than Windows, seemingly. You can control things like hardware access and file access at the operating system level, it's very visible to the user, and the default is to deny permissions.
But I've also noticed that phone apps can update outside of the official channel, if they choose. Is there any good way to police this without compromising the capabilities of all apps?
Microsoft has tried pushing app deployment and management platforms that would make this kind of thing really possible, but it constantly receives massive pushback. This was the concept of stuff like Windows S, where pretty much all apps have to be the new modern store app package and older "just run the install.exe as admin and double click the shortcut to run" was massively deprecated or impossible.
I’m not an IT professional, but I don’t use antivirus software on my personal macs and linux machines- I do regular rotated physical backups, and only install software digitally signed by trusted sources and well reviewed Pirate Bay accounts (that's a joke :-).
My only windows machine is what I would classify as a mission critical hardware connected/control device, an old Windows 8 tablet I use for car diagnostics- I do not connect it to the internet, and never perform updates on it.
I am an academic and use a lot of old multi-million dollar scientific instruments which have old versions of windows controlling them. They work forever if you don't network them, but the first time you do, someone opens up a browser to check their social media, and the entire system will fail quickly.
Yes. In an environment where you have so many clients that they can DDoS the antivirus management server, you have to stagger the update schedule anyway. The way we set it up, sysadmins/help desk/dev deployments updated on day 1, all IT workstations/test deployments updated on day 2, and all workstations/staging/production deployments on day 3.
Probably, implicitly. Have automated regular backups, and don’t let your AV automatically update, or even if it does, don’t log into all your computers simultaneously. If you update/login serially, then the first BSOD would maybe prevent you from doing the same thing on the other (or possibly, send you running to the other to accomplish your task, and BSODing that one too!)
But yeah this is one reason why I don’t have automatic updates enabled for anything, the other major one being that companies just can’t resist screwing with their UIs.
What people aren’t understanding is MOST of the outage isn’t caused by a crowdstrike install itself, it’s caused because something upstream of it (a critical application server) is what got borked, and that’s having a domino effect on everything else.
Remember, there's someone out there right now, without irony, suggesting that AI can fix this. There's someone else scratching their head, wondering why AI hasn't fixed this yet. And there's someone doing a three-week bootcamp in AI, convinced that AI will fix this. I’m not sure which is worse
A heuristic that has served me well for years is that anyone who uses the word “cybersecurity” is likely incompetent and should be treated with suspicion.
My first encounter with CrowdStrike was overwhelmingly negative. I was wondering why for the last couple weeks my laptop slowed to a crawl for 1-4 hours on most days. In the process list I eventually found CrowdStrike using massive amounts of disk i/o, enough to double my compile times even with a nice SSD. Then they started installing it on servers in prod, I guess because our cloud bill wasn’t high enough.
It rather looks like Crowdstrike marketed heavily to corporate executives using a horror story about the bad IT tech guy who would exfiltrate all their data if they didn't give Crowdstrike universal access at the kernel level to all their machines...? It seems more aimed at monitoring the employees of a corporation for insider threats than for defense against APT actors.
How long before companies start consciously de-risking by replacing general-purpose systems like Windows with newer systems with smaller attack surfaces? Why does an airline need to use Windows at all for operations? From what I’ve seen, their backend systems are still running on mainframes. The terminals are accessed on PCs running Windows, but those could trivially be replaced with iPadOS devices that are more locked down than Windows and generally more secure by design.
One of the problems possibly preventing this is that budgets for buying software aren't controlled by people administering the software. Definitely not by people using it.
Often, the cost of switching is too high or too complex to justify. On top of that, many applications commonly run in manufacturing etc., simply does not run on any other OS.
The billions that have been lost, and the lives that have been lost, have, in the blink of an eye, rendered the "too costly to implement" argument moot.
For bean-counting purposes, it's just really convenient that the burden of that cost was transferred onto somebody else, so that the claim can continue to be made that another solution would still be too costly to implement.
Accepting the status-quo that got us here in the first place, under the pseudo-rational argument that there are not realistic alternatives, is simply putting ones head in the sand and careening, full steam ahead, to the next wall waiting for us.
That there might not be an alternative available currently does not mean that a new alternative cannot be actively pursued, and that it is not time for extreem introspection.
Certain backend systems run on mainframes, yes. But the airline's website? No (only the booking portion interacts with a mainframe via API calls). Identity management system? No. Etc.
Banks are down so petrol stations and supermarkets are basically closed.
People can't check in to airline flights, various government services including emergency telephone and police are down. Shows how vulnerable these systems are if there's just one failure point taking all those down.
000 was never down, and most supermarkets and servos were still up. It was bad, but ABC appear to not have the internal capacity to validate all reports.
It's pretty bad when the main ABC 7pm News Bulletin pretty much had them reading from their iPads couldn't use their normal studio background screens and didn't even give us the weather forecast!
CIO here. They are known to be incredibly pushy. In my company we RFP'd for our endpoint & cyber security. Found the CS salesperson went over me to approach our CEO who is completely non-technical to try and seal a contract because I was on leave for 1 week out of service (and this was known to them). When I found out by our CEO informing me of the approach we were happy to sign with SentinelOne
One thing I'm really happy about at my current company is that when a sales person from a vendor (not Crowdstrike) tried that our CEO absolutely ripped them a new one and basically banned that company from being a vendor for a decade.
I had a very similar experience, I was leading the selection process for our new endpoint security vendor, Crowdstrike people:
- verbally attacked/abused a selection team member
- were ranting constantly about golf with our execs
- were dismissive and just annoying throughout
- raised hell with our execs when they learned they were not going to POC, basically went through everyone of them simultaneously
- I had to get a rep kicked out of the rfp as he was constantly disrespectful
We did not pick them, and cancelled any other relashionship we had with them, in IR space by example.
I think the update will be applied overnight, which is a different window (no pun intended) dependent on timezone and the impact will be reported when users come back online (or not) and identify the issue.
Currently seeing this happening in real time in the UK.
I was at the supermarket here last night about the time it kicked off. It seemed payWave was down, there were a few people walking out empty handed as they only had Apple Pay, etc on them. But the vast majority of people seemed fine, my chipped credit card worked without issue.
> 7/18/24 10:20PT - Hello everyone - We have widespread reports of BSODs on windows hosts, occurring on multiple sensor versions. Investigating cause. TA will be published shortly. Pinned thread.
This was particularly interesting (from the reddit thread posted above):
> A colleague is dealing with a particularly nasty case. The server storing the BitLocker recovery keys (for thousands of users) is itself BitLocker protected and running CrowdStrike (he says mandates state that all servers must have "encryption at rest").
> His team believes that the recovery key for that server is stored somewhere else, and they may be able to get it back up and running, but they can't access any of the documentation to do so, because everything is down.
> but they can't access any of the documentation to do so, because everything is down.
One of my biggest frustrations with learning networking was not being able to access the internet. Nowadays you probably have a phone with a browser, but back in the day if you were sitting in a data room and you'd configured stuff wrong, you had a problem.
Isn’t that what office safes are for? I don’t know the location, but all the old guard at my company knew that room xyz at Company Office A held a safe with printed out recovery keys and the root account credentials. No idea where the key to the safe is or if it’s a keypad lock instead. Almost had to use it one time.
I'm guessing someone somewhere said that "it must be stored in hard copy in a safe" and the answer was in the range of "we don't have a safe, we'll be fine".
Or worse, if it's like where I worked in the past, they're still in the buying process for a safe (started 13 months ago) and the analysts are building up a general plan for the management of the safe combination.
They still have to start the discussions with the union to see how they'll adapt the salary for the people that will have to remember the code for the safe and who's gonna be legally responsible for anything that happens to the safe.
Last follow-up meeting summary is "everything's going well but we'll have to modify the schedule and postpone the delivery date of a few months, let's say 6 to be safe"
Not just financial / process barriers. I worked for a company in the early 90's that needed a large secure safe to store classified documents and removable hard drives. A significant part of the delay in getting it was figuring out how to get it into the upstairs office where it would be located. The solution involved removing a window and hiring a crane.
When we later moved to new offices, somebody found a solution that involved a 'stair-walking' device that could supposedly get the safe down to the ground floor. This of course jammed when it was halfway down the stairs. Hilarity ensued.
Didn't bookmark it or anything and going back to the original reddit thread I now see that there are close to 9,000 comments, so unfortunately the answer is no...
Absolutely correct. Unfortunately, there is no other solution to this issue. If the laptops were powered down overnight, there might be a stroke of luck. However, this will be one of the most challenging recoveries in IT history, making it a highly unpleasant experience.
Yeah in context we have about 1000 remote workers down. We have to call them and talk through each machine because we can't fix them remotely because they are stuck boot looping. A large proportion of these users are non-technical.
MS Windows Recovery screen (or the OS installer disk) might ask you for the recovery key only, but you can unlock the drive manually with the password as well! I had to do that a week ago after a disk clone gone wrong, so in case someone steps on the same issue (this here is tested with Win 10, but it should be just the same for W11 and Server):
1. Boot the affected machine from the Windows installer disk
2. Use "Repair options"
3. Click through to the option to spawn a shell
4. It will now ask you for unlocking the disk with a recovery key. SKIP THAT.
5. In the shell, type: "manage-bde -unlock C: -Password", enter the password
6. The drive is unlocked, now go and execute whatever recovery you have to do.
> Can you even get the secret from the TPM in recovery mode?
Given that you can (relatively trivially) sniff the TPM communication to obtain the key [1], yes it should be possible. Can't verify it though as I've long ago switched to Mac for my primary driver and the old cheesegrater Mac I use as a gaming rig doesn't have a hardware TPM chip.
yea I don't need an attack on a weak system, I mean the authorized legal normal way of unlocking BL from Windows when you have the right credentials. Windows might not be able to unlock BitLocker with just your password.
I don't know how common it is to disable TPM-stored keys in companies, but on personal licenses, you need group policy to even allow that.
Although this is moot if Windows recovery mode is accepted as the right system by the TPM. But aren't permissions/privileges a bit neutered in that mode?
Most people installed CrowdStrike because an audit said they needed it. I find it exceedingly unlikely that the same audit did not say they have to enable Bitlocker and backup its keys.
I can confirm this. EDR checkbox for CrowdStrike, BitLocker enabled for local disk encryption checkbox. BitLocker backups to Entra because we know reality happens, no checkbox for that.
I know it does for personal accounts once linked to your machine. Years ago, I used the enterprise version and it didn’t, probably because it was “assumed” that it should be done with group policies, but that was in 2017.
Yes you should be able to pull it from your domain controllers. Unless they're also down, which they're likely to be seeing as Tier 0 assets are most likely to have crowdstrike on them. So you're now in a catch 22.
Rolling back an Active Directory server is a spectacularly bad idea. Better make doubly sure it's not connected to any network before you even attempt to do so.
In theory. I've seen it not happen twice. (The worst part is that you can hit the Bitlocker recovery somewhat randomly because of an irrelevant piece of hardware failing, and now you have to rebuild the OS because the recovery key is MIA.)
It includes PDFs of some relevant support pages that someone printed with their browser 5 hours ago. That's probably the right thing to do in such a situation to get this kind of info publicly available ASAP, but still, oof. Looks like lots of people in the Reddit thread had trouble accessing the support info behind the login screen.
what many people of not taking is that why we are here:
one simple reason:
all eggs in one Microsoft PC basket
why in one Microsoft PC basket?
- most corporate desktop apps are developed for Windows ONLY
Why most corporate desktop apps are developed for Windows ONLY?
- it is cheaper to develop and distribute since, 90% of corporations use Windows PCs ( Chicken and Egg problem)
- alternate Mac Laptops are 3x more expensive, so corporations can't afford
- there are no robust industrial grade Linux laptops from PC vendors (lack of support, fear of Microsoft may penalize for promoting Linux laptops etc.)
1/ Most large corporations (Airlines, Hospitals etc..) can AFFORD & DEMAND their Software vendors to provide their ' business desktop applications' both in Windows and Linux versions and install mix of both Operating systems.
2/ majority of corporate desktop applications can be Web applications (Browser based) removing the single vendor Microsoft Windows PC/Laptops
Windows is not the issue here. If all of the businesses used Linux, a similar software product, deployed as widely as Crowdstrike, with auto-update, could result in the same issue.
Same goes for the OS; if let's say majority of businesses used RHEL with auto updates, RedHat could in theory push an update, that would result bring down all machines.
Agree. The monoculture simply accelerates the infection because there are no sizable natural barriers to stop it.
Windows and even Intel must take some blame, because in this day and age of vPro on the board and rollbacks built into the OS it's incredible that there is no "last known good" procedure to boot into the most recent successfully booted environment (didnt NT have this 30 years ago?), or remotely recover the system. I pity the IT staff that are going to have to talk Bob in Accounting through bitlocker and some sys file, times 1000s.
IT get some blame, because this notion that an update from a third party can reach past the logical gatekeeping function that IT provides, directly into their estate, and change things, is unconscionable. Why dont the PCs update from a local mirror that IT has that has been through canary testing? Do we trust vendors that much now?
I would posit that RedHat have a slightly longer and more proven track record than Crowdstrike, and more transparent process with how they release updates.
No entity is infallible but letting one closed source opaque corporation have the keys to break everything isn’t resilient.
Yes it is. Windows was created for the "Personal Computer" with zero thought initially put in to security. It has been fighting that heritage for 30 years. The reason Crowdstrike exists at all is due to shortcomings (real or perceived) in Windows security.
Unix (and hence Linux and MacOS) was designed as a multi-user system from the start, so access controls and permissions were there from the start. It may have been a flawed security model and has been updated over time, but at least it started some notion of security. These ideas had already expanded to networks before Microsoft ever heard the word Netscape.
> was designed as a multi-user system from the start, so access controls and permissions were there from the start.
Right and Windows NT wasn't? Obviously it supported all of those things from the very beginning (possibly even in a superior way to Unix in some cases considering it's a significantly more modern OS)...
The fact that MS developed another OS called Windows (3.1 -> 95 -> 98) prior to that which was to some extent binary compatible with NT seems somewhat tangential. Otherwise the same arguments would surely apply to MacOS as well?
> These ideas had already expanded to networks before Microsoft ever heard the word Netscape.
Does not seem like a good thing on its own to me. Just solidifies the fact the it's an inherently less modern OS than Windows(NT) (which still might have various design flaws obviously, that might be worth discussing, it just has nothing to do whatsoever with what you're claiming here...)
We have Crowdstrike on our Linux fleet. It is not merely a malware scanner but is capable of identifying and stopping zero-day attacks that attempt local privilege escalation. It can, for example, detect and block attempts to exploit CVE-2024-3094 - the xz backdoor.
Perhaps we need to move to an even more restrictive design like Fuschia, or standardize on an open source eBPF based utility that's built, tested, and shipped with a distribution's specific kernel, but Windows is not the issue here.
Security is a complex and deeply evolved field. Many modern required security practices are quite recent from a historical perspective because we simply didn't know we would need them.
A safe security first OS from 20 years ago would most likely be horribly insecure now.
yes, staggered software update is the way to go. there was reply in this thread why Crowdstrike did not do it -- don't want extra cost of Engineering for that
having 1/3 of Airlines computers Windows, RHEL, Ubuntu .. all unlikely to hit same problems at same time.
But you're more likely to encounter problems. That's likely a good thing as it improves your DR documentation and processes but could be a harder sell to the suits.
But then it'd be putting all eggs in the Linux pc basket, wouldn't it? I think they point was that more heterogeneity would make this not be a problem. If all your potatoes are the same potato it only takes one bad blight epidemic to kill off all farmed potatoes in a country. If there's more heterogeneity things like that doesn't happen.
The difference being that RHEL has a QA process which crowd strike apparently does not. The quality practices for open source involved companies is apparently much higher than for large closed source "security" firms.
I guess getting whined at because obscure things break in beta or rc releases has a good effect for the people using LTS.
Maybe this is pie-in-the-sky thinking, but if all the businesses used some sort of desktop variant of Android, the Crowdstrike app (to the extent that such a thing would even be necessary in the first place) would be sandboxed and wouldn't have the necessary permissions to bring down the whole operating system.
When notepad hits an unhandled exception and the OS decides it's in an unpredictable state, the OS shuts down notepad's process. When there's an unhandled exception in kernel mode, the OS shuts down the entire computer. That's a BSOD in Windows or a kernel panic in Linux. The problem isn't that CrowdStrike is a normal user mode application that is taking down Windows because Windows just lets that happen, it's that CrowdStrike has faulty code that runs in kernel mode. This isn't unique to Windows or Linux.
The main reason they need to run in kernel mode is you can't do behavior monitoring hooks in user mode without making your security tool open to detection and evasion. For example, if your security tool wants to detect whenever a process calls ShellExecute, you can inject a DLL into the process that hooks the ShellExecute API, but malware can just check for that in its own process and either work around it or refuse to run. That means the hook needs to be in kernel mode, or the OS needs to provide instrumentation that allows third party code to monitor calls like that without running in kernel mode.
IMO, Windows (and probably any OS you're likely to encounter in the wild) could do better providing that kind of instrumentation. Windows and Office have made progress in the last several years with things like enabling monitoring of PowerShell and VBA script block execution, but it's not enough that solutions like CrowdStrike can do their thing without going low level.
Beyond that, there's also going to be a huge latency between when a security researcher finds a new technique for creating processes, doing persistence, or whatever and when the engineering team for an OS can update their instrumentation to support detecting it, so there's always going to be some need for a presence in kernel mode if you want up to date protection.
I mean, to me that's just a convincing argument against using kernel-mode spywa-, err, endpoint protection, with OTA updates that give you no way to stage or test them yourself cannot be secure.
How are those arguments against kernel level detection from a security perspective?
His arguments show that without kernel level, you either can't catch all bad actors as they can evade detection, or that the latency is too big that an attacker basically has free reign for some time after detection.
SolarWinds story was quickly forgotten, and this one will be too, and we'll continue to build such special single points of global catastrophic failure into our craftly architected decentralized highly robust horizontally scaled multi-datacenter-region systems
The SolarWinds story wasn't forgotten. Late last year the SEC launched a complaint against SolarWinds and its CISO. It was only yesterday that many of the SEC's claims against the CISO were dismissed.
Solarwinds is still dealing with the reputation damage and fallout today from that breach. People don’t forget about this stuff. the lawsuits will likely be hitting crowdstrike for years to come
No less than three baskets, or you cannot apply for bailouts. If you want to argue your industry is a load-bearing element in the economy: no less than three baskets.
Making everything browser based doesn't help (unless you can walk across the room and touch the server). The web is all about creating fast-acting local dependency on the actions of far-away people who are not known or necessarily trusted by the user. Like crowdstrike, it's about remote control, and it's exactly that kind of dependency that caused this problem.
I love piling on Microsoft as much as the next guy, but this is bigger than that. It's a structural problem with how we (fail to) manage trust.
[AWS Health Dashboard](https://health.aws.amazon.com/health/status)
"First, in some cases, a reboot of the instance may allow for the CrowdStrike Falcon agent to be updated to a previously healthy version, resolving the issue.
Second, the following steps can be followed to delete the CrowdStrike Falcon agent file on the affected instance:
1. Create a snapshot of the EBS root volume of the affected instance
2. Create a new EBS volume from the snapshot in the same Availability Zone
3. Launch a new instance in that Availability Zone using a different version of Windows
4. Attach the EBS volume from step (2) to the new instance as a data volume
5. Navigate to the \windows\system32\drivers\CrowdStrike\ folder on the attached volume and delete "C-00000291*.sys"
6. Detach the EBS volume from the new instance
7. Create a snapshot of the detached EBS volume
8. Create an AMI from the snapshot by selecting the same volume type as the affected instance
9. Call replace root volume on the original EC2 Instance specifying the AMI just created"
reply