Hacker News new | past | comments | ask | show | jobs | submit login
CrowdStrike Update: Windows Bluescreen and Boot Loops (reddit.com)
4489 points by BLKNSLVR 4 months ago | hide | past | favorite | 3859 comments



All: there are over 3000 comments in this thread. If you want to read them all, click More at the bottom of each page, or like this:

https://news.ycombinator.com/item?id=41002195&p=2

https://news.ycombinator.com/item?id=41002195&p=3

https://news.ycombinator.com/item?id=41002195&p=4 (...etc.)


Throwaway account...

CrowdStrike in this context is a NT kernel loadable module (a .sys file) which does syscall level interception and logs then to a separate process on the machine. It can also STOP syscalls from working if they are trying to connect out to other nodes and accessing files they shouldn't be (using some drunk ass heuristics).

What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.

This has taken us out and we have 30 people currently doing recovery and DR. Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver. We have to literally take each node down, attach the disk to a working node, delete the .sys file and bring it up. Either that or bring up a new node entirely from a snapshot.

This is fine but EC2 is rammed with people doing this now so it's taking forever. Storage latency is through the roof.

I fought for months to keep this shit out of production because of this reason. I am now busy but vindicated.

Edit: to all the people moaning about windows, we've had no problems with Windows. This is not a windows issue. This is a third party security vendor shitting in the kernel.


I did approximately this recently, but on a Linux machine on GCP. It sucked far worse than it should have: apparently GCP cannot reliably “stop” a VM in a timely manner. And you can’t detach a boot disk from a VM that isn’t “stopped”, nor can you multi-attach it, nor can you (AFAICT) convince a VM to boot off an alternate disk.

I used to have this crazy idea that fancy cloud vendors had competent management tools. Like maybe I could issue an API call to boot an existing instance from an alternate disk or HTTPS netboot URL. Or to insta-stop a VM and get block-level access to its disk via API, even if I had to pay for the instance while doing this.

And I’m not sure that it’s possible to do this sort of recovery at all without blowing away local SSD. There’s a “preview” feature for this on GCP, which seems to be barely supported, and I bet it adds massive latency to the process. Throwing away one’s local SSD on every single machine in a deployment sounds like a great way to cause potentially catastrophic resource usage when everything starts back up.

Hmm, I wonder if you’re even guaranteed to be able to get your instance back after stopping it.

WTF. Why can’t I have any means to access the boot disk of an instance, in a timely manner? Or any better means to recover an instance?

Is AWS any better?


AWS is not any better really on this. In fact 2 years ago (to the day!) we had a complete AZ outage in our local AWS region. This resulted in their control plane going nuts and being unable to shut down or start new instances. Then capacity problems.


That's happened several times, actually. That's probably just the latest one. The really fun one was when S3 went down in 2017 in Virginia. Caused global outages of multiple services because most services were housed out of Virginia and when EC2 and other services went offline due to dependency on S3, everything cascade failed across multiple regions (in terms of start/stop/delete...ie. api actions. Stuff that was running was, for the most part, still working in some places).

...I remember that day pretty well. It was a busy day.


> apparently GCP cannot reliably “stop” a VM in a timely manner.

In OCI we made a decision years ago that after 15 minutes from sending an ACPI shutdown signal, the instance should be hard powered off. We do the same for VM or BM. If you really want to, we take an optional parameter on the shutdown and reboot commands to bypass this and do an immediate hard power off.

So worst case scenario here, 15 minutes to get it shut down and be able to detach the boot volume to attach to another instance.


I had this happen to one of my VMs, I was trying to compile something and went out of memory, then tried to stop the VM and it only came back after 15 min. I think it is a good compromise, long enough to give a chance for a clean reboot but short enough to prevent longer downtimes.

I’m just a free tier user but OCI is quite powerful. It feels a bit like KDE to me where sometimes it takes a while to find out where some option is, but I can always find it somewhere, and in the end it beats feeling limited by lack of options.


We've tried at shorter time periods, back in the earlier days of our platform. Unfortunately what we've found is that the few times we've tried to lower it from 15 minutes, we've ended up with Windows users experiencing corrupt drives. Our best blind interpretation is that some things common enough on Windows can take up to 14 minutes to shut down under worst circumstances. So 15 minutes it is!


This sounds appealing. Is OCI the only cloud to offer this level of control?


Based on your description, AWS has another level of stop, the "force stop", which one can use in such cases. I don't have statistics on the time, so I don't know if that meets your criteria of "timely", but I believe it's quick enough (sub-minute, I think).


There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

As for "throwing away local SSD", that only happens on AWS with instance store volumes which used to be called ephemeral volumes as the storage was directly attached to the host you were running on and if you did a stop/start of an ebs-backed instance, you were likely to get sent to a different host (vs. a restart API call, which would make an ACPI soft command and after a duration...I think it was 5 minutes, iirc, the hypervisor would kill the instance and restart it on the same host).

When the instance would get sent to a different host, it would get different instance storage and the old instance storage would be wiped from the previous host and you'd be provisioned new instance storage on the new host.

However, with EBS-volumes, those travel from host to host across stop/start cycles and they're attached via very low latency across the network from EBS servers and presented as a local block device to the instance. It's not quite as fast as local instance store, but it's fast enough for almost every use case if you get enough IOPS provisioned either through direct provisioning + correct instance size OR through a large enough drive + large enough instance to maximjze the connection to EBS (there's a table and stuff detailing IOPs, throughput, and instance size in the docs).

Also, support can detach the volume as well if the instance is stuck shutting down and doesn't get manually shut down by the API after a timeout.

None of this is by any means "ideal", but the complexity of these systems is immense and what they're capable of at the scale they operate is actually pretty impressive.

The key is...lots of the things you talk about are do-able at small scale, but when you add more and more operations and complexity to the tool stack on interacting with systems, you add a lot of back-end network overhead, which leads to extreme congestion, even in very high speed networks (it's an exponential scaling problem).

The "ideal" way to deal with these systems is to do regular interval backups off-host (ie. object/blob storage or NFS/NAS/similar) and then just blow away anything that breaks and do a quick restore to the new, fixed instance.

It's obviously easier said than done and most shops still on some level think about VMs/instances as pets, rather than cattle or have hurdles that make treating them as cattle much more challenging, but manual recovery in the cloud, in general, should just be avoided in favor of spinning up something new and re-deploying to it.


> There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

Also, there should be a way to force stop an instance that is already stopping.


>This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The issue is far more nuanced than that. The systems are very complex and they're a hypervisor that has layers of applications and interfaces to allow scaling. In fact, the hosts all have BMCs (last I knew...but I know there were some who wanted to get rid of the BMC due to BMCs being unreliable, which is, yes, an issue when you deal with scale because BMCs are in fact unreliable. I've had to reset countless stuck BMCs and had some BMCs that were dead).

The hypervisor is certainly capable of killing an instance instantly, but the preferred method is an orderly shutdown. In the case of a reboot and a stop (and a terminate where the EBS volume is not also deleted on termination), it's preferred to avoid data corruption, so the hypervisor attempts an orderly shutdown, then after a timeout period, it will just kill it if the instance has not already shutdown in an orderly manner.

Furthermore, there's a lot more complexity to the problem than just "kill the guest". There are processes that manage the connection to the EBS backend that provides the interface for the EBS volume as well as apis and processes to manage network interfaces, firewall rules, monitoring, and a whole host of other things. If the monitoring process gets stuck, it may not properly detect an unhealthy host and external automated remediation may not take action. Additionally, that same monitoring is often responsible for individual instance health and recovery (ie. auto-recover) and if it's not functioning properly, it won't take remediation actions to kill the instance and start it up elsewhere. Furthermore, the hypervisor itself may not be properly responsive and a call from the API won't trigger a shutdown action. If the control plane and the data plane (in this case, that'd be the hypervisor/host) are not syncing/communicating (particularly on a stop or terminate), the API needs to ensure that the state machine is properly preserved and the instance is not running in two places at once. You can then "force" stop or "force" terminate and/or the control plane will update state in its database and the host will sync later. There is a possibility of data corruption or double send/receive data in a force case, which is why it's not preferred. Also, after the timeout (without the "force" flag), it will go ahead and mark it terminated/stopped and will sync later, the "force" just tells the control plane to do it immediately, likely because you're not concerned with data corruption on the EBS volume, which may be double-mounted if you start up again and the old one is not fully terminated.

>The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

It does have a concept where all resources are still held and billed, except CPU and Memory. That's what a reboot effectively does. Same with a stop (except you're not billed for compute usage and network usage will obviously be zero, but if you have an EIP, that would incur charges still). The transition between stop and running is also fast, the only delays incurred are via the control plane...either via capacity constraints causing issues placing an instance/VM or via the chosen host not communicating properly...but in most cases, it is generally a fast transition. I'm usually up and running in under 20 seconds when I start up an existing instance from a stopped state. There's also now a hibernate or sleep state that the instance can be put into if it's windows via the API where the instance acts just like a sleep/hibernate state of a regular Windows machine.

>Also, there should be a way to force stop an instance that is already stopping.

There is. I believe I referred to it in my initial response. It's a flag you can throw in the API/SDK/CLI/web console when you select "terminate" and "stop". If the stop/terminate command don't execute in a timely manner, you can call the same thing again with a "force" flag and tell the control plane to forcefully terminate, which marks the instance as terminated and will asynchronously try to rectify state when the hypervisor can execute commands. The control plane updates the state (though, sometimes it can get stuck and require remediation by someone with operator-level access) and is notified that you don't care about data integrity/orderly shutdown and will (once its updated the state in the control plane and regardless of the state of the data plane) mark it as "stopped" or "terminated". Then, you can either start again, which should kick you over to a different host (there are some exceptions) or you can launch a new instance if you terminated and attach an EBS volume (if you chose not to terminate the EBS volume on termination) and retrieve data (or use the data or whatever you were doing with that particular volume).

Almost all of that information is actually in the public docs. There was only a little bit of color about how the backend operates that I added for color. There's hundreds of programs that run to make sure the hypervisor and control plane are both in sync and able to manage resources and if just a few of them hang or are unable to communicate or the system runs out of resources (more of a problem on older, non-nitro hosts as that's a completely different architecture with completely different resource allocations), then the system can become partially functional...enough so that remediation automation won't step in or can't step in because other guests appear to be functioning normally. There's many different failure modes of varying degrees of "unhealthy" and many of them are undetectable or need manual remediation, but are statistically rare and by and large most hosts operate normally. On a normally operating host, forcing a shutdown/terminate works just fine and is fast. Even when some programs that are managing the host are not functioning properly, launch/terminate/stop/start/attach/detach all tend to continue to function (along with the "force" on detach, terminate, stop), even if one or two functions of the host are not functioning properly. It's also possible (and has happened several times) where a particular resource vector is not functioning properly, but the rest of the host is fine. In that case, the particular vector can be isolated and the rest of the host works just fine. It's literally these tiny little edge cases that happen maybe .5% of the time that cause things to move slower and at scale, a normal host with a normal BMC would have the same issues. Ie. I've had to clear stuck BMCs before on those hosts. Also, I've dealt with completely dead BMCs. When those states occur, if there's also a host problem, remediation can't go in and remedy host-level problems, which can lead to those control-plane delays as well as the need to call a "force".

Conclusion: it may SEEM like it should be super easy, but there's about a million different moving parts to cloud vendors and it's not just as simple as kill it with fire and vengeance (ie. quemu guest kill). BMCs and hypervisors do have an instant kill switch (and guest kill is used on the hypervisor as is a BMC power off in the right remediation circumstances), but you're assuming those things always work. BMCs fail. BMCs get stuck. You likely haven't had the issue because you're not dealing with enough scale. I've had to reset BMCs manually more times than I can count and I've also dealt with more than my fair share of dead ones. So, "power off immediately" does not always work, which means a disconnect occurs between the control plane and the data plane. There's also delays in remediation actions that automation takes to give enough time for things to respond to the given commands, which leads to additional wait time.


I understand that this complexity exists. But in my experience with Google Compute, this isn’t a 1%-of-the-time problem with something getting stuck. It’s a “GCP lacks the capability” issue. Here’s the API:

https://cloud.google.com/compute/docs/reference/rest/v1/inst...

AWS does indeed seem more enlightened:

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_S...


yeah, AWS rarely has significant capacity issues. While the capacity utilization typically sits around 90% across the board, they're constantly landing new capacity, recovering broken capacity, and working to fix issues that cause things to get stuck (and lots of alarms and monitoring).

I worked there for just shy of 7 years and dealt with capacity tangentially (knew a good chunk of their team for a while and had to interact with them frequently) across both teams I worked on (support and then inside the EC2 org).

Capacity, while their methodologies for expanding were, in my opinion, antiquated and unenlightened for a long time, were still rather effective. I'm pretty sure that's why they never updated their algorithm for increasing capacity to be more JIT. Now, they have a LOT more flexibility in capacity now that they have resource vectoring, because you no longer have hosts with fixed instance sizes for the entire host (homogenous). You now have the ability to fit everything like legos as long as it is the same family (ie. c4 with c4, m4 with m4, etc.) and there was additional work being done to have cross-family resource vectoring as well that was in-use.

Resource vectors took a LONG time for them to get in place and when they did, capacity problems basically went away.

The old way of doing it was if you wanted to have more capacity for, say, c4.xlarge, you'd either have to drop new capacity and build it out to where the entire host had ONLY c4.xlarge OR you would have to rebuild excess capacity within the c4 family in that zone (or even down to the datacenter-level) to be specifically built-out as c4.xlarge.

Resource vectors changed all that. DRAMATICALLY. Also, to reconfigure a hosts recipe now takes minutes, rather than rebuilding a host and needing hours. So, capacity is infinitely more fungible than it was when I started there.

Also, I think resource vectoring came on the scene around 2019 or so? I don't think it was there in 2018 when I went to work for EC2...but it was there for a few years before I quit...and I think it was in-use before the pandemic...so, 2019 sounds about right.

Prior to that, though, capacity was a much more serious issue and much more constrained on certain instance types.


I always said if you want to create real chaos, don't write malware. Get on the inside of a security product like this, and push out a bad update, and you can take most of the world down.


So… Write malware?


*Malicious code on a legit software


> Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver.

It took a bit to figure out with some customers, but we provide optional VNC access to instances at OCI, and with VNC the trick seems to be to hit esc and then F8, at the right stage in the boot process. Timing seems to be the devil in the details there, though. Getting that timing right is frustrating. People seem to be developing a knack for it though.


> give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.

Interesting..

> We have to literally take each node down, attach the disk to a working node..

Probably the easiest solution for you is to go back in time to a previous scheduled snapshot, if you have that setup already.


That would make sense but it appears everyone is doing EBS snapshots in our regions like mad so they aren't restoring. Spoke to our AWS account manager (we are a big big big org) and they have contention issues everywhere.

I really want our cages, C7000's and VMware back at this point.


Netflix big? Bigger or Smaller?

I'm betting I have a good idea of one of the possible orgs you work for, since I used to work specifically with the largest 100 customers during my ~3yr stint in premium support


Netflix isn't really that big. two organizations ago our reverse proxy used 40k cores. netflixes is less than 5k. of course, that could just mean our nginx extensions are 8 times crappier than netflix.


Smaller. No one has heard of us :)


> Spoke to our AWS account manager (we are a big big big org)

Is this how you got the inside scoop on the rollout fiasco?


Beautiful


> This is not a windows issue.

Honest question, I've seen comments in these various threads about people having similar issues (from a few months/weeks back) with kernel extension based deployments of CrowdStrike on Debian/Ubuntu systems.

I haven't seen anything similar regarding Mac OS, which no longer allows kernel extensions.

Is Mac OS not impacted by these kinds of issues with CrowdStrike's product, or have we just not heard about it due to the small scale?

Personally, it's a shared responsibility issue. MS should build a product that is "open to extension but closed for modification".

> they pissed over everyone's staging and rules and just pushed this to production.

I am guessing that act alone is going to create a massive liability for CrowdStrike over this issue. You've made other comments that your organization is actively removing CrowdStrike. I'm curious how this plays out. Did CrowdStrike just SolarWind themselves? Will we see their CISO/CTO/CEO do time? This is just the first part of this saga.


The issue is where it is integrated. You could arguably implement CrowdStrike in BPF on Linux. On NT they literally hook NT syscalls in the kernel from a driver they inject into kernel space which is much bad juju. As for macOS, you have no access to the kernel.

There is no shared responsibility. CrowdStrike pushed a broken driver out, then triggered the breakage, overriding customer requirement and configuration for staging. It is a faulty product with no viable security controls or testing.


Yep, it's extremely lame that CS has been pushing the "Windows" narrative to frame it as a Windows issue in the press, so everyone will just default blame Microsoft (which everyone knows) and not Crowdstrike (which only IT/cybersec people are familiar with).

And then you get midwits who blame Microsoft for allowing kernel access in the first place. Yes Apple deprecated kexts on macOS; that's a hell of a lot easier to do when you control the entire hardware ecosystem. Go ahead and switch to Apple then. If you want to build your own machines or pick your hardware vendor, guess what, people are going to need to write drivers, and they are probably going to want kernel mode, and the endpoint security people like CrowdStrike will want to get in there too because the threat is there.

There's no way for Microsoft or Linux for that matter to turn on a dime and deny kernel access to all the thousands upon thousands of drivers and system software running on billions of machines in billions of potential configurations. That requires completely reworking the system architecture.


> midwits

This midwit spent the day creating value for my customers instead of spinning in my chair creating value for my cardiologist.

Microsoft could provide adequate system facilities so that customers can purchase products that do the job without having the ability to crash the system this way. They choose not to make those investments. Their customers pay the price by choosing Microsoft. It's a shared responsibility between the parties involved, inclduing the customers that selected this solution.

We all make bad decisions like this, but until customers start standing up for themselves with respect to Microsoft, they are going to continue to have these problems, and society is going to continue to pay the price all around.

We can and should do better as an industry. Making excuses for Microsoft and their customers doesn't get us there.


This midwit believes a half decent Operating System kernel would have a change tracking system that can auto-roll back a change/update that impacts the boot process causing a BSOD. We see in Linux, multiple kernel boot options, fail safe etc. It is trivial to code at the kernel the introduction of driver / .sys tracking that can detect a failed boot and revert to the previous good config. A well designed kernel would have roll back, just like SQL.


Windows does have that and does do that. Crowdstrike does stuff at UEFI level to install itself again.


Could Microsoft put pressure on UEFI vendors to coordinate a way for such reinstallation to be suppressed during this failsafe boot?


Not sure why you are being downvoted. Take a look at ChromeOS and MacOS to see how those mechanisms are implemented there.

They aren’t perfect, but they are an improvement over what is available on Windows. Microsoft needs to get moving in this same direction.


um.. don't have access to the kernel? what's with all the kexts then? [edit: just read 3rd parties don't get kexts on apple silicon. that's a step in the right direction, IMHO. I love to bitch about Mach/NeXTStep flaws, but happy to give them props when they do the right thing.]


Although it's a .sys file, it's not a device driver.

"Although Channel Files end with the SYS extension, they are not kernel drivers."

https://www.crowdstrike.com/blog/technical-details-on-todays...


Yeah it's a way of delivering a payload to the driver, which promptly crashed.

Which is horrible!


Horrible for sure, not least because hackers now know that the channel file parser is fragile and perhaps exploitable. I haven't seen any significant discussion about follow-on attacks, it's all been about rolling back the config file rather than addressing the root cause, which is the shonky device driver.


I suspect the wiley hackors have known how fragile that code is for years.


But it is Windows because the kernel should be able to roll back a bad update, there should NEVER be BSODs.


Windows does do that. Crowdstrike sticks it back in at the UEFI level by the looks, because you know, "security".


pish! this isn't VM/SP! commodity OSes and hardware took over because customers didn't want to pay firms to staff people who grokked risk management. linux supplanted mature OSes because some dork implied even security bugs were shallow with all those billions of eyes. It's a weird world when MSFT does a security stand down in 2003 and in 2008 starts widening security holes because the new "secure" OS they wrote was a no-go for third parties who didn't want to pay $100 to hire someone who knew how to rub two primes together.

I miss my AS/400.

This might be a decent place to recount the experience I had when interviewing for office security architect in 2003. my background is mainframe VM system design and large system risk management modeling which I had been doing since the late 80s at IBM, DEC, then Digital Switch and Bell Canada. My resume was pretty decent at the time. I don't like Python and tell VP/Eng's they have a problem when they can't identify benefits from JIRA/SCRUM, so I don't get a lot of job offers these days. Just a crusty greybeard bitching...

But anyway... so I'm up in Redmond and I have a decent couple of interviews with people and then the 3rd most senior dev in all of MSFT comes in and asks "how's your QA skills?" and I start to answer about how QA and Safety/Security/Risk Management are different things. QA is about insuring the code does what it's supposed to, software security, et al is about making sure the code doesn't do what it's not supposed to and the philosophic sticky wicket you enter when trying to prove a negative (worth a google deep dive if you're unfamiliar.) Dude cuts me off and says "meh. security is stupid. in a month, Bill will end this stupid security stand down and we'll get back to writing code and I need to put you somewhere and I figured QA is the right place."

When I hear that MSFT has systems that expose inadequate risk management abstractions, I think of the culture that promoted that guy to his senior position... I'm sure he was a capable engineer, but the culture in Redmond discounts the business benefits of risk management (to the point they outsource critical system infrastructure to third parties) because senior engineers don't want to be bothered to learn new tricks.

Culture eats strategy for breakfast, and MSFT has been fed on a cultural diet of junk food for almost half a century. At least from the perspective of doing business in the modern world.


> ”This is not a windows issue. This is a third party security vendor shitting in the kernel.“

Sure, but Windows shares some portion of the blame for allowing third-party security vendors to “shit in the kernel”.

Compare to macOS which has banned third-party kernel extensions on Apple Silicon. Things that once ran as kernel extensions, including CrowdStrike, now run in userspace as “system extensions”.


Back in 2006 the Microsoft agreed to allow kernel level access for Security companies due to an EU anti trust investigation. They were being sued by anti virus companies because they were blocking kernel access in the soon to be released Vista.

https://arstechnica.com/information-technology/2006/10/7998/


Wow, that looks like a root cause


Wow! First cookie pop-ups, now Blue Friday...?


Sick and tired of EU meddling in tech. If third parties can muck around in the kernel, then there's nothing Microsoft can really do at that point. SMH


Can they simultaneously allow this, but recommend against it and deny support / sympathy if you do it to your OS?


Yes... in the same sense that if a user bricks their own system by deleting system32 then Windows shares some small sliver of the blame. In other words, not much.


Why should Windows let users delete system32? If they don't make it impossible to do so accidentally (or even maliciously), then I would indeed blame Windows.

On macOS you can't delete or modify critical system files without both a root password and enough knowledge to disable multiple layers of hardware-enforced system integrity protection.


And what do you think installing a deep level antivirus across your entire fleet is equivalent to?


lol. Never said they should, did I?


the difference is you can get most of the functionality you want without deleting system32, but if you want the super secure version of NT, you have to let idiots push untested code to your box.

linux, Solaris, BSD and macOS aren't without their flaws, but MSFT could have done a much better job with system design.


...but still, if the user space process is broken, MacOS will fail as well. Maybe it's a bit easier to recover, but any broken process with non-trivial privileges can interrupt the whole system.


It's certainly not supposed to work like that. In the kernel, a crash brings down the entire system by design. But in userspace, failed services can be restarted and continued without affecting other services.

If a failure in a userspace service can crash the entire system, that's a bug.


It's kind of inevitable that a security system can crash the system. It just needs to claim than one essential binary is infected with malware, and the system won't run.


Hello:

I'm a reporter with Bloomberg News covering cybersecurity. I'm trying to learn more about this Crowdstrike update potentially bypassing staging rules and would love to hear about your experience. Would you be open to a coversation?

I'm reachable by email at jbleiberg2@bloomberg.net or on Signal at JakeBleiberg.24. Here's my Bloomberg author page: https://www.bloomberg.com/authors/AWuCZUVX-Pc/jake-bleiberg.

Thank you.

Jake


Before reaching the "pushed out to every client without authorization" stage, a kernel driver/module should have been tested. Tested by Microsoft, not by "a third party security vendor shitting in the kernel" that some criminally negligent manager decided to trust.


> Tested by Microsoft

MS don't have testers any more. Where do you think CS learned their radically effective test-in-prod approach?


I think they learned it from Freedesktop developers.


Yeah we have a staging and test process where we run their updated Falcon sensor releasees.

They shit all over our controls and went to production.

This says we don't control it and should not trust it. It is being removed.


> It is being removed.

Congratulations on actually fixing the root cause, as opposed to hand wringing and hoping they don't break you again. I'm expecting "oh noes, better keep it on anyway to be safe" to be the popular choice.


yeah, I agree. I think most places will at least keep it until the existing contract comes time for renegotiation and most will probably keep using cs.

It's far easier for IT departments to just keep using it than it is to switch and managers will complain about "the cost of migrating" and "the time to evaluate and test a new solution" or "other products don't have feature X that we need" (even when they don't need that feature, but THINK they do).


why would Microsoft be required to test some 3rd party software? Maybe I mis-understood.


It's a shitty C++ hack job within CloudStrike with a null pointer. Because the software has root access, Windows shuts it down as a security precaution. A simple unit test would have caught this, or any number of tools that look for null pointers in C++, not even full QA. It's unbelievable incompetence.


Took down our entire emergency department as we were treating a heart attack. 911 down for our state too. Nowhere for people to be diverted to because the other nearby hospitals are down. Hard to imagine how many millions of not billions of dollars this one bad update caused.


Yup - my mom went into the ER for stroke symptoms last night and was put under MRI. The MRI imaging could NOT be sent to the off-site radiologist and they had to come in -- turned out the MRI outputs weren't working at all.

We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.


A relative of mine had back surgery late yesterday. Today the hospital nursing staff couldn’t proceed with the pain medication process for patients recovering from surgery because they didn’t have access to the hospital systems.


My wife is a nurse. She has a non-critical job making care plans for patients and the system is STILL down.


Hope she's okay. For better or worse, our entire emergency department flow is orchestrated around epic. If we can't even see the board, nurses don't know what orders to perform, etc.


If it’s so critical that nurses are left standing around clueless then if it goes down entire teams of people should be going to prison for manslaughter.

Or, we could build robust systems that can tolerate indefinite down time. Might cost more, might need more staff.

Pick one. I’ll always pick the one that saves human lives when systems go down.


Okay but that will affect hospital profits and our PE firms bought these hospitals specifically to wrench all redundancy out of these systems in the name of efficiency (higher margins and thus profit) so that just won't do.


Private equity people need to start getting multiple life sentences for fucking around with shit like this. It's unironically a national security issue.


1. Hospitals should not make profits.

2. Hospitals should not have executives.

3. Hospitals should be community funded with backstop by the federal government.

4. PE is a cancer - let the doctors treat it.


Doctors can't even own hospitals now. Doctor-owned hospitals were banned with the passage of Obamacare in order to placate big hospital systems concerned about the growing competition.


Another way to look at it is that you can have more hospitals using systems with a lower cost, thus saving more lifes comparing to only a few hospitals using an expensive system.


This isn't another way to look at it, this is the only way to look at it.


I wish your mother recovers promptly. And I’m glad she doesn’t run on Windows. ;-)


Ha ha! Good one! This is a save!

Wishes for a speedy recovery to your mom!

I hope no one uses such single point of failure systems anymore. Especially CS. The same is applicable for Cloudflare as well! But at least, the systems will be functioning standalone and accessible in their case and could cause only netwide outage! (i.e., if the CF infra goes down!)

Anyways, who knows what is going to happen with such widespread vendor dependency?

The world gets reminded about the Supply Chain Attacks every year which is a good (but a scary) one that definitely needs some deep thinking...

Up for it?


I am "saving" this comment :)

... and seconding all the best wishes for the mother involved. Do get well.-


I hope she's ok.


Wishing you and your mom the best


I wish your mother the best <3


Thank you <3


Idk… critical hospital systems should be air gapped.


All of the critical equipment is. But we need internet access on computers, or at the very least Epic does to pull records from other hospitals.


> We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.

That's an extra 4 hours of emergency room fees you ideally wouldn't have to pay for.


Having a medical system that has the concept of "hours of emergency room fees" is also a pretty fundamental problem


It's actually per 15 minutes :)


Honestly, that sounds like a typical ER visit.


The system crashed while my coworker was running a code (aka doing CPR) in the ER last night. Healthcare IT is so bad at baseline that we are somewhat prepared for an outage while resuscitating a critical patient.


The second largest hospital group in Nashville experienced a ransomware attack about two months ago. Nurses told me they were using manual processes for three weeks.


It takes a certain type of a criminal a55hole to attack hospitals and blackmail them. I would easily support life or death penalty for anyone attempting this cr@p.


In this case it was tracked to Russia.


That is absolutely one of the A-tier "certain type of a criminal a55hole".


More than just Nashville, they have hospitals all over the country.


Ascension?


Yes. And I was told by multiple nurses at St. Thomas Midtown that the hospital did not have manual procedures already in place. In their press release they refer to their hospitals as "ministries" [0], so apparently they practice faith-based cyber security (as in "we believe that we don't need backups") since it took over 3 weeks to recover.

[0] https://about.ascension.org/cybersecurity-event


As a paramedic, there is very little about running a code that requires IT. You have the crash cart, so not even stuck trying to get meds out of the Pyxis. The biggest challenge is charting / scribing the encounter.


lol, yep, that was my take on this... If you need a computer to run an ACLS algorithm, something has gone seriously wrong.


Especially out in the field where we have a lot more autonomy. If our iPads break we'll just use paper.


Excuse my ignorance, but what systems are needed for CPR?


I used to work in healthcare IT. Running a code is not always only CPR.

Different medications may be pushed (injected into the patient) to help stabilize them. These medications are recorded via a bar code and added to the patients chart in Epic. Epic is the source of truth for the current state of the patient. So if that is suddenly unavailable that is a big problem.


Makes sense, thank you for the explanation.


Okay,not having historical data avaliable to make decision on what to put into a patient is understandable - but maybe also print critical stuff per patient once a day? - but not being able to log an action in realtime should not be a critical problem.


It is a critical problem if your entire record of life-saving drugs you've given them in the past 24 hours suddenly goes down. You have to start relying on people's memories, and it's made worse by shift turn-overs so the relevant information may not even be reachable once the previous shift has gone home.

There are plenty of drugs that can only be given in certain quantities over a certain period of time, and if you go beyond that, it makes the patient worse not better. Similarly there are plenty of bad drug interactions where whether you take a given course of action now is directly dependent on which drugs that patient has already been given. And of course you need to monitor the patient's progress over time to know if the treatments have been working and how to adjust them, so if you suddenly lose the record of all dosages given and all records of their vital signs, you've lost all the information you need to treat them well. Imagine being dropped off in the middle of nowhere, randomly, without a GPS.


That's why there's a sharpie in the first aid kit. If you're out of stuff to write on you can just write on the patient.

More seriously, we need better purpose build medical computing equipment, that runs on it's own OS, and only has outbound network connectivity for updating other systems.

I also think of things like the old school "check list boards" that used to be literally built into the yolk of the airplane they were made for.


I’m afraid the profitability calculation shifted it in favor of off-the-shelf OS a long time ago. I agree with you, though, that a general purpose OS has way too much crap that isn’t needed in a situation like this.


> That's why there's a sharpie in the first aid kit.

That doesn't help when the system goes down and you lose the record of all medications administered prior to having to switch over to the Sharpie.


> It is a critical problem if your entire record of life-saving drugs you've given them in the past 24 hours suddenly goes down.

Will outages like this motivate a backup paper process? The automated process should save enough information on paper so a switch over to paper process at any time is feasible. Similar to elections.


Maybe if all the profit seeking entities were removed from healthcare that money could instead go to the development of useful offline systems.

Maybe a handheld device for scanning in drugs or entering procedure information that stores the data locally which can then be synced with a larger device with more storage somewhere that is also 100% local and immutable which then can sync to online systems if that is needed.


And with their luck, those handheld devices will also be sent the OTA update that temporarily bricks them along with everything else.


no money for that

there are backup paper processes, but they start fresh when the systems go down

If it was printing paper in case of downtime 24/7, it would be massive wasteage for the 99% of time system is up


A good system is resilient. Paper process could take over when system is down. Form my understanding healthcare systems undergo recurrent outages for various reasons.


Many place did revert back to paper processes. But, it’s a disaster model that has to tested to make sure everyone can still function when your EMR goes down. Situations like this just reinforce that you can’t plan for if IT systems go down, it is when they go down.


My experience with internet outages affecting retail is the ability to rapidly and accurately calculate bill totals and change is not practiced much anymore. Not helped by things like 9.075 % tax rates to be sure.


How about an e-ink display for each patient that gets drug and administration info displayed on it?


Real paper is probably as much about breaking from the "IT culture" as it's about the physical properties. E-ink display would probably help with power outage, but happily display BSOD in an incident like this.


Honestly if you were designing a system to be resilient to events like this one, the focus would be on distributed data and local communication. The exact sort of things that have become basically dirty words in this SaaS future we are in. Every PC in the building, including the ones tethered to equipment, is presently basically a dumb terminal, dependent on cloud servers like Epic, meaning WAN connection is a single point of failure (I assume that a hospital hopefully has a credible backup ISP though?) and same for the Epic servers.

If medical data were synced to the cloud but also stored on the endpoint devices and local servers, you’d have more redundancy. Obviously much more complexity to it but that’s what it would take. Epic as single source of truth means everyone is screwed when it is down. This is the trade off that’s been made.


> synced to the cloud but also stored on the endpoint devices and local servers

That's a recipe for a different kind of disaster. I actually used Google Keep some years ago for medical data at home — counted pills nightly, so mom could either ask me or check on her phone if she forgot to take one. Most of the time it worked fine, but the failure modes were fascinating. When it suddenly showed data from half a year ago, I gave up and switched to paper.


I don't think it is historical data required to make a decision, it is required to store the action for historical purposes in the future. This is ultimately to bill you and to track that a doctor isn't stealing medication, improperly treating the patient, and to track it for legal purposes.

Some hospitals require you to input this in order to even get physical access to the medications.

Although a crash cart would normally have common things necessary to save someone in an emergency, so I would think that if someone was truly dying they could get them what they needed. But of course there are going to be exceptions and a system being down will only make the process harder.


> maybe also print critical stuff per patient once a day?

Yep, the business continuity boxes are basically minimally connected PDF archives of patient records "printed" multiple times a day.


maybe non-volatile e-paper, which can be updated easily if things are up, and if the system is down it still works as well as the printouts


updatable e-paper is going to be very expensive


Compared to managing thousands of printers? And then the resulting printouts? Buying ink, changing the cartridges?

Technologically it seems doable. Big enough order brings down the costs.

https://soldered.com/product/soldered-inkplate-5-5-2%e2%80%b...

Of course the real backup plan should be designed based on the actual needs, perhaps the whole system needs an "offline mode" switch. I assume they already run things locally, in case the big cable seeker machine arrives in the neighborhood.


A small printer connected to the scanner should do.


in this case, it's the entire operating system going down on all computers, so I don't think the printers are working either


Most printers in these facilities run standalone on an embedded Linux variant.They actually can host whole folders of.data for reproduction "offline". Actually all scan/print/fax multi function machines can generally do that these days. If the IT onsite is good though the usb ports an storage on devices should be locked down.


Looks like a small scanner + printer running a small minimalistic RTOS would be a good solution.


Ok now you have a park of 200 of those devices to handle. And now you move a patient across a service or to another hospital and then....

Reality is complex.


Oh yes. This would be a contingency measure, just to keep the record in a human readable form while requiring little manual labor. Printed codes could be scanned later into Epic and, if you need to transfer the patient, tear the paper and send it with them.


This.

Anyone involved in designing and/or deploying a system where an application outage threatens life safety, should be charged with criminal negligence.

A receipt printer in every patient room seems like a reasonable investment.


This would be challenging. Establishing crowdstrike’s duty to a hospital patient would be challenging if not impossible in some jurisdictions.


It is not necessarily crowdstrike's responsibility, but it should be someone's.

If I go to Home Depot to buy rope for belaying at my rock climbing center and someone falls, breaks the rope and dies, then I am on the hook for manslaughter.

Not the rope manufacturer, who clearly labeled the packaging with "do not use in situations where safety can be endangered". Not the retailer, who left it in the packaging with the warning, and made no claim that it was suitable for a climbing safety line. But me, who used a product in a situation where it was unsuitable.

If I instead go to Sterling Rope and the same thing happens, fault is much more complicated, but if someone there was sufficiently negligent they could be liable for manslaughter.

In practice, to convict of manslaughter, you would need to show an individual was negligant. However, our entire industry is bad at our job, so no individual involved failed to perform their duties to a "reasonable" standard.

Software engineering is going to follow the path that all other disciplines of meatspace engineering did. We are going to kill a lot of people; and every so often, enough people will die that we add some basic rules for safety critical software, until eventually, this type of failure occuring without gross negligence becomes nearly unthinkable.


Its on whoever runs the hospitals computer systems - allowing a ring 0 kernel driver to update ad-hoc from the internet is just sheer negligence.

Then again, the management that put this in are probably also the same idiots that insist on a 7 day lead time CAB process to update a typo on a brochure ware website "because risk".


This patient is dead. They would not have been if the computer system was up. It was down because of CrowdStrike. CrowdStrike had a duty of care to ensure they didn't fuck over their client's systems.

I'm not even beyond two degrees of seperation here. I don't think a court'll have trouble navigating it.


I suppose it will come as a surprise to you that you have misleading intuitions about the duty of care.

Cloudstrike did not even have a duty of care to their customer, let alone their customer’s customer (speaking for my jurisdiction, of course).


If that really were how it worked, I don’t think that software would really exist at all. Open Source would probably be the first to disappear too — who would contribute to, say, Linux, if you could go to jail for a pull request you made because it turns out they were using it in a life or death situation and your code had a bug in it. That checks all the same boxes that your scenario does: someone is dead, they wouldn’t be if you didn’t have a bug in your code.

Now, a tort is less of a stretch than a crime, but thank goodness I’m not a lawyer so I don’t have to figure out what circumstances apply and how much liability the TOS and EULAs are able to wash away.


When I read something like this that has such a confident tone while being incredibly incorrect all I can do is shake my head and try to remember I was young once and thought I knew it all as well.


I don't think you understand the scale of this problem. Computers were not up to print from. Our Epic cluster was down for placing and receiving orders. Our lab was down and unable to process bloodwork - should we bring out the mortar and pestle and start doing medicine the old fashioned way? Should we be charged with "criminal negligence" for not having a jar of leeches on hand for when all else fails?


I was advocating for a paper fall back. That means that WHILE the computers are running, you must create a paper record, eg “medication x administered at time y”, etc., hence the receipt printers, which are cheap and low-dependency.

The grandparent indicated that the problem was that when all tow computers went down, they couldn’t look up what had already been done for the patient. I suggested a simple solution for that - receipt printers.

After the computers fail you tape the receipt to the wall and fall pack to pen and paper until the computers come back up.

I completely understand the scale of the outage today. I am saying that it was a stupid decision and possibly criminally negligent to make a life critical process dependent on the availability of a distributed IT application not specifically designed for life critical availability. I strongly stand by that POV.


> I suggested a simple solution for that - receipt printers.

Just so I understand what you are saying you are proposing that we drown our hospital rooms in paper receipt constantly. In the off chance the computers go down very rarely?

Do you see any possible drawbacks with your proposed solution?

> possibly criminally negligent to make a life critical process dependent on the availability of a distributed IT application

What process is not “life critical” in a hospital? Do you suggest that we don’t use IT at all?


Modern medicine requires computers. You literally cannot provide medical care in a critical care setting with the sophistication and speed required for modern critical care without electronic medical records. Fall back to paper? Ok, but you fall back to 1960s medicine, too.


We need computers. But, how about we fall back to an air-gapped computer with no internet connection and a battery backup?

Why does everything need the internet?


> Why does everything need the internet?

Why would you ever need to move a patient from one hospital room containing one set of airgapped computers into another, containing another set of airgapped computers?

Why would you ever need to get information about a patient (a chart, a prescription, a scan, a bill, an X-Ray) to a person who is not physically present in the same room (or in the same building) as the patient?


You wouldn't airgap individual rooms.

And sending data out can be done quite securely. Then replies could be highly sanitized or kept on specific machines outside the air gap.


You also need to receive similar data from outside the hospital.

And now you've added an army of people running around moving USB sticks, or worse, printouts and feeding them into other computers.

It's madness, and nobody wants to do it.


Local area networks air gapped from the internet don't need to be air gapped from each other. You could have nodes in each network responsible for transmitting specific data to the other networks.. like, all the healthcare data you need. All other traffic, including windows updates? Blocked. Using IP still a risk? Use something else. As long as you can get bytes across a wire, you can still share data over long distances.

In my eyes, there is a technical solution therr that keeps friction low for hospital staff: network stuff, on an internet, but not The Internet...

Edit: I've since been reading the other many many comment threads on this HN post which show the reasons why so much stuff in healthcare is connected to each other via good old internet, and I can see there's way more nuance and technicality I am not privy to which makes "just connect LANs together!" less useful. I wasn't appreciating just how much of medicine is telemedicine.


I think wiring computers within the hospital over LAN, and adding a human to the loop for inter-hospital communication seems like a reasonable compromise.

Yes there will be some pain, but the alternative is what we have right now.

> nobody wants to do it.

Tough luck. There's lots of things I don't want to do.


Less time urgent, and would not take an army.


This approach is also what popped in my head. I've seen people use white boards for this already so it must be ok from a hipaa standpoint.


A hospital my wife worked at over a decade ago didn't use EMR's, it was all on paper. Each patient had a binder. Per stay. And for many of them it rolled into another binder. (This was neuro-ICU so generally lengthy patient stays with lots of activity, but not super-unusual or Dr House stuff, every major city in America will have 2-3 different hospitals with that level of care.)

But they switched over to EMR because the advantages of Pyxis[1] in getting the right medications to the right patients at the right time- and documenting all of that- are so large that for patient safety reasons alone it wins out over paper. You can fall back to paper, it's just a giant pain in the ass to do it, and then you have to do the data entry to get it all back into EMR's. Like my wife, who was working last night when everyone else in her department got Crowdstrike'd, she created a document to track what she did so it could be transferred into EMR's once everything comes back up. And the document was over 70 pages long! Just for one employee for one shift.

1: Workflow: Doctor writes prescription in EMR. Pharmacist reviews charts in EMR, approves prescription. Nurse comes to Pyxis cabinet and scans patient barcode. Correct drawer opens in cabinet so the proper medication- and only the proper medication- is immediately available to nurse (technicians restock cabinet when necessary). Nurse takes medication to patient's room, scans patient barcode and medication barcode, administers drug. This system has dramatically lowered the rates of wrong-drug administration, because the computers are watching over things and catch humans getting confused on whether this medication is supposed to go to room 12 or room 21 in hour 11 of their shift. It is a great thing that has made hospitals safer. But it requires a huge amount of computers and networks to support.


> Pyxis cabinet

Why would a Pyxis cabinet run Windows? I realize Windows isn't even necessarily at fault here, but why on earth would such a device run Windows? Is the 90s form of mass incompetence in the industry still a thing where lots of stuff is written for Windows for no reason?


I don't know what Pyxis runs on, my wife is the pharmacist and she doesn't recognize UI package differences with the same practiced eye that I do. And she didn't mention problems with the Pyxis. Just problems with some of their servers and lots of end user machines. So I don't know that they do.


You only need one link in the chain of doctor -> pharmacist -> pixys -> nurse to be reliant on Windows for this to fail.


This would be a disaster from a HIPAA perspective, and an unimaginable amount of paperwork.


For relying on windows to run this kind of stuff and not doing any kind of staged rollout but just blindly applying untested kernel driver 3rd party patching fleet wide? yeah honestly. We had safer rollouts for cat videos than y'all seem to have for life critical systems. Maybe some criminal liability would make y'all care about reliability a bit more.


Staged rollout in the traditional sense wouldn't have helped here because the skanky kernel driver worked under all test conditions. It just didn't work when ot got fed bad data. This could have been mitigated by staging the data propagation, or by fully testing the driver with bad data (unlikely to ever have been done by any commercial organization). Perhaps some static analysis tool could have found the potential to crash (or the isomorphic "safe language" that doesn't yet exist for NT kernel drivers).


If you don't see that the thing that happened today that blew up the world was the rollout, I don't know what to tell you.


A QR code can store 3 KB of data. Every patient has a small QR Sticker printer on their bed. Whenever EPIC updates, print a new small QR sticker. Patient being moved tear of sticker and stick to their wrist tag.

This much of patients state will be carried on their wrist. Maybe for complex cases you need two stickers. Have to be judicious in encoding data, maybe just last 48 hours.

Handheld qr readers, off line that read and display QR data strings.


You need to document everything during a code arrest. All interventions, vitals and other pertinent information must be logged for various reasons. Paper and pen work but they are very difficult to audit and/or keep track of. Electronic reporting is the standard and deviating from the standard is generally a recipe for a myriad of problems.


We chart all codes on paper first and then transfer to computer when it's done. There's a nurse whose entire job is to stay in one place and document times while the rest of us work. You don't make the documenter do anything else because it's a lot of work.

And that's in the OR, where vitals are automatically captured. There just aren't enough computers to do real-time electronic documentation, and even if there were there wouldn't be enough space.


I chart codes on my EPCR, in the PT's house, almost everyday with one hand. Not joking about the one hand either.

Its easier, faster, and more accurate than writing in my experience. We have a page solely dedicated to codes and the most common interventions. Got IO? I press a button and its documented with timestamp. Pushing EPI, button press with timestamp. Dropping an I-Gel or Intubating, button press... you get the idea.

The details of the interventions can be documented later along with the narrative, but the bulk of the work was captured real-time. We can also sync with our monitors and show depth of compressions, rate of compressions and rhythms associated with the continuous chest compression style CPR we do for my agency.

Going back to paper for codes would be ludicrous for my department. The data would be shit for a start. Hand writing is often shit and made worse under the stress of screaming bystanders. Depending on whether we achieved ROSC or not would increase the likelihood of losing paper in the shuffle


The idea is to have the current system create a backup paper trail from which you practice resuming from for when computers go down. Nothing about current process for you need change only that you be familiar with falling back to paper backups when computers are down.


Which means that you have to be operating papered before the system goes down. If you aren't, the system never gets to transition because it just got CrowdStruck.


Correct. We use paper receipts for shopping and paper ballots for voting. Automation is fast and efficient, but there must be a manual fallback when power fails or automation is unreliable.

This wisdom is echoed in some religious practices that avoid complete reliance on modern technology.


> depth of compressions

Okay, how does that monitor work? Genuinely curious.


Replace require and must with expected to, and you get the difference of policy and reality.


You can do CPR without a computer system, but changing systems in the middle of resuscitation where a delay of seconds can mean the difference between survival and death is absolutely not ideal. CPR in the hospital is a coordinated team response and if one person can’t do their job without a computer then the whole thing breaks down.


If you're so close to death that you're depending on a few seconds give or take, you're in God's hands. I would not blame or credit anyone or any system for the outcome, either way.


I’m sure you meant “the physicians’ hands.”


No. The physician will be running a standard ER code protocol, following a memorized flow chart.


Judgement is always part of the process, but yeah running a routine code is pretty easy to train for. It's one of the easiest procedures in medicine. There are a small number of things that can go wrong that cause quick death, and for each a small number of ways to fix them. You can learn all that in a 150 hour EMT class.


My guess is the system that notifies the next caretaker in the chain that someone is currently receiving CPR.

if it works, there's a lot more to be done to get the patient to stable.


need to play bee gees on windows media player


probably the system used to pull and record medication uses in a hospital. It's been awhile, but "Pyxis" used to be the standard where I shadowed.

Nurses hated it.


Hello, I'm a journalist looking to reach people impacted by the outage and wondering if you could kindly connect with your ER colleague. My email is sarah.needleman@wsj.com. Thanks!


Surprised and impressed at your using HN as a resource.


The comments is the content. I have always said this.


I mean if they're finding sources through the comment and then corroborating their stories via actual interviews, it's completely fine practice. As long as what's printed is corroborated and cross-referenced I don't see a problem.

If they go and publish "According to hackernews user davycro ..." _then_ there's a problem.


She is living in the future. Way to go.


I sent them your contact info, pretty sure they will be asleep for the next few hours


Now this is an unusual meeting of two meanings of "running a code".


there's a great meme out there that says something like: Everyone on my floor is coding! \n Software PMs: :-D \n Doctors: :-O


When you're a software engineer turned doctor you get sent that by all of your friends xD


> Took down our entire emergency department as we were treating a heart attack.

It makes my blood boil to be honest that there is no liability for what software has become. It's just not acceptable.

Companies that produce software with the level of access that Crowdstrike has (for all effective purposes a remote root exploit vector) must be liable for the damages that this access can cause.

This would radically change how much attention they pay to quality control. Today they can just YOLO-push barely tested code that bricks large parts of the economy and face no consequences. (Oh, I'm sure there will be some congress testimony and associated circus, but they will not ever pay for the damages they caused today.)

If a person caused the level and quantity of damage Crowdstrike caused today they would be in jail for life. But a company like Crowdstrike will merrily go on doing more damage without paying any consequence.


> Companies that produce software

What about companies that deploy software with the level of quality that Crowdstrike has? Or Microsoft 365 for that matter.

That seems to be the bigger issue here; after all Crowdstrike probably says it is not suitable for any critical systems in their terms of use. You shouldn't be able to just decide to deploy anything not running away fast enough on critical infrastructure.

On the other hand, Crowdstrike Falcon Sensor might be totally suitable for a non-critical systems, say entertainment systems like the Xbox One.


CrowdStrike https://www.crowdstrike.com › resources › infographics Learn how CrowdStrike keeps your critical areas of risk such as endpoints, cloud workloads, data, and identity, safe and your business running


Wife is a nurse. They eventually go 2 computers working for her unit. I don't think it impacted patients already being treated, but they couldn't get surgeries scheduled and no charting was being done. Some of the other floors were in complete shambles.


Hi, as I noted to another commenter, I'm a journalist looking to speak with people who've been impacted by the outage. I'm wondering if I could speak with your wife. My email is sarah.needleman@wsj.com. Thanks.


Sure I’ll pass your email along to her and see if she wants to do that.


I dont understand how this isnt bigger news?

Local emergency services were basically nonfunctioning for better part of the day along with the heat wave and various events, seems like a number of deaths (locally at least, specific to what I know for my mid sized US city) will be indirectly attributable to this.


It's entirely possible (likely, even) that someone died from this, but it's hard to know with critically ill patients whether they would have survived without the added delays.


On aggregate it is. How many deaths over the average for these conditions did we see?


We are in the process of calculating this but need this 24H period to roll over so we can benchmark the numbers against a similar 24H period. Its hard to tell if the numbers we get back will even be reliable given a lot of the statistics back from today from what I can tell have been via emails or similar.


So what?


Give it like, a week before bothering to ask such questions...


If true, this is insane that critical facilities like hospital do not have decentralized security system.


Crowdstrike is on every machine in the hospital because hospitals and medical centers became a big target for ransomware a few years ago. This forced medical centers to get insured against loss of business and getting their data back. The insurance companies that insure companies against ransomware insist on putting host based security systems onto every machine or they won't cover losses. So Crowdstrike (or one of their competitors) has to run on every machine.


I wonder why putting software on every machine, instead of relying on a good firewall and network separation.

Granted, you are still vulnerable of physical attacks (i.e. the person coming with an USB stick) but I would say much more difficult, and if you put firewalls also between compartment of internal networks even difficult.

Also, I think the use of Windows in critical settings is not a good choice, and to me we had a demonstrations. For who says the same could have happened to Linux, yes but you could have mitigated it. For example, to me a Linux system used in critical settings shall have a root read-only root filesystem, on Windows you can't. Thus the worse you would had is to reboot the machine to restore it.


The physical security of computers in , say a hospital, is poor. You can't rely on random people not getting access to a logged in computer.


A common attack vector is phishing, where someone clicks on an email link and gets compromised or supplies credentials on a spoofed login page. External firewalls cannot help you much there.

Segmenting your internal network is a good defence against lots of attacks, to limit the blast radius, but it's hard and expensive to do a lot of it in corporate environments.


There are no good firewall in the market. It's always the pretend-firewall that becomes the vector.


Yup as you say, if you go for a state of the art firewall, then that firewall also becomes a point of failure. Unfortunately complex problems don't go away by saying the word "decentralize".


You highly overestimate the capabilities of the average IT person working for a hospital. I'm sure some could do it. But most who can work elsewhere.


I wonder if those same insurance policies are going to pay out due to the losses from this event?


> I wonder if those same insurance policies are going to pay out due to the losses from this event?

They absolutely should be liable for the losses, in each case where they caused it.

(Which is most of them. Most companies install crowdstrike because their auditor want it and their insurance company says they must do whatever the auditor wants. Companies don't generally install crowdstrike out of their own desire.)

But of course they will not pay a single penny. Laws need to change for insurance companies, auditors and crowdstrike to be liable for all these damages. That will never happen.


Why would they? Cybersecurity insurance doesn’t cover “we had an outage” - it covers a security breach.


Depends on what the policy (contract) says. But there's a good argument that your security vendor is inside the wall of trust at a business, and so not an external risk.


In a sense, it looks like these insurance company's policies work a little bit like regulation. Except that it's not monopolistic (different companies are free to have different rules), and when shit hits the fan, they actually have to put their money where their mouth is.

Despite this horrific outage, in the end it sounds like a much better and anti-fragile system than a government telling people how to do things.


A little bit, probably slightly better. But insurance companies don't want to eliminate risk (if they did that, no one would buy their product). They instead want to quantify, control and spread the risk by creating a risk pool. Good, competent regulation would be aimed at eliminating, as much as reasonably possible, the risk. Instead, insurance company audits are designed to eliminate the worst risk and put everyone into a similar risk bucket. After spending money on an insurance policy and passing an audit, why would a company spend even more money and effort? They have done "enough".


> The insurance companies that insure companies against ransomware insist on putting host based security systems onto every machine or they won't cover losses.

This is part of the problem too. These insurance/audit companies need to be made liable for the damage they themselves cause when they require insecure attack vectors (like Crowdstrike) to be installed on machines.


Crowdstrike and its ilk are basically malware. There have to be better anti-ransomware approaches, such as replicated, immutable logs for critical data.


That only solves half the problem, it doesn't solve data theft


1. Is data theft the main risk of ransomware?

2. Why would anyone trust a ransomware perpetrator to honor a deal to not reveal or exploit data upon receipt of a single ransom payment? Are organizations really going to let themselves be blackmailed for an indefinite period of time?

3. I'm unconvinced that crowdstrike will reliably prevent sensitive data exfiltration.


1. Double extortion is the norm, some groups don't even bother with the encryption part anymore, they just ask a ransom for not leaking the data

2. Appearently yes. Why do you think calls to ban payments exist?

3. At minimum it raises the bar for the hackers - sure, it's not like you can't bypass edr but it's much easier if you don't have to bypass it at all because it's not there


> That only solves half the problem, it doesn't solve data theft

crowsdstrike is not a DLP solution. You can solve that problem (where necessary) by less intrusive means.


I agree edr is not a DLP solution, but edr is there to prevent* an attack getting to the point where staging the data exfil happens... In which case yes I would expect web/volumetric DLP kicks in as the next layer.

*Ok ok I know it's bypassable but one of the happy paths for an attack is to pivot to the machine that doesn't have edr and continue from there.


Is there any security company that provides decentralized service?


By "decentralized" I think you mean "doesn't auto-update with new definitions"?

I have worked at places which controlled the roll-out of new security updates (and windows updates) for this very reason. If you invest enough in IT is possible. But you have to have a lot of money to invest in IT to have people good enough to manage it. If you can get SwiftOnSecurity to manage your network, you can have that. But can every hospital, doctor's office, pharmacy, scan center, etc. get top tier talent like SwiftOnSecurity?


I used to work for a major retailer managing updates to over 6000 stores. We had no auto updates (all linux systems in stores) and every update went through our system.

When it came to audit time, the auditors were always impressed that our team had better timely updates than the corporate office side of things.

I never really thought we were doing anythin all that special (in fact, there were always many things I wanted to improve anout the process) but reading about this issue makes me think that maybe we really were just that much better than the average IT shop?


> I have worked at places which controlled the roll-out of new security updates (and windows updates)

But did they also control the roll-out of virus/threat definition files? Because if not their goose would have been still cooked this time.


Maybe, maybe not, devil's in the details.

If, for example, they were doing slow rollouts for configs in addition to binaries, they could have caught the problem in their canary/test envs and not let it proceed to a full blackout.


When I say decentralized, I mean security measures and updates taken locally at the facility. For example, MRI machines are local, and they get maintained and updated by specialists dispatched by the vendor (Siemens or GE)


Siemens or GE or whomever built the MRI machine aren't really experts in operating systems, so they just use one that everyone knows how to work, MS Windiows. It's unfortunate that to do things necessary for modern medicine they need to be networked together with other computers (to feed the EMR's most importantly) but it is important in making things safer. And these machines are supposed to have 10-20 year lifespans (depending on the machine)! So now we have a computer sitting on the corporate network, attached to a 10 year old machine, and that is a major vulnerability if it isn't protected, patched, and updated. So is GE or Siemens going to send out a technician to every machine every month when the new Windows patch rolls out? If not, the computer sitting on the network is vulnerable for how long?

Healthcare IT is very important, because computers are good at record-keeping, retrieval and storage, and that's a huge part of healthcare.


A large hospital takes in power from multiple feeds in case any one provider fails. It's amazing that we're even thinking in terms of "a security company" rather than "multiple security layers."

The fact that ransomware is still a concern is an indication that we've failed to update our IT management and design appropriately to account for them. We took the cheap way out and hoped a single vendor could just paper over the issue. Never in history has this ever worked.

Also speaking of generators a large enough hospital should be running power failure test events periodically. Why isn't a "massive IT failure test event" ever part of the schedule? Probably because they know they have no reasonable options and any scale of catastrophe would be too disastrous to even think about testing.

It's a lesson on the failures of monoculture. We've taken the 1970s design as far as it can ago. We need a more organically inspired and rigorous approach to systems building now.


This. The 1970s design of the operating system and the few companies that deliver us the monoculture are simply not adequate or robust given the world of today.


It's insane why critical facilities use Windows OS rather than Linux/*BSD, which is rock-solid.



They'll still install crowdstrike or some other rootkit that will bring it all down anyway


> Hard to imagine how many millions of not billions of dollars this one bad update caused.

And even worse, possibly quite a few deaths as well.

I hope (although I will not be holding my breath), that this is the wake-up call we need to realise that we cannot have so much of our critical infrastructure rely on the bloated OS of company known for its buggy, privacy-intruding, crapware riddled software.

I'm old enough to remember the infamous blue-screen-of-death Windows 98 presentation. Bugs exist but that was hardly a glowing endorsement of high-quality software.. This was long ago, yet it is nigh on impossible to believe that the internal company culture has drastically improved since then, with regular high-profile screw-ups reminding us of what is hiding under the thin veneer of corporate of respectability.

Our emergency systems don't need windows, our telephone systems don't need windows, our flight management systems don't need windows, our shop equipment systems don't need windows, our HVAC systems don't need windows, and the list goes on, and on, and on.

Specialized, high-quality OSes with low attack surfaces are what we need to run our systems. Not a generic OS stuffed with legacy code from a time when those applications were not even envisaged.

Keep-it-simple-stupid -KISS-is what we need to go back to, our lives literally depend on it.

With the mutli-billion dollars screw-up that happened yesterday, and an as-of-yet unknown number of deaths, it's impossible to argue that the funds are unavailable to develop such systems. Plurality is what we need, built on top of strong standards for compatibility and interoperability.


OK, but this was a bug in an update of a kernel module that just happened to be deployed on Windows machines. How many OSs are there that can gracefully recover from an error in kernel space? If every machine that crashed had been running, say, Linux and the update had been coded equivalently, nothing would've changed.

Perhaps rather than an indictment on Windows, this is a call to re-evaluate microkernels, at least for critical systems and infrastructure.


It was not a call to replace windows systems with linux, but to replace it with specialised OSes that do less, with better stability guarantees.

And building something around microkernels would definitely not be a bad starting point.


> Took down our entire emergency department

What does this mean? Did the power go down? Is all the equipment connected? Or is it the insurance software that can't run do nothing gets done? Maybe you can't access patient files anymore but is that taking down the whole thing?


Every computer entered a bluescreen loop. We are dependent on Epic for placing orders, for nursing staff to know what needs to be done, for viewing records, for transmitting and interpreting radiology machines. It's how we know the current state of the department and where each patient (out of 50+ people we are simultaneously treating) is at. Our equipment still works but we're flying blind and having to shout orders at each other and have no way to send radiology images to other doctors for consultation.


Yeah in Radiology we depend on Epic and a remote reading service called VRAD. VRAD runs on AWS and went down just after 0130 hrs EST. Without Epic & VRAD we were pretty helpless.


Can't imagine how stressful this must have been for Radiology. I had two patients waiting on CT read with expectation to discharge if no acute findings. Had to let them know we had no clear estimate for when that would be, and might not even know when the read comes back if we can't access epic.

Have a family member in crit care who was getting a sepsis workup on a patient when this all happened. They somehow got plain film working offline after a bit of effort.


Did the person survive?


We have limited visibility into this in the emergency department. You stabilize the patient and admit them to the hospital, then they become internal medicine or ICU's patient. Thankfully most of the work was done and consults were called prior to the outage, but they were in critical condition.


I will say - the way we typically find out really sends a shiver down your spine.

You come in for you next shift and are finishing charting from your prior shift. You open one of your partially finished charts and a little popup tells you "you are editing the chart for a deceased patient".


Sounds like this is hugely emotionally taxing, do you just get used to it after a while, or is it a constant weight?

This is why I'm impressed by anyone who works in a hospital, especially the more urgent/intensive care


i'll admit i have no idea what i'm talking about but isn't there some Plan B options? something that's more manual? or are surgeons too reliant on computers?


There are plan B options like paper charting, downtime procedures, alternative communication methods and so on. So while you can write down a prescription and cut a person open you can't manually do things pull up the patient's medical history for the last 10 years in a few seconds, have an image read remotely when there isn't a radiologist available on site, or electronically file for the meds to just show up instantly (all depending on what the outage issue is affecting of course). For short outages some of these problems are more "it caused a short rush on limited staff" than "things were falling apart". For longer outages it gets to be quite dangerous and that's where you hope it's just your system that's having issues and not everyone in the region so you can divert.

If the alternatives/plan b's were as good or better than the plan a's then they wouldn't be the alternatives. Nobody is going to have half a hospital's care capacity sit as backup when they could use that year round to better treat patients all the time, they just have plans of last resort to use when what they'd like to use isn't working.

(worked healthcare IT infrastructure for a decade)


> So while you can write down a prescription and cut a person open you can't manually do things pull up the patient's medical history for the last 10 years in a few seconds, have an image read remotely when there isn't a radiologist available on site, or electronically file for the meds to just show up instantly (all depending on what the outage issue is affecting of course).

I worked for a company that sold and managed medical radiology imaging systems. One of our customers' admins called and said "Hey, new scans aren't being properly processed so radiologists can't bring them up in the viewer". I told him I'd take a look at it right away.

A few minutes later, he called back; one of their ERs had a patient dying of a gunshot wound and the surgeon needed to get the xray up so he could see where the bullet was lodged before the guy bled out on the table.

Long outages are terrifying, but it only takes a few minutes for someone to die because people didn't have the information they needed to make the right calls.


Yep, when patients often still die while everything is working fine even a minor inconvenience like "all of the desktop icons reset by mistake" can be enough to tilt the needle the wrong way for someone.


I used to work for a company that provided network performance monitoring to hospitals. I am telling a Story second hand that I heard the CEO share.

One day, during a rapid pediatric patient intervention, a caregiver tried to log in to a PC to check a drug interaction. The computer took a long time to log in because of a VDI problem where someone had stored many images in a file that had to be copied on login. While the care team was waiting for the computer, an urgent decision was made to give the drug. But a drug interaction happened — one that would have been caught, had the VDI session initialized more quickly.

The patient died and the person whose VDI profile contained the images in the bad directory committed suicide. Two lives lost because files were in the wrong directory.


What's insane medical malpractice is that radiology scans aren't displayed locally first.

You don't need 4 years of specialized training to see a bullet on a scan.


We can definitely get local imaging with X-Ray and ultrasound - we use bedside machines that can be used and interpreted quickly.

X-Ray has limitations though - most of our emergencies aren't as easy to diagnose as bullets or pneumonia. CT, CTA, and to a lesser extent MRI are really critical in the emergency department, and you definitely need four years of training to interpret them, and a computer to let you view the scan layer-by-layer. For many smaller hospitals they may not have radiology on-site and instead use a remote radiology service that handles multiple hospitals. It's hard to get doctors who want to live near or commute to more rural hospitals, so easier for a radiologist to remotely support several.


GP referred to "processed," which could mean a few things. I interpreted it to mean that the images were not recording correctly locally prior to any upload, and they needed assistance with that machine or the software on it.


I am talking out my ass, but...

Seems like a possible plan would be duplicate computer systems that are using last week's backup and not set to auto-update. Doesn't cover you if the databases and servers go down (unless you can have spares of those too), but if there is a bad update, a crypto-locker, or just a normal IT failure each department can switch to some backups and switch to a slightly stale computer instead of very stale paper.


We have "downtime" systems in place, basically an isolated Epic cluster, to prevent situations like this. The problem is that this wasn't a software update that was downloaded by our computers, it was a configuration change by Crowdstrike that was immediately picked up by all computers running its agent. And, because hospitals are being heavily targeted by encryption attacks right now, it's installed on EVERY machine in the hospital, which brought down our Epic cluster and the disaster recovery cluster. A true single point of failure.


Can only speak for the UK here, but having one computer system that is sufficiently functional for day-to-day operations is often a challenge, let alone two.


My hospital's network crashed this week (unrelated to this). Was out for 2-3 hours in early afternoon.

The "downtime" computers were affected just like everything else because there was no network.

Phones are all IP-based now; they didn't work.

Couldn't check patient histories, couldn't review labs, etc. We could still get drugs, thankfully, since each dispensing machine can operate offline.


There are often such plans from DR systems to isolated backups to secondary system, as much as risk management budget allow at least. Of course it takes time to switch to these and back, the missing records cause chaos (both inside synced systems and with patient data) both ways and it takes a while to do. On top of that not every system will be covered so it's still a limited state.


Yes buy the more high available you do the more it costs and it's not like this happens every week.


As I was finishing my previous costs it occurred to me that costs are fungible.

Money spent on spares is not spent on cares.


Thank you, I'm quickly becoming tired of HN posters assuming they know how hospitals operate and asking why we didn't just use Linux.


There are problems with getting lab results, X-rays, CT and MRI scans. They do not have paper-based Plan B. IT outage in a modern hospital is a major risk to life and health of their patients.


I don't know about surgeons, but nursing and labs have paper fallback policies... they can backload the data later.


It's often the case that the paper fallbacks can't handle anywhere near the throughput required. Yes, there's a mechanism there, but it's not usable beyond a certain load.


I think it's eventually manageable for some subset of medical procedures, but the transition to that from business as usual is a frantic nightmare. Like there's probably a whole manual for dealing with different levels of system failure, but they're unlikely to be well practiced.

Or maybe I'm giving these institutions too much credit?


Why is the emergency department using windows?


Why did they update everything all at once?


I assume Crowdstrike is software you usually want to update quickly, given it is (ironically) designed to counter threats to your system.

Very easy for us to second guess today of course. But in another scenario a manager is being torn a new one because they fell victim to a ransomware attack via a zero day systems were left vulnerable to because Crowdstrike wasn’t updated in a timely manner.


Maybe, if there's a new zero-day major exploit that is spreading like wildfire. That's not the normal case. Most successful exploits and ransom attacks are using old vulnerabilites against unpatched and unprotected systems.

Mostly, if you are reasonably timely about keeping updates applied, you're fine.


> Maybe, if there's a new zero-day major exploit that is spreading like wildfire. That's not the normal case.

Sure. And Crowstrike releasing an update that bricks machines is also not the normal case. We're debating between two edges cases here, the answers aren’t simple. A zero day spreading like wildfire is not normal but if it were to happen it could be just as, if not more, destructive than what we’re seeing with Crowdstrike.


In the context of the GP where they were actively treating a heart attack, the act of restarting the computer (let alone it never come back) in of itself seems like an issue.


I believe this update didn't restart the computer, just loaded some new data into kernel. Which didn't crash anything the previous 1000 times. A successful background update could hurt performance, but probably machines where that's considered a problem just don't run a general-purpose multitasking OS?


tfw you need to start staggering your virus updates in case your anti-virus software screws you over instead


Maybe those old boomer IT people were on to something by using different Citrix clusters and firewalling off the ones that run essential software...


Crowdstrike pushed a configuration change that was a malformed file, which was picked up by every computer running a the agent (millions of computers across the globe). It's not like hospitals and IT systems are manually running this update and can roll it back.

As to why they didn't catch this during tests or why they don't use perform gradual change rollouts to hosts, your guess is as good as mine. I hope we get a public postmortem for this.


Considering Crowdstrike mentioned in their blog that systems that had their 'falcon sensor' installed weren't affected [1], and the update is falcon content, I'm not sure it was a malformed file, but just software that required this sensor to be installed. Perhaps their QA only checked if the update broke systems with this sensor installed, and didn't do a regression check on windows systems without it.

[1]https://www.crowdstrike.com/blog/statement-on-falcon-content...


That’s not exactly what they’re saying.

It says that if a system isn’t “affected”, meaning it doesn’t reboot in a loop, then the “protection” works and nothing needs to be done. That’s because the Crowdstrike central systems, on which rely the agents running on the clients’ systems, are working well.

The “sensor” is what the clients actually install and run on their machines in order to “use Crowdstrike”.

The crash happened in a file named csagent.sys which on my machine was something like a week old.


I'm not familiar with their software, but I interpreted their wording to mean their bug can leave your system in one of two possible states:

(1) Entire system is crashed.

(2) System is running AND protected from security threats by Falcon Sensor.

And to mean that this is not a possible state:

(3) System is running but isn't protected by Falcon Sensor.

In other words, I interpreted it to mean that they're trying to reassure people they don't need to worry about crashes and hacks, just crashes.


> Why did they update everything all at once?

This is beyond hospital IT control. Clownstrike (sorry, Crowdstrike) unconditionally force-updates the hosts.


Likely because staggered updates would harm their overall security services. I'm guessing these software offer telemetry that gets shared across their clientele, so that gets hampered if you have a thousand different software versions.


My guess is this was an auto-update pushed out by whatever central management server they use. Given CS is supposed to protect your from malware, IT may have staged and pushed the update in one go.


Auto-updates are the only reason something like this gets so widespread so fast.


High-end hospital-management software is not simple stuff, to roll your own. And the (very few) specialty companies which produce such software may see no reason to support a variety of OS's.


A follow up question is why is the one OS chosen the one historically worst at security.


It appears insecure because it is under constant attack because it is so prevalent. Let’s not pretend the *nix world is any better.

I’m no fan of Windows or Microsoft but the commitment to backwards compatibility should not be underestimated.


Are you sure that argument still holds when everyone has Android/iOS phone with apps that talk to Linux servers, and some use Windows desktops and servers as well?


There isn't, and never was, a benevolent dictator choosing the OS for computers in medical settings.

Instead, it's a bunch of independent-ish, for-profit software & hardware companies. Each one trying to make it cheap & easy to develop their own product, and to maximize sales. Given the dominance of MS-DOS and Windows on cheap-ish & ubiquitous PC's, starting in the early-ish 1980's, the current situation was pretty much inevitable.


To add detail for those that don't understand, the big healthcare players barely have unix teams, and the small mom and pop groups literally have desktops sitting under the receptionist desk running the shittiest software imaginable.

The big health products are built on windows because they are built by outsourced software shops and target the majority of builds which are basically the equivalent of bob's hardware store still running windows 95 on their point of sale box.

The major players that took over this space for the big players had to migrate from this, so they still targeted "wintel" platforms because the vast majority of healthcare servers are windows.

Its basically the tech equivalent of everything evolved from the width of oxen for railway.


Because of critical mass. A significant amount of non-technically inclined people use Windows. Some use Mac. And they're intimidated by anything different.


Generally speaking employees don't really per se use windows so much as click the browser icon and proceed to use employers web based tools.


There's a bunch of non-web proprietary software medical offices use to access patient files, result histories, prescription dispensation etc. At least here in Ontario my doctor uses an actual windows application to accomplish all that.


Then they use those apps. The point is that since they usage of the OS as such is so minimal as to be irrelevant as long as it has a launcher and an X in the top corner.

They could as well launch that app in OpenBSD.


Momentum as well. Many of these systems started in DOS. The DOS->Windows transition is pretty natural.


Exactly !

Question is: why half+ of Fortune 500 companies allowed Crowdstrike - Windows hackers - access and total control of their not-a-ms-windows business ? Obviously Crowdstrike do not do medicine or lifting cranes differentiation. "In the middle of the surgery" is not in their use case docs!

There was somewhere Mercedes pitstop image with wall of BSoD monitors :) But that is not Crowdstrike business either...

And all that via public internet and misc clouds. Banks have their own fibre lines, why hospitals can't?

Airports should disconnect from Internet too, selling tickets can be separate infra, synchronization between POSes and checkout don't need to be in real time.

There is only one sane way to prevent such events: EOD controlled by organization and this is sharply incompatible with 3rd party on-line EOD providers. But they can sell it in a box and do real time support when called.


I mean this question is the most honest way; I am not trying to be snarky or superior.

What are the hard problems? I can think of a few, but I'm probably wrong.


Auditing: using Windows plus AV plus malware protection means you demonstrate compliance faster than trying to prove your particular version on Linux is secure. Hospitals have to demonstrate compliance in very short timeframes and every second counts. If you fail to achieve this, some or all of your units can be closed.

Dependency chains: many pieces of kit either only have drivers on windows or work much better on Windows. You are at the mercy of the least OS diverse piece of kit. Label printers are notorious for this as an e.g.

Staffing: Many of your staff know how to do their jobs excellently, but will struggle with tech. You need them to be able assume a look and feel, because you dont want them fighting UX differences when every second counts. Their stress level is roughly equiv. to their worst 10 seconds of their day. And staff will quit or strike over UX. Even UI colour changes due to virtualization down scaling have triggered strife.

Change Mgmt: Hospitals are conservative and rarely push the envelope. We are seeing a major shift at the moment in key areas (EMR) but this still happening slowly. No one is interested in increasing their risk just because Linux exists and has Win64 compatability. There is literally no driver for change away from windows.


> There is literally no driver for change away from windows.

(Not including this colossal fuck up.)


No hospital will shift to Linux because of this incident. They may shift away from Crowdstrike, but not to another OS.


It's actually not that hard from a conceptual implementation standpoint, it's a matter of scale, network effects, and regulatory capture


> What are the hard problems? I can think of a few, but I'm probably wrong.

Billing and insurance reimbursement process change all the time and is a headache to keep up to date. E.g. the actual dentist software is paint but with mainly the bucket and some way to quickly insert teeth objects to match your mouth. I.e. almost no medical skill in the software itself helping the user.


Because essentially every large hospital in the USA does?


This is the result of vendor lock-in and the lesson for all businesses not to use Microsoft servers. Linux/*BSD are rock-solid and open source.


It's not just that. A large portion of IT people who work in these industries find Windows much easier to administer. They're very resistant to switching out even if it was possible and everything the company needed was available elsewhere.

Even if they did switch, they'd then want to install all the equivalent monitoring crap. If such existed, it would likely be some custom kernel driver and it could bring a unix system to its knees when shit goes wrong too.


I mean crowdstrike has a linux equivalent which broke rhel recently by triggering kernel panic


Contact a lawyer if this affected her health please. Any delays in receiving Stroke care can have injured her more I imagine. Any docs here?


ER worker here. It really depends on the details. If she was C-STAT positive with last known normal within three hours, you assume stroke, activate the stroke team, and everything moves very quickly. This is where every minute counts, because you can do clot busting to recover brain function.

The fact that she was discharged without an overnight admit suggests to me that the MRI did not show a stroke, or perhaps she was outside the treatment window when she went to the hospital.


What if it was a cerebral bleed?


I can't even begin to imagine the cost of proving the health effects and attempting to win the case.


Yes. Reading and learning.


I remember a fed speaker in the 90s at Alexis hotel Defcon trying to rationalize their weirdly over-aggressive approach to enforcement by mentioning how hackers would potentially kill people in hospitals, fast forward to today and it's literally the "security" software vendor that's causing it.


Well cryptolockers have actually compromised various hospitals and I remember the first one was in the United Kingdom .


Don't forget that nearly all crypto lockers are run by North Korea or other state espionage groups pretending to be North Korea.

If we adjusted our foreign policy slightly, I think we would dissuade that whole class of attacker.


It's not like hackers haven't killed people in hospitals with e.g. ransomware. Our local dinky hospital system was hit by ransomware twice, which at the very least delayed some important surgeries.


I can't imagine why any critical system is connected to the internet at all. It never made sense to me. Wifi should not be present on any critical system board and ethernet plugged in only when needed for maintenance.

This should be the standard for any life sustaining or surgical systems, and any critical weapons systems.


I work for a large medical device company and my team works on securing medical devices. At least at my company as a general rule, the more expensive the equipment (and thus the more critical the equipment, think surgical robots) the less likely it will ever be connected to a network, and that is exactly because of what you said, you remove so many security issues when you keep devices in a disconnected state.

Most of what I do is creating the tools to let the field reps go into hospitals and update capital equipment in a disconnected state (IE, the reps must be physically tethered to the device to interact with it). The fact that any critical equipment would get an auto-update, especially mid-surgery is incredibly bad practice.


I work for the government supporting critical equipment - not in medical, in transportation sector - and the systems my team supports not only are not connected to the internet, they aren't even capable of being so connected. Unfortunately the department responsible for flogging us to do cybersecurity reporting (different org branch than my team) has all our systems miscategorized as IT data systems (when they don't even contain an operating system). So we waste untold numbers of engineer hours now reporting "0 devices affected" to lists of CvE's and answering data calls about SSH, Oracle or Cisco vulnerabilities, etc. etc. which we keep answering with "this system is air gapped and uses a microcontroller from 1980 that cannot run Windows or Linux" but the cybersecurity-flogging department refuses to properly categorize us. My colleague is convinced they're doing that because it inflates their numbers of IT systems.

Anyway: it is getting to the point that I cynically predict we may be required to add things to the system (such as embedding PCs), just so we can turn around and "secure" them to comply with the requirements that shouldn't be applied to these systems. Maybe this current outage event will be a wake up call to how misplaced the priorities are, but I doubt it.


All this stuff could easily be airgapped or revert to USB stick fail safe.


Have you ever tried to airgap a gigantic wifi network across several buildings?

Has to be wifi because the carts the nurses use roll around. Has to be networked so you can have EMR's that keep track of what your patients have gotten and the Pharmacists, doctors, and nurses can interface with the Pyxis machines correctly. The nurse scans a patients barcode at the Pyxis, the drawer opens to give them the drugs, and then they go into the patient's room and scan the drug barcode and the patients barcode before administering the drug. This system is to prevent the wrong drug from being administered, and has dramatically dropped the rates of mis-administering drugs. The network has to be everywhere on campus (often times across many buildings). Then the doctor needs to see the results of the tests and imaging- who is running around delivering all of these scans to the right doctors?

You don't know what you are talking about if you think this is easy.


Air gap the system with the external world is different from air gap internally. The systems are only update via physical means. And possibly all data in and out is offline like, via certain double firewall arrangement (you do not let direct contact but dump in and out files). Not common but for industrial critical system saw a few big shops did this.


So how does a doctor issue a discharge order via e-prescription to the patients pharmacy for them to pick up when they leave? How do you update the badge readers on the drug vaults when an employee leaves and you need to deactivate their badge? How do you update the EMR's from the hospital stay so the GP practice they use can see them after discharge? How do you order more supplies and pharmacy goods when you run out? How do you contact the DEA to get approval for using certain scheduled meds? I'm afraid that external networks are absolutely a requirement for modern hospitals.

If the system has to be networked with the outside world, who is responsible for physically updating all of these machines, so they don't get ransomware'd? Who has to go out and visit each individual machine and update it each month so the MRI machine doesn't get bricked by some teen ransomware gang? Remember that was the main threat hospitals faced 3-4 years ago, which is why Crowdstrike ended up on everyone's computer: because the ransomware insurance people forced them too.

There is a reason that I am a software engineer and not an IT person. I prefer solving more tractable problems, and I think proving p!=np would be easier than effectively protecting a large IT network for people who are not computing professionals.

One of my favorite examples: in October 2013 casino/media magnate and right wing billionaire Sheldon Adelson gave a speech about how the US and Israel should use nuclear weapons to stop Iran nuclear program. In February 2014 a 150 line VB macro was installed on the Sands casino network that replicated and deleted all HDDs, causing 150 million dollars of damage. That was to a casino, which spends a lot of money on computer security, and even employs some guys named Vito with tire irons. And it wasn't nearly enough.


> Who has to go out and visit each individual machine and update it each month so the MRI machine doesn't get bricked by some teen ransomware gang?

The manufacturer does. As I mentioned in my OP I help build the software for our field reps to go into hospitals and clinics to update our devices in a disconnected state. Most of the critical equipment we manufacture has this as a requirement since it can't be connected to a network for security reasons.

As for discharge orders, etc, I can't speak to that, but that's also not what I would consider critical. I'm talking about things like surgical robots, which can not be connected to a network for obvious reasons, especially during a surgery.


External networks are required but it should be possible to air gap the critical stuff to read only. It’s just that it’s costly and hospitals are poor/cheap


Did this actually happen to medical equipment mid-surgery today?


The OP for this very thread said as much.


My wife is a hospital pharmacist. (1) When she gets a new prescription in, she needs to see the patients charts on the electronic medical records, and then if she approves the medication a drawer in the Pyxis cabinet (2) will open up when a nurse scans the patients barcode, allowing them to remove the medication, and then the nurse will scan the patient's barcode and the medication barcode in the patients room to record that it was delivered at a certain time. Computers are everywhere in healthcare, because they need records and computers are great at record-keeping. All of those need networks to connect them, mostly on wifi (so the nurses scanners can read things).

In theory you could build an air-gapped network within a hospital, but then how do you transmit updates to the EMR's across different campuses of your hospital? How do you issue electronic prescriptions for patients to pick up at their home pharmacy? How do you handle off-site data backup?

Quite honestly, outside of defense applications I'm not aware of people building large air-gapped networks (and from experience, most defense networks aren't truly air-gapped any more, though I won't go into detail). Hospitals, power plants, dams, etc. all of them rely heavily on computers these days, and connect those over the regular internet.

1: My wife was the only pharmacist in her department last night whose computer was unaffected by Crowdstrike (for unknown reasons). She couldn't record her work in the normal ways, because the servers were Crowdstrike'd as well. So she spun up a document of her decisions and approvals, for later entry into the systems. It was over 70 pages long when she went off shift this morning. She's asleep right now.

2: https://www.bd.com/en-uk/products-and-solutions/products/pro...


First - drop "air-gapped" term and replace it with "internet-gapped". TA^h^h^a^a! And it already have a name: "The LAN"... Now teach managers about importance of local net vs open/public/world net. Tell them cloud costs more becouse someone is making a fortune or three on it !

TIP: many buildings can be part of one LAN! It is called VPN and Russia and China do not like it becouse it is good for peoples!

TIP: data can be easily exchanged when needed! Including LAN.

--

My wife is a hospital pharmacist. (1) When she gets a new prescription in, she needs to see the patients charts on the electronic medical records, and then if she approves the medication a drawer in the Pyxis cabinet (2) will open up when a nurse scans the patients barcode, allowing them to remove the medication, and then the nurse will scan the patient's barcode and the medication barcode in the patients room to record that it was delivered at a certain time. Computers are everywhere in healthcare, because they need records and computers are great at record-keeping. All of those need networks to connect them, mostly on wifi (so the nurses scanners can read things).

--

It was description of very local workflow...

It was description of data flow - no any reason it should be monopolized by unsecure by design os vendor that need to be mandatory secured by essentialy kernel rootkit aka os hacking. Which contradicts using that os in the first place!

And looks like Crowdstrike is just if you ask for price then you can't have it version of SELinux :>>> RH++ for two decades of making presentations of SELinux necessity.

But over all allowing automatic updates from 3rd party not having clue about medicine to hospital system, etc. is managers criminal negligence. Simple as that. Curent state of the art ? More negligence! Add (business) academia & co to chronic offenders. Call them what they truly are - sociopaths via craft training facilities.

>In theory you could build an air-gapped network within a hospital, but then how >do you transmit updates to the EMR's across different campuses of your hospital?

How do you transmit to other campuses of other hospitals ? EASY! Transfer mandatory data. Pleas notice I used words like "mandatory" and "data". I DID NOT SAY "use mandatory http stack to transfer data"! NO. NO, I'm far, faaar from even sugesting THAT ! :>

>How do you issue electronic prescriptions for patients to pick up at their home pharmacy?

Hard sold on that "air-gapped and in cage" meme, eh? Send them required data via secure and private method! Communications channels already "hacked" - monopolized - by FB? Obviously that should do not happend in first place. So resolve it as part of un-win-dosing critical civilian infra.

>How do you handle off-site data backup?

That one I do not get. You saying that cloud access is a only possibility to have backups??? And Internet is a must to do it?? Is medical staff brain dead? Ah, no... It's just managers... Again.

>Quite honestly, outside of defense applications I'm not aware of people building large air-gapped networks

And dhcp and "super glue" and tons of other things was invented by military, for a reason, but that things proliferated to civilians anyway. For good reasons. Air-gapping should be much more common when wifi signal allows tracking how you move in your own home. Not to mention GSM+ based "technologies"...

There is old saying: Computers maximize doing. And when somewhere is chaos then computers simply do their work.


I think the criticial systems here are often the ones that need to be connected to some network. Somebody up there mentioned how the MRI worked fine, but they still needed to get the results to the people who needed it. So the problem there was more doctor <-> doctor.


Yeah, our imaging devices were working fine, but with Epic down, you lose most of your communication between departments and your sole way of sharing radiology images and interpretations.


> Roslin: ...it tells people things like where the restroom is, and--

> Adama: It's an integrated computer network, and I will not have it aboard this ship.

> Roslin: I heard you're one of those people. You're actually afraid of computers.

> Adama: No, there are many computers on this ship. But they're not networked.

> Roslin: A computerized network would simply make it faster and easier for the teachers to be able to teach--

> Adama: Let me explain something to you. Many good men and women lost their lives aboard this ship because someone wanted a faster computer to make life easier. I'm sorry that I'm inconveniencing you or the teachers, but I will not allow a networked computerized system to be placed on this ship while I'm in command. Is that clear?

> Roslin: Yes, sir.

> Adama: Thank you. 'Scuse me.


and any critical weapons systems.

... at which point you will lose battles to enemies who have successfully networked their command and control operations. (For extra laughs, just wait until this is also true of AI.)

Ultimately there are just too darned many advantages to connecting, automating, and eventually 'autonomizing' everything in sight. It sucks when things don't go right, or when a single point of failure causes a black-swan event like this one, but in an environment where you're competing against either time or external adversaries, the alternatives are all worse.


Or the opposite: the enemy (or a third-party enemy who wasn't previously a combatant in the battle) hijacks your entire naval USV/UUV fleet & air force drone fleet using an advanced cyberattack, and suddenly your enemy's military force has almost doubled while yours is down to almost zero, and these hijacked machines are within your own lines.


Yes, the efficiency gains of remote automated administration and deployment make up for most outages that are caused by it.

A better thing to do is do phased deployment, so you can see if an update will cause issues in your environment before pushing it to all systems. As this incident shows, you can’t trust a software vendor to have done that themselves.


This wasn't a binary patch though, it was a configuration change that was fed to every device. Which raises a LOT of questions about how this could have happened and why it wasn't caught sooner.


Writing from the SRE side of the discipline, it's commonly a configuration change (or a "flag flip") that ultimately winds up causing an outage. All too seldom are configuration data considered part of the same deployable surface area (and, as a corollary, part of the same blast radius) as program text.

I've mostly resigned myself today to deploying the configuration change and watching for anomalies in my monitoring for a number of hours or days afterward, but I acknowledge that I also have both a process supervisor that will happily let me crash loop my programs and deployment infrastructure that will nonetheless allow me to roll things back. Without either of those, I'm honestly at a loss as to how I'd safely operate this product.


  # Update A
  
  ## config.ext
  
  foo = false
  
  ## src.py
  
  from config import config
  
  if config('foo'):
      work(2 / 0)
  else:
      work(10 / 5)
"Yep, we rigorously tested it."

  # Update B
  
  ## config.ext
  
  foo = true
"It's just a config change, let's go live."


Yeah, that's about right.

The most insidious part of this is when there are entire swaths of infrastructure in place that circumvent the usual code review process in order to execute those configuration changes. Boolean flags like your `config('foo')` here are most common, but I've also seen nested dictionaries shoved through this way.


When I was at FB there were a load of SEVs caused by config changes, such that the repo itself would print out a huge warning about updating configs and show you how to do a canary to avoid this problem.


As in, there was no way to have configured the sensors to prevent this? They were just going to get this if they were connected to the internet? If I was an admin that would make me very angry.


This is the way it's done in the nuclear industry across the US for power and enrichment facilities. Operational/secure section of the plant is airgapped with hardware data diodes to let info out to engineers. Updates and data are sneaker netted in.


Not like hackers haven’t done the same.


At least hackers let people boot their machines, and some even have an automated way to restore the files after a payment. CS doesn't even do that. Hackers are looking better and more professional if we're going to put them in the same bucket, that is.


The criminal crews have a reputation to uphold. You don't deliver on payment, the word gets around and soon enough nobody is going to pay them.

These security software vendors have found a wonderful tacit moat: they have managed to infect various questionnaire templates by being present in a short list of "pre-vetted and known" choices in a dropdown/radiobutton menu. If you select the sane option ("other"), you get to explain to technically inept bean counters why you did so.

Repeat that for every single regulator, client auditing team, insurance company, etc. ... and soon enough someone will decide it's easier and cheaper to pick an option that gets you through the blind-leading-the-blind question karaoke with less headaches.

Remember: vast majority of so-called security products are sold to people high up in the management chain, but they are inflicted upon their victims. The incentives are perverse, and the outcomes accordingly predictable.


> If you select the sane option ("other"), you get to explain to technically inept bean counters why you did so.

Tell them it’s for preserving diversity in the field.


Funnily enough, a bit of snark can help from time to time.

For anyone browsing the thread archive in the future: you can have that quip in your back pocket and use it verbally when having to discuss the bingo sheet results with someone competent. It's a good bit of extra material, but it can not[ß] be your sole reason. The term you do want to remember is "additional benefit".

The reasons you actually write down boil down to four things. High-level technical overview of your chosen solution. Threat model. Outcomes. And compensating controls. (As cringy as that sounds.)

If you can demonstrate that you UNDERSTAND the underlying problem, and consider each bingo sheet entry an attempt at tackling a symptom, you will be on firmer ground. Focusing on threat model and the desired outcomes helps to answer the question, "what exactly are you trying to protect yourself from, and why?"

ß: I face off with auditors and non-technical security people all the time. I used to face off with regulators in the past. In my experience, both groups respond to outcome-based risk modeling. But you have to be deeply technical to be able to dissect and explain their own questions back to them in terms that map to reality and the underlying technical details.


nothing like this scale. These machines are full blue screen and completely inoperable.


The problem is concentration risk and incentives. Everyone is incentivized to follow the herd and buy Crowdstrike for EDR because of sentiment and network effects. You have to check the box, you have to be able to say you're defending against this risk (Evolve Bank had no EDR, for example), and you have to be able to defend your choice. You've now concentrated operational risk in one vendor, versus multiple competing vendors and products minimizing blast radius. No one ever got fired for buying Crowdstrike previously, and you will have an uphill climb internally attempting to argue that your org shouldn't pick what the bubble considers the best control.

With that said, Microsoft could've done this with Defender just as easily, so be mindful of system diversity in your business continuity and disaster recovery plans and enterprise architecture. Heterogeneous systems can have inherent benefits.


If you have a networked hybrid heterogeneous system though now you have weakest link issue, since lateral movement can now happen after your weaker perimeter tool is breached


A threat actor able to evade EDR and moving laterally or pivoting through your env should be an assumption you’ve planned for (we do). Defense in depth, layered controls. Systems, network, identity, etc. One control should never be the difference between success and failure.

https://apnews.com/article/tech-outage-crowdstrike-microsoft...

> “This is a function of the very homogenous technology that goes into the backbone of all of our IT infrastructure,” said Gregory Falco, an assistant professor of engineering at Cornell University. “What really causes this mess is that we rely on very few companies, and everybody uses the same folks, so everyone goes down at the same time.”


WannaCry did about the same damage to be honest. To pretty much the same systems.

The irony is the NHS likely installed CrowdStrike as a direct reaction to WannaCry.


The difference is malware infection is usually random and gradual. CrowdStrike screwup is everything at once with 100% lethality.


Computers hit by ransomware are also inoperable, and ransomware is wildly prevalent.


Yes, but computers get infected by ransomware randomly; Crowdstrike infected large amount of life-critical systems worldwide over some time, and then struck them all down at the same time.


I'm not sure I agree, ransomware attacks against organizations are often targeted. They might not all happen on the same day, but it is even worse: an ongoing threat every day.


It's why it's not worse - an ongoing threat means only small amount of systems are affected at a time, and there is time to develop countermeasures. An attack on everything all at once is much more damaging, especially when it eliminates fallback options - like the hospital that can't divert their patients because every other hospital in the country is down too, and so is 911.


Ransomware that affects only individual computers died not get payouts outside of hitting extremely incompetent orgs.

If you want actually good payout, your crypto locker has to either encrypt network filesystems, or infect crucial core systems (domain controllers, database servers, the filers directly, etc).

Ransomware getting smarter about sideways movement, and proper data exfiltration etc attacks, are part of what led to proliferation of requirements for EDRs like Crowdstrike, btw


Ransomware vendors at least try to avoid causing damage to critical infrastructure, or hitting way too many systems simultaneously - it's good neither for business nor for their prospects of staying alive and free.

But that's besides the point. Point is, attacks distributed over time and space ultimately make the overall system more resilient; an attack happening everywhere at once is what kills complex systems.

> Ransomware getting smarter about sideways movement, and proper data exfiltration etc attacks, are part of what led to proliferation of requirements for EDRs like Crowdstrike, btw

To use medical analogy, this is saying that the pathogens got smarter at moving around, the immune system got put on a hair trigger, leading to a cytokine storm caused by random chance, almost killing the patient. Well, hopefully our global infrastructure won't die. The ultimate problem here isn't pathogens (ransomware), but the oversensitive immune system (EDRs).


I want to agree with the point you're making, but WannaCry, to take one example, had an impact at roughly this scale.


I think recovering from this incident will be more straightforward than WannaCry.

At large-scale, you don’t solve problems, you only replace them with smaller ones.


Not like the security software has ever stopped it.


A lot of security software, ranging from properly using EDRs like Crowdstrike to things like simply setting some rules in Windows File Server Resource Manager fooled many ransomware attacks at the very least


I'm guessing hundreds of billions if you could somehow add it all up.

I can't believe they pushed updates to 100% of Windows machines and somehow didn't notice a reboot loop. Epic gross negligence. Are their employees really this incompetent? It's unbelievable.

I wonder where MSFT and Crowdstrike are most vulnerable to lawsuits?


This outage seems to be the natural result of removing QA by a different team than the (always optimistic) dev team as a mandatory step for extremely important changes. And neglecting canary type validations. The big question is will businesses migrate away from such a visibly incompetent organization. (Note I blame the overall org; I am sure talented individuals tried their best inside a set of procedures that asked for trouble.)


So there was apparently an Azure outage prior to this big one. One thing that is a pretty common pattern in my company when there are big outages is something like this:

1. Problem A happens, it’s pretty bad

2. A fix is rushed out very quickly for problem A. It is not given the usual amount of scrutiny, because Problem A needs to be fixed urgently.

3. The fix for Problem A ends up causing Problem B, which is a much bigger problem.

tl;dr don’t rush your hotfixes through and cut corners in the process, this often leads to more pain


If you’ve ever been forced to use a PC with Crowdstrike it’s not amazing at all. I’m amazed incident of this scale didn’t happen earlier.

Everything about it reeks of incompetence and gross negligence.

It’s the old story of the user and purchaser being different parties-the software needs to be only good enough to be sold to third parties who never neeed to use it.

It’s a half-baked rootkit part of performative cyberdefence theatrics.


> It’s a half-baked rootkit part of performative cyberdefence theatrics.

That describes most of the space, IMO. In a similar vein, SOC2 compliance is bullshit. The auditors lack the technical acumen – or financial incentive – to actually validate your findings. Unless you’re blatantly missing something on their checklist, you’ll pass.


From a enterprise software vendor perspective, cyber checklists feel like a form of regulatory capture. Someone looking to sell something gets a standard or best practice created, added to the checklists, and everyone is forced to comply, regardless of the context.

Any exception made to this checklist is reviewed by third parties that couldn't care less, bean counters, or those technically incapable of understanding the nuance, leaving only the large providers able to compete on the playing field they manufactured.


This will go on for multiple days, but hundreds of billions would be >$36 trillion annualized if it was that much damage for one day. World annual GDP is $100 trillion.


Their terms of use undoubtedly disclaim any warranty, fitness for purpose, or liability for any direct or incidental consequences of using their product.

I am LMFAO at the entire situation. Somewhere, George Carlin is smiling.


MSFT doesn’t recommend or specify CrowdStrike


I wonder if companies are incentivized to buy Crowdstrike because of Crowdstrike's warranty that will allegedly reimburse you if you suffer monetary damage from a security incident while paying for Crowdstrike.


If such a warranty exists, the real question will be how Crowdstrike tries to spin this as a non-security incident.


The CEO says it isn't and we believe them apparently


There must be an incentive. Because from a security perspective bringing in a 3rd party to a platform (microsoft) to do a job the platform already does is literally just the definition of opening up holes in your security. Completely b@tshit crazy, the salesmen for these products should hang their heads in shame. It's just straight up bad practice. Im astounded it's so widespread.


Insurance companies recommended them


Same people who destroyed a US bridge recently.

This is the result of giving away US jobs overseas at 1/10th the salary


Do you have some more details?


I saw one of the surgery videos recently. The doctor was saying, "Alexa, turn on suction." It boggled my mind. There could be so many points of failure.


Fwiw this is not typical, we don’t have alexa/siri type smart devices in any OR I work in, and suction is turned on off with a button and a dial.


It's in a Maryland clinic doing plastic surgery.

Edit: Found it. https://www.youtube.com/watch?v=nS9nLvGMLH0&t=947s


ALEXA, TURN OFF THE SUCTION! ALEXA!!

“Loive from NPR news in Washington“


I don't suppose there was a doctor or nurse named Alexa involved?


Not to be that guy, but I often say software engineering as a field should have harsher standards of quality and certainly liability for things like these. You know like civil engineers, electrical engineers and most people whose work could kill people if done wrongly.

Usually when I write this devs get all defensive and ask me what the worst thing is that could happen.. I don't know.. Could you guarantee it doesn't involve people dying?

Dear colleagues, software is great because one persons work multiplies. But it is also a damn fucking huge responsibility to ensure you are not inserting bullshit into the multiplication.


Some countries such as Canada have taken minor steps towards this, for example making it illegal to call oneself a software engineer unless you are certified by the provinces professional engineering body, however this is still missing a lot. I also don't wish to be "that guy" but I'll go further and say that the US is really holding this back by not making using Software Engineer as a title (without holding a PEng) illegal in a similar fashion.

If we can at least get that basis then we can start to define more things such as jobs that non Engineers can not legally do, and legal ramifications for things such as software bugs. If someone will lose their professional license and potentially their career over shipping a large enough bug, suddenly the problem of having 25,000 npm dependences and continuous deployment breaking things at any moment will magically cease to exist quite quickly.


I'd go a step farther and say software engineering as a field is not respected at the same levels as such certified/credentialed engineers, because of these lacks of standards and liabilities. Leading to common occurrences of systemic destructive failures such as this, due to organization level direction being very lax in dealing with software failure potential.


I don't know, I get paid more than most of my licensed engineer friends. That's the only respect that really matters to me. Not saying there might not be other advantages to a professional organization for software.


I feel the same way but do agree there’s a general lack of respect for the field relative to other professions. Here’s another thread on the subject https://news.ycombinator.com/item?id=23676651


Respect has to be earned.


I believe instances like this will push people to reconsider the lax stance. Humans in general have a hard time regulating something abstract. The fact that people can be killed is well-known since the 80s', see https://en.wikipedia.org/wiki/Therac-25


I once worked on some software that generated PDFs of lab reports for drug companies monitoring clinical trials. These reports had been tested, but not exhaustively.

We got a new requirement to give doctors access to print them on demand. Before this, doctors only read dot matrix-printed reports that had been vetted for decades. With our XSL-FO PDF generator, it was possible that a column could be pushed outside the print boundary, leading a doctor to see 0.9 as 0. I assume in a worst worst case scenario, this could lead to an misdiagnosis, intervention, and even a patient's death.

I was the only one in the company who cared about doing a ton more testing before we opened the reports to doctors. I had to fight hard for it, then I had to do all the work to come up with every possible lab report scenario and test it. I just couldn't stand the idea that someone might die or be seriously hurt by my software.

Imagine how many times one developer doesn't stand up in that scenario.


This is why I made that point, similar to you I would not stand for having my code in something that I can't stand behind, especially if it potentially harms people.

But it should not hinge on us convincing people.


I'd endorse this. That way when my hypothetical PHB wants to know why something is taking so long I can say "See this part? Someone could die if we don't refactor it."


Related talk by Alan Kay: https://youtu.be/D43PlUr1x_E


It’s important not to disregard that software engineers are often overruled by management or product when strict deadlines and targets exist.


"If only we asked harder problems for our leetcode interview!"


And how many lifes lost?


It's honestly terrifying that someone would opt for Windows in systems critical to medical emergencies.

I hope organisations start revisiting some of these insane decisions.


Not my story to tell, so I'm relaying it. Childhood friend works for a big company, you've heard their name, they make nuclear control systems for nuclear reactors; they have products out in the field they support and there are new reactors in parts of the world from time to time. We were scheduled to have lunch a couple years back and he bailed, we rescheduled, he bailed because that was the day you couldn't defer XP updates anymore, they came in and some XP systems became Windows 10. XP was "nuclear reactor approved" by someone, they had a tool chain that didn't work right on other versions of windows, it all gave me chills.

They ended up giving MS a substantial amount of money to extend support for their use case for some number of years. I can't remember the number he told me but it was extremely large.


If its not connected to the internet who cares?


It sounds like he said XP machines auto-updated to Windows 10, and they would have had to have been connected to the internet in order to download that update. (I'm assuming, optimistically, that these were more remote-control computers than actual nuclear devices.)


Eh. There are a great many problems that could befall a medical emergency systems that are unrelated to OS. Like power loss. I think the core problem here really is a lack of redundancy.


I've had updates break Linux machines.

Just a few weeks ago I had an OpenBSD box render itself completely unbootable after nothing more than a routine clean shutdown. Turns out their paranoid-idiotic "we re-link the kernel on every boot" coupled with their house-of-cards file system corrupted the kernel, then overwrote the backup copy when I booted from emergency media - which doesn't create device nodes by default so can't even mount the internal disks without more cryptic commands.

Give me the Windows box, please.


Counter anecdote: I’ve been using Linux for 20 years, nearly half of that professionally. The only time I’ve broken a Linux box where it wasn’t functional was mixing Debian unstable with stable, and I was still able to fix it.

I’ve had hardware stop working because I updated the kernel without checking if it removed support, but a. that’s easily reversible b. Linux kept working fine, as expected.

I’ll also point out, as I’m sure you know, that the BSDs are not Linux.


Funny, i broke my Debian twice (on two separate laptops) by doing exactly that, mixing stable with testing. I was kinda obliged to use "testing" because Dell XPS would miss critical drivers.

I switched to opensuse afterwards


In fairness, this is the number one way listed [0] on how to break Debian. That said, if you need testing (which isn’t that uncommon for personal use; Debian is slow to roll out changes, favoring stability), then running pure Sid is actually a viable option. It’s quite stable, despite its name.

[0]: https://wiki.debian.org/DontBreakDebian


you are comparing a broken bicycle to a trainwreck


some critical software has DRM that only works in Windows


"Took down our entire emergency department as we were treating a heart attack. 911 down for our state too."

Why would Windows systems be anywhere near critical infra ?

Heart attacks and 911 are not things you build with Windows based systems.

We understood this 25 years ago.


I do not think windows is the problem here. The problem is that equipment that is critical infrastructure being connected to the internet, imo. There is little reason for a lot of computers in some settings to be connected to the internet, except for convenience or negligence. If data transfer needs to be done, it can happen through another computer. Some systems should exist on a (more or less) isolated network at best. Too often we do not really understand the risk of a device being connected to the internet, until something like this happens.


You have no idea how a hospital or modern medicine works. It needs to be online.


Why would a machine that is required for a MRI machine to work (as one of the examples given in the thread here) need to be online? I understand about logging, though even then I think it is too risky. Do all these machines _really_ need to be online, or just nobody bothered after all the times something happened or, even worse, software companies profit in certain ways and would not want to change their models? Can we imagine no other way to do things apart from connecting everything to some server wherever that is?


MRI read outs are 3d, so can't be printed for analysis. They are gigabytes in size, and the units are usually in a different part of the building. So you could sneakernet cds every time an MRI is done, then sneakernet the results back. Or you could batch it and then analysis is done slowly and all at once. OR you could connect it to a central server and results/analysis can be available instantly.

Smarter people than us have already thought through this and the cost-benefit analysis said "connect it to a server"


So in that case you setup a NAS server that it can push the reports to, everything else is firewalled off.

Its just laziness, and to be honest, an outage like this has no impact on their management reputation as a lot of other poorly run companies and institutions were also impacted, so the focus is on crowdstrike and azure, not them.


I admit I'm not a medical professional but these sound like problems with better solutions than lots of Internet connected terminals that can be taken down by edr software.

Why not an internal only network for all the terminals to talk to a central server, then disable any other networking for the terminals? Why do those terminals need a browser where pretty much any malware is going to enter from? If hospitals are paying out the ass for their management software from epic/etc, they should be getting something with a secure design. If the central server is the only thing that can be compromised then when edr takes it down you at least still have all your other systems, presumably with cached data to work from


Ever heard of a LAN? You don't need internet access for every single machine.


Many X-Rays (MRIs, CT scans, etc.) are read and interpreted by doctors who are remote. There are firms who that's all they do - provide a way to connect radiologists and hospitals, and handle the usual business back-end work of billing, HR, and so on. Search for "teleradiology"

Same goes for electronic medical records. There are people who assign ICD-10 codes (insurance billing codes) to patient encounters. Often this is a second job for them and they work remote and typically at odd hours.

A modern hospital cannot operate without internet access. Even a medical practice with a single doctor needs it these days so they can file insurance claims, access medical records from referred patients and all the other myriad reasons we use the internet today.


Okay, so (as mentioned elsewhere in this thread), connect the offline box to an online NAS with the tightest security between the two humanly possible. You can get the relevant data out to those who need it.

This stuff isn't impossible to solve. Rather, the incentives just aren’t there. People would rather build an apparatus for blame-shifting than actually just building a better solution.


Do you think everyone involved is physically present? The gp was absolutely accurate that you guys have no idea how modern healthcare works and this had nothing to do with externally introduced malware.


This sounds a bit like someone just got ran over by a truck because the driver couldn’t see them so people ask why trucks are so big that they’re dangerous and the response is “you just don’t know how trucks work” rather than “yeah maybe drivers should be able to see pedestrians”.

If modern medicine is dangerous and fragile because of network connected equipment then that should be fixed even if the way it currently works doesn’t allow it.


This is a completely different discussion. They absolutely should be reliable. The part that is a complete non starter is not being networked because it ignores that telemedicine, pacs integration, and telerobotics exist.

If you don't understand why it has to be networked with extremely bad fallback to paper, then I suggest working in healthcare for a bit before pontificating on how everything should just go back to the stone age.


Networking puts their reliability into risk. As shown here, as shown in ransomware cases. It is not the first time something like this happen.

The question is not whether or not hospitals need internet at all or to go back into printing things in paper or whatever nobody ever said. The question is whether everything in the hospital should be connected to the internet. Again the example used was simple. Having the computer processing and exporting the data from an MRI machine connected online in order to transfer the data, vs using a separate computer to transfer the data and the first computer is offline. This is how we are supposed to transfer similar data at my work for security reasons. I am not sure why it cannot happen in there. If you cannot transfer data through that computer, there could be an emergency backup plan. But you need to solve only the transfering data part. Not everything.


even the most secure outbound protection would likely whitelist the CrowdStrike update servers because they'd be considered part of the infrastructure


You don’t print the images an MRI produced, you transmit them to the people who can interpret them, and they are almost never in the same room as the big machine, and sometimes they need to be called up in a different office altogether.


The comment [0] mentioned that they could not get at all the mri outputs even with the radiologist coming on site. Obviously, software that was processing/exporting the data was running on a computer that was connected online, if not requiring internet connection itself. Data transfer can happen from another computer than the one the data is processed/obtained. Less convenient, but this is common practice in many other places for security and other reasons.

[0] https://news.ycombinator.com/item?id=41009018


I mean, this is incentivized by current monetization models. Remove the need to go through a payment based aaS infra, and all the libraries to do the data visualization could be running on the MRI dude's PC.

-aaS by definition requires you to open yourself to someone else to let them do the work for you. It doesn't empower you, it empowers them.


Yeah I suspect -aaS monetisation models are one of the reasons of the current all-to-internet mess. However, such software running in the machine using a hardware usb key as authenticating is not unheard of either in software like that. I wish that decisions on these subjects were done based on the specific needs of the users rather than the finance people of -aaS companies.


Our critical devices were fine. But epic and all of our machines were down. How do you transmit radiology images without epic?


Is that an ironic question? Or serious one? I fail to detect the presence or absence of irony sometimes online. I just hope that my own healthcare system has some back-up plans for how to do day-to-day operations like transfering my scan results to a specialist in case the system they normally use fails.


"It needs to be online."

No, it doesn't.

Some have chosen - for reasons of efficiency and scale and cost - to place it online.

However, this is a trade-off for fragility.

It's not insane to make this trade-off ...

... but it is insane to not realize one is making it.


It seems like you’ve never worked with critical infra. Most of it runs on 6 to 10 year old unpatched versions of Windows…


"It seems like you’ve never worked with critical infra."

My entire career has been spent building, and maintaining, critical infra.[1]

Further, in my volunteer time, I come into contact with medical, dispatch and life-safety systems and equipment built on Windows and my question remains the same:

Why is Windows anywhere near critical infra ?

Just because it is common doesn't mean it's any less shameful and inadequate.

I repeat: We've fully understood these risks and frailties for 25 years.

[1] As a craft, and a passion - not because of "exciting career opportunities in IT".


Is this the rsync.net HN account? If so, lmao @ the comment you replied to.

> As a craft, and a passion

I believe you’ve nailed the core problem. Many people in tech are not in it because they genuinely love it, do it in their off time, and so on. Companies, doubly so. I get it, you have to make money, but IME, there is a WORLD of difference in ability and self-solving ability between those who love this shit, and those who just do it for the money.

What’s worse is that actual fundamental knowledge is being lost. I’ve tried at multiple companies to shift DBs off of RDS / Aurora and onto at the very least, EC2s.

“We don’t have the personnel to support that.”

“Me. I do this at home, for fun. I have a rack. I run ZFS. Literally everything in this RFC, I know how to do.”

“Well, we don’t have anyone else.”

And that’s the damn tragedy. I can count on one hand the number of people I know with a homelab who are doing anything other than storing media. But you try telling people that they should know how to administer Linux before they know how to administer a K8s cluster, and they look at you like you’re an idiot.


The old school sysadmins who know technology well are still around but there is increasingly less of them while the demand skyrockets as our species gives computers an increasing number of responsibilities.

There is tremendous demand for technology that works well and works reliably. Sure, setting up a database running on an EC2 instance is easy. But do you know all of the settings to make the db safe to access? Do you maintain it well, patch it, replicate it, etc? This can all be done by one of the old school sysadmins. But they are rare to find, and not easy to replace. It's hard to judge from the outside, even if you are an expert in the field.

So when the job market doesn't have the amount of sysadmins/devops engineers available, then the cloud offers a good replacement. Even if you as an individual company can solve it by offering more money and having a tougher selection process, this doesn't scale over the entire field, as at that point the whole number of available experts comes in.

Aurora is definitely expensive, but there is cheaper alternatives to it. Full disclosure, I'm employed by one of these alternative vendors (Neon). You don't have to use it, but many people do and it makes their life easier. The market is expected to grow a lot. Clouds seem to be one of the ways our industry is standardizing.


I’m not even a sysadmin, I just learned how to do stuff in Gentoo in the early ‘00s. Undoubtedly there are graybeards who will laugh at the ease of tooling that was available to me.

> But do you know all of the settings to make the db safe to access? Do you maintain it well, patch it, replicate it, etc?

Yes, but to be fair, I’m a DBRE (and SRE before that). I’m not advocating that someone without fairly deep knowledge attempt to do this in prod at a company of decent size. But your tiny startup? Absolutely; chuck a default install of Postgres or MySQL onto Debian, and optionally tune 2 – 3 settings (shared_buffers, effective_cache_size, and random_page_cost for Postgres; (innodb_buffer_pool_* and sync_array_size for MySQL – the latter isn’t necessary until you have high concurrency, but it also can’t be changed without a restart so may as well). Pick any major backup solution for your DB (Barman for Postgres, XtraBackup for MySQL, etc.), and TEST YOUR BACKUPS. That’s about it. Apply any security patches (or use unattended-upgrades, just be careful) as they’re released, and don’t do anything outside of your distro’s package management. You’ll be fine.

Re: Neon, I’ve not used it, but I’ve read your docs extensively. It’s the most interesting Postgres-aaS product I’ve seen, alongside postgres.ai, but you’re (I think) targeting slightly different audiences. I wish you luck!


> It’s the most interesting Postgres-aaS product I’ve seen, alongside postgres.ai, but you’re (I think) targeting slightly different audiences. I wish you luck!

This is always great feedback to hear, thank you!


Also a lot of the passionate security people such as myself moved on to other fields as it has just become bullshit artists sucking on the vendors teat and filling out risk matrix sheets, but no accountability when their risk assessments invariably turn out to be wrong.


That reminds me, I should check Twitter to see the most recent batch of “cybersecurity experts” take on Crowdstrike. Always a good time.


raises hand you guys hiring? I’ll be proof that there is indeed “anyone else.”


Not saying they're sufficient reasons but ..

1. more Windows programmers than Linux so they're cheaper.

2. more third-party software for e.g. reporting, graphing to integrate with

3. no one got fired for buying Microsoft

4. any PC can run Windows; IT departments like that.


My comment was tongue in cheek, of course it should not be this way but as you know it oftentimes is.


In the past, old versions of Windows were often considered superior because they stopped changing and just kept working. Today, that strategy is breaking down because attackers have a lot more technology available to them: a huge database of exploits, faster computers, IoT botnets, and so on. I suspect we're going to see a shift in the type of operating system hospitals run. It might be Linux or a more hardened version of Windows. Either way, the OS vendor should provide all security infrastructure, not a third party like Crowdstrike, IMHO.


> I suspect we're going to see a shift in the type of operating system hospitals run. It might be Linux or a more hardened version of Windows.

Why? "Hardening" the OS is exactly what Crowdstrike sells and bricked the machines with.

Centralization is the root cause here. There should be no by design way for this to happen. That also rules out Microsoft's auto updates. Only the IT department should be able to brick the hospitals machines.


Hardening is absolutely not what crowdstrike sells. They essentially sell OS monitoring and anomaly detection. OS monitoring involves minimizing the attack surface, usually by minimizing the number of services running and limiting the ability to modify the OS


Nothing wrong with that. Windows XP-64 supports up to 128GB physical RAM, could be 5 years until that is available on laptops. Windows 7 Pro supports up to 192 GB of RAM. Now if you were to ask me what you would run on those systems with maxed out RAM, I wouldn't know. I also don't think the Excel version that runs on those versions of windows allows partially filled cells for Gantt charts.


>Most of it runs on 6 to 10 year old unpatched versions of Windows…

Well, that's a pretty big problem. I don't know how we ended up in a situation where everybody is okay with the most important software being the most insecure, but the money needed to keep critical infra totally secure is clearly less than the money (and lives!) lost when the infra crashes.


Well you can use stupid broken software with any OS, not just Windows. Isn't CrowdStrike Falcon available on Linux, is there any reason why couldn't they have introduced a similar bug and similar consequences there?


None. There are a bunch of folks here who clearly haven’t spent a day in enterprise IT proclaiming Linux would’ve saved the day. 30 seconds of research would’ve lead them to discover crowdstrike also runs on Linux and has created similar problems on Linux in the past.


Oh could you link me the source of the claim that all linux clients of crowdstrike went down all at once? I'm very interested to hear it.


No it couldn't. Crowdstrike on Linux uses eBPF and therefore can't cause a kernel panic (which is the fundamental issue here).



It's even better when you get told about the magical superiority of apple for that...

... Except Apple pretty much pushes you to run such tools just to get reasonable management key alone things like real-time integrity monitoring of important files (Crowdstrike in $DAYJOB[-1] is how security knew to ask whether it was me or something else that edited PAM config for sudo on corporate Mac)


Enterprise mac always follows the same pattern, users proclaim its superiority while its off the radar, then it gets mcaffee, carbon black, airlock, and a bunch of other garbage tooling installed and runs as poorly as enterprise Windows.

The best corporate dev platform at moment is WSL2 - most of the activity inside the WSL2 vm isn't monitored by the windows tooling so performance is fast. Eventually security will start to mandate agents inside the WSL2 instance, but at the moment most orgs dont.


> Why would Windows systems be anywhere near critical infra ?

This is just a guess, but maybe the client machines are windows. So maybe there are servers connected to phone lines or medical equipment, but the doctors and EMS are looking at the data on windows machines.


> Why would Windows systems be anywhere near critical infra ?

maybe Heartbleed or the xzUtils debacles convinced them to switch.


Because Windows is accessible and Linux requires uncommon expertise and short term cost that is just not practical for lots of places.

Goodluck teaching administrators an entirely new ecosystem, goodluck finding software off the shelf for Linux.

Bespoke is expensive, expertise is rare, Linux is sadly niche.


No. The problem isn’t expertise — it’s CIOs that started their career in the 1990s and haven’t kept up with the times. I had to explain why we wanted PostgreSQL instead of MS SQL server. I shouldn’t have to have that conversation with an executive that should theoretically be a highly experienced expert. We also have CIOs that have MBAs but not actual background in software. (I happen to have an MBA but I also have 15+ years of development experience.) My point is CIOs generally know “business” and they know how to listen to pitches from “Enterprise” software companies — but they don’t actually have real-world experience using the stuff they’re forcing upon the org.

I recently did a project with a company that wanted to move their app to Azure from AWS — not for any good technical reason but just because “we already use Microsoft everywhere else.”

Completely stupid. S3 and Azure Blob don’t work the same way. MCS and AWS SES also don’t work the same way — but we made the switch not even for reasons of money, but because some Microsoft salesman convinced the CIO that their solution was better. Similar to why many Jira orgs force Bitbucket on developers — they listen to vendors rather than the people that have to use this stuff.


> I had to explain why we wanted PostgreSQL instead of MS SQL server.

Tbf, you are giving up a clustering index in that trade. May or may not matter for your workload, but it’s a remarkably different storage strategy that can result in massive performance differences. But also, you could have the same by shifting to MySQL, sooooo…


That’s so infuriating. But, while the people in your story sound dumb, they still sound way more technically literate than 95% of society. Azure is blue, AWS is followed by OME.

Teach a 60 year old industrial powertrain salesman to use Linux and to redevelop their 20 year old business software for a different platform.

Also explain why it’s worth spending food, house, and truck money on it.

Finally, local IT companies are often incompetent. You get entire towns worth of government and business managed by a handful of complacent, incompetent local IT companies. This is a ridiculously common scenario. It totally sucks, and it’s just how it is.


Are. You. Kidding.

Windows servers are “niche” compared to Linux servers. Command line knowledge is not “uncommon expertise,” it’s imo the bare minimum for working in tech.


Most businesses aren’t working in tech.

I’m not wildly opinionated here, I should clarify. I’d love a more Linux-y world. I’m just saying that a lot of small-medium towns, and small-medium businesses are really just getting by with what they know. And really, Windows can be fine. Usually, however, you get people who don’t understand tech, who can barely use a Windows PC, nevermind Linux, and don’t really have the budget to rebuild their entire tech ecosystem or the knowledge to inform that decision. It sucks, but it’s how it is.

Also, Open Office blows chunks. Business users use Windows. M365 is easy to get going, email is relatively hands-off, deliverability is abstracted. Also, a LOT of business software is Windows exclusive. And that also blows chunks.

I would LOVE a more open source, security minded, bespoke world! It’s just not the way it is right now.


> Why would Windows systems be anywhere near critical infra ?

Why would computers be anywhere near critical infra? This sounds like something that should failsafe, the control system goes down but the thing keeps running. If power goes down, hospitals have generator backups, it seems weird that computers would not be in the same situation


i mean not just dollars but lives also right? do we have a way to track that?


Yup through electronic medical records... o wait


What's the NASDAQ ticker for lives?


> Hard to imagine how many millions of not billions of dollars this one bad update caused.

I mean, if the problem is that hospitals can't function anymore, money is hardly the biggest problem


[flagged]


Without access to Epic we can't place med orders, look up patient records, discharge patients from the hospital, enter them into our system, really much of anything. Every provider in the emergency department is on their computer placing orders and doing work when not interacting with a patient. Like most hospitals in this country, our entire workflow depends on Epic. We couldn't even run blood tests because the lab was down too.

The STEMI was stabilized, it's more that it was scary to lose every machine in the department at once while intubating a crashing patient. You're flying blind in a lot of ways.


If the computer system was down, and medicine was needed to save a life, would either some protocol dictate grabbing the medicine and dealing with the paperwork or consequence later? If protocol didn’t allow for discussion, would staff start breaking protocol to save life?


You can skip paperwork but what if the patient is allergic to a medicine and you need to check medical records? Or you need to call for a surgeon but VoIP is down? Etc…


My father's coworker died from being in hospital for observation after few scratches in car accident because they were accidentally given medication they were allergic to.

So, yeah. The paperwork can save lives too and not sadly red tape is bad.

Otherwise you may go o hospital to pickup your friend and be told to wait for coroner


> Surely none of the medical devices needed to treat a heart attack are Windows PCs connected to the internet?

Wouldn't that be nice


I'm guessing they were being treated over the phone as the systems went down. I've been through a similar situation, the person on the phone will give step by step instructions while waiting for an ambulance to arrive.

Sounds like with the systems being down the call would have been cut off which sounds horrible.


No, treating in person. But we can't function as a department without computers. You call cardiology (on another floor) and none of their computers are working to be able review the patients records. You could take the EKG printout and run it to them, but we're just telling them lab results from what we can remember before our machines all bluescreened. The lab's computers were down so we can't do blood tests. Nursing staff knows what to do next by looking at the board or their computer. Without that you're just a room full of people shouting things at each other, and definitely can't see the 3-4x patients an hour you're expected to. Doctors and midlevels rely on epic to place med orders too.


[flagged]


It's against the site guidelines to post like this, and we have to ban accounts that to it repeatedly, so if you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.


May I say that starting from "treating a heart attack" and ending up worrying about millions lost in productivity sounds a bit "wrong"?


I just had a ten hour hospital shift from hell, apologies if my writing is lacking. I can't think of a better way to try to measure the scope of the damage caused by this.


Just completed a standing 24 due to this outage. My B-Shift brothers and sisters had to monitor the radios all night for their units to be called for emergencies. I heard every dispatch that went out.

We were back in the 1960's with paper and pen for everything, no updates on nature of call, no address information, nothing... find out when you show up and hope the scene is secure. It was wild as it was coupled to a relatively intense Monsoon storm.


Starting with an ER story kind of set up the expectation that you'll be "measuring the scope of the damage" in lives lost, not dollars. Though I guess at large enough scale, they're convertible.

Regardless, thanks for your report; seeing it was very sobering. I hope you can get some rest, and that things will soon return to normalcy.


A tiny bit of thought about your situation IMO should lead anyone to conclude that you just first-hand experienced the fallout of today's nightmare, and then took a step back and realized you were likely one of millions if not billions of other people experiencing the same, and relayed that thought in terms of immediately understandable loss. Someone else might see "wrong" but I saw empathy.


Sorry to hear this! I'm a journalist covering this mess and wondering if we could talk. Am at sarah.needleman@wsj.com


Take care of yourself. you're making the world a better place. You deserve better supportive technology, not this shit show.


Billions in losses means a somewhat worse life for a huge number of people and potentially much worse healthcare problems down the line, the NHS was affected


When it comes to measuring the impact to society at scale, dollars is really the only useful common proxy. One can't enumerate every impact this is going to have on the world today -- there's too many.


Bullshit. Absolute bullshit.

I've told my testers for years their efficacy at their jobs would be measured in unnecessary deaths prevented. Nothing less. Exactly this outcome was something I've made unequivocally clear was possible, and came bundled with a cost in lives. Yet the "Management and bean counter types" insist "Oh, nope. Only the greenbacks matter. It's the only measure."

Bull. Shit. If we weren't so obsessed with imaginary value attached to little green strips of paper, maybe we'd have the systems we need so things like this wouldn't happen. You may not be able to enumerate every, but you damn well can enumerate enough. Y'all just don't want to because then work starts looking like work.


Why measure only death, as if it is the only terrible thing that can happen to someone?

That doesn’t count serious bodily injury, suffering, people who were victimized, people who had their lives set back for decades due to a missed opportunity, a person who missed the last chance to visit a loved one, etc.

There are uncountable different impacts that happen when you’re talking about events on the scale of an economy. Which is why economists use dollars. The proxy isn’t useful because it is more important than life, it it useful because the diversity of human experience is innumerable.


I understand your emotion but perhaps people simply don't value human lives.

At least putting a number to life is an genuine attempt even though it may be distasteful.

The fact is that there already is a number on it, which one can derive entirely descriptively without making moral judgements. Insurance companies and government social security offices already attempt to determine the number.

The number is not infinite or we'd have no cars.


[flagged]


"Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."

https://news.ycombinator.com/newsguidelines.html

https://news.ycombinator.com/item?id=41005274


Millions lost is sizeable parts of people's lives they won't get back.


> "Took down our entire emergency department as we were treating a heart attack."

Not questioning that it happened, but this was a boot loop after a content update. So if the computers were off and didn't get the update, and you booted them, they would be fine. And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.

How did it happen that you were rebooting in the middle of treating a heart attack? [Edit: BSOD -> auto reboot]


Beyond the BSOD that happened in this case, in general this is not true with Windows:

> And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.

Windows has been notorious for forcing updates down your throat, and rebooting at the least appropriate moments (like during time-sensitive presentations, because that's when you stepped away from the keyboard for 5 minutes to set up the projector). And that's in private setting. Corporate setting, the IT department is likely setting up even more aggressive and less workaround-able reboot schedule.

Things like this is exactly why people hate auto-updates.


Windows Update has nothing to do with it.


But it has created a culture of everything needing to be kept up to date all the time no matter what, and pulling control of those updates out of your own hands into the provider's.


True, especially when a reboot of Windows takes several minutes because it started auto-applying updates!


How do you propose ensuring critical security updates get deployed then?

Especially if an infected machine can attack others?

Users/IT regularly would never update or deploy patches which has its own consequences. There’s no perfect solution—but rather there to accept the pain.

It’s a lot like herd immunity in vaccines.


> It’s a lot like herd immunity in vaccines.

Yes. But you don't deploy experimental vaccines simultaneously across the entire population all at once. Inoculating an entire country takes months; the logistics incidentally provide protection against unforeseen immediate-term dangerous side effects. Without that delay, well, every now and then you'd kill half the population with a bad vaccine. The equivalent of what's happening now with CrowdStrike.


Windows update actually provides sensible control over when and how to supply updates since I think Windows 2000 (definitely was there by vista time). You just need to use it.


It was degrading since Windows 2000, with Microsoft steadily removing and patching up any clever workarounds people came with to prevent the system from automatically rebooting. The pinnacle of that, an insult added to injury, was introduction of "active hours" - a period of, initially, at most 8 or 10 hours, designated as the only time in the day your system would not reboot due to updates. Sucks if your computer isn't an office machine only ever used 9-to-5.


No, it was not degrading - Windows 10 introduced forced updating in home editions because it was weighed to be better for general cases (that it got abused later is separate issue).

The assumption is that "pros" and "enterprise" either know how to use provided controls or have WSUS server setup which takes over all of scheduling updates.


We do not know if the update was new version of the driver (which also can be updated without reboot on Windows since... ~17 years ago at least) or if it was done data that was hot-reloaded that triggered a latent big in the driver


> "Windows has been notorious for forcing updates down your throat"

in the same way cars are notorious for forcing you to run out of gas while you're driving them and leaving you stranded... because you didn't make time to refill them before it became a problem.

> "Things like this is exactly why people hate auto-updates."

And people also hate making time for routine maintenance, and hate getting malware from exploits they didn't patch, and companies hate getting DDoS'd by compromised Windows PCs the owners didn't patch, and companies hate downtime from attackers taking them offline. There isn't an answer which will please everyone.


This isn't really a good faith response. This prevention of functionality during a critical period while forcing an update would be like if a modern car refused to drive during an emergency due to a forced over the air update that paused the ability to drive till the update was finished.


The parent response wasn't good faith; it was leaning on an emergency in a hospital department caused by CrowdStrike to whine about Microsoft in trollbait style.

> "This prevention of functionality during a critical period while forcing an update would be like if a modern car refused to drive during an emergency"

Machines don't know if there's an emergency going on; if you don't do maintenance, knowing that the thing will fail if you don't, then you're rolling the dice on whether it fails right when you need it. It's akin to not renewing an SSL certificate - you knew it was coming, you didn't deal with it, now it's broken - despite all reasonable arguments that the connection is approximately as safe 1 minute after midnight as it was 1 minute before, if the smartphone app (or whatever) doesn't give you any expired cert override then complaining does nothing. Windows updates are released the same day every month, and have been mandatory for eight years: https://www.forbes.com/sites/amitchowdhry/2015/07/20/windows...

And we all know why - because Windows had a reputation of being horribly insecure, and when Microsoft patched things, nobody installed the patches. So now people have to install the patches. Complaining "I want to do it myself" leads to the very simple reply: you can - why didn't you do it yourself before it caused you a problem?

If you're still stubbornly refusing to install them, refusing to disable them, refusing to move to macOS or Linux, and then complaining that they forced you to update at an inconvenient time, you should expect people to point out how ridiculous (and off-topic) you're being.


(Your user name is wonderful.)

> It's akin to not renewing an SSL certificate.

Your choice of analogies is a good one. I have done SSL type stuff since 1997.

Doesn't matter: I would have to work a few hours very carefully before modifying my web server config. And test it.

I am terrified by scale of deployment involved in this CloudStrike update.


But that's the thing, forced updates are not akin to maintenance or certs that expire on an annual basis. I'm not sure where you seem to be getting your "you should expect people to point out how ridiculous you're being" line from. Your the only one I'm seeing arguing this idea.


Disabling forced updates by using proper managed updates features that exist longer than "forced updates" had is table stakes for IT. In fact, it was considered important and critical before Windows became major OS in business.


Not setting computers that are in any critical path on proper maintenance schedule (which, btw, overrides automatic updates on Windows and doesn't require extra licenses!) is the same as willfully ignoring maintenance just because the car didn't punch you in the face every time you need to up some fluids


I agree that it is willfully ignoring maintenance, but I completely disagree with the analogy that it is the same as ignoring a fluid change in a car. A car will break down and may stop working without fluid changes. The same is almost assuredly not usually true if a windows, or other, update is ignored. If you disagree, then I'd be happy to review any evidence you have that these updates really are always as critical as you think.


A lot of things that come as "mandatory patches" in IT, not just for Windows, are things that tend to generate recalls - or "sucks to be you, buy new car" in automotive world.

In more professional settings than private small car ownership, you often will both have regular maintenance updates provided and mandates to follow them. Sometimes they are optional because your environment doesn't depend on them, sometimes they are mandatory fixes, sometimes they change from optional to mandatory overnight when previous assumptions no longer apply.

Several years ago a bit over 100 people and uncounted amount of possible more had their lives endangered because an extra airflow directing piece of metal was optional, and after the incident it was quickly made mandatory, with hundreds of aircraft being stopped to have the fix applied (which previously was only required for hot locations - climate change really bit it).

Similarly, when you drive your car and it fails to operate, that's just you. When it's a more critical service, you're either facing corporate, or in worst case, governmental questions.


Not OP, but some (most? many?) machines receiving the update crashed with a BSOD. So that's how they could enter the boot loop.


I just realised I had read that, but 4 minutes later and it's too late to delete my comment now; Thanks, yes it makes sense.


Half of the hotels (Choice) computers were down. We never reboot the computer, unless it's not working or working slowly or Windows update.


A lot of security software updates on-line, workout rebooting.

If said update pushes you into bsod where automatic watchdog (by default set enabled in windows) reboots...well, here you have a bootloop


idk, a lot of system are never meant to be rebooted outside of the update schedule, so they wouldn't have been off in the first place. And if those systems control others, then there is a domino effect.

I can see very well how one computer could have screwed all others. It's really not hard to imagine.


And dove software is supposed to hot patch itself because you might not have time to take systems offline to deal with ongoing attack, for example


What happens when a computer gets rebooted as part of daily practice or because of the update, and then it becomes unusable, and then the treatment team needs to use it hours later?


I dunno, but they'd know about it hours earlier in time to switch to paper, or pull out older computers, or something - in that scenario it wouldn't have happened "as we were treating a heart attack" and they would have had time to prepare.


I work for a diesel truck repair facility and just locked up the doors after a 40 minute day :( .

- lifts wont operate.

- cant disarm the building alarms. (have been blaring nonstop...)

- cranes are all locked in standby/return/err.

- laser aligners are all offline.

- lathe hardware runs but controllers are all down.

- cant email suppliers.

- phones are all down.

- HVAC is also down for some reason (its getting hot in here.)

the police drove by and told us to close up for the day since we dont have 911 either.

alarms for the building are all offline/error so we chained things as best we could (might drive by a few times today.)

we dont know how many orders we have, we dont even know whos on schedule or if we will get paid.


How come lifts and cranes are affected by this?

Are they somehow controlled remotely? or do they need to ping a central server to be able to operate?

I can see how alarms, email and phones are affected but the heavy machinery?

(Clearly not familiar with any of these things so I am genuinely curious)


Lots and lots of heavy machinery uses Windows computers even for local control panels.


But why does it need to be remotely updated? Have there been major innovations in lift technology recently? They still just go up and down, right?

Once such a system is deployed why would it ever need to be updated?


They're probably deployed to a virtualized system to easy with maintenance and upkeep.

Updates are partially necessary to ensure you don't end up completely unsupported in the future.

It's been a long time, but I worked IT for an auto supplier. Literally nothing was worse than some old computer crapping out with an old version of Windows and a proprietary driver. Mind you, these weren't mission critical systems, but they did disrupt people's workflows while we were fixing the systems. Think, things like digital measurements or barcode scanners. Everything can be easily done by hand but it's a massive pain.

Most of these systems end up migrated to a local data center than deployed via a thin client. Far easier to maintain and fix than some box that's been sitting in the corner of a shop collecting dust for 15 years.


Ok but it’s a LIFT. How is Windows even involved? Is it part of the controls?


Real problem is not that it's just a damn lift and shouldn't need full Windows. It's that something as theoretically solved and done problem as an operating system is not practically so.

An Internet of Lift can be done with <32MB of RAM and <500MHz single core CPU. Instead they(for whoever they) put a GLaDOS-class supercomputer for it. That's the absurdity.


An Internet of Lift can be done with <32KB of RAM and <500KHz single core CPU.


You’d be surprised at how entrenched Windows is in the machine automation industry. There are entire control systems algo implemented and run in realtime Windows, vendors like Beckhoff and ACS only have Windows build for their control software which developers extend and build on top with Visual Studio.


Absolutely correct, I've seen muli-axis machine tools that couldn't even be started let alone get running properly if Windows wouldn't start.

Incidentally, on more than one occasion I've not been able to use one of the nearby automatic tellers because of a Windows crash.


Siemens is also very much in on this. Up to about the 90s most of these vendors were running stuff on proprietary software stacks running on proprietary hardware networked using proprietary networks and protocols (an example for a fully proprietary stack like this would be Teleperm). Then in the 90s everyone left their proprietary systems behind and moved to Windows NT. All of these applications are truly "Windows-native" in the sense that their architecture is directly built on all the Windows components. Pretty much impossible to port, I'd wager.


Example of patent: https://patents.google.com/patent/US6983196B2/en

So for maintenance and fault indications. Probably saves some time from someone digging up manuals for checking error codes from where ever they maybe placed or not. Also could display things like height and weight.


Perhaps "Windows Embedded" is involved somewhere in the control loop, it is a huge industry but not that well-known to the public;

https://en.wikipedia.org/wiki/Windows_Embedded_Industry

https://en.wikipedia.org/wiki/Windows_IoT


We do ATM's - it runs on Windows IOT - before that it was OS/2.


Any info on whether this Crowdstrike Falcon crap is used here?


Fortunately for us not at all although we use it on our desktops - my work laptop had a BSOD on Friday morning, but it recovered.


According to reports the ATMs of some banks also showed the BSOD which surprised me; i wouldn't have thought such "embedded" devices needed any type of "third-party online updates".


Security for a device that can issue cash is kind of important.


Its easier and cheaper (and a lil safer) to run wires to the up\down control lever and have those actuate a valve somewhere, than it is to run hydraulic hoses to a lever like in lifts of old, for example.

That said it could also be run by whatever the equivalent of "PLC on an 8bit Microcontroller" is, and not some full embedded Windows system with live online virus protection so yeah, what the hell.


Probably for things like this - https://www.kone.co.uk/new-buildings/advanced-people-flow-so...

There’s a lot of value on Internet of Things everything, but comes with own risks.


I'm having a hard time picturing a multi-story diesel repair shop. Maybe a few floors in a dense area but not so high that a lack of elevators would be show stopping. So I interpret "lift" as the machinery used to raise equipment off the ground for maintenance.


Several elevator controllers automatically switch to the safe mode if they detect a fire or security alarm (which apparently is also happening).


The most basic example is duty cycle monitoring and trouble shooting. You can also do things like digital lock-outs on lifts that need maintenance.

While the lift might not need a dedicated computer, they might be used in an integrated environment. You kick off the alignment or a calibration procedure from the same place that you operate the lift.


how many lifts, and how many floors, with how many people are you imagining? Yes, there's a dumb simple case where there's no need for a computer with an OS, but after the umpteenth car with umpteen floors, when would you put in a computer?

and then there's authentication. how do you want key cards which say who's allowed to use the lift to work without some sort of database which implies some sort of computer with an operating system?


It's a diesel repair shop, not an office building. I'm interpreting "lift" as a device for lifting a vehicle off the ground, not an elevator for getting people to the 12th floor.


> But why does it need to be remotely updated?

Because it can be remotely updated by attackers.


Security patches, assuming it has some network access.


Why would a lift have network access?


Do you see a lot of people driving around applying software updates with diskettes like in the old days?

Have we learned nothing from how the uranium enrichment machines were hacked in Iran? Or how attackers routinely move laterally across the network?

Everything is connected these days. For really good reasons.


Your understanding of stuxnet is flawed, Iran was attacked by the Us Gov in a very very specific spearfish attack with years of preparation to get Stux into the enrichment facilities - nothing to do with lifts connected to the network.

Also the facility was air-gapped, so it wasn't connected to ANY outside network. They had to use other means to get Stux on those computers and then used something like 7 zero days to move from windows into Siemens computers to inflict damage.

Stux got out potentially because someone brought their laptop to work, the malware got into said laptop and moved outside the airgap from a different network.


"Stux got out potentially because someone brought their laptop to work, the malware got into said laptop and moved outside the airgap from a different network."

The lesson here is that even in an air-gapped system the infrastructure should be as proprietary as is possible. If, by design, domestic Windows PCs or USB thumb drives could not interface with any part of the air-gapped system because (a) both hardwares were incompatible at say OSI levels 1, 2 & 3; and (b) software was in every aspect incompatible with respect to their APIs then it wouldn't really matter if by some surreptitious means these commonly-used products entered the plant. Essentially, it would be almost impossible† to get the Trojan onto the plant's hardware.

That said, that requires a lot of extra work. By excluding subsystems and components that are readily available in the external/commercial world means a considerable amount of extra design overhead which would both slow down a project's completion and substantially increase its cost.

What I'm saying is obvious, and no doubt noted by those who've similar intentions to the Iranians. I'd also suggest that the use of individual controllers etc. such as the Siemens ones used by Iran either wouldn't be used or they'd need to be modified from standard both in hardware and with the firmware (hardware mods would further bootstrap protection if an infiltrator knew the firmware had been altered and found a means of restoring the default factory version).

Unfortunately, what Stuxnet has done is to provide an excellent blueprint of how to make enrichment (or any other such) plants (chemical, biological, etc.) essentially impenetrable.

† Of course, that doesn't stop or preclude an insider/spy bypassing such protections. Building in tamper resistance and detection to counter this threat would also add another layer of cost and increase the time needed to get the plant up and running. That of itself could act as a deterrent, but I'd add that in war that doesn't account for much, take Bletchley and Manhattan where money was no object.


I once engineered a highly secure system that used (shielded) audio cables and amodem as the sole pathway to bridge the airgap. Obscure enough for ya?

Transmitted data was hashed on either side, and manually compared. Except for very rare binary updates, the data in/out mostly consisted of text chunks that were small enough to sanity-check by hand inside the gapped environment.


Stux also taught other government actors what's possible with a few zero days strung together, effectively starting the cyberwasr we've been in for years.

Nothing is impenetrable.


You picked a really odd day and thread to say that everything is connected for really good reasons.


Or being online in the first place. Sounds like an unnecessary risk.


Remember those good old fashioned windows that you could roll down manually after driving into a lake?

Yeah, can’t do it now: it’s all electronic.


I’m sure that lifts have been electronically controlled for decades. But why is Windows (the operating system) involved?


but why do they have CS on them? they should be simply not connected to any kinds of networks.

and if there's some sensor network in the building that should be completely separate from the actual machine controls.


Compliance.

To work with various private data, you need to be accredited and that means an audit to prove you are in compliance with whatever standard you are aspiring to. CS is part of that compliance process.


Which private data would a computer need to operate a lift?


Another department in the corporation is probably accessing PII, so corporate IT installed the security software on every Windows PC. Special cases cost money to manage, so centrally managed PCs are all treated the same.


Anything that touches other systems is a risk and needs to be properly monitored and secured.

I had a lot of reservations about companies installing Crowdstrike but I'm baffled by the lack of security awareness in many comments here. So they do really seem necessary.


It must be security tags on the lift which restrict entry to authorised staff.


who's allowed to use the lift? where do those keycards authenticate to?


Because there's some level of convenience involved with network connectivity for OT.


That sounds...suboptimal.

I would imagine they used specialized controller cards or something like that.


They optimize for small batch development costs. Slapping windows PC when you sell a few hundred to thousand units is actually pretty cheap. Software itself is probably same order of magnitude, cheaper for UI itself...


And cheap both short and long term. Microsoft has 10 year lifecycles you don't need to pay extra for. Linux you need IT staff to upgrade it every 3 years. Not to mention hiring engineers to recompile software every 3 years with the distro upgrade.


Ubuntu LTS has support for 5 years, can be extended to 10 years of maintenance/security support with ESM (which is a paid service).

Same with Rocky Linux, but the extra 5 years of maintenance/security support is provided for free.


thats just asking for trouble.


Probably a Windows-based HMI (“human-machine interface”).

I used to build sorting machines that use variants of the typical “industrial” tech stack, and the actual controllers are rarely (but not never!) Windows. But it’s common for the HMI to be a Windows box connected into the rest of the network, as well as any server.


I'm still running multiple CNC/Industrial equipment with win3.1/98/xp. Only just retired one running Dos 6.2


I'm just impressed that the lifts, alarms, cranes, phones, etc all run on Windows somehow.


In a lot of cases you find tangential dependencies on Windows in ways you don't expect. For example a deployment pipeline entirely linux-based deploying to linux-based systems that relies on Active Directory for authentication.


> Active Directory for authentication.

In my experience that'd be 90% of the equipment.

"Oh! It has LDAP integration! We can 'Single Sign On'."


I don't know if "impressed" is the right word..

"Appalled", "bewildered" and "horrified" and also comes to mind..


I'm more confused because I have never, ever encountered a lift that wasn't just some buttons or joysticks on a controller attached to the lift. There is zero need of more computing power than a 8-bit microcontroller from the 1980s. I don't know where I would even buy such a lift with a windows PC.


No one sells 8 bit microcontrollers from the 1980s anymore. Just because you don't need the full power of modern computing hardware and software doesn't mean you are going to pay extra for custom, less capable options.


wow, why do lifts require an OS?


I think the same question can be asked for why lots of equipment seemingly requires an OS. My take is that these products went through a phase of trying to differentiate themselves from competitors and so added convenience features that were easier to implement with a general purpose computer and some VB script rather than focusing on the simplest most reliable way to implement their required state machines. It's essentially convenience to the implementors at the expense of reliability of the end result.


My life went sideways when organizations I worked for all started to make products solely for selling and not for using those. If the product was useful for something, that was the side effect of being sellable. Not the goal.


Worse is Better has eaten the world. The philosophy of building things properly with careful, bespoke, minimalist designs has been totally destroyed by a race to the bottom. Grab it off the shelf, duct tape together a barely-working MVP, and ship it.

Now we are reaping what we sowed.


That's what you get for outsourcing to some generic shop with no domain expertise who implements to a spec for the lowest dollar.


the question is - why lifts require windows?


The question is, why do lifts require Crowdstrike?


Some idiot with college degree in office no-where near the place sees that we have these PCs here. And then they go over compliance list and mandate this is needed. Now go install it and the network there...


Or they want to protect their Windows-operated lifts from very real and life threatening events like an attacker jumping from host to host until they are able to lock the lifts and put people lives at risk or cause major inconveniences.

Not all security is done by stupid people. Crowdstrike messed up in many ways. It doesn't make the company that trusted them stupid for what they were trying to achieve.


Crowdstrike is malware and spyware. Trusting one malware to control another is your problem right there. It will always blow up in your face.


Why are the lifts networked or on a network which can route to the internet?

This is a car lift. It really doesn't need a computer to begin with. I've never seen one with a computer. WTF?


For the same reason people want to automate their homes, or the industries run with lots of robots, etc: because it increases productivity. The repair shop could be monitoring for usage, for adequate performance of hydraulics, long-term performance statistics, some 3rd-party gets notified to fix it before it's totally unusable, etc.

I have a friend that is a car mechanic. The amount of automation he works with is fascinating.

Sure, lifts and whatnot should be in a separate network, etc, but even banks and federal agencies screw up network security routinely. Expecting top-tier security posture from repair shops is unrealistic. So yes, they will install a security agent on their Windows machines because it looks like a good idea (it really is) without having the faintest clue about all the implications. C'est la vie.


But what are you automating? It's a car lift, you need to be standing next to it to safely operate it. You can't remotely move it, it's too dangerous. Most of the things which can go wrong with a car lift require a physical inspection and for things like hydraulic pressure you can just put a dial indicator which can be inspected by the user. Heck, you can even put electronic safety interlocks without needing an internet connection.

There are lots of difficult problems when it comes to car repair, but cloud lift monitoring is not something I've ever heard anyone ask for.

The things you're describing are all salesman sales-pitch tactics, they're random shit which sound good if you're trying to sell a product, but they're all stuff nobody actually uses once they have the product.

It's like a six in one shoe horn. It has a screw driver, flash light, ruler, bottle opener, and letter opener. If you're just looking at two numbers and you see regular shoe horn £5, six in one shoe horn £10 then you might blindly think you're getting more for your money. But at the end of the day, I find it highly unlikely you'll ever use it for anything other than to put tight shoes on.


I imagine something keeps monitors how many times the lift has gone up and down for maintenance reasons. Maybe a nice model monitors fluid pressure in the hydraulics to watch for leaks. Perhaps a model watches strain, or balance, to prevent a catastrophic failure. Maybe those are just sensors but if they can’t report their values they shutdown for safety’s sake. There are all kinds of reasonable scenarios that don’t rely on bad people trying to screw or cheat someone.


None of these features require internet or a windows machine, most of them do not require a computer or even a microcontroller. Strain gauges can be useful for checking for an imbalanced load, but they cannot inspect the metal for you.


The question is, why do lifts require internet connection on top of the rest.


In my office, when we swipe our entry cards at the security gates, a screen at the gate tells us which lift to take based on the floor we work on, and sets the lift to go to that floor. It's all connected.


In the context of a diesel repair shop, he likely was referring to fork lifts or vehicle lifts rather than elevators.


This doesn't require an internet, just a LAN.


Remote monitoring and maintenance. Predictive maintenance, monitor certain parameters of operation and get maintenance done before lift stops operating.


It's a car lift. Not only would it be irresponsible to rely on a computer to tell you when you should maintain it, as some inspections can only be done visually, it seems totally pointless as most inspections need to be done manually.

Get a reminder on your calendar to do a thorough inspection once a day/week (whatever is appropriate) and train your employees what to look for every time it's used. At the end of the day, a car lift on locks is not going to fail unless there's a weakness in the metal structure, no computer is going to tell you about this unless there's a really expensive sensor network and I highly doubt any of the car lifts in question have such a sensor network.

Moreover, even if they did have such a sensor network, why are these machines able to call out to the internet?


These requirements can be met by making the lift's systems and data observable, which is a uni-directional flow of information from the lift to the outside world. Making the lift's operation modifiable from the outside world is not required to have it be observable.


I mean... the beginning of mission impossible 1 should tell you.


The same reason everyone just uses a microcontroller on everything. It's like a universal glue and you can develop in the same environment you ship. Makes it easy.


Well, how else is the operator supposed to see outside?


Heh ...


Why do lathes , cranes and laser alignment systems need a new copy of windows?


Very likely they use a manufacturing execution system like Dassault's DELMIA or Siemens MES.

These systems are intended to allow local control of a factory, or cloud based global control of manufacturing.

They can connect to individual PLC(Programmable Logic Controller) which handles the actual equipment.

They connect to a LAN network, or to the internet. So they naturally need some form of security.

They could use Windows Server, Redhat Linux, etc. but they need some form of security. Which is how controller would be affected.

Usually you can just set them to manual though...


Lathes probably have PCs connected to them to control them, and do CNC stuff (he did say the controllers). Laser alignment machines all have PCs connected to them these days.

The cranes and lifts though... I've never heard of them being networked or controlled by a computer. Usually it's a couple buttons connected to the motors and that's it. But maybe they have some monitoring systems in them?


Off then top of my head, based on limited experience in industrial automation:

- maintenance monitoring data shipping to centralised locations

- computer based HMI system - there might be good old manual control but it might require unreasonable amounts of extra work per work order

- Centralised control system - instead of using panel specific to lift, you might be controlling bunch of tools from common panel

- integration with other tools, starting from things as simple as pulling up manufacturers' service manual to check for details to doing things like automatically raising the lift to position appropriate for work order involving other (possibly also automated) tools with adjustments based on the vehicle you're lifting

There could be more.


CNC machine tools can track use, maintenance, etc via the network. You can also push programs to them for your parts.

The need a new copy of Windows because running an old copy on a network is a worse idea.


This blows my mind because none of this requires windows, or a desktop OS at all.


No, they don't. Absolutely. But there are very few companies successful not using Windows or existing OS. Apple HomePod runs iOS.


Remember that CNC is programming environment. Now how do actually see what program is loaded? Or where is the execution at the moment? For anything beyond few lines of text on dotmatrix screen actual OS starts to be come desirable.

And all things considered, Windows is not that bad option. Anything else would also have issues. And really what is your other option some outdated, unmaintained Android? Does your hardware vendor offer long term support for Linux?

Windows actually offers extremely good long term support quite often.


> And all things considered, Windows is not that bad option

I'm gonna go out on a limb and say that it actually is. It's a closed source OS which includes way more functionality than you need. A purpose-built RTOS running on a microcontroller is going to provide more reliability, and if you don't hook it up to the internet it will be more secure, too. Of course, if you want you can still hook it up to the internet, but at least you're making the conscious decision to do so at that point.

Displaying something on a screen isn't very hard in an embedded environment either.

I have an open source printer which has a display, and runs on an STM32. It runs reliably, does its job well, and doesn't whine about updates or install things behind my back because it physically can't, it has no access to the internet (though I could connect it if I desired). A CNC machine is more complex and has more safety considerations, but is still in a similar class of product.

https://youtu.be/FxIUs-pQBjk?si=N-W-Af6jBgGBiIgl&t=46


> Does your hardware vendor offer long term support for Linux?

This seems muddled. If the CNC manufacturer puts Linux on an embedded device to operate the CNC, they're the hardware manufacturer and it's up to them to pick a chip that's likely to work with future Linuxes if they want to be able to update it in the future. Are you asking if the chip manufacturer offers long-term-support for Linux? It's usually the other way around, whether Linux will support the chip. And the answer, generally, is "yes, Linux works on your chip. Oh you're going to use another chip? yes, Linux works on that too". This is not really something to worry about. Unless you're making very strange, esoteric choices, Linux runs on everything.

But that still seems muddled. Long-term support? How long are we talking? Putting an old Linux kernel on an embedded device and just never updating it once it's in the field is totally viable. The Linux kernel itself is extremely backwards compatible, and it's often irrelevant which version you're using in an embedded device. The "firmware upgrades" they're likely to want to do would be in the userspace code anyhow - whatever code is showing data on a display or running a web server you can upload files to or however it works. Any kernel made in the last decade is going to be just fine.

We're not talking about installing Ubuntu and worrying about unsolicited Snap updates. Embedded stuff like this needs a kernel with drivers that can talk to required peripherals (often over protocols that haven't changed in decades), and that can kick off userspace code to provide a UI either on a screen or a web interface. It's just not that demanding.

As such, people get away with putting FreeRTOS on a microcontroller, and that can show a GUI on a screen or a web interface too, you often don't need a "full" OS at all. A full OS can be a liability, since it's difficult to get real-time behaviour which presumably matters for something like a CNC. You either run a real-time OS, or a regular OS (from which the GUI stuff is easier) which offloads work to additional microcontrollers that do the real-time stuff.

I did not expect Windows to be running on CNCs. I didn't expect it to be running on supermarket checkouts. The existence of this entire class of things pointlessly running self-updating, internet-connected Windows confuses me. I can only assume that there are industries where people think "computer equals Windows" and there just isn't the experience present, for whatever reason, to know that whacking a random Linux kernel on an embedded computer and calling it a day is way easier than whatever hoops you have to jump through to make a desktop OS, let alone Windows, work sensibly in that environment.


5-10 years is not unreasonable expected support I think.

And if you are someone manufacturing physical equipment be it CNC machine or vehicle lift hiring entire team to keep Linux patched and making your own releases seems pretty unreasonable and waste of resources. In the end anything you choose is not error free. And the box running software is not main product.

This is actually huge challenge. Finding vendor that can deliver you a box where to run software with promised long term support, when the support is actually more than just few years.

Also I don't understand how it is any more acceptable to run unpatched Linux in networked environment than it is Windows. These are very often not just stand-alone things, but instead connected to at least local network if not larger networks. With possible internet connections too. So not updating vulnerabilities is as unacceptable as it would be with Windows.

With CNC there is place for something like Windows OS. You have separate embedded system running the tools. But you still want a different piece managing the "programs". As you could have dozens or hundreds of these. And at that point reading them from network starts once again make sense. Time of dealing with floppies is over...

And with checkouts, you want more UI than just buttons. And Windows CE has been reasonably effective tool in that.

Linux is nice on servers, but often with embedded side keeping it secure and up to date is massive amount of pain. Windows does offer excellent stability and long term support. And you can just simply buy a computer with sufficient support from MS. One could ask why do not not massive companies run their own Linux distributions?


> 5-10 years is not unreasonable expected support I think.

A couple of years ago, I helped a small business with an embroidery machine that runs Windows 98. Its physical computer died, and the owner could not find the spare parts. Fortunately, it used a parallel port to control the embroidery hardware, so it was easy to move to a VM with a USB parallel port adapter.


That was very lucky then. USB parallel ports adapters are only intended to work with printers. They fail with any hardware that does custom signalling over the parallel port.


Ok, just make the lift controller analogue. No digital processors at all. Nothing to update, so no updates needed.


Maybe you want your lift to be able to diagnose itself. Tell possible faults, instead of spending man hours on troubleshooting every part each time downtime included. With big lifts there are many parts that could go wrong. Being able to identify which one saves lot of time and time is money.

These sort of outages are actually extremely rare nowadays. Considering how long these control systems have been kept around must mean that they are not actually causing that many issue that replacing them would be worth it.


you log into the machine, download files, load files onto the program. that doesn't need a desktop environment? you want to reimplement half of one, poorly, because that would have avoided this stupid mistake, in exchange for half a dozen potential others, and a worse customer experience?


> you log into the machine, download files, load files onto the program. that doesn't need a desktop environment?

Believe it or not, it doesn't! An embedded device with a form of flash storage and an internet connection to a (hopefully) LAN-only server can do the same thing.

> you want to reimplement half of one, poorly

Who says I would do it poorly? ;)

> and a worse customer experience?

Why would a purpose-built system be a worse customer experience than _windows_? Are you really going to set the bar that low?


and why do they run spyware?


Probably because some fraction of lift manufacturer's customer base has a compliance checklist requiring it.


Because we live deep into the internet of shit era.


How else are you going to update your grocery list while operating the lift?


> we dont have 911 either

Holy cow...

Who on earth requires a Windows-based backend (or whatever else had CrowdStrike, in the shop or outside) for regular (VoIP) phone calls.

This should really lead to some learnings for anyone providing any kind of phone infrastructure.


Or lathe, or cranes, or alarms, or hvac... what the actual fuck.

Next move should be some artisanal as mechanical-as-possible quality products, or at least Linux(TM) certified product or similar (or Windows-free (TM)). The opportunity is here, everybody noticed this clusterfuck, and smart folks don't like ignoring threats that are in your face.

But I suppose in 2 weeks some other bombastic news will roll over this and most will forget. But there is always some hope


That’s not it. 911 itself was down.


Oh, great. I guess that counts as phone infrastructure.


what are the brands of these systems?


Oh man, you work with some cool (and dangerous) stuff.

Outage aside, do you feel safe using it while knowing that it accepts updates based on the whims of far away people that you don't know?


I hate to be that person, but things have moved to automatic updates because security was even shittier when the user was expected to do it.

I can't even imagine how much worse ransomware would be if, for example, Windows and browsers weren't updating themselves.


I feel like this is the fake reason given to try to hide the obvious reason: automatic updates are a power move that allows companies to retain control of products they've sold.


It's not fake reason; it's a very real solution to a very real problem.

Of course companies are going to abuse it for grotesque profit motive, but that doesn't make their necessity a lie.


Yep. And even aside from security, its a nightmare needing to maintain multiple versions of a product. "Oh, our software is crashing? What version do you have? Oh, 4.5. Well, update 4.7 from 2 years ago may fix your problem, but we've also released major versions 5 and 6 since then - no, I'm not trying to upsell you ma'am. We'll pull up the code from that version and see if we can figure out the problem."

Having evergreen software that just keeps itself up to date is marvellous. The Google Docs team only needs to care about the current version of their software. There are no documents saved with an old version. There's no need to backport fixes to old versions, and no QA teams that need to test backported security updates on 10 year old hardware.

Its just a shame about, y'know, the aptly named crowdstrike.


> The Google Docs team only needs to care about the current version of their software. There are no documents saved with an old version.

There sure are. I have dozens saved years ago.


Fine. But Google can mass-migrate all of them to a new format any time they want. They don’t have the situation you used to have with Word, where you needed to remember to Save As Word 2001 format or whatever so you could open the file on another computer. (And if you forgot, the file was unreadable). It was a huge pain.