Hacker News new | past | comments | ask | show | jobs | submit | page 2 login
CrowdStrike Update: Windows Bluescreen and Boot Loops (reddit.com)
4480 points by BLKNSLVR 5 days ago | hide | past | favorite | 3847 comments



I work for a diesel truck repair facility and just locked up the doors after a 40 minute day :( .

- lifts wont operate.

- cant disarm the building alarms. (have been blaring nonstop...)

- cranes are all locked in standby/return/err.

- laser aligners are all offline.

- lathe hardware runs but controllers are all down.

- cant email suppliers.

- phones are all down.

- HVAC is also down for some reason (its getting hot in here.)

the police drove by and told us to close up for the day since we dont have 911 either.

alarms for the building are all offline/error so we chained things as best we could (might drive by a few times today.)

we dont know how many orders we have, we dont even know whos on schedule or if we will get paid.


How come lifts and cranes are affected by this?

Are they somehow controlled remotely? or do they need to ping a central server to be able to operate?

I can see how alarms, email and phones are affected but the heavy machinery?

(Clearly not familiar with any of these things so I am genuinely curious)


Lots and lots of heavy machinery uses Windows computers even for local control panels.

But why does it need to be remotely updated? Have there been major innovations in lift technology recently? They still just go up and down, right?

Once such a system is deployed why would it ever need to be updated?


They're probably deployed to a virtualized system to easy with maintenance and upkeep.

Updates are partially necessary to ensure you don't end up completely unsupported in the future.

It's been a long time, but I worked IT for an auto supplier. Literally nothing was worse than some old computer crapping out with an old version of Windows and a proprietary driver. Mind you, these weren't mission critical systems, but they did disrupt people's workflows while we were fixing the systems. Think, things like digital measurements or barcode scanners. Everything can be easily done by hand but it's a massive pain.

Most of these systems end up migrated to a local data center than deployed via a thin client. Far easier to maintain and fix than some box that's been sitting in the corner of a shop collecting dust for 15 years.


Ok but it’s a LIFT. How is Windows even involved? Is it part of the controls?

Real problem is not that it's just a damn lift and shouldn't need full Windows. It's that something as theoretically solved and done problem as an operating system is not practically so.

An Internet of Lift can be done with <32MB of RAM and <500MHz single core CPU. Instead they(for whoever they) put a GLaDOS-class supercomputer for it. That's the absurdity.


You’d be surprised at how entrenched Windows is in the machine automation industry. There are entire control systems algo implemented and run in realtime Windows, vendors like Beckhoff and ACS only have Windows build for their control software which developers extend and build on top with Visual Studio.

Absolutely correct, I've seen muli-axis machine tools that couldn't even be started let alone get running properly if Windows wouldn't start.

Incidentally, on more than one occasion I've not been able to use one of the nearby automatic tellers because of a Windows crash.


Siemens is also very much in on this. Up to about the 90s most of these vendors were running stuff on proprietary software stacks running on proprietary hardware networked using proprietary networks and protocols (an example for a fully proprietary stack like this would be Teleperm). Then in the 90s everyone left their proprietary systems behind and moved to Windows NT. All of these applications are truly "Windows-native" in the sense that their architecture is directly built on all the Windows components. Pretty much impossible to port, I'd wager.

Perhaps "Windows Embedded" is involved somewhere in the control loop, it is a huge industry but not that well-known to the public;

https://en.wikipedia.org/wiki/Windows_Embedded_Industry

https://en.wikipedia.org/wiki/Windows_IoT


We do ATM's - it runs on Windows IOT - before that it was OS/2.

Any info on whether this Crowdstrike Falcon crap is used here?

Fortunately for us not at all although we use it on our desktops - my work laptop had a BSOD on Friday morning, but it recovered.

According to reports the ATMs of some banks also showed the BSOD which surprised me; i wouldn't have thought such "embedded" devices needed any type of "third-party online updates".

Example of patent: https://patents.google.com/patent/US6983196B2/en

So for maintenance and fault indications. Probably saves some time from someone digging up manuals for checking error codes from where ever they maybe placed or not. Also could display things like height and weight.


Its easier and cheaper (and a lil safer) to run wires to the up\down control lever and have those actuate a valve somewhere, than it is to run hydraulic hoses to a lever like in lifts of old, for example.

That said it could also be run by whatever the equivalent of "PLC on an 8bit Microcontroller" is, and not some full embedded Windows system with live online virus protection so yeah, what the hell.


Probably for things like this - https://www.kone.co.uk/new-buildings/advanced-people-flow-so...

There’s a lot of value on Internet of Things everything, but comes with own risks.


I'm having a hard time picturing a multi-story diesel repair shop. Maybe a few floors in a dense area but not so high that a lack of elevators would be show stopping. So I interpret "lift" as the machinery used to raise equipment off the ground for maintenance.

Several elevator controllers automatically switch to the safe mode if they detect a fire or security alarm (which apparently is also happening).

The most basic example is duty cycle monitoring and trouble shooting. You can also do things like digital lock-outs on lifts that need maintenance.

While the lift might not need a dedicated computer, they might be used in an integrated environment. You kick off the alignment or a calibration procedure from the same place that you operate the lift.


how many lifts, and how many floors, with how many people are you imagining? Yes, there's a dumb simple case where there's no need for a computer with an OS, but after the umpteenth car with umpteen floors, when would you put in a computer?

and then there's authentication. how do you want key cards which say who's allowed to use the lift to work without some sort of database which implies some sort of computer with an operating system?


It's a diesel repair shop, not an office building. I'm interpreting "lift" as a device for lifting a vehicle off the ground, not an elevator for getting people to the 12th floor.

> But why does it need to be remotely updated?

Because it can be remotely updated by attackers.


Security patches, assuming it has some network access.

Why would a lift have network access?

Do you see a lot of people driving around applying software updates with diskettes like in the old days?

Have we learned nothing from how the uranium enrichment machines were hacked in Iran? Or how attackers routinely move laterally across the network?

Everything is connected these days. For really good reasons.


Your understanding of stuxnet is flawed, Iran was attacked by the Us Gov in a very very specific spearfish attack with years of preparation to get Stux into the enrichment facilities - nothing to do with lifts connected to the network.

Also the facility was air-gapped, so it wasn't connected to ANY outside network. They had to use other means to get Stux on those computers and then used something like 7 zero days to move from windows into Siemens computers to inflict damage.

Stux got out potentially because someone brought their laptop to work, the malware got into said laptop and moved outside the airgap from a different network.


"Stux got out potentially because someone brought their laptop to work, the malware got into said laptop and moved outside the airgap from a different network."

The lesson here is that even in an air-gapped system the infrastructure should be as proprietary as is possible. If, by design, domestic Windows PCs or USB thumb drives could not interface with any part of the air-gapped system because (a) both hardwares were incompatible at say OSI levels 1, 2 & 3; and (b) software was in every aspect incompatible with respect to their APIs then it wouldn't really matter if by some surreptitious means these commonly-used products entered the plant. Essentially, it would be almost impossible† to get the Trojan onto the plant's hardware.

That said, that requires a lot of extra work. By excluding subsystems and components that are readily available in the external/commercial world means a considerable amount of extra design overhead which would both slow down a project's completion and substantially increase its cost.

What I'm saying is obvious, and no doubt noted by those who've similar intentions to the Iranians. I'd also suggest that the use of individual controllers etc. such as the Siemens ones used by Iran either wouldn't be used or they'd need to be modified from standard both in hardware and with the firmware (hardware mods would further bootstrap protection if an infiltrator knew the firmware had been altered and found a means of restoring the default factory version).

Unfortunately, what Stuxnet has done is to provide an excellent blueprint of how to make enrichment (or any other such) plants (chemical, biological, etc.) essentially impenetrable.

† Of course, that doesn't stop or preclude an insider/spy bypassing such protections. Building in tamper resistance and detection to counter this threat would also add another layer of cost and increase the time needed to get the plant up and running. That of itself could act as a deterrent, but I'd add that in war that doesn't account for much, take Bletchley and Manhattan where money was no object.


I once engineered a highly secure system that used (shielded) audio cables and amodem as the sole pathway to bridge the airgap. Obscure enough for ya?

Transmitted data was hashed on either side, and manually compared. Except for very rare binary updates, the data in/out mostly consisted of text chunks that were small enough to sanity-check by hand inside the gapped environment.


Stux also taught other government actors what's possible with a few zero days strung together, effectively starting the cyberwasr we've been in for years.

Nothing is impenetrable.


You picked a really odd day and thread to say that everything is connected for really good reasons.

Or being online in the first place. Sounds like an unnecessary risk.

Remember those good old fashioned windows that you could roll down manually after driving into a lake?

Yeah, can’t do it now: it’s all electronic.


I’m sure that lifts have been electronically controlled for decades. But why is Windows (the operating system) involved?

but why do they have CS on them? they should be simply not connected to any kinds of networks.

and if there's some sensor network in the building that should be completely separate from the actual machine controls.


Compliance.

To work with various private data, you need to be accredited and that means an audit to prove you are in compliance with whatever standard you are aspiring to. CS is part of that compliance process.


Which private data would a computer need to operate a lift?

Another department in the corporation is probably accessing PII, so corporate IT installed the security software on every Windows PC. Special cases cost money to manage, so centrally managed PCs are all treated the same.

Anything that touches other systems is a risk and needs to be properly monitored and secured.

I had a lot of reservations about companies installing Crowdstrike but I'm baffled by the lack of security awareness in many comments here. So they do really seem necessary.


It must be security tags on the lift which restrict entry to authorised staff.

who's allowed to use the lift? where do those keycards authenticate to?

Because there's some level of convenience involved with network connectivity for OT.

That sounds...suboptimal.

I would imagine they used specialized controller cards or something like that.


They optimize for small batch development costs. Slapping windows PC when you sell a few hundred to thousand units is actually pretty cheap. Software itself is probably same order of magnitude, cheaper for UI itself...

And cheap both short and long term. Microsoft has 10 year lifecycles you don't need to pay extra for. Linux you need IT staff to upgrade it every 3 years. Not to mention hiring engineers to recompile software every 3 years with the distro upgrade.

Ubuntu LTS has support for 5 years, can be extended to 10 years of maintenance/security support with ESM (which is a paid service).

Same with Rocky Linux, but the extra 5 years of maintenance/security support is provided for free.


thats just asking for trouble.

Probably a Windows-based HMI (“human-machine interface”).

I used to build sorting machines that use variants of the typical “industrial” tech stack, and the actual controllers are rarely (but not never!) Windows. But it’s common for the HMI to be a Windows box connected into the rest of the network, as well as any server.


I'm still running multiple CNC/Industrial equipment with win3.1/98/xp. Only just retired one running Dos 6.2

I'm just impressed that the lifts, alarms, cranes, phones, etc all run on Windows somehow.

In a lot of cases you find tangential dependencies on Windows in ways you don't expect. For example a deployment pipeline entirely linux-based deploying to linux-based systems that relies on Active Directory for authentication.

> Active Directory for authentication.

In my experience that'd be 90% of the equipment.

"Oh! It has LDAP integration! We can 'Single Sign On'."


I don't know if "impressed" is the right word..

"Appalled", "bewildered" and "horrified" and also comes to mind..


I'm more confused because I have never, ever encountered a lift that wasn't just some buttons or joysticks on a controller attached to the lift. There is zero need of more computing power than a 8-bit microcontroller from the 1980s. I don't know where I would even buy such a lift with a windows PC.

No one sells 8 bit microcontrollers from the 1980s anymore. Just because you don't need the full power of modern computing hardware and software doesn't mean you are going to pay extra for custom, less capable options.

wow, why do lifts require an OS?

I think the same question can be asked for why lots of equipment seemingly requires an OS. My take is that these products went through a phase of trying to differentiate themselves from competitors and so added convenience features that were easier to implement with a general purpose computer and some VB script rather than focusing on the simplest most reliable way to implement their required state machines. It's essentially convenience to the implementors at the expense of reliability of the end result.

My life went sideways when organizations I worked for all started to make products solely for selling and not for using those. If the product was useful for something, that was the side effect of being sellable. Not the goal.

Worse is Better has eaten the world. The philosophy of building things properly with careful, bespoke, minimalist designs has been totally destroyed by a race to the bottom. Grab it off the shelf, duct tape together a barely-working MVP, and ship it.

Now we are reaping what we sowed.


That's what you get for outsourcing to some generic shop with no domain expertise who implements to a spec for the lowest dollar.

the question is - why lifts require windows?

The question is, why do lifts require Crowdstrike?

Some idiot with college degree in office no-where near the place sees that we have these PCs here. And then they go over compliance list and mandate this is needed. Now go install it and the network there...

Or they want to protect their Windows-operated lifts from very real and life threatening events like an attacker jumping from host to host until they are able to lock the lifts and put people lives at risk or cause major inconveniences.

Not all security is done by stupid people. Crowdstrike messed up in many ways. It doesn't make the company that trusted them stupid for what they were trying to achieve.


Crowdstrike is malware and spyware. Trusting one malware to control another is your problem right there. It will always blow up in your face.

Why are the lifts networked or on a network which can route to the internet?

This is a car lift. It really doesn't need a computer to begin with. I've never seen one with a computer. WTF?


For the same reason people want to automate their homes, or the industries run with lots of robots, etc: because it increases productivity. The repair shop could be monitoring for usage, for adequate performance of hydraulics, long-term performance statistics, some 3rd-party gets notified to fix it before it's totally unusable, etc.

I have a friend that is a car mechanic. The amount of automation he works with is fascinating.

Sure, lifts and whatnot should be in a separate network, etc, but even banks and federal agencies screw up network security routinely. Expecting top-tier security posture from repair shops is unrealistic. So yes, they will install a security agent on their Windows machines because it looks like a good idea (it really is) without having the faintest clue about all the implications. C'est la vie.


But what are you automating? It's a car lift, you need to be standing next to it to safely operate it. You can't remotely move it, it's too dangerous. Most of the things which can go wrong with a car lift require a physical inspection and for things like hydraulic pressure you can just put a dial indicator which can be inspected by the user. Heck, you can even put electronic safety interlocks without needing an internet connection.

There are lots of difficult problems when it comes to car repair, but cloud lift monitoring is not something I've ever heard anyone ask for.

The things you're describing are all salesman sales-pitch tactics, they're random shit which sound good if you're trying to sell a product, but they're all stuff nobody actually uses once they have the product.

It's like a six in one shoe horn. It has a screw driver, flash light, ruler, bottle opener, and letter opener. If you're just looking at two numbers and you see regular shoe horn £5, six in one shoe horn £10 then you might blindly think you're getting more for your money. But at the end of the day, I find it highly unlikely you'll ever use it for anything other than to put tight shoes on.


I imagine something keeps monitors how many times the lift has gone up and down for maintenance reasons. Maybe a nice model monitors fluid pressure in the hydraulics to watch for leaks. Perhaps a model watches strain, or balance, to prevent a catastrophic failure. Maybe those are just sensors but if they can’t report their values they shutdown for safety’s sake. There are all kinds of reasonable scenarios that don’t rely on bad people trying to screw or cheat someone.

None of these features require internet or a windows machine, most of them do not require a computer or even a microcontroller. Strain gauges can be useful for checking for an imbalanced load, but they cannot inspect the metal for you.

The question is, why do lifts require internet connection on top of the rest.

In my office, when we swipe our entry cards at the security gates, a screen at the gate tells us which lift to take based on the floor we work on, and sets the lift to go to that floor. It's all connected.

In the context of a diesel repair shop, he likely was referring to fork lifts or vehicle lifts rather than elevators.

This doesn't require an internet, just a LAN.

Remote monitoring and maintenance. Predictive maintenance, monitor certain parameters of operation and get maintenance done before lift stops operating.

These requirements can be met by making the lift's systems and data observable, which is a uni-directional flow of information from the lift to the outside world. Making the lift's operation modifiable from the outside world is not required to have it be observable.

It's a car lift. Not only would it be irresponsible to rely on a computer to tell you when you should maintain it, as some inspections can only be done visually, it seems totally pointless as most inspections need to be done manually.

Get a reminder on your calendar to do a thorough inspection once a day/week (whatever is appropriate) and train your employees what to look for every time it's used. At the end of the day, a car lift on locks is not going to fail unless there's a weakness in the metal structure, no computer is going to tell you about this unless there's a really expensive sensor network and I highly doubt any of the car lifts in question have such a sensor network.

Moreover, even if they did have such a sensor network, why are these machines able to call out to the internet?


I mean... the beginning of mission impossible 1 should tell you.

The same reason everyone just uses a microcontroller on everything. It's like a universal glue and you can develop in the same environment you ship. Makes it easy.

Well, how else is the operator supposed to see outside?

Heh ...

Why do lathes , cranes and laser alignment systems need a new copy of windows?

Very likely they use a manufacturing execution system like Dassault's DELMIA or Siemens MES.

These systems are intended to allow local control of a factory, or cloud based global control of manufacturing.

They can connect to individual PLC(Programmable Logic Controller) which handles the actual equipment.

They connect to a LAN network, or to the internet. So they naturally need some form of security.

They could use Windows Server, Redhat Linux, etc. but they need some form of security. Which is how controller would be affected.

Usually you can just set them to manual though...


Lathes probably have PCs connected to them to control them, and do CNC stuff (he did say the controllers). Laser alignment machines all have PCs connected to them these days.

The cranes and lifts though... I've never heard of them being networked or controlled by a computer. Usually it's a couple buttons connected to the motors and that's it. But maybe they have some monitoring systems in them?


Off then top of my head, based on limited experience in industrial automation:

- maintenance monitoring data shipping to centralised locations

- computer based HMI system - there might be good old manual control but it might require unreasonable amounts of extra work per work order

- Centralised control system - instead of using panel specific to lift, you might be controlling bunch of tools from common panel

- integration with other tools, starting from things as simple as pulling up manufacturers' service manual to check for details to doing things like automatically raising the lift to position appropriate for work order involving other (possibly also automated) tools with adjustments based on the vehicle you're lifting

There could be more.


CNC machine tools can track use, maintenance, etc via the network. You can also push programs to them for your parts.

The need a new copy of Windows because running an old copy on a network is a worse idea.


This blows my mind because none of this requires windows, or a desktop OS at all.

No, they don't. Absolutely. But there are very few companies successful not using Windows or existing OS. Apple HomePod runs iOS.

Remember that CNC is programming environment. Now how do actually see what program is loaded? Or where is the execution at the moment? For anything beyond few lines of text on dotmatrix screen actual OS starts to be come desirable.

And all things considered, Windows is not that bad option. Anything else would also have issues. And really what is your other option some outdated, unmaintained Android? Does your hardware vendor offer long term support for Linux?

Windows actually offers extremely good long term support quite often.


> And all things considered, Windows is not that bad option

I'm gonna go out on a limb and say that it actually is. It's a closed source OS which includes way more functionality than you need. A purpose-built RTOS running on a microcontroller is going to provide more reliability, and if you don't hook it up to the internet it will be more secure, too. Of course, if you want you can still hook it up to the internet, but at least you're making the conscious decision to do so at that point.

Displaying something on a screen isn't very hard in an embedded environment either.

I have an open source printer which has a display, and runs on an STM32. It runs reliably, does its job well, and doesn't whine about updates or install things behind my back because it physically can't, it has no access to the internet (though I could connect it if I desired). A CNC machine is more complex and has more safety considerations, but is still in a similar class of product.

https://youtu.be/FxIUs-pQBjk?si=N-W-Af6jBgGBiIgl&t=46


> Does your hardware vendor offer long term support for Linux?

This seems muddled. If the CNC manufacturer puts Linux on an embedded device to operate the CNC, they're the hardware manufacturer and it's up to them to pick a chip that's likely to work with future Linuxes if they want to be able to update it in the future. Are you asking if the chip manufacturer offers long-term-support for Linux? It's usually the other way around, whether Linux will support the chip. And the answer, generally, is "yes, Linux works on your chip. Oh you're going to use another chip? yes, Linux works on that too". This is not really something to worry about. Unless you're making very strange, esoteric choices, Linux runs on everything.

But that still seems muddled. Long-term support? How long are we talking? Putting an old Linux kernel on an embedded device and just never updating it once it's in the field is totally viable. The Linux kernel itself is extremely backwards compatible, and it's often irrelevant which version you're using in an embedded device. The "firmware upgrades" they're likely to want to do would be in the userspace code anyhow - whatever code is showing data on a display or running a web server you can upload files to or however it works. Any kernel made in the last decade is going to be just fine.

We're not talking about installing Ubuntu and worrying about unsolicited Snap updates. Embedded stuff like this needs a kernel with drivers that can talk to required peripherals (often over protocols that haven't changed in decades), and that can kick off userspace code to provide a UI either on a screen or a web interface. It's just not that demanding.

As such, people get away with putting FreeRTOS on a microcontroller, and that can show a GUI on a screen or a web interface too, you often don't need a "full" OS at all. A full OS can be a liability, since it's difficult to get real-time behaviour which presumably matters for something like a CNC. You either run a real-time OS, or a regular OS (from which the GUI stuff is easier) which offloads work to additional microcontrollers that do the real-time stuff.

I did not expect Windows to be running on CNCs. I didn't expect it to be running on supermarket checkouts. The existence of this entire class of things pointlessly running self-updating, internet-connected Windows confuses me. I can only assume that there are industries where people think "computer equals Windows" and there just isn't the experience present, for whatever reason, to know that whacking a random Linux kernel on an embedded computer and calling it a day is way easier than whatever hoops you have to jump through to make a desktop OS, let alone Windows, work sensibly in that environment.


5-10 years is not unreasonable expected support I think.

And if you are someone manufacturing physical equipment be it CNC machine or vehicle lift hiring entire team to keep Linux patched and making your own releases seems pretty unreasonable and waste of resources. In the end anything you choose is not error free. And the box running software is not main product.

This is actually huge challenge. Finding vendor that can deliver you a box where to run software with promised long term support, when the support is actually more than just few years.

Also I don't understand how it is any more acceptable to run unpatched Linux in networked environment than it is Windows. These are very often not just stand-alone things, but instead connected to at least local network if not larger networks. With possible internet connections too. So not updating vulnerabilities is as unacceptable as it would be with Windows.

With CNC there is place for something like Windows OS. You have separate embedded system running the tools. But you still want a different piece managing the "programs". As you could have dozens or hundreds of these. And at that point reading them from network starts once again make sense. Time of dealing with floppies is over...

And with checkouts, you want more UI than just buttons. And Windows CE has been reasonably effective tool in that.

Linux is nice on servers, but often with embedded side keeping it secure and up to date is massive amount of pain. Windows does offer excellent stability and long term support. And you can just simply buy a computer with sufficient support from MS. One could ask why do not not massive companies run their own Linux distributions?


> 5-10 years is not unreasonable expected support I think.

A couple of years ago, I helped a small business with an embroidery machine that runs Windows 98. Its physical computer died, and the owner could not find the spare parts. Fortunately, it used a parallel port to control the embroidery hardware, so it was easy to move to a VM with a USB parallel port adapter.


That was very lucky then. USB parallel ports adapters are only intended to work with printers. They fail with any hardware that does custom signalling over the parallel port.

Ok, just make the lift controller analogue. No digital processors at all. Nothing to update, so no updates needed.

Maybe you want your lift to be able to diagnose itself. Tell possible faults, instead of spending man hours on troubleshooting every part each time downtime included. With big lifts there are many parts that could go wrong. Being able to identify which one saves lot of time and time is money.

These sort of outages are actually extremely rare nowadays. Considering how long these control systems have been kept around must mean that they are not actually causing that many issue that replacing them would be worth it.


you log into the machine, download files, load files onto the program. that doesn't need a desktop environment? you want to reimplement half of one, poorly, because that would have avoided this stupid mistake, in exchange for half a dozen potential others, and a worse customer experience?

> you log into the machine, download files, load files onto the program. that doesn't need a desktop environment?

Believe it or not, it doesn't! An embedded device with a form of flash storage and an internet connection to a (hopefully) LAN-only server can do the same thing.

> you want to reimplement half of one, poorly

Who says I would do it poorly? ;)

> and a worse customer experience?

Why would a purpose-built system be a worse customer experience than _windows_? Are you really going to set the bar that low?


and why do they run spyware?

Probably because some fraction of lift manufacturer's customer base has a compliance checklist requiring it.

Because we live deep into the internet of shit era.

How else are you going to update your grocery list while operating the lift?

> we dont have 911 either

Holy cow...

Who on earth requires a Windows-based backend (or whatever else had CrowdStrike, in the shop or outside) for regular (VoIP) phone calls.

This should really lead to some learnings for anyone providing any kind of phone infrastructure.


Or lathe, or cranes, or alarms, or hvac... what the actual fuck.

Next move should be some artisanal as mechanical-as-possible quality products, or at least Linux(TM) certified product or similar (or Windows-free (TM)). The opportunity is here, everybody noticed this clusterfuck, and smart folks don't like ignoring threats that are in your face.

But I suppose in 2 weeks some other bombastic news will roll over this and most will forget. But there is always some hope


That’s not it. 911 itself was down.

Oh, great. I guess that counts as phone infrastructure.

what are the brands of these systems?

Oh man, you work with some cool (and dangerous) stuff.

Outage aside, do you feel safe using it while knowing that it accepts updates based on the whims of far away people that you don't know?


I hate to be that person, but things have moved to automatic updates because security was even shittier when the user was expected to do it.

I can't even imagine how much worse ransomware would be if, for example, Windows and browsers weren't updating themselves.


I feel like this is the fake reason given to try to hide the obvious reason: automatic updates are a power move that allows companies to retain control of products they've sold.

It's not fake reason; it's a very real solution to a very real problem.

Of course companies are going to abuse it for grotesque profit motive, but that doesn't make their necessity a lie.


Yep. And even aside from security, its a nightmare needing to maintain multiple versions of a product. "Oh, our software is crashing? What version do you have? Oh, 4.5. Well, update 4.7 from 2 years ago may fix your problem, but we've also released major versions 5 and 6 since then - no, I'm not trying to upsell you ma'am. We'll pull up the code from that version and see if we can figure out the problem."

Having evergreen software that just keeps itself up to date is marvellous. The Google Docs team only needs to care about the current version of their software. There are no documents saved with an old version. There's no need to backport fixes to old versions, and no QA teams that need to test backported security updates on 10 year old hardware.

Its just a shame about, y'know, the aptly named crowdstrike.


> The Google Docs team only needs to care about the current version of their software. There are no documents saved with an old version.

There sure are. I have dozens saved years ago.


Fine. But Google can mass-migrate all of them to a new format any time they want. They don’t have the situation you used to have with Word, where you needed to remember to Save As Word 2001 format or whatever so you could open the file on another computer. (And if you forgot, the file was unreadable). It was a huge pain.

Yes it is better than the Word situation, but no it isn't not caring. There do exist old format docs and Google does have to care - to make that migration.

Yes, they have to migrate once. But they don’t need to maintain 8 different versions of Word going back a decade, make sure all security patches get back ported (without breaking anything along the way), and make all of them are in some way cross compatible despite having differing feature sets.

If google makes a new storage format they have to migrate old Google docs. But that’s a once off thing. When migrations happen, documents are only ever moved from old file formats to new file formats. With word, I need to be able to open an old document with the new version of word, make changes then re-save it so it’s compatible with the old version of word again. Then edit it on an old version of word and go back and forth.

I’m sure the Google engineers are very busy. But by making Docs be evergreen software, they have a much easier problem to solve when it comes to this stuff. Nobody uses the version of Google docs from 6 months ago. You can’t. And that simplifies a lot of things.


> Yes, they have to migrate once.

They have to migrate each time they change the format, surely. Either that or maintain converters going back decades, to apply the right one when a document is opened.

> but they don’t need to maintain 8 different versions of Word going back a decade, make sure all security patches get back ported

Nor does Microsoft for Word.

> With word, I need to be able to open an old document with the new version of word, make changes then re-save it so it’s compatible with the old version of word again.

You don't have to, unless you want the benefit of that.

And Google Docs offers the same.

> Nobody uses the version of Google docs from 6 months ago. You can’t. And that simplifies a lot of things.

Well, I'd love to use the version of Gmail web from 6 months ago. Because three months ago Google broke email address input such that it no longer accesses the contacts list and I have to type/paste each address in full.

That's a price we pay for things being "simpler" for a software provider than can and does change the software I am using without telling me let alone giving me the choice.

Not to mention the change that took away a large chunk of my working screen space for an advert telling me to switch to the app version, despite have the latest version of Google's own Chrome. An advert I cannot remove despite having got the message 1000 times. Pure extortion. Simplification is no excuse.


It used to be the original reason why automatic updates were accepted and it was valid.

But since then it has been abused for all sorts of things that really are nothing more than consolidation of power, including an entire shift in mentality of what "ownership" even means: Tech companies today seem to think it's the standard that they keep effective ownership of a product for its entire life cycle, no matter how much money a customer has paid for it, and no matter deeply the customer relies on that product.

(Politicians mostly seem fine with that development or even encourage it)

I agree that an average nontechnical person can't be expected to keep track of all the security patches manually to keep their devices secure.

What I would expect would be an easy way to opt-out of automatic updates if you know what you're doing. The fact that many companies go to absurd lengths to stop you from e.g. replacing the firmware or unlocking the bootloader, even if you're the owner of the device is a pretty clear sign to me they are not doing this out of a desire to protect the end-user.

Also, I'm a bit baffled that there is no vetting at all of the contents of updates. A vendor can write absolutely whatever they want into a patch for some product of theirs and arbitrarily change the behaviour of software and devices that belong to other people. As a society, we're just trusting the tech companies to do the right thing.

I think a better system would be if updates would at the very least have to be vetted by an independent third party before being applied and a device would only accept an update if it's signed by the vendor and the third-party.

The third-party cold then do the following things:

- run tests and check for bugs

- check for malicious and rights-infringing changes deliberately introduced by the vendor (e.g. taking away functionality that was there at time of purchase)

- publicly document the contents of an update, beyond "bug fixes and performance improvements".


What you're describing is what Linux distro maintainers do: Debian maintainers check the changes of different software repos, look at new options and decide if anything should be disabled in the official Debian release, and compile and upload the packages.

The problem you are complaining about here is the weakening of labor and consumer organizations vis a vis capital or ownership organizations. The software must be updated frequently due to our lack of skill in writing secure software. Whether all the corporations will take advantage of everything under the sun to reduce the power the purchasers and producers of these products have is a political and legal questions. If only the corporations are politically involved then only they will have their voice heard by the legislatures.

no reason why both can't be true — the security is overall better, and companies are happy to invest in advancing this paradigm because it gives them more control

incentive can and does undermine the stated goal. what if the government decided to take control of everyone's investment portfolio to prevent the market doing bad things? or an airplane manufacturer gets takes control of its own safety certification process because obviously its in their best interest that their planes are safe? imposed curfew, everyone has to be inside their homes while its dark outside because most violent crimes occur at night?

This is for critical infrastructure though. You AT LEAST test it out first on some machines

That may apply to things that need to be online, but... a lathe?

how much lathe-ing have you done recently? did you load files onto your CNC lathe with an SD card, and thus there is a computer, which needs updates, or are you thinking of a lathe that is a motor and a rubber band, and nothing else, from, like, high school woodshop?

I bought a 3d printer years ago then let it sit collecting dust for like 2 or more years because I was intimidated by it. Finally started using it and was blown away how useful it has been to me. Then a long time later realized holy shit there are updates and upgrades one can easily do. I can add a camera and control everything and monitor everything from any online connected device. I always hated pulling out the sd card and bringing it to my computer and copying it over and back to the printer and so on. Being online makes things so much easier and faster. I have been rocking my basic printer for a few years now and have not paid much attention to the scene and then started seeing these multi color prints holy shit am I slow and behind the times. The newer printers are pretty rad but I will give props to my Anycubic Mega it has been a work horse and I have had very little problems. I don't want it to die on me but a newer printer would be cool also.

All fine... until it gets hacked.

And does what? Print something?

There are immense benefits to using modern computing power, including both onboard and remote functionality. The cost of increased software security vulnerability is easily justified.


More like infect something. Your computer.

> The cost of increased software security vulnerability is easily justified.

Sometimes yes, sometimes no.


wouldn't the lathe need to be online to get the OTA update from Crowdstrike?

What load of horseshit.

1. Nobody auto updates my linux machines. They have no malware. 2. It's my job to change the oil in my car. When Ford starts sending a tech to my house to tamper with my machines "because they need maintenance" will be the day I am no longer a Ford customer.


The irony of this comment is almost perfected by the fact Ford were one of the leading companies in bringing ECU's (one of the myriad of computer systems essential to modern vehicles that can and do receive regular updates) to market in checks notes 1975.

https://en.wikipedia.org/wiki/Ford_EEC


Carelessly handled Linux machines* can and do get infected by malware or compromised for data exfile, don't be obtuse.

*Let's not pretend this never happens


Not to mention CVE mitigation.

Those Linux systems that aren't getting updates must be the ones sending Mirai to my Linux systems, which are getting updates (and also Mirai, although it won't run because it's the wrong architecture).

No malware? Only if you have your head in the sand.


I assume that comment was saying that they handle the update process and that their machines don't have any malware on them.

I ignored it because it was somewhat abusive and is missing the problem that automatic updates are trying to solve: that most people, but not all, don't do updates.


yeah, you don't want day to day security (a) changing daily (b) at the kennel level

Wow, this hits close to home. Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009. I added a check on the driver initialization path and didn't annotate the code as non-paged because frankly I didn't know at the time that the Windows kernel was paged. All my kernel development experience up to that point was with Linux, which isn't paged.

BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.

The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.


> without even the most basic level of qualification

That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.


Exactly this what I was missing in the story. Like why not to have a limited set of users have it before going live for the whole user base at a mission critical product like this is beyond comprehension of everyone ever came across software bugs (so billions of people). And then we already overcame the part of not testing internally well, or at all? Something clusteruck must have happened there which is still better than imagining that this is the normal way the organization operates. Which is a very scary vision. Serious rethinking of trusting this organization is due everywhere!

But that would require hiring staff to manage the process, and that is money taken away from sponsoring an F1 racing team.

The funniest part was seeing Mercedes F1 team pit crew staring at BSODs at their workstations[1] while wearing CrowdStrike t-shirts. Some jokes just write themselves. Imagine if they loose the race because of their sponsor.

But hey, at least they actually dogfood the products of their sponsors instead of just taking money to shill random stuff.

[1] https://www.thedrive.com/news/crowdstrike-sponsored-mercedes...


Or it could be made that Windows stops loading drivers that are crashing.

Third-party driver/module crashed more than 3 times in a row -> Third-party driver/module is punished and has to be manually re-enabled.


Because CrowdStrike is an EDR solution it likely has tamper-proofing features (scheduled tasks, watchdog services, etc.) that re-enables it. These features are designed to prevent malware or manual attackers from disabling it.

These features drive me nuts because they prevent me, the computer owner/admin, from disabling. One person thought up techniques like "let's make a scheduled task that sledgehammers out the knobs these 'dumb' users keep turning' and then everyone else decided to copycat that awful practice.

If you're the admin, I would assume you have the ability to disable Crowdstrike. There must be some way to uninstall it, right?

Not if you want to keep the magic green compliance checkbox!

Are you saying that the compliance rule requires the software to be uninstallable? Once it's installed it's impossible to uninstall? No one can uninstall it? I have a hard time believing it's impossible to remove the software. In the extreme case, you could reimage the machine and reinstall Windows without Crowdstrike.

Or are you saying that it is possible to uninstall, but once you do that, you're not in compliance, so while it's technically possible to uninstall, you'll be breaking the rules if you do so?


It's obviously the second option.

The person I originally replied to, rkagerer, said there was some technical measure preventing rkagerer from uninstalling it even though rkagerer has admin on the computer.

I was referring to the difficulty overriding the various techniques certain modern software like this use to trigger automatic updates at times outside admin control.

Disabling a scheduled task is easy, but unfortunately vendors are piling on additional less obvious hooks. Eg. Dropbox recreates its scheduled task every time you (run? update?) it, and I've seen others that utilize the various autostart registry locations (there are lots of them) and non-obvious executables to perform similar "repair" operations. You wind up in "Deny Access" whackamole and even that isn't always effective. Uninstalling isn't an option if there's a business need for the software.

The fundamental issue is their developers / product managers have decided they know better than you. For the many users out there who are clueless to IT this may be accurate, but it's frustrating to me and probably others who upvoted the original comment.


Is what you're saying relevant in the Crowdstrike case? If you don't want Crowdstrike and you're an admin, I assume there are instructions that allow you to uninstall it. I assume the tamper-resistant features of Crowdstrike won't prevent you from uninstalling it.

I cannot find that comment. Care to link it?


An admin can obviously disable a scheduled task... It's not "impossible" to remove the software, just annoying.

It's not obvious - the owner of the computer sets the rules.

If you're the owner, just turn it off and uninstall.

Doesn't malware do that as well?

But what other malware has been as successful? Crowdstrike can rest easy knowing it's taken down many of the most critical systems in the world.

Oh, no, actually, if Crowdstrike WAS malware, the authors would be in prison.. not running a $90B company.


it does. several crowdstrike alerts popped when i was remediating systems of the broken driver.

Wouldn't this be an attack vector? Use some low-hanging bug to bring down an entire security module, allowing you to escalate?

It's currently a DOS by the crashing component, so it's already broken the Availability part of Confidentiality/Integrity/Availability that defines the goals of security.

But a loss of availability is so much more palatable than the others, plus the others often result in manually restricting availability anyway when discovered.

I think the wider societal impact from the loss of availability today - particularly for those in healthcare settings - might suggest this isn't always the case

Availability of a system that can’t ensure data integrity seems equally bad though.

Tell that to the millions of people whose flights were canceled, the surgeries not performed, etc etc.

What is the importance of data integrity? If important pre-op data/instructions are missing or gets saved on the wrong patient record which causes botched surgeries, if there are misprescribed post-op medications, if there is huge confusion and delays in critical follow-up surgeries because of a 100% available system that messed up patient data across hospitals nationwide, if there are malpractice lawsuits putting entire hospitals out of business etc etc, then is that fallout clearly worth having an available system in the first place?

How does crowdstrike protect against instructions being saved on the wrong patient’s record?

Huh? We're talking about hypotheticals here. You're saying availability is clearly more important than data integrity. I'm saying that if a buggy kernel loadable module allowed systems to keep on running as if nothing was wrong, but actually caused data integrity problems while the system is running, that's just as bad or worse.

Or anyone who owns CrowdStrike shares.

They’d surely have used some kind of Unix if uptime mattered.

before you get all smug recognize that linux has the exact same architecture, just because it wasn't impacted - this time.

Too late, I was born smug.

If Linux and Windows have similar architectural flaws, Microsoft must have some massive execution problems. They are getting embarrassed in QA by a bunch of hobbyists, lol.


I'm sure the people who missed their flights because of this disagree.

Or families of those who die.

> Wouldn't this be an attack vector?

Isn't DoSing your own OS an attack vector? and a worse one when it's used in critical infrastructure where lives are at stake.

There is a reasonable balance to strike, sometimes it's not a good idea to go to extreme measures to prevent unlikely intrusion vectors due to the non-monetary costs.

See: The optimal amount of fraud is non-zero.


In the absence of a Crowdstrike bug, if an attacker is able to cause Crowdstrike to trigger a bluescreen, I assume the attacker would be able to trigger a bluescreen in some other way. So I don't think this is a good argument for removing the check.

That assumes it's more likely than crowdstrike mass bricking all of these computers... this is the balance, it's not about possibility, it's about probability.

I think we're in agreement. I now realize my previous comment replied to the wrong comment. I meant to reply to Lx1oG-AWb6h_ZG0. Sorry.

If you're planning around bugs in security modules, you're better off disabling them - malware routinely use bugs in drivers to escalate, so the bug you're allowing can make the escalation vector even more powerful as now it gets to Ring 0 early loading.

Requires state level social engineering.

Might by why north Koreans are trying to get work from home jobs.

https://www.businessinsider.com/woman-helped-north-korea-fin...


It does. CrowdStrike forced itself into boot process. Normal windows drivers will be disable automatically if they caused a crash

I use Explorer Patcher on a windows 11 machine. It had a history of crash loops with Explorer that they implemented this circuit breaker functionality.

It's baffling how fast and wide the blast radius was for this Crowdstrike update. Quite impressive actually, if you think about it - updating billions of systems that quickly.

Certainly living up to the name

Indeed, far more damage caused than any actual malware!

This was my first thought too. I'm not that familiar with the space, but I would think for something this sensitive the rollout would be staggered at least instead of what looks like globally all at the same time.

This is the bit I am still trying to understand. On CrowdStrike you can define how many updates a host is behind. I.e. n (latest), n-1 (one behind) or n-2 etc. This update was applied to a 'latest' policy hosts and the n-2 hosts. To me it appears that there was more to this than just a corrupt update, otherwise how was this policy ignored? Unless it doesn't separate the update as deeply and maybe just a small policy aspect, which would also be very concerning.

I guess we won't really know until they release the post mortem...


Yeah, my guess is that they roll out the updates to every client at the same time, and then have the client implement the n-1/2/whatever part locally. That worked great-ish until they pushed a corrupt (empty) update file which crashed the client when it tried to interpret the contents... Not ideal, and obviously there isn't enough internal testing before sending stuff out to actual clients.

But do you ever get free world-wide advertisement that everyone uses your product? Crowdstrike sure did and I'm sure they'll use that to sell it to more people.

That is the right way to do it.

> It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.

Discussed elsewhere it is claimed that the file causing the crash was a data file that has been corrupted in the delivery process. So the development team and their CI have probably tested a good version, but the customer received a bad one.

If that is true to problem is that the driver first uses an unsigned file at all, so all customer machines are continuously at risk for local attacks. And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.


If the file was signed, wouldn't that have prevented the corrupted transmission file from being loaded.

I assume if the signed file was hacked (or parts missing), then it wouldn't pass verification.


> And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.

To me, this is the inexcusable sin. These updates should be signed and signatures validated before the file is read. Ideally the signing/validating would be handled before distribution so that when this file was corrupted, the validation would have failed here.

But even with a good signature, when a file is read and the values don’t make sense, it should be treated as a bad input. From what I’ve seen, even a magic bytes header here would have helped.


Still a staggered roll-out would have reduced the impact.

https://news.ycombinator.com/item?id=41006104#41006555

the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers

per a new/green account


“And so that’s why we recommend using phased rollouts” -Every DevOps engineer from now on

“But that costs us money and time” - some suit.

"And they promise fast threat mitigation... Let allow them to take over EVERYTHING! With remote access, of course. Some form of overwatch of what they in/out by our staff ? Meh... And it even allow us to do cuts in headcount and infra by $<digits_here> a year."

So have we decided to stop using checksums or something?

Perhaps it was the checksum/signature process!

Ya gotta keep checksumming until you find a fixed point.

when something is changed, we usually re-test. that's the whole point of testing anyway. :)

> I didn't know at the time that the Windows kernel was paged.

At uni I had a professor in database systems, who did not like written exams, but mostly did oral exams. Obviously for DBMSes the page buffer is very relevant, so we chatted about virtual memory and paging. So in my explanation I made the difference for kernel space and user space. I am pretty sure I had read that in a book describing VAX/VMS internals. However, the professor claimed that a kernel never does paging for its own memory. I did not argue on that and passed the exam with the best grade. Did not check that book again to verify my claim. I have never done any kernel space development even vaguely close to memory management, so still today I don't know the exact details.

However, what strikes me here: When that exam happened in 1985ish the NT kernel did not exist yet, I'd believe. However, IIRC a significant part of the DEC VMS kernel team went to Microsoft to work on the NT kernel. So the concept of paging (a part of) kernel memory went with them? Whether VMS --> WNT, every letter increased by one is just a coincidence or intentionally the next baby of those developers I have never understood. As Linux has shown us today much bigger systems can be successfully handled without the extra complications for paging kernel memory. Whether it's a good idea I don't know, at least not a necessary one.


If you want to hear the history of [DEC/VMS] NT from the horses mouth:

https://www.youtube.com/watch?v=xi1Lq79mLeE


Oh oh, 3 hours 10. I watched around half of it.

The VMS --> WNT acronym relationship was not mentioned, maybe it was just made up later.

One thing I did not know (or maybe not remember) is that NT was originally developed exclusively for the Intel i860, one of Intel's attempts to do RISC. Of course in the late 1980s CISC seemed deemed and everyone was moving to RISC. The code name of the i860 was N10. So that might well be the inside origin of NT, the marketing name New Technology retrofitted only later.


Here's a direct link:

https://youtu.be/xi1Lq79mLeE?t=4314

"New Technology", if you want to search the transcript. Per Dave, marketing did not want to use "NT" for "New Technology" because they thought no one would buy new technology.


Actually it was not only x86 hardware that was not really planned for the NT kernel, also Windows user space was not the first candidate. Posix and maybe even OS/2 were earlier goals.

So the current x86 Windows monoculture came up as an accident because strategically planned new options did not materialize. The user space change should finally debunk the theory that VMS andvances into WNT was a secret plot by the engineers involved. It was probably a coincidence discovered after the fact.


https://www.usenix.org/system/files/1311_05-08_mickens.pdf

"Perhaps the worst thing about being a systems person is that other, non-systems people think that they understand the daily tragedies that compose your life. For example, a few weeks ago, I was debugging a new network file system that my research group created. The bug was inside a kernel-mode component, so my machines were crashing in spectacular and vindic- tive ways. After a few days of manually rebooting servers, I had transformed into a shambling, broken man, kind of like a computer scientist version of Saddam Hussein when he was pulled from his bunker, all scraggly beard and dead eyes and florid, nonsensical ramblings about semi-imagined enemies. As I paced the hallways, muttering Nixonian rants about my code, one of my colleagues from the HCI group asked me what my problem was. I described the bug, which involved concur- rent threads and corrupted state and asynchronous message delivery across multiple machines, and my coworker said, “Yeah, that sounds bad. Have you checked the log files for errors?” I said, “Indeed, I would do that if I hadn’t broken every component that a logging system needs to log data. I have a network file system, and I have broken the network, and I have broken the file system, and my machines crash when I make eye contact with them. I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood.”


Ah, the joys of trying to come up with creative ways to get feedback from your code when literally nothing is available. Can I make the beeper beep in morse code? Can I just put a variable delay in the code and time it with a stopwatch to know which value was returned from that function? Ughh.

Some of us have worked on embedded systems or board bringup. Scope and logic analyzer ... Serial port a luxury.

IIRC Windows has good support for debugging device drivers via the serial port. Overall the tooling for dealing with device drivers in windows is not bad including some special purpose static analysis tool and some pretty good testing.


This is why power users want that standard old two digit '7 segment' display to show off that ONE hex code the BIOS writes to at various steps...

When stuff breaks, not if, WHEN it breaks, this at least gives a fighting chance at isolating the issue.


Yeah. Been there, done that. Write to an unused address decode to trigger the logic analyzer when I got to a specific point in the code, so I could scroll back through the address bus and figure out what the program counter had done for me to get to that piece of code.

Old school guys at my first job could send the contents of the program counter to the speaker, and diagnose problems by the sound of it.

Definitely Old School Cool


I call this "throwing dye in the water".

I certainly used beeping for debugging more than once! : - )

Quoting James Mickens is always the winning move. I recommend the entire collection of his wisdom, https://mickens.seas.harvard.edu/wisdom-james-mickens

James Mickens’s Monitorama 2014 presentation had me laughing to the point of tears. “Look a word cloud!”

Title: "Computers are a Sadness, I am the Cure" https://vimeo.com/95066828


Say "word count" one more time!

Somebody get this man a serial port, or maybe a PC Speaker to Morse out diagnostics signals.

That's beautiful.

This is an interesting piece of creative writing, but virtual machines already existed in 2013. There are very few reasons to experiment on your dev machine.

OS / driver development needs to be done on bare metal sometimes.

At the time, Mickens worked at Microsoft Research, and with the Windows kernel development team. There may only be a few reasons to experiment on your dev machine, but that's one environment where they have those reasons.

Sometimes you have to debug on a real machine. When you do, you'd usually use a serial port for your debug output. Everything has one.

>Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009.

Hello from a fellow BitLocker dev from this time! I think I know who this is, but I'm not sure and don't want to say your name if you want it private. Was one of your Win10 features implementing passphrase support for the OS drive? In any case, feel free to reach out and catch up. My contact info is in my profile.


Win8. I've been seeing your blog posts show up here and there on HN over the years, so I was half expecting you to pick up on my self-doxx. I'll ping you offline.

"It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification."

It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?


Drivers have had to be signed forever and pass pretty rigorous test suites and static analysis.

The problem here is obviously this other file the driver sucks in. Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...


There’s a design problem here if the driver can’t be self-contained in such a way that it’s possible to roll back the kernel to a known good state.

How so? Preventing roll-backs on software updates is a "security feature" in most cases for better and for worse. Yeah, it would be convenient for tinkerers or in rare events such as these, but would be a security issue in the 99,9..99% of the time for enterprise users where security is the main concern.

I don't really understand this, many Linux distributions like Universal Blue advertise rollbacks as a feature. How is preventing a roll-back a "security feature"?

Imagine a driver has an exploitable vulnerability that is fixed in an update. If an attacker can force a rollback to the vulnerable older version, then the system is still vulnerable. Disallowing the rollback fixes this.

ohh

> Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...

    "What are you complaining about? It works on my machine."™

> In which case why did they fail to prevent this?

"Oh, crowdstrike? Yeah, yeah, here's that Winodws kernel code signing key you paid for."


You can pay for it and sign a file full of null characters. Signing has nothing to do with quality from what I understand.

"Yours sincerely,

Crowdstrike

---

PS - If you get hit by some massive crash, we refer you to our company's name. What were you expecting?"


[flagged]


Please explain this comment. How is the Crowdstrike incident related to the Key Bridge collision?

I think he's implying there was some sort of conspiracy by foreign actors.

This is what I don’t get, it’s extremely hard for me to believe this didn’t get caught in CI when things started blue screening. Every place I ever did test rebooting/powercycling was part of CI, with various hardware configs. This was before even our lighthouse customers even saw it.

What makes you think they have CI after what happened?

Apparently the flaw was added to the config file in post-processing after it had completed testing. So they thought they had testing, but actually didn't.

Disgruntled employee trying to use Crowd Strike to start a General Strike?

I was thinking, this doesn't seem like its a case of all these machines still on an old version of windows, or some specific version, that is having issues. Therefore QA just missed one particular variant in their smoke testing. It seems like its every windows instance with that software, so either they don't have basic automated testing, or someone pushed this outside of a normal process.

> Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

Up the chain to automated test machines, right?


You would think automated test would come before your teammates work stations / commit to head.

Did I mention this was 15 years ago? Software development back then looked very different than it does now, especially in Wincore. There was none of this "Cloud-native development" stuff that we all know and love today. GitHub was just about 1 year old. Jenkins wouldn't be a thing for another 2 years.

In this case the "automated test" flipped all kinds of configuration options with repeated reboots of a physical workstation. It took hours to run the tests, and your workstation would be constantly rebooting, so you wouldn't be accomplishing anything else for the rest of the day. It was faster and cheaper to require 8 devs to rollback to yesterday's build maybe once every couple of quarters than to snarl the whole development process with that.

The tests still ran, but they were owned and run by a dedicated test engineer prior to merging the branch up.


Sorry, the comment wasn't meant to be a personal judgement on you.

Jenkins was called Hudson from 2005 until 2011, and version control is much, much older.

I'm surprised you didn't have two or more workstations.


I'm completely ignorant on the topic but isn't rebooting a default test for kernel code, given how sensitive it is?

Oh I rebooted, I just didn't happen to have the right configuration options to invoke the failure when I rebooted. Not every dev workstation was bluescreening, just the ones with the particular feature enabled.

But as someone already pointed out, the issue was seen on all kinds of windows hosts. Not just the ones running a specific version, specific update etc.

That sounds like it was caught by luck, unless there was some test explicitly with that configuration in the QA process?

A lot of QA, especially at the system level, is just luck. That’s why it’s so important to dogfood internally imho.

And by internally I don’t just mean the development team, but anyone and everyone at the company who is allowed to have access to early builds.


There's "something that requires highly specific conditions managed to slip past QA" and then there's "our update brought down literally everyone using the software". This isn't a matter of bad luck.

Maybe thru luck, they're gonna uncover another xz utils backdoor MS version, but its probably gonna get covered up because, Microsoft

What does this mean?

Windows kernel paged, linux non paged?


The memory used by the Windows kernel is either Paged or Non-Paged. Non-Paged means pinning the memory in physical RAM. Paged means it might be swapped out to disk and paged back in when needed. OP was working on BitLocker a file system driver, which handles disk IO. It must be pinned in physical RAM to be available all the times; otherwise, if it's paged out, an IO request coming would find the driver code missing in memory and try to page in the driver code, which triggers another IO request, creating an infinite loop. The Windows kernel usually would crash at that point to prevent a runway system and stops at the point of failure to let you fix the problem.

Thank you!

Linux is a bit unusual in that kernel memory is generally physically mapped and unless you use vmalloc any memory you allocate has to correspond to pages backed by RAM. This also ties into how file IO happens, swapping, and how Linux approach to IO is actually closer to Multics and OS/400 than OG Unix.

Many other systems instead default to using full power of virtual memory including swapping kernel space to disk, with only things explicitly need to be kept in ram being allocated from "non-paged" or "wired" memory.

EDIT: fixed spelling thanks to writing on phone.


Linux kernel memory isn’t paged out to disk, while Windows kernel memory can be: https://knowledge.broadcom.com/external/article/32146/third-...

Has that changed? I remember always creating a swap partition that was meant to be at least the size of RAM

I do not mean this to be blamey in any way shape or form and am asking only about the process:

Shouldn’t that have been caught in code review?


My manager actually blamed the more senior developer who reviewed my code for that one.

Must have been DNS... when they did the deployment run and the necessary code was pulled and the DNS failed and then the wrong code got compiled...</sarcasm>

that they don't even do staged/A-B pushes was also <mind-blown-away>

But the most.... ironical was: https://www.theregister.com/2024/07/18/security_review_failu...


So the key test, the test that was not run, was to turn the machine off and on again? Classic windows.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: