Hacker News new | past | comments | ask | show | jobs | submit | page 3 login
CrowdStrike Update: Windows Bluescreen and Boot Loops (reddit.com)
4480 points by BLKNSLVR 5 days ago | hide | past | favorite | 3847 comments



Some Canonical guy I think many years ago mentioned this as their sales strategy a few year ago after a particularly nasty Windows outage:

We don't ask customers to switch all systems from Windows to Ubuntu, but to consider moving maybe a third to Ubuntu so they won't sit completely helpless next time Windows fail spectacularly.

While I see more and more Ubuntu systems, and recently have even spotted Landscape in the wild I don't think they were as successful as they hoped with that strategy.

That said, maybe there is a silver lining on todays clouds both WRT Ubuntu and Linux in general, and also WRT IT departments stopping to reconsider some security best practices.


Except further up this thread another poster mentions that CrowdStrike took down their Debian servers back in April as well. As soon as you're injecting third party software into your critical path with self-triggered updates you're vulnerable to the quality (or lack of) that software despite platform.

Honestly your comment highlights one of the few defenses... don't sit all on one platform.


Sure, but note the sales pitch was to encourage resiliency through diversity. While that may not be helpful in cases where one vendor may push the same breaking change through to multiple platforms, it also may be helpful. I remember doing some work with a mathematics package under Solaris while in university, while my peers were using the same package under Windows. Both had the same issue, but the behaviour was different. Under Solaris, it was possible to diagnose since the application crashed with useful diagnostic information. Under Windows, it was impossible to diagnose since it took out the operating system and (because of that) it was unable to provide diagnostic information. (It's worth noting that I've seen the opposite happen as well, so this isn't meant to belittle Windows.)

Yes, I already heard one manager at my company today say they're getting a mac for their next computer. That's great, the whole management team shouldn't be on Windows. The engineering team is already pretty diversified between mac, windows, and linux. The next one might take down all 3 but at least we tried to diversify the risk.

Yep, these episodes are the banana monoculture [0] applied to IT. The solution isn't to use this vendor or avoid that vendor, it's to diversify your systems such that you can have partial operability even if one major component is down.

[0] https://en.m.wikipedia.org/wiki/Gros_Michel_banana


> don't sit all on one platform.

Debian has automatic updates but they can be manual as well. That's not the case in Windows.

The best practice for security critical infrastructure in which peoples lives are at stake, is to install some version of BSD stripped down to it's bare minimum. But then the company has to pay for much more expensive admins. Windows admins are much cheaper and plentiful.

Also as a user of Ubuntu and Debian for more than a decade, i have a hunch that this will not happen in India [1].

[1] https://news.itsfoss.com/indian-govt-linux-windows/


Windows updates can definitely be manual. And anyway, this was not a Windows update. It was a CrowdStrike update.

Oh, i thought it was tied to OS updates. So Windows is not to blame, if that's the case.

well, in another sense, Windows is certainly to blame partially. Several technical solutions have been put forward here and in other places, that would've at least limited the blast radius of a faulty update/driver/critical path. Windows didn't implement any of those. Presumably by choice and for good reasons: A tradeoff would be that software like crowdstrike is more limited in protecting you. So the Windows devs deliberately opted for this risk.

Or they never considered it, which is far worse.


Hopefully they won't botch the update for two operating systems at the same time. But yeah. Hope.

Yeah, I see a lot of noise on social media blaming this on Microsoft/Windows... but AFAIK if you install a bad kernel driver into any major OS the result would be the same.

The specific of this CrowdStrike kernel driver (which AFAIK is intended to intercept and log/deny syscalls depending on threat assessment?) means that this is badnewsbears no matter which platform you're on.

Like sure, if an OS is vulnerable to kernel panics from code in userland, that's on the OS vendor, but this level of danger is intrinsic to kernel drivers!


> AFAIK if you install a bad kernel driver into any major OS the result would be the same

Updates should not be destructive. Linux doesn't typically overwrite previous kernels, and bootloaders let users choose a kernel during startup.

Furthermore, an immutable OS makes rollback trivial for the entire system, not just the kernel (reboot, select previous configuration).

I hope organizations learn from this, and we move to that model for all major OSes.

Immutability is great, as we know from functional programming. Nix and Guix are pushing these ideas forward, and other OSes should borrow them.


It's interesting to me that lay people are asking the right questions, but many in the industry, such as the parent here, seem to just accept the status quo. If you want to be part of the solution, you have to admit there is a problem.

True; except here's what's baffling:

CloudStrike only uses a kernel level driver on Windows. It's not necessary for Mac, it's not necessary for Linux.

Why did they feel that they needed kernel level interventions on Windows devices specifically? Windows may have some blame there.


Apple deprecated kernel extensions with 10.15 in order to improve reliability and eventually added a requirement that end users must disable SIP in order to install kexts. Security vendors moved to leverage the endpoint security framework and related APIs.

On Linux, ebpf provides an alternative, and I assume, plenty of advantages over trying to maintain kernel level extensions.

I haven’t researched, but my guess is that Microsoft hasn’t produced a suitable alternative for Windows security vendors.


> Why did they feel that they needed kernel level interventions on Windows devices specifically?

Maybe because everyone else in "security" and DRM does it, so they figured this is how it's done and they should do it too?

My prior on competence of "cybersecurity" companies is very, very low.


> My prior on competence of "cybersecurity" companies is very, very low.

Dmitri Alperovitch agrees with you.[0] He went on record a few months back in a podcast, and said that some of the most atrocious code he has ever seen was in security products.

I am certain he was implicitly referring, at least in part, to some of the code seen inside his past company's own code base.

0: https://nationalsecurity.gmu.edu/dmitri-alperovitch/ ["Co-founder and former CTO of Crowdstrike"]


> Maybe because everyone else in "security" and DRM does it, so they figured this is how it's done and they should do it too?

What DRM uses kernel drivers? And how do you plan to prevent malware from usermode?


> CloudStrike ONLY uses a kernel level driver on Windows

Crowdstrike uses a kernel level driver ONLY on Windows.


CrowdStrike uses a kernel level driver on Windows ONLY.

Even better..

ONLY on Windows does CrowdStrike use a kernel level driver.


Yeah, I think your point is totally valid. Why does CrowdStrike need syscall access on Windows when it doesn't need it elsewhere?

I do think there's an argument to be made that CrowdStrike is more invasive on Windows because Windows is intrinsically less secure. If this is true then yeah, MSFT has blame to share here.


I don't know about MacOS, but at least as recently as a couple years ago crowdstrike did ship a Linux kernel module. People were always complaining about the fact that it advertised the licensing as GPL and refused to distribute source.

I imagine they've simply moved to eBPF if they're not shipping the kernel module anymore.


I haven't looked too deeply into how EDRs are implemented on Linux and macOS, but I'd wager that CrowdStrike goes the way of its own bit of code in kernel space to overcome shortcomings in how ETW telemetry works. It was never meant for security applications; ETW's purpose was to aid in software diagnostics.

In particular, while it looks like macOS's Endpoint Security API[0] and Linux 4.x's inclusion of eBPF are both reasonably robust (if the literature I'm skimming is to be believed), ETW is still pretty susceptible to blinding attacks.

(But what about PatchGuard? Well, as it turns out, that doesn't seem to keep someone from loading their own driver and monkey patching whatever WMI_LOGGER_CONTEXT structures they can find in order to call ControlTraceW() with ControlCode = EVENT_TRACE_CONTROL_STOP against them.)

0: https://developer.apple.com/documentation/endpointsecurity


Non hardware "drivers" which cause a BSOD should be disabled automatically on next boot.

Windows offers it's users nothing here.


You can also make rollback easy. Just load the config before the one where you took the bad update.

Of course that means putting the user in control of when they apply updates, but maybe that would be a good thing anyway.


Linux and open source also have the potential to be far more modular than Windows is. At the moment we have airport display boards running a full windows stack including anti-virus/spyware/audit etc, just to display a table ... madness

I'm a Kubuntu user that, seemingly due to Canonical's decision to ship untested software regularly, has been repeatedly hit by problems with snaps. What were initially basic, obvious, and widespread issues with major software.

Yes, distribute your eggs, but check the handles on the baskets being sold to you by the guy pointing out bad handles.


FWIW, while some people like Kubuntu, I have had much better results with KDE Neon.

Stable Ubuntu core under the surface, and everything desktop related delivered by the KDE team.


Thanks for the tip, I'm looking to jump ship to MX-Linux, just procrastinating the move right now.

Still haven't forgiven Ubuntu for pushing a bad kernel of their own that caused a boot loop if you used containers...

I’ll never forgive them for the spyware they defaulted to on in their desktop stuff. It wasn’t the worst thing in the world, but they’re also the only major distro to ever do it, so Ubuntu (and Canonical as a whole) can get fucked, imo.

[flagged]


That's a long grudge to hold over a feature that was reconsidered and removed.

To say it rather politely, the mindset exposed by introducing this feature is unlikely to go away.

As shown by Mozilla.

Maybe, but Canonical didn't learn and are back to pushing advertising and forcing unwanted changes.

i started with RH (Non-EL) back in the mid-to-late 90s, and switched to gentoo as soon as one of my best (programmer) friends gushed about how much better of an admin it had made them[0], so i started down that path - by the time AWS appeared, we were both automating everything, using build (pump) servers, etc. I like debian, a lot - really! I think apt is about the best non-technical-user package manager, and the packages that were available without having to futz with keyrings was great.

Ubuntu spent a lot of time, talent, and treasure on trying to migrate people off windows instead of being a consistent, great OS. It is still with great dread that i open docs for some new package/program linked to from HN or elsewhere; dread that the first instruction is "start with ubuntu 18.04|20.04".

[0] They actually maintained the unofficial gentoo AWS images for over a decade. unsure if they still do, it could be automated to run a new build off every quarter. https://github.com/genewitch/gentoo/blob/master/gentoo_auto.... (a really old version of the script i keep to remind me that automation is possible with nearly everything...)


canonical has some of the most ridiculous IT job postings i’ve come across. just sounds like a bananas software shop. didn’t give me much confidence in whatever they cooking up in there

Not really.

Sure but if that Canonical sales person was successful in that, I'd almost guarantee that after they switched the first third they'd be in there arguing to switch out the rest.

Absolutely.

I'm just saying what they said their strategy was, not judging their sales people.


Many years ago an Ubuntu tech sales guy demoed their (openstack?) Self hosted cloud offering, his laptop was running windows..

Canonical in particular are no better, they do the exact same thing with that aberration called snap. They have brought entire clusters down before with automatic updates.

Seems like a reasonable strategy. Not just Ubuntu but some redundancy in some systems.

Ubuntu has unattended-upgrades enabled by default

Yes, but by default the only repo enabled for it is $(cat /etc/os-release)-security.

But CrowdStrike is security as well?

Yes, but it's not included in the upstream Ubuntu security repository. In fact, it's not available via any repository AFAIK. It updates itself via fetching new versions from CrowdStrike backend according to your update policy for the host in question. However, as we've learned the past days, that policy does not apply to the "update channel" files...

things are so interdependent that in this scenario you might now just end up crashing the system if either Windows or Ubuntu are down instead of just the one of them you chose

Read on Mastodon: https://infosec.exchange/@littlealex/112813425122476301

The CEO of Crowdstrike, George Kurtz, was the CTO of McAfee back in 2010 when it sent out a bad update and caused similar issues worldwide.

If at first you don't succeed, .... ;-) j/k


If anything, this just shows how short-term our memory is. I imagine crowdstrike stock will be back to where it was by the end of next week.

I bet they don't even lose a meaningful amount of customers. Switching costs are too high.

A real shame, and a good reminder that we don't own the things we think we own.


> this just shows how short-term our memory is.

I've been out of IT proper for a while, so to me, I had to ask "the Russiagate guys are selling AV software now?"


I don't partake in the stock market these days, but this is the kind of event that you can make good money betting the price will come back up.

When a company makes major headlines for bad news like this investors almost always over react and drive the price too far down.


I dunno. The stock price will probably dead cat bounce, but this is the sort of thing that causes companies to spiral eventually.

They just made thousands of IT people physically visit machines to fix them. Then all the other IT people watched that happen globally. CTOs got angry emails from other C-levels and VPs. Real money was lost. Nobody is recommending this company for a while.

It may put a dent in Microsoft as splash damage.


>It may put a dent in Microsoft as splash damage.

I have a feeling that Microsoft's PR team will be able to navigate this successfully and Microsoft might even benefit from this incident as it tries to pull customers away from CrowdStrike Falcon and into its own EDR product -- Microsoft Defender for Endpoint.


My (very unprofessional) guess here is that investors in the near term will discount the company too heavily and the previously overvalued stock will blow past a realistic valuation and be priced too low for a little while. The software and company aren't going anywhere as far as I can tell, they have far too much marketshare and use of CrowdStrike is often a contractual obligation.

That said, I don't gamble against trading algorithms these days and am only guessing at what I think will happen. Anyone passing by, please don't take random online posts as financial advice.


After yesterday, CRWD is still up more than the S&P since the start of the year, and both are up insane amounts.

The stock market is unrelated to reality.


Honestly makes me angry, if we had a sense of justice in this world this would devastate them financially.

With a P/E of over 573? Doubt it will recover that fast.

Worth $3.7B, paid $148M in 2022.

Edited to add: I wonder what the economic fallout from this will be? 10x his monetary worth? 100x? (not trying to put a price on the people who will die because of the outage; for that he and everyone involved needs to go to jail)


Nothing at all.

He will be the guy that convinced the investors and stakeholders to pour more money into the company despite some world-wide incident.

He deserves at least 3x the pay.

PS: look at the stocks! They sank, and now they are gaining again value. People can't work, people die, flights get delayed/canceled because of their software.


Regarding the stock. I'm sure people are "buying the dip".

From an investing perspective, that's fairly foolish until the financial liability of the company has been assessed.

Time will tell whether it's foolish or not.

so much seems based on sentiment now, might not matter as much as it would have 15 years ago.

If you invest based on fundamentals and company finances, you probably haven't had many chances to buy any positions in the last decade. Stock prices are completely unhinged from company financial reports.

"This is just a demonstration about how critical our software is and how undervalued we are. If the whole world economy requires us to run, we should be worth more" /s

I am still waiting for someone saying how generative AI (= chatbots/copilot) would have solved this problem, or even "never let happen"...

Generative AI would have mandated QA as best practice and to limit corporate liability.

Yeah well, obviously they don't need AI for that.

"We have unit tests coverage of 99%".

Let's be honest. :)


>I run a software company that deploys antivirus software automatically to customers' computers and servers. What should I be aware of as part of this and what actions should I ensure occur?

I gave it this question and part of the steps included deploying to avoid peak hours to avoid potential disruptions (such as the weekend for businesses); including a way to detect deployment issues and easily rollback the software in case there is an issue; and testing the deployment in a controlled environment before rolling it out to all customers (this would include doing a small rollout before a larger one). All of these actions are very standard and all would have greatly helped in this situation.


Crowdstrike (Falcon) has all possible compliance certifications (they are PCI, FedRAMP, ISO-27001, etc.), and these often include following standard practices of SDLC, such as testing, deploying, incident management, etc. And they are much more detailed than the answer from AI.

My suspicion, as someone pointed out, is that they deployed this update as a small patch, maybe bypassing some controls and cutting some corners. I really wouldn't blame it on lack of automated tests, etc. Only crowdstrike people know how this thing possibly made it to production.


>What certifications should I have?

ChatGPT Obtaining relevant certifications can enhance the credibility of your antivirus software and reassure customers about its security, reliability, and compliance with industry standards. Here are some key certifications you should consider:

1. Security Certifications: ISO/IEC 27001: This certification specifies the requirements for establishing, implementing, maintaining, and continually improving an information security management system (ISMS). It demonstrates your commitment to managing and protecting sensitive information. ISO/IEC 15408 (Common Criteria): This certification provides assurance that the security features of your software have been rigorously tested and meet internationally recognized standards. 2. Quality Management Certifications: ISO 9001: This certification focuses on quality management systems and ensures that your processes consistently provide products that meet customer and regulatory requirements. ISO/IEC 25010: This standard specifies a model for evaluating software quality, ensuring that your software meets high standards of functionality, reliability, usability, efficiency, maintainability, and portability. 3. Data Privacy Certifications: GDPR Compliance: If you operate in or serve customers in the European Union, demonstrating compliance with the General Data Protection Regulation (GDPR) is essential. CCPA Compliance: For customers in California, compliance with the California Consumer Privacy Act (CCPA) is important. 4. Industry-Specific Certifications: HIPAA Compliance: If your software handles protected health information (PHI) in the United States, ensuring compliance with the Health Insurance Portability and Accountability Act (HIPAA) is crucial. PCI DSS: If your software processes payment card information, compliance with the Payment Card Industry Data Security Standard (PCI DSS) is necessary. 5. Product-Specific Certifications: ICSA Labs Certification: ICSA Labs, an independent division of Verizon, provides testing and certification for security products, including antivirus software. Certification from ICSA Labs can validate the effectiveness of your software. AV-TEST Certification: The AV-TEST Institute is an independent research institute that evaluates and certifies antivirus software for its protection, performance, and usability. An AV-TEST certification can enhance your product’s credibility. VB100 Certification: Virus Bulletin (VB) offers the VB100 certification, which is awarded to antivirus products that meet specific criteria for detecting known malware without generating false positives. OPSWAT Certification: OPSWAT offers certifications that validate the interoperability and effectiveness of cybersecurity products, including antivirus software. 6. Environmental and Occupational Health Certifications: ISO 14001: This certification demonstrates your commitment to environmental management, which can be important for corporate social responsibility (CSR) and sustainability. ISO 45001: This certification focuses on occupational health and safety management systems, ensuring a safe and healthy workplace for your employees.


"I asked ChatGPT how to prevent this and here's what it said. <generic bullet points about software testing> AI would have prevented this."

Every major outage when I worked at Google caused stock price to increase for this exact reason.

IT is always considered a cost until execs realize its critical to the company's existence. Keeping the lights on always seems to be undervalued. :(


You’re joking but I actually think this is part of how the CEO will frame things to investors.

Kurtz response is ridiculous blaming the customer on X. He will probably find another company to hire him as CEO tho. Just an upside down world in the C-suite world.

Don't forget the golden parachute. These guys always seem to fail upward.

That guy is gonna fail all the way right up to the top. Sheesh.

who is hiring these fucking idiots? they need to be blacklisted

Crowdstrike is run by humans just like you and me. One mistake doesn’t mean they are completely incompetent.

> One mistake doesn’t mean they are completely incompetent.

They are completely incompetent because for something as critical as crowdstrike code, you must build so many layers of validation that one, two or three mistakes don't matter because they will be caught before the code ends up in a customer system.

Looks like they have so little validation that one mistake (which is by itself totally normal) can end up bricking large parts of the economy without ever being caught. Which is neither normal nor competent.


Except this isn’t one mistake. Writing buggy code is a mistake. Not catching it in testing, QA, dogfooding or incremental rollouts is a complete institutional failure

Mistakes are perfectly fine, that's why multiple layers of testing exist

> Mistakes are perfectly fine, that's why multiple layers of testing exist

Indeed. Or in the case of crowdstrike, should exist. Which clearly doesn't for them.


The CTO with a shitty track record, not the line employees. He deserves zero reprieve

Reminds me of Phil Harrison who always seems to find himself in an of executive position, botching launches of new video game platforms - PlayStation 3, Xbox One, Google Stadia

CXOs usually have deep connection and great contracts (golden parachutes, etc.) that make them extremely difficult to fire and amiable to hire :)

He founded the company

I didn’t understand why in 2010, it didn’t seem to make most news…

Took out the entire company where I worked.

People thought it was a worm/virus — few minutes after plugging in laptop, McAfee got the DAT update, quarantined the file; which caused Windows to start countdown+reboot (leading to endless BSODs).


Yet another successful loser who somehow continues to ascend corporate ranks despite poor company performance. Just shows how disconnected job performance is from C-suite peer reviews, a glorified popularity contest. Should add the unity and better.com folk here

Eh. To be fair, the higher profile your job is, the more likely you'll be the face of one of these in your career.

Ok but he faced two

“There's an old saying in Tennessee — I know it's in Texas, probably in Tennessee — that says, fool me once, shame on — shame on you. Fool me — you can't get fooled again.”

- GWB


fool me once...

This event is predicted in Sydney Dekker’s book “Drift into Failure”, which basically postulates that in order to prevent local failure we setup failure prevention systems that increase the complexity beyond our ability to handle, and introduce systemic failures that are global. It’s a sobering book to read if you ever thought we could make systems fault tolerant.

We need more local expertise is really the only answer. Any organization that just outsources everything is prone to this. Not that organizations that don't outsource aren't prone to other things, but at least their failures will be asynchronous.

Funny thing is that for decades there were predictions about how there was a need for millions of more IT workers. It was assumed one needed local knowledge in companies. Instead what we got was more and more outsourced systems and centralized services. This today is one of the many downsides.

Two weeks ago it was just about all car dealers

The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely. So companies who do their own IT would be routinely outcompeted by ones that outsource, only for the latter to get into trouble when the black swan swoops in. The problem is all other kinds of companies are mostly extinct by then unless their investors had some super-human foresight and discipline to invest for years into something that year after year looks like losing money.

> The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely.

Is that because of economies of scale or because the vendor is just cutting costs while hiding their negligence?

I don't understand how a single vendor was able to deploy an update to all of these systems virtually simultaneously, and _that_ wasn't identified as a risk. This smells of mindless box checking rather than sincere risk assessment and security auditing.


Kinda both I think, with an addition of principal agent problem. If you found a formula that provides the client with an acceptable CYA picture it is very scalable. And the model of "IT person knowledgeable in both security, modern threats and company's business" is not very scalable. The former, as we now know, is prone to catastrophic failures, but those are rare enough for a particular decision-maker to not be bothered by it.

the vendor is just cutting costs while hiding their negligence?

That's how it works.


Depressing thought that this phenomena is some kind of Nash equilibrium. That in the space of competition between firms, the equilibrium is for companies to outsource IT labor, saving on IT costs and passing that cost savings onto whatever service they are providing. -> Firms that outsource, out-compete their competition + expose their services to black swan catastrophic risk. Is regulation that only way out of this, from a game theory perspective?

Depressing, but a good way to think about it.

The whole market in which crowdstrike can exist is a result of regulation, albeit bad regulation.

And since the returns of selling endpoint protection are increasing with volume, the market can, over time, only be an oligopoly or monopoly.

It is a screwed market with artificially increased demand.

Also the outsourcing is not only about cost and compliance. There is at least a third force. In a situation like this, no CTO who bought crowdstrike products will be blamed. He did what was considered best industry practice (box ticking approach to security). From their perspective it is risk mitigation.

In theory, since most of the security incidents (not this one) involve the loss of personal customer data, if end customers would be willing to a pay a premium for proper handling of their data, AND if firms that don’t outsource and instead pay for competent administrators within their hierarchy had a means of signaling that, the equilibrium could be pushed to where you would like it to be.

Those are two very questionable ifs.

Also how do you recognise a competent administrator (even IT companies have problems with that), and how many are available in your area (you want them to live in the vicinity) even if you are willing to pay them like the most senior devs?

If you want to regulate the problem away, a lot of influencing factors have to be considered.


It has been exactly the same with outsourcing production to China...

Also a major point in the Black Swan. In the Black Swan, Taleb describes that it is better for banks to fail more often than for them to be protected from any adversity. Eventually they will become "too big to fail". If something is too big to fail, you are fragile to a catastrophic failure.

I was wondering when someone would bring up Taleb RE: this incident.

I know you aren't saying it is, but I think Taleb would argue that this incident, as he did with the coronavirus pandemic for example, isn't even a Black Swan event. It was extremely easy to predict, and you had a large number of experts warning people about it for years but being ignored. A Black Swan is unpredictable and unexpected, not something totally predictable that you decided not to prepare for anyways.


I think Grey Rhino is the term to use. Risks that we can see and acknowledge yet do nothing about.

That is interesting, where does he talk about this? I'm curious to hear his reasoning. What I remember from the Black Swan is that Black Swan events are (1) rare, (2) have a non-linear/massive impact, (3) and easy to predict retrospectively. That is, a lot of people will say "of course that happened" after the fact but were never too concerned about it beforehand.

Apart from a few doomsdayers I am not aware of anybody was warning us about a crowd strike type of event. I do not know much about public health but it was my understanding that there were playbooks for an epidemic.

Even if we had a proper playbook (and we likely do), the failure is so distributed that one would need a lot of books and a lot of incident commanders to fix the problem. We are dead in the water.


"Antifragile" is even more focused around this.

I think it was "predicted" by Sunburst, the Solarwinds hack.

I don't think centrally distributed anti-virus software is the only way to maintain reliability. Instead, I'd say companies to centralize anything like administration since it's cost effective and because they actually aren't concerned about global outage like this.

JM Keynes said "A ‘sound’ banker, alas! is not one who foresees danger and avoids it, but one who, when he is ruined, is ruined in a conventional and orthodox way along with his fellows, so that no one can really blame him." and the same goes for corporate IT.


Many systems are fault tolerant, and many systems can be made fault tolerant. But once you drift into a level of complexity spawned by many levels of dependencies, it definitely becomes more difficult for system A to understand the threats from system B and so on.

Do you know of any fault tolerant system? Asking because in all the cases I know, when we make a system "fault tolerant" we increase the complexity and we introduce new systemic failure modes related to our fault-tolerant-making-system, making them effectively non fault tolerant.

In all the cases I know, we traded frequent and localized failure for infrequent but globalized catastrophic failures. Like in this case.


You can make a system tolerant to certain faults. Other faults are left "untolerated".

A system that can tolerate anything, so have perfect availability, seems clearly impossible. So yeah, totally right, it's always a tradeoff. That's reasonable, as long as you trade smart.

I wonder if the people deciding to install Crowdstrike are aware of this. If they traded intentionally, and this is something they accepted, I guess it's fine. If not... I further wonder if they will change anything in the aftermath.


There will be lawsuits, there will be negotiations for better contracts, and likely there will be processes put in place to make it look like something was done at a deeper level. And yet this will happen again next year or the year after, at another company. I would be surprised if there was a risk assessment for the software that is supposed to be the answer to the risk assessment in the first place. Will be interesting to see what happens once the dust settles.

  - This is system has a single point of failure, it is not fault tolerant. Lets introduce these three things to make it fault-tolerant
  - Now you have three single points of failure...

That makes it three times as durable...

...right?


It really depends on the size of the system and the definition of fault tolerance. If I have a website calling out to 10 APIs and one API failure takes down the site, that is not fault tolerance. If that 1 API failure gets caught and the rest operate as normal, that is fault tolerance, but 10% of the system is down. If you go to almost any site and open the dev console, you'll see errors coming from parts of the system, that is fault tolerance. Any twin engine airplane is fault tolerant...until both engines fail. I would say the solar system is fault tolerant, the universe even moreso if you consider it a system.

tldr there are levels to fault tolerance and I understand what you are saying. I am not sure if you are advocating for getting rid of fault handling, but generally you can mitigate the big scary monsters and what is left is the really edge case issues, and there really is no stopping one of those from time to time given we live in a world where anything can happen at anytime.

This instance really seems like a human related error around deployment standards...and humans will always make mistakes.


well, you usually put a load balancer and multiple instances of your service to handle individual server failures. In a basic no-lb case, your single server fails, you restart it and move on (local failure). In a load balancer case, your lb introduces its own global risks e.g. the load balancer can itself fail, which you can restart, but the load balancer can have a bug and stop handling sticky sessions when your servers are relying on it, and now you have a much harder to track brown-out event that is affecting every one of your users for a longer time, it's hard to diagnose, might end up with hard to fix data issues and transactions, and restarting the whole might not be enough.

So yeah, there is no fault tolerance if the timeframe is large enough, there are just less events, with much higher costs. It's a tradeoff.

The cynical in me thinks that the one advantage of these complex CYA systems is that when systems fail catastrophically like CrowdStrike did, we can all "outsource" the blame to them.


It's also in line with arguments made by Ted Kaczynski (the Unabomber)

> Why must everything collapse? Because, [Kaczynski] says, natural-selection-like competition only works when competing entities have scales of transport and talk that are much less than the scale of the entire system within which they compete. That is, things can work fine when bacteria who each move and talk across only meters compete across an entire planet. The failure of one bacteria doesn’t then threaten the planet. But when competing systems become complex and coupled on global scales, then there are always only a few such systems that matter, and breakdowns often have global scopes.

https://www.overcomingbias.com/p/kaczynskis-collapse-theoryh...

https://en.wikipedia.org/wiki/Anti-Tech_Revolution


crazy how much he was right. if he hadn't gone down the path of violence out of self-loathing and anger he might have lived to see a huge audience and following.

I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.

There was a quote last year during the "Twitter files" hearing, something like, "it is axiomatic that the government cannot do indirectly what it is prohibited from doing directly".

Perhaps ironically, I had a difficult time using Google to find the exact wording of the quote or its source. The only verbatim result was from a NYPost article about the hearing.


>I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.

Be realistic, none his ideas would be blacklisted. They sound good on paper, but the instant it's time for everyone to return to mudhuts and farming, 99% of people will return to Playstations and ACs.

He wasn't "silenced" because the government was out to get him, no one talks about his ideas because they are just bad. Most people will give up on ecofascism once you tell them that you won't be able to eat strawberries out of season.


"would be blacklisted, deplatformed, or deamplified by consolidated authorities"

Sorry. Not true. You have Black Swan (Taleb) and Drift into Failure (Dekker) among many other books. These ideas are very well known to anyone who makes the effort.


> it is axiomatic that the government cannot do indirectly what it is prohibited from doing directly

Turns out SCOTUS decided it isn't, and the government is free to do exactly that as long as they are using the services of an intermediary.


The only thing that got Unabomber blacklisted is that he started to send bombs to people. His manifesto was dime a dozen, half the time you can expect politician boosting such stuff for temporary polling wins.

Hell, if we take his alleged (don't have vetted the genealogy tree) cousins, his body count isn't even that impressive.


Being the subject of psychological experiments at Harvard probably did a number on him

I think a surprising amount of people already share this view, even if they don't go into extensive treatment with references like Dekker presumably does (I haven't read it).

I suspect most people in power just don't subscribe to that. which is precisely why it's systemic to see the engineer shouting "no!" when John CEO says "we're doing it anyway." I'm not sure this is something you can just teach, because the audience definitely has reservations about adopting it.


> we setup failure prevention systems

You can't prevent failure. You can only mitigate the impact. Biology has pretty good answers as to how to achieve this without having to increase complexity as a result, in fact, it often shows that simpler systems increase resilliency.

Something we used to understand until OS vendors became publicly traded companies and "important to national security" somehow.


> if you ever thought we could make systems fault tolerant

The only possible way to fault tolerancy is simplicity and then more simplicity.

Things like crowsdtrike have the opposite approach. Add a lot of fragile complexity attempting to catch problems, but introducing more attack surfaces than they can remove. This will never succeed.


Just yesterday listened to a lecture by Moshe Vardi which covers adjacent topics:

https://simons.berkeley.edu/events/lessons-texas-covid-19-73...


As an architect of secure, real-time systems, the hardest lesson I had to learn is there's no such thing as a secure, real-time system in the absolute sense. Don't tell my boss.

I haven't read it, but I'd take a leap to presume it's somewhere between the people that say "C is unsafe" and "some other language takes care of all of things".

Basically delegation.


The thing that amazes me is how they've rolled out such a buggy change at such a scale. I would assume that for such critical systems, there would be a gradual rollout policy, so that not everything goes down at once.

Lack of gradual, health mediated rollout is absolutely the core issue here. False positive signatures, crash inducing blocks, etc will always slip through testing at some % no matter how good testing is. The necessary defense in depth here is to roll out ALL changes (binaries, policies, etc) in a staggered fashion with some kind of health checks in between (did > 10% of endpoints the change went to go down and stay down right after the change was pushed?).

Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.

You can stagger changes out within a reasonable timeframe - the blocks already take hours/days/weeks to come up with, taking an extra hour or two to trickle the change out gradually with some basic sanity checks between staggers is a tradeoff everyone would embrace in order to avoid the disaster we're living through today.

Need a reset on their balance point of security:uptime.


Wow !! good to know real reason for non-staggered release of the software ...

> Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.


There's some irony there in that the whole point of CrowdStrike itself is that it does behavioural based interventions. ie: it notices "unusual" activity over time and then can react to that autonomously. So them telling you they can't engineer it is kind of like them telling you they do don't know how to do a core feature they actually sell and market the product itself doing.

The core issue? I'd say it's QA.

Deploy to a QA server fleet first. Stuff is broken. 100% prevention.


It's quite handy that all the things that pass QA never fail in production. :)

On a serious note, we have no way of knowing whether their update passed some QA or not, likely it hasn't, but we don't know. Regardless, the post you're replying to, IMHO, correctly makes the point that no matter how good your QA is: it will not catch everything. When something slips, you are going to need good observability and staggered, gradual, rollbackable, rollouts.

Ultimately, unless it's a nuclear power plant or something mission critical with no redundancy, I don't care if it passes QA, I care that it doesn't cause damage in production.

Had this been halted after bricking 10, 100, 1.000, 10.000, heck, even 100.000 machines or a whopping 1.000.000 machines, it would have barely made it outside of the tech circle news.


> On a serious note, we have no way of knowing whether their update passed some QA or not

I think we can infer that it clearly did not go through any meaningful QA.

It is very possible for there to be edge-case configurations that get bricked regardless of how much QA was done. Yes, that happens.

That's not what happened here. They bricked a huge portion of internet connected windows machines. If not a single one of those machines was represented in their QA test bank, then either their QA is completely useless, or they ignored the results of QA which is even worse.

There is no possible interpretation here that doesn't make Crowdstrike look completely incompetent.


If there had been a QA process, the kill rate could not have been as high as it is, because there'd have to be at least one system configuration that's not subject to the issue.

I agree that testing can reduce the probability of having huge problems, but there are still many ways in which a QA process can fail silently, or even pass properly, without giving a good indication of what will happen in production due to data inconsistencies or environmental differences.

Ultimately we don't know if they QA'd the changes at all, if this was data corruption in production, or anything really. What we know for sure is that they didn't have a good story for rollbacks and enforced staggered rollouts.


There's also the possibility that they did do QA, had issues in QA and were pressured to rush the release anyways.

Unsubstantiated (not even going to bother link to the green-account-heard-it-from-a-friend comment), the fault was added by a post-QA process

My understanding of their argument is that they can't afford the time to see if it breaks the QA fleet. Which I agree with GP is not a sufficient argument.

Yes, one of the first steps of this gradual rollout should be rolling out to your own company in the classic, "eat your own dogfood" style.

If and when there is a US Cyber Safety Review Board investigation of this incident, documents like that are going to be considered with great interest by the parties involved.

Often it is the engineers working for a heavily invested customer at the sharp end of the coal face who get a glimpse underneath the layers of BS and stare into the abyss.

This doesn’t look good, they say. It looks fine from up top! Keep shoveling! Comes the reply.


Sure, gradual rollout seems obviously desirable, but think of it from a liability perspective.

You roll out a patch to 1% of systems, and then a few of the remaining 99% get attacked and they sue you for having a solution but not making it available to them. It won't matter that your sales contract explains that this is how it works and the rollout is gradual and random.

Just a thought.


These suing hypotheticals work both ways- they can sue for crashing 100% of your computers - so don't really explain any decision

Then push it down to customer, better yet provide integration points with other patch management software (no idea if you can integrate with WSUS without doing insane crap, but it's not the only system to handle that, etc.)

Another version of the "fail big" or "big lie" type phenomenon. Impact 1% of your customers and they sue you saying the gradual rollout demonstrates you had prior knowledge of the risk. Impact 100% of your customers and somehow you get off the hook by declaring it a black swan event that couldn't have been foretold.

Don't you think they will be sued now too?

CS recently went through cost cutting measures, which is likely why there's not a QA fleet to deploy to or improving their engineering processes.

Were they struggling with paying the employees?

In modern terms, you mean they simply weren't willing to babysit longer install frames.

This. I can see such an update shipping out for a few users. I mean I've shipped app updates that failed spectacularly in production due to a silly oversight (specifically: broken on a specific Android version), but those were all caught before shipping the app out to literally everybody around the world at the same time.

The only thing I can think of is they were trying to defend from a very severe threat very quickly. But... it seems like if they tested this on one machine they'd have found it.

Unless that threat was a 0day bug that allows anyone to SSH to any machine with any public key, it was not worth pushing it out in haste. Full stop. No excuses.

Can't boot, can't get cracked! Big brain thinking.

That’s the most charitable hypothesis, and I agree could be possible!

I myself have ninja-shipped a fix a minor problem, but then caused a worse problem since I rushed it.


I pushed a one-character fix and broke it a second time

I'd love to know what the original threat was. I hope it was something dumb like applying new branding colors to the systray indicator.

"works on my machine" at Internet scale. What a scary thought

I also blame the customers here to be completely honest.

The fact the software does not allow for progressive rollout of a version in your own fleet should be an instantaneous "pass". It's unacceptable for a vendor to decide when updates are applied to my systems.


Absolutely. I may be speaking from ignorance here, as I don't know much about Windows, but isn't it also a big security red flag that this thing is reaching out to the Internet during boot?

I understand the need for updating these files, they're essentially what encodes the stuff the kernel agent (they call it a "sensor"?) is looking for. I also get why a known valid file needs to be loaded by the kernel module in the boot process--otherwise something could sneak by. What I don't understand is why downloading and validating these files needs to be a privileged process, let alone something in the actual kernel. And to top it all off, they're doing it at boot time. Why?

I hope there's an industry wide safety and reliability lesson learned here. And I hope computer operators (IT departments, etc) realize that they are responsible for making sure the things running on their machines are safe and reliable.


Well said. I can't fathom companies being fine with some 3rd party pushing arbitrary changes to their critical production systems.

With fear of sounding like a douche-bag, I honestly believe there's A LOT of incompetence in the tech-world, which permeates all layers, security companies, AV companies, OS companies etc.

I really blame the whole power-structure, it looked like the engineers had the power, but last 10 years tech has been turned upside-down and exploited as any other industry, controlled by the opportunistic and greedy people. Everything is about making money, shipping features, the engineering is lost.

Would you rather tick compliance boxes easily or think deep about your critical path? Would you rather pay 100k for a skilled engineer or 5 cheaper (new) ones? Would you rather sell your HW now despite pushing feature-incomplete buggy app ruining the experience for many many customers? Will you listen to your engineers?

I also blame us, the SWE engineers, we are waay to easily busied around by these types of people who have no clue. Have professional integrity, tests is not optional or something that can be cut, it's part of SWE. Gradual rollout, feature-toggles, fall-backs/watchdogs etc. basic tools everyone should know.


I know people really dislike how Apple restricts your freedom to use their software in any way they don't intend. But this is one of the times where they shine.

Apple recognised kernel extension brought all sorts of trouble for users such as instability, crashing, etc. and presented a juicy attack surface. They deprecated and eventually disallowed kernel extensions supplanting them with a system extensions framework to provide interfaces for VPN functionality, EDR agents, etc.

A Crowdstrike agent couldn't panic or boot loop macOS due to a bug in the code when using this interface.


> I know people really dislike how Apple restricts your freedom to use their software in any way they don't intend. But this is one of the times where they shine.

Yes, the problem here is that the system owners had too much control over their systems.

No, no, that's the EXACT OPPOSITE of what happened. The problem is Crowdstrike had too much control of systems -- arguing that we should instead give that control to Apple is just swapping out who's holding the gun.


> arguing that we should instead give that control to Apple is just swapping out who's holding the gun.

apple wrote the OS, in this scenario they're already holding a nuke, and getting the gun out of crowdstrike's hands is in fact a win.

it is self-evident that 300 countries having nukes is less safe than 5 countries having them. Getting nukes (kernel modules) out of the hands of randos is a good thing even if the OS vendor still has kernel access (which they couldn't possibly not have) and might have problems of their own. IDK why that's even worthy of having to be stated.

don't let the perfect be the enemy of the good, incremental improvements in the state of things is still improvement. there is a silly amount of black-and-white thinking around "popular" targets like apple and nvidia (see: anything to do with the open-firmware-driver) etc.

"sure google is taking all your personal data and using it to target ads to your web searches, but apple also has sponsored/promoted apps in the app store!" is a similarly trite level of discourse that is nonetheless tolerated when it's targeted at the right brand.


Perfectly stated!

This is good nuance to add to the conversation, thanks.

I think in most cases you have to trust some group of parties. As an individual you likely don't have enough time and expertise to fully validate everything that runs on your hardware.

Do you trust the OSS community, hardware vendors, OS vendors like IBM, Apple, M$, do you trust third party vendors like Crowdstrike?

For me, I prefer to minimize the number of parties I have to trust, and my trust is based on historical track record. I don't mind paying and giving up functionality.


Even if you've trusted too many people, and been burned, we should design our systems such that you can revoke that trust after the fact and become un-burned.

Having to boot into safe mode and remove the file is a pretty clumsy remediation. Better would be to boot into some kind of trust-management interface and distrust cloudstrike updates dated after July 17, then rebuild your system accordingly (this wouldn't be difficult to implement with nix).

Of course you can only benefit from that approach if you trust the end user a bit more than we typically do. Physical access should always be enough to access the trust management interface, anything else is just another vector for spooky action at a distance.


It is some mix of priorities along the frontier, with Apple being on the significantly controlling end such that I wouldn't want to bother. Your trust should also be based on prediction, and giving a major company even more control over what your systems are allowed to do has been historically bad and only gets worse. Even if Apple is properly ethical now (I'm skeptical, I think they've found a decently sized niche and that most of their users wouldn't drop them even if they moved to significantly higher levels of telemetry, due to being a status good in part), there's little reason to give them that power in perpetuity. Removing that control when it is absued hasn't gone well in the past.

Microsoft is also trying to make drivers and similar safer with HVCI, WDAC, ELAM and similar efforts.

But given how a large part of their moat is backwards compatibility, very few of those things are the default and even then probably wouldn't have prevented this scenario.


Microsoft has routinely changed the display driver model, breaking backward compatibility. They've also barred print drivers.

> large part of their moat is backwards compatibility

This is more of a religious belief than truth, IMO. They could strong-arm recalcitrant customers, but they don't.


> They could strong-arm recalcitrant customers, but they don't.

They really can't. When the customers have to redo their stack, they might do that in a way that doesn't need Microsoft at all.


These customers wouldn't be able to do that in time frames measured in anything but decades and/or they would risk going bankrupt attempting to switch.

Microsoft has far more leverage than they choose to exert, for various reasons.


I can't run a 10year old game on my Mac but i can run a 30 year old game on my windows 11 box. Microsoft prioritizes backwards compatibility for older software,

You can’t run a 30 year old driver in Windows, nor a 10 year old in all likelihood.

Microsoft prioritizes userspace compatibility, but their driver models have changed (relatively) frequently.


If you are a Crowdstrike customer you can’t run anything today.

For apple you just need to be an apple customer, they do a good job on crashing computers with their OSX updates like Sonoma. I remember my first macbook pro retina couldn’t go to sleep because it wouldn’t wake up till apple decided to release a fix for it. Good thing they don’t make server OSes.

I remember fearing every OSX update because until they switched to just shipping read-only partition images you had considerable chance of hitting a bug in Installer.app that resulted in infinite loop... (the bug existed since ~10.6 until they switched to image-based updates...)

30 years ago would be 1994. Were there any 32-bit Windows games in 1994 other than the version of FreeCell included with Win32s?

16-bit games (for DOS or Windows) won't run natively under Windows 11 because there's no 32-bit version of Windows 11 and switching a 64-bit CPU back to legacy mode to get access to the 16-bit execution modes is painful.


Maybe. Have you tried? 30 year old games often did not implement delta timing, so they advance ridiculously fast on modern processors. Or the games required a memory mode not supported by modern Windows (see real mode, expanded memory, protected mode), requiring DOSBox or other emulator to run today.

DOSBox runs on Mac too, incidentally.


A 10 year old driver would crash your system, and you can't install some VM like you can with some game. Not that great prioritization

If the user want remote code execution (auto updates are) in kernel space, let them.

Apple sell the whole hardware stack. I don't think limeting drivers would fly on Windows or Linux.


Pretty sure there's an exception for drivers but requires at minimum notarisation from Apple, but more likely a review as well.

They just developed a new framework that allows drivers to work just in user space https://developer.apple.com/documentation/driverkit

Well - recognition where it's due - that actually looks pretty great. (Assuming that, contrary to prior behavior, they actually support it, and fix bugs without breaking backwards compatibility every release, and don't keep swapping it out for newer frameworks, etc etc)

Ok what if they sold it off by default but there was a physical switch that could turn it on, that required hardware access?

Good compromise?


That’s exactly how macOS works (except it’s not a physical switch). You can disable SIP if you have hardware access to a machine.

I would be fine with jumpers, ye.

No.

Go buy a different product if you want that functionality. I'm sticking with my Apple phone so outages like this are much less likely to affect me.


> I also blame us, the SWE engineers, we are waay to easily busied around by these types of people who have no clue. Have professional integrity, tests is not optional or something that can be cut, it's part of SWE.

Then maybe most of what's done in the "tech-industry" isn't, in any real sense, "engineering"?

I'd argue the areas where there's actual "engineering" in software are the least discussed---example being hard real-time systems for Engine Control Units/ABS systems etc.

That _has_ to work, unlike the latest CRUD/React thingy that had "engineering" processes of cargo-culting whatever framework is cool now and subjective nonsense like "code smells" and whatever design pattern is "needed" for "scale" or some such crap.

Perhaps actual engineering approaches could be applied to software development at large, but it wouldn't look like what most programmers do, day to day, now.

How is mission-critical software designed, tested, and QA'd? Why not try those approaches?


Amen to that. Software Engineering as a discipline badly suffers from not incorporating well-known methods for preventing these kinds of disasters from Systems Engineering.

And when I say Systems Engineering I don't mean Systems Programming, I mean real Systems Engineering: https://en.wikipedia.org/wiki/Systems_engineering

> How is mission-critical software designed, tested, and QA'd? Why not try those approaches?

Ultimately, because it is more expensive and slower to do things correctly, though I would argue that while you lose speed initially with activities like actually thinking through your requirements and your verification and validation strategies, you end up gaining speed later when you're iterating on a correct system implementation because you have established extremely valuable guardrails that keep you focused and on the right track.

At the end of the day, the real failure is in the risk estimation of the damage done when these kinds of systems fail. We foolishly think that this kind of widespread disastrous failure is less likely than it really is, or the damage won't be as bad. If we accurately quantified that risk, many more systems we build would fall under the rigor of proper engineering practices.


Accountability would drive this. Engineering liability codes are a thing, trade liability codes are a thing. If you do work that isn't up to code, and harm results, you're liable. Nobody is holding us software developers accountable, so it's no wonder these things continue to happen.

"Listen to the engineers?" The problem is that there are no engineers, in the proper sense of the term. What there are is tons and tons of software developers who are all too happy to be lax about security and safe designs for their own convenience and fight back hard against security analysts and QA when called out on it.

I would add that a lot of ppl in this industry also just blindly follow the herd too, without any independent thinking.

Oh, everyone is using Crowdstrike? I guess i have to do so too!

Oh, everyone is using Kubernetes? I guess i better start migrating our services to it too!

Oh, everyone is using this fancy Vercel stuff? We better use it too!

Oh everyone is migrating their workloads to the cloud even tho we dont need to and it costs 5x more?!! We better do so too!!


> it looked like the engineers

Engineers can be lazy and greedy, too. But at least they should better understand the risks of cutting corners.

> Have professional integrity, tests is not optional or something that can be cut, it's part of SWE. Gradual rollout, feature-toggles, fall-backs/watchdogs etc. basic tools everyone should know.

In my career, my solution for this has been to just include doing things "the right way" as part of the estimate, and not give management the option to select a "cutting corners" option. The "cutting corners" option not only adds more risk, but rarely saves time anyway when you inevitably have to manually roll things back or do it over.


Sigh, I've tried this. So management reassigned to a dev who was happy to ship a simalcrum of the thing that, at best, doesn't work or, at worst, is full of security holes and gives incorrect results. And this makes management happy because something shipped! Metrics go up!

And then they ask why, exactly, did the senior engineer say this would take so long? Why always so difficult?


I don't know that incompetence is the best way to describe the forces at play but I agree with your sentiment.

There is always tension between business people and engineering. Where the engineers want things to be perfect and safe, because we need to fix the arising issues during nights and weekends. The business people are interested in getting features released, and don't always understand the risks by pushing arbitrary dates.

It's a tradeoff which in healthy organizations where the two sides and leadership communicate effectively is well managed.


> Where the engineers want things to be perfect and safe, because we need to fix the arising issues during nights and weekends. The business people are interested in getting features released, and don't always understand the risks by pushing arbitrary dates.

Isn't this issue a vindication of the engineering approach to management, where you try to _not_ brick thousands of computers because you wanted to meet some internal deadline faster?


You don't consider bricking a considerable fraction of the world's computers in a way that's difficult to recover from incompetence?

Maybe it is. I don't like the connotation of it though.

It implies some sort of individual failure when I think it's an organizational failure is what I am was trying to say.


> There is always tension between business people and engineering.

Really? I think this situation (and the situation with Boeing!) shows that the tension is between ultimately between responsibility and irresponsibility.

I cannot be said that this is a win for short-sighted and incompetent business people?

If people don't understand the risks they shouldn't be making the decisions.


I think this is especially true in businesses where the thing you are selling is literally your ability to do good engineering. In the case of Boeing the fundamental thing customers care about is the "goodness" of the actual plane (for example the quality, the value for money, etc). In the case of Crowdstrike people wanted high quality software to protect their computers.

Yeah, good point. If you buy a carton of milk and it's gone off you shrug and go back to the store. If you're sitting in a jet plane at 30,000ft and the door goes for a walk... Twilight Zone. (And if the airline's security contractor sends a message to all the planes to turn off their engines... words fail. It's not... I can't joke about it. Too soon.)

Yes. I have been working in the tech industry since the early aughts and I never seen the industry so weak on engineer lead firms. Something really happened and the industry flipped.

In most companies, businesspeople without any real software dev experience control the purse strings. Such people should never run companies that sell life-or-death software.

The reality is there is plenty of space in the software industry to trade off velocity against "competent" software engineering. Take Instagram as an example. No one is going to die if e.g. a bug causes someone's IG photo upload to only appear in a proper subset of the feeds where it should appear.

There's a lot of incompetence by choice.


In the civil engineering world, at least in Europe, the lead engineer would sign papers that would put him as liable if a bridge or a building structure collapses on its own. The civil engineers face literal prison time if they make a sloppy work.

In the software engineering world, we have TOSs that deny any liability if the software fails. Why?

It boils my blood to think that the heads of CrowdStrike would maybe get a slap on the wrist and everything will slowly continue as usual as the machines will get fixed.

People died for this bug.


Let's think about this for a second. I agree to some extend with what you are trying to say, I just think there's a critical thing missing here in your consideration, and that is usage of the product outside its intended purpose/marketing.

Civil engineers built bridges knowingly that civilians use them, and structural failure can cause deaths. The line of responsibility is clear.

SW companies (like CrowdStrike (CS)) it MAY BE less straight-forward.

A relevant real-world example is the use of consumer drones in military conflicts. Companies like DJI design and market their drones for civilian use, such as photography. However, these drones have been repurposed in conflict zones, like Ukraine, to carry explosives. If such a drone malfunctioned during military use, it would be unreasonable to hold DJI accountable, as this usage clearly falls outside the product's intended purpose and marketing.

The liability depends on the guarantees they make. If they market it for AV used for critical infrastructure, such as healthcare (seems like they do https://www.crowdstrike.com/platform/) - by all means, it's reasonable to hold with accountable.

However, SW companies should be able to sell products and long as they're clear what the limitations are, and it needs to be clearly communicated to the customers.


We have those TOS's in the software world because it would be prohibitively expensive to make all software reliable as a publicly used bridge. For those who died as a direct result of CrowdStrike, that's where the litigious nature of the US becomes a rare plus. And CrowdStrike will lose a lot of customers over this. It isn't perfect, but the market will arbitrate CrowdStrike's future in the coming months and years.

We’re definitely in a moment. I’ve seen a large shift away from discipline in the field. People don’t seem to care about professionalism or “good work”.

I mean back in the mid teens we had the whole “move fast and break things” motif. I think that quickly morphed into “be agile” because no one actually felt good about breaking things.

We don’t really have any software engineering leaders these days. It would be nice if one stood up and said “stop being awful. Let’s be professionals and earn our money.” Like, let’s create our own oath.


> We don’t really have any software engineering leaders these days. It would be nice if one stood up and said “stop being awful. Let’s be professionals and earn our money.”

I assume you realize that you don't get very far in many companies when you do that. I'm not humble-bragging, but I used to say just this over past 10-15 years even when in senior/leadership positions, and it ended up giving me a reputation of "oh, gedy is difficult", and you get sidelined by more "helpful" junior devs and managers who are willing to sling shit over the wall to please product. It's really not worth it.


It’s a matter of getting a critical mass of people who do that. In other words, changing the general culture. I’m lucky to work at a company that more or less has that culture.

Yeah I’ve found this is largely cultural, and it needs to come from the top.

The best orgs have a gnarly, time-wisened engineer in a VP role who somehow is also a good people person, and pushes both up and down engineering quality above all else. It’s a very very rare combination.


If it's a mature system and management is highly risk averse, not fucking up means more than slinging shit quickly

> We’re definitely in a moment. I’ve seen a large shift away from discipline in the field. People don’t seem to care about professionalism or “good work”.

Agreed. Thinking back to my experience at a company like Sun, every build was tested on every combination of hardware and OS releases (and probably patch levels, don't remember). This took a long time and a very large number of machines running the entire test suites. After that all passed ok, the release would be rolled out internally for dogfooding.

To me that's the base level of responsibility an engineering organization must have.

Here, apparently, Crowdstrike lets a code change through with little to no testing and immediately pushes it out to the entire world! And this is from a product that is effectively a backdoor to every host. What could go wrong? YOLO right?

This mindset is why I grow to hate what the tech industry has become.


As an infra guy, it seems like all my biggest fights at work lately have been about quality. Long abandoned dependencies that never get updated, little to no testing, constant push to take things to prod before they're ready. Not to mention all the security issues that get shrugged off in the name of convenience.

I find both management and devs are to blame. For some reason the amazingly knowledgeable developers I read on here daily are never to be found at work.


Yes. I’ve had the same experience. Literally have had engineers get upset with me when I asked them to consider optimizing code or refactor out complexity. “Yeah we’ll do it in a follow up, this needs to ship now,” is what I always end up hearing. We’re not their technical leads but we get pulled into a lot of PRs because we have oversight on a lot of areas of the codebase. From our purview, it’s just constantly deteriorating.

Need to make software developers legally liable like other engineers, that will cause a huge behavioral shift

IMO, if you want to write code for anything mission critical you should need some kind of state certification, especially when you are writing code for stuff that is used by govt., hospitals, finance etc.

Certifications by themselves don’t help if the culture around them doesn’t change. Otherwise it’s just rubber-stamping.

Not certification, licensure. That can and will be taken away if you violate the code of ethics. Which in this case means the code of conduct dictated to you by your industry instead of whatever you find ethical.

Like a license to be a doctor, lawyer, or civil engineer.

There’s - perhaps rightfully, but certainly predictably - a lot of software engineers in this thread moaning about how evil management makes poor engineers cut corners. Great, licensure addresses that. You don’t cut corners if doing so and getting caught means you never get to work in your field again. Any threat management can bring to the table is not as bad as that. And management is far less likely to even try if they can’t just replace you with a less scrupulous engineer (and there are many, many unscrupulous engineers) because there aren’t any because they’re all subject to the same code of ethics. Licensure gives engineers leverage.

Super unpopular concept, though.


Certifications and compliance regimes are what got us into this mess in the first place.

I think that could cause a huge shift away from contributing to or being the maintainer of open source software. It would be too risky if those standards were applied and they couldn't use the standard "as is, no warranties" disclaimers.

Actually, no it wouldn't, as the licensire would likely be tied with providing the service on a paid basis to others. You could write or maintain any codebase you want. Once you start consuming it for an employer though, the licensure kicks in.

Paid/subsidized maintainers may be a different story though. But there absolutely should be some level of teeth and stake wieldable by a professional SWE to resist pushes to "just do the unethical/dangerous thing" by management.


I might have misunderstood. I took it to mean that engineers would be responsible for all code they write - the same as another engineer may be liable for any bridge they build - which would mean the common "as is", "no warranty", "not fit for any purpose" cute clauses common to OSS would no longer apply as this is clearly skirting around the fact that you made a tool to do a specific thing, and harming your computer isn't the intended outcome.

You can already enforce responsibility via contract but sure, some kind of licensing board that can revoke a license so you can no longer practice as a SWE would help with pushback against client/employer pressure. In a global market though it may be difficult to present this as a positive compared to overseas resources once they get fed up with it. It would probably need either regulation, or the private equivalent - insurance companies finding a real, quantifiable risk to apply to premiums.


Trouble is, the bridge built by any licensed engineer stands in its location, and can't be moved or duplicated. Software however is routinely duplicated, and copied to places that might not be suitable for ite original purpose.

I’d be ok with this so long as 1) there are rules about what constitutes properly built software and 2) there are protections for engineers who adhere to these rules

Greed and MBAs have colonized the far ends of the techno sphere.

Far from being douchey, I think you've hit the nail on the head. No one is perfect, we're all incompetent to some extent. You've written shitty code, I've definitely written shitty code. There's little time or consideration given to going back and improving things. Unless you're lucky enough to have financial support while working on a FOSS project where writing quality software is actually prioritized.

I get the appeal software developers have to start from scratch and write their own kernel, or OS, etc. And then you realize that working with modern hardware is just as messy.

We all stack our own house of cards upon another. Unless we tear it all down and start again with a sane stable structure, events like this will keep happening.


I know 100k+ engineers (some in security) that definitely should not be described as skilled.

Wow, you know a lot of people!

I think you are correct on that many SWEs are incompetent. I definitely am. I wish I had the time and passion to go through a complete self-training of CS fundamentals using Open Course resources.

You do realize that knowledge of CS fundamentals is extremely unlikely to have prevented this?

> I honestly believe there's A LOT of incompetence in the tech-world

I can understand why. An engineer with expertise in one area can be a dunce in another; the line between concerns can be blurry; and expectations continue to change. Finding the right people with the right expertise is hard.


100% what we seen in the last couple of decades is the march of normies into the techno sphere to the detriment of the prior natives.

We've essentially watched digital colonialism, and it certainly peaks with Elon musk wealth and ego, attempting to buy up the digital market place of ideas.


Pardon my snarkiness but this is what you get when imbeciles MBAs and marketers run every company, not engineers.

Applying rigorous engineering principles is not something I see developers doing often. Whether or not it's incompetence on their part, or pressure from 'imbecile MBAs and marketers', it doesn't matter. They are software developers, not engineers. Engineers in most countries have to belong to a professional body and meet specific standards before they can practice as professionals. Any asshat can call themselves a 'software engineer', the current situation being a prime example, or was this a marketing decision?

You're making the title be more than it is. This won't get solved by more certification. The checkbox of having certified security is what allowed it to happen in the first place.

No. Engineering means something. This is a software ‘engineering’ problem. If the field wants the nomenclature, then it behooves them to apply rigour to who can call themselves an engineer or architect. Blaming middle management is missing the wood for the trees. The root cause was a bad patch. That is developments fault, and no one else’s. As to why this fault could happen, well the design of Windows should be scrutinised. Again, middle management isn’t really to blame here, software architects and engineers design the infrastructure, they choose to use Windows for a variety of reasons.

The point here m trying to make is blaming “MBAs and marketing” shifts blame and misses the wood for the trees. The OP is as on the holier-than-thou “engineer” trip. They are not engineers.


I think engineering only means something because of culture. It all starts from the culture of collective people who define and decide what principles are to be followed and why. All the certifications and licensing that are prerequsite to becoming an engineer are outcomes of the culture that defined them.

Today we have pockets of code produced by one culture linked (literally) with pockets of code produced by a completely different ones and somehow expect the final result to adhere to the most principled and disciplined culture.


Nobody did gradual rollout in 1992

Not entirely true. The company I worked for, major network equipment provider, had a customer user group that had self-organised to take it in turns to be the first customer to deploy major new software builds. It mostly worked well.

Just wait until the PE licensing requirements come to legally charge money for code.

This is the thing that gets me most about this. Any Windows systems developer knows that a bug in a kernel driver can cause BSODs - why on earth would you push out such changes en-masse like this?!

In 2012 a local bank rolled out an update that basically took all of their customer services offline. Couldn't access your money. Took them a month to get things working again.

No concept of "canarying", eh?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: