Can anyone from crowdstrike or someone who knows what lead to this incident, how it got past QA etc... care to share?
That other thread with more than 3000 msgs may or may not have this info, hard to read...
Nope. Tech is specially protected by both sides of the isle. Why do you think there's no real or legally defined "engineering" happening? No red tape. No filter of quality, no accountability.
We have to legislate out big tech and start holding them accountable. Prison time for execs and significant fines is the correct and effective method.
Any org using a tech product should have a team reviewing contracts. Many contracts have financial penalties for not meeting SLAs. That's not going to be in the standard terms of service, but rather in an enterprise contract. Any large company just accepting standard ToS for business critical software is not doing their due diligence.
It doesn't have to be that technical. What I'm reading in there is that they had a logic flaw and their test cases nd testing processes aren't robust enough to catch major logic flaws. Not that surprising the way most orgs treat their testing.
There's a long list of incredibly damaging fuckups by software companies, but I can't think of a single example of an existence-ending judgement from litigation.
The biggest reason why billion-dollar software companies continue to release shitty and dangerous products is because they face zero legal liability for their negligence.
>There's a long list of incredibly damaging fuckups by software companies,
Just one example - Fujitsu, associated with the biggest ever misscarriage of justice in England because of bugs in their Horizon software, still going strong...
Great example, but sadly a collective of wronged Postmasters didn’t (at least until recently) have much clout - partly hence the problem.
In the Crowdstrike case, we have multiple major airlines, multiple airports, hospitals, and some large media outlets. Probably billions of dollars of losses.
Exactly this, they will probably pay a few pennies here and there to settle simple disputes out of courts and business will continue as usual. It's not like any of the governing bodies will intervene to punish these corporations. It's truly a sad state of affairs when those elected to protect the public serve the needs of mega corporations instead..
The types of law that you would normally expect to protect you as a US consumer against completely defective security software that does nothing right and harms customers do not work on software for a variety of complicated reasons. In a way that is completely distinct from other kinds of things that are sold, you will not be able to recover under a UCC warranty claim, a negligence claim, or a claim related to defectively designed or manufactured piece of software.
If Crowdstrike is litigated into non-existence, it will not be in America, because the law doesn't work like that. Worth noting that there are a lot of software companies in America. This is not a coincidence.
Yeah, I keep hearing people say “CrowdStrike is done for” or similar. But I honestly think this will be blown over and forgotten in a month or 2.
Almost every large tech company in existence has had some sort of fuck up and survived. (Intel CPU vulnerabilities, iCloud security scandal, Sony multiple data leaks, ..).
Yeah but they had one job… nobody in corporate IT is going to want to bet their careers on buying into crowdstrike from this point onward. Maybe not terminal for the company but it’s going to hurt it for a long time to come.
> nobody in corporate IT is going to want to bet their careers on buying into crowdstrike
CISOs buy EDR. In order to kill CrowdStrike, you'd need competitors with similar capabilities who haven't caused similar but smaller and less publicized outages or performance hits (off the top of my head, Tanium and Carbon Black have. I was there.) And that haven't been publicly hacked due to equally boneheaded issues in other products recently (like Palo Alto). So Microsoft... maybe.
In an industry which is based on ticking check boxes for auditors, relying on Crowd strike will auto untick the checkbox. The only hope for them is rebrand and starting from scratch.
Auditors might consider this a feature. ;) What's more secure than a computer that's totally inaccessible (because its OS has been rendered unbootable)?
I happen to be on vacation with a former crowdstrike employee right now, they said none of their old coworker/friends are saying anything. I’m guessing internal comms went out to lock everything down
Their big claim to (public, non-technical crowd) fame before this was their role in "auditing" the DNC servers. It was their widely-disseminated claim that the DNC servers were hacked by Russians, who subsequently gave the DNC emails to Wikileaks. Later, under oath, in 2017, CrowdStrike's president of services and chief security officer Shawn Henry admitted to the US House Permanent Select Committee on Intelligence that they not only had no evidence that the data was hacked by Russians, but that they had no evidence that it was exfiltrated at all (rather than leaked), but rather that it was an assumption. Unfortunately this admission wasn't declassified until the Mueller report was publicly released and many people had already been fooled, and (in my eyes) did little to establish Crowdstrike as a reliable or trustworthy organization.
Some notable quotes from Shawn Henry regarding this from that hearing:
>"There are times when we can see data exfiltrated, and we can say conclusively. But in this case it appears it was set up to be exfiltrated, but we just don’t have the evidence that says it actually left."
>"There’s not evidence that they were actually exfiltrated. There's circumstantial evidence but no evidence that they were actually exfiltrated."
>"There is circumstantial evidence that that data was exfiltrated off the network. … We didn't have a sensor in place that saw data leave. We said that the data left based on the circumstantial evidence. That was the conclusion that we made."
>"Sir, I was just trying to be factually accurate, that we didn't see the data leave, but we believe it left, based on what we saw."
>Asked directly if he could "unequivocally say" whether "it was or was not exfiltrated out of DNC," Henry told the committee: "I can't say based on that."
Not all perfect examples (e.g., some aren't metal, but they all have their own edge to them)... Kittie, Lamb of God, Dream Theater, Green Jelly, Heart, Ministry.
(Well, as you probably know "nirvana" is not exactly a perspective that many ordinary people welcome, and it is in fact very far from some concepts of "heaven"; "Nirvana" for the eponymous band meant "induced bliss"; the band was far from fixated as e.g. Cobain found documented inspiration in the Damned ("Come as You Are" is apparently a citation of "Life Goes On") and the latter just lightly enjoyed the eerie spirit uncommittedly - a spinoff was Captain Sensible solo projects with "Wot"...
Makes perfect sense if you know your history: _soft_ware for _micro_computers, because in the era where Microsoft was founded, the word "computer" was associated with room or building sized mainframes. Minicomputers were the size of a rack or two and microcomputers / home computers were ones that fit on top of a desk.
Microsoft DOES make sense though - tiny software...
I dont know if gibson took it or coined it in neuromancer but it at least makes sense as a name for a thing. This isn't the same as "random noun" for a company name.
Nah, it’s Microcomputer Software. Software for tiny computers. At the time, desktops were considered microcomputers compared to the workstations, mainframes, supercomputers, etc
> They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.
Sussing things out, they uploaded a "config" file that had a .sys extension that caused all the trouble.
.sys files are supposed to be protected system files and require specual privileges to touch. I imagine Crowdstrike requires some special type of Windows "root access" to operate effectively (like many antivirus packages) in order to detect and block low level attacks.
So where things likely went pear-shaped was Crowdstrike's QA process for config updates is possibly less stringent than core code updates. But because they were using .sys files it was actually given elevated privileges and gets installed during boot.
As for the actual bug, I expect it was either something like the sys file referencing itself or some sort of stack overflow somewhere, both of which I would pin on Microsoft for not being able to detect and recover from during boot up.
All of this is straight guesswork based solely on experience as a longtime Windows user.
>.sys files are supposed to be protected system files and require specual privileges to touch. I imagine Crowdstrike requires some special type of Windows "root access" to operate effectively (like many antivirus packages) in order to detect and block low level attacks.
Any file in C:\windows\ is protected by default and requires administrative privileges to modify. There isn't any extra protections for .sys files specifically.
>As for the actual bug, I expect it was either something like the sys file referencing itself or some sort of stack overflow somewhere, both of which I would pin on Microsoft for not being able to detect and recover from during boot up.
Since when is it the operating system's responsibility to recover from badly written kernel mode code?
How do you know which driver is responsible for the crash? It's all in kernel mode so everything is sharing the same address space. Even if you could determine that the crashed code belonged to a given driver, it's impossible to know whether that was responsible the crash, or some other driver called it with bad parameters. Even if you could get past all those problems, not loading a driver also comes with other problems. If the driver is required for some critical functionality, not loading the driver might cause other crashes. It also means you can disable an antivirus by getting to crash 3 times in a row.
This is a simple problem to understand and solve. Corporations need to buy a license from CrowdStrikeCrowdStrike.com. This allows large corporations that do not know crowdstrike is a security nightmare built on top of the microsoft security nightmare to assure themselves they are now secure.
My guess is it’s (phased rollout) like all safety precautions. You can flout it and you’ll be fine most of the time, possibly for a while. It works until it doesn’t.
In this case it’s a bit ironic since their whole business is mitigating risk.
I am not from Crowdstrike, but they were under pressure in 2022-2023 to show a profit. They did that by cutting expense growth across the board, and I am now guessing QA got hit particularly hard.
I expect we will see detailed forensics published by various third parties in the coming days and weeks. As to what CS itself will publish, that remains to be seen.
Alternately, they did not ever test the ability of their code to accept arbitrary configurations. That totally could have been done, allowing fast config pushes. But it wasn’t. So as a reliability engineer, I’d point at that change: configs were originally expected to be tested, so the parser was only tested to be sure it showed issues on pre-release testing. Later, they started shipping configs faster than their release cycle, but didn’t check to requalify the parser for this new requirement.
Behind that, I’d look at the engineering and product culture that had that happen. Is there a list of what’s been tested for what purpose? Was the expectation that configs get tested like software written down someplace, and do the roles who scheduled that config release read that place? Who is organizationally set up to notice this, and why is it a director-level IC straight out of xkcd 2347?
I don't think you will get the introductory "As a (so-called) senior software engineer at Crowdstrike" brags here at this time since they are internally screaming and swallowing their pride over that massive outage.
Yeah even that guy who always told them they should always do canary deployments should stifle his "I toldja!" instinct and just keep quiet until everybody forgets.
I think they don't share specific details because it will reveal how bad they handle QA and update rollouts processes for a product that expensive, maybe at some point law enforcement will reveal the embarrassing truth.
An internet connected kernel level driver, with a bricking update, that passed QA, that could also be rolled out via a non-staged method, at global scale, under 24hrs? That is unthinkable.
I'm sure Crowdstrike will not take kindly to their employees spilling the beans here, just saying... Especially a very high-profile issue like this that is likely to end up in the courts.
Morality and ethics fly out the window the minute cash enters the equation. It's very easy to appear to altruism and write heroic posts on X, it's another to apply them in practice.
They didn't even bother to test this update. They don't have canary deployments. It's clear that their QA and development culture is complete garbage. That is not "just a mistake".
I'd be surprised if they were toast. Big companies can fuck up with impunity - just look at Boeing. They can take a bit of an insurance hit, chalk it up to the cost of doing business, and pass the cost back to their customers in the long run.
I'm amazed it only went down 10%. Some of my stocks go down that much because an analyst at some no name Canadian bank lowers their price target by 5%.
It’s only market manipulation if you have the intention of making the stock go down, right? If answering the OP is illegal, then no employee of any listed company can ever say something bad about it, which sounds too restrictive?
Wonder how many are polishing their resumes right now.
I guess it would depend how honest they are in their communication. Presumably if they outright lied, it would give enough incentive for someone to create an throwaway account via a VPN from a browser and machine instance they never used to say something like “hell, no, that’s not what happened”
the world deserves a detailed post mortem and an apology!!! considering the scale of this shit show there's no way this didn't affect a significant amount of people very seriously. i would make a bet that people died because of it or experienced any kind of calamity be it personally or medically.
I think we have reached and inflection point. I mean we have to make an inflection point out of this.-
This outage represents more than just a temporary disruption in service; it's a black swan célèbre of the perilous state of our current technological landscape. This incident must be seen as an inflection point, a moment where we collectively decide to no longer tolerate the erosion of craftsmanship, excellence, and accountability that I feel we've been seeing all over the place. All over critical places.-
Who are we to make this demand? Most likely technologists, managers, specialists, and concerned citizens with the expertise and insight to recognize the dangers inherent in our increasingly careless approach to ... many things, but, particularly technology. Who is to uphold the standards that ensure the safety, reliability, and integrity of the systems that underpin modern life? Government?
Historically, the call for accountability and excellence is not new. From Socrates to the industrial revolutions, humanity has periodically grappled with the balance between progress and prudence. People have seen - and complained about - life going to hell, downhill, fast, in a hand basket without brakes since at least Socrates.-
Yet, today’s technological failures have unprecedented potential for harm. The Crowdsource outage killed, halted businesses, and posed serious risks to safety—consequences that were almost unthinkable in previous eras. This isn't merely a technical failure; it’s a societal one, revealing a disregard for foundational principles of quality and responsibility. Craftsmanship. Care and pride in one's work.-
Part of the problem lies in the systemic undervaluation of excellence. In pursuit of speed and profit uber alles. Many companies have forsaken rigorous testing, comprehensive risk assessments, and robust security measures. The very basics of engineering discipline—redundancy, fault tolerance, and continuous improvement—are being sacrificed. This negligence is not just unprofessional; it’s dangerous. As this outage has shown, the repercussions are not confined to the digital realm but spill over into the physical world, affecting real lives. As it always has. But never before have the actions of so few "perennial interns" affected so many.-
This is a clarion call for all of us with the knowledge and passion to stand up and insist on change. Holding companies accountable, beginning with those directly responsible for the most recent failures.-
Yet, it must go beyond punitive measures. We need a cultural shift that re-emphasizes the value of craftsmanship in technology. Educational institutions, professional organizations, and regulatory bodies must collaborate to instill and enforce higher standards. Otherwise, lacking that, we must enforce them ourselves. Even if we only reach ourselves in that commitment.-
Perhaps we need more interdisciplinary dialogue. Technological excellence does not exist in a vacuum. It requires input from ethical philosophers, sociologists, legal experts. Anybody willing and able to think these things through.-
The ramifications of neglecting these responsibilities are clear and severe. The fallout from technological failures can be catastrophic, extending well beyond financial losses to endanger lives and societal stability. We must therefore approach our work with the gravity it deserves, understanding that excellence is not an optional extra but an essential quality sine qua non in certain fields.-
We really need to make this be an actual tuning point, and not just another Wikipedia page.-
I appreciate your call to action, but it should be noted that it was only one of many platform choices that had the failure. The bandwagon of corporate microsoft installs, cloud based storage, massive central management of endpoints, sub contractors sub contracting to sub sub contractors, all the executives greenlighting security agents based on what’s most popular. You’re absolutely right if you’re talking about the Microsoft business ecosystem. But note, the internet stayed up, nearly every SaaS platform and product of note stayed up. Because that craftsmanship is still alive in the linux devops culture movement. Microsoft has been directly or indirectly responsible for almost every major IT disaster. It’s not a coincidence or because there’s “so many more MS installs”… not anymore. It’s because they suck, and by extension everything related to them sucks.
"This is an article about a Reddit thread discussing why a diesel repair shop had a system-wide failure. The conversation centers around why industrial equipment, like lifts and cranes, would be running Windows and be connected to a network. People debate the pros and cons of having internet-connected machinery, with some arguing for the benefits of remote monitoring and updates and others expressing concerns about security vulnerabilities. Many users point out that some critical systems, like elevators, could be controlled with simpler and more reliable solutions."
I'm extremely for integration of smart features into home, business and industrial applications. However, and this is a great, great big HOWEVER, the smart features must always be an added feature, and absent the customer desiring to use them, or the ability to use them via technical issues/obsolescence, the equipment in question should degrade to a usable state.
Perfect example: I own a bunch of iHome smart outlets that I use with Apple HomeKit. Got a whole box of the things off ebay years ago, paid about $6 each. Along with (obviously) being a smart home accessory that runs via HomeKit, they have a little button on the side you can push to toggle the power on and off. So, in the odd event that my network is being uncooperative and I need to turn a light off, I can do that very easily. Not as easily as I could with my phone, but still quite easily.
I don't understand why industry is so hostile to this. It just makes sense to me: build in all the smart features you like, but if they're not working, just have the bloody thing operate like a normal... whatever. Microwave, lamp, etc.
Oh I get how it happened just fine, man (and thanks for the context for future readers, it's helpful). I think think there are way too many keywords to change context but the LLM of course did not do that. Sigh.
> Many users point out that some critical systems, like elevators, could be controlled with simpler and more reliable solutions.
All too true. But between the "we can sell far more stuff if we convince people to replace things that still work fine" of Modern Capitalism, and the "I can brag about all our Shiny New Stuff, and it ain't my money" of Modern Executives...
The thread on Hacker News revolves around a significant incident where a CrowdStrike update led to widespread issues, including Windows blue screens and boot loops. This event affected various industries, including facilities where heavy machinery like lifts and cranes depend on Windows-operated systems.
1. *Incident Overview*: The CrowdStrike update caused critical system failures, leading to operational disruptions across multiple sectors. Users reported being unable to operate essential equipment, disarm alarms, or use communication tools, which brought businesses to a standstill.
2. *Root Cause Analysis*: The primary issue stemmed from an update pushed by CrowdStrike that conflicted with existing systems, particularly those running on Windows. This led to blue screens and boot loops, effectively rendering systems inoperable. The reliance on Windows for industrial control systems (HMI - Human-Machine Interface) exacerbated the impact.
3. *Quality Assurance and Deployment*: There was significant discussion on how such a critical update passed QA. It's suggested that the widespread deployment of uniform security policies without tailored considerations for different system requirements contributed to the problem. In some cases, security software was installed on systems where it might not have been necessary, driven by compliance needs and centralized IT policies.
4. *Implications and Lessons*: The incident highlights the vulnerabilities of interconnected systems and the challenges in managing updates and security across diverse operational environments. It underscores the need for robust QA processes and the potential risks of over-centralized security policies.
No sir, not at all! It was inevitable! Let's just leave this behind us! Those who think it's preventable are part of a certain "evangelism strike force", I tell you! They make zero good arguments, never listen to them!
Back to business as usual, boys! C/C++ are without a single flaw!
To people who say this is not a Microsoft issue... it absolutely is a Microsoft issue. Microsoft allowed third parties to muck with the Windows kernel in a way that makes the computer unbootable. How is that not a Microsoft issue?
Apple has a vetting process before they will allow an app to be added to their app store. Why doesn't Microsoft have a vetting process before allowing a third party to mess with the Windows kernel? Does Crowdstrike have SOC2 or some other certification to make sure they are following secure practices, with third-party verification that they are following their documented practices? If not, why not? Why doesn't Microsoft require that?
It is clear that the status quo can't continue. Think about the 911 calls that didn't get answered and the surgeries that had to be postponed. How many people lost their lives because of this? How does the industry make sure this doesn't happen again? Just rely on Crowdstrike to get their act together? Is it enough to trust them to do so?
Microsoft can't really "certify" their way out of this. Crowdstrike updates as they find threats; that is, all the time. Microsoft can't perfectly vet every update - they come too fast.
That sounds like an argument towards Microsoft not allowing third party drivers like this, or at least strongly discouraging them and making it clear that it breaks the warranty. Didn't Apple do this with deprecating kexts? (maybe that's not applicable, I don't do a lot of macOS dev)
Auditing every data file update seems just as error/system failure prone as Crowdstrike's process was. I don't see a clear reason why Microsoft would have any better incentive than Crowdstrike here.
I do think that maybe the commercial OS vendor has _some_ support responsibilities to at least warn and discourage customers from using the product in dangerous ways? I mean, it's not like we're talking about a couple people installing bad kernel drivers here, we're talking about a worldwide incident. WHQL seems like an admission that Microsoft knows they need to keep dangerous drivers out of the ecosystem.
Let's say MS does not allow third party drivers at all. Then they would have a monopoly over software drivers and system software like security systems. I doubt regulators would want that.
You can do checkbox exercises all day, won't make a difference.
Nearly all banks have long long lists of certification, they still have extremely bad customer-side security processes because you can "interpret" various guidanecs and pay the right auditors enough to have it ignored.
Right. So the model is broken. You cannot both respond to threats in a timely manner and have Microsoft certify that the update is safe.
That leaves you either not responding quickly or responding with uncertified updates. In the past, we have examples of not responding quickly that took down large chunks of the internet (I don't remember the examples, but they were quite famous at the time). Now we have an example of a fast, uncertified update taking down a large chunk of the internet.
So, given that it can take down much of the internet no matter which we choose, now what do we do?
Crowdstrike didn't have the right processes in place.
What can we do?
Require them to have documented processes, and require periodic (like every 6 months) third-party auditing that they have the right processes, and they are complying with their own processes.
Again, Microsoft doesn't control modules people choose to use and can't assume anything about how they work, much less disable them without operator approval.
Imagine if malware could somehow crash this module - would you be happy about the OS automatically rolling bank introduction of said module, opening your system to vulnerabilities?
the driver in question was tested and passed WHQL. CrowdStrike included functionality in the driver to interpret a downloadable file (similar to an antivirus signature file). The file in the problematic update was malformed, and the CrowdStrike driver did not handle this case properly; Windows was unable to continue given the exception in question[1].
No operating system can guarantee that a driver will never cause the machine to crash. This wasn’t Microsoft’s fault.
I suspect the US government might have pressured them to push this update because they’ve found out that Crowdstrike’s system has either already been breached or has a significant zero-day security vulnerability.
> I think the good news, if it's incompetence, is that they can greatly reduce the risk of recurrence by improved processes and tooling.
Now let me give you the realistic news: every executive everywhere will just hand-wave this away with "it will not happen again anytime soon, or ever, right?" and be back to business as usual.
The extreme cost-cutting and allowing non-technical people to rule us is what is getting us into these messes every time. And humanity in its totality seems almost completely unable to learn.
I think your cynicism may be misplaced in this particular situation.
CrowdStrike's ability to make sales is (I assume) very dependent on their appearance of trustworthiness.
Given the microscope they're under, and how serious some of their customers are, I think they'll need to make a convincing case that this kind of problem won't happen again.
Reminds me a little of Boeing, now that I think of it.
I can accept your POV of me being too cynical btw. But I think it's also important to give you one extra nuance: everybody feels awkward after "accidents" like these and everybody is very willing to accept half-arsed explanations (that usually have zero details or guarantees of future better practices) and they just want to move on and forget about it.
Historically it seems that this approach wins in 99% of the cases.
I think it would take Crowdstrike doing five, six of these in a row before they tanked. Look at Boeing; two planes full of passengers died for quite scandalous reasons, followed up by a whole bunch of other engineering scandals, and while the value of Boeing shares took a 50% hit, the company's still in business, and people are still buying planes from them.
> I think the good news, if it's incompetence, is that they can greatly reduce the risk of recurrence by improved processes and tooling.
And, if the various stories are to believed, removing or neutering the execs and managers that required an update be pushed out immediately without passing through the normal staging system.
It's mind boggling how an organization the size of CrowdStrike, built on endpoint agents running in kernel-mode had such an #epicfail. Was all of QA out this week or something? It's truly mind boggling so I'm really looking forward to the post-mortem, if we ever get one.
The size of the organization may only serve to increase the number of potentially bad pieces of code from just plain old tech debt, poorly engineered, or just a hack job implementation. The ability to control quality of code is highly dependent on how staunch the organizational culture is around keeping the repos clean of dead, kludgey, or just bad code, and strict enforcement of agreed upon architecture.
Even if that were remotely likely it would make no sense.
Why would they ask to put half the country to a standstill because a foreign threat actor might have control of the system and might put half the country to a standstill?
It would if an attack was already underway or was planned to execute in very short order. It’s also possible that Crowdstrike wasn’t the issue, but rather the solution. If you find out the bomb is set to go off at noon on July 19 and you can’t easily patch it, you just shut the whole thing down for a day.
It’s an action plot from a movie. Improbable, but not impossible. I don’t think our government is nearly that competent, but who knows.
Because it would be completely implausible for the foreign actor to want to put half the country to a stand-still. They'd likely do their very best to use this zero-day to gain even further footing in backdooring as much machines as they can.
This is the most needlessly complex, mustache-twirling, moderately evil-villain type plot.
The simpler explanation is often the answer here. Either (or both) their internal QA and release process is flawed or specific employees did not follow the process correctly.
These fantastical explanations of some big government conspiracy only serve to make the believers feel better about the fact that nobody is actually in control and shit just happens randomly which is honestly a far more terrifying prospect.
Not sure why this is so unpopular. Pressure comes from a boss or a client. The US government is likely one of those. But not the only one who can put on that kind of pressure.
Unfortunately it's rather lacking in both technicality and detail.