> “Speed was the most important thing,” said Jeff Gardner, a senior user experience designer at CrowdStrike who said he was laid off in January 2023 after two years at the company. “Quality control was not really part of our process or our conversation.”
This type of article - built upon disgruntled former employees - is worth about as much as the apology GrubHub gift card.
Look, I think just as poorly about CrowdStrike as anyone else out there... but you can find someone to say anything, especially when they have an axe to grind and a chance at some spotlight. Not to mention this guy was a designer and wouldn't be involved in QC anyway.
> Of the 24 former employees who spoke to Semafor, 10 said they were laid off or fired and 14 said they left on their own. One was at the company as recently as this summer. Three former employees disagreed with the accounts of the others. Joey Victorino, who spent a year at the company before leaving in 2023, said CrowdStrike was “meticulous about everything it was doing.”
Except the biggest IT outage ever. And a postmortem showing their validation checks were insufficient. And a rollout process that did not stage at all, just rawdogged straight to global prod. And no lab where the new code was actually installed and run prior to global rawdogging.
I'd say there's smoke, and numerous accounts of fire, which this can be taken in the context of.
"Everyone" piles on Tesla all the time; a worthwhile comparison would be how Tesla roll out vehicle updates.
Sometimes people are up in arms "where's my next version" (eg when adaptive headlights was introduced), yet Tesla prioritise a safe, slow roll out. Sometimes the updates fail (and get resolved individually), but never on a global scale. (None experienced myself, as a TM3 owner on the "advanced" update preference).
I understand the premise of Crowdstrike's model is to have up to date protection everywhere but clearly they didn't think this through enough times, if at all.
You can also say the same thing about Google. Just go look at the release notes on the App Store for the Google Home app. There was a period of more than six months where every single release said "over the next few weeks we're rolling out the totally redesigned Google Home app: new easier to navigate 5-tab layout."
When I read the same release notes so often I begin to question whether this redesign is really taking more than six months to roll out. And then I read the Sonos app disaster and I thought that was the other extreme.
> Just go look at the release notes on the App Store for the Google Home app. [...] When I read the same release notes so often I begin to question whether this redesign is really taking more than six months to roll out.
Google is terrible at release notes. Since several years ago, the release notes for the "Google" app on the Android app store always shows the exact same four unchanging entries, loosely translating from Portuguese: "enhanced search page appearance", "new doodles designed for app experience", "offline voice actions (play music, enable Wi-Fi, enable flashlight) - available only in the USA", "web pages opened directly within the app". I heavily doubt it's taking these many years to roll out these changes; they probably simply don't care anymore, and never update these app store release notes.
The sentence you quoted clearly meant, from the context, "clearly we have nothing [to learn from the opinions of these former employees]". Nothing in your comment is really anything to do with that.
There definitely was a huge outage, but based on the given information we still can't know for sure how much they invested in testing and quality control.
There's always a chance of failure even for the most meticulous companies.
Now I'm not defending or excusing the company, but a singular event like this can happen to anyone and nothing is 100%.
If thorough investigation revealed poor quality control investment compared to what would be appropriate for a company like this, then we can say for sure.
With that alone we know they have failed the simplest of quality control methods for a piece of software as widespread as theirs. This is even excluding that there should have been some kind of error handling to allow the computer to boot if they did push bad code.
While I agree with this, from a software engineering perspective I think it's more useful to look at the lessons learned. I think it's too easy to just throw "Crowdstrike is a bunch of idiots" against the wall, and I don't think that's true.
It's clear to me that CrowdStrike saw this as a data update vs. a code update, and that they had much more stringent QA procedures for code updates that they did data updates. It's very easy for organizations to lull themselves into this false sense of security when they make these kinds of delineations (sometimes even subconsciously at first), and then over time they lose site of the fact that a bad data update can be just as catastrophic as a bad code update. I've seen shades of this issue elsewhere many times.
So all that said, I think your point is valid. I know Crowdstrike had the posture that they wanted to get vulnerability files deployed globally as fast as possible upon a new threat detection in order to protect their clients, but it wouldn't have been that hard to build in some simple checks in their build process (first deploy to a test bed, then deploy globally) even if they felt a slower staged rollout would have left too many of their clients unprotected for too long.
Hindsight is always 20/20, but I think the most important lesson is that this code vs data dichotomy can be dangerous if the implications are not fully understood.
It could have been ok to expedite data updates, should the code treat configuration data as untrusted input, as if it could be written by an attacker. It means fuzz testing and all that.
Obviously the system wasn't very robust, as a simple, within specs change could break it. A company like CrowdStrike, which routinely deals with memory exploits and claims to do "zero trust" should know better.
As often, there is a good chance it is an organization problem. The team in charge of the parsing expected that the team in charge of the data did their tests and made sure the files weren't broken, while on the other side, they expected the parser to be robust and at worst, a quick rollback could fix the problem. This may indeed be the sign of a broken company culture, which would give some credit to the ex-employees.
That rumor floated around Twitter but the company quickly disavowed it. The problem was that they added an extra parameter to a common function but never tested it with a non-wildcard value, revealing a gap in their code coverage review:
From the report, it seems the problem is that they added a feature that could use 21 arguments, but there was only enough space for 20. Until now, no configuration used all 21 (the last one was a wildcard regex, which apparently didn't count), but when they finally did, it caused a buffer overflow and crashed.
> It's clear to me that CrowdStrike saw this as a data update vs. a code update, and that they had much more stringent QA procedures for code updates that they did data updates.
It cannot have been a surprise to Crowdstrike that pushing bad data had the potential to bork the target computer. So if they had such an attitude that would indicate striking incompetence. So perhaps you are right.
> It's clear to me that CrowdStrike saw this as a data update vs. a code update
> Hindsight is always 20/20, but I think the most important lesson is that this code vs data dichotomy can be dangerous if the implications are not fully understood.
But it's not some new condition that the industry hasn't already been dealing with for many many decades (i.e. code vs config vs data vs any other type of change to system, etc.).
I'm sorry but there comes a point where you have to call a spade a spade.
When you have the trifecta of regex, *argv packing and uninitialized memory you're reaching levels of incompetence which require being actively malicious and not just stupid.
The blame for the Linux situation isn’t as clear cut as you make it out to be. Red hat rolled out a breaking change to BPF which was likely a regression. That wasn’t caused directly by a crowdstrike update.
It's not about the blame, it's about how you respond to incidents and what mitigation steps you take. Even if they aren't directly responsible, they clearly didn't take proper mitigation steps when they encountered the problem.
How do you mitigate the OS breaking an API below you in an update? Test the updates before they come out? Even if you could, you'd still need to deploy a fix before the OS update hits the customers, and anyone that didn't update would still be affected.
The linux case is just _very_ different from the windows case. The mitigation steps that could have been taken to avoid the linux problem would not have helped for the windows outage anyways, the problems are just too different. The linux update was about an OS update breaking their program, while the windows issue was about a configuration change they made triggering crashes in their driver.
It's: a) an update, b) pushed out globally without proper testing, c) that bricked the OS.
It's an obvious failure mode that if you have a proper incident response process would be revealed from that specific incident and flagged for needing mitigation.
I do this specific thing for a living. You don't just address the exact failure that happened but try to identify classes of risk in your platform.
> Even if you could, you'd still need to deploy a fix before the OS update hits the customers, and anyone that didn't update would still be affected.
And yet the problem would still only affect Crowdstrike's paying customers. No matter how much you blame upstream your paying customers are only ever going to blame their vendor because the vendor had discretion to test and not release the update. As their customers should.
Sure, customers are free to blame their vendor. But please, we’re on HN, we aren’t customers, we don’t have beef in this game. So we can do better here, and properly allocate blame, instead of piling on the cs hate for internet clout.
And again, you cannot prevent your vendor breaking you. Sure, you can magic some convoluted process to catch it asap. But that won’t help the poor sods who got caught in-between.
> If thorough investigation revealed poor quality control investment compared to what would be appropriate for a company like this, then we can say for sure.
We don't really need that thorough of an investigation. They had no staged deploys when servicing millions of machines. That alone is enough to say they're not running the company correctly.
I also fall on the side of "stagger the rollout" (or "give customers tools to stagger the rollout"), but at the same time I recognize that a lot of customers would not accept delays on the latest malware data.
Before the incident, if you asked a customer if they would like to get updates faster even if it means that there is a remote chance of a problem with them... I bet they'd still want to get updates faster.
I would say that canary release is an absolute must 100%. Except I can think of cases where it might still not be enough. So, I just don't feel comfortable judging them out of the box. Does all the evidence seem to point against them? For sure. But I just don't feel comfortable giving that final verdict without knowing for sure.
Specifically because this is about fighting against malicious actors, where time can be of essence to deploy some sort of protection against a novel threat.
If there's deadlines that you can go over, and nothing bad happens, for sure. Always have canary releases, and perfect QA, monitoring everything thoroughly, but I'm just saying, there can be cases where damage that could be done if you don't act fast enough, is just so much worse.
And I don't know that it wasn't the case for them. I just don't know.
> Specifically because this is about fighting against malicious actors, where time can be of essence to deploy some sort of protection against a novel threat.
This is severely overstating the problem: an extra few minutes is not going to be the difference between their customers being compromised. Most of the devices they run on are never compromised, because anyone remotely serious has defense in depth.
If it was true, or even close to true, that would make the criticism more rather than less strong. If time is of the essence, you invest in things like reviewing test coverage (their most glaring lapse), fuzz testing, and common reliability engineering techniques like having the system roll back to the last known good configuration after it’s failed to load. We think of progressive rollouts as common now but they got to get that mainstream in large part because the Google Chrome team realized rapid updates are important but then asked what they needed to do to make them safe. CrowdStrike’s report suggests that they wanted rapid but weren’t willing to invest in the implementation because that isn’t a customer-visible feature – until it very painfully became one.
"CrowdStrike delivers security content configuration updates to our sensors in two ways: Sensor Content that is shipped with our sensor directly, and Rapid Response Content that is designed to respond to the changing threat landscape at operational speed.
...
The sensor release process begins with automated testing, both prior to and after merging into our code base. This includes unit testing, integration testing, performance testing and stress testing. This culminates in a staged sensor rollout process that starts with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’) through Sensor Update Policies.
The event of Friday, July 19, 2024 was not triggered by Sensor Content, which is only delivered with the release of an updated Falcon sensor. Customers have complete control over the deployment of the sensor — which includes Sensor Content and Template Types.
...
Rapid Response Content is used to perform a variety of behavioral pattern-matching operations on the sensor using a highly optimized engine.
Newly released Template Types are stress tested across many aspects, such as resource utilization, system performance impact and event volume. For each Template Type, a specific Template Instance is used to stress test the Template Type by matching against any possible value of the associated data fields to identify adverse system interactions.
Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published.
On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production."
Do you seriously believe that all CrowdStrike on Windows customers were at such imminent risk of ransomware that one-two hours to run this on one internal setup and catch the critical error they released would have been dangerous?
This is a ludicrous position, and has been proven obviously false by the proceedings: all systems that were crashed by this critical failure were not, in fact, attacked with ransomware once the CS agent was un-installed (at great pain).
You don't want to be in a situation where you're taken hostage and asked hundred mills ransomeware just because you're too slow to mitigate the situation.
Mitigation: Validate the number of input fields in the Template Type at sensor compile time
Mitigation: Add runtime input array bounds checks to the Content Interpreter for Rapid Response Content in Channel File 291
- An additional check that the size of the input array matches the number of inputs expected by the Rapid Response Content was added at the same time.
- We have completed fuzz testing of the Channel 291 Template Type and are expanding it to additional Rapid Response Content handlers in the sensor.
Mitigation: Correct the number of inputs provided by the IPC Template Type
Mitigation: Increase test coverage during Template Type development
Mitigation: Create additional checks in the Content Validator
Mitigation: Prevent the creation of problematic Channel 291 files
Mitigation: Update Content Configuration System test procedures
Mitigation: The Content Configuration System has been updated with additional deployment layers and acceptance checks
Mitigation: Provide customer control over the deployment of Rapid Response Content updates
Could you please stop posting unsubstantive comments and/or flamebait? Posts like this one and https://news.ycombinator.com/item?id=41542151 are definitely not what we're trying for on HN.
I just don't think a company like Crowdstrike has a leg to stand on when leveling the "disgruntled" label in the face of their, let's face it, astoundingly epic fuck up. It's the disgruntled employees that I think would have the most clear picture of what was going on, regardless of them being in QA/QC or not because they, at that point, don't really care any more and will be more forthright with their thoughts. I'd certainly trust their info more than a company yes-man which is probably where some of that opposing messaging came from.
Why would you trust a company no-man any more than a company yes-man? They both have agendas and biases. Is it just that you personally prefer one set of biases (anti-company) more than the other (pro-company)?
Yes, I am very much biased toward being anti-company and I make no apologies for that. I've been in the corporate world long enough to know first-hand the sins that PR and corporate management commits on the company's behalf and the harm it does. I find information coming from the individual more reliable than having it filtered through corpo PR, legal, ass-covering nonsense, the latter group often wanting to preserve the status quo than getting out actual info.
Because there is still an off-hand chance that an employee who has been let go isn't speaking out of spite and merely stating the facts - depends on a combination of their honesty and the feeling they harbor about being let go. Everyone who is let go isn't bitter and/or a liar.
However, every company yes-man is paid to be a yes-man and will speak in favor of the company without exception - that literally is the job. Otherwise they will be fired and will join the ranks of the aforementioned people.
So logically it makes more sense for me to believe the former more than the latter. The two-sides are not equivalent (as you may have alluded) in term of trustworthiness.
No, what we have is a publication who is claiming that the people they talked to were credible and had points that were interesting and tended to match one another and/or other evidence.
You can make the claim that Semafor is bad at their jobs, or even that they're malicious. But that's a hard claim to make given that in the paragraph you've quoted they are giving you the contrary evidence that they found.
And this is a process many of us have done informally. When we talk to one ex-employee of a company, well maybe it was just that guy, or just where he was in the company. But when a bunch of people have the same complaint, it's worth taking it much more seriously.
This is like online reviews. If you selectively take positive or negative reviews and somehow censor the rest, the reviews are worthless. Yet, if you report on all the ones you find, it's still useful.
Yes, I'm more likely to leave reviews if I'm unsatisfied. Yes, people are more likely to leave CS if they were unhappy. Biased data, but still useful data.
If design isn’t involved in QC you’re not doing QC very well. If design isn’t plugged into development process enough to understand QC then you’re not doing design very well.
Why would a UX designer be involved in any way, shape, or form in kernel level code patches? They would literally never ship an update if they had that many hands in the pot for something completely unrelated. Should they also have their sales reps and marketing folks pre-brief before they make any code changes?
A UX designer might have told them it was a bad idea to deploy the patch widely without testing a smaller cohort, for instance. That’s an obvious measure that they skipped this time.
I can't believe people on HN are posting this stuff over and over again. Either you are holistically disconnected from what proper software development should look like or outright creating the same environments that resulted in the crowdstrike issue.
Software security and quality is the responsibility of everyone on the team. A good UX designer should be thinking of ways a user can escape the typical flow or operate in unintended ways and express that to testers. And in decisions where management is forcing untested patches everyone should chime in.
Not true; UX designers typically are responsible for advocating for a robust, intuitive experience for users. The fact that kernel updates don’t have a user interface doesn’t make them exempt from asking the simple question: how will this affect users? And the subsequent question: is there a chance that deploying this eviscerates the user experience?
Granted, a company that isn’t focused on the user experience as much as it is on other things might not prioritise this as much in the first place.
How would it not be related? Jamming untested code down the pipe with no way for users to configure when it's deployed and then rendering their machines inoperable is an extremely bad user experience and I would absolutely expect a UX expert to step in to try to avoid that.
Pfft, I never said that at all. I’m not talking about technical decisions. OP was talking about QC, which is verifying software for human use. If you don’t have user-centered people involved (UX or product or proserve) then you end up with user-hostile decisions like these people made.
I would agree if it was a UI designer, but a good UX designer designs for the users, which in this case including the system admins who will be updating kernel level code patches. Ensuring they have a good experience e.g no crashes, is their job. A recommendation would likely be for example small roll-outs to minimise the number of people having a bad user experience on a roll-out that goes wrong.
There are some very specific accusations backed up by non-denials from crowdstrike.
Ex-employees said bugs caused the log monitor to drop entries. Crowdstrike responded the project was never designed to alert in real time. But Crowdstrike's website currently advertises it as working in real time.
Ex-employees said people trained to monitor laptops were assigned to monitor AWS accounts with no extra training. Crowdstrike replied that "there were no experienced ‘cloud threat hunters’ to be had" in 2022 and that optional training was available to the employees.
> Quality control was not really part of our process or our conversation.
Is anyone really surprised or learned any new information? For us that have worked for tech companies, this is one of those repeating complaints that you hear across orgs that indicates a less than stellar engineering culture.
I've worked with numerous F500 orgs and I would say 3/5 orgs that I worked in, their code was so bad that it made me wonder how they haven't had a major incident yet.
In principle yes, I agree that former employees' sentiments have an obvious bias, but if they all trend in the same direction - people who worked in different times and functions and didn't know each other while on the job - that points to a likely underlying truth.
I do agree with having to expect bias there, but who else do you really expect to speak out?Any current employee would very quickly become an ex-employee if they speak out with any specifics.
I would expect any contractor that may have worked for CrowdStrike, or done something like a third-party audit, would be under an NDA covering their work.
Who's left to speak out with any meaningful details?
Here's some anecdotal evidence - a friend worked at CrowdStrike and was horrified at how incredibly disorganised the whole place was. They said it was completely unsurprising to them that the outage occurred. More surprising to them was that it hadn't happened more often given what a clusterfrock the place was.
Except the fact that CrowdStrike fucked up the one thing they weren't supposed to fuck up.
So yeah, at this point I'm taking the ex-employees' word, because it confirms the results that we already know -- there is no way that update could have gone out had there been proper "safety first" protocols in place and CrowdStrike was "meticulous".
Disgruntled are the Crowdstrike customers that had to deal with the outage. These employees have a lot of reputation to lose for coming forward. Crowstrike is a disgrace of a company and many others like it are doing the same behaviors but they just haven't gotten caught yet. Software development has become a disgrace when the bottom line of squeezing margins to please investors took over.
Honestly, this article describes nearly all companies (from the perspective of the engineers) so I’m not sure I find it hard to believe this one is the same.
I was surprised by how dismissive these comments are. Former staff members, engineers included, are claiming that their former company's unsafe development culture contributed to a colossal world-wide outage & other previous outages. These employee's allegations ought to be seen as credible, or at least as informative. Instead, many seem to be attacking the UX designer commenting on 'Quality control was not part of our process'.
My guess is that people are identifying with sentence said just before: "Speed [of shipping] is everything." Aka "Move fast and break things."
The culture described by this article must mirror many of our lived experiences. The pure pleasure of shipping code, putting out fires, making an impact (positive or negative)... and then leaving it to the next engineers & managers to sort out, ignoring the mess until it explodes. Even when it does, no one gets blamed for the outage and soon everyone goes back to building features that get them promoted, regardless of quality.
Through that ZIRP light, these process failures must look like a feature, not a bug. The emphasis on "quality" must also look like annoying roadblocks in the way of having fun on the customer's dime.
This is not a game. I would normally agree but not when it comes to low-level kernel drivers. They're a cyber security company making it even worse.
Not very long ago we had this client who ordered a custom high security solution (using a kernel driver). I can't reveal too much but basically they had this offline computer running this critical database and they needed a way to account for every single system call to guarantee that any data could have not been changed without the security system alerting and logging the exact change. No backups etc were allowed to leave the computer ever. We were even required to check ntdll (this was on Windows) for hooks before installing the driver on-site & other safety precautions. Exceptions, freezes or a deadlock? No way. Any system call missed = disaster.
We took this seriously. Whenever we made a change to the driver code we had to re-test the driver on 7 different computers (in-office) running completely different hardware doing a set test procedure. Last test before release entailed an even more extensive test procedure.
This may sound harsh but CrowdStrike are total amateurs, always been. Besides, what have they contributed to the cyber security community? - Nothing! Their research are at a level of a junior cyber security researcher. They are willing to outright lie and jump to wild conclusions which is very frowned upon in the community. Also heard others comment on how CS really doesn't really fit the mold of a standard cyber security company.
Nah, CS should take a close look at true professional companies like Kaspersky and Checkpoint; industry leaders who've created proven top notch security solutions (software/services) but not least actually contributed their valuable research to the community for free, catching zero-days, reporting them before no one even had a chance of exploiting them.
Absolutely. Some people are born firefighters. Nothing wrong with that.
I once worked with a senior engineer who loved running incidents. He felt it was real engineering. He loved debugging thorny problems on a strict timeline, getting every engineer in a room and ordering them about, while also communicating widely to the company. Then, there's the rush of the all-clear and the kudos from stakeholders.
Specific to his situation, I think he enjoyed the inflated ownership that the sudden urgency demanded. The system we owned was largely taken for granted by the org; a dead-end for a career. Calling incidents was a good way to get visibility at low-cost, i.e., no one would follow-up on our postmortem action items.
It eventually became a problem, though, when the system we owned was essentially put into maintenance mode, aka zero development velocity. Then I estimate (balancing for other variables) the rate the senior engineer called an incident for not-incidents went up by 3x...
I agree that enjoying firefighting is not inherently harmful. However, the situation you describe afterward irks me in some way I can't quite put my finger on. A lot of words (toxic, dishonest, marketing, counterproductive, bus factor) come to mind, but none of them quite fit.
Some people rise to the occasion during crises and find it rewarding. There's a lot of pop science around COMT (the "warrior gene" associated with stress resilience), which I take with a grain of salt. There does seem to be something there, though, and it overlaps with my personal experience that many great security operations people tend to have ADHD traits.
I've volunteered to fight a share of fires from people who check things in untested, change infrastructure randomly, etc.
What I've learned is that fixing things for these people (and even having entire teams fixing things for weeks) just leads to a continued lax attitude to testing, and leaving the fallout for others to deal with. To them, it all worked out in the end, and they get kudos for rapidly getting a solution in place.
I'm done fixing their work. I'd rather work on my own tasks than fix all the problems with theirs. I'm strongly considering moving on, as this has become an entrenched pattern.
Former QA engineer here, and can confirm quality is seen as an annoying roadblock in the way of self-interested workers, disguised as in the way of having fun on the customers dime.
My favorite repeated reorg strategy over the years is “that we will train everyone in engineering to be hot swappable in their domains”. Talk about spinning wheels.
Critical software infrastructure should be regulated the way critical physical infrastructure is. We don't trust the people who make buildings and bridges to "do the right thing" - we mandate it with regulations and inspections. (When your software not working strands millions of people around the globe, it's critical) And this was just a regular old "accident"; imagine the future, when a war has threat actors trying to knock things out.
Did you notice that the piece of software in question was apparently installed mostly in companies where regulations and inspections already override sysadmins' common sense? Are you sure the answer is simply more of the same?
I've worked in these enterprise organizations for a long time. They don't run on common sense, or even what one might consider "business sense". Their existing incentives create bizarre behavior.
For example, you might think "if a big security exploit happens, the stock price might tank". So if they value the stock price, they'll focus on security, right?. In reality what they do is focus on burying the evidence of security exploits. Because if nobody finds out, the stock price won't tank. Much easier than doing the work of actually securing things. And apparently it's often legal.
And when it's not a bizarre incentive, often people just ignore risks, or even low-level failures, until it's too late. Four-way intersections can pile up accidents for years until a school bus full of kids gets T-boned by a dump truck. We can't expect people to do the right thing even if they notice a problem. Something has to force the right thing.
The only thing I have ever seen force an executive to do the right thing is a law that says they will be held liable if they don't. That's still not a guarantee it will actually happen correctly, course. But they will put pressure on their underlings to at least try to make it happen.
On top of that, I would have standards that they are required to follow, the way building codes specify the standard tolerances, sizes, engineering diagrams, etc that need to be followed and inspected before someone is allowed into the building. This would enforce the quality control (and someone impartial to check it) that was lacking recently.
This will have similar results as building codes - increased bureaucracy, cost, complexity, time... but also, more safety. I think for critical things, we really do need it. Industrial controls, like those used for water, power (nuclear...), gas, etc, need it. Tanker and container ships, trains/subways, airlines, elevators, fire suppressants, military/defense, etc. The few, but very, very important, systems.
If somebody else has better ideas, believe me, I am happy to hear them....
Would you pay 10x (or more, even) for these systems? That means 10x the price of water, utilities, transport etc, which then accumulate up the chain to make other things which don't have criticality but do depend on the ones that do.
The thing is, what exists today exists because it's the path of least resistence.
Consumer costs would not go up 10x to put more care into ensuring the continuous operation of critical IT infrastructure. Things like "an update to the software or configuration of critical systems must first be performed on a test system".
You're right (not sure about the exact factor though) - and there's also additional costs when those systems fail. Someone, somewhere lost money when all those planes were grounded and services suspended.
At some point - maybe it already happened, I don't know - spending more on preventive measures and maintenance will be the path of least resistance.
No, it exists because of all must bow to the deity of increasing shareholder value. Remember that good product is not necessarily equal or even a subset of the easy to sell product. Only once the incentives are aligned towards building quality software that lasts will we see change.
> Would you pay 10x (or more, even) for these systems?
if it's critical to your business, then yes; but you quickly find out whether or not it's actually critical to your business or whether it's something you can do without
Probably there should be an independent body that oversees postmortems on tech issues, with the ability to suggest changes. This is what airlines face during crash investigations and often new rules are put in place (e.g., don’t let the shift manager self-certify his own work in the incident where the pilot’s window popped off). How this would look like with software companies, and what the bar is for being subject to this rigor I don’t know (I suspect not for a Candy Crush outage though).
In general, the biggest problem I see with late stage capitalism, and a lack of accountability in general, is that given the right incentives people will “fuck things up” faster than you can stop them. For example, say CrowdStrike was skirting QA - what’s my incentive as an individual employee versus the incentive of an executive at the company? If the exec can’t tell the difference between good QA and bad QA, but can visually see the accounting numbers go up when QA is underfunded, he’s going to optimize for stock price. And as an IC there’s not much you can do unless you’re willing to fight the good fight day in and day out. But when management repeatedly communicates they do not reward that behavior, and indeed may not care at all about software quality over a 5 year time horizon, what do you do? The key lies in finding ways to convince executives or short of that holding them to account like you say.
I've commented on this before, but in this case I think it starts to fall onto the laps of the individual employees themselves by way of licensing, or at least some sort of certification system. Sure, you could skirt a test here or there, but then you'd only be shorting yourself when shit hits the fan. It'd be your license and essentially your livelihood on the line.
"Proper" engineering disciplines have similar systems like the Professional Engineer cert via the NSPE that requires designs be signed off. If you had the requirement that all software engineers (now with the certification actually bestowing them the proper title of 'engineer') sign off on their design, you could prevent the company from just finding someone else more unscrupulous to push that update or whatever through. If the entirety of the department or company is employing properly certificated people, they'd be stuck actually doing it the right way.
That's their incentive to do it correctly: sign your name to it, or lose your license, and just for drama's sake, don't collect $200, directly to jail. For the companies, employ properly licensed engineers, or risk unlimited downside liability when shit goes sideways, similar to what might happen if an engineering firm built a shoddy bridge.
Would a firm that peddles some sort of CRUD app need to go through all of this? If it handles toxic data like payments or health data or other PII, sure. Otherwise, probably not, just like you have small contracting outfits that build garden sheds or whatever being a bit different than those that maintain, say, cooling systems for nuclear plants. Perhaps a law might be written to include companies that work in certain industries or business lines to compel them to do this.
It’s not true that “common sense” is being overridden: most companies and sysadmins do need that baseline to avoid “forgetting” about things which aren’t trivial to implement (if you didn’t work in the field 10+ years ago, it was common to see systems getting patched annually or worse, people opening up SSH/Remote Desktop to the internet for convenience, shared/short passwords even for privileged accounts, vendors would require horribly insecure configuration because they didn’t want to hire anyone who knew how to do things better, etc.). There are drawbacks to compliance security but it has been useful for flushing all of that mess out.
Even if it wasn’t wrong, that’s still the wrong reaction. We’re in this situation because so many companies were negligent in the past and the status quo was obviously untenable. If there is a problem with a given standard the solution is to make a better system (e.g. like Apple did) rather than to say one of the most important industries in the world can’t be improved because that’d require a small fraction of its budget.
I'm saying that a (different) regulation, standard, and inspection, should apply to the whole software bill of materials, as it relates to the critical-ness of the product. Like, if security is important, the security-critical components should be inspected/tested. That's how you build a building safely: the nails are built to a certain specification and the nail vendor signs off on that.
"We can't regulate the industry because then the US loses to China" or "regulation will kill the US competitive advantage!" responses I've had to suggesting the same and I just can't. But I agree with you 100%. If it's safety critical, it should be under even more scrutiny than other things, it shouldn't be left to self-regulating QA-like processes in profit seeking companies and has to have a bit more scrutiny before the big button gets pressed.
Edit: Disclaimer: The quotes aren't mine, just retorts I've received from others when I suggest the R-word.
Not to mention humans going extinct because regulators are to blame for there being no city on Mars. Because that's definitely the reason there's no city on Mars.
Yesterday morning I learned that someone I was acquainted with had just passed away and the funeral is scheduled for next week.
They recently had a stroke at home just days after spending over a month in the hospital.
Then I remembered that they were originally supposed to be getting an important surgery, but it was delayed because of the CrowdStrike outage. It took weeks for the stars to align again and the surgery to happen.
It makes me wonder what the outcome would have been if they had gotten the surgery done that day, and not spent those extra weeks in the hospital with their condition and stressing about their future?
I appreciate your post here and I'm glad you shared, because it's an example of a distributed harm. One of millions to shake out of this incident, that doesn't have a dollar figure, so it doesn't really "count".
To illustrate:
If I were to do something horrible like kick a 3 year olds knee out and cripple them for life, I would be rightly labeled a monster.
But If I were to say... advocate for education reform to push American Sign Language out of schools, so that deaf children grow up without a developmental language? We don't have words for that, and if we did, none of them would get near the cumulative scope and harm of that act.
We simply do not address distributed harms correctly. And a big part of it is that we don't, we can't, see all the tangible harms it causes.
Not to defend Crowdstrike in any way, but it’s a bit unfair to only look at the downside. What if his hospital hadn’t bought an antivirus, and got hit by ransomware?
Sure, and even if the surgery happened on time, they still might have had a stroke once they got home and had the same outcome.
But as other posts on HN have discussed, anecdotes, especially your own, hit differently.
It makes me thankful the software I work on isn't involved in life and death situations... But then again, it causes me to better consider the things my work could be responsible for (banking). Rushed work that causes a loan application to fail or transaction to be held unnecessarily shouldn't kill someone outright, but there can be real consequences that affect real people just like Rita.
"“Speed was the most important thing,” said Jeff Gardner, a senior user experience designer at CrowdStrike who said he was laid off in January 2023 after two years at the company. “Quality control was not really part of our process or our conversation.”
Their 'expert' on engineering process is a senior UX designer? Somehow, I doubt they were very close to the kernel patch deployment process.
They probably weren’t, but that still speaks to their general culture and is compatible with what we know about their kernel engineering culture (limited testing, no review, no use of common fail safe mechanisms).
It sounds like you might want to read their technical report. That’s neither anecdotal nor a single point, and it showed a pretty large gap in engineering leadership with numerous areas well behind the state of the art.
That’s why I said it was compatible: both these former employees and their own report showed an emphasis on shipping rapidly but not the willingness to invest serious money in the safeguards needed to do so safely. If you want to construct another theory, feel free to do so.
> I bet my ass anyone working in low-level code don't ship the way you do in Cloud.
Their technical report says otherwise – and we know they didn’t adopt the common cloud practices of doing real testing before shipping or having a progressive deployment.
Not justifying what they did with qc, but qc is missing from quite a few places in software development that I've been apart of. People might get the impression from the article that every software project is well tested, whereas in my experience most are rushed out.
I’ve worked for several multi billion dollar software companies. None of them had a dedicated QA function by design. Everything is about moving fast. That culture is ok if you’re making entertainment software or low criticality business software. It’s a very bad idea for critical software. Unfortunately the “move fast” attitude has metastasised to places where it has no place .
Much of the discourse around this topic has described ideal testing and deployment practise. Maybe it's different in Silicon Valley or investment banks, but for the sorts of companies I work for (telco mostly) things are very far from that ideal.
My view of he industry is one of shocking technical ineptitude from all but a minority of very competent people who actually keep things running... Of management who prioritize short term cost reduction over quality at every opportunity, leading to appalling technical debt and demoralized, over-worked staff who rapidly stop giving a damn about quality, because speaking out about quality problems is penalized.
Fun linguistics fact, but gruntled as the antonym of disgruntled is a back-formation. The word disgruntled is a bit strange, in that it uses "dis-" not as a reversal prefix (such as in dissatisfied or dissimilar), but as an intensifier. The original "gruntle" was related to grunt, grunting, it was similar to "grumble", denoting the sounds an annoyed crowd might make. But this old sense of gruntle, gruntling, gruntled has not been used since the 16th century. And in the past century, people have started back-forming a new "gruntle" by analyzing "dis-gruntled" as using the more common meaning of "dis-".
A similar use of dis- as an intensifier apparently happened in "dismayed" (here from an Old French verb, esmaier , which meant to trouble, to disturb), and in "disturbed" (from Latin a word, turba, meaning turmoil). I haven't heard any one say they are "mayed" or "turbed", but people would probably see the same as "gruntled" if you used them.
Everything that we know about CrowdStrike stinks of Knight Capital to me. A minor culture problem snowballed into complete dysfunction, eventually resulting in a company-ending bug.
That’s about how much the trading problem that set off turmoil on the stock market on Wednesday morning is already costing the trading firm.
The Knight Capital Group announced on Thursday that it lost $440 million when it sold all the stocks it accidentally bought Wednesday morning because a computer glitch. "
I do not work in finance, but surely every trading company has had an algorithm go wild at some point. Just becomes a matter of how fast someone can pull the circuit breaker before the expensive failure becomes public.
Theirs didn't fail, and they did have one. The circuit breaker they had that would have worked was a big red button that killed all of their trading processes, which would have meant spending the rest of the day figuring out and unwinding their positions.
Ihey were unwilling to push that button in the short time they had. If you read the reports to the SEC or the articles about it, you will note that. The follow-ups recommended that all firms adopt a big red button that is less catastrophic.
The TL;DR of Knight is that Knight had several things go wrong at the same time, and had no circuit breaker for the problem that did not stop trading for the whole firm for the day. Most trading firms have had things go badly, but the holes in the Swiss cheese aligned for Knight (and they were larger than many other firms). This all comes from a sort of culture of carelessness.
I always thought the Swiss cheese model was used to suggest that no one party could possibly be responsible for a bad thing that happened. Interesting to see the company’s culture blamed for the cheese itself.
Personally, I think there are too many things in modern American society that involve diffusion of responsibility, presumably so that people avoid negative consequences. If you're going to suggest that a system gives 1/10th of the responsibility to 10 different people, the one who made the system is the enabler of that and IMO should suffer the consequences.
The Swiss cheese model fits better as a rebuttal when the cheese comprises both the finger-pointer and the finger-pointee. Think: sure, our software had a bug that said up was down, but what about all of your own employees who used the software, had certifications, and should have known better than to accept its conclusions?
Your usage, in assigning blame rather than diffusing it, was novel to me.
Crowdstrike was heavily pushed on us at a previous company both for compliance reason by some of our clients (BCG were the ones pushing us to use crowdstrike) and from our liability insurance company.
It was really an uphill battle to convince everyone not to use Crowdstrike. Eventually I managed to but after many meetings where I had to spend a significant amount of time convincing different shareholders. I'm sure a lot of people just fold and go with them.
Worked on a team that deployed crowdstrike agents to organize and... Yeah. One of the biggest problems we had was that the daemon would log a massive amount of stuff... But had no config for it to stop or reduce it.
This "security" thing is getting ridiculous. It's become the Gestapo of information technology, they can do anything they want when they want to your computer, cannot resist it and there's absolutely no transparency on what they do to you and why.
I've recently changed jobs and the new employer, a large company, obviously has to have an IT compliance / security update policy because everyone else has it so if they stand out from the crowd and don't do it and somehow get hacked, it's 100x worse than constantly annoying employees and top of the line computers working like a 1970s terminal.
It's rarely that a week passes without the obligatory update + restart. And at least once a month they update THE FUCKING BIOS! What the fuck can be so broken in those laptops that the BIOS is a constant security hazard?! And why would you buy software from someone who week after week after week tell you all you had so far was a hazardous piece of shit that cannot possibly function without constant pampering?
Ahh and of course they botch it. Had to have the OS completely wiped out and reinstalled after the laptop started to behave more and more erratically, 100% caused by faulty updates on top of faulty patches trying to patch the faulty updates. Worked OK for a while afterwards then updates started piling up and so far I only lost use of the web camera (before it was Wifi then display adapter).
There's literally no words how much I hate "the system" and the constant security update take it up the ass we're forced to put up with.
Our law firm, Brown, LLC, would like to speak with you to discuss your experience at Crowdstrike. Is there a good time and way to do this? You can call our office at (877) 561-0000 or view our site www.IFightForYourRights.com. Thank you and best of luck.
“It was hard to get people to do sufficient testing sometimes,” said Preston
Sego, who worked at CrowdStrike from 2019 to 2023. His job was to review the
tests completed by user experience developers that alerted engineers to bugs
before proposed coding changes were released to customers. Sego said he was
fired in February 2023 as an “insider threat” after he criticized the
company’s return to-work policy on an internal Slack channel.
Okay clearly that company has a culture issue. Imagine criticizing a policy and then getting labeled "insider threat".
I'd like to clarify: that my job was also to educate, modernize, and improve developer velocity through tooling and framework updates / changes (impacting every team in my department (UX / frontend engineering)).
Reviewing tests is part of PR review.
--- and before anyone asks, this is my statement on CrowdStrike calling everyone disgruntled:
"I'm not disgruntled.
But as a shareholder (and probably more primarily, someone who cares about coworkers), I am disappointed.
For the most part, I'm still mourning the loss of working with the UX/Platform team."
I know you're just quoting the phrase, but what a gross and dishonest way of phrasing "return to office". Implies working remotely doesn't count as work. Smacks of PR. Yuck.
It’s down 30% since the incident, and flat since 3 years ago.
If it runs up a huge amount in the first half of the year and then the incident knocks off 30% of their market, that still means the incident was really bad.
Typical of tech companies these days. Quality is considered immaterial - or worse - put on low level managers and engineers who don't have the time to clearly examine quality and good roll out practices.
C-Suite and investors don't seem to want to spend on quality. They should just price in that their stock investment could collapse any day.
I believe one of the biggest bad trends of the software industry as a whole is cutting down on QA/testing effort. A buggy product is almost always an unsuccessful one.
Blame Facebook and Google for that. They became successful without QA engineers, so the rest of the industry decided to follow suit in an effort to stay modern.
Clearly there weren't any code review workflow processes in place, which is astonishing.
That's why our primary focus is-transparency, accountability, and system integrity to bring a decentralized, transparent, and reliable platform for journalists, researchers, scientists, and content creators.
Yes. It was a manufacturing facility and since the products were photosensitive the entire line operated in total darkness. It was two months before they turned the lights on and I could see what I was programming for.
This was the first place I saw standups. [Edit: this was the 1990s] They were run by and for the "meat", the people running the line. "Level 2" only got to speak if we were blocked, or to briefly describe any new investigations we would be undertaking.
Weirdly (maybe?) they didn't drug test. I thought of all the places I've worked, they would. But they didn't. They were firmly committed to the "no SPOFs" doctrine and had a "tap out" policy: if anyone felt you were distracted, they could "tap you out" for the day. It was no fault. I was there for six months and three or four times I was tapped out and (after the first time, because they asked what I did with my time off the first time) told to "go climb a rock". I tapped somebody out once, for what later gossip suggested was a family issue.
It was a machine. At first it was kind of creepy to have the feeling that when you entered the building you were part of a machine. But after a couple of weeks it was addictive and I have never looked forward to going to work somewhere as much as I did while working there. Even climbing the rocks on my enforced days off gained a mental narrative that "I'm climbing this rock to be the best part of the machine I can be".
Sure most of the times I was tapped out I was distracted by personal thoughts. But one time I was just thinking about the problem. I protested "but I was thinking about the problem!" and they said "go think somewhere else!".
Yes, at a trading company, where important central systems had a multiweek testing process (unless the change was marked as urgent, in which case it was faster) with a dedicated team and a full replica environment which would replay historical functions 1:1 (or in some cases live), and every change needed to have an automated rollback process. Unsurprising since it directly affects the bottom line.
We had a state management and deployment system through which all changes were effected that would automatically rollback changes if the smoke test failed, or if one of the ops staff found an issue.
Nope. Did everyone forget the tech motto "move fast and break things"? Where is the room for quality control in that philosophy?
Corps won't even put resource into anti-fraud efforts if they believe the millions being stolen from their bottom line isn't worth the effort. I have seen this attitude working in FAANGS.
None of this will change until tech workers stop being sadists and actually unionize.
If their (or your) shop is anything like mine, its' been a constant whittling of ancillary support roles (SDET, QA, SRE) and a shoving of all of the above into the sole responsibility of devs over the last few years. None of this is surprising at all.
Yes, because in point of fact this company is the best at what it does — preventing security breaches. The outage — disruptive as it was — was not a breach. This elemental fact is lost amidst all the knee jerk HN hate, but goes a long way toward explaining why the stock only took a modest hit.
That's a somewhat narrow definition of "security."
The 3rd component of the CIA triad is often overlooked, yet the availability is what makes the protected asset—and, transitively, the protection itself—useful at the first place.
The disruption is effectively a Denial of Service.
Would be interesting to know from their employees if there have been any tangible changes in the blind pursuit of velocity, better QA etc in the aftermath of this fiasco.
Again, if you're an organisation big enough to care about single-pane-of-glass-monitoring you probably already have access to this via the Microsoft 365 license tier you're on.
if you had used 'some' before 'people' i could agree but some industries have to use a siem or they can be fined, so, i mean if there's a list of siems that are definitely not going to ever crash by messing around in the kernel lets get a list going
Luckily the concern isn’t simply whether they could make a mistake and cause a crash by easing around in the kernel, it’s whether they’re likely to, and I’d argue that CrowdStrike is particularly likely to do so given their testing and rollout processes, and the culture that encompasses those failures
Insurers often require to have Endpoint Detection and Response for all the devices, from a third-party. In-house often won't cut it, even if it makes more practical sense.
But then you can't blame anyone else when shit hits the fan! Isn't that what you're really paying for with EDR? No one is safe from a targeted attack, regardless of software.
Just another example of technical leadership being completely irresponsible and another example of tech companies prioritizing the wrong things. As a security company, this completely blows their credibility. i’m not convinced they learned anything from this and don’t expect this effect to change anything. This is a culture issue, not a technical one. One RCA isn’t going to change this.
Reliability is a critical facet of security from a business continuity standpoint. Any business still using crowdstrike is out of their mind.
Having worked for a SIEM vendor, I can say that all security software is extremely invasive, and most security people can probably track every action you make on company-issued devices, and that includes HTTPS decryption.
Reminds me of a guy I know openly bragging that he can watch all of his customers who installed his company's security cameras. I won't reveal his details but just imagine any cloud security camera company doing the same and you would probably be right.
Yeah the question is always if the cure is better than the disease. I'm quite ambivalent on this. On the one hand I tend to agree with the "Anti AV camp" that a sufficiently maintained machine can do well when following best practices. Of course that includes SIEM which can also be run on-premise and doesn't necessarily have to decrypt traffic if it just consumes properly formatted logs.
On the other hand there was e.g. WannaCry in 2017 where 200,000 systems across 150 countries running Windows XP and other unsupported Windows Server versions had crypto miners installed. It shows that companies world-wide had trouble properly maintaining the life cycle of their systems. I think it's too easy to only accuse security vendors of quality problems.
AKIDs... ugh. They'll be there if you use AWS + Mac.
Again, the plaintext is the problem.
These environment variables get loaded from the command line, scripts, etc. - CrowdStrike and all of the best EDRs also collect and send home all of that, but probably in an encrypted stream?
I usually remote dev on an instance in a VPC because of crap like this. If you like terrible ideas (I don't use this except for debugging IAM stuff, occasionally), you can use the IMDS like you were an AWS instance by giving a local loopback device the link-local ipv4 address 169.254.169.254/32 and binding traffic on the instance's 169.254.169.254/32 port 80 to your lo's port 80, and a local AWS SDK will use the IAM instance profile of the instance you're connected to. I'll repeat, this is not a good idea.
Thank you, that's a sound perspective, but it is the responsibility of the security staff who deploy EDRs like Crowdstrike to scrub any data at ingestion time into their SIEM. but within CS's platform, it makes little sense to talk about scrubbing, since CS doesn't know what you want scrubbed unless it is standardized data forms (like SSNs,credit cards,etc..).
Another way to look at it is, the CS cloud environment is effectively part of your environment. the secrets can get scrubbed, but CS still has access to your devices, they can remotely access them and get those secrets at any time without your knowledge. that is the product. The security boundary of OP's mac is inclusive of the CS cloud.
for their own cloud, yeah, you basically accept their cloud as an extension of your devices. but the back-end they use(d?), Splunk, does have scrubbing capability they can expose to customers, if actual customers requested it.
In reality, you can take steps to prevent PII from being logged by Crowdstrike, but credentials are too non-standard to meaningfully scrub. It would be an exercise in futility. If you trust them to have unrestricted access to the credential, the fact that they're inadvertently logging it because of the way your applications work should not be considered an increase in risk.
Anyone with the right level of access to your Falcon instance can run commands on your endpoints (using RTR) and collect any data not already being collected.
that's what EDRs do. anyone with access to your SIEM or CS data should also be trusted with response access (i.e.: remotely access those machines).
If you want this redacted, it is a SIEM functionality not Crowdstrike's. Depends on the SIEM but even older generation SIEMs have a data scrubbing feature.
This isn't a Crowdstrike design decision as you've put it. any endpoint monitoring too, including the free and open source ones behave just as you described. You won't just see env vars from macs but things like domain admin creds and PKI root signing private keys. If you give someone access to an EDR, or they are incident responders with SIEM access, you've trusted them with full -- yet, auditable and monitored -- access to that deployment.
Sure, storage. Networking though? SIEMs receive and send data unencrypted? They should not. By sending the data in plain text you open up an attack surface to anyone sniffing the network.
So there's this thing called "Threat model" and it includes some assumptions about some moving parts of the infra, and it very often includes assertion that a particular environment (like IDS log, signing infra surrounding HSM etc.) is "secure" (they mean outside of the scope of that particular threat model). So it often gets papered over, and it takes some reflex to say "hey, how we will secure that other part". There needs to be some conciousnes about it, because it's not part of this model under discussuon, so not part of the agenda of this meeting...
And it gets lost.
That's how shit happens in compliance-oriented security.
There are secrets like passwords, but there are also secrets like "these are the parameters for running a server for our assembly line for X big corp".
They have IT policies to make sure it largely does not apply. Even in our policy officially any personal use is forbidden. Funnily there is also agreement with our employee board, that any personal use will not be sanctioned. So guess what happens. This done to circumvent not only GPR but also TTDSG in germany (which is harsher on 'spying' as it applies to telecoms. For any 'officially' gathered personal information though typical very specific agreements with our employee board exist though (reporting of illness, etc). Wonder how such information which is also sensitive in a workplace is handled. Also I see those systems used in hospitals etc, if other peoples data is pumped through this systems GDPR definitively applies and auditors may find it (I only know such auditing in finance though). In the future NIS2 will also apply so exactly the people that use such systems will be put under additional scrutiny. Hope this triggers also some auditing of the systems used and not just the use of more of such systems.
Is this really a criticism? Because this has been the case forever with all security and SIEM tools. It’s one of the reasons why the SIEM is the most locked down pieces of software in the business.
Realistically, secrets alone shouldn’t allow an attacker access - they should need access to infrastructure or a certificates in machines as well. But unfortunately that’s not the case for many SaaS vendors.
I can trust you enough to let you borrow my car and not crash it, but still want to know where my car is with an Airtag.
Similarly employees can be trusted enough with access to prod, while the company wants to protect itself from someone getting phished or from running the wrong "curl | bash" command, so the company doesn't get pwned.
That's far from factual and you are making things up. You don't need to send the actual keys to a siem service to monitor the usage of those secrets. You can use a cryptographic hash and send the hash instead. And they definitely don't need to dump env values and send them all.
Sending env vars of all your employees to one place doesn't improve anything. In fact, one can argue the company is now more vulnerable.
It feels like a decision made by a clueless school principle, instead of a security expert.
A secure environment doesn't involve software exfiltrating secrets to a 3rd party. It shouldn't even centralize secrets in plaintext. The thing to collect and monitor is behavior: so-and-so logged into a dashboard using credentials user+passhash and spun up a server which connected to X Y and Z over ports whatever... And those monitored barriers should be integral to an architecture, such that every behavior in need of auditing is provably recorded.
If you lean in the direction of keylogging all your employees, that's not only lazy but ineffective on account of the unnecessary noise collected, and it's counterproductive in that it creates a juicy central target that you can hardly trust anyone with. Good auditing is minimally useful to an adversary, IMO.
> In a highly auditable/“secure” environment, you can’t give secrets to employees with no tracking of when the secrets are used.
This does not seem to require regularly exporting secrets form the employee's machines though. Which is the main complaint I am reading. You would log when the secret is used to access something, presumably remote to the users machine.
I’m well aware of what a SIEM does. You do not need to log a plaintext secret to know what the principal is doing with it. In a highly auditable environment (your words) this is a disaster
In a highly secure environment, don't use long lived secrets in the first place. You use 2FA and only give out short lived tokens. The IdP (ID Provider) refreshing the token for you provides the audit trail.
Keeping secrets and other sensitive data out of your SIEM is a very important part of SIEM design. Depending on what you’re dealing with you might want to tokenize it, or redact it, but you absolutely don’t want to don’t want to just ingest them in plaintext.
If you’re a PCI company then ending up with a credit card number in your SIEM can be a massive disaster. Because you’re never allowed to store that in plaintext, and your SIEM data is supposed to be immutable. In theory that puts you out of compliance for a minimum of one year with no way to fix it, in reality your QSAs will spend some time debating what to do about it and then require you to figure out some way to delete it, which might be incredibly onerous. But I have no idea what they’d do if your SIEM somehow became full of credit card numbers, that probably is unfixable…
If that’s straightforward then congratulations, you’ve failed your assessment for not having immutable log retention.
They certainly wouldn’t let you keep it there, but if your SIEM was absolutely full of cardholder data, I imagine they’d require you to extract ALL of it, redact the cardholder data, and the import it to a new instance, nuking the old one. But for a QSA to sign off on that they’d be expecting to see a lot of evidence that removing the cardholder data was the only thing you changed.
> Realistically, secrets alone shouldn’t allow an attacker access - they should need access to infrastructure or a certificates in machines as well.
This isn't realistic, it's idealistic. In the real world secrets are enough to grant access, and even if they weren't, exposing one half of the equation in clear text by design is still really bad for security.
Two factor auth with one factor known to be compromised is actually only one factor. The same applies here.
My mental model was that Apple provides backdoor decryption keys to China in advance for devices sold in China/Chinese iCloud accounts, but that they cannot/will not bypass device encryption for China for devices sold outside of the country/foreign iCloud accounts.
Seriously? Crowdstrike is obviously NSA just like Kaspersky is obviously KGB and Wiz is obviously Mossad. Why else are counties so anxious about local businesses not using agents made by foreign actors?
KGB is not even a thing. Modern equivalent is FSB, no? I'm skeptical. I don't think it's obvious that these are all basically fronts, as much as I'm willing to believe that IC tentacles reach wide and deep.
Agents don't just read env vars and send them to SIEM.
There's a triggering action that caused the env vars to be used by another ... ehem... Process ... that any EDR software in this beautiful planet would have tracked.
No it logs every command macOS runs or that you type in a terminal. Either directly or indirectly. From macOS internal periodic tasks to you running “ls”.
I don't think this is limited to just Macs based on my experience with the tool. It also sends command line arguments for processes which sometimes contain secrets. The client can see everything and run commands on the endpoints. What isn't sent automatically can be collected for review as needed.
It does redact secrets passed as command line arguments. This is what makes it so inconsistent. It does recognize a GitHub token as an argument and blanks it out before sending it. But then it doesn’t do that if the GitHub token appears in an env var.
It may depend a bit on your organization but I bet most folks using an EDR solution can tell you that Macs are probably very low on the list when it comes to malware. You can guess which OS you will spend time on every day ...
Arbitrary bad practices as status quo without criticism, far from absolving more of the same, demand scrutiny.
Arbitrarily high levels of market penetration by sloppy vendors in high-stakes activities, far from being an argument for functioning markets, demand regulation.
Arbitrarily high profile failures of the previous two, far from indicating a tolerable norm, demand criminal prosecution.
It is recently that this seemingly ubiquitous vendor, with zero-day access to a critical kernel space that any red team adversary would kill for, said “lgtm shipit” instead of running a test suite with consequences and costs (depending on who you listen to) ranging from billions in lost treasure to loss of innocent life.
We know who fucked up, have an idea of how much corrupt-ass market failure crony capitalism could admit such a thing.
The only thing we don’t know is how much worse it would have to be before anyone involved suffers any consequences.
This type of article - built upon disgruntled former employees - is worth about as much as the apology GrubHub gift card.
Look, I think just as poorly about CrowdStrike as anyone else out there... but you can find someone to say anything, especially when they have an axe to grind and a chance at some spotlight. Not to mention this guy was a designer and wouldn't be involved in QC anyway.
> Of the 24 former employees who spoke to Semafor, 10 said they were laid off or fired and 14 said they left on their own. One was at the company as recently as this summer. Three former employees disagreed with the accounts of the others. Joey Victorino, who spent a year at the company before leaving in 2023, said CrowdStrike was “meticulous about everything it was doing.”
So basically we have nothing.