Hacker News new | past | comments | ask | show | jobs | submit login
CrowdStrike fail and next global IT meltdown (cnbc.com)
11 points by ijidak 89 days ago | hide | past | favorite | 7 comments



How is this even a suggestion for a company at this scale?

From the article, "Software updates should be rolled out incrementally

One lesson from the global IT outage, O'Neill said, is that CrowdStrike's update should have been rolled out incrementally."

This is why I find the coding interviews of many companies to be misguided.

I bet Crowdstrike's hiring process focuses on Leet Code problems and ignores practical considerations like, real world engineering considerations that matter when building an agent.

Leet code bakes premature optimization into the hiring process, and ignores the far more common business reasons that software fails.

A robust system for incremental roll-out and verification should be 101 for software with this level of success and market penetration.


That's one take. However, in this case, the problem is even more basic - and ridiculous. By all accounts, every system that received this update, died to the BSOD. How is that possible?

It means that the update was not tested, not even once. It certainly was not tested in multiple environments, with multiple configurations, as must be standard for kernel-level software.

This isn't due to "leet code". This is a fundamental process failure. It should not be possible to push out an untested update.

How was it possible? Will have the guts to ever explain? Probably not...


There are many, many possibilities for what could have happened which need to be investigated. Some of which would be surprisingly hard to test for.

1. Did the CDN have a failing disk? A full null read sometimes happens with failing drives.

2. Did the disk holding the update fail just before the CDN upload? (I.e. did the deployment script successfully upload a failed read?)

3. Did the CDN upload fail, but a different process thought it succeeded, activating distribution?

4. Did their updates conform to an internal standard which is later serialized or minified for public distribution, and something broke in the serializer or compression tool?

It is completely possible that the update was tested, good to go, and there was a distribution failure.

However, I do know what Cloudstrike could have done, and should do in the future:

- Staged rollouts, even 15 minutes would help as a gap

- More testing (duh)

- Perhaps most importantly, improving the kernel driver to never crash from any possible input using a fuzzer, but continue booting and warn userspace of failure


It wasn’t corrupted. It was bad code. Bad code which clearly wasn’t tested.


Really? We know, for a fact, yesterday, that the “update” was a file full of complete null characters, unless there was some update on that.

https://news.ycombinator.com/item?id=41009740

That’s quite open to the possibility of a disk failure or CDN failure. Bad code would at least show something that can’t be executed.


official company statement says that null characters were not cause of the problem


> I bet Crowdstrike's hiring process focuses on Leet Code problems and ignores practical considerations like, real world engineering considerations that matter when building an agent.

You are 100% right. What has happened is that tech companies have become filled with people managers instead of engineering managers. People managers don't understand what makes for good tech. They rely on proxies for everything.

Leetcode is a proxy, so is # of commits, or hours in office, story points etc.

A real engineering management is fundamentally involved closely with engineering and processes. They already have a large toolbox of good practices to lean on. If the CEO pushes something, they use this toolbox to demonstrate how it could affect the business.

People managers need to be called out and removed from the industry. It is literally a matter of shareholder value at this point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: