Crowdstrike did this to our production linux fleet back on April 19th, and I've been dying to rant about it.
The short version was: we're a civic tech lab, so we have a bunch of different production websites made at different times on different infrastructure. We run Crowdstrike provided by our enterprise. Crowdstrike pushed an update on a Friday evening that was incompatible with up-to-date Debian stable. So we patched Debian as usual, everything was fine for a week, and then all of our servers across multiple websites and cloud hosts simultaneously hard crashed and refused to boot.
When we connected one of the disks to a new machine and checked the logs, Crowdstrike looked like a culprit, so we manually deleted it, the machine booted, tried reinstalling it and the machine immediately crashes again. OK, let's file a support ticket and get an engineer on the line.
Crowdstrike took a day to respond, and then asked for a bunch more proof (beyond the above) that it was their fault. They acknowledged the bug a day later, and weeks later had a root cause analysis that they didn't cover our scenario (Debian stable running version n-1, I think, which is a supported configuration) in their test matrix. In our own post mortem there was no real ability to prevent the same thing from happening again -- "we push software to your machines any time we want, whether or not it's urgent, without testing it" seems to be core to the model, particularly if you're a small IT part of a large enterprise. What they're selling to the enterprise is exactly that they'll do that.
Oh, if you are also running Crowdstrike on linux, here are some things we identified that you _can_ do:
- Make sure you're running in user mode (eBPF) instead of kernel mode (kernel module), since it has less ability to crash the kernel. This became the default in the latest versions and they say it now offers equivalent protection.
- If your enterprise allows, you can have a test fleet running version n and the main fleet run n-1.
- Make sure you know in advance who to cc on a support ticket so Crowdstrike pays attention.
I know some of this sounds obvious, but it's easy to screw up organizationally when EDR software is used by centralized CISOs to try to manage distributed enterprise risk -- like, how do you detect intrusions early in a big organization with lots of people running servers for lots of reasons? There's real reasons Crowdstrike is appealing in that situation. But if you're the sysadmin getting "make sure to run this thing on your 10 boxes out of our 10,000" or whatever, then you're the one who cares about uptime and you need to advocate a bit.
I would wager that even most software developers who understand the difference between kernel and user mode aren't going to be aware there is a "third" address space, which is essentially a highly-restricted and verified byte code virtual machine that runs with limited read-only access to kernel memory
Not that it changes your point, and I could be wrong, but I'm pretty sure eBPF bytecode is typically compiled to native code by the kernel and runs in kernel mode with full privileges. Its safety properties entirely depend on the verifier not having bugs.
fwiw there's like a billion devices out there with cpus that can run java byte code directly - it's hardly experimental. for example, Jazelle for ARM was very widely deployed
Depending on what kernel I'm running, CrowdStrike Falcon's eBPF will fail to compile and execute, then fail to fall back to their janky kernel driver, then inform IT that I'm out of compliance. Even LTS kernels in their support matrix sometimes do this to me. I'm thoroughly unimpressed with their code quality.
JackC mentioned in the parent comment that they work for a civic tech lab, and their profile suggests they’re affiliated with a high-profile academic institution. It’s not my place to link directly, but a quick Google suggests they do some very cool, very pro-social work, the kind of largely thankless work that people don’t get into for the money.
Perhaps such organizations attract civic-minded people who, after struggling to figure out how to make the product work in their own ecosystem, generously offer high-level advice to their peers who might be similarly struggling.
It feels a little mean-spirited to characterize that well-meaning act of offering advice as “insane.”
This is gold. My friend and me were joking around that they probably did this to macos and linux before, but nobody gave a shit since it's... macos and linux.
(re: people blaming it on windows and macos/linux people being happy they have macos/linux)
I don’t think people are saying that causing a boot loop is impossible on Linux, anyone who knows anything about the Linux kernel knows that it’s very possible.
Rather it’s that on Linux using such an invasive antiviral technique in Ring 0 is not necessary.
On Mac I’m fairly sure it is impossible for a third party to cause such a boot loop due to SIP and the deprecation of kexts.
I believe Apple prevented this also for this exact reason. Third-parties cannot compromise the stability of the core system, since extensions can run only in user-space.
I might be wrong about it, but I feel that malware with root access can wreak quite a havoc. Imagine that this malware decides to forbid launch of every executable and every network connection, because their junior developer messed up with `==` and `===`. It won't cause kernel crash, but probably will render the system equally unusable.
Root access is a separate issue, but user space access to sys level functions is something Apple has been slowly (or quickly on the IOS platform, where they are trying to stop apps snooping on each other) clamping down on for years.
On both macOS and Linux, there's an increasingly limited set of things you can do from root. (but yeah, malware with root is definitely bad, and the root->kernel attack surface is large)
Malware can do tons of damage even with only regular user access, e.g. ransomware. That’s a different problem from preventing legitimate software from causing damage accidentally.
To completely neuter malware you need sandboxing, but this tends to annoy users because it prevents too much legitimate software. You can set up Mac OS to only run sandboxed software, but nobody does because it’s a terrible experience. Better to buy an iPad.
> but nobody does because it’s a terrible experience
To be fair, all apps from the App Store are sandboxed, including on macOS. Some apps that want/need extra stuff are not sandboxed, but still use Gatekeeper and play nice with SIP and such.
FWIW, according to Activity Monitor, somewhere around 2/3 to 3/4 of the processes currently running on my Mac are sandboxed.
Terrible dev experience or not, it's pretty widely used.
It depends on your setup. If you actually put in the effort to get apparmor or selinux set up, then root is meaningless. There have been so many privilege escalation exploits that simply got blocked by selinux that you should worry more about setting selinux up than some hypothetical exploit.
It's not unnecessary, it's harder (no stable kernel ABI, and servers won't touch DKMS with a ten foot pole).
On the other hand you might say that lack of stable kernel ABI is what begot ebpf, and that Microsoft is paying for the legacy of allowing whatever (from random drivers to font rendering) to run in kernel mode.
I’ve had an issue with it before in my work MacBook. It would just keep causing the system to hang, making the computer unusable. Had to get IT to remove it.
> we push software to your machines any time we want, whether or not it's urgent, without testing it
Do they allow you to control updates? It sounds like what you want is for a small subset of your machines using the latest, while the rest wait for stability to be proven.
This is what happened to us. We had a small fraction of the fleet upgraded at the same time and they all crashed. We found the cause and set a flag to not install CS on servers with the latest kernel version until they fixed it.
I wonder if the changes they put in behind the scenes for your incident on Linux saved Linux systems in this situation and no one thought to see if Windows was also at risk.
So in a nutshell it is about corporations pushing for legislation which compels usage of their questionable products, because such products enable management to claim compliance when things go wrong, even when the things that go wrong is are the compliance ensuring products.
CrowdStrike Falcon may ship as a native package, but after that it completely self-updates to whatever they think you should be running. Often, I have to ask IT to ask CS to revert my version because the "current" one doesn't work on my up-to-date kernel/glibc/etc. The quality of code that they ship is pretty appalling.
Thanks for confirming. Is there any valid reason these updates couldn't be distributed through proper package repositories, ideally open repositories (especially data files which can't be copyrightable anyway)?
Yes but that puts a lot of complexity on the end user and you end-up with:
1. A software vendor that is unhappy about the speed they can ship new features at
2. Users that are unhappy the software vendor isn't doing more to reduce their maintenance burden, especially when they have a mixture of OS, distros and complex internal IT structures
IMO default package manager have failed on both linux and windows to provide a good solution for remote updates so everyone re-invents the wheel with custom mini package managers + dedicated update systems.
This seems to be misinformation? The CrowdStrike KB says this was due to a Linux kernel bug.
---
Linux Sensor operating in user mode will be blocked from loading on specific 6.x kernel versions
Published Date: Apr 11, 2024
Symptoms
In order to not trigger a kernel bug, the Linux Sensor operating in user mode will be prevented from loading on specific 6.x kernel versions with 7.11 and later sensor versions.
Applies To
Linux sensor 7.11 in user mode will be prevented from loading:
For Ubuntu/Debian kernel versions:
6.5 or 6.6
For all distributions except Ubuntu/Debian, kernel versions:
6.5 to 6.5.12
6.6 to 6.6.2
Linux sensor 7.13 in user mode will be prevented from loading:
For all distributions except Ubuntu/Debian, kernel versions:
6.5 to 6.5.12
6.6 to 6.6.2
Linux Sensors running in kernel mode are not affected.
Resolution
CrowdStrike Engineering identified a bug in the Linux kernel BPF verifier, resulting in unexpected operation or instability of the Linux environment.
In detail, as part of its tasks, the verifier backtracks BPF instructions from subprograms to each program loaded by a user-space application, like the sensor. In the bugged kernel versions, this mechanism could lead to an out-of-bounds array access in the verifier code, causing a kernel oops.
This issue affects a specific range of Linux kernel versions, that CrowdStrike Engineering identified through detailed analysis of the kernel commits log. It is possible for this issue to affect other kernels if the distribution vendor chooses to utilize the problem commit.
To avoid triggering a bug within the Linux kernel, the sensor is intentionally prevented from running in user mode for the specific distributions and kernel versions shown in the above section
These kernel versions are intentionally blocked to avoid triggering a bug within the Linux kernel. It is not a bug with the Falcon sensor.
Sensors running in kernel mode are not affected.
No action required, the sensor will not load into user mode for affected kernel versions and will stay on kernel mode.
For Ubuntu 22.04 the following 6.5 kernels will load in user mode with Falcon Linux Sensor 7.13 and higher:
6.5.0-1015-aws and later
6.5.0-1016-azure and later
6.5.0-1015-gcp and later
6.5.0-25-generic and later
6.5.0-1016-oem and later
If for some reason the sensor needs to be switched back to kernel mode:
Switch the Linux sensor backend to kernel mode
sudo /opt/CrowdStrike/falconctl -s --backend=kernel
The short version was: we're a civic tech lab, so we have a bunch of different production websites made at different times on different infrastructure. We run Crowdstrike provided by our enterprise. Crowdstrike pushed an update on a Friday evening that was incompatible with up-to-date Debian stable. So we patched Debian as usual, everything was fine for a week, and then all of our servers across multiple websites and cloud hosts simultaneously hard crashed and refused to boot.
When we connected one of the disks to a new machine and checked the logs, Crowdstrike looked like a culprit, so we manually deleted it, the machine booted, tried reinstalling it and the machine immediately crashes again. OK, let's file a support ticket and get an engineer on the line.
Crowdstrike took a day to respond, and then asked for a bunch more proof (beyond the above) that it was their fault. They acknowledged the bug a day later, and weeks later had a root cause analysis that they didn't cover our scenario (Debian stable running version n-1, I think, which is a supported configuration) in their test matrix. In our own post mortem there was no real ability to prevent the same thing from happening again -- "we push software to your machines any time we want, whether or not it's urgent, without testing it" seems to be core to the model, particularly if you're a small IT part of a large enterprise. What they're selling to the enterprise is exactly that they'll do that.