Also it's really cool how much FB has put into butterFS and cgroups. They're doing very foundational work for the container space, which is very cool.
XDP = eXpress Data Path, it's an eBPF program that runs before the kernel network stack and allows you to process raw packets as fast as your network card will allow.
eBPF = extended Berkley Packet Filter is a BPF program that is compiled and run on a virtual machine in the kernel (allows you to run kernel code from userspace). It can talk to user space through maps and hook in to various parts of the kernel. An important point is that once compiled, an eBPF program is guaranteed to halt and has other verification performed on it, to make it safe to run in the kernel.
BCC = BPF Compiler Collection, is a set of tools for working with eBPF, it uses llvm and clang to make it easy to write eBPF programs in Python, Rust, etc.
butterFS = btrfs, the filesystem. People often call it butterFS in conversation, even though the btr stands for b-tree.
cgroups = Technically the work is on cgroup2, which changes the way processes are laid out in a resource hierarchy from the original cgroups. This is how resource constraints are placed on containers, although many people use it just to monitor processes (without constraining resources).
This isn't a common spelling for Btrfs, and just added confusion for me since I didn't immediately follow the reference in your original comment (I assumed it was something new). It may be a common spoken pronunciation but writing 'butterFS' in a forum comment is more characters/keystrokes than 'btrfs', so I'd consider it a typo/error to be corrected rather than a technical abbreviation to be defined.
That's made me curious. Got any links or search terms I can use to learn more about that?
I'm guessing nobody has solved the halting problem, I wonder what constraints are? Is the eBPF programming language not Turning Complete? Are the inputs bound in a way that means they don't need a general solution to the halting problem? Does the eBPF program get compiled with a killswitch to guarantee halting?
> There are inherent security and stability risks with allowing user-space code to run inside the kernel. So, a number of checks are performed on every eBPF program before it is loaded. The first test ensures that the eBPF program terminates and does not contain any loops that could cause the kernel to lock up. This is checked by doing a depth-first search of the program's control flow graph (CFG). Unreachable instructions are strictly prohibited; any program that contains unreachable instructions will fail to load.
XDP = eXpress Data Path. It is a new packet processing mechanism in the Linux kernel, which is in some ways an answer to DPDK and other userspace networking frameworks that skip the kernel in pursuit of high performance. It was originally proposed by Cloudflare, when they achieved poor scalability (in terms of packets per second) for something as simple as a packet drop rule in the kernel. The principle behind XDP is to leverage packet processing rules as early as possible in the packet processing pipeline (no wasted work). However, only certain types of rules are simple enough to be done in a high performance way -- complex rules would still be left to netfilter / ebtables.
The rules which XDP leverages, called extended Berkeley Packet Filters (eBPF) are a new take on an old technology. eBPF is a mechanism that allows userspace BPF rules to be inserted on-the-fly into the kernel. Essentially, matching rules which meet certain simplicity requirements (e.g. loop free) can be compiled into a bytecode that is executed by the kernel in a very efficient way. This is an extremely flexible technology, and one domain which it is well suited for is packet processing. BCC is just the set of compiler tools for creating your own eBPF bytecode.
Afaik, the original idea of XDP was discussed among a few kernel networking hackers at a netdev conference and very early prototype was done by Plumgrid back then. Cloudflare is also deploying it in production and have blogged about it as well though that happened a bit later: https://blog.cloudflare.com/how-to-drop-10-million-packets/
This sentence is not quite correct: "However, only certain types of rules are simple enough to be done in a high performance way -- complex rules would still be left to netfilter / ebtables." Under high packet load, netfilter will simply not be able to keep up. The rules that can be written in eBPF with the help of LLVM's eBPF backend are quite complex, for example, Facebook has written their Katran load balancer in eBPF: https://code.fb.com/open-source/open-sourcing-katran-a-scala... . Google folks harden the network stacks receive path with XDP as "big red button" to stop malicious packets: http://vger.kernel.org/netconf2017_files/rx_hardening_and_ud...
Recently Intel developers have added AF_XDP with zero-copy mode which gets pretty close to DPDK: https://www.dpdk.org/wp-content/uploads/sites/35/2018/10/pm-... The goal is that DPDK would only need to rely on AF_XDP and doesn't have the burden to maintain their own user space drivers anymore such that they can be consolidated in the kernel while retaining performance of DPDK.
Definitely exciting times ahead! :-)
eBPF is a new kernel "tool": https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into...
BCC is built on eBPF: https://github.com/iovisor/bcc
It's really cool new debugging and analysis stuff for the Linux Kernel. I'm asking my team to learn it ASAP.
Tools like bcc make the process easier though, but require additional tooling on the system (e.g. LLVM).
I guess you would probably want the (precompiled) eBPF program as a binary that you can load in some way instead, for "production".
Until you’ve written a couple, you might believe this. After, you’ll understand the issues with debugging, etc.
And yet they've set up they're own site for BTRFS and all (https://facebookmicrosites.github.io/btrfs/)
Edit: it looks like a lot of the projects have already been open-sourced but were developed at Facebook (e.g. PSI and cgroups2), and Facebook is now announcing them all together along with some new tools.
But even looking how it never gained mainstream adoption for over 10 years when people always wanted something like zfs, it just feels there's something substantially wrong about its development model.
Personally, I tried btrfs as a root filesystem for a couple years. It definitely had weird gotchas like running out of inode space pretty frequently when making use of snapshots. A few years back I also ran into a problem with a bunch of zero length files (race between writing metadata and content?)
I switched this year and I found zfs to much more user friendly and reliable, although setting up zfs on a root on ubuntu is still a manual process.
But I guess if (big if?) FB uses it in production (without a lot of internal patches) it may be kinda stable now. Still, I don't see it as being useful for a lot of people nowadays, particularly if you sit on an "expendable" VM in a cloud.
I met Josef at a FB recruiting event at my school.
For example, how (and if) they estimated the cost of this new load balancing technology in terms of money saved/user experience/security.
BTW, if I am a corporation betting my sales on F5 load balancers, I would be worried (and would probably notify my shareholders of the risk).
- If you're already using gnu/linux, building functionality you need on top of what you have is most likely the easiest and least risky way to go (see also: worse is better). Google and now apparently FB definitely built stuff on top of Linux to meet their specific use cases (Google was a big contributor to the original cgroups IIRC), and it worked out quite well for them.
- I would imagine that the way this might go is that a bright engineer/team identifies an opportunity to make a significant improvement, proposes it, does some back-of-the-envelope calculations of the potential benefits it brings, and convinces management that the risk/reward ratio is satisfactory. This is, generally speaking, a critical skill to possess as an engineer in any organization. But especially at the scale of Google/Facebook, if you can propose something that will save even single-digit percent in CPU/RAM/DISK usage or other operational considerations, that translates to many millions of dollars.
(Side story--and hopefully I'm not violating any NDAs by saying this--when I was at Google Jeff Dean was leading a grassroots effort to identify and remove wasteful code, such as extraneous log statements and unnecessary string formatting, and maintained a document that cataloged how much $/year (hint: often more than most people's salary's) each improvement saved)
- Generally, I feel like most of these projects likely come about incrementally, perhaps with some patches here or there, some ideas that come about based on experience or academic/PhD research from employees, and then eventually that accumulates into something that ends up being cleaned up and upstreamed.
- I also think that generally speaking, a lot of engineers do want to express pride in their work, and open sourcing it is one way to achieve that. Furthermore, there is likely a tradeoff of competitive advantage vs community goodwill and also getting other people outside of your company to maintain, debug, and improve what you made, and I think Google and Facebook have an engineering culture of preferring the latter to the former. At the end of the day, Google is profitable because of its search/ad algorithms and Facebook because of its social graph and ad platform. Their tech stacks are necessary but not sufficient, and furthermore are often don't necessarily provide that much benefit to others who aren't running at the same scale (realistically speaking, you can get a lot of mileage out of a single, relatively modest server running OpenBSD or something)
Frequent hangs of my system (unrelated to the defrag process that runs daily, which also uses 100% CPU for minutes -- the best I have deduced it towards is either because of using raid+nvme, or because of high (>100GB) RAM)), Docker is very unstable using btrfs , etc etc.
I moved back to ext4 + mdraid just last week and couldn't be happier.
Why are you using defrag daily? Why aren't you using the autodefrag mount option instead if you really need such frequent defragging?
Really anytime there's a problem in the kernel, you need to try the workload with a mainline kernel and report the problem to the upstream kernel list if you can reproduce it. If you can't reproduce the problem with mainline, then you have to take up the bug with your distro. That's the way it is with everything, not just Btrfs.
Ergo, I think asking for technical help about Btrfs on serverfault or stack exchange or even HN is weird. People having Btrfs problems need to go to directly to the upstream list:
Even weirder is the serverfault user has SLES! He has a support contract with SUSE so why post in serverfault? It just makes zero sense to me to do that...
And the github link, the OP was asked for more information as it sounded like not a Docker problem at all, and no followup response.
Sure, if I have a bug with program X that's patched by distro Y, that's the normal support path.
I think there's a couple issues with applying that same logic to a filesystem that's supposedly stable enough to be used as a root FS though.
First, a filesystem should generally be stable enough that patches applied by a distro don't completely ruin it. How many bugs have there been in ext4 or xfs that were specific to one distro and their kernel patches? They're certainly possible, but I would think that pre-release testing would catch the vast majority of them. Red Hat dropping support of btrfs was a big vote of no confidence here, because it's not just lack of support for RHEL users, it implies lack of testing efforts even for Centos & Fedora users.
Second, if I'm having issues with my root filesystem, it's a bit of a crapshoot as to whether the system is stable enough to compile a mainline kernel and try to reproduce the bug while running that.
And finally, I simply don't want bugs in my root filesystem. I don't even want to get to the point of pondering if I should send the bug report to my distro or to the mainline kernel. I want my root filesystem to be a thing that Just Works, without having to think about it.
File systems are sufficiently complicated only developers with expertise in a particular file system will be applying patches. Red Hat has device-mapper, LVM, ext4 and XFS developers with such expertise, but not Btrfs developers. That's the reason why they dropped it.
> it implies lack of testing efforts even for Centos & Fedora users
I can't parse that.
> And finally, I simply don't want bugs in my root filesystem
This is both naive and a reasonable request. It's naive in that they all have bugs, users find them, they report them, they get fixed. Happens all the time. It's also reasonable to pick a file system you think will have the least problems for your use case, if you're not interested in being a bug reporter.
I understand that it's just a single data point, and other people have more positive experiences. My intention was not to find a solution here on HN, but OP asked for experience reports so I gave mine.
Best to my knowledge, it started when I upgraded from 32GB RAM to 128GB RAM, and/or started using virtual machines more intensively after that. I tried tuning btrfs in various ways, disabling CoW, enabling/disabling autodefrag. The problems start to re-appear and intensify after about one week of using a fresh install.
All I know it's definitely not something simple like "use the latest kernel".
I would (and do) use ZFS on a server because it's rock solid and powerful, but I prefer BTRFS on desktops / laptops because it's more flexible and lower maintenance. The only major downside for me is that it pretty much only works on Linux.
But if you have raw video, I expect it to kick in more.
I.e. for my archive volume compression is quite low (99%). For example for games volume it's better (81%).
Is Facebook just throwing out stuff from their private branch of the Linux kernel in hopes that somebody has free time to upstream the changes?
Not being snarky, I genuinely don't understand the point of this...
It is generally a cause for celebration when someone releases their work for everyone else to consider for use.
True for the moment, but I can see that changing. Having a safe VM at the system call layer is a game changer for so many subsystems. It turns the Linux kernel into a hybrid exokernel.
I thought eBPF programs run inside the kernel. Or are they _from_ userspace, run inside the kernel and so provide efficient into-kernel calls as well as direct access to certain _allowed_ hardware? I hope I’m not riding around semantics, sorry if I do.
So in the I2C case, I could see device drivers that get called to handle the weird things that I2C devices do where you're not really sure until master interrupt time exactly what needs to happen next and you need to make a real time decision. So a BPF program that implements a state machine and is run directly from the master interrupt could be more performant and more power efficient than doing it in user space, and safer than doing it in kernel space.
Did that answer your question?
Can someone explain the word "safe" in that paragraph? Genuinely curious, not trying to start a war flame. I'm interested in understanding use cases for BPF.
I can only see it making sense if much of your disk space is used up by uncompressed text files (or other types of uncompressed filetypes).
Another use case, soonish I hope to see support for selectable compression levels for zstd (currently only with zlib). And also zstd supports use of a training file for even higher ratios. Both would work well for archiving and seed/sprout use cases, where taking the heavy front end hit with slow writes will be worth it (faster download times, and of course the sprout can have a different compression level or no compression so subsequent writes can be fast.
Latest squashfs support includes zstd and also selectable levels, if you need a read-only image.
* Btrfs compression is multithreaded, and can use up to the number of cores available on the system.
* Compression might not help speed on SSDs, but it should help reduce the burn rate, if you care about that.
* My intern wrote a patch to add compression level support to zstd in btrfs . It should be merged upstream soon.
* Slightly off topic, but grub will soon understand btrfs compressed with zstd.
The zstd CLI supports multithreaded compression with the flag `-T <num-threads>`.
I'm using zstd compression for my btrfs volumes.