
Facebook open-sources new suite of Linux kernel components and tools - gyre007
https://code.fb.com/open-source/linux/
======
ilovecaching
XDP is the future of high speed networking in the Linux Kernel. What's amazing
about XDP is how accessible it is compared to user space kernel bypass.
Literally anyone can write an eBPF program, and it even sanity checks it for
you! Very excited to see all the amazing work FB puts into eBPF. I've used BCC
extensively and it's amazing the granularity you have over resource
consumption with eBPF.

Also it's really cool how much FB has put into butterFS and cgroups. They're
doing very foundational work for the container space, which is very cool.

~~~
zzzzzzzza
As a (semi casual) linux user but "kernel outsider" anyone want to break down
all the acronyms used here?

~~~
ilovecaching
FB = Facebook :)

XDP = eXpress Data Path, it's an eBPF program that runs before the kernel
network stack and allows you to process raw packets as fast as your network
card will allow.

eBPF = extended Berkley Packet Filter is a BPF program that is compiled and
run on a virtual machine in the kernel (allows you to run kernel code from
userspace). It can talk to user space through maps and hook in to various
parts of the kernel. An important point is that once compiled, an eBPF program
is guaranteed to halt and has other verification performed on it, to make it
safe to run in the kernel.

BCC = BPF Compiler Collection, is a set of tools for working with eBPF, it
uses llvm and clang to make it easy to write eBPF programs in Python, Rust,
etc.

butterFS = btrfs, the filesystem. People often call it butterFS in
conversation, even though the btr stands for b-tree.

cgroups = Technically the work is on cgroup2, which changes the way processes
are laid out in a resource hierarchy from the original cgroups. This is how
resource constraints are placed on containers, although many people use it
just to monitor processes (without constraining resources).

~~~
wgjordan
> butterFS = btrfs, the filesystem. People often call it butterFS in
> conversation, even though the btr stands for b-tree.

This isn't a common spelling for Btrfs, and just added confusion for me since
I didn't immediately follow the reference in your original comment (I assumed
it was something new). It may be a common spoken pronunciation but writing
'butterFS' in a forum comment is _more_ characters/keystrokes than 'btrfs', so
I'd consider it a typo/error to be corrected rather than a technical
abbreviation to be defined.

~~~
ilovecaching
I did it because I think it's funny.

~~~
Zhyl
I can't believe it's not butterFS

------
amaccuish
I didn't quite get this post; half these things are stuff that facebook uses,
not that they themselves have created and open-sourced (e.g. BTRFS initially
from Oracle)?

And yet they've set up they're own site for BTRFS and all
([https://facebookmicrosites.github.io/btrfs/](https://facebookmicrosites.github.io/btrfs/))

~~~
traek
Facebook hired the principal developer of BRTFS, Chris Mason, to work on it
full-time.

Edit: it looks like a lot of the projects have already been open-sourced but
were developed at Facebook (e.g. PSI and cgroups2), and Facebook is now
announcing them all together along with some new tools.

~~~
h1d
Interesting they continue to develop btrfs when RedHat severed its tie which
sounded like the end of the deal for btrfs when no one but suse uses it as the
default file system.

But even looking how it never gained mainstream adoption for over 10 years
when people always wanted something like zfs, it just feels there's something
substantially wrong about its development model.

[https://news.ycombinator.com/item?id=14907771](https://news.ycombinator.com/item?id=14907771)

~~~
nouseforaname
I like the fact that there's a little competition.

Personally, I tried btrfs as a root filesystem for a couple years. It
definitely had weird gotchas like running out of inode space pretty frequently
when making use of snapshots. A few years back I also ran into a problem with
a bunch of zero length files (race between writing metadata and content?)

I switched this year and I found zfs to much more user friendly and reliable,
although setting up zfs on a root on ubuntu is still a manual process.

~~~
rachelbythebay
ENOSPC latching. Not just you. It never ends.

------
platform
Would be interesting to learn, if at all possible, how these kinds of
contributions are justified to upper management/financial folks.

For example, how (and if) they estimated the cost of this new load balancing
technology in terms of money saved/user experience/security.

BTW, if I am a corporation betting my sales on F5 load balancers, I would be
worried (and would probably notify my shareholders of the risk).

~~~
Jnr
I haven't checked what they have released so I could be wrong but my guess is
that by open sourcing their tools they might hope to merge it into Linux
kernel so that they would not have to maintain their kernel patches.
Additionally they could get improvements from others.

~~~
mikekchar
I'm not quite sure if the OP meant justifying open sourcing their tools or
writing them in the first place. However, for open sourcing, I'm pretty sure I
could justify that pretty easily just on reducing the cost of recruitment.
There was a time that Facebook had (at least in my circles) a terrible
reputation as an employer. Working at Facebook would be a laughable job. Now
it's one of the most sought after jobs in the industry. Not only that, but
people regularly spend time getting familiar with their in house tools before
they even send in a resume. For me, it's basically a no-brainer.

------
dward
Oomd reminds me of userspace OOM handling mechanism proposed by David Reintjes
of Google:

[https://lwn.net/Articles/590960/](https://lwn.net/Articles/590960/)

~~~
macdice
See also the wonderfully named SIGDANGER signal on AIX. It seems many people
have invented this. I've been wondering out loud recently if PostgreSQL should
attempt to support these various impending-doom-notification mechanisms, or
whether a system that close to the edge really needs human intervention
anyway.

~~~
puzzle
The Amiga had a feature in the Exec kernel since version 3.0 that let
applications register so-called memory handlers, which the system could call
to purge e.g. thumbnail caches owned by the userland process, etc. Of course
the Amiga had a single address space and no memory protection or paging, so
things were easier.

~~~
egorfine
iOS has a mechanism like that. The application receives notification on memory
pressure and can release some memory in order to not being killed.

------
leshow
Is btrfs a good choice for a format for a regular Linux install on a dev or
user machine now? I've been using ext4 for almost a decade, curious to hear
responses.

~~~
stingraycharles
For the past few months, I've been trying to make it work (mostly because of
the raid + snapshotting functionality), but I have ran into nothing but
trouble.

Frequent hangs of my system (unrelated to the defrag process that runs daily,
which also uses 100% CPU for minutes -- the best I have deduced it towards is
either because of using raid+nvme, or because of high (>100GB) RAM)), Docker
is very unstable using btrfs [2], etc etc.

I moved back to ext4 + mdraid just last week and couldn't be happier.

1 [https://serverfault.com/questions/747366/btrfs-write-
operati...](https://serverfault.com/questions/747366/btrfs-write-operations-
hang-when-appending-to-files)

2
[https://github.com/moby/moby/issues/34501](https://github.com/moby/moby/issues/34501)

~~~
cmurf
I've been using it for years, including with Docker, and haven't had any
problems. There are people successfully using Btrfs with tens of thousands of
containers, e.g.

[https://lore.kernel.org/linux-
btrfs/CAMp4zn8YUdVShFibUKCXtwZ...](https://lore.kernel.org/linux-
btrfs/CAMp4zn8YUdVShFibUKCXtwZTZpicCbmm7zSYMn7+K5CNt-cxGA@mail.gmail.com/)

Why are you using defrag daily? Why aren't you using the autodefrag mount
option instead if you really need such frequent defragging?

Really anytime there's a problem in the kernel, you need to try the workload
with a mainline kernel and report the problem to the upstream kernel list if
you can reproduce it. If you can't reproduce the problem with mainline, then
you have to take up the bug with your distro. That's the way it is with
everything, not just Btrfs.

Ergo, I think asking for technical help about Btrfs on serverfault or stack
exchange or even HN is weird. People having Btrfs problems need to go to
directly to the upstream list:

[http://vger.kernel.org/vger-lists.html#linux-
btrfs](http://vger.kernel.org/vger-lists.html#linux-btrfs)

Even weirder is the serverfault user has SLES! He has a support contract with
SUSE so why post in serverfault? It just makes zero sense to me to do that...

And the github link, the OP was asked for more information as it sounded like
not a Docker problem at all, and no followup response.

~~~
evil-olive
> Really anytime there's a problem in the kernel, you need to try the workload
> with a mainline kernel and report the problem to the upstream kernel list if
> you can reproduce it. If you can't reproduce the problem with mainline, then
> you have to take up the bug with your distro. That's the way it is with
> everything, not just Btrfs.

Sure, if I have a bug with program X that's patched by distro Y, that's the
normal support path.

I think there's a couple issues with applying that same logic to a filesystem
that's supposedly stable enough to be used as a root FS though.

First, a filesystem should generally be stable enough that patches applied by
a distro don't completely ruin it. How many bugs have there been in ext4 or
xfs that were specific to one distro and their kernel patches? They're
certainly possible, but I would think that pre-release testing would catch the
vast majority of them. Red Hat dropping support of btrfs was a big vote of no
confidence here, because it's not just lack of support for RHEL users, it
implies lack of testing efforts even for Centos & Fedora users.

Second, if I'm having issues with my root filesystem, it's a bit of a
crapshoot as to whether the system is stable enough to compile a mainline
kernel and try to reproduce the bug while running that.

And finally, I simply _don 't want bugs_ in my root filesystem. I don't even
want to get to the point of pondering if I should send the bug report to my
distro or to the mainline kernel. I want my root filesystem to be a thing that
Just Works, without having to think about it.

~~~
cmurf
>First, a filesystem should generally be stable enough that patches applied by
a distro don't completely ruin it.

File systems are sufficiently complicated only developers with expertise in a
particular file system will be applying patches. Red Hat has device-mapper,
LVM, ext4 and XFS developers with such expertise, but not Btrfs developers.
That's the reason why they dropped it.

> it implies lack of testing efforts even for Centos & Fedora users

I can't parse that.

> And finally, I simply don't want bugs in my root filesystem

This is both naive and a reasonable request. It's naive in that they all have
bugs, users find them, they report them, they get fixed. Happens all the time.
It's also reasonable to pick a file system you think will have the least
problems for your use case, if you're not interested in being a bug reporter.

------
reacharavindh
I don't understand the point of this... if they are open source Linux Kernel
contributions, shouldn't they be upstreamed into the Linux kernel?

Is Facebook just throwing out stuff from their private branch of the Linux
kernel in hopes that somebody has free time to upstream the changes?

Not being snarky, I genuinely don't understand the point of this...

~~~
ISL
It is a whole lot easier to peruse and use modifications that someone else has
written and debugged than to write them oneself.

It is generally a cause for celebration when someone releases their work for
everyone else to consider for use.

~~~
pweissbrod
The philosophy of the kernel is to work for everyone's need. Its a good
question, if these enhancements are of broad benefit and unobtrusive as they
seem it's likely better to upstreaming it vs out-of-band patches.

------
ecesena
> BPF is a highly flexible, efficient code execution engine in the Linux
> kernel that allows bytecode to run at various hook points, enabling ___safe_
> __and easy modifications of kernel behaviors with custom code.

Can someone explain the word "safe" in that paragraph? Genuinely curious, not
trying to start a war flame. I'm interested in understanding use cases for
BPF.

~~~
forgottenpass
Certain limitations must be checked (e.g. can be proven to halt). It's a bit
much for an HN post, see
[https://lwn.net/Articles/740157/](https://lwn.net/Articles/740157/)
subsection "The eBPF in-kernel verifier" for more background.

~~~
ecesena
Thanks for the pointer.

------
shmerl
BTRFS is good. Surprisingly it's the only common upstream Linux filesystem
with transparent compression.

~~~
jolmg
Do you think the transparent compression is worth it? I imagine most of the
data people have in their computers is already compressed (images, audios,
videos, etc.), so it would only slow down reading and writing for most files.

I can only see it making sense if much of your disk space is used up by
uncompressed text files (or other types of uncompressed filetypes).

~~~
cmurf
Using compress=zstd on system root saves me about 35-40%. And I save about 1/2
the space if I use compress-force=zstd mount option. I do this on computers
with SDHC/SDXC cards which have dog slow writes and lots of spare CPU time,
and some of the time on HDD. I don't use compression on computers with NVMe,
it doesn't appear to help much at all, and also I'm pretty sure the
compression is currently a single threaded pipe, so for multi-queue devices
it's plausible some workloads could become slower.

Another use case, soonish I hope to see support for selectable compression
levels for zstd (currently only with zlib). And also zstd supports use of a
training file for even higher ratios. Both would work well for archiving and
seed/sprout use cases, where taking the heavy front end hit with slow writes
will be worth it (faster download times, and of course the sprout can have a
different compression level or no compression so subsequent writes can be
fast.

Latest squashfs support includes zstd and also selectable levels, if you need
a read-only image.

~~~
shmerl
Do they plan to implement multithreaded compression / decompression?

~~~
terrelln
Btrfs already supports multithreaded compression and decompression. Each 128
KB block is (de)compressed with a single thread, but multiple blocks can be
(de)compressed in parallel.

The zstd CLI supports multithreaded compression with the flag `-T <num-
threads>`.

~~~
shmerl
Good to know, thanks! Can you add that info to btrfs wiki compression page,
please[1]?

1\.
[https://btrfs.wiki.kernel.org/index.php/Compression](https://btrfs.wiki.kernel.org/index.php/Compression)

------
iamgopal
The amount of engineering FB us doing, I wish they would make something
important that social networks.

