Hacker News new | past | comments | ask | show | jobs | submit login
Facebook open-sources new suite of Linux kernel components and tools (fb.com)
309 points by gyre007 4 months ago | hide | past | web | favorite | 111 comments



XDP is the future of high speed networking in the Linux Kernel. What's amazing about XDP is how accessible it is compared to user space kernel bypass. Literally anyone can write an eBPF program, and it even sanity checks it for you! Very excited to see all the amazing work FB puts into eBPF. I've used BCC extensively and it's amazing the granularity you have over resource consumption with eBPF.

Also it's really cool how much FB has put into butterFS and cgroups. They're doing very foundational work for the container space, which is very cool.


As a (semi casual) linux user but "kernel outsider" anyone want to break down all the acronyms used here?


FB = Facebook :)

XDP = eXpress Data Path, it's an eBPF program that runs before the kernel network stack and allows you to process raw packets as fast as your network card will allow.

eBPF = extended Berkley Packet Filter is a BPF program that is compiled and run on a virtual machine in the kernel (allows you to run kernel code from userspace). It can talk to user space through maps and hook in to various parts of the kernel. An important point is that once compiled, an eBPF program is guaranteed to halt and has other verification performed on it, to make it safe to run in the kernel.

BCC = BPF Compiler Collection, is a set of tools for working with eBPF, it uses llvm and clang to make it easy to write eBPF programs in Python, Rust, etc.

butterFS = btrfs, the filesystem. People often call it butterFS in conversation, even though the btr stands for b-tree.

cgroups = Technically the work is on cgroup2, which changes the way processes are laid out in a resource hierarchy from the original cgroups. This is how resource constraints are placed on containers, although many people use it just to monitor processes (without constraining resources).


> butterFS = btrfs, the filesystem. People often call it butterFS in conversation, even though the btr stands for b-tree.

This isn't a common spelling for Btrfs, and just added confusion for me since I didn't immediately follow the reference in your original comment (I assumed it was something new). It may be a common spoken pronunciation but writing 'butterFS' in a forum comment is more characters/keystrokes than 'btrfs', so I'd consider it a typo/error to be corrected rather than a technical abbreviation to be defined.


I did it because I think it's funny.


I can't believe it's not butterFS


I usually pronounce it as betterFS, which perfectly describes it :-)


> An important point is that once compiled, an eBPF program is guaranteed to halt …

That's made me curious. Got any links or search terms I can use to learn more about that?

I'm guessing nobody has solved the halting problem, I wonder what constraints are? Is the eBPF programming language not Turning Complete? Are the inputs bound in a way that means they don't need a general solution to the halting problem? Does the eBPF program get compiled with a killswitch to guarantee halting?


From elsewhere in the thread I found this link https://lwn.net/Articles/740157/

> There are inherent security and stability risks with allowing user-space code to run inside the kernel. So, a number of checks are performed on every eBPF program before it is loaded. The first test ensures that the eBPF program terminates and does not contain any loops that could cause the kernel to lock up. This is checked by doing a depth-first search of the program's control flow graph (CFG). Unreachable instructions are strictly prohibited; any program that contains unreachable instructions will fail to load.


Thanks!


Ok, so what does eBGP stand for? (By the way, thanks for the other definitions!)


external Border Gateway Protocol. BGP is a routing protocol used to share reachability information thru independent routing domains; iBGP is when you use BGP to manage your own networks. eBGP is what is spoken between different AS (Autonomous Systems) across the core (routers without a default route) internet. You'll often read "Country X lost 50% of traffic for N hours due to eBGP issues".


In this context it was a typo (now corrected).


What’s the difference between using kernel parameters within sysctl.conf and cgroup2?


cgroups aren't for manipulating kernel parameters. It's for setting resource limits on pids, and retrieving information about resource usage of a pid or group of pids.


Disclaimer: I am not an expert in this, so any corrections are welcome. But here's my intuition.

XDP = eXpress Data Path. It is a new packet processing mechanism in the Linux kernel, which is in some ways an answer to DPDK and other userspace networking frameworks that skip the kernel in pursuit of high performance. It was originally proposed by Cloudflare, when they achieved poor scalability (in terms of packets per second) for something as simple as a packet drop rule in the kernel. The principle behind XDP is to leverage packet processing rules as early as possible in the packet processing pipeline (no wasted work). However, only certain types of rules are simple enough to be done in a high performance way -- complex rules would still be left to netfilter / ebtables.

The rules which XDP leverages, called extended Berkeley Packet Filters (eBPF) are a new take on an old technology. eBPF is a mechanism that allows userspace BPF rules to be inserted on-the-fly into the kernel. Essentially, matching rules which meet certain simplicity requirements (e.g. loop free) can be compiled into a bytecode that is executed by the kernel in a very efficient way. This is an extremely flexible technology, and one domain which it is well suited for is packet processing. BCC is just the set of compiler tools for creating your own eBPF bytecode.


Here's some more info in the BPF and XDP reference guide on concepts, use cases and getting started examples to catch up: https://cilium.readthedocs.io/en/latest/bpf/

Afaik, the original idea of XDP was discussed among a few kernel networking hackers at a netdev conference and very early prototype was done by Plumgrid back then. Cloudflare is also deploying it in production and have blogged about it as well though that happened a bit later: https://blog.cloudflare.com/how-to-drop-10-million-packets/

This sentence is not quite correct: "However, only certain types of rules are simple enough to be done in a high performance way -- complex rules would still be left to netfilter / ebtables." Under high packet load, netfilter will simply not be able to keep up. The rules that can be written in eBPF with the help of LLVM's eBPF backend are quite complex, for example, Facebook has written their Katran load balancer in eBPF: https://code.fb.com/open-source/open-sourcing-katran-a-scala... . Google folks harden the network stacks receive path with XDP as "big red button" to stop malicious packets: http://vger.kernel.org/netconf2017_files/rx_hardening_and_ud...

Recently Intel developers have added AF_XDP with zero-copy mode which gets pretty close to DPDK: https://www.dpdk.org/wp-content/uploads/sites/35/2018/10/pm-... The goal is that DPDK would only need to rely on AF_XDP and doesn't have the burden to maintain their own user space drivers anymore such that they can be consolidated in the kernel while retaining performance of DPDK.

Definitely exciting times ahead! :-)


Thank you for the insight! Your post adds helpful context / corrections. Very exciting times, indeed! :)


I can get you a couple:

eBPF is a new kernel "tool": https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into...

BCC is built on eBPF: https://github.com/iovisor/bcc

It's really cool new debugging and analysis stuff for the Linux Kernel. I'm asking my team to learn it ASAP.


By butterFS do you mean btrfs? I know the former only as a pronunciation of the latter.


b-tree-FS has the same number of syllables, so butterFS is an unnecessary and silly pronunciation.


Nope, that is exactly the name under which it is commonly known, and that's how its developers pronounce it officially.


Writing EBPF programs is still quite involved, especially if you want to distribute them on other systems (as currently the features depend on the Kernel version so the programs are not very portable yet).

Tools like bcc make the process easier though, but require additional tooling on the system (e.g. LLVM).


Maybe I'm wrong (or things changed since I looked at this), but if you use bcc you essentially depend from LLVM at runtime.

I guess you would probably want the (precompiled) eBPF program as a binary that you can load in some way instead, for "production".


So from your perspective XDP is better the DPDK and will be the leading userspace network stack in years to come, is that correct? Any chance you can tell me why?


From a kernel developers perspective :) Check out the talk: Fast Programmable Networks & Encapsulated Protocols by David S. Miller.


Thank you very much! DPDK is not linux! :)


> Literally anyone can write an eBPF program,

Until you’ve written a couple, you might believe this. After, you’ll understand the issues with debugging, etc.


I took that sentence idiomatically, to mean something like "the obstacles have been reduced by at least an order of magnitude", since it's never literally true that literally anyone can write some software.


I didn't quite get this post; half these things are stuff that facebook uses, not that they themselves have created and open-sourced (e.g. BTRFS initially from Oracle)?

And yet they've set up they're own site for BTRFS and all (https://facebookmicrosites.github.io/btrfs/)


Facebook hired the principal developer of BRTFS, Chris Mason, to work on it full-time.

Edit: it looks like a lot of the projects have already been open-sourced but were developed at Facebook (e.g. PSI and cgroups2), and Facebook is now announcing them all together along with some new tools.


Interesting they continue to develop btrfs when RedHat severed its tie which sounded like the end of the deal for btrfs when no one but suse uses it as the default file system.

But even looking how it never gained mainstream adoption for over 10 years when people always wanted something like zfs, it just feels there's something substantially wrong about its development model.

https://news.ycombinator.com/item?id=14907771


I like the fact that there's a little competition.

Personally, I tried btrfs as a root filesystem for a couple years. It definitely had weird gotchas like running out of inode space pretty frequently when making use of snapshots. A few years back I also ran into a problem with a bunch of zero length files (race between writing metadata and content?)

I switched this year and I found zfs to much more user friendly and reliable, although setting up zfs on a root on ubuntu is still a manual process.


ENOSPC latching. Not just you. It never ends.


I feel it's most bad experiences when it was started at first - and on enterprise systems it never really got adoption because of bad RAID. I still remember some bugs with inodes that could not be deleted making me mad and forcing me to format my laptop with the more conventional XFS :)

But I guess if (big if?) FB uses it in production (without a lot of internal patches) it may be kinda stable now. Still, I don't see it as being useful for a lot of people nowadays, particularly if you sit on an "expendable" VM in a cloud.


Synology has rolled out btrfs in par with ext4 a few years ago for models that support it. Changing to a new filesystem is scary so I'm not surprised it has taken this long to reach mainstream. I'm glad there's still a big player like FB supporting it.


They hired a bunch of people from Fusion-IO to work on it.

https://en.wikipedia.org/wiki/Fusion-io

I met Josef at a FB recruiting event at my school.


Oomd was open sourced before this announcement: https://code.fb.com/production-engineering/oomd/ But seems like Facebook is grouping them to show how they can work together.


Facebook can "show off" and develop a reputation as having interesting, technically innovative, meaty, challenging projects to work on which is useful for recruitment purposes.


And still it's Facebook :D


Yeah it's really weird. Same for cgroup2 and all this other stuff. This is all established open software.


Would be interesting to learn, if at all possible, how these kinds of contributions are justified to upper management/financial folks.

For example, how (and if) they estimated the cost of this new load balancing technology in terms of money saved/user experience/security.

BTW, if I am a corporation betting my sales on F5 load balancers, I would be worried (and would probably notify my shareholders of the risk).


I'd love to see the full breakdown, but I think hints of it are there in the text. E.g., for the load balancing they say, it "helped improve the performance and scalability of network load balancing while drastically reducing inefficiencies." At Facebook's scale, those are pretty easy to translate into dollars by, e.g., looking at how many machines you now didn't have to buy to keep up with growth.


I haven't checked what they have released so I could be wrong but my guess is that by open sourcing their tools they might hope to merge it into Linux kernel so that they would not have to maintain their kernel patches. Additionally they could get improvements from others.


I'm not quite sure if the OP meant justifying open sourcing their tools or writing them in the first place. However, for open sourcing, I'm pretty sure I could justify that pretty easily just on reducing the cost of recruitment. There was a time that Facebook had (at least in my circles) a terrible reputation as an employer. Working at Facebook would be a laughable job. Now it's one of the most sought after jobs in the industry. Not only that, but people regularly spend time getting familiar with their in house tools before they even send in a resume. For me, it's basically a no-brainer.



I don't have firsthand knowledge of obtaining such justifications, but a couple things come to mind that seem reasonable to me as an engineer who has to propose any sort of project (and I worked at Google that also makes upstream contributions and was able to glimpse into some of the "contribute to open source" justifications):

- If you're already using gnu/linux, building functionality you need on top of what you have is most likely the easiest and least risky way to go (see also: worse is better[1]). Google and now apparently FB definitely built stuff on top of Linux to meet their specific use cases (Google was a big contributor to the original cgroups IIRC), and it worked out quite well for them.

- I would imagine that the way this might go is that a bright engineer/team identifies an opportunity to make a significant improvement, proposes it, does some back-of-the-envelope calculations of the potential benefits it brings, and convinces management that the risk/reward ratio is satisfactory. This is, generally speaking, a critical skill to possess as an engineer in any organization. But especially at the scale of Google/Facebook, if you can propose something that will save even single-digit percent in CPU/RAM/DISK usage or other operational considerations, that translates to many millions of dollars.

(Side story--and hopefully I'm not violating any NDAs by saying this--when I was at Google Jeff Dean was leading a grassroots effort to identify and remove wasteful code, such as extraneous log statements and unnecessary string formatting, and maintained a document that cataloged how much $/year (hint: often more than most people's salary's) each improvement saved)

- Generally, I feel like most of these projects likely come about incrementally, perhaps with some patches here or there, some ideas that come about based on experience or academic/PhD research from employees, and then eventually that accumulates into something that ends up being cleaned up and upstreamed.

- I also think that generally speaking, a lot of engineers do want to express pride in their work, and open sourcing it is one way to achieve that. Furthermore, there is likely a tradeoff of competitive advantage vs community goodwill and also getting other people outside of your company to maintain, debug, and improve what you made, and I think Google and Facebook have an engineering culture of preferring the latter to the former. At the end of the day, Google is profitable because of its search/ad algorithms and Facebook because of its social graph and ad platform. Their tech stacks are necessary but not sufficient, and furthermore are often don't necessarily provide that much benefit to others who aren't running at the same scale (realistically speaking, you can get a lot of mileage out of a single, relatively modest server running OpenBSD or something)

[1] https://www.jwz.org/doc/worse-is-better.html


Oomd reminds me of userspace OOM handling mechanism proposed by David Reintjes of Google:

https://lwn.net/Articles/590960/


See also the wonderfully named SIGDANGER signal on AIX. It seems many people have invented this. I've been wondering out loud recently if PostgreSQL should attempt to support these various impending-doom-notification mechanisms, or whether a system that close to the edge really needs human intervention anyway.


The Amiga had a feature in the Exec kernel since version 3.0 that let applications register so-called memory handlers, which the system could call to purge e.g. thumbnail caches owned by the userland process, etc. Of course the Amiga had a single address space and no memory protection or paging, so things were easier.


iOS has a mechanism like that. The application receives notification on memory pressure and can release some memory in order to not being killed.


Does this fix the cgroup memory reporting issue? It's caused me a lot of pain with containers containing crap code that assumes the memory is all available.


Is btrfs a good choice for a format for a regular Linux install on a dev or user machine now? I've been using ext4 for almost a decade, curious to hear responses.


For the past few months, I've been trying to make it work (mostly because of the raid + snapshotting functionality), but I have ran into nothing but trouble.

Frequent hangs of my system (unrelated to the defrag process that runs daily, which also uses 100% CPU for minutes -- the best I have deduced it towards is either because of using raid+nvme, or because of high (>100GB) RAM)), Docker is very unstable using btrfs [2], etc etc.

I moved back to ext4 + mdraid just last week and couldn't be happier.

1 https://serverfault.com/questions/747366/btrfs-write-operati...

2 https://github.com/moby/moby/issues/34501


I've been using it for years, including with Docker, and haven't had any problems. There are people successfully using Btrfs with tens of thousands of containers, e.g.

https://lore.kernel.org/linux-btrfs/CAMp4zn8YUdVShFibUKCXtwZ...

Why are you using defrag daily? Why aren't you using the autodefrag mount option instead if you really need such frequent defragging?

Really anytime there's a problem in the kernel, you need to try the workload with a mainline kernel and report the problem to the upstream kernel list if you can reproduce it. If you can't reproduce the problem with mainline, then you have to take up the bug with your distro. That's the way it is with everything, not just Btrfs.

Ergo, I think asking for technical help about Btrfs on serverfault or stack exchange or even HN is weird. People having Btrfs problems need to go to directly to the upstream list:

http://vger.kernel.org/vger-lists.html#linux-btrfs

Even weirder is the serverfault user has SLES! He has a support contract with SUSE so why post in serverfault? It just makes zero sense to me to do that...

And the github link, the OP was asked for more information as it sounded like not a Docker problem at all, and no followup response.


> Really anytime there's a problem in the kernel, you need to try the workload with a mainline kernel and report the problem to the upstream kernel list if you can reproduce it. If you can't reproduce the problem with mainline, then you have to take up the bug with your distro. That's the way it is with everything, not just Btrfs.

Sure, if I have a bug with program X that's patched by distro Y, that's the normal support path.

I think there's a couple issues with applying that same logic to a filesystem that's supposedly stable enough to be used as a root FS though.

First, a filesystem should generally be stable enough that patches applied by a distro don't completely ruin it. How many bugs have there been in ext4 or xfs that were specific to one distro and their kernel patches? They're certainly possible, but I would think that pre-release testing would catch the vast majority of them. Red Hat dropping support of btrfs was a big vote of no confidence here, because it's not just lack of support for RHEL users, it implies lack of testing efforts even for Centos & Fedora users.

Second, if I'm having issues with my root filesystem, it's a bit of a crapshoot as to whether the system is stable enough to compile a mainline kernel and try to reproduce the bug while running that.

And finally, I simply don't want bugs in my root filesystem. I don't even want to get to the point of pondering if I should send the bug report to my distro or to the mainline kernel. I want my root filesystem to be a thing that Just Works, without having to think about it.


>First, a filesystem should generally be stable enough that patches applied by a distro don't completely ruin it.

File systems are sufficiently complicated only developers with expertise in a particular file system will be applying patches. Red Hat has device-mapper, LVM, ext4 and XFS developers with such expertise, but not Btrfs developers. That's the reason why they dropped it.

> it implies lack of testing efforts even for Centos & Fedora users

I can't parse that.

> And finally, I simply don't want bugs in my root filesystem

This is both naive and a reasonable request. It's naive in that they all have bugs, users find them, they report them, they get fixed. Happens all the time. It's also reasonable to pick a file system you think will have the least problems for your use case, if you're not interested in being a bug reporter.


I've been using Arch Linux mostly, with the 4.18 and more recently 4.19 kernels. As explained in another post, I've done extensive trial & error to isolate the issue.

I understand that it's just a single data point, and other people have more positive experiences. My intention was not to find a solution here on HN, but OP asked for experience reports so I gave mine.


Using Arch, and such recent kernels, and having isolated the issue, you should report it upstream. Otherwise it sounds like you have the time and preference to report negative experiences on HN rather than see the problem get fixed.


Every year I consider spending some time to get familiar with btrfs and every year I read horror stories like this one and I stop right there.


Just look at the btrfs commits in each new Linux version - the critical fixes never stop. And most of them are direct corner case fixes without systematic cleanup. The code seems to consist of corner cases without a robust framework to perform difficult operations.


Seems like these are issues mainly related to using older versions of the kernel and/or Docker.


This is not the case. I've done a lot of trial & error, and tried Ubuntu 18.04 / 18.10, Mint and Arch. The problems persist everywhere.

Best to my knowledge, it started when I upgraded from 32GB RAM to 128GB RAM, and/or started using virtual machines more intensively after that. I tried tuning btrfs in various ways, disabling CoW, enabling/disabling autodefrag. The problems start to re-appear and intensify after about one week of using a fresh install.

All I know it's definitely not something simple like "use the latest kernel".


Do you have a recent kernel? Filesystems get fixes only with kernel updates.


It depends on how you use it, I tried it a couple of years ago and ran into a couple of issues with the tooling around it. I used it on a fresh install of OpenSuse and ran into issues a little while later because the distro was saving btrfs snapshots every day, and they eventually filled up my small SSD and prevented the system from being able to boot because it didn't have enough free space to create some small temporary files it needed. You'll also need to get used to using btrfs tools certain tasks, because tools like df and du don't understand how much space btrfs is actually using so you get innacurate results. Things may have changed in the past couple of years, but having to tune different paramaters and learn new commands wasn't really worth the effort because I wasn't using most of the features on a standard Linux desktop install.


I wouldn't be using any of the advanced features. I just want a filesystem. The automatic compression stuff seemed attractive, but if there's anything else I have to do besides just format a drive differently it's not worth it for me. Thanks for sharing your experience.


I've been using BTRFS for at least 5 years on my primary workstation and various other machines and have had no issues whatsoever. Being able to format the root of a drive with BTRFS and then work with subvolumes instead of partitions and manipulate them on the fly is wonderful, as are filesystem snapshots. No more planning partition layouts and/or dealing with LVM.

I would (and do) use ZFS on a server because it's rock solid and powerful, but I prefer BTRFS on desktops / laptops because it's more flexible and lower maintenance. The only major downside for me is that it pretty much only works on Linux.


There's also WinBtrfs https://github.com/maharmstone/btrfs


Oh, I didn't know that existed. I'm not sure if I'd trust it, but glad to know it exists. It's not really a big issue for me, though, as I have a home NAS. Really more important to me than Windows support is support on BSD (esp. FreeBSD) as I use that on my NAS.


Judging by the issues on WinBtrfs it doesn't work well (or at all).


Do you use raid with BTRFS?


I have. I didn't have any trouble out of it.


I'm using btrfs on home computer (with compression enabled) on big storage HDD (games, videos, photos etc.), primarily because of compression and easiness of creating and mounting subvolumes without making new partitions. For regular root / home partition on NVMe I'm using XFS which is faster.


How much compression do you get out of video files? My understanding is they are already pretty compressed so you don't see a great improvement. I have many terabytes of files on some drives, even 5% of savings could go a long way.


I didn't play with forced compression, so most of such files aren't compressed by btrfs. With forced compression, overall ratio will be higher.

But if you have raw video, I expect it to kick in more.

I.e. for my archive volume compression is quite low (99%). For example for games volume it's better (81%).



I use it in raid 1 on a 2x1TB setup and it works great, if you don't use its 'exotic' features like software raid and the fancy volume management stuff just stick to ext4


fair enough, I don't need any of the fancy features.


I'm using it for root for years. Stable, but slow. Still much faster than NTFS or Apple hfs+ though. zfs needs to much diskspace for my taste.


I would never use ntfs or apple's filesystem for linux though.


I don't understand the point of this... if they are open source Linux Kernel contributions, shouldn't they be upstreamed into the Linux kernel?

Is Facebook just throwing out stuff from their private branch of the Linux kernel in hopes that somebody has free time to upstream the changes?

Not being snarky, I genuinely don't understand the point of this...


It is a whole lot easier to peruse and use modifications that someone else has written and debugged than to write them oneself.

It is generally a cause for celebration when someone releases their work for everyone else to consider for use.


The philosophy of the kernel is to work for everyone's need. Its a good question, if these enhancements are of broad benefit and unobtrusive as they seem it's likely better to upstreaming it vs out-of-band patches.


btrfs has been part of the mainline kernel since 2.6.29, eBPF since 4.1 and cgroups2 since 4.5. That leaves PSI as the only kernel component mentioned that has not yet been upstreamed and the intention is for it to be.


Some of these are not actually things that should be in the main kernel, which is also why they are being referred to as "components". E.g: Not every laptop user needs to be able to run BPF.


> Not every laptop user needs to be able to run BPF.

True for the moment, but I can see that changing. Having a safe VM at the system call layer is a game changer for so many subsystems. It turns the Linux kernel into a hybrid exokernel.


How would other subsystems use the safety of the BPF program - which I assume is the safe VM you mention? Do they have to assume that the network stack may not halt and have special code for it or sth ?


Systemd uses bpf, it can be used for accounting of I/O.


eBPF doesn't _have_ to use the network stack at all. So imagine i2c devices that get pulled out of the kernel. KVM drivers for top half of MMIO so KVM only needs half the number of context switches. Audio DSP kernels that run at interrupt time for super low latency. Really almost any interesting real time work. Etc.


> i2c devices that get pulled out of the kernel.

I thought eBPF programs run inside the kernel. Or are they _from_ userspace, run inside the kernel and so provide efficient into-kernel calls as well as direct access to certain _allowed_ hardware? I hope I’m not riding around semantics, sorry if I do.


They are programs, provided by user space, that are sandboxed to run safely in kernel space. Anything that isn't a RISC-esque instruction in BPF is a jump to a kernel function, and each flavor of BPF (BPF_PROG_TYPE_*) has it's own table of allowed functions. Most of the program types that are mainlined call functions in the net stack, but that's not intrinsic to the idea of BPF.

So in the I2C case, I could see device drivers that get called to handle the weird things that I2C devices do where you're not really sure until master interrupt time exactly what needs to happen next and you need to make a real time decision. So a BPF program that implements a state machine and is run directly from the master interrupt could be more performant and more power efficient than doing it in user space, and safer than doing it in kernel space.

Did that answer your question?


I finally understand what program type stands for and the gist of the rest, thank you.


It's fast but certainly not safe. it has native arrays, with which you can easily exploit spectre or meltdown issues. to be safe you need to turn it off.


The MAP_ARRAY access goes through an external function call, which gives you a nice place to hook for mitigations. And this stuff is going to have to be fixed properly, as running sandboxed untrusted code in a higher context is waaayyy to useful in a post Moore's law world. Kernels like XOK relied on similar techniques up and down their system call layer; it had three different virtual machines.


Some of the things look like they've already been upstreamed (cgroups2). Tejun Heo, the author of this announcement, is a Linux kernel developer.


Tejun is the maintainer of cgroups (and has been for several years, and has moved companies several times). All of my angst and annoyance with cgroupv2 aside, quite a few of these projects were discussed at the last OSS and he's quite right that some of the OOM handling needs to be out-of-kernel to avoid killing things that are doing actual work.


Maybe it's for showing off that they contribute...


> BPF is a highly flexible, efficient code execution engine in the Linux kernel that allows bytecode to run at various hook points, enabling safe and easy modifications of kernel behaviors with custom code.

Can someone explain the word "safe" in that paragraph? Genuinely curious, not trying to start a war flame. I'm interested in understanding use cases for BPF.


Certain limitations must be checked (e.g. can be proven to halt). It's a bit much for an HN post, see https://lwn.net/Articles/740157/ subsection "The eBPF in-kernel verifier" for more background.


Thanks for the pointer.


It's guaranteed memory safe and halting.


BTRFS is good. Surprisingly it's the only common upstream Linux filesystem with transparent compression.


Do you think the transparent compression is worth it? I imagine most of the data people have in their computers is already compressed (images, audios, videos, etc.), so it would only slow down reading and writing for most files.

I can only see it making sense if much of your disk space is used up by uncompressed text files (or other types of uncompressed filetypes).


Using compress=zstd on system root saves me about 35-40%. And I save about 1/2 the space if I use compress-force=zstd mount option. I do this on computers with SDHC/SDXC cards which have dog slow writes and lots of spare CPU time, and some of the time on HDD. I don't use compression on computers with NVMe, it doesn't appear to help much at all, and also I'm pretty sure the compression is currently a single threaded pipe, so for multi-queue devices it's plausible some workloads could become slower.

Another use case, soonish I hope to see support for selectable compression levels for zstd (currently only with zlib). And also zstd supports use of a training file for even higher ratios. Both would work well for archiving and seed/sprout use cases, where taking the heavy front end hit with slow writes will be worth it (faster download times, and of course the sprout can have a different compression level or no compression so subsequent writes can be fast.

Latest squashfs support includes zstd and also selectable levels, if you need a read-only image.


I implemented zstd compression in btrfs/squashfs, and work on upstream zstd.

* Btrfs compression is multithreaded, and can use up to the number of cores available on the system. * Compression might not help speed on SSDs, but it should help reduce the burn rate, if you care about that. * My intern wrote a patch to add compression level support to zstd in btrfs [1]. It should be merged upstream soon. * Slightly off topic, but grub will soon understand btrfs compressed with zstd.

[1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-...


Do they plan to implement multithreaded compression / decompression?


Btrfs already supports multithreaded compression and decompression. Each 128 KB block is (de)compressed with a single thread, but multiple blocks can be (de)compressed in parallel.

The zstd CLI supports multithreaded compression with the flag `-T <num-threads>`.


Good to know, thanks! Can you add that info to btrfs wiki compression page, please[1]?

1. https://btrfs.wiki.kernel.org/index.php/Compression


It saves I/O on HDD, which is useful if you have a good multi core CPU. Some things are compressed but not all. BTRFS has some heuristic which switches off compression if it's not saving space. It's not very elaborate now, but they plan to improve it.

See https://btrfs.wiki.kernel.org/index.php/Compression#What_hap...

I'm using zstd compression for my btrfs volumes.


While that is literally true you can use Virtual Data Optimizer (VDO) in Red Hat and CentOS. It does transparent deduplication and compression. Red Hat bought it from Permabit and now it's also available in CentOS.


The amount of engineering FB us doing, I wish they would make something important that social networks.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: