
Bcachefs: “the COW filesystem for Linux that won't eat your data” - koverstreet
http://bcachefs.org/
======
slavapestov
Here's a funny story. At one point bcache development was funded by a startup
(which I won't name here). They were using it as the local storage layer of a
distributed storage product. I worked there for a year in 2014.

Apparently they were not aware of the fact that bcache was a) GPLd code, or b)
developed before the company existed, first as a hobby project and then at
Google. After a couple of years, they noticed that Kent was in fact posting
the bcache source code on his personal web site. At this point they fired him
and threatened to sue. I quit the company then (along with a number of other
people, for mostly unrelated reasons, such as the fact that the CTO was a
notorious brogrammer). Kent got a litigator and when it was made very clear to
them that they had no case, they backed down, but not before wasting a ton of
money.

As far as I know, they're still actively violating the GPL by shipping a
product containing modified kernel code in it without releasing the source,
nor do they acknowledge that they did not develop the key component of their
product.

The "commercial" version had a rather broken and messy snapshots
implementation and had diverged a bit from the open source bcachefs at that
point, mostly because snapshots were poorly implemented. It's also kind of
funny because after we left the company we still knew of some tricky data
corruption bugs, and it's likely they're still there in the "commercial"
version, because backporting the latest fixes would be non-trivial and I don't
think their testing or development methodology would have caught them.

Anyway, I gave up on startups and enterprise storage after this, but Kent is
still developing bcachefs on his own time and money, so if you use it please
consider donating some money to support its development.

~~~
indolering
His Patreon page[1] shows he only receives $762 in donations a month, less
than a third of what he needs to keep from eating into his personal savings.

Sad given how much a modern filesystem would help Linux : (

[1]: [https://www.patreon.com/bcachefs](https://www.patreon.com/bcachefs)

~~~
Chris2048
> As far as I know, they're still actively violating the GPL

Maybe he can get extra money from a lawsuit?

~~~
indolering
Few people would willingly sign up for that kind of torture.

------
zanny
I've been using btrfs since about 2011, and I've stopped using ext4 / xfs /
zfs everwhere since about 2014.

From 2012-2014 it was mostly breakage every other month. From 2014-2016, it
was semi-annual issues.

For the last ~18 months I have had ~30 machines running btrfs with no issues,
some servers, some personal computers. The release notes are boring, the bugs
are boring, and to me its definitely in a state I would strongly consider
trusting it with any workload.

I worry that btrfs is just going to remain doomed. It wasn't stable half a
decade ago, so it - for some reason - cannot be more stable now. But it has
seen so much work put into it to make it as mature as it is now, and in my
experience it is pretty damn mature now. All I want to see is another year and
a half of perfect stability before I would start arguing to drop zfs entirely.

~~~
_ecqc
Are you running BTRFS with its built in RAID? That's been the biggest blocker
for me. There have been numerous RAID bugs that have caused data-loss and I
believe at least one of them is still unpatched.

~~~
kzrdude
I think Chris Mason is open with that raid support is not really stable yet,
right?

Status page says so:
[https://btrfs.wiki.kernel.org/index.php/Status](https://btrfs.wiki.kernel.org/index.php/Status)

~~~
aarmenaa
As far as I know they consider RAID0, 1, and 10 to be stable. Last time I used
it rebuilds were substantially slower than ZFS or mdraid. Rebuild performance
seems to one of a few issues that BTRFS has had trouble solving. RAID 5 and 6
were declared stable last year, only to have that retracted when some fatal
flaw was discovered that would apparently cause data loss if you needed to
rebuild.

~~~
mshook
It's mostly true but the issues as far as I know are:

\- RAID 1 with more than 2 disks is not what you think it is: the data will be
mirrored but only once, no matter how many disks you have (meaning if you have
a mirror with 3 disks, you only have 2 copies of your data). Because in BTRFS
lingo, RAID 1 means '2 copies of the data'
[https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_dif...](https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_differences_among_MD-
RAID_.2F_device_mapper_.2F_btrfs_raid.3F) which is not what people expect from
RAID 1 with more than 2 disks

\- RAID 1 always needs 2 disks to be working, if not you can't mount the
thing... Well you can, but only once...
[https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_volume...](https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_volumes_only_mountable_once_RW_if_degraded)

\- RAID 10 inherits these special RAID 1 cases as a result

Most of that stuff is described here:
[https://btrfs.wiki.kernel.org/index.php/Status](https://btrfs.wiki.kernel.org/index.php/Status)

So as they say on the status page: _" mostly ok"_

------
boris
>Snapshot implementation has been started, but snapshots are by far the most
complex of the remaining features to implement

Snapshots are the #1 feature of COW filesystems. I've been using them for a
bit in btrfs and this feature is game-changing (and no, it hasn't eaten my
data yet).

~~~
std_throwaway
At first glance, the status page of btrfs look horrible:

[https://btrfs.wiki.kernel.org/index.php/Status](https://btrfs.wiki.kernel.org/index.php/Status)

The problem areas are mostly RAID and exotic features. RAID can be handled by
a different layer and most users don't really need the exotic features.

Judging from the media silence in the last months I'd say either people
stopped using btrfs or it just about works good enough for everbody.

~~~
feld
when the RAID is handled by a different layer you lose some very important
integrity features

~~~
kazinator
That's why all big data in big corporations uses exotic-open-source-
filesystem-based RAID. That's where the action is when it comes to integrity.

None of that virtual block device or driver-level software junk, let alone
hardware RAID controller solutions.

~~~
barrkel
Well, there are other reasons; you want to write code that operates on the
data, and neither the code nor the data fits on a single machine - you have to
target an abstraction which spans machines. Block storage is too low-level an
abstraction.

That isn't to say that using high performance block storage isn't still a win
even when the redundancy is multiplied at a higher level. The higher level
redundancy is also about colocating more data with the code - i.e. it's not
just redundant for integrity, but to increase the probability it's close to
the code.

~~~
kazinator
Block storage can be network-abstracted.

Even virtual memory for that matter. Now ancient concept:

[https://en.wikipedia.org/wiki/Distributed_shared_memory](https://en.wikipedia.org/wiki/Distributed_shared_memory)

~~~
barrkel
Of course. Most production monoliths are deployed on networked block storage -
aka SAN - and NUMA is already structurally distributed memory, even on a
single box. But it's not the right paradigm to scale well, no more than chatty
RPC that pretends the network doesn't exist is the right way to design a
distributed system.

------
kev009
I think ZFS is the only viable open source CoW total storage management option
commercially. These new Linux filesystems are way too late to the party, and
it will take a decade for them to reach maturity when they reach basic 1.0
feature parity.

In parallel I see XFS as the long term evolution for Linux file systems. It
will continue to scale slightly up from where it sits today and address fail
in place, flash, metadata checksums, snapshots etc where total storage
management is done by overlays like HDFS, object stores, etc.

~~~
_ecqc
I think ZFS is fantastic for businesses but there are a couple places where it
falls short compared to bcachefs for me:

\- For non-business users who want a RAID, ZFS is too inflexible. You can't
add or remove disks to a RAIDZ vdev. If you want the space efficiency of
RAIDZ, you have to expand your array in units of entire vdevs. If you want
replicas, you have to expand in at least pairs of disks. BTRFS and bcachefs
both allow you to replicate more flexibly and reshape your array.

\- ZFS doesn't work particularly well with SSDs as caches. ZIL and L2ARC are
nice but they're not as nice as a full bcache-style tiering setup. bcachefs
tiers let you do crazy things like a 4-tier storage setup with Nearline HDD ->
15k SAS HDD -> SATA SSD -> NVMe SSD.

\- ZFS is pretty complex to manage in general and major features like ZIL and
L2ARC are arcanely documented. So far, bcachefs is pretty straightforward to
use.

------
ysleepy
While I really like these sort of file systems, I'm not holding my breath.

This isn't a simple filesystem project, but plays in the next-gen space ZFS
opened up. There will be a lot to do, especially IO scheduling, RAID safety
with shitty drive firmwares, consistency guarantees with fsync/partial flushes
etc.

I'm pessimistic about it being mainlined in the near future, the core team
will be weary of a second btrfs.

What I would like to see is a APFS/exFAT crossover with COW and data checksums
without all the volume mgmt with ports for all possible operating systems so
everyone can use it for their SDcards, usb-sticks and external drives without
making tradeoffs and using fuse.

~~~
sedachv
> What I would like to see is a APFS/exFAT crossover with COW and data
> checksums without all the volume mgmt with ports for all possible operating
> systems so everyone can use it for their SDcards, usb-sticks and external
> drives without making tradeoffs and using fuse.

+1. Filesystems without bit-rot protection on flash drives are going to become
at least as big a problem as optical disc rot.

~~~
mschuster91
> without making tradeoffs and using fuse.

What's the problem with fuse? It allows sharing code between Linux, OS X,
(Free)BSD and even Windows (via dokan).

Yes, it will not offer you the same performance as an in-kernel driver (due to
context switches), but given that CPU power always increases, no big problem
there.

~~~
sedachv
> What's the problem with fuse?

1\. Only available on Android when rooted.

2\. Support varies between OSes. For example OpenBSD's FUSE does not have the
_default_permissions_ / _allow_other_ flags, which makes for example encfs
(and any other virtual filesystems that are backed by multiple files) a pain
to use since OpenBSD 6.0 removed user mounting.

~~~
khc
1\. most filesystems won't be available on Android anyway, this point is moot

2\. most non-fuse filesystems won't be ported to your BSD of choice anyway

------
jeremyw
To press this point:

If you think we need an alternate effort and/or competition to build an
advanced, native filesystem for Linux (I do), please consider a subscription
on Patreon
([https://www.patreon.com/bcachefs](https://www.patreon.com/bcachefs)). Kent
has a long history of shipping sophisticated, high-quality code.

~~~
king_phil
Plus he is willing to help out when you need to nail down a bug, as I recently
discovered with bcache. My first Linux kernel patch might be a fix of a
deadlock in bcache :-)

------
gbrown_
The architecture page is pleasantly illuminating, nice to see an effort made
in technical documentation.

[http://bcachefs.org/Architecture/](http://bcachefs.org/Architecture/)

------
throw2016
Chris Mason and the btrfs team are clearly talented. But the initial
excitement of btrfs has sadly dissipated and its promise as the next
generation Linux fs remains unrealised. It now feels a bit jaded and the
momentum is spent.

I suspect many have lost patience with the promise of COW and unfortunately
for bcachefs this history will cast a shadow on its development and potential.

Database performance remains problematic on COW and while things like
snapshots and adhoc disk and volume management are interesting even exciting
one soon realises unless one has a pressing need they are just nice to have.
Eventually boring ext4 ticks all the boxes and one may as well forget about
the fs and focus elsewhere.

~~~
pgaddict
I don't think COW in general is a big issue for databases. You can get pretty
good performance for ZFS (very stable and consistent behavior), for example.
The COW is not free, of course, but you get interesting features in return,
and if you need them (e.g. snapshots), it's usually much better than LVM +
non-COW filesystem.

The fact that some COW filesystem perform poorly does not mean all COW
filesystems do.

------
phs318u
This takes me back. 9 years ago I was playing around with ZFS COW and OS X
sparse bundle containers to host disks images for multiple "versions"
(exploiting CoW) of the same VM image. I wrote up an article on what I was
doing [1]. Never persevered though as it was a bit too fragile (at that time
ZFS on OS X was not at all ready for prime time).

Funny, but every so often I wonder what it might be like in a parallel world
where Apple bought Sun instead of Oracle.

[1] [http://macoverdrive.blogspot.com.au/2008/10/using-zfs-to-
man...](http://macoverdrive.blogspot.com.au/2008/10/using-zfs-to-manage-your-
vm-zoo.html)

------
sargun
I'm looking forward to BcacheFS. ZFS on Linux is great when it works well, but
it's an absolute pain when it breaks. Not only does it taint the kernel, but
it doesn't really mesh very well in the kernel due to the usage of the SPL --
a layer used to convert Linux APIs to Solaris Kernel APIs. In addition, ZFS
doesn't use as much native Linux memory management as I'd like, instead it
manages its own pool of memory. This makes troubleshooting more difficult.
This mechanism is further aggravated with the use of the kmem cgroup.

For example, if you have a dirty page in a cgroup, and the cgroup OOMs, the
kernel will trigger writes. If any of these writes require memory allocations,
they'll probably fail since the current cgroup is OOM. ZFS subsequently gets
stuck in an infinite loop, and locks up. See:
[https://github.com/zfsonlinux/zfs/issues/5535](https://github.com/zfsonlinux/zfs/issues/5535)

I understand that a lot of ZFS works comes from LLNL & government funding. I'm
not blaming them, as it works for their use case of machines that are running
dedicated, controlled workloads.

We're experimenting with Btrfs, and we'll see how it goes.

------
std_throwaway
It looks like it takes quite some time till it is fully implemented. Why
should we start using it right now instead of btrfs/zfs?

~~~
aseipp
You probably shouldn't. It's ready for adventurous testers, and is pretty
stable, but unless you're willing to report bugs or hack on it, you can
probably stay away.

There are reasons to still want it, despite its newness; for example, the
latest updates bring huge improvements in metadata efficiency (low metadata
overhead -> more metadata in the cache -> larger working set). Someone on the
IRC channel reported it's somewhere around 20x faster than most filesystems
when it comes to "iterate millions of files recursively", blowing everything
else out of the water. (This seems somewhat synthetic, and I'd say it _mostly_
is -- but OTOH, "tons of files in a directory" being really slow is life, and
has bitten me multiple times in a prior job). In general, improved metadata
efficiency helps everywhere, though. For example, if you're doing backups on a
really big filesystem recursively, you'll have to traverse the metadata inodes
a lot to get e.g. last modified time. bcachefs will likely do awesome here in
terms of performance.

Another unique feature I recall is the fact it has very very good tail latency
-- bcachefs almost never blocks on I/O unncessarily, so you do not get random
'lag spikes' when things like the page cache get flushed out (which may halt
some other I/O ops). This makes the system feel much more consistent, in
general.

There's lots of good info in the architecture document and Patreon posts from
Kent:

[http://bcachefs.org/Architecture/](http://bcachefs.org/Architecture/)

[https://www.patreon.com/bcachefs/posts](https://www.patreon.com/bcachefs/posts)

~~~
h2hn
I spent the last week testing ext4/btrfs/zfs on Linux and I found that zfs is
rather slow and btrfs has improved its performance a lot in the last years (I
should refined the script a bit, upload some graphics and make a post).

[https://gist.github.com/liloman/d525131fab9b9a440140905921e9...](https://gist.github.com/liloman/d525131fab9b9a440140905921e9346a)

I'll give it a try on bcachefs. :)

The script needs a 512MB spare disk partition and some basic changes but the
fundamental work is there.

------
jethro_tell
I like that we are seeing competition in this space. I think it's good for
business.

I do however see some big red flags in the linked page:

> Starting from there, bcachefs development has prioritized incremental
> development, and keeping things stable, and aggressively fixing design
> issues as they are found

Which is it? Big design changes or stable FS?

~~~
OD_
From what the developer has stated on reddit, it's more like he wants to
aggressively make changes on the filesystem _right now_ , before any attempt
at mainlining into the kernel, to not end up like btrfs, which in his view,
was mainlined prematurely.

~~~
JoshTriplett
It does make sense to have it rock-solid stable _before_ mainlining, so that
people don't get burned by it early on.

------
Y_Y
So by analogy with btrfs "butterface" I suppose we're supposed to pronounce
this "book-a-chefs"?

------
X86BSD
Curious, I didn't see mention of this, but perhaps someone here knows, is
there TRIM support or planned support for bcachefs?

~~~
loeg
TRIM isn't super important given bcache's write pattern (sequential writes to
large aligned blocks). It doesn't do random overwrite in place of small
blocks.

------
corppneq
> Bcachefs: “the COW filesystem for Linux that won't eat your data

from site:

> Bcachefs is not yet upstream - you'll have to build a kernel to use it.

> Snapshot implementation has been started, but snapshots are by far the most
> complex of the remaining features to implement -

Yes. Very mature.

