
BetrFS: An in-kernel file system that uses Bε trees to organize on-disk storage - espeed
http://www.betrfs.org/
======
heavenlyhash
This is really interesting...

... so I'm really a bit bewildered and sad by some of the engineering choices
they made along the way, like requiring _a modified kernel_ :
[https://github.com/oscarlab/betrfs/blob/master/README.md#com...](https://github.com/oscarlab/betrfs/blob/master/README.md#compiling-
the-code)

The reasons are detailed just above that link target, but are somewhat absurd
IMHO: they modified the kernel's `struct task_struct` to pass error values,
rather than fixing one of their libraries to handle its own error values. In
return, nobody ever gets to use this without patching a kernel (which is, by
the way, so ancient of a fork at this point that Forget It).

I know it's unfair of a random internetfolk to complain about the engineering
choices of a project like this, I'm sure I don't know the influences and
tradeoffs first hand, etc, etc, but... Ow. I would've tried this.

But not on a 3.11 kernel fork.

I hope someone can take all the awesome research here and get it to veer
towards something slightly closer to product.

~~~
nneonneo
All the reasons given boil down to “we didn’t want to mess with TokuDB”. The
build process (requiring specific GCC versions, CMake, and a bunch of packages
including Valgrind), the errno patch to task_struct, and the half dozen libc
stub functions are all there because TokuDB expects userspace libc.

I don’t know whether the authors ever attempted to patch TokuDB itself (or
streamline it down to the essentials for the kernel). Instead they appear to
have just taken the entire userspace-designed library and hacked the kernel
until it fit. It’s a halfway-decent strategy if your goal is to get the thing
off the ground as quickly as possible, but obviously a real implementation
would have to ship a modified TokuDB instead (it’s OSS, so it should be
hackable!)

This is pretty common in academia, sadly. As an academic who has released some
academic OSS code myself, I can say that often there’s just not enough time or
motivation to fix a blob of code into a generally usable format. Often it’s
released just so that we can say “hey we open-sourced it so other researchers
can build on it/replicate our results”. This may be one of the reasons why
academic ideas don’t make it out to the real world that quickly.

~~~
tobias3
TokuDB seems to be written in C++. It's neat that they managed to make that
work in-kernel.

Also there are patent notices in the TokuDB files.

So probability of it being in Linux at some point approaches zero. And with
the patents they even made the whole technique unviable for any Linux fs
(haven't looked at them in detail obviously).

~~~
espeed
See Bradley's comment wrt to licensing the fractal tree code as GPLv2 w/ a
patent provision
[https://news.ycombinator.com/item?id=18208209](https://news.ycombinator.com/item?id=18208209)

------
lorenzhs
Michael Bender, one of the people behind this, gave an excellent invited talk
on B^epsilon trees and the possibilities that write-optimised data structures
introduce (especially in data base systems) at IPDPS this year. Unfortunately
it wasn’t recorded as far as I’m aware, but the slides are available at
[http://ipdps.org/ipdps2018/bender-
ipdps2018-wods.pdf](http://ipdps.org/ipdps2018/bender-ipdps2018-wods.pdf). A
more formal introduction to B^epsilon trees is
[http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf](http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf)

~~~
espeed
Here's the talk Bradley Kuszmaul [1] gave to MIT 6.172 in 2010...

How TokuDB Fractal Tree Indexes Work
[https://www.youtube.com/watch?v=9Rb85cOXTKU&t=202s](https://www.youtube.com/watch?v=9Rb85cOXTKU&t=202s)

[1]
[https://people.csail.mit.edu/bradley/](https://people.csail.mit.edu/bradley/)

------
espeed
Talk given by Rob Johnson [1] at MSR a few years back...

BetrFS: A Right-Optimized Write-Optimized File System
[https://www.youtube.com/watch?v=fBt5NuNsoII](https://www.youtube.com/watch?v=fBt5NuNsoII)

[1] [http://www3.cs.stonybrook.edu/~rob/](http://www3.cs.stonybrook.edu/~rob/)

~~~
jules
That's a great talk, thanks!

------
williamkuszmaul
The website doesn't seem to mention that several of the papers on the
filesystem won best-paper awards at major conferences. The paper, Optimizing
Every Operation in a Write-Optimized File System, in particular, won best-
paper award at FAST '16.

------
y4mi
Btrfs vs betrfs... This is going to cause so much confusion

Ah, but the project isn't new, so I guess it's not a new problem

~~~
codetrotter
Pronounce btrfs as “butter fs” and betrfs as “bee-turr fs” or something, then
there is no confusion ;)

But yeah I agree, betrfs and btrfs are way too similar names.

~~~
stephenr
I can’t beleive it’s not butrfs.

~~~
muterad_murilax
Your sound a bit bitrfs.

------
perlgeek
What's the state of this file system? Is it in the Linux kernel? In some BSDs?
Both the main page in the FAQ talk about "the kernel" without saying _which_
kernel it is.

How reliable is it? Are there file system checkers for it? Does it support
snapshots?

~~~
yjftsjthsd-h
There's a comment upthread that it works on a very much patched Linux kernel,
so it certainly isn't upstreamed.

------
mirekrusin
What's the story with patent/license on fractal tree by tokutek (now percona)?
Can you use it for personal use only? Ie. you can't use it at work without a
license? What about Bε-trees - are there patent free implementations?

~~~
bradleykuszmaul
Tokutek licensed the fractal tree under GPLv2 with an explicit patent license
to make clear that anyone could use the fractal tree code. I don't know what
Percona did after the aquisiton.

~~~
loeg
Any idea what year the patent clock times out?

~~~
colanderman
2027

------
WhitneyLand
Sorry, I have to say this is a poor and unpersuasive case for the
architecture.

Performance increases are everything, yet unless I’m missing it there is no
way to know what the improvements are.

For example, data on the chart here
[http://www.betrfs.org/faq.html](http://www.betrfs.org/faq.html), is enough
information provided to reproduce the results including hardware and
configuration? If not, the results are utterly meaningless.

Forward looking, it would strain credibility if you hadn’t tried to get a
sense of the real world gains on Intel XPoint storage. I would speculate it
will not be many years before there are no new green field deployments of
storage that spins around in a circle.

~~~
Something1234
First off, why would you use a research file system on enterprise hardware?
Secondly, if you read just a little bit further you see that they are testing
using spinning rust (HDD). Since they mention threads, I would hope that they
are testing using something that supports hardware parallelism. So probably
some relatively decent modern hardware. Mind you this is all speculation, and
conclusions you could draw by reading the FAQ __carefully __.

~~~
WhitneyLand
If you describe results of an experiment, you need to provide the detail or
pointer to how to reproduce it.

Benchmarking on enterprise hardware is completely relevant because that is the
technology that is going to become dominant over the next few years so if
performance gains do not show up on that type of technology they may not
significant.

In some cases enterprise hardware is much different from what would be used in
other scenarios, or on a massive scale, similar to how Facebook doesn’t go
down and buy enterprise servers to run their data center. However in this
case, the new generation of memory/storage hybrid will not be that different
whether it’s in your laptop or in a server, Size features and scalability
withstanding of course.

If I read an FAQ question that doesn’t have an asterisk or a pointer to the
full information, it’s not my responsibility as the reader to go hunting
around for details. This is the job of the author of the paper or website. You
don’t get to brag without putting a _

------
SEJeff
I wonder how this compares to bcachefs:
[https://bcachefs.org](https://bcachefs.org)

------
userbinator
A filesystem using complex data structures, with no mention of reliability? It
seems like a huge omission. Personally I think simplicity (and reliability,
which usually accompanies it) is the most important for a filesystem --- it
doesn't matter how fast it is, if it is prone to data loss from bugs or
whatever else.

------
jwatte
They do all this work, and then choose possibly the most confusing name they
could?

------
StreamBright
Is it mandatory to implement this in the kernel? Is this because linux is a
monolith?

~~~
SEJeff
No, FUSE literally stands for "Filesystems in User Space"

~~~
StreamBright
Sorry the title of the article is in-kernel file system. Did I miss something?

~~~
SEJeff
You asked if it was mandatory to implement this in the kernel because Linux is
a monolith. I pointed out that FUSE exists, and allows filesystems to be in
userspace. I answered your question. As to some of the silly decisions made by
this project? I can't speak to that.

