
Masstree: A cache-friendly mashup of tries and B-trees - HenryR
http://the-paper-trail.org/post/masstree
======
koverstreet
I've been staring to write a paper on bcache/bcachefs's b-tree design, which
is also a sort of combination b-tree/trie (it uses binary trees in eytzinger
layout). Pretty cool to see the same idea coming up in multiple places.

Running index-microbenchmarks now to see which is faster :)

~~~
koverstreet
With what appears to be the current version of masstree:
[https://github.com/kohler/masstree-beta](https://github.com/kohler/masstree-
beta)

Single threaded random lookups: 1.4M/sec for bcachefs, 2M/sec for masstree.

------
espeed
Masstree is one of the DBs Berkeley used when comparing the performance of
Anna [1] (now Fluent [2]):

[1] [https://rise.cs.berkeley.edu/blog/anna-
kvs/](https://rise.cs.berkeley.edu/blog/anna-kvs/)

[2] [https://github.com/fluent-project/fluent](https://github.com/fluent-
project/fluent)

~~~
HenryR
Yeah, Masstree has settled into the standard set of comparator systems for
most research since it was published - and not as a strawman "this system was
crap, so let's pretend we've done good work by beating it!" but as a real
challenge to do better than.

Anna is on my list of systems to include in this review (see [https://www.the-
paper-trail.org/reading-list/](https://www.the-paper-trail.org/reading-
list/)). Looking forward to it!

------
tonyg
Related: qp-tries and critbit tries:
[https://dotat.at/prog/qp/README.html](https://dotat.at/prog/qp/README.html),
[https://cr.yp.to/critbit.html](https://cr.yp.to/critbit.html)

------
leiroigh
So what are the advantages over adaptive radix trees or good old judy-
dict/array?

Apart from judy being too damn complicated, and too old to be optimized for
vector-compare instructions (I think the fancy hand-coded x86 vector-
comparisons are the main reason for ART being competitive with judy,
considering that it misses at least key compression and the clever allocator,
and that ART is not as optimized for using every byte out of every fetched
cache-line).

~~~
HenryR
Here's a recent comparison of Masstree to ART:
[https://twitter.com/andy_pavlo/status/986647389820747776?s=2...](https://twitter.com/andy_pavlo/status/986647389820747776?s=21)

ART looks to be better in most cases. It's on my list of K-V stores to review:
[https://www.the-paper-trail.org/page/reading-list/](https://www.the-paper-
trail.org/page/reading-list/)

~~~
leiroigh
Cool, looking forward. Please put Judy in as well :)

Seriously, the 17 year old judy is still pretty good, despite it's lack of use
of cool vector load / compare / etc instructions for quickly traversing tree
nodes that are too small for a full radix search (ART does "simultaneous
search", i.e. compares to all stored keys in a single instruction, while judy
afaik runs a linear search).

It would be pretty cool if someone vectorized that in judy, and replaced null-
terminated strings by a binary-safe representation. Unfortunately, all
implementations are old and very hard to read.

~~~
HenryR
I could either figure out how Judy works, or review another three papers :)

~~~
leiroigh
:)

My super high-level partial understanding of Judy is the following:

Start with ART, which is pretty simple. Then do the obvious improvements:
First, we want fast access to "element number N in sorted order", i.e. you
also store number of descendants. Next, you do key compression: Storing the
portion of the key that can be reconstructed from the tree traversal is silly.
Next: an 8-bit tag for signaling one out of 4 node types? Are you, like,
filthy rich? More node types it is. Next: Spending 64 bit for a node pointer?
Are you crazy? Put the type-tag into the parents pointer (earlier resolution
of the branch!), and use a custom allocator to get smaller pointers (you
allocate big segments and your node pointers are offsets).

You end up with an unintelligible monstrosity. Give it a cute name, forget
about SIMD because it is 2001, and you end up with something like Judy.

------
HenryR
Author here - if you like this you might also like another paper summary of
mine in the same vein:

[https://news.ycombinator.com/item?id=18132730](https://news.ycombinator.com/item?id=18132730)

~~~
leiroigh
So I was wondering... are there sensible structures that combine hashmap and
tree by having a double index?

Reason: Ordered access or range queries need a (radix-)tree.

So insertion and removal need to pay for the comparatively slow tree search
and rebalancing.

Lookups or mutations could use a hash table that references the same data,
especially if no key compression is used for storage.

~~~
HenryR
I don't know of one, but it's such a natural idea that I'd guess it's been
studied. There are standard implementations of LRU caches that use e.g. a hash
map and a linked list to get both fast lookup and ordering, but for real
performance I think you'd want to try and minimise the number of data
structures to avoid having competing cache behaviours.

------
danielhlockard
Why do I get a gross yellow block covering the image when I mouse over it?

edit: oh, it's because it's a link and they have some background coloring or
something on links.

~~~
HenryR
Oh yeah that's no good. I'll fix that when I'm back at my computer. Thanks for
pointing that out!

------
vectorEQ
can't you better hash the keys and match hashes? They arent variable length,
and will be unique per unique string regardless of the length.

~~~
jsnell
That won't give you range queries / ordered iteration, which were the point of
using a tree rather than a hash table in the first place.

