
A Comparison of Log-Structured Merge (LSM) and Fractal Tree Indexing - ddorian43
http://highscalability.com/blog/2014/8/6/tokutek-white-paper-a-comparison-of-log-structured-merge-lsm.html
======
btrask
Fractal trees are not "LSM-trees, but better." Fractal trees are more read-
optimized, whereas LSM-trees are more write-optimized.

A fully compacted LSM-tree should be equivalent to a b-tree in terms of read
performance. However I don't think any existing implementations are anywhere
near that.

I appreciate TokuTek's work in this area but their marketing should be taken
with a grain of salt. They've published some stupid stuff before (look at the
graph here[1] and tell me how to make something just like a b-tree, but
faster).

[1] [http://www.tokutek.com/2011/10/write-optimization-myths-
comp...](http://www.tokutek.com/2011/10/write-optimization-myths-comparison-
clarifications-part-2/)

~~~
bradleykuszmaul
I'm the author of this paper, and I'll try to respond to some of what has been
stated on this thread.

> A fully compacted LSM-tree should be equivalent to a b-tree in terms of read
> performance.

A fully compacted LSM tree arises only when writes stop happening for a
sufficiently long time. It's a long-solved problem to build a read-only
B-tree. The challenge is to not cause your read performance to fall when there
are writes in the workload. This paper attempts to analyze that penalty.

> They've published some stupid stuff before (look at the graph here[1] and
> tell me how to make something just like a b-tree, but faster).

Possibly we've published some stupid stuff, but I don't think that Leif's blog
posting qualifies. Is there something specific there that is incorrect?

~~~
btrask
Thanks for taking the time to reply.

In Leif's post, the graph shows "b-trees" on the inner curve. The equivalent
point on the outer curve would be a data structure that makes the same
read/write tradeoffs as a b-tree, but is simply better in all cases.

The other reply suggests cache oblivious b-trees. I haven't had time to look
at them yet. It's worth pointing out that if cache oblivious b-trees really
are better than normal b-trees, you could build an LSM-tree out of them.

My interpretation is that TokuTek is conflating data structures' theoretical
performance with their implementations' practical performance. So the graph is
trying to represent "typical" b-trees as not using compression and other
techniques that TokuDB uses. Under that interpretation, each data structure
would have a spectrum of possible performance from poor to highly tuned, which
the graph doesn't make clear.

To be fair, I think TokuDB probably has very good real world performance, and
there's a huge need for better LSM-trees (and better database engines in
general). My objection is the comparison of the data structures, not the
implementations.

~~~
bradleykuszmaul
Looking at that post on Leif's drawing, I think I agree that the B-tree is
placed incorrectly. It turns out that B-trees _are_ on the optimal read/write
tradeoff curve, but they are in a very bad corner.

B-trees cost O(log_B N) to read or write.

Fractal trees cost O(log_B N) to read, and O((log_B N)/sqrt(N)) to write (for
the case where epsilon=2).

The hidden constants inside the big-Oh are small; they seem to be less than 2
in practice. In fact, in the benchmarks that I've seen, the constants are in
favor of the fractal tree. I suspect that this is because the block size in
practical B-trees is too small, so the seek time is large compared to the
transfer time on magnetic disk. Furthermore small block sizes hurt
compression, small block sizes make the tree deeper. Furthermore, dynamic
B-trees (ones that can be written) typically fill their blocks only about 3/4
full on average, making the effective block size even smaller. In most dynamic
B-trees, the block size is 4KB or 16KB. In a few it's 32KB or even 64KB. For
read-only databases I've seen people use 1MB blocks for B-trees. In the
fractal tree, the effective block size is 4MB, and I suspect that's where
we're winning on the constants.

So for a constant slowdown in read performance (a quite small constant
slowdown, or possibly a small speedup) you can get dramatic speedups on write
performance.

LSM trees are not on the optimal tradeoff curve. They suffer worse read
perforamance that B-trees or Fractal trees, and their write performance is
theoretically comparable to that of fractal trees.

So there's an error in that drawing, but the drawing is conceptual anyway,
since there's no scale. The point of the drawing is that fractal trees
dominate LSM trees, and for about the same cost as a B-tree on reads, you can
get much better write performance. So the drawing is correct in spirit.

It makes no sense to build an LSM tree out of fractal trees. The resulting
data structure would still be slower than a fractal tree. It's also unlikely
to be faster than a B-tree-based LSM, since the B-trees used in LSM trees are
static, not dynamic. You never have to insert into one of those B-trees, which
means you can apply a bunch of tricks, such as using huge blocks. That
probably shifts the constant inside the big-Oh back in favor of the B-tree for
read operations, which are the only B-tree operations inside an LSM.

I don't think Tokutek is conflating the theoretical performance with the
practical performance. Both the theoretical and practical performance appear
to be better than the alternatives. In practice fractal trees appear to be
slightly faster than B-trees at reads (assuming that the B-tree is actually a
writeable data structure) in real workloads, and much faster at writes. In
theory, the asymptotic read performance is the same, and the asymptotic write
performance is much better for Fractal Trees.

It's true that the LSM implementations could all be bad, leading to the wrong
conclusion when doing benchmarking. I'd find that surprising because the
fractal tree seems to be better than any of the many LSM implementations in
practice, and I'd hope that at least one of them would get it right. And the
practical performance matches what the theory says it should, so it's not
surprising that the fractal trees are faster.

I'm not sure why you object to comparing the data structures. We can compare
them both theoretically using big-Oh notation. We can also do an analysis for
specific database sizes and memory sizes of how the data structures compare,
assuming a good implementation. That's what my paper tried to do. We could
also compare the implementations, which my paper didn't do, but which others
have done. For example, Tim Callaghan has done a lot of benchmarking which
seems to be done right.

~~~
btrask
Thanks for acknowledging the problem with the graph. We'll have to disagree
over "correct in spirit" versus "stupid."

I suggested building a LSM out of cache-oblivious b-trees because of the claim
that they were better than b-trees in all cases, including (ostensibly) read
performance. I didn't realize that it was another name for fractal trees.
(It's been a while since I read the fractal tree papers.)

I'm not convinced that LSM-trees are suboptimal. The lower levels have shorter
trees, so the price of redundant lookups is small, and you should get better
write performance because you can do compaction without touching other levels.

Conversely, SQLite4 has designed an LSM-tree with forward pointers between the
levels. However I think that actually sacrifices some write performance for
read performance, which probably makes it closer to a fractal tree. (SQLite4
is also vaporware, unfortunately.)

I don't think there is a free lunch here. If you want to claim theoretically
better read performance, you have to admit theoretically worse write
performance. Tricks like compression or bloom filters (for LSM-trees) help in
practice but don't change the theory.

------
tokenrove
The same paper is available here without filling in a form:
[http://insideanalysis.com/wp-
content/uploads/2014/08/Tokutek...](http://insideanalysis.com/wp-
content/uploads/2014/08/Tokutek_lsm-vs-fractal.pdf)

~~~
jlarocco
Thanks

------
koverstreet
Downside of fractal trees is the root of the tree inevitably becomes contended
under multithreaded workloads - everything has to go through the root and
updates modify the root first.

I vaguely recall there being other downsides to the pure fractal tree approach
I can't recall right now.

Compacting data structures are quite nifty though and do have useful
properties - you'll actually get the best performance in practice with a
hybrid compacting data structure/B+ tree.

I don't much care for LSMs, though.

~~~
skybrian
Seems like the root could be sharded if it's an issue?

~~~
thorwahnawa
Not really. But you could imagine hacks like forcing updates instead of
logging to the first N-levels (N = 1 or 2) of the B-tree. I don't know how
well that would work in practice.

I do know that B-tree block contention is a big problem for my $DAYJOB's use
case for smaller B-trees.

~~~
koverstreet
Block contention? You mean contention on the locks for individual nodes?

~~~
thorwahnawa
Yes. You can grab the ancestors shared and the leaf or subtree you want to
modify exclusive. If all of your writers land in the root of the tree because
it's a small tree, they all contend on the same lock.

~~~
koverstreet
So it's only an issue with small trees? I'm not really seeing the issue then,
use smaller btree nodes if it's really an issue...

~~~
dignati
The issue seems yo be that writes in Fractal trees always go through the root
node. This would lead to a situation similar to small B-Trees where writes
will most likely access the root node and contend.

------
eternalban
The related patents:

[http://www.google.com/patents/US8185551](http://www.google.com/patents/US8185551)
[http://www.google.com/patents/US8489638](http://www.google.com/patents/US8489638)

------
jasonwatkinspdx
Just scanning the beginning, parts of this are pretty misleading.

For example:

 _"... if a database row contains 100 bytes, and a B tree such as InnoDB
employs 16KiB pages [11], then the B tree may perform 16KiB of I/O to write a
single 100-byte row, for a write amplification of 160, compared to a write
amplification of 30 to 70 for the other data structures."_

Reading this gives the impression that the only way pages are updated is by
writing the page in whole. Storage engines commonly use a slotted page layout
to avoid this. That is, variable length records are stored contiguously from
the start of the page, and are indexed by an array of fixed sized offset
entries that grows up from the end of the page. Inserting a 100 byte row will
append it to the end of data area, then prepend a new offset to the slot
array. On spinning disks this is typically two 512 byte writes (assuming page
size is decently larger than block size, as is common). On SSD's it's more
complicated due to the FTL, and state of the art systems tend to use a log
structured approach. See the Bw-tree paper from Microsoft Research for
example.

~~~
hyc_symas
IMO they're being quite disingenuous in that note. InnoDB is the default
storage engine in MySQL, and there are very few production SQL databases with
a tiny 100 bytes per row. A typical DB row in common apps is a couple KB at
least. As a further example, MySQL's NDB Cluster engine has a maximum row size
of 8KB, and this is frequently a problem for apps that want to migrate to it.

If you start a discussion of write amplification with such an unrealistic
record size, that pretty much moots the rest of the discussion.

For a real-world examination of write amplification, at multiple record sizes:
[http://symas.com/mdb/ondisk/](http://symas.com/mdb/ondisk/)

"Academic math" means nothing when you get to real-world software. Tokutek FT
code is a pig, by every measure.

~~~
bradleykuszmaul
Many important databases do contain 100-byte rows. Sometimes the biggest
tables contain 100-byte rows, and then smaller tables contain larger rows.

I'd have to disagree with your assessment of the value of math.

It's nice to see someone trying to do performance benchmarks, but there are
some serious problems with the benchmark presented at symas.com.

1) The benchmark hardware used fusion I/O. If you can afford fusion I/O, and
you don't care about compression, you should use InnoDB. But you didn't
measure InnoDB. If you use fusion I/O you probably actually care about
compression, which you didn't measure, since you disabled compression. Why
even report space used if you've disabled compression?

2) You used libc malloc to try to make a "uniform test". TokuDB recommends the
use of JEmalloc, and you mentioned that some other storage engine requires
TCmalloc. There's a reason to use a better memory allocator, and some storage
engines need it. You've misconfigured at least two of the storage engines.

3) You wrote TokuDB code using the embedded API without passing the flags that
make the high-ingestion rate possible. I'll be the first to admit that the
documentation is poor for Tokutek's embedded API (the API that looks like
Berkeley DB), but that doesn't change the fact that you are misusing the
software. In particular, it appears that you used `0' for the flag argument to
the DB->put() operations, meaning that every write is preceeded by a query.
Since you apparently didn't mess up the LSM-based API's, it's not a fair
comparision.

4) You ran your benchmarks for 20 minutes on each storage engine. That's not
nearly long enough. For example, the system won't have aged at all.
Interesting things start happening when the the database has been running for
a long time. Interesting things that are predicted by the math.

~~~
hyc_symas
I'm sure there are many important databases with 100 byte rows. But very few
of these will be SQL databases, and very few of these will be using InnoDB.

1) re: InnoDB - sorry, but your constant comparison to InnoDB only reminds me
of the old joke about the snail that got run over by a turtle - while waking
up in the recovery room, the snail says "I don't remember what happened, it
all happened so fast!" Beating InnoDB at anything says nothing useful.
[http://symas.com/mdb/memcache/](http://symas.com/mdb/memcache/)

We tested compression.
[http://symas.com/mdb/inmem/compress/](http://symas.com/mdb/inmem/compress/)

2) We tested with multiple malloc libraries as well.
[http://symas.com/mdb/inmem/malloc/](http://symas.com/mdb/inmem/malloc/) If
your storage engine _needs_ a particular memory allocator, You're Doing It
Wrong.

3) I didn't write the TokuDB code, Leif Walsh did.
[https://gist.github.com/leifwalsh/8378154#file-
db_bench_ydb-...](https://gist.github.com/leifwalsh/8378154#file-db_bench_ydb-
cc-L459-L485)

[https://github.com/hyc/leveldb/commits/benches/doc/bench/db_...](https://github.com/hyc/leveldb/commits/benches/doc/bench/db_bench_tokudb.cc?author=leifwalsh)

I also didn't write the LSM-based programs, their respective authors did. I
trusted each of them to use their own APIs correctly. That may have been a
mistake in the case of the TokuDB code, but given the lack of documentation I
didn't have much better alternatives.

4) We have much longer benchmark runs published as well. But as noted here
[http://symas.com/mdb/hyperdex/](http://symas.com/mdb/hyperdex/) what really
matters is the total number of ops executed. "Running for a long time" is only
a requirement when your DB engine is ... slow.

~~~
mdcallag
Welcome, Bradley and Leif, to the club of people whom Howard has declared are
doing it wrong. You are in fine company as I am also a member of that club.
Although I do prefer snark from Turing Award winners. I don't think there is
reason to pursue a discussion. It won't end well. But I do understand the
desire to address the claims once.

[https://gigaom.com/2011/07/07/facebook-trapped-in-mysql-
fate...](https://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate-worse-
than-death/)

------
thesz
In a DB with LSM storage you are free to do whatever you need to do.

For example, you can compact size-tiered levels when their size ratio exceeds
some threshold. This approach will make situation presented in paper
impossible.

In short, levels 1..N should be compacted into one when
sum_{i=1..N-1}level_size(i)>=C*level_size(N).

The multiplier C is important. Compactions will be rare and writes will be
faster if C is bigger than 1. The read performance will be worse - we have to
read more levels. To make compactions more often, we make C smaller. This way
we prioritize read speed.

And the most important thing here that we can compact levels even for read
transactions. If we have many read transactions, it is necessary to make then
faster.

This very important point of LSM design is not mentioned in any literature I
read.

Fractal trees can be seen as basically LSM trees with fixed coefficient C=1.
And with embedded index, as smaller levels provide indexing for bigger levels.

So they force database designer into some corner. Which is different from
corner of B+trees, but corner nevertheless. LSM trees are more flexible in
that regard, in my opinion.

~~~
bradleykuszmaul
No, I don't think you can view a fractal tree as an LSM tree with C=1.

One conclusion of this paper is that for worst-case workloads, LSM trees are
not as good asympotically as fractal trees. It's all fine and good to be
flexible, but if there is no good setting of the parameters, the flexibility
doesn't actually help solve the problem.

~~~
thesz
You can change settins on the fly. Actually, you have to change only one
setting, level merge multiplier.

You encounter many reads - make merge multiplier bigger. You encounter many
writes - make it smaller.

------
rdtsc
Wonder how this compares with:

[https://github.com/couchbase/forestdb](https://github.com/couchbase/forestdb)

The idea seems to be Trie uses B+tree as a node. Key is split into 8 byte
chunks. Each chunk is used as a key for each level of B+tree

(from their presentation at:
[http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_...](http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_poster_jungsang_ahn.pdf)
)

