Hacker News new | past | comments | ask | show | jobs | submit login
A Comparison of Log-Structured Merge (LSM) and Fractal Tree Indexing (highscalability.com)
81 points by ddorian43 on Mar 22, 2015 | hide | past | web | favorite | 29 comments

Fractal trees are not "LSM-trees, but better." Fractal trees are more read-optimized, whereas LSM-trees are more write-optimized.

A fully compacted LSM-tree should be equivalent to a b-tree in terms of read performance. However I don't think any existing implementations are anywhere near that.

I appreciate TokuTek's work in this area but their marketing should be taken with a grain of salt. They've published some stupid stuff before (look at the graph here[1] and tell me how to make something just like a b-tree, but faster).

[1] http://www.tokutek.com/2011/10/write-optimization-myths-comp...

...tell me how to make something just like a b-tree, but faster

If you wanted to do that on modern multilayer-memory architecture, you would use a "cache-obvious" structure like cache-oblivious b-tree, cache-oblivious Lookahead Array and similar structures. Because these kinds of structures allow large and small scale locality of data, meaning that a disk-read will pull several pieces of data at once, a jump between nodes produces fewer cache misses (at each level) and so-forth.

Michael A. Bender wrote of the original papers on cache oblivious structures and not coincidentally is TokuTek founder.

http://www.tokutek.com/company/team/ http://en.wikipedia.org/wiki/Cache-oblivious_algorithm http://supertech.csail.mit.edu/cacheObliviousBTree.html

I'm the author of this paper, and I'll try to respond to some of what has been stated on this thread.

> A fully compacted LSM-tree should be equivalent to a b-tree in terms of read performance.

A fully compacted LSM tree arises only when writes stop happening for a sufficiently long time. It's a long-solved problem to build a read-only B-tree. The challenge is to not cause your read performance to fall when there are writes in the workload. This paper attempts to analyze that penalty.

> They've published some stupid stuff before (look at the graph here[1] and tell me how to make something just like a b-tree, but faster).

Possibly we've published some stupid stuff, but I don't think that Leif's blog posting qualifies. Is there something specific there that is incorrect?

Thanks for taking the time to reply.

In Leif's post, the graph shows "b-trees" on the inner curve. The equivalent point on the outer curve would be a data structure that makes the same read/write tradeoffs as a b-tree, but is simply better in all cases.

The other reply suggests cache oblivious b-trees. I haven't had time to look at them yet. It's worth pointing out that if cache oblivious b-trees really are better than normal b-trees, you could build an LSM-tree out of them.

My interpretation is that TokuTek is conflating data structures' theoretical performance with their implementations' practical performance. So the graph is trying to represent "typical" b-trees as not using compression and other techniques that TokuDB uses. Under that interpretation, each data structure would have a spectrum of possible performance from poor to highly tuned, which the graph doesn't make clear.

To be fair, I think TokuDB probably has very good real world performance, and there's a huge need for better LSM-trees (and better database engines in general). My objection is the comparison of the data structures, not the implementations.

Looking at that post on Leif's drawing, I think I agree that the B-tree is placed incorrectly. It turns out that B-trees are on the optimal read/write tradeoff curve, but they are in a very bad corner.

B-trees cost O(log_B N) to read or write.

Fractal trees cost O(log_B N) to read, and O((log_B N)/sqrt(N)) to write (for the case where epsilon=2).

The hidden constants inside the big-Oh are small; they seem to be less than 2 in practice. In fact, in the benchmarks that I've seen, the constants are in favor of the fractal tree. I suspect that this is because the block size in practical B-trees is too small, so the seek time is large compared to the transfer time on magnetic disk. Furthermore small block sizes hurt compression, small block sizes make the tree deeper. Furthermore, dynamic B-trees (ones that can be written) typically fill their blocks only about 3/4 full on average, making the effective block size even smaller. In most dynamic B-trees, the block size is 4KB or 16KB. In a few it's 32KB or even 64KB. For read-only databases I've seen people use 1MB blocks for B-trees. In the fractal tree, the effective block size is 4MB, and I suspect that's where we're winning on the constants.

So for a constant slowdown in read performance (a quite small constant slowdown, or possibly a small speedup) you can get dramatic speedups on write performance.

LSM trees are not on the optimal tradeoff curve. They suffer worse read perforamance that B-trees or Fractal trees, and their write performance is theoretically comparable to that of fractal trees.

So there's an error in that drawing, but the drawing is conceptual anyway, since there's no scale. The point of the drawing is that fractal trees dominate LSM trees, and for about the same cost as a B-tree on reads, you can get much better write performance. So the drawing is correct in spirit.

It makes no sense to build an LSM tree out of fractal trees. The resulting data structure would still be slower than a fractal tree. It's also unlikely to be faster than a B-tree-based LSM, since the B-trees used in LSM trees are static, not dynamic. You never have to insert into one of those B-trees, which means you can apply a bunch of tricks, such as using huge blocks. That probably shifts the constant inside the big-Oh back in favor of the B-tree for read operations, which are the only B-tree operations inside an LSM.

I don't think Tokutek is conflating the theoretical performance with the practical performance. Both the theoretical and practical performance appear to be better than the alternatives. In practice fractal trees appear to be slightly faster than B-trees at reads (assuming that the B-tree is actually a writeable data structure) in real workloads, and much faster at writes. In theory, the asymptotic read performance is the same, and the asymptotic write performance is much better for Fractal Trees.

It's true that the LSM implementations could all be bad, leading to the wrong conclusion when doing benchmarking. I'd find that surprising because the fractal tree seems to be better than any of the many LSM implementations in practice, and I'd hope that at least one of them would get it right. And the practical performance matches what the theory says it should, so it's not surprising that the fractal trees are faster.

I'm not sure why you object to comparing the data structures. We can compare them both theoretically using big-Oh notation. We can also do an analysis for specific database sizes and memory sizes of how the data structures compare, assuming a good implementation. That's what my paper tried to do. We could also compare the implementations, which my paper didn't do, but which others have done. For example, Tim Callaghan has done a lot of benchmarking which seems to be done right.

Thanks for acknowledging the problem with the graph. We'll have to disagree over "correct in spirit" versus "stupid."

I suggested building a LSM out of cache-oblivious b-trees because of the claim that they were better than b-trees in all cases, including (ostensibly) read performance. I didn't realize that it was another name for fractal trees. (It's been a while since I read the fractal tree papers.)

I'm not convinced that LSM-trees are suboptimal. The lower levels have shorter trees, so the price of redundant lookups is small, and you should get better write performance because you can do compaction without touching other levels.

Conversely, SQLite4 has designed an LSM-tree with forward pointers between the levels. However I think that actually sacrifices some write performance for read performance, which probably makes it closer to a fractal tree. (SQLite4 is also vaporware, unfortunately.)

I don't think there is a free lunch here. If you want to claim theoretically better read performance, you have to admit theoretically worse write performance. Tricks like compression or bloom filters (for LSM-trees) help in practice but don't change the theory.

The same paper is available here without filling in a form: http://insideanalysis.com/wp-content/uploads/2014/08/Tokutek...


Downside of fractal trees is the root of the tree inevitably becomes contended under multithreaded workloads - everything has to go through the root and updates modify the root first.

I vaguely recall there being other downsides to the pure fractal tree approach I can't recall right now.

Compacting data structures are quite nifty though and do have useful properties - you'll actually get the best performance in practice with a hybrid compacting data structure/B+ tree.

I don't much care for LSMs, though.

To complain about contention under multithreaded workloads seems to be missing the point. At the point that multithreaded workloads are becoming a problem, the ingestion rate is huge (hundreds of thousands of insertions per second). For an in-memory database, that may not seem like much, but for a disk-resident database, that's a lot.

I don't know what a "hybrid compacting data structure/B+ tree" is. Is there somewhere I can read about how you get the best performance in practice with it?

Seems like the root could be sharded if it's an issue?

Not really. But you could imagine hacks like forcing updates instead of logging to the first N-levels (N = 1 or 2) of the B-tree. I don't know how well that would work in practice.

I do know that B-tree block contention is a big problem for my $DAYJOB's use case for smaller B-trees.

Block contention? You mean contention on the locks for individual nodes?

Yes. You can grab the ancestors shared and the leaf or subtree you want to modify exclusive. If all of your writers land in the root of the tree because it's a small tree, they all contend on the same lock.

I'm no expert, but it seems like you need to guess at some key ranges and start the BTree at some minimum size even if most of the leaves start out empty. Either that, or don't write anything until the tree grows a bit and rely on a writeahead log for durability.

Hotspots could still show up further down, though, if writes aren't fairly random.

So it's only an issue with small trees? I'm not really seeing the issue then, use smaller btree nodes if it's really an issue...

The issue seems yo be that writes in Fractal trees always go through the root node. This would lead to a situation similar to small B-Trees where writes will most likely access the root node and contend.

How would you do that? You have to be able to do lookups on the root. You have to have something to shard /on/.

B+ trees shard the updates by splitting the keyspace up into different leaf nodes.

Just scanning the beginning, parts of this are pretty misleading.

For example:

"... if a database row contains 100 bytes, and a B tree such as InnoDB employs 16KiB pages [11], then the B tree may perform 16KiB of I/O to write a single 100-byte row, for a write amplification of 160, compared to a write amplification of 30 to 70 for the other data structures."

Reading this gives the impression that the only way pages are updated is by writing the page in whole. Storage engines commonly use a slotted page layout to avoid this. That is, variable length records are stored contiguously from the start of the page, and are indexed by an array of fixed sized offset entries that grows up from the end of the page. Inserting a 100 byte row will append it to the end of data area, then prepend a new offset to the slot array. On spinning disks this is typically two 512 byte writes (assuming page size is decently larger than block size, as is common). On SSD's it's more complicated due to the FTL, and state of the art systems tend to use a log structured approach. See the Bw-tree paper from Microsoft Research for example.

IMO they're being quite disingenuous in that note. InnoDB is the default storage engine in MySQL, and there are very few production SQL databases with a tiny 100 bytes per row. A typical DB row in common apps is a couple KB at least. As a further example, MySQL's NDB Cluster engine has a maximum row size of 8KB, and this is frequently a problem for apps that want to migrate to it.

If you start a discussion of write amplification with such an unrealistic record size, that pretty much moots the rest of the discussion.

For a real-world examination of write amplification, at multiple record sizes: http://symas.com/mdb/ondisk/

"Academic math" means nothing when you get to real-world software. Tokutek FT code is a pig, by every measure.

Yeah, and I should have mentioned LMDB as another example of how state of the art storage engines tend towards log structured allocation management. The RamCloud design is another example.

The way I'd summarize the case on SSD's is: the FTL is going to log structure your writes anyhow, you might as well align with that and recoup what you can.

Many important databases do contain 100-byte rows. Sometimes the biggest tables contain 100-byte rows, and then smaller tables contain larger rows.

I'd have to disagree with your assessment of the value of math.

It's nice to see someone trying to do performance benchmarks, but there are some serious problems with the benchmark presented at symas.com.

1) The benchmark hardware used fusion I/O. If you can afford fusion I/O, and you don't care about compression, you should use InnoDB. But you didn't measure InnoDB. If you use fusion I/O you probably actually care about compression, which you didn't measure, since you disabled compression. Why even report space used if you've disabled compression?

2) You used libc malloc to try to make a "uniform test". TokuDB recommends the use of JEmalloc, and you mentioned that some other storage engine requires TCmalloc. There's a reason to use a better memory allocator, and some storage engines need it. You've misconfigured at least two of the storage engines.

3) You wrote TokuDB code using the embedded API without passing the flags that make the high-ingestion rate possible. I'll be the first to admit that the documentation is poor for Tokutek's embedded API (the API that looks like Berkeley DB), but that doesn't change the fact that you are misusing the software. In particular, it appears that you used `0' for the flag argument to the DB->put() operations, meaning that every write is preceeded by a query. Since you apparently didn't mess up the LSM-based API's, it's not a fair comparision.

4) You ran your benchmarks for 20 minutes on each storage engine. That's not nearly long enough. For example, the system won't have aged at all. Interesting things start happening when the the database has been running for a long time. Interesting things that are predicted by the math.

I'm sure there are many important databases with 100 byte rows. But very few of these will be SQL databases, and very few of these will be using InnoDB.

1) re: InnoDB - sorry, but your constant comparison to InnoDB only reminds me of the old joke about the snail that got run over by a turtle - while waking up in the recovery room, the snail says "I don't remember what happened, it all happened so fast!" Beating InnoDB at anything says nothing useful. http://symas.com/mdb/memcache/

We tested compression. http://symas.com/mdb/inmem/compress/

2) We tested with multiple malloc libraries as well. http://symas.com/mdb/inmem/malloc/ If your storage engine needs a particular memory allocator, You're Doing It Wrong.

3) I didn't write the TokuDB code, Leif Walsh did. https://gist.github.com/leifwalsh/8378154#file-db_bench_ydb-...


I also didn't write the LSM-based programs, their respective authors did. I trusted each of them to use their own APIs correctly. That may have been a mistake in the case of the TokuDB code, but given the lack of documentation I didn't have much better alternatives.

4) We have much longer benchmark runs published as well. But as noted here http://symas.com/mdb/hyperdex/ what really matters is the total number of ops executed. "Running for a long time" is only a requirement when your DB engine is ... slow.

Welcome, Bradley and Leif, to the club of people whom Howard has declared are doing it wrong. You are in fine company as I am also a member of that club. Although I do prefer snark from Turing Award winners. I don't think there is reason to pursue a discussion. It won't end well. But I do understand the desire to address the claims once.


In a DB with LSM storage you are free to do whatever you need to do.

For example, you can compact size-tiered levels when their size ratio exceeds some threshold. This approach will make situation presented in paper impossible.

In short, levels 1..N should be compacted into one when sum_{i=1..N-1}level_size(i)>=C*level_size(N).

The multiplier C is important. Compactions will be rare and writes will be faster if C is bigger than 1. The read performance will be worse - we have to read more levels. To make compactions more often, we make C smaller. This way we prioritize read speed.

And the most important thing here that we can compact levels even for read transactions. If we have many read transactions, it is necessary to make then faster.

This very important point of LSM design is not mentioned in any literature I read.

Fractal trees can be seen as basically LSM trees with fixed coefficient C=1. And with embedded index, as smaller levels provide indexing for bigger levels.

So they force database designer into some corner. Which is different from corner of B+trees, but corner nevertheless. LSM trees are more flexible in that regard, in my opinion.

No, I don't think you can view a fractal tree as an LSM tree with C=1.

One conclusion of this paper is that for worst-case workloads, LSM trees are not as good asympotically as fractal trees. It's all fine and good to be flexible, but if there is no good setting of the parameters, the flexibility doesn't actually help solve the problem.

You can change settins on the fly. Actually, you have to change only one setting, level merge multiplier.

You encounter many reads - make merge multiplier bigger. You encounter many writes - make it smaller.

Wonder how this compares with:


The idea seems to be Trie uses B+tree as a node. Key is split into 8 byte chunks. Each chunk is used as a key for each level of B+tree

(from their presentation at: http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_... )

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact