
Btree vs. LSM - pbhowmic
https://github.com/wiredtiger/wiredtiger/wiki/Btree-vs-LSM
======
hendzen
I have a lot of respect for the author of that page, Alex Gorrod. However,
this benchmark is quite dated (2014). The WiredTiger codebase was still
immature when it was written. Since WiredTiger's acquisition by MongoDB (the
company) and integration in to MongoDB (the database), the btree and LSM
implementations have undergone extensive changes and are now much more
hardened and adapted to production workloads. Furthermore it would be worth
considering other engines such as InnoDb (btree) and RocksDB (LSM), which
along with WiredTiger are considered the leading open source storage engines.

For more recent and realistic benchmarks you should look at the work of Mark
Callaghan.

~~~
erpellan
Mark's blog:
[http://smalldatum.blogspot.co.uk/](http://smalldatum.blogspot.co.uk/)

------
3pt14159
Great analysis but can someone fork this and update the graphs to improve some
things? I'd do it but I'm on a really bad internet connection right now.

1\. Use SI magnitudes, so that we don't have to count the zeros. (so 100k
instead of 100000.)

2\. Either use the same chart for LSM vs Btree for each of read and write, or
at the very least use the same y axis between the two.

3\. It's hard to see the difference on the limited write benchmarks. Can we
make it either logarithmic or another graph with these differences
highlighted?

~~~
ddorian43
Many benchmarks on different dbs using either lsm/btree on different
environments/data-sets: [https://symas.com/products/lightning-memory-mapped-
database/...](https://symas.com/products/lightning-memory-mapped-
database/project-benchmarks/) (including wired tiger)

~~~
petergeoghegan
I agree that the quality of each implementation is very pertinent here. I
wonder if the B-Tree implementation in question used the Lehman & Yao B-link
technique, in order to obviate the need for "latch coupling"/"crabbing". At
the same time, it's interesting to consider if the LSM implementation is
similarly optimized to minimize low-level locks/semaphore overhead, if in fact
that's actually possible.

In general, B-Tree index performance has plenty to do with low-level
optimizations, beyond things that are published in academic papers. I'm
talking about things like micro-optimizing the number of CPU cache misses on
internal pages, the use of interpolation search, and so on.

------
erichocean
My favorite database _by far_ today is LMDB (B+Tree).[0] Performance is
insane, and very low-variance. Reads scale linearly with core counts, and it
has a lot of useful index types and knobs to get maximum performance.

What am I most looking forward to using later this year? ScyllaDB[1] and
CockroachDB[2], both in conjunction with LMDB.

[0] [http://104.237.133.194/doc/](http://104.237.133.194/doc/)

[1] [http://www.scylladb.com/](http://www.scylladb.com/)

[2] [https://www.cockroachlabs.com/](https://www.cockroachlabs.com/)

~~~
filereaper
Really liking CockroachDB so far.

Also reading through their site, I didn't know Google's Spanner required
AtomicClocks! Holy deep pockets Batman!

~~~
MichaelGG
I don't think such clocks are that expensive. A quick search shows several
companies offering them ready to go. I'd guess that even relative to the
engineering time involved, the cost of the clocks was irrelevant.

GPS NTP servers start at $1000, but I dunno how much that delivers. The actual
GPS signal is probably accurate enough, but the rest of the hardware chain on
a $1000 box might not be <1ms. Google might have also wanted a backup in case
antennas are disturbed or GPS signal somehow interrupted?

But even then - just because Google chose 7ms doesn't mean everyone else has
to. Drop a GPS NTP master at each site and live with that? I wouldn't imagine
it'd get much worse than 10-20ms?

I don't really know a whole lot about this though.

------
jules
B-epsilon trees are a modification of B-trees that support much higher write
throughput. It is done by adding a buffer to each node in the tree so that you
don't need to travel all the way down the tree every time. When a buffer is
full all its contents are pushed down to the next level in one batch. It would
be interesting to see how well they stack up against LSM's write throughput.

~~~
ddorian43
Looks like fractal trees:
[https://en.wikipedia.org/wiki/Fractal_tree_index](https://en.wikipedia.org/wiki/Fractal_tree_index)

~~~
jules
Thanks for the link, that cleared up a confusion that I've had for a while.
I've read a lot of contradictory things about fractal trees, but apparently
they changed the data structure from a COLA to a B-epsilon tree at some point
but kept the name fractal tree.

------
armon
The used synthetic benchmark is not a great indicator of performance in most
(any?) production environments. It's generally useful and interesting to
understand the tradeoffs between an LSM and B-Tree. Particularly, if you have
an update or delete heavy workload the compaction cost of an LSM can become an
issue. B-Tree's don't suffer from compaction issues, so in some sense trade
consistently slower writes instead of amortizing out an I/O intensive
compaction.

Some of the worst production experiences I've had came from exhausting I/O on
the database, and then having LevelDB / RocksDB / LSM stores kick off their
compaction. B-Tree will give you a very consistent redline in terms of
performance generally.

TL;DR: There are trade offs between the two, but this benchmark is not
particularly insightful given that it doesn't really test any of the
interesting boundary conditions or real world query patterns.

------
andyjpb
B-trees (especially Copy on Write B-trees) tend to be limited by the disk seek
performance. The write performance of a virgin B-tree will also vary
substantially depending on the keys that are being inserted. Random keys will
lead to a lot of splitting of the tree which can be slow as it requires a lot
of locking and allocation. Sequential keys will amortise the expensive part of
building the structure of the tree.

This test uses a 10GB test size which is very small compared to the size of
RAM available for the page cache. This will hide the seek penalties that the
B-Tree takes on reads. Over a much larger test one would expect the write
performance of a well written LSM to flatten out to a constant, indefinitely
sustainable, rate. A (COW) B-tree would be expected to gradually degrade over
time as it gets more fragmented and takes more seek penalties for writes and
range reads.

The seek penalty is relevant for SSDs as there are still only a finite number
available: you can expect on the order of 100K/s rather than the 100/s you
might expect from a conventional disk. You still want to be mostly operating
those things in streaming read mode where possible.

------
filereaper
I'd love to see a breakdown of Databases broken down by their primary
underlying storage mechanism (eg):

\- RDBMS: B-Tree Layout (good for lookups)

\- No-SQL (like) DB's: LSM (good for heavy write throughputs)

And then there's ones like Dremel which opt for high-octane full-table scans.

~~~
jeffdavis
RDBMS often support a variety of primary and secondary index types. There's
nothing about the relational model or SQL that won't work with an LSM tree.

------
rawnlq
Might be a stupid question, but is this performance of just the data structure
in memory or with reading/writing to disk also?

~~~
avita1
LSMs and Btrees are both data structures designed for use with disks, so the
tests were probably run with disks.

~~~
alexnewman
Anything with blocks. SSD have blocks. nvme on the other hand....

------
hans
here's a comparison of the LSM and Fractal trees.

[http://highscalability.com/blog/2014/8/6/tokutek-white-
paper...](http://highscalability.com/blog/2014/8/6/tokutek-white-paper-a-
comparison-of-log-structured-merge-lsm.html)

------
professorm
I noticed they used LD_PRELOAD=/usr/lib64/libjemalloc.so

glibc malloc not up to the task?

~~~
wolf550e
People often run MySQL with jemalloc preloaded. Here's a benchmark comparing
glibc ptmalloc, jemalloc and tcmalloc for some storage engines:
[http://lmdb.tech/bench/inmem/malloc/](http://lmdb.tech/bench/inmem/malloc/)

