
The Log-Structured Merge-Tree (LSM-Tree) (1996) [pdf] - espeed
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf
======
kev009
Howard Chu of LMDB fame really bags on LSM:

[https://twitter.com/hyc_symas/status/650799297143836672](https://twitter.com/hyc_symas/status/650799297143836672)

[https://twitter.com/hyc_symas/status/657399873562656768](https://twitter.com/hyc_symas/status/657399873562656768)

~~~
thesz
I implemented a variant of LSM in C# (BerkeleyDB feature/bug compatible, more
or less) for our internal DB server (OO/relational DB).

Yes, it is true that LSM can incur a O(logN) read amplification, it was
clearly shown in our bare storage tests. But! The full table scan read in full
server has 3% (three percent) difference between different storages (LSM is 3%
slower). And, as I haven't used BTrees for LSM levels, but some other
superficially similar structures with different construction (and no
modification - MVCC), the read of complex structures from our DB (a PCB
project - you can guess complexity) netted in clear benefit - that read
operation was from 16 (sixteen) to 19 (nineteen) percents faster compared to
BerkeleyDB. Given that BerkeleyDB storage backend has not occupied more than
20% of server runtime in our profiles, it amounts to 5 to 20 times speedup in
read operations.

This can be explained like using the following argument: because Btree gets
fragmented when you randomly insert things - pages get scattered all over the
place. The rebuild phase I employed made pages much less scattered (actually,
there cannot be more than some fixed number (8 in our case) different
contiguous runs of pages to hold all data). This amounted to radical
difference in read performance.

Take a look:
[https://cs.brown.edu/research/pubs/theses/masters/2010/newto...](https://cs.brown.edu/research/pubs/theses/masters/2010/newton.pdf)

The paper above discusses Btree fragmentation and proposes something like
small LSM structure for indices.

~~~
hyc_symas
Reads in LMDB are orders of magnitude faster than BerkeleyDB. Not just a tiny
16-19%. Both are B+trees. Using the right algorithm is important, but using it
_well_ matters too.

[http://symas.com/mdb/#bench](http://symas.com/mdb/#bench)

Wy original comment was about write amplification, not read amplification.
It's already a given that LSMs are slow for reads. LSMs are claimed to be
write optimized and yet it's easy to demonstrate that their write
amplification is prohibitively bad. They also are inherently unreliable,
making them a choice suitable only for idiots.

[https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/pillai)

~~~
thesz
You answer not my analysis, but your idea of my analysis. The decrease of
reading time for complex structure was 16%-19% for A WHOLE SERVER OPERATION,
where BDB never amounted more than 20% in profile.

So I achieved speedup of 5-20 times over BDB (albeit not quite recent one, I
admit).

Scale-free 1M nodes graph write does not even ends with BDB in reasonable time
(and produces HUGE log). The difference between BDB and my LSM implementation
for smaller graph size writes was in order of 200-300 times, LSM is faster.

Finally, I admit your code is here to test and use, mine's closed source and
there are only my words. So I have to shut myself up and try to recreate that
wonderful thing for everyone to use.

