
Introduction to LSM Trees: May the Logs Be with You - priyankvex
https://priyankvex.wordpress.com/2019/04/28/introduction-to-lsm-trees-may-the-logs-be-with-you/
======
DocSavage
I found this older introduction to be pretty good and part of a series:
[https://medium.com/databasss/on-disk-io-part-3-lsm-
trees-8b2...](https://medium.com/databasss/on-disk-io-part-3-lsm-
trees-8b2da218496f)

Besides the use of LSM trees in RocksDB and leveldb-like databases, there is
also the WiscKey approach
([https://www.usenix.org/node/194425](https://www.usenix.org/node/194425))
that helps read/write amplification by keeping the LSM tree small and mostly
removing the values to a separate value log. There's a pure Go implementation
of the WiscKey approach used by dgraph: [https://github.com/dgraph-
io/badger](https://github.com/dgraph-io/badger)

~~~
alexott
He (Alex Petrov) is also writing a book for O'Reilly on the database
internals: [https://www.goodreads.com/book/show/44647144-database-
intern...](https://www.goodreads.com/book/show/44647144-database-internals)

~~~
dominotw
weird o'reilly doesn't list that on their own website.

~~~
clumsysmurf
Seems like OReilly really goes out of their way to hide ALL books these days.
When you go to the main site
([https://www.oreilly.com/](https://www.oreilly.com/)), where do you see
anything related to books? All I see is "online learning", "blended courses",
"conferences" and "ideas".

I'm a bit upset by this, because I've found the Safari experience terrible.

------
lichtenberger
If you just need to fetch values by a key, for the main storage (might be even
as simple as generated by a sequence generator) you can even avoid the
asynchronous background compaction overhead and thus unpredictable read- or
write-peaks and so on by hashing the keys if it's not already an integer/long
based identifier: Basically storing a persistent (both on-disk persistence as
well as in the functional sense immutable) hash array based trie. This can
easily be extended to store a new revision through copy-on-write semantics.
Instead of storing whole page snapshots however, storage advanced now permit
fine granular access to your data. Thus you can basically apply lessons
learned from backup systems to version the data pages itself and even improve
on that.

Disclaimer: I'm the author of a recent article about a free open source
storage system I'm maintaining, which versions data at it's very core: "Why
Copy-on-Write Semantics and Node-Level-Versioning are key to Efficient
Snapshots": [https://hackernoon.com/sirix-io-why-copy-on-write-
semantics-...](https://hackernoon.com/sirix-io-why-copy-on-write-semantics-
and-node-level-versioning-are-key-to-efficient-snapshots-754ba834d3bb)

~~~
zzzcpan
You don't actually need to do asynchronous background compaction at all. You
can do compaction whenever in small incremental steps not causing any spikes
in read or write latencies. Just spreading it across all writes gets you
slightly slower, but latency capped writes. It's unfortunate that LevelDB
popularized this compaction in a thread idea. It's pretty bad one.

~~~
lichtenberger
Good catch :-) right, but still merging/compaction work has to be done. Maybe
too much, if you just need to fetch a value by its key and thus just an
equality scan is needed (no range scans or other comparisons). For the latter
case I've implemented an AVL-tree, which is also versioned and stored in our
record pages and best read fully in-memory (but doesn't have to). For sure
there are plenty of optimizations and for instance also spatio-temporal
indexes or full-text indexes possible, but I guess first looking into cost-
based rewrite rules for the query compiler and replication/partitioning for
horizontal scaling. Too many ideas I guess ;-) but the best would be to have a
great open source community :-)

------
webshit155
I don't understand the hash index part. I guess that for every segment on
disk, you also have a hash table for it, correct? Also, this part doesn't seem
to be very memory-efficient

