
Log Structured Merge Trees - kushti
http://www.benstopford.com/2015/02/14/log-structured-merge-trees/
======
MaulingMonkey
>> Yahoo, for example, reports a steady progression from read-heavy to read-
write workloads, driven largely by the increased ingestion of event logs and
mobile data.

> Sorry if this an ignorant question. Are these applications really that
> sensitive to write throughput ? event logs are only read of-band, so whats
> the rush. Seems like any benefits are offset by the GC stall anyways.

There's two variables here: Latency, and volume.

I'm not sure how much of an issue latency is for event logs / analytic metrics
(RE: "whats the rush"), but if you've written e.g. Pokemon Go with several
million installs, there's a sheer volume of inbound data that - no matter how
little you care about latency - leaves you with one of two options: throttle
it and throw some of it away, or write something that can scale to handle that
much sheer volume.

And the first option is not one: Throwing away player gameplay progression
will piss them off.

~~~
russell_sears
Hi. First author of the bLSM (Yahoo) paper here. You've pretty much nailed it:
The trick is getting latency and volume at the same time. If you're willing to
wait an hour to look at event logs, then Hadoop is good enough, but where's
the fun in that?

At the time, we were looking for the best of both worlds: Hadoop and log
processing give high throughput writes, but latencies are much too high to act
on user intent within a single session. We had a lot of applications in mind
that needed low latency access to event log data, but didn't have a suitable
storage engine to back them.

Existing LSM trees suffered from write stalls and excessive read
amplification. We were targeting hard disks back then, and didn't want to pay
for the extra seeks. We got random read access down to about one seek per
read, and figured out how to eliminate write stalls. Surprisingly, we still
ended up with good write throughput.

More importantly, from an application perspective, we provide efficient range
scans, which lets you reason about groups of data with matching or contiguous
keys. If you squint hard enough, this is all MapReduce really does, except it
precomputes all the answers up front, and we do it in real time. On the one
hand we have lower throughput, since we perform more random reads / sequential
writes, and also do more thread synchronization. On the other, we only perform
the computations for data that's actually being used for something, which
means we do a lot less work for the right applications.

Thanks!

------
cmrdporcupine
Why is sequential access still faster than random access on an SSD (author
claims this)? I understand why for mechanical magnetic media but I do not
understand why this is still a thing for SSD? I remember reading about LSM a
long time ago, and thinking that with the advent of cheap SSD that this kind
of optimization technique wouldn't be as relevant.

Is the advantage of sequential reading a product of the interface and protocol
used to speak to the SSD device? Or is it actually a product of the physical
media?

EDITED: (I had written sequential slower than random meant to write the
opposite)

~~~
cmurphycode
Just to clarify, I think you meant to ask why random access is still slower
than sequential on ssd.

Writes smaller than the ssd erase size are more expensive per byte because of
write amplification. This is not nearly as big of a factor as for disks, but
it does seem to matter even on enterprise flash, if you benchmark properly
(truly random and enough data so that it cannot be buffered by clever
controllers). And as my sibling comment mentioned, the write amplification
wastes flash lifetime, too!

On enterprise flash, I've found that random reads top out at the same
bandwidth ad sequential reads, but it takes a LOT of parallelism to get there,
so sequential reading, if possible, is still nice (let the controller do the
parallelism for you!)

~~~
russell_sears
There are two effects at play. If you're reading a multiple of a flash page at
a time, then you can get random bandwidth to match sequential. Under the hood,
decent SSDs actually round robin sequential reads across all of the underlying
hardware, so they can run as fast as whatever their bottleneck resource is.
The bottleneck could be error correction, the SATA link, or the channels
(copper traces) between the NAND and the controller, for example. Random reads
with enough parallelism give the SSD enough scheduling freedom to have the
same throughput as sequential.

The second effect comes from the fact that application data is rarely exactly
the same size as a flash page. If you group all the related data on one page,
then you don't waste time transferring bits you won't look at. B-Trees sort of
do this, but they leave holes in pages so they can support efficient inserts
(a good SSD or database will compress the holes, but that doesn't help random
reads much, since it's still going to transfer entire flash pages to the SSD
controller).

LSM-Trees push things a bit further, since new data ends up being grouped on
higher levels of the tree. When I was benchmarking bLSM's predecessor, I
worked out the expected throughput on the back of an envelope while I was
waiting for results. It did much better than expected, since the benchmark
heavily favored recently created data! YMMV, of course.

------
mutex007
I am curious how an inverted index will be implemented efficiently atop an LSM
tree. Maybe batch parts of the posting list under a numeric key and when
performing an intersection operation between two lists, you interate using a
range scan. Could work but i wonder how fast Read operations will be. Anybody
attempted yet?

~~~
btrask
LSM trees are ideal for posting lists, because every time you add a new
document, you need to touch potentially hundreds of different lists. LSM trees
are able to batch all of these writes together. That's why Google originally
developed BigTable, AFAIK.

~~~
teraflop
In fact, Lucene (the library behind Solr and Elasticsearch) is basically a
special-purpose inverted-index-as-LSM-tree database engine.

------
onetwotree
Thanks for this. I really feel like I understand the algorithm, and more
importantly, the tradeoffs involved, after reading that.

My company makes a product that involves a sort of primary feature that's
insanely read heavy (something like 10k to 1 read to write ratio), and a
secondary feature that's exactly the opposite. We just use Postgres with for
both (and it's obviously the winner for the read heavy workload) but I think
I'll have to look more into this stuff.

One interesting thing to note about our write heavy workload is that the reads
are usually in chronological order, with a filter. Anyone know what that
implies for LSM based stores?

~~~
thesz
Tune LSM merging algorithm parameter.

If LSM merges levels at the 1:10 ration and more, then the result will be less
levels and faster reads. The merges will happen quite often relatively to
reads. If it merges levels at the ratio of 1:1.0001, then there will be more
levels and less merges - less delays in write-heavy application.

------
coleifer
I wrote some python bindings to the sqlite4 lsm store:

[http://lsm-db.readthedocs.io](http://lsm-db.readthedocs.io)

------
serialx
It would be more interesting to see more modern approaches with KV storages
like forestdb[1]. Couchbase already replaced their storage engine with
forestdb[2].

[1]:
[https://www.computer.org/csdl/trans/tc/preprint/07110563.pdf](https://www.computer.org/csdl/trans/tc/preprint/07110563.pdf)
[2]:
[https://github.com/couchbase/forestdb](https://github.com/couchbase/forestdb)

~~~
rdtsc
What's modern about it?

------
bogomipz
In the section on read workloads the author states:

"External file: leave the data as a log/heap and create a separate hash or
tree index into it."

I am confused by this because a heap with a separate hash or tree index would
be a "clustered index" and not a heap at all correct? Or am I
misunderstanding?

------
kristianp
Wow that thin font-weight is hard to read. In chrome; Hit f12, in the styles
tab change the font-weight to 400.

~~~
uuoc
I use this book-marklet to fix pages that decide to use a small font weight,
that results in this hard to read text:

javascript:(function(){var%20all=document.getElementsByTagName("*");for%20(var%20i=0,max=all.length;i<max;i++){all[i].style.fontWeight='400';};void(0);})();

Fixes those broken pages to be quite readable.

