
Damn Cool Algorithms: Log structured storage (2009) - Tomte
http://blog.notdot.net/2009/12/Damn-Cool-Algorithms-Log-structured-storage
======
nostrademons
In the 8 years since this was written, Log-Structured Merge Trees (the
concrete realization of this idea) have basically "won". BigTable, AppEngine,
LevelDB, Cassandra, HBase, MongoDB, and several others are all built around
them.

There's a powerful hardware trend driving this, namely that disk capacities
and write bandwidth are still increasing rapidly, but seek times have
basically plateaued. That means that data structures that rely on append-only
operations can continue to scale to take advantage of bigger disks, but data
structures that rely on disk seeks (eg. B-trees) have hit a bottleneck. Also,
as number of cores continues to increase, playback & processing from a
sequential log can often be parallelized, but updating your on-disk indexes
blocks on I/O.

~~~
gregw134
I suppose the follow up question is, what will replace this once enterprise
SSDs start replacing spinning disks?

~~~
nostrademons
SSDs have even more unique access patterns - they give you (relatively) fast
seek times, there's no particular penalty for random-access vs. sequential
usage, but you really want to avoid writes, both to preserve the lifetime of
the drive and because mixing reads & writes with a SSD lowers performance.

In practice (and I haven't worked in a big corp for a couple years now), I
don't see SSDs replacing mechanical disks. Rather, they're being used for
different functions. Disks are becoming a data warehouse; they're the new
tape, but one that systems can write directly to, without needing scheduled
backups. Operational serving is moving toward SSDs and RAM, with a periodic
processing step that pulls data off disks and builds a pre-made shard that
gets written all at once to the SSD. LSM trees are well-adapted to the data-
warehousing part of this, which is why you continue to see big uptake of
BigTable/Cassandra/Mongo. On the SSD end...most people are using custom
algorithms & data structures, AFAIK, but you're seeing a resurgence of data
structures like perfect hashes, bloom filters, and plain old sorted arrays,
all of which have great read performance but can't be incrementally updated
without rebuilding the whole data structure.

~~~
mattb314
Do you think you could point me towards a source for "mixing reads & writes
with a SSD lowers performance"? This seems reasonable, but I've never actually
heard it before, and a cursory googling didn't turn much up. I have heard
before that small, random writes on SSDs are much more expensive than
sequential writes because random writes can produce write amplification and
increase fragmentation. It was my impression that SSDs stand to benefit
significantly from log structured filesystems despite the mixed read/write
load, but I don't follow it that closely and could be wrong.

~~~
noahdesu
[https://www.usenix.org/conference/atc14/technical-
sessions/p...](https://www.usenix.org/conference/atc14/technical-
sessions/presentation/skourtis)

~~~
mattb314
Thanks! This talk is pretty cool.

In case anyone else is interested, in addition to the above video, this [1]
paper goes into some of the details, including this explanation:

"Reads and writes on SSDs can interfere with each other for many reasons. (1)
Both operations share many critical resources, such as the ECC engines and the
lock-protected mapping table, etc. Parallel jobs accessing such resources need
to be serialized. (2) Both writes and reads can generate background operations
internally, such as readahead and asynchronous write-back [6]. (3) Mingled
reads and writes can foil certain internal optimizations. For example, flash
memory chips often provide a cache mode [2] to pipeline a sequence of reads or
writes. A flash memory plane has two registers, data register and cache
register. When handling a sequence of reads or writes, data can be transferred
between the cache register and the controller, while concurrently moving
another page between the flash medium and the data register. However, such
pipelined operations must be performed in one direction, so mingling reads and
writes would interrupt the pipelining."

Also relevant to the "seek" conversation below: it turns out many SSDs have
read-ahead caches built in, so sequential reads are much faster than random
reads after an initial warm-up, just like hard disks (the difference, however
is closer to ~5x than to 100x difference you see in HDDs).

[1]
[http://bit.csc.lsu.edu/~fchen/publications/papers/hpca11.pdf](http://bit.csc.lsu.edu/~fchen/publications/papers/hpca11.pdf)

------
no_protocol
I am impressed by the writing style. Very clear and delivers a solid
explanation.

What relation, if any, would this type of system have with "persistent data
structures", a term I have seen used in some browsing of functional
programming topics. Is this somewhat like a persistent data structure until
old parts are overwritten ("garbage collected"?)? Is there a flavor of
persistent data structure similar to this?

~~~
noahdesu
There is also a really cool paper on concurrency control for databases
implemented as log-structured storage:
[http://www.vldb.org/pvldb/vol4/p944-bernstein.pdf](http://www.vldb.org/pvldb/vol4/p944-bernstein.pdf)

------
timClicks
Slightly related perhaps and something I have been curious about for a
while... event sourcing seems like a very powerful pattern that I haven't seen
wide adoption. The best documentation seems to be some MS dev library notes
and a discussion from M Fowler.

Are there any open source implementations of a database that uses event
sourcing?

~~~
zshift
Check out the videos and articles by Greg Young on the topic. This one's a
good start
[https://www.youtube.com/watch?v=8JKjvY4etTY](https://www.youtube.com/watch?v=8JKjvY4etTY)

------
d_t_w
For further/similar content by Ben Stopford that I personally found a very
high quality:

[http://www.benstopford.com/2015/02/14/log-structured-
merge-t...](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/)

------
mwcampbell
Another notable product based on log-structured storage is ObjectiveFS
([https://objectivefs.com/](https://objectivefs.com/)), which implements a
POSIX filesystem on top of Amazon S3 and other object stores. It's
proprietary, so I don't know much about how it works. But it claims to be a
log-structured filesystem.

