

SSTable and Log Structured Storage: LevelDB - igrigorik
http://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/

======
herf
Has anyone measured the "write amplification" rate of LevelDB? I have noticed
small writes cause lots of disk writes (which is an issue on SSD) but haven't
measured it with actual numbers yet.

~~~
jemfinch
I'm not sure what you mean by write amplification. A small write (say,
updating a small key) will cause a single disk write for the journal. What
additional writes are you seeing?

~~~
snewman
Each new value is written to the journal. Later, when the journal is
compacted, the value will be written to an SSTable. Each subsequent compaction
involving that value causes the value to be written (copied) yet again. So
over time, a 100-byte value will consume k * 100 bytes of disk write bandwidth
(and (k-1) * 100 bytes of read bandwidth).

In principle, k can be arbitrarily large (over a sufficiently long time span).
The actual value depends on many factors. For instance, if your write rate is
high in comparison with the amount of memory available for the memtable, then
you will have more compactions. If data has a short half-life, then it will
not survive to participate in as many compactions.

~~~
jemfinch
In principle, k is limited by O(log n), where n is the number of changes to
the database; it's not quite accurate to call it "arbitrarily large".

~~~
leif
Counting bytes is the wrong thing to do (for old disks). If you count disk
seeks (as you should), LSM trees are a lot better than B-trees, but it turns
out you sacrifice more read performance than you should.

On SSDs, you do want to count bytes, and in this case, LSM trees do exhibit
write amplification and it is a big deal. When your disks have a limited
lifetime, even doing 3x the writes you need to hurts a lot.

(Not trying to spam, I just don't want to type the same thing twice, but) I
explain this better here: [http://www.tokutek.com/2011/10/write-optimization-
myths-comp...](http://www.tokutek.com/2011/10/write-optimization-myths-
comparison-clarifications-part-2/)

------
dhruvbird
Is it not possible to re-write a B-Tree once it doubles in size?

For example, suppose you start off with a B-Tree with size 1000. Once it
reaches 2000, re-write it. This way, the query performance remains good. Next
time it reaches 4000, re-write it again, then at 8000, and so on... This way,
you get good query speeds even in the presence of random inserts.

You need not re-write the B-Tree on the 2000'th or 4000'th insert. Instead,
you can start the re-writing process when the size of the B-Tree reaches
O(n/log n). This way, for the last O(log n) values, every time one value is
inserted, we copy O(log n) values from the old tree to the new one. When we
are done, we have an almost sorted list of values that we inserted in the new
B-Tree!

