
RocksDB – A persistent key-value store for fast storage environments - MadeInSyria
http://rocksdb.org/
======
snewman
Very nice work, and the wiki is also quite nice -- I wish more projects had a
page like [https://github.com/facebook/rocksdb/wiki/Rocksdb-
Architectur...](https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-
Guide). It's really nice to see a clear, terse summary of what makes this
project interesting relative to its predecessors.

At my company (scalyr.com), we've built a more-or-less clone of LevelDB in
Java, with a similar goal of extracting more performance on high-powered
servers (and better integration with our Java codebase). I'll be digging
through rocksdb to see what ideas we might borrow. A few things we've
implemented that might be interesting for rocksdb:

* The application can force segments to be split at specified keys. This is very helpful if you write a block of data all at once and then don't touch it for a long time. The initial memtable compaction places this data in its own segment and then we can push that segment down to the deepest level without ever compacting it again. It can also eliminate the need for bloom filters for many use cases, as you often wind up with only one segment overlapping a particular key range.

* The application can specify different compression schemes for different parts of the keyspace. This is useful if you are storing different kinds of data in the same database.

* We don't use timestamps anywhere other than the memtable. This puts some constraints on snapshot management, but streamlines get/scan operations and reduces file size for small values.

Do you have benchmarks for scan performance? This is an important area for us.
I don't have exact figures handy, but we get something like 2GB/second (using
8 threads) on an EC2 h1.4xlarge, uncached (reading from SSD) and decompressing
on the fly. This is an area we've focused on.

I'd enjoy getting together to compare notes -- send me an e-mail if you're
interested. steve @ (the domain mentioned above).

~~~
hyc_symas
SkyDB using LMDB gets 3GB/sec on a standalone PC.
[https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1...](https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1X35alxcJ)

~~~
bjconlan
Wow, Awesome link, LMDB always seems to fly under the radar, SkyDB+LMDB.
Genius. (and written in go! I'm sold... well will at least give it a bash)

------
dhruba_b
Hi guys, I am Dhruba and I work in the Database Engineering team at Facebook.
We just released RocksDB as an open source project. If anybody has any
technical questions about RocksDB, please feel free to ask. Thanks.

~~~
jbapple
Hi Dhruba, thanks for volunteering to ask questions.

What are the big algorithmic ideas behind RocksDB?

My understanding is that LevelDB is based on log structured merge trees. These
can be deamortized using methods from Overmars's "The Design of Dynamic Data
Structures" or Bender et al.'s "Cache-Oblivious Streaming B-trees". How did
you reduce latency?

What else was slowing down databases larger than RAM? How did you fix that?

~~~
dhruba_b
RocksDB has an LSM architecture, similar in nature to HBase, leveldb, etc. But
the implementation is based on a Theorem that we will be publishing shortly. I
am working on the Theorem with a colleague of mine.

Cache Oblivious B-trees is an interesting paper. Similarly fractal trees. Most
of them optimize the case when index nodes are not in memory. However, in our
use-cases, we typically configure the system in such a way that most index
nodes are in memory.

For an LSM database, the key component is "compaction". You can ingest data
only as fast as you can compact, otherwise u get a unstable system. .1 RocksDB
replaced the Level-style compaction of leveldb with UniversalStyleCompaction
that has reduced write amplification. This boosts performance. 2\. RocksDB
implemented multi-threaded write, which means that parallel compactions on
different parts of the database can occur simultaneously. This boosts
performance. 3\. Bloom filter for range-scans: this boost read performance 4\.
MergeType records that allows higher level objects (counters, lists) use only-
write instead of a read-modify-write. Improves performance. 5\. And many
more...

~~~
jbapple
Can you share with us the statement of that theorem?

What is "UniversalStyleCompaction", and why is it capitalized and missing
spaces?

How does a Bloom filter for range scans work? Standard Bloom filters (as you
know) are for existence only.

~~~
waffleclub
I'm guessing this may have been an early draft of some of the statements of
the theorem:

[http://webcache.googleusercontent.com/search?q=cache:fTxlRmb...](http://webcache.googleusercontent.com/search?q=cache:fTxlRmb9uMUJ:rocksdb.blogspot.com/+&cd=11&hl=en&ct=clnk&gl=us&client=firefox-a)

------
Patient0
I'm surprised that the C++ code is not using the RAII idiom in some obvious
places.

For example:
[https://github.com/facebook/rocksdb/blob/master/db/db_impl.c](https://github.com/facebook/rocksdb/blob/master/db/db_impl.c)

There are many places with bracketed calls to mutex_.Lock and mutex_.Unlock().

An example:

    
    
          mutex_.Unlock();
          LogFlush(options_.info_log);
          env_->SleepForMicroseconds(1000000);
          mutex_.Lock()
    

Why didn't the authors use the RAII idiom here? Even if there are no
exceptions expected, the code would still be simpler and less error prone by
using a guard object.

~~~
tsewlliw
fixed your link:
[https://github.com/facebook/rocksdb/blob/master/db/db_impl.c...](https://github.com/facebook/rocksdb/blob/master/db/db_impl.cc#L1665)

Take another look! There's a guard object used at the function scope to ensure
the lock is released, and this block is bracketed to _release_ and _reacquire_
the lock, not acquire and release. There may be a case for a guard object that
does the release/reacquire, but its definitely not a slam dunk like
acquire/release

~~~
phunge
Still, that's not exception-safe, correct? If LogFlush or SleepForMicroseconds
throws an exception the mutex will be unlocked twice, which pthreads disallows
for normal mutexes...

~~~
cbsmith
You know, for a second I thought you were wrong, but I changed my mind. This
_does_ look like a bug, and a simple on to avoid at that.

It's tough, because Rocks is still highly based on LevelDB, which conforms to
Google's coding style guideline, which makes RAII more than a bit tricky to do
right.

------
rdtsc
Well LevelDB is already good. And if this improves on it, that's great.

I was looking at embedded key value stores and also found -- HyperLevelDB
(from creators of Hyperdex database). They also improved on LevelDB in respect
to compaction and locking:

[http://hyperdex.org/performance/leveldb/](http://hyperdex.org/performance/leveldb/)

So now I am curios how it would compare.

Another interesting case optimized for reads is LMDB. That is a small but very
fast embedded database at sits at the core of OpenLDAP. That one has
impressive benchmarks.

[http://symas.com/mdb/microbench/](http://symas.com/mdb/microbench/)

(Note: LMDB used to be called MDB, you might know it by that name).

~~~
AaronFriel
The LMDB statistics are very strange - why is synchronous SSD performance
_worse_ on most figures than HDD performance? Something seems very wrong with
these benchmarks:

    
    
        Section 5 (SSD) F (Synchronous Writes)
        
        Random Writes
        
        LevelDB              342 ops/sec	
        Kyoto TreeDB          67 ops/sec	
        SQLite3              114 ops/sec	
        MDB                  148 ops/sec	
        MDB, no MetaSync     322 ops/sec	
        BerkeleyDB           291 ops/sec	
        
        Section 8 (HDD) F (Synchronous Writes)
        
        Random Writes
        
        LevelDB             1291 ops/sec	
        Kyoto TreeDB          28 ops/sec	
        SQLite3              112 ops/sec	
        MDB                  297 ops/sec	
        BerkeleyDB           704 ops/sec	
        
    

Really? LevelDB is four times faster on an HDD than an SSD with synchronous
writes? BerkeleyDB is over twice as fast?

This smells.

~~~
hyc_symas
Keep in mind, the HDD was using ext2 and the SSD was using reiserfs.
Synchronous writes on ext2 are faster than all journaling filesystems.

~~~
jbellis
Not three orders of magnitude faster, which is the difference between hdd and
ssd random writes.

~~~
ithkuil
Three orders of magnitude faster would mean 1000x faster. You probably meant 3
times faster.

~~~
jbellis
SSDs really are 1000x faster at random writes (~200,000 iops vs ~200 iops)

------
gfodor
this is cool, though I'd wonder how it compares to Kyoto Cabinet. another big
issue I've run into personally is the fact that both LevelDB and KC don't
explicitly support multiple processes reading the db at once. (KC's API allows
this but advises against it, LevelDB afaik doesn't even allow it.) I wonder if
RocksDB gets past this.

~~~
hyc_symas
Kyoto Cabinet will self-corrupt if you use it that way. LMDB supports multi-
process explicitly.

~~~
gfodor
can you explain how this happens? if it's just a read-only process, how can it
corrupt anything?

------
_kst_
A very minor point:

The illustrative code snippet on the home page has a spurious semicolon on the
first line:

    
    
        #include <assert>;

~~~
jamesgpearce
fixed! - thanks

------
wbolster
The benchmark at [https://github.com/facebook/rocksdb/wiki/Performance-
Benchma...](https://github.com/facebook/rocksdb/wiki/Performance-
Benchmarks#2-bulk-load-of-keys-in-random-order) states that for LevelDB, "in
24 hours it inserted only 2 million key-values", and that "each key is of size
10 bytes, each value is of size 800 bytes".

I might be missing something, but that took just a few minutes on my ~2 year
old desktop machine. Sample code:
[https://gist.github.com/wbolster/7487225](https://gist.github.com/wbolster/7487225)

~~~
dhruba_b
There was a typo, the 2 million should have been 200 million keys. I fixed the
wiki page. Thanks again for pointing it out.

------
parshap
Node.js bindings (compatible with levelup) have already been released by
rvagg: [https://npmjs.org/package/rocksdb](https://npmjs.org/package/rocksdb)

------
canadi
Tnx for all the comments! Feel free to continue the discussion at
[https://www.facebook.com/groups/rocksdb.dev/](https://www.facebook.com/groups/rocksdb.dev/)

------
arthursilva
Looking forward to see this in Riak

