At my company (scalyr.com), we've built a more-or-less clone of LevelDB in Java, with a similar goal of extracting more performance on high-powered servers (and better integration with our Java codebase). I'll be digging through rocksdb to see what ideas we might borrow. A few things we've implemented that might be interesting for rocksdb:
* The application can force segments to be split at specified keys. This is very helpful if you write a block of data all at once and then don't touch it for a long time. The initial memtable compaction places this data in its own segment and then we can push that segment down to the deepest level without ever compacting it again. It can also eliminate the need for bloom filters for many use cases, as you often wind up with only one segment overlapping a particular key range.
* The application can specify different compression schemes for different parts of the keyspace. This is useful if you are storing different kinds of data in the same database.
* We don't use timestamps anywhere other than the memtable. This puts some constraints on snapshot management, but streamlines get/scan operations and reduces file size for small values.
Do you have benchmarks for scan performance? This is an important area for us. I don't have exact figures handy, but we get something like 2GB/second (using 8 threads) on an EC2 h1.4xlarge, uncached (reading from SSD) and decompressing on the fly. This is an area we've focused on.
I'd enjoy getting together to compare notes -- send me an e-mail if you're interested. steve @ (the domain mentioned above).
1. RocksDb has a feature that allows an application to determine when to close a file (i.e. segment). You can write your compaction code via compaction_filter_factory defined in https://github.com/facebook/rocksdb/blob/master/include/rock...
2. RocksDb also has a feature that allows an application to close a block inside a segment. https://github.com/facebook/rocksdb/commit/fd075d6edd68ddbc1...
3. RocksDb has a feature to use different compression algorithms for different parts of the database. In the Level Style Compaction, you can configure a different compression algorithm for different levels. In Universal Style Compaction, you can specify that you want compression only for x% earliest data in your database.
4. We have internal benchmarks for scan performance but because of lack of developer resources, we might not be able to open source those numbers.
It will be great to catch up in person.
This sounds quite interesting; have you considered open-sourcing it?
We will probably at least publish a report describing the work in more detail, some time in the next few months, on our blog (http://blog.scalyr.com).
What are the big algorithmic ideas behind RocksDB?
My understanding is that LevelDB is based on log structured merge trees. These can be deamortized using methods from Overmars's "The Design of Dynamic Data Structures" or Bender et al.'s "Cache-Oblivious Streaming B-trees". How did you reduce latency?
What else was slowing down databases larger than RAM? How did you fix that?
Cache Oblivious B-trees is an interesting paper. Similarly fractal trees. Most of them optimize the case when index nodes are not in memory. However, in our use-cases, we typically configure the system in such a way that most index nodes are in memory.
For an LSM database, the key component is "compaction". You can ingest data only as fast as you can compact, otherwise u get a unstable system.
.1 RocksDB replaced the Level-style compaction of leveldb with UniversalStyleCompaction that has reduced write amplification. This boosts performance.
2. RocksDB implemented multi-threaded write, which means that parallel compactions on different parts of the database can occur simultaneously. This boosts performance.
3. Bloom filter for range-scans: this boost read performance
4. MergeType records that allows higher level objects (counters, lists) use only-write instead of a read-modify-write. Improves performance.
5. And many more...
What is "UniversalStyleCompaction", and why is it capitalized and missing spaces?
How does a Bloom filter for range scans work? Standard Bloom filters (as you know) are for existence only.
 - http://hyperdex.org/performance/leveldb/
One definite use-case for RocksDB is a message queue that has a high number of puts/gets/deletes.
Also, please look at options.h to determine what options to tune. A not-tuned system will not show you much perf difference from leveldb.
Appreciated when project is on github and open for issues/PR, etc.
Also, there aren't JNI bindings... are there?
Thanks for the contribution. Just started using LevelDB on a project, but deployment will involve fast flash storage and rocksdb looks like a worthy successor.
Please check out Universal Comaction Style, multi-threaded-compaction, pipelined memtables.
I used to have JNI bindings that I pulled in from https://github.com/fusesource/leveldbjni but it was difficult for me to update the JNI everytime we added new apis to RocksDB. It would be great if somebody who needs Java support can implement JNI bindings for RocksDB.
For example: https://github.com/facebook/rocksdb/blob/master/db/db_impl.c
There are many places with bracketed calls to mutex_.Lock and mutex_.Unlock().
Take another look! There's a guard object used at the function scope to ensure the lock is released, and this block is bracketed to release and reacquire the lock, not acquire and release. There may be a case for a guard object that does the release/reacquire, but its definitely not a slam dunk like acquire/release
It's tough, because Rocks is still highly based on LevelDB, which conforms to Google's coding style guideline, which makes RAII more than a bit tricky to do right.
(not db_impl.c but .cc) I was wondering for a while if it was possible to do RAII in idiomatic c99 -- and it appears it isn't (or doesn't make as much sense, anyway).
I was looking at embedded key value stores and also found -- HyperLevelDB (from creators of Hyperdex database). They also improved on LevelDB in respect to compaction and locking:
So now I am curios how it would compare.
Another interesting case optimized for reads is LMDB. That is a small but very fast embedded database at sits at the core of OpenLDAP. That one has impressive benchmarks.
(Note: LMDB used to be called MDB, you might know it by that name).
Section 5 (SSD) F (Synchronous Writes)
LevelDB 342 ops/sec
Kyoto TreeDB 67 ops/sec
SQLite3 114 ops/sec
MDB 148 ops/sec
MDB, no MetaSync 322 ops/sec
BerkeleyDB 291 ops/sec
Section 8 (HDD) F (Synchronous Writes)
LevelDB 1291 ops/sec
Kyoto TreeDB 28 ops/sec
SQLite3 112 ops/sec
MDB 297 ops/sec
BerkeleyDB 704 ops/sec
Could it be that most database engines are based on algorithms that were developed before SSDs were significant, and were extremely optimized for HDD performance?
If the cache was being used, the HDD results are not actually synchronous, a power loss event would result in data loss.
LSMs are getting popular because it's harder to scale durable writes than reads, which can be handled (in many cases independently) by caching.
Going forward, power efficiency will still be crucial - for as long as civilization persists. But optimizing durable writes will be about as useful as optimizing delay loops.
Also what about concurrent writes? Does it have a database wide writer lock or is it per key (per page?) ?
It is a single-writer DB, one DB-wide writer lock. Fine-grained locking is a tar pit.
We (Tokutek) tried for a long time to get by with a big monolithic lock, and a) competing with InnoDB was really hard since they do concurrent writers really really well, and b) when we did decide to break up the lock, it wasn't as hard as we thought it would be and it worked really really well.
Don't get discouraged, break up that lock!
As always, you have to profile your workload and see where the delays and bottlenecks really are. Taking a single mutex instead of continuously locking/unlocking all over the place was a win for us.
I can see how the extra code locking/concurrency code would expand the library size out of the CPU cache, though.
Most impressive about LMDB to me is the zero-copy model for readers, with is no extra memcpy needed, maybe that is something obvious for database gurus but it is pretty clever trick I think.
The illustrative code snippet on the home page has a spurious semicolon on the first line:
I might be missing something, but that took just a few minutes on my ~2 year old desktop machine. Sample code: https://gist.github.com/wbolster/7487225