HyperLevelDB: A High-Performance LevelDB Fork

siculars · on June 6, 2013

There seems to be three things they are doing to reach their higher performance:

1. Multi-threaded synchronized write ordering. I'm interested in the internal synchronization mechanism...

"LevelDB uses very coarse-grained synchronization which forces all writes to proceed in an ordered, first-come-first-served fashion, effectively reducing throughput to that of a single thread. HyperLevelDB increases concurrency by allowing multiple threads to agree on the order of their respective writes, and then independently apply the writes in a manner consistent with the agreed-upon order."

2. Tuned write delay on compaction. What external instrumentation/markers are they passing into hyperleveldb to tune write delay?

"HyperLevelDB removes this artificial delay, allowing the application (in our case, HyperDex) to independently decide to delay writes, using information available outside the scope of LevelDB."

3. Tuned intra-level re-writes.

"LevelDB's compaction algorithm is not efficient, and in the "fillrand" benchmark will, on average, rewrite 3MB of data in the upper level for every 1MB of data in the lower level. HyperLevelDB avoids this waste by selecting the compaction with the smallest overhead."

rescrv · on June 6, 2013

I'm the HyperDex developer who did the work on HyperLevelDB. I'll answer your questions in order as best I can.

1. Our internal synchronization mechanism is a simple change. The stock LevelDB does the following:

    place our current on the back of wait_queue
    wait for (our thread to be head of wait_queue || thread ahead of us to do our work)
    if work done: exit
    possibly build a batch of our writes and 
    append data to the log
    insert data into the memtable
    signal the next writer, and any writer whose work we finished

HyperLevelDB does this a little differently. We made the log and the memtable concurrent datastructures, so that multiple threads can write to each one at a time. We then do a little synchronization to ensure that we don't reveal the writes to readers in the wrong order.

    get a ticket, indicating the order of our writes
    insert the data into the log
    insert the data into the memtable
    wait for writes with a lower token to complete

For the actual implementations, check out the code for LevelDB (Lines 1135-1196 of https://github.com/rescrv/HyperLevelDB/blob/28dad918f2ffb80f...) and HyperLevelDB (Line 1307-1428 of https://github.com/rescrv/HyperLevelDB/blob/master/db/db_imp...).

Effectively, this change moves from a model where there is exactly one writer at a time, to one where the bulk of the work (inserting into log/memtable) is done in parallel by writer threads.

2. LevelDB provides a GetProperty call. We can inspect the number of files in Level-0 and back-off where appropriate. There is no write delay in LevelDB itself. By the end-to-end principle, the storage server is in a better position to decide whether to delay writes, or just keep pushing them into the database.

notacoward · on June 6, 2013

Nice to see. I played with LevelDB a while ago, and performance seemed great as long as you didn't want any kind of durability . . . but of course you would in real life, and as soon as you start forcing things to disk LevelDB's performance dropped to levels that were beatable by any number of alternatives with fewer dependencies. To say I was unimpressed would be an understatement. It's good to see someone bringing it to where it should have been already.

vosper · on June 6, 2013

"as soon as you start forcing things to disk LevelDB's performance dropped to levels that were beatable by any number of alternatives with fewer dependencies"

What alternatives did you test that performed better than LevelDB?

davidkellis · on June 6, 2013

The performance graphs look promising, but the second graph in the Parallelism section seems suspect to me. Why didn't they plot the graph out to 8 threads, like they did with the first graph?

Edit: I don't mean to sound snarky, I'm definitely impressed.

rescrv · on June 6, 2013

So we scale up to (num_cores - 1). We have 8-way and a 4-way machines. I'll update the page to make this clear.

marshray · on June 6, 2013

Even if the perf nosedives after (n - 1) threads, it would be nice to see on the graph.

strmpnk · on June 6, 2013

How does this compare with some of Basho's LevelDB performance work? https://github.com/basho/leveldb

rescrv · on June 6, 2013

I've not fully compared the code, but at first glance, they're choosing different constants for compaction (such as level size, and ratio of compacted-to-uncompacted data) and have done-away with the non-overlapping invariant for the first few levels of the tree. I've not benchmarked their code, so I don't know how effective this strategy is.

What I can see from their code, is that they rely on a compaction strategy that is very similar to stock LevelDB (in terms of selecting the next uncompacted SSTs within a level), that they haven't done anything to improve multi-threaded performance, and that they've invested a lot of work into computing if/when writes should be delayed within the LevelDB code. We've drastically changed the compaction strategy, begun to improve concurrency (I know for a fact that we have opportunities to improve it further), and we believe that throttling writes at the storage layer is not the correct level to make such decisions, so we've removed the code which does so.

I guess the most fair thing to say is that we've taken complementary approaches, and nothing from either approach is not portable to the other.

MatthewVon · on June 6, 2013

As the person doing the github:basho/leveldb work, I agree with the above (very professional reply, not something you expect to see on the Internet, thank you). We are optimizing to our individual environments. I do need to review the compaction algorithms to see if there is benefit to Basho. However, it will be a few weeks before that happens.

The write multi-threading does not help Basho's Riak 1.x series. We parallelize by using multiple leveldb database (Riak term vnodes) and do NOT parallelize individual database until 2.0. Therefore, unfair to measure that feature agains Riak.

Our compaction code is adjusted with an emphasis on running multiple compactions of varied priority. Our use of multiple databases got hung up on leveldb's single compaction thread. Again, this complete difference of environments would be unfair in a Hyperleveldb direct comparison.

And the most difficult issue for dropping hyper into Riak is that our leveldb performs the write throttling, hyperleveldb leaves that to HyperDex. This is yet again an environment design decision ... but says coding is require to make hyperleveldb "just work" with Riak. That will be a while.

I therefore do not claim that Basho's leveldb would be better with HyperDex and suspect that today's hyperleveldb would not be better with Basho's Riak. We optimized to our different pain points.

Matthew

_b · on June 8, 2013

I would like to see a benchmark showing if HyperLevelDB's changes made compaction slower or faster. (This is important because LevelDB's bottleneck is commonly compaction.)

Under the compaction section, a chart does show a dramatic improvement on the fillrand benchmark. But since HyperLevelDB removes the LevelDB write delay when compaction falls behind, writes in HyperLevelDB will continue at full speed even as compaction falls extremely far behind. So, yes, the data for the benchmark was technically written into the db, but at the costs of making reads insanely expensive.

Most reads need to touch every level-0 file. If compaction moving data out of level-0 fell 10GB behind, then there would be hundreds of level-0 files, and most reads would need to check all of them.

If the benchmark had mixed in one random read for every random write, I assume HyperLevelDB would have been dramatically slower than LevelDB.

But if the changes in what to pick for compaction made HyperLevelDB non-trivially faster, that would be quite interesting. But I can't tell if they did.

keyaell1 · on June 6, 2013

I'm glad someone is addressing these issues. I think I'm not the only person that have had some bulk-write and compaction related performance problems with LevelDB. I wonder if there is any chance the changes from this fork will be rolled back into LevelDB codebase.

emin_gun_sirer · on June 6, 2013

I can answer the opposite question: we intend to keep in sync with upstream. Since HyperLevelDB is a drop-in replacement for LevelDB, it should be fairly easy for LevelDB users to pick it up.

alexchamberlain · on June 6, 2013

Why can't you just push it upstream? Is there resistance?

rescrv · on June 6, 2013

We will try to push upstream those pieces that upstream would need. Some pieces, such as the autoconfiscated build system are likely not of interest to them, but of interest to us.

rescrv · on June 6, 2013

We're doing our best to keep API compatibility with LevelDB and will do our best to keep true to LevelDB. That being said, it'd be great to see our changes flow upstream.

bonzoesc · on June 6, 2013

I got this to build on Ubuntu 12.10 Quantal Quetzal:

    libtoolize
    aclocal
    autoheader
    automake --add-missing
    autoconf
    ./configure 
    make

No performance numbers 'cause I did it on a slow machine though.

rescrv · on June 6, 2013

Here's a more-robust way to bootstrap:

    autoreconf -I
    ./configure
    make

halayli · on June 6, 2013

I am surprised they are using levelDB for a key/value store when they could have used a DBM variant like Kyoto Cabinet HashDB, which is way faster for such a task. Why do you need sorted keys for key/value storage?

monstrado · on June 6, 2013

Sorted keys are extremely useful for time series data. For example, if a key has a timestamp in it and you'd like to do an aggregation over a few days of "keys"...it's very simple to do an iteration from a STARTKEY to a STOPKEY (the keys in-between don't have to be defined...which is very important). Not only is it simple to do the iteration, it's very fast. This kind of use case (very common) is difficult without sorted keys and iteration capabilities.

halayli · on June 6, 2013

Agreed - but they are doing a full scan when doing searches so I don't see the benefit.

monstrado · on June 6, 2013

Searching is not the same thing as a range iteration, normally if you want to search a Key/Value than you'll need to scan the entire dataset.

A range iteration allows you to scan a subset of the data, and as long as they didn't fundamentally change how LevelDB works, than they will still support ranges.

halayli · on June 6, 2013

> Searching is not the same thing as a range iteration, normally if you want to search a Key/Value than you'll need to scan the entire dataset.

In this case, you lost the sorted keys advantage that BTree gives you.

My point is that this project leans more towards exact key -> value lookups and a BTree is an overkill here.

monstrado · on June 6, 2013

I guess it depends on your use case, most of the use cases I've seen have been time series using range iterations which is incredibly fast, but I understand your concern if you're only using it for random gets.

halayli · on June 6, 2013

Probably providing the option to create a btree or hash map would be nice.

jeffffff · on June 7, 2013

hash indexes tend to have really terrible write performance because the locations of the writes on disk are random. lsm trees have way better write performance.

Severian · on June 6, 2013

I'd be interested to see the I/O with the same disk system. Right now the comparison is between a 500GB HD and an SSD. I would assume this would skew results quite a bit.

rescrv · on June 6, 2013

Please look at the graphs again. Each comparison is done twice: once on hdd and once on ssd. This gives an apples-to-apples comparison for each platform.