There seems to be three things they are doing to reach their higher performance:
1. Multi-threaded synchronized write ordering. I'm interested in the internal synchronization mechanism...
"LevelDB uses very coarse-grained synchronization which forces all writes to proceed in an ordered, first-come-first-served fashion, effectively reducing throughput to that of a single thread. HyperLevelDB increases concurrency by allowing multiple threads to agree on the order of their respective writes, and then independently apply the writes in a manner consistent with the agreed-upon order."
2. Tuned write delay on compaction. What external instrumentation/markers are they passing into hyperleveldb to tune write delay?
"HyperLevelDB removes this artificial delay, allowing the application (in our case, HyperDex) to independently decide to delay writes, using information available outside the scope of LevelDB."
3. Tuned intra-level re-writes.
"LevelDB's compaction algorithm is not efficient, and in the "fillrand" benchmark will, on average, rewrite 3MB of data in the upper level for every 1MB of data in the lower level. HyperLevelDB avoids this waste by selecting the compaction with the smallest overhead."
I'm the HyperDex developer who did the work on HyperLevelDB. I'll answer your questions in order as best I can.
1. Our internal synchronization mechanism is a simple change. The stock LevelDB does the following:
place our current on the back of wait_queue
wait for (our thread to be head of wait_queue || thread ahead of us to do our work)
if work done: exit
possibly build a batch of our writes and
append data to the log
insert data into the memtable
signal the next writer, and any writer whose work we finished
HyperLevelDB does this a little differently. We made the log and the memtable concurrent datastructures, so that multiple threads can write to each one at a time. We then do a little synchronization to ensure that we don't reveal the writes to readers in the wrong order.
get a ticket, indicating the order of our writes
insert the data into the log
insert the data into the memtable
wait for writes with a lower token to complete
Effectively, this change moves from a model where there is exactly one writer at a time, to one where the bulk of the work (inserting into log/memtable) is done in parallel by writer threads.
2. LevelDB provides a GetProperty call. We can inspect the number of files in Level-0 and back-off where appropriate. There is no write delay in LevelDB itself. By the end-to-end principle, the storage server is in a better position to decide whether to delay writes, or just keep pushing them into the database.
Nice to see. I played with LevelDB a while ago, and performance seemed great as long as you didn't want any kind of durability . . . but of course you would in real life, and as soon as you start forcing things to disk LevelDB's performance dropped to levels that were beatable by any number of alternatives with fewer dependencies. To say I was unimpressed would be an understatement. It's good to see someone bringing it to where it should have been already.
"as soon as you start forcing things to disk LevelDB's performance dropped to levels that were beatable by any number of alternatives with fewer dependencies"
What alternatives did you test that performed better than LevelDB?
The performance graphs look promising, but the second graph in the Parallelism section seems suspect to me. Why didn't they plot the graph out to 8 threads, like they did with the first graph?
Edit:
I don't mean to sound snarky, I'm definitely impressed.
I've not fully compared the code, but at first glance, they're choosing different constants for compaction (such as level size, and ratio of compacted-to-uncompacted data) and have done-away with the non-overlapping invariant for the first few levels of the tree. I've not benchmarked their code, so I don't know how effective this strategy is.
What I can see from their code, is that they rely on a compaction strategy that is very similar to stock LevelDB (in terms of selecting the next uncompacted SSTs within a level), that they haven't done anything to improve multi-threaded performance, and that they've invested a lot of work into computing if/when writes should be delayed within the LevelDB code. We've drastically changed the compaction strategy, begun to improve concurrency (I know for a fact that we have opportunities to improve it further), and we believe that throttling writes at the storage layer is not the correct level to make such decisions, so we've removed the code which does so.
I guess the most fair thing to say is that we've taken complementary approaches, and nothing from either approach is not portable to the other.
As the person doing the github:basho/leveldb work, I agree with the above (very professional reply, not something you expect to see on the Internet, thank you). We are optimizing to our individual environments. I do need to review the compaction algorithms to see if there is benefit to Basho. However, it will be a few weeks before that happens.
The write multi-threading does not help Basho's Riak 1.x series. We parallelize by using multiple leveldb database (Riak term vnodes) and do NOT parallelize individual database until 2.0. Therefore, unfair to measure that feature agains Riak.
Our compaction code is adjusted with an emphasis on running multiple compactions of varied priority. Our use of multiple databases got hung up on leveldb's single compaction thread. Again, this complete difference of environments would be unfair in a Hyperleveldb direct comparison.
And the most difficult issue for dropping hyper into Riak is that our leveldb performs the write throttling, hyperleveldb leaves that to HyperDex. This is yet again an environment design decision ... but says coding is require to make hyperleveldb "just work" with Riak. That will be a while.
I therefore do not claim that Basho's leveldb would be better with HyperDex and suspect that today's hyperleveldb would not be better with Basho's Riak. We optimized to our different pain points.
I would like to see a benchmark showing if HyperLevelDB's changes made compaction slower or faster. (This is important because LevelDB's bottleneck is commonly compaction.)
Under the compaction section, a chart does show a dramatic improvement on the fillrand benchmark. But since HyperLevelDB removes the LevelDB write delay when compaction falls behind, writes in HyperLevelDB will continue at full speed even as compaction falls extremely far behind. So, yes, the data for the benchmark was technically written into the db, but at the costs of making reads insanely expensive.
Most reads need to touch every level-0 file. If compaction moving data out of level-0 fell 10GB behind, then there would be hundreds of level-0 files, and most reads would need to check all of them.
If the benchmark had mixed in one random read for every random write, I assume HyperLevelDB would have been dramatically slower than LevelDB.
But if the changes in what to pick for compaction made HyperLevelDB non-trivially faster, that would be quite interesting. But I can't tell if they did.
I'm glad someone is addressing these issues. I think I'm not the only person that have had some bulk-write and compaction related performance problems with LevelDB. I wonder if there is any chance the changes from this fork will be rolled back into LevelDB codebase.
I can answer the opposite question: we intend to keep in sync with upstream. Since HyperLevelDB is a drop-in replacement for LevelDB, it should be fairly easy for LevelDB users to pick it up.
We will try to push upstream those pieces that upstream would need. Some pieces, such as the autoconfiscated build system are likely not of interest to them, but of interest to us.
We're doing our best to keep API compatibility with LevelDB and will do our best to keep true to LevelDB. That being said, it'd be great to see our changes flow upstream.
I am surprised they are using levelDB for a key/value store when they could have used a DBM variant like Kyoto Cabinet HashDB, which is way faster for such a task. Why do you need sorted keys for key/value storage?
Sorted keys are extremely useful for time series data. For example, if a key has a timestamp in it and you'd like to do an aggregation over a few days of "keys"...it's very simple to do an iteration from a STARTKEY to a STOPKEY (the keys in-between don't have to be defined...which is very important). Not only is it simple to do the iteration, it's very fast. This kind of use case (very common) is difficult without sorted keys and iteration capabilities.
Searching is not the same thing as a range iteration, normally if you want to search a Key/Value than you'll need to scan the entire dataset.
A range iteration allows you to scan a subset of the data, and as long as they didn't fundamentally change how LevelDB works, than they will still support ranges.
I guess it depends on your use case, most of the use cases I've seen have been time series using range iterations which is incredibly fast, but I understand your concern if you're only using it for random gets.
hash indexes tend to have really terrible write performance because the locations of the writes on disk are random. lsm trees have way better write performance.
I'd be interested to see the I/O with the same disk system. Right now the comparison is between a 500GB HD and an SSD. I would assume this would skew results quite a bit.
Please look at the graphs again. Each comparison is done twice: once on hdd and once on ssd. This gives an apples-to-apples comparison for each platform.
1. Multi-threaded synchronized write ordering. I'm interested in the internal synchronization mechanism...
"LevelDB uses very coarse-grained synchronization which forces all writes to proceed in an ordered, first-come-first-served fashion, effectively reducing throughput to that of a single thread. HyperLevelDB increases concurrency by allowing multiple threads to agree on the order of their respective writes, and then independently apply the writes in a manner consistent with the agreed-upon order."
2. Tuned write delay on compaction. What external instrumentation/markers are they passing into hyperleveldb to tune write delay?
"HyperLevelDB removes this artificial delay, allowing the application (in our case, HyperDex) to independently decide to delay writes, using information available outside the scope of LevelDB."
3. Tuned intra-level re-writes.
"LevelDB's compaction algorithm is not efficient, and in the "fillrand" benchmark will, on average, rewrite 3MB of data in the upper level for every 1MB of data in the lower level. HyperLevelDB avoids this waste by selecting the compaction with the smallest overhead."