
HyperLevelDB:  A High-Performance LevelDB Fork - organicdeadbeef
http://hyperdex.org/performance/leveldb/
======
siculars
There seems to be three things they are doing to reach their higher
performance:

1\. Multi-threaded synchronized write ordering. I'm interested in the internal
synchronization mechanism...

"LevelDB uses very coarse-grained synchronization which forces all writes to
proceed in an ordered, first-come-first-served fashion, effectively reducing
throughput to that of a single thread. HyperLevelDB increases concurrency by
allowing multiple threads to agree on the order of their respective writes,
and then independently apply the writes in a manner consistent with the
agreed-upon order."

2\. Tuned write delay on compaction. What external instrumentation/markers are
they passing into hyperleveldb to tune write delay?

"HyperLevelDB removes this artificial delay, allowing the application (in our
case, HyperDex) to independently decide to delay writes, using information
available outside the scope of LevelDB."

3\. Tuned intra-level re-writes.

"LevelDB's compaction algorithm is not efficient, and in the "fillrand"
benchmark will, on average, rewrite 3MB of data in the upper level for every
1MB of data in the lower level. HyperLevelDB avoids this waste by selecting
the compaction with the smallest overhead."

~~~
rescrv
I'm the HyperDex developer who did the work on HyperLevelDB. I'll answer your
questions in order as best I can.

1\. Our internal synchronization mechanism is a simple change. The stock
LevelDB does the following:

    
    
        place our current on the back of wait_queue
        wait for (our thread to be head of wait_queue || thread ahead of us to do our work)
        if work done: exit
        possibly build a batch of our writes and 
        append data to the log
        insert data into the memtable
        signal the next writer, and any writer whose work we finished
    

HyperLevelDB does this a little differently. We made the log and the memtable
concurrent datastructures, so that multiple threads can write to each one at a
time. We then do a little synchronization to ensure that we don't reveal the
writes to readers in the wrong order.

    
    
        get a ticket, indicating the order of our writes
        insert the data into the log
        insert the data into the memtable
        wait for writes with a lower token to complete
    

For the actual implementations, check out the code for LevelDB (Lines
1135-1196 of
[https://github.com/rescrv/HyperLevelDB/blob/28dad918f2ffb80f...](https://github.com/rescrv/HyperLevelDB/blob/28dad918f2ffb80fd70110ed5cd47744339649f2/db/db_impl.cc))
and HyperLevelDB (Line 1307-1428 of
[https://github.com/rescrv/HyperLevelDB/blob/master/db/db_imp...](https://github.com/rescrv/HyperLevelDB/blob/master/db/db_impl.cc)).

Effectively, this change moves from a model where there is exactly one writer
at a time, to one where the bulk of the work (inserting into log/memtable) is
done in parallel by writer threads.

2\. LevelDB provides a GetProperty call. We can inspect the number of files in
Level-0 and back-off where appropriate. There is no write delay in LevelDB
itself. By the end-to-end principle, the storage server is in a better
position to decide whether to delay writes, or just keep pushing them into the
database.

------
notacoward
Nice to see. I played with LevelDB a while ago, and performance seemed great
as long as you didn't want any kind of durability . . . but of course you
would in real life, and as soon as you start forcing things to disk LevelDB's
performance dropped to levels that were beatable by any number of alternatives
with fewer dependencies. To say I was unimpressed would be an understatement.
It's good to see someone bringing it to where it should have been already.

~~~
vosper
"as soon as you start forcing things to disk LevelDB's performance dropped to
levels that were beatable by any number of alternatives with fewer
dependencies"

What alternatives did you test that performed better than LevelDB?

------
davidkellis
The performance graphs look promising, but the second graph in the Parallelism
section seems suspect to me. Why didn't they plot the graph out to 8 threads,
like they did with the first graph?

Edit: I don't mean to sound snarky, I'm definitely impressed.

~~~
rescrv
So we scale up to (num_cores - 1). We have 8-way and a 4-way machines. I'll
update the page to make this clear.

~~~
marshray
Even if the perf nosedives after (n - 1) threads, it would be nice to see on
the graph.

------
strmpnk
How does this compare with some of Basho's LevelDB performance work?
<https://github.com/basho/leveldb>

~~~
rescrv
I've not fully compared the code, but at first glance, they're choosing
different constants for compaction (such as level size, and ratio of
compacted-to-uncompacted data) and have done-away with the non-overlapping
invariant for the first few levels of the tree. I've not benchmarked their
code, so I don't know how effective this strategy is.

What I can see from their code, is that they rely on a compaction strategy
that is very similar to stock LevelDB (in terms of selecting the next
uncompacted SSTs within a level), that they haven't done anything to improve
multi-threaded performance, and that they've invested a lot of work into
computing if/when writes should be delayed within the LevelDB code. We've
drastically changed the compaction strategy, begun to improve concurrency (I
know for a fact that we have opportunities to improve it further), and we
believe that throttling writes at the storage layer is not the correct level
to make such decisions, so we've removed the code which does so.

I guess the most fair thing to say is that we've taken complementary
approaches, and nothing from either approach is not portable to the other.

~~~
MatthewVon
As the person doing the github:basho/leveldb work, I agree with the above
(very professional reply, not something you expect to see on the Internet,
thank you). We are optimizing to our individual environments. I do need to
review the compaction algorithms to see if there is benefit to Basho. However,
it will be a few weeks before that happens.

The write multi-threading does not help Basho's Riak 1.x series. We
parallelize by using multiple leveldb database (Riak term vnodes) and do NOT
parallelize individual database until 2.0. Therefore, unfair to measure that
feature agains Riak.

Our compaction code is adjusted with an emphasis on running multiple
compactions of varied priority. Our use of multiple databases got hung up on
leveldb's single compaction thread. Again, this complete difference of
environments would be unfair in a Hyperleveldb direct comparison.

And the most difficult issue for dropping hyper into Riak is that our leveldb
performs the write throttling, hyperleveldb leaves that to HyperDex. This is
yet again an environment design decision ... but says coding is require to
make hyperleveldb "just work" with Riak. That will be a while.

I therefore do not claim that Basho's leveldb would be better with HyperDex
and suspect that today's hyperleveldb would not be better with Basho's Riak.
We optimized to our different pain points.

Matthew

------
_b
I would like to see a benchmark showing if HyperLevelDB's changes made
compaction slower or faster. (This is important because LevelDB's bottleneck
is commonly compaction.)

Under the compaction section, a chart does show a dramatic improvement on the
fillrand benchmark. But since HyperLevelDB removes the LevelDB write delay
when compaction falls behind, writes in HyperLevelDB will continue at full
speed even as compaction falls extremely far behind. So, yes, the data for the
benchmark was technically written into the db, but at the costs of making
reads insanely expensive.

Most reads need to touch every level-0 file. If compaction moving data out of
level-0 fell 10GB behind, then there would be hundreds of level-0 files, and
most reads would need to check all of them.

If the benchmark had mixed in one random read for every random write, I assume
HyperLevelDB would have been dramatically slower than LevelDB.

But if the changes in what to pick for compaction made HyperLevelDB non-
trivially faster, that would be quite interesting. But I can't tell if they
did.

------
keyaell1
I'm glad someone is addressing these issues. I think I'm not the only person
that have had some bulk-write and compaction related performance problems with
LevelDB. I wonder if there is any chance the changes from this fork will be
rolled back into LevelDB codebase.

~~~
emin_gun_sirer
I can answer the opposite question: we intend to keep in sync with upstream.
Since HyperLevelDB is a drop-in replacement for LevelDB, it should be fairly
easy for LevelDB users to pick it up.

~~~
alexchamberlain
Why can't you just push it upstream? Is there resistance?

~~~
rescrv
We will try to push upstream those pieces that upstream would need. Some
pieces, such as the autoconfiscated build system are likely not of interest to
them, but of interest to us.

------
bonzoesc
I got this to build on Ubuntu 12.10 Quantal Quetzal:

    
    
        libtoolize
        aclocal
        autoheader
        automake --add-missing
        autoconf
        ./configure 
        make
    

No performance numbers 'cause I did it on a slow machine though.

~~~
rescrv
Here's a more-robust way to bootstrap:

    
    
        autoreconf -I
        ./configure
        make

------
halayli
I am surprised they are using levelDB for a key/value store when they could
have used a DBM variant like Kyoto Cabinet HashDB, which is way faster for
such a task. Why do you need sorted keys for key/value storage?

~~~
monstrado
Sorted keys are extremely useful for time series data. For example, if a key
has a timestamp in it and you'd like to do an aggregation over a few days of
"keys"...it's very simple to do an iteration from a STARTKEY to a STOPKEY (the
keys in-between don't have to be defined...which is very important). Not only
is it simple to do the iteration, it's very fast. This kind of use case (very
common) is difficult without sorted keys and iteration capabilities.

~~~
halayli
Agreed - but they are doing a full scan when doing searches so I don't see the
benefit.

~~~
monstrado
Searching is not the same thing as a range iteration, normally if you want to
search a Key/Value than you'll need to scan the entire dataset.

A range iteration allows you to scan a subset of the data, and as long as they
didn't fundamentally change how LevelDB works, than they will still support
ranges.

~~~
halayli
> Searching is not the same thing as a range iteration, normally if you want
> to search a Key/Value than you'll need to scan the entire dataset.

In this case, you lost the sorted keys advantage that BTree gives you.

My point is that this project leans more towards exact key -> value lookups
and a BTree is an overkill here.

~~~
monstrado
I guess it depends on your use case, most of the use cases I've seen have been
time series using range iterations which is incredibly fast, but I understand
your concern if you're only using it for random gets.

~~~
halayli
Probably providing the option to create a btree or hash map would be nice.

------
Severian
I'd be interested to see the I/O with the same disk system. Right now the
comparison is between a 500GB HD and an SSD. I would assume this would skew
results quite a bit.

~~~
rescrv
Please look at the graphs again. Each comparison is done twice: once on hdd
and once on ssd. This gives an apples-to-apples comparison for each platform.

