
LevelDB: a fast and lightweight key/value database library - adulau
http://code.google.com/p/leveldb/
======
necubi
Note that this is not a database _server_ , like Redis or Memcached. This is a
database _library_ , more along the lines of sqlite. In particular: "There is
no client-server support builtin to the library. An application that needs
such support will have to wrap their own server around the library" and only
one process can access a database a time.

~~~
rb2k_
It's the same model that Tokyo/Kyoto Cabinet uses. The difference is that
Tokyo/Kyoto Tyrant is an available network interface :)

~~~
yingfeng
LevelDB should not be compared with the storage model of TC, because leveldb
could be looked on a bigtable that run on single node, which is a model that
is optimzied for updating instead of reading. The application situation for
leveldb is different with that of TC

------
stephth
This could be great news for iOS. AFAIK, this is the first K/V DB that could
be built for iOS (c++) _and_ has a free license that is compatible with the
AppStore (Tokyo Cabinet uses LGPL and the author has shown no interest to me
in adding a static linking clause, Berkeley DB requires an expensive
commercial license, etc).

~~~
electrum
What advantages does this have over the built-in SQLite?

~~~
beder
It stores key/value pairs instead of relational data. I've been looking for
something like this for a while as a replacement to NSUserDefaults.

~~~
ichverstehe

        CREATE TABLE kv (key text PRIMARY KEY, value text);

------
metabrew
This is pretty neat, and different from many other "simple" KV stores, because
it arranges things in key-order on disk, and permits forward/reverse iteration
over keys. (ie, range queries are cheap)

edit: it's actually full log-structured merge, sstables, memtables etc.

------
psaccounts
The implementation seems to use Log-Structured Merge Trees.

The only paper on this data structure seems to be: <http://goo.gl/CVF1l>

This paper is poorly written and quite honestly not useful to implement an LSM
tree. Does anyone know of a better paper than this one?

~~~
timr
<http://labs.google.com/papers/bigtable.html>

~~~
psaccounts
But it doesn't really describe the LSM data structure!

~~~
timr
It describes the approach used here (cf. all of the discussion of the tablet
system). The few differences are documented:

[http://code.google.com/p/leveldb/source/browse/trunk/doc/imp...](http://code.google.com/p/leveldb/source/browse/trunk/doc/impl.html)

------
joeshaw
The leveling here reminds me a bit of how Lucene organizes and compacts its
segments files. When you write documents to Lucene they are written to a
segments file, usually one for each batch of documents you write. When
searching, each of the segments files are searched for terms, so the more
segments files you have, the slower the search tends to be. When segments
files reach a certain size they are automatically compacted into other
segments files. Users can (and should) then "optimize" their index, which
compacts all the segments files into a single file. The log files and sorted
tables of leveldb appear to work very similarly.

------
cool-RR
All the code files have this as a first line:

    
    
        // Copyright (c) 2011 The LevelDB Authors. All rights reserved.
    

I wonder how come it's not copyrighted to Google.

~~~
ivank
That's what Google does when they set up a project that accepts contributions.
Google Inc. is the author; see
<http://code.google.com/p/leveldb/source/browse/trunk/AUTHORS>

~~~
azsromej
The initial version written by the authors of MapReduce and BigTable

------
jhawk28
This looks like it has similar goals to bitcask. The tradeoffs and differences
would be nice to know.

~~~
sghemawat
leveldb is a persistent ordered map; bitcask is a persistent hash table (no
ordered iteration).

bitcask stores a fixed size record in memory for every key. So for databases
with large number of keys, it may use too much memory for some applications.

bitcask can guarantee at most one disk seek per lookup I think. leveldb may
have to do a small handful of disk seeks. To clarify, leveldb stores data in a
sequence of levels. Each level stores approximately ten times as much data as
the level before it. A read needs one disk seek per level. So if 10% of the db
fits in memory, leveldb will need to do one seek (for the last level since all
of the earlier levels should end up cached in the OS buffer cache). If 1% fits
in memory, leveldb will need two seeks.

~~~
stephth
Also worth noting: bitcask is written in erlang and leveldb in c++.

~~~
devpotato
Technically bitcask is a combination of Erlang and C.

------
yingfeng
I uses boost::atomic to make it usable without c++0x
<http://www.chaoticmind.net/~hcb/projects/boost.atomic/>

#include <boost/atomic.hpp>

class AtomicPointer { private: boost::atomic<void _> rep_; public:
AtomicPointer() { } explicit AtomicPointer(void_ v) { } inline void*
Acquire_Load() const { return rep_.load(boost::memory_order_acquire); } inline
void Release_Store(void* v) { rep_.store(v, boost::memory_order_release); }
inline void* NoBarrier_Load() const { return
rep_.load(boost::memory_order_relaxed); } inline void NoBarrier_Store(void* v)
{ rep_.store(v, boost::memory_order_relaxed); } };

------
leif
Anybody know what hardware those benchmarks were on? Is it actually durable?
They mention problems with writeback caching, but are they at least durable
from the OS perspective? How well does it scale?

~~~
sghemawat
It is durable from an OS perspective. I started typing in a long explanation,
but I think the comment for the "sync" field in the options.h file captures
things well:

    
    
      struct WriteOptions {
        // If true, the write will be flushed from the operating system
        // buffer cache (by calling WritableFile::Sync()) before the write
        // is considered complete.  If this flag is true, writes will be
        // slower.
        //
        // If this flag is false, and the machine crashes, some recent
        // writes may be lost.  Note that if it is just the process that
        // crashes (i.e., the machine does not reboot), no writes will be
        // lost even if sync==false.
        //
        // In other words, a DB write with sync==false has similar
        // crash semantics as the "write()" system call.  A DB write
        // with sync==true has similar crash semantics to a "write()"
        // system call followed by "fsync()".
        //
        // Default: false
        bool sync;
    

We don't have much experience with scaling to larger databases yet. There are
known problems which will cause a (smallish) constant factor slowdown in write
performance after the database becomes a few (10?) GB in size, but I don't
recall the details all that well, and the implementation has changed somewhat
since that experiment. I would like to characterize this better and fix things
so we can support somewhere between 100GB-1TB databases well. It just hasn't
become a priority yet.

The benchmark numbers on the linked page were from a small million entry
database that easily fits in the OS buffer cache.

~~~
JulianMorrison
Writes being lost means what, a trashed file? Or merely an incomplete one?

~~~
sghemawat
Merely an incomplete one. Leveldb never writes in place: it always appends to
a log file, or merges existing files together to produce new ones. So an OS
crash will cause a partially written log record (or a few partially written
log records). Leveldb recovery code uses checksums to detect this and will
skip the incomplete records.

------
joe_the_user
What I would like to know is, is there support accessing values by nth-largest
key?

I had to roll my own b-tree library specifically to get this feature since
nothing out-there had it.

~~~
fleitz
Why not index size(key)?

Then lookup the n-th largest key and then pull the values?

eg. SELECT value from key_value_pairs order by size(key) LIMIT N

~~~
joe_the_user
I don't get it,

How would you use this approach to find say, the 195687th largest element in
O(log(n)) time?

Edit: However, I do know I _can_ write a SQL query to find this - an ordinary
index plus limit should do this. It's just such an approach gets really
awkward and so unpredictable when you are dealing with a lot of distinct
columns and tables - why I implemented this with key-value DB (a custom index
on top of Kyoto Cabinet).

~~~
fleitz
Why does lots of columns and tables make this approach awkward? Sure you might
need a couple computed columns, and a little table de normalization, but IMHO
most of where NoSQL gets it's performance is from de normalization.

Regardless of how many elements compose the key the calculation is the same.
I'd be surprised if you couldn't approach O(log(n)) time with a good SQL
implementation, some computed columns and/or indexed views.

Just wondering was the b-tree implementation you did a pure B-Tree or was the
data structure itself modified in some way. eg. storing the number of elements
to the left or right?

~~~
joe_the_user
You certainly _could_ do a SQL implementation. For my purposes, each of the
many queries was going to involve something like at least six lines of logic
(it was "find the nth value satisfying X, Y, and Z condition"). Debugging and
trying make sure the indices worked correctly for each query looked uglier
than creating a whole different structure - my structure was native C++ so
there wasn't the translation issues of SQL.

I used a B-tree modified by adding a member variable including the total
number of children of each parent. Its so simple a modification I don't know
why more implementations don't use it but, it seems they don't.

------
argvzero
Was really psyched until I saw --std=c++0x in the Makefile. What 0x features
does leveldb depend on, and can those dependencies be realistically removed?

~~~
sghemawat
The use of c++0x should be limited to the small AtomicPointer class in
port_posix.h. The class uses <cstdatomic> for its implementation. If you can
provide a different implementation of "acquire store" and "release load" for
your architecture, you should be all set without the need for c++0x.

I suspect that you may find suitable code in the chromium source code. See
leveldb's port_chromium.h.

~~~
davisp
I managed to get an initial patch up here:

<http://code.google.com/p/leveldb/issues/detail?id=2>

------
Emore
@sghemawat: Are you planning on implementing atomic CAS operations?

~~~
sghemawat
We have no such plans. The library is designed so that concurrency control is
outside its scope. If I needed it for an application, I might consider using
some application level code that looks like:

    
    
        bool CAS(db, key, oldvalue, newvalue) {
          lock some mutex;
          read key's value from db;
          bool result = (value == oldvalue);
          if (result) write key=>newvalue to db;
          unlock;
          return result;
        }
    

This should be not much slower than any hard-wired CAS we could provide from
inside leveldb.

------
18pfsmt
It looks like this is a local storage tool, similar to sqlite. Is that
correct? Lots of chromium devs on that list.

~~~
abarth
That's correct. We're using leveldb as the back-end for IndexedDB in Chrome.

~~~
est
Please enlighten me, in the LevelDB page it said:

> Only a single process (possibly multi-threaded) can access a particular
> database at a time.

But I presume Chrome is multiprocess by nature?

~~~
jorlow
All leveldb access is done in the main process (which also handles UI,
network, most file access, etc).

------
gmarkov
something similar + benchmarks <http://stsdb.com>

~~~
evangineer
"STSdb is coded in 100% pure managed C#." & tri-licensed
(GPL2/GPL3/Commercial) vs LevelDB's C++ & New BSD license.

Hackers should be able to figure out which bests fits their particular use
case.

------
ma2rten
Does someone have any idea, how the benchmark results compare to other
databases (e.g. redis) ?

~~~
davidhollander
Based on their numbers it looks slower than TokyoCabinet but still much faster
than BerkleyDB. Would need to design a head to head benchmark.

I don't think Redis is a good comparison as that's an in-memory database so
better suited for the 10% of hot data you'd need to cache. Whereas disk stores
like TokyoCabinet and LevelDB would be great for storing the other 90-100%. If
your use case involves a large dataset and you don't have terabytes of RAM
lying around, that is.

~~~
sghemawat
One of the leveldb authors here. TokyoCabinet is something we seriously
considered using instead of writing leveldb. TokyoCabinet has great
performance usually. I haven't done a careful head-to-head comparison, but it
wouldn't surprise me if it was somewhat faster than leveldb for many
workloads. Plus TokyoCabinet is more mature, has matching server code etc. and
may therefore be a better fit for many projects.

However because of a fundamental difference in data structures (TokyoCabinet
uses btrees for ordered storage; leveldb uses log structured merge trees),
random write performance (which is important for our needs) is significantly
better in leveldb. This part we did measure. IIRC, we could fill TokyoCabinet
with a million 100-byte writes in less than two seconds if writing
sequentially, but the time ballooned to ~2000 seconds if we wrote randomly.
The corresponding slowdown for leveldb is from ~1.5 seconds (sequential) to
~2.5 seconds (random).

~~~
lisper
Did you consider fractal trees?

<http://tokutek.com/presentations/bender-Scalperf-9-09.pdf>

~~~
yingfeng
fractal trees should have a faster read performance with leveldb with a
comparable updating performance. There does not exist any open source solution
for such structure, while we could hope that www.acunu.com might provide an
open source solution for similary Stratified B-tree
[http://www.acunu.com/2011/04/log-file-systems-and-ssds-
made-...](http://www.acunu.com/2011/04/log-file-systems-and-ssds-made-for-
each-other/)

~~~
br1
Being in-kernel, acunu is very far from the lightness of LevelDB.

------
foobarbazetc
Anyone manage to compile this on OS X?

~~~
forgotusername
You'll need a compiler from MacPorts: it wants C++0x which Apple's gcc doesn't
support yet.

~~~
foobarbazetc
Argh... that makes it even harder to use on iOS. :)

------
mcorrientes
Sounds like a possible redis or memcached replacement.

A performance comparision with memcached would be interesting.

~~~
latch
replacement is a strong word when you consider that you are talking about:

get/put/delete vs <http://redis.io/commands>

~~~
emehrkay
That is beautiful documentation.

