Hacker News new | comments | show | ask | jobs | submit login
LevelDB: a fast and lightweight key/value database library (code.google.com)
232 points by adulau on May 8, 2011 | hide | past | web | favorite | 79 comments

Note that this is not a database server, like Redis or Memcached. This is a database library, more along the lines of sqlite. In particular: "There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library" and only one process can access a database a time.

It's the same model that Tokyo/Kyoto Cabinet uses. The difference is that Tokyo/Kyoto Tyrant is an available network interface :)

LevelDB should not be compared with the storage model of TC, because leveldb could be looked on a bigtable that run on single node, which is a model that is optimzied for updating instead of reading. The application situation for leveldb is different with that of TC

Hi Thank you for this explanation. My first thought was how is it different than Redis and then when I read no client-server support, I thought, this is worse than Redis. But then your comparison with sqlite makes it a little bit better. I can see a few use cases for it now. Thanks!

I think saying it's not like 'Memcached' could be wrong. I can think of Memcached as a specialized case of LevelDB.

This could be great news for iOS. AFAIK, this is the first K/V DB that could be built for iOS (c++) and has a free license that is compatible with the AppStore (Tokyo Cabinet uses LGPL and the author has shown no interest to me in adding a static linking clause, Berkeley DB requires an expensive commercial license, etc).

Not key/value, but Mobile CouchBase (CouchDB) supports iOS and apps using it have been approved for the App Store.

An interesting point to note is that Mobile CouchBase is primarily written in Erlang, so porting any existing key/value stores is not dependant on their native language being C or C++.

True. I haven't tried it and it looks promising, but currently it depends on packages like Erlang, Spidermonkey, and OpenSSL, so adding Mobile CouchBase to an app means increasing its size by about 42MB.

What advantages does this have over the built-in SQLite?

It stores key/value pairs instead of relational data. I've been looking for something like this for a while as a replacement to NSUserDefaults.

    CREATE TABLE kv (key text PRIMARY KEY, value text);

Inserts are much faster due to the log-structured merge-tree.

This is pretty neat, and different from many other "simple" KV stores, because it arranges things in key-order on disk, and permits forward/reverse iteration over keys. (ie, range queries are cheap)

edit: it's actually full log-structured merge, sstables, memtables etc.

The implementation seems to use Log-Structured Merge Trees.

The only paper on this data structure seems to be: http://goo.gl/CVF1l

This paper is poorly written and quite honestly not useful to implement an LSM tree. Does anyone know of a better paper than this one?

Read "Cache-Oblivious Streaming B-trees" http://supertech.csail.mit.edu/cacheObliviousBTree.html It' LSM with faster searches.

Thanks for the pointer. Are there any known open-source implementations of the same?

Not that I know of. Also, they have a patent, but that didn't stop Acunu from reimplementing and improving the algorithm.

But it doesn't really describe the LSM data structure!

It describes the approach used here (cf. all of the discussion of the tablet system). The few differences are documented:


The leveling here reminds me a bit of how Lucene organizes and compacts its segments files. When you write documents to Lucene they are written to a segments file, usually one for each batch of documents you write. When searching, each of the segments files are searched for terms, so the more segments files you have, the slower the search tends to be. When segments files reach a certain size they are automatically compacted into other segments files. Users can (and should) then "optimize" their index, which compacts all the segments files into a single file. The log files and sorted tables of leveldb appear to work very similarly.

All the code files have this as a first line:

    // Copyright (c) 2011 The LevelDB Authors. All rights reserved.
I wonder how come it's not copyrighted to Google.

That's what Google does when they set up a project that accepts contributions. Google Inc. is the author; see http://code.google.com/p/leveldb/source/browse/trunk/AUTHORS

The initial version written by the authors of MapReduce and BigTable

The idea behind this approach is to simplify the maintenance of the copyright line and to encourage collective ownership of the code. I'm told that some open-source projects have conflicts about who's name appears in which files, but I've not experienced that first-hand.

This looks like it has similar goals to bitcask. The tradeoffs and differences would be nice to know.

leveldb is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration).

bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications.

bitcask can guarantee at most one disk seek per lookup I think. leveldb may have to do a small handful of disk seeks. To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks.

I just need to say - it's pretty cool to see Sanjay Ghemawat hanging out on HN.

Also worth noting: bitcask is written in erlang and leveldb in c++.

Technically bitcask is a combination of Erlang and C.

I uses boost::atomic to make it usable without c++0x http://www.chaoticmind.net/~hcb/projects/boost.atomic/

#include <boost/atomic.hpp>

class AtomicPointer { private: boost::atomic<void> rep_; public: AtomicPointer() { } explicit AtomicPointer(void v) { } inline void* Acquire_Load() const { return rep_.load(boost::memory_order_acquire); } inline void Release_Store(void* v) { rep_.store(v, boost::memory_order_release); } inline void* NoBarrier_Load() const { return rep_.load(boost::memory_order_relaxed); } inline void NoBarrier_Store(void* v) { rep_.store(v, boost::memory_order_relaxed); } };

Anybody know what hardware those benchmarks were on? Is it actually durable? They mention problems with writeback caching, but are they at least durable from the OS perspective? How well does it scale?

It is durable from an OS perspective. I started typing in a long explanation, but I think the comment for the "sync" field in the options.h file captures things well:

  struct WriteOptions {
    // If true, the write will be flushed from the operating system
    // buffer cache (by calling WritableFile::Sync()) before the write
    // is considered complete.  If this flag is true, writes will be
    // slower.
    // If this flag is false, and the machine crashes, some recent
    // writes may be lost.  Note that if it is just the process that
    // crashes (i.e., the machine does not reboot), no writes will be
    // lost even if sync==false.
    // In other words, a DB write with sync==false has similar
    // crash semantics as the "write()" system call.  A DB write
    // with sync==true has similar crash semantics to a "write()"
    // system call followed by "fsync()".
    // Default: false
    bool sync;
We don't have much experience with scaling to larger databases yet. There are known problems which will cause a (smallish) constant factor slowdown in write performance after the database becomes a few (10?) GB in size, but I don't recall the details all that well, and the implementation has changed somewhat since that experiment. I would like to characterize this better and fix things so we can support somewhere between 100GB-1TB databases well. It just hasn't become a priority yet.

The benchmark numbers on the linked page were from a small million entry database that easily fits in the OS buffer cache.

Welcome to HN commenting, Sanjay!

(For those that aren't familiar, his bio is here: http://research.google.com/people/sanjay/index.html)

Writes being lost means what, a trashed file? Or merely an incomplete one?

Merely an incomplete one. Leveldb never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). Leveldb recovery code uses checksums to detect this and will skip the incomplete records.


it says so on the page: CPU: 4 x Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz

Safe to assume the data fit into memory..whatever amount there was.

more interested in disk (and workloads larger than memory, but it doesn't sound like that's their use case)

What I would like to know is, is there support accessing values by nth-largest key?

I had to roll my own b-tree library specifically to get this feature since nothing out-there had it.

Why not index size(key)?

Then lookup the n-th largest key and then pull the values?

eg. SELECT value from key_value_pairs order by size(key) LIMIT N

I don't get it,

How would you use this approach to find say, the 195687th largest element in O(log(n)) time?

Edit: However, I do know I can write a SQL query to find this - an ordinary index plus limit should do this. It's just such an approach gets really awkward and so unpredictable when you are dealing with a lot of distinct columns and tables - why I implemented this with key-value DB (a custom index on top of Kyoto Cabinet).

Why does lots of columns and tables make this approach awkward? Sure you might need a couple computed columns, and a little table de normalization, but IMHO most of where NoSQL gets it's performance is from de normalization.

Regardless of how many elements compose the key the calculation is the same. I'd be surprised if you couldn't approach O(log(n)) time with a good SQL implementation, some computed columns and/or indexed views.

Just wondering was the b-tree implementation you did a pure B-Tree or was the data structure itself modified in some way. eg. storing the number of elements to the left or right?

You certainly could do a SQL implementation. For my purposes, each of the many queries was going to involve something like at least six lines of logic (it was "find the nth value satisfying X, Y, and Z condition"). Debugging and trying make sure the indices worked correctly for each query looked uglier than creating a whole different structure - my structure was native C++ so there wasn't the translation issues of SQL.

I used a B-tree modified by adding a member variable including the total number of children of each parent. Its so simple a modification I don't know why more implementations don't use it but, it seems they don't.

I don't think that's what the OP wants. The real result expected is supposed to be ordered by the value of the key itself rather than the size of the key.

Was really psyched until I saw --std=c++0x in the Makefile. What 0x features does leveldb depend on, and can those dependencies be realistically removed?

The use of c++0x should be limited to the small AtomicPointer class in port_posix.h. The class uses <cstdatomic> for its implementation. If you can provide a different implementation of "acquire store" and "release load" for your architecture, you should be all set without the need for c++0x.

I suspect that you may find suitable code in the chromium source code. See leveldb's port_chromium.h.

I suspect you need a newer version of gcc/g++ (one that supports the c++0x standard). You can probably download a later gcc package and then use that to install.


On Wed, May 11, 2011 at 3:37 AM, conglin.deng <conglin.deng@aliyun-inc.com> wrote: > Hi jeff, > > > > I have checkout leveldb code from code.google.com, > > But when I run make command on the root folder, encounter a error as > below, > > > > /////////////////////////// > > [root@host]$make > > g++ -c -DLEVELDB_PLATFORM_POSIX -I. -I./include -std=c++0x -g2 > db/db_bench.cc -o db/db_bench.o > > cc1plus: error: unrecognized command line option "-std=c++0x" > > make: * [db/db_bench.o] Error 1 > > [root@host]$pwd > > /home/admin/leveldb > > [root@host]$ > > /////////////////////////// > > > > > > Would you please to tell me how to build and setup leveldb on a redhat > 5.4 env. Thanks very much! > > > > Thanks > > linc

I managed to get an initial patch up here:


@sghemawat: Are you planning on implementing atomic CAS operations?

We have no such plans. The library is designed so that concurrency control is outside its scope. If I needed it for an application, I might consider using some application level code that looks like:

    bool CAS(db, key, oldvalue, newvalue) {
      lock some mutex;
      read key's value from db;
      bool result = (value == oldvalue);
      if (result) write key=>newvalue to db;
      return result;
This should be not much slower than any hard-wired CAS we could provide from inside leveldb.

It looks like this is a local storage tool, similar to sqlite. Is that correct? Lots of chromium devs on that list.

That's correct. We're using leveldb as the back-end for IndexedDB in Chrome.

Please enlighten me, in the LevelDB page it said:

> Only a single process (possibly multi-threaded) can access a particular database at a time.

But I presume Chrome is multiprocess by nature?

All leveldb access is done in the main process (which also handles UI, network, most file access, etc).

I am guessing each chrome "tab" (read: process) has it's own db open, that is just a guess though.

something similar + benchmarks http://stsdb.com

"STSdb is coded in 100% pure managed C#." & tri-licensed (GPL2/GPL3/Commercial) vs LevelDB's C++ & New BSD license.

Hackers should be able to figure out which bests fits their particular use case.

Does someone have any idea, how the benchmark results compare to other databases (e.g. redis) ?

Based on their numbers it looks slower than TokyoCabinet but still much faster than BerkleyDB. Would need to design a head to head benchmark.

I don't think Redis is a good comparison as that's an in-memory database so better suited for the 10% of hot data you'd need to cache. Whereas disk stores like TokyoCabinet and LevelDB would be great for storing the other 90-100%. If your use case involves a large dataset and you don't have terabytes of RAM lying around, that is.

One of the leveldb authors here. TokyoCabinet is something we seriously considered using instead of writing leveldb. TokyoCabinet has great performance usually. I haven't done a careful head-to-head comparison, but it wouldn't surprise me if it was somewhat faster than leveldb for many workloads. Plus TokyoCabinet is more mature, has matching server code etc. and may therefore be a better fit for many projects.

However because of a fundamental difference in data structures (TokyoCabinet uses btrees for ordered storage; leveldb uses log structured merge trees), random write performance (which is important for our needs) is significantly better in leveldb. This part we did measure. IIRC, we could fill TokyoCabinet with a million 100-byte writes in less than two seconds if writing sequentially, but the time ballooned to ~2000 seconds if we wrote randomly. The corresponding slowdown for leveldb is from ~1.5 seconds (sequential) to ~2.5 seconds (random).

fractal trees should have a faster read performance with leveldb with a comparable updating performance. There does not exist any open source solution for such structure, while we could hope that www.acunu.com might provide an open source solution for similary Stratified B-tree http://www.acunu.com/2011/04/log-file-systems-and-ssds-made-...

Being in-kernel, acunu is very far from the lightness of LevelDB.

Is there a more useful technical paper on Fractal Trees? Better yet is there a open source implementation of the same?

Cache-oblivious streaming B-trees:


I don't know for certain any open-source implementation, but I have heard COLAs are used in HBase.

> difference in data structures

Always good to have more tools in one's arsenal in that case. I'll drop one of you a message if I write a Lua binding for it (using LuaJIT+embedded k/v for data services atm).

Have you looked at implementing CAS? It is the one thing missing for me to adopt it as a underlying store for my AvroBase API: http://www.javarants.com/2010/06/30/havrobase-a-searchable-e...

Could you please report about caching setup for random read test? was there any cache (a part of OS's) for either position/index lookup or payload? Or did you hit the disk without any caching of past operations?


It doesn't seem to have a network interface. It is somewhat comparable to InnoDB or Tokyo/Kyoto Cabinet

Anyone manage to compile this on OS X?

Chromium compiles leveldb using the standard Xcode toolchain, but has a dependency on Chromium's "base" library, so it's for sure possible. I can't tell you the exact steps, but I suspect it wouldn't be too hard to get the posix port compiling with Xcode as well though.

Managed to get it working last night:


You'll need a compiler from MacPorts: it wants C++0x which Apple's gcc doesn't support yet.

Argh... that makes it even harder to use on iOS. :)

Sounds like a possible redis or memcached replacement.

A performance comparision with memcached would be interesting.

replacement is a strong word when you consider that you are talking about:

get/put/delete vs http://redis.io/commands

That is beautiful documentation.

sure, dependes of course if you're using redis only for a simple key value store.

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact