Hacker News new | comments | show | ask | jobs | submit login
MDBM – High-speed database (yahooeng.tumblr.com)
159 points by threepointone on Dec 11, 2014 | hide | past | web | favorite | 111 comments



I'm so excited that they finally open-sourced this. It's relatively old tech at Yahoo, stuff folks outside never got to see. It was difficult to explain to later colleagues the stuff I knew about shared-memory databases because I couldn't give them a frame of reference.

mdbm performance is even better on FreeBSD than Linux because FreeBSD supports MAP_NOSYNC, which causes the kernel not to flush dirty pages to disk until the region is unmapped. Perhaps mdbm's release will finally get the Linux kernel team to provide support for that flag.


Same here. I remember wishing we could Open Source it back in the early 2000s. Good to see this coming out so people can take a little credit for work they did back in the day.


I've been providing people with source all along. SGI was pretty pleasant about letting me retain copyright on stuff like that.


Didn't Yahoo have copyright on a bunch of the modifications to mdbm?


Yeah, they wacked it pretty hard and I don't have anything to do with that. I asked them to call it YDBM but that idea came too late.


Yeah, YDBM was another thing (looks a lot like Project Voldemort, except in C, not Java) that maybe someday will be released.


How is your MAP_NOSYNC any different from just using mlock?


mdbm uses a file-backed mmap region so you're implicitly saying you do want some sort of persistence. Also, mlock locks the heap in RAM to keep it from being swapped - mdbm's regions aren't heap space.

There's a mmap flag on Linux called MAP_LOCKED but I'm not sure how it behaves with MAP_SHARED, which mdbm uses (the man page isn't clear).


This looks interesting. At this stage of the game a more meaningful benchmark might involve LMDB, Wiredtiger, and, yes, LevelDB.


I don't think it's comparable to benchmark MDBM against LMDB or WiredTiger as keys are not kept in sorted order (no range queries), there is no support for transactions, and MDBM does not offer durability in the event of power loss.

MDBM is pretty much an optimized persistent hash table. LMDB and WiredTiger aim to be full-fledged ACID compliant database storage engines with functionality similar to that of BerkeleyDB or InnoDB.


You make some good points. We benchmark LMDB against LevelDB and its derivatives even though none of the LevelDB family offer ACID transactions. (http://symas.com/mdb/ondisk/ ) Despite this fact, people will ask the question and try to make the comparison, so we run those tests. It's silly, but most people seem to pay attention to performance more than to safety/reliability.

From my totally biased perspective, MDBM is utter garbage. They use mmap but make absolutely zero effort to use it safely. This was the biggest obstacle to overcome in developing LMDB; I had a few lengthy conversations with the SleepyCat guys about it as well. It's the reason it took 2 years (from 2009 when we first started talking about it, to 2011 first code release) to get LMDB implemented. If you want to call something a "database" you have to do more than just mmap a file and start shoving data into it - you have to exert some kind of control over how and when the mapped data gets persisted to disk. Otherwise, if you just let the OS randomly flush things, you'll wind up with garbage. As Keith Bostic said to me (private email):

"The most significant problem with building an mmap'd back-end is implementing write-ahead-logging (WAL). (You probably know this, but just in case: the way databases usually guarantee consistency is by ensuring that log records describing each change are written to disk before their transaction commits, and before the database page that was changed. In other words, log record X must hit disk before the database page containing the change described by log record X.)

In Berkeley DB WAL is done by maintaining a relationship between the database pages and the log records. If a database page is being written to disk, there's a look-aside into the logging system to make sure the right log records have already been written. In a memory-mapped system, you would do this by locking modified pages into memory (mlock), and flushing them at specific times (msync), otherwise the VM might just push a database page with modifications to disk before its log record is written, and if you crash at that point it's all over but the screaming."

The harsh realities of working with mmap are what dictated LMDB's copy-on-write design - it's the only way to ensure consistency with an mmap without losing performance (due to multiple mlock/msync syscalls). None of these design considerations are evident in MDBM.

LMDB's mmap is read-only by default, because otherwise it's trivial to permanently corrupt a database by overwriting a record, writing past the end, etc. MDBM's mmap is read-write, and the only "protection" you get is a doc that tells you "be Vewwy vewwy careful!" Ridiculously sloppy.

LMDB's design and implementation are proven incorruptible. MDBM (and LevelDB and all its derivatives) are proven to be quite fragile. https://www.usenix.org/conference/osdi14/technical-sessions/...

Leaving reliability aside for a moment, there's also the issue of performance and efficiency. We used to use DBM-style hashes for the indexes in OpenLDAP, up to release 2.1. We abandoned them in favor of B-trees in OpenLDAP 2.2 because extensive benchmarking showed that BDB's B-trees were faster than its hash implementation at very large data sizes. The fundamental problem is that hash data structures are only fast when they are sparsely populated. When the number of data records you need to work with increases to fill the table, you start getting more and more hash collisions that result in lots of linear probes (or whatever other hash recovery strategy you're using). The other problem is that the very sparse/unordered nature of hashes makes them extremely cache unfriendly - you get zero locality-of-reference for groups of related queries. So as your data volumes increase, you get less and less benefit from the amount of RAM you have available. When the data exceeds the size of RAM, the number of disk seeks required for an arbitrary lookup is enormous, and every read is a random access. Using a hash for a large-scale data store is just horrible. (We tested this extensively a decade ago http://www.openldap.org/lists/openldap-devel/200401/msg00077... )


Hey, I've been working on a graph db-like thing as a hobby project for the last six months and I'm using LMDB as backend. I tried many alternatives (LevelDB, Sqlite4's LSM, BDB etc.) before settling on LMDB. The alternatives all had some quirk that stopped me from using them.

Among other things, I like that LMDB has zero-copy reads and that's something I've taken care to preserve all the way through my layers.

Just wanted to say thanks for the great work. LMDB is a joy to work with.


Great to hear that ;)


"We abandoned them in favor of B-trees in OpenLDAP 2.2 because extensive benchmarking showed that BDB's B-trees were faster than its hash implementation at very large data sizes."

Didn't bdb's linear hashing scheme extend the size of the hash table enough to keep it at the required loadfactor?


http://www.openldap.org/lists/openldap-devel/200401/msg00074...

Our experience with it shows that resizing was itself a very expensive operation.


Thanks for the reply. Interesting that linear hashing had such a big effect, seeing as it is meant to be a slowly-occurring process that only happens when the average load factor of all buckets exceeds a threshold. I guess that was back in 2004 though. Wonder if the same performance is still applicable?


Howard, since you're here and taking questions... :-)

Do you have any idea if a sqlite 4 release is imminent? Will lmdb work with it right out of the gate?

Thanks.


Sorry, I haven't kept up with sqlite4 development. I'm still on the sqlite-developers mailing list but I never see any traffic about it in particular.

I'm going to guess that they will not ship an LMDB driver right out of the gate. The one we were working on was not completed (our contractor flaked), and while I know they did some work on their own, I have no idea how complete that was either.


Thanks for your insight and the in-depth benchmarks you provide.


You're welcome ;)

The benchmarks are obviously for our own benefit too - until someone does these comparisons, none of us knows where things truly stand.


You pretty clearly haven't used MDBM because the MDBM I worked on at SGI (and still use to this day) gets to any key in two page faults (aka 2 disk seeks) at the most. That was the whole point of it.

If you want I'll go shove a few GB into an mdbm, drop caches, and time a lookup.


If you've already ported the levelDB benchmark driver, feel free to send it to me: https://github.com/hyc/leveldb/tree/benches/doc/bench

2 seeks at the most, are you talking about a 32 bit address space? The only way that's possible in 64 bits is to direct map a hash into e.g. 2 32 bit chunks and use the hash as an actual disk block address for the first chunk, and an index into a block list for the 2nd chunk.


2 seeks. Address space doesn't matter, you have one seek to read the directory (I'm assuming 100% cold cache), and one seek to get to the page in question.

Not only that, we watched the bus on an SGI Challenge and counted cache misses and TBL misses. 2 TBL misses to get a key.

Saying that it isn't possible on a 64 bit VM system makes no sense to me. If I have a 2TB file and I seek to location A and read it, then seek to location B and read it, you are saying that's not possible? Same thing with mmap, I set a pointer to the mapping, read p, p += <number>, read p. Two seeks, two page faults, whatever you want to call it, it does 2 and only 2 I/O's to get a key/value (unless the pages are bigger than disk blocks but then those are going to be sequential I/O's, no extra seeks).


I was actually thinking of a >2GB DB file on a 32 bit server. But leaving that aside, it sounds like you're assuming a perfect hash function with no collisions. If you have collisions, you have to deal with the possibility of a hash bucket overflowing and requiring an additional seek.

Anyway, I don't doubt that you can operate in 2 seeks in the normal case.


It is 2 seeks, at most, for 100% of lookups.


Care to share any details on the hashing scheme? Is it based on linear hashing, a la Litwin and Larson?


The hash is up to you, several are provided.


Isn't LevelDB a building block of Google's distributed file systems?


LevelDB is based on concepts from Google's BigTable database system. The tablet implementation for the BigTable system was developed starting in about 2004, and is based on a different Google internal code base than the LevelDB code.[1]

[1] http://en.wikipedia.org/wiki/LevelDB#History


They probably release old tech after they have upgraded their own. So when they did their internal LevelDB 3.0 they released 1.0 as opensource. Probably do the same thing with all their releases (map-reduce,bigtable etc).


If you look at the MDBM page you'll notice that they currently benchmark against LevelDB, BerkeleyDB, and Kyoto Cabinet. What I was suggesting involves the same theme but better, newer competition.

I agree that it's an apples to oranges comparison in any case.


The timings they give there can only make sense on a fast SSD or when the database they benchmark on is completely cached.

It's an apples-to-oranges comparison only of MDBM wins significantly against LMDB. If they are comparable in timing, or e.g. MDBM is 20% faster, then it would be an apples-to-apples comparison, MDBM having 20% speed advantage, and LMDB having every other possible advantage (memory safety, ACIDity, ordered retrieval, multiple databases, etc.)

LMDB is truly, incredibly, really marvelous. On 64-bit it comes close to being the end-all-be-all local KV-store. If your databases are not more than a few tens of megs each, the same is true for 32-bit processors as well.


> If your databases are not more than a few tens of megs each, the same is true for 32-bit processors as well.

Why is that ? Shouldn't 32-bit processors give you enough space in the range of hundreds of MiB ?


Sure. But you only have 2-3GB of usable address space, and if your app has a lot of other variables to work with, there's not much space left over.

I should note that in LMDB 1.0 we'll have dynamic unmapping and remapping, to allow 32-bit machines to work with larger DBs. (There's still a significant performance cost for this. It's only being done to allow folks to use the same code on 32 and 64.)


Off-topic: will you have something like networked KV store aka redis. what do you think of Redis or Cassandra?


If you're interested in building a distributed DB on top of LMDB, I took care of part of the problem (replication/consistency) in my flotilla library at

https://github.com/jbooth/flotilla

Basically I'm just layering the raft consistency algorithm on top of LMDB. Both systems single-thread write transactions for consistency, so there's some mechanical sympathy. Doesn't mandate any specific data model or even a client-server network transport, it's basically a replicated embedded DB. Anyone could build replicated redis on top of it or a Cassandra clone if they want to get into managing shards/rings.

Sample app (still WIP) at https://github.com/jbooth/merchdb


For redis, there's already ardb, ledisdb, redis-NDS, and a few other redis clones that support LMDB. We also already have memcacheDB on LMDB, as well as HyperDex with LMDB.

I haven't looked closely at Cassandra since it's in java, and after I didn't find a simple backend plugin API I didn't look any further.


Yes, I am quite impressed by it as well.


I'm not very comfortable with storage engines that directly build on memory mapped files. MongoDB's current storage engine is mmap based and it's sub optimal at best which is undoubtedly part of the reason they're building a completely new storage engine now (WiredTiger).


Using mmap well takes great care. MongoDB was careless. There's good reason to believe the MDBM designers were careless too.


MDBM guy here. Care to elaborate on what we got wrong?



Umm... the lack of transactional integrity is part of the mdbm design. So it's only "careless" in the strictest sense of the term (the designers explicitly wanted to exploit not having to care about it).

mdbm is certainly not without limitations, but is careful about its use of mmap to an extent that comparisons with MongoDB are laughable.


I have already admitted my obvious biases, but seriously - when you design such a trivially corruptible system, you can't call it a persistent database; it's at most a cache. It won't survive a system crash at all, it probably won't survive an application crash intact either. To call it persistent is laughable.


> I have already admitted my obvious biases, but seriously - when you design such a trivially corruptible system, you can't call it a persistent database; it's at most a cache. It won't survive a system crash at all, it probably won't survive an application crash intact either. To call it persistent is laughable.

There are a number of use cases where in the event of a node failure it is better to rebuild from a replica or a log. Statistically, the RAM on another host is actually more reliable than local storage. Additionally, the database does have sync'ing primitives that allow for a variety of persistence strategies... just not the traditional ACID strategy.

In practice, there are lots of cases where the freedom to ignore transactional integrity is very handy, and yes, a cache would definitely be one of them.


If DB updates are not atomically visible, then no sync'ing strategy will protect you from corruption during a crash.

So in other words, you're saying "MDBM is not a persistent database." Glad that's clear. Totally agree, there are probably lots of use cases for it. But persistent data store isn't one of them.


> If DB updates are not atomically visible, then no sync'ing strategy will protect you from corruption during a crash.

Who said anything about them not being atomically visible? Not being ACID doesn't mean "none of the above", and atomicity in an embedded database is practically a footnote in a distributed system anyway, because you're looking for atomicity on a much grander scale. You can impose whatever locking strategy you want around the memory to ensure updates are atomically visible. It's an embedded DB so consistency is in the hands of the application logic. Traditional ACID terms for isolation and durability aren't there, but for say a distributed system you'd do that in the network layer.

If you have multiple replicas, if one node dies you don't try to recover its data (oddly, storage corruption and node failure have a high co-occurrence rate ;-), you just wipe and pull from the other nodes. If you can't get a clean, consistent state from a node that hasn't failed, you've got bigger problems than your embedded database. As I pointed out, pulling from other nodes is often faster and more reliable than pulling from disk anyway.

In general, with a high performance network application, it is often preferable address these issues somewhere other than in your embedded database.

So what is the corruption you are envisioning?


MDBM is a hash, a pretty scalable one and fairly light weight. It is so fast that a common usage is to log updates to a log file and just rebuild the hash from that after a reboot. We've been doing that for years, works well for us.

I'm guessing you are one of the people behind some other technology. Goody for you but do you really think you make your case by dissing anything else? If you have a solution that works for you, great. SGI, Yahoo, and other companies have found a use for MDBM. SGI was using it ~20 years ago and at the time there was nothing that came close to the same performance.


It doesn't mean he's wrong, but hyc_symas is the CTO of Symas who produce the Lightning memory-mapped database, so part of his business is in direct competition. He does dance around this a bit below, but I think how aggressively he's trashing MDBM with broad & unsupported statements in this thread doesn't really speak well of his company or his personaly integrity. You can have a technical disagreement without looking like you're spreading FUD in place of valid technical arguments. This isn't how you do that.


I stated up front that I am totally biased. I provided multiple links backing up my statements. Nothing I've said here is unfounded.

The only FUD here is advertising a piece of software as a high performance embedded database when in fact it's not suitable for such use on its own. The most viable use cases the authors have presented is when using MDBM as part of a larger distributed system such that the loss of a single DB instance isn't fatal. The above comment talks about restoring from a log, but the actual log mechanism isn't part of MDBM. I.e., MDBM is incomplete on its own and you must provide additional pieces in order to use it effectively.

I'm not solely interested in promoting my own DB. If you read the On-Disk microbenchmarks I linked you'll see that there's a broad range of use cases where LMDB gets trounced by LevelDB and other LSMs. I'm interested in facts.

From the original link:

"On clean shutdown of the machine, all of the MDBM data will be flushed to disk. However, in cases like power-failure and hardware problems, it’s possible for data to be lost, and the resulting DB to be corrupted. MDBM includes a tool to check DB consistency. However, you should always have contingencies. One way or another this is some form of redundancy…"

The fact is, this is a system that can lose data on a crash and it doesn't include its own recovery mechanism. Without such a mechanism you can't call it a persistent data store because MDBM by itself is not persistent.


> The fact is, this is a system that can lose data on a crash and it doesn't include its own recovery mechanism. Without such a mechanism you can't call it a persistent data store because MDBM by itself is not persistent.

Persistence != ACID

By your definition, ext2 is not persistent. Give it a break.


ext2 comes with its own recovery system: fsck.


The yahoo people claim to have the same thing for mdbm. As someone who worked on it, I can easily see how you would write such a thing, it's way more trivial than fsck (and I'm a file system guy, I'd much rather write a mdbm_fsck than a file system fsck).


They don't just claim to have the same thing for mdbm, it's right there in the project.

Sounds like you are shooting off without really understanding what you are criticizing.


That explains a lot, I had guessed as much. hyc_symas should give it a rest, mdbm isn't competing with an actual DB, it's a memory mapped hash, a fast one and useful for a lot of things, but it's not a DB. I dunno where he got that idea from in the first place.


> I dunno where he got that idea from in the first place.

You really started posting on this comment thread without reading the title of the article?


Lots of things are called a database without having database functionality. man gdbm: GDBM - The GNU database manager.

So you are spouting all your disdain based on a title? Just curious, have you even run any version of mdbm? Or are you just spitting out baseless opinions?

I'm sure your code is awesome and all but man, you are defensive. If your code is that great let it speak for itself. Bashing stuff that isn't even trying to compete with you makes you look pretty insecure. You remind me of me when I was younger; that's not a compliment, I was a raging asshole.


Don't be silly. My opinions are never baseless.

We used MDBM in OpenLDAP for a few years on SGI Irix. I haven't touched it myself in something like 9 or 10 years though. http://www.openldap.org/lists/openldap-devel/199903/msg00094...

http://www.openldap.org/lists/openldap-devel/200501/msg00053...

Using DBM-style DBs was an endless nightmare of corruption bugs. That technology belongs firmly in the distant past, we have better solutions today.

http://www.openldap.org/lists/openldap-software/200607/msg00...

I'm not bashing MDBM because it's a competitor; it obviously isn't a competitor. Heck the only reason I'm commenting in this thread is because you asked me to elaborate on its design flaws. Now you're hurt that I answered your question. Don't ask if you don't want to know the answer.

... seems like us folks in OpenLDAP weren't the only ones to have bad experiences with it. This guy in this same discussion seems to share the basic sentiment. https://news.ycombinator.com/item?id=8733819


Where did I say your opinions are baseless? Edit. Huh, yeah I did. In the same place where I asked if you had ever used mdbm or are you spouting baseless opinions. So have you used it? All the links you provided talk about various DBM backends, not one of them point to mdbm as a problem. So again, have you used it or are you in fact stating baseless opinions?

I'm sorry the openldap couldn't figure out how to use MDBM in a way that worked for them. Yahoo clearly has. We have. Others have. It works extremely well for what it was designed for and the perf numbers show that it spanks the living crap out of your stuff. But your stuff does more, which is cool. Why don't you focus on that rather than trying to crap all over some useful code? It's clearly not a threat to you. I don't see how the world is well served by your comments. For certain problems MDBM is way the hell better than what you have. It's a narrow niche, doesn't compete with you, so why all the fuss?

As for me asking you to comment, yup, I did, but you were already well on your way of banging on technology you clearly didn't understand. I think I get why it didn't work for you, your comments have made it clear you have no idea how it works. I went and read all the links you provided above, not one mentioned mdbm having corruption bugs. It's entirely possible that the other DBM style dbs had bugs, or it is possible that people were inserting / deleting in a first / next loop, whatever. But pointing a finger at MDBM and claiming corruption bugs, how about you substantiate that claim? Seems somewhat flawed when multiple companies have used it for 10 years or more and it seems to work fine for them.


You were already losing credibility due to your inattention to detail. I never mentioned performance anywhere in this thread. It's stupid to even talk about it since a hash doesn't do ordered lookups. But now that you're raising the topic: https://github.com/hyc/leveldb/tree/benches/doc/bench

"Spanks the living crap" - must be some heretofore unknown definition of "spanks".

  ./db_bench_mdbm
  MDBM:       version 4.11.1
  Date:       Mon Dec 15 04:39:20 2014
  CPU:        4 * Intel(R) Core(TM)2 Extreme CPU Q9300  @ 2.53GHz
  CPUCache:   6144 KB
  Keys:       16 bytes each
  Values:     100 bytes each (50 bytes after compression)
  Entries:    1000000
  RawSize:    110.6 MB (estimated)
  FileSize:   62.9 MB (estimated)
  ------------------------------------------------
  fillrandsync :      40.627 micros/op 24614 ops/sec;    2.7 MB/s (1000 ops)
  65604	/tmp/leveldbtest-1000
  fillrandom   :      16.356 micros/op 61137 ops/sec;    6.8 MB/s
  122056	/tmp/leveldbtest-1000
  fillrandbatch :       5.499 micros/op 181850 ops/sec;   20.1 MB/s
  121936	/tmp/leveldbtest-1000
  fillseqsync  :      40.163 micros/op 24898 ops/sec;    2.8 MB/s (1000 ops)
  65604	/tmp/leveldbtest-1000
  fillseq      :      16.424 micros/op 60886 ops/sec;    6.7 MB/s
  175724	/tmp/leveldbtest-1000
  fillseqbatch :       5.648 micros/op 177041 ops/sec;   19.6 MB/s
  175724	/tmp/leveldbtest-1000
  overwrite    :      16.290 micros/op 61385 ops/sec;    6.8 MB/s
  175724	/tmp/leveldbtest-1000
  readrandom   :       0.598 micros/op 1672444 ops/sec; (1000000 of 1000000 found)
  readseq      :       0.096 micros/op 10370216 ops/sec; 1147.2 MB/s

  ./db_bench_mdb
  LMDB:       version LMDB 0.9.14: (September 20, 2014)
  Date:       Mon Dec 15 04:41:37 2014
  CPU:        4 * Intel(R) Core(TM)2 Extreme CPU Q9300  @ 2.53GHz
  CPUCache:   6144 KB
  Keys:       16 bytes each
  Values:     100 bytes each (50 bytes after compression)
  Entries:    1000000
  RawSize:    110.6 MB (estimated)
  FileSize:   62.9 MB (estimated)
  ------------------------------------------------
  fillrandsync :      12.818 micros/op 78015 ops/sec;    8.6 MB/s (1000 ops)
  224	/tmp/leveldbtest-1000/dbbench_mdb-1
  224	/tmp/leveldbtest-1000
  fillrandom   :       4.275 micros/op 233923 ops/sec;   25.9 MB/s
  116548	/tmp/leveldbtest-1000/dbbench_mdb-2
  116548	/tmp/leveldbtest-1000
  fillrandbatch :       3.490 micros/op 286502 ops/sec;   31.7 MB/s
  126384	/tmp/leveldbtest-1000/dbbench_mdb-3
  126384	/tmp/leveldbtest-1000
  fillseqsync  :      14.972 micros/op 66791 ops/sec;    7.4 MB/s (1000 ops)
  172	/tmp/leveldbtest-1000/dbbench_mdb-4
  172	/tmp/leveldbtest-1000
  fillseq      :       2.231 micros/op 448145 ops/sec;   49.6 MB/s
  125872	/tmp/leveldbtest-1000/dbbench_mdb-5
  125872	/tmp/leveldbtest-1000
  fillseqbatch :       0.425 micros/op 2355457 ops/sec;  260.6 MB/s
  125872	/tmp/leveldbtest-1000/dbbench_mdb-6
  125872	/tmp/leveldbtest-1000
  overwrite    :       4.881 micros/op 204857 ops/sec;   22.7 MB/s
  125872	/tmp/leveldbtest-1000/dbbench_mdb-6
  125872	/tmp/leveldbtest-1000
  readrandom   :       1.166 micros/op 857624 ops/sec; (1000000 of 1000000 found)
  readseq      :       0.059 micros/op 17092556 ops/sec; 1890.9 MB/s
  readreverse  :       0.042 micros/op 23814626 ops/sec; 2634.5 MB/s
Feel free to submit a patch if I got anything wrong in that driver; it was a pretty hasty patch. This is running on tmpfs, so no I/O involved. MDBM is faster on random read, which is what you'd expect since it's a hash and doesn't have to navigate down a tree to locate a record. Aside from that, it's pretty pedestrian.


From the OP's posting:

  Test 	           MDBM 	LevelDB 	KyotoCabinet 	BerkeleyDB
  Write Time 	   1.1 μs 	4.5 μs 	        5.1 μs 	        14.0 μs
  Read Time 	   0.45 μs      5.3 μs 	        4.9 μs 	        8.4 μs
  Sequential Read  0.05 μs      0.53 μs 	1.71 μs 	39.1 μs
  Sync Write 	   2625 μs      34944 μs 	177169 μs 	13001 μs
As to losing credibility, I'm semi retired, I stopped trying to impress people a decade ago. If i were you I'd be more worried about your own image, bad mouthing other people's tech when you demonstrate you don't understand it hasn't made you like good to at least a few people here.

It wouldn't be that hard to say "MDBM is great when used as an index into a DB, it works just fine for that. But you are going to have to rebuild the index after a power failure unless you take care to flush the data. It's somewhat unfair that the OP compared against my database because mine is slower because it handles crashes."

Instead you come out with "MDBM is complete crap". Well, no, it's not. In the domain where it is useful it is actually quite useful, it's 10x faster for lookups than your DB. So the trade off is speed vs surviving reboots. For lots of people, speed is much more important. Machines don't crash every ten minutes. In fact, it's pretty common to see uptimes in 100's of days. Lets say that 100 days is average and lets say that it takes a full day to rebuild the MDBM. So MDBM is delivering 99/100 days of useful work. You deliver 100/100 days. Oh, wait, except that your useful work is running 10x slower if the DB is being used as an index. So you delivered 10/100 days. See why some people may prefer to use MDBM when it is put like that? Performance is a feature.


Nice try. Performance is a side-effect, efficiency is a feature. Nothing you can find will show MDBM to be 10x faster than LMDB. You must be talking about LevelDB.

Look at the results I posted again, and look at the benchmark code. Tell me that I've made a mistake, that's fine. The thing about open source is there's no reason to BS, anyone can build and run it and see for themselves. LMDB is faster and its data is more compact than MDBM, so you can get more work done using less resources, and you don't have to worry about losing your work for unexpected downtime.

Hashes suck for large volumes of data. That's just the reality of it, plain and simple: Low storage efficiency, memory-intensive, and cache-unfriendly. Whether people like or dislike me personally for saying so doesn't change the facts.

As for uptime - sure, and my PCs have uptimes for hundreds of days too. But I'd be a fool to just take that for granted and not take regular backups. The problem with your 99/100 days math is that you can't actually account for the cost of a crash that way. It might only set you back to 0/100; if you're unlucky it will set you back to -100 or more.


You're right, I was talking about LevelDB. But in your benchmark MDBM is 2x faster on the code path it cares about, random reads. Lots of people pay attention to a factor of 2.

Any idea why it is that much better than LMDB?

I actually agree with you in that hashes sort of suck, just look at Git when the repo gets big, a hash is a miserable way to traverse all that data, very cache unfriendly.

But the use cases for MDBM are the same, you have lots of keys and you want to get to any key very quickly. It appears to me that it (still, 20 years later) wins that race.


> Any idea why it is that much better than LMDB?

Yes, for exactly the reasons you'd expect:

  mdb_stat /tmp/leveldbtest-1000/dbbench_mdb-6/
  Status of Main DB
    Tree depth: 4
    Branch pages: 204
    Leaf pages: 31250
    Overflow pages: 0
    Entries: 1000000
As you said, MDBM can find any record in 2 seeks; for this database the LMDB tree height is 4 so any random access takes 4 seeks. 2x perf difference.

On a larger DB we would expect MDBM's random read perf advantage to get larger as well, until the DB exceeds the size of RAM. I've tried to duplicate my http://symas.com/mdb/ondisk/ tests with MDBM but it makes XFS lose its mind by the time the DB gets to 2x the size of RAM. First I had to increase the MDBM page size from my default of 4KB to 128KB, otherwise I'd see a lot of this in the output:

  2014/12/16-13:13:29 ... thread 0: (200000,12200000) ops and (3436.0,8072.2) ops/second in (58.206750,1511.358831) seconds
  3:54903002:dc057:00864 mdbm.c:1809 MDBM cannot grow to 33554432 pages, max=16777216
but on this VM with 32GB RAM, every time the DB hit 60GB in size the kernel log would start getting spammed with

  Dec 16 22:19:19 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:21 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:23 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

So thus far I've been unable to load the 160GB DB to reproduce the test. But this underscores the basically uncacheable nature of hashes - once the directory gets too big to fit in RAM, your "2 seeks per access" goes out the window because the kernel is thrashing itself trying to keep the whole directory in-memory while finding the requested data pages.

B+tree performance degrades gracefully as data volumes increase. Hash performance falls off a cliff once you cross the in-memory threshold.


So for creating the DB you are correct though you should be able to size the DB such that you aren't splitting and copying all the time if you have an idea of how much data you are going to put in.

The "2 seeks per access" absolutely does NOT go out the window. The whole point was that this works for any size DB. It's a function of the size of the pages, the keys, and the values, for a run with key+val @ 20 bytes and 8K pages, 25 million entries had a directory of 16KB. You hash the key, walk however much of the 16KB you need to find the page, and then you go to that page and only that page.

2 seeks for any lookup for any sized DB. The directory could be better but you walk sequentially (hell, map that and lock it in, then it is 1 seek for any lookup).

The reason you are thrashing XFS so much is we're growing the directory and the number of pages. Each time you split a page you have to copy to the new pages.


For completeness' sake... I've tried a dozen runs at this already, looking for tuning parameters to tweak. It always starts well and then degrades rapidly before finally hanging.

  MDBM:       version 4.11.1
  Date:       Tue Dec 16 20:34:36 2014
  CPU:        16 * Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
  CPUCache:   20480 KB
  Keys:       16 bytes each
  Values:     2000 bytes each (1000 bytes after compression)
  Entries:    76800000
  RawSize:    147656.2 MB (estimated)
  FileSize:   74414.1 MB (estimated)
  ------------------------------------------------
  2014/12/16-20:34:38 ... thread 0: (200000,200000) ops and (128923.6,128923.6) ops/second in (1.551306,1.551306) seconds
  2014/12/16-20:34:39 ... thread 0: (200000,400000) ops and (110571.0,119044.1) ops/second in (1.808792,3.360098) seconds
  2014/12/16-20:34:42 ... thread 0: (200000,600000) ops and (87921.9,106480.3) ops/second in (2.274746,5.634844) seconds
  2014/12/16-20:34:44 ... thread 0: (200000,800000) ops and (107459.0,106723.3) ops/second in (1.861175,7.496019) seconds
  2014/12/16-20:34:46 ... thread 0: (200000,1000000) ops and (90923.1,103138.7) ops/second in (2.199660,9.695679) seconds
  2014/12/16-20:34:50 ... thread 0: (200000,1200000) ops and (51852.2,88542.6) ops/second in (3.857118,13.552797) seconds
  2014/12/16-20:34:53 ... thread 0: (200000,1400000) ops and (57516.7,82207.6) ops/second in (3.477253,17.030050) seconds
... by the end

  2014/12/16-21:18:19 ... thread 0: (200000,22400000) ops and (858.0,8541.6) ops/second in (233.094661,2622.447456) seconds
  2014/12/16-21:19:19 ... thread 0: (200000,22600000) ops and (3321.7,8424.5) ops/second in (60.210362,2682.657818) seconds
  2014/12/16-21:20:28 ... thread 0: (200000,22800000) ops and (2880.3,8284.6) ops/second in (69.437524,2752.095342) seconds
  2014/12/16-21:21:31 ... thread 0: (200000,23000000) ops and (3203.6,8171.9) ops/second in (62.429181,2814.524523) seconds
then a stream of these start showing up in dmesg

  Dec 16 21:14:09 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 21:14:30 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 21:14:32 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 21:14:34 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 21:14:36 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
and on and on until I kill the job.

  Dec 16 22:19:01 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:03 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:05 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:07 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:09 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:11 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:13 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:15 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:17 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:19 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:21 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:23 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 17 00:14:16 localhost systemd: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 1372 (vmtoolsd)
LMDB does this load in 8 minutes at an effective data rate of ~283MB/sec. Pretty much as fast as the hardware will stream (peak throughput of 300MB/sec in this VM). I don't see how you can possibly use MDBM as an unreliable data store if it could potentially take hours to reload the data.


Just to be clear - the loading job keeps running when the kernel error messages first start happening at 21:14:09. But after 21:21:31 the loading job makes no more progress. The kernel error messages continued until 22:19:23 when I killed the job, but nothing further got written to the DB in that intervening hour (based on looking at the filesize, which admittedly may not tell the whole story). The job stats were set to output once per 200,000 ops. It's of course possible that it was still making some progress, but dropping from 3203 ops/second to something less than 55/second is pretty drastic, and at that rate it would have taken weeks to complete the load.

As another note, I also tried to use mdbm_pre_split() at the beginning of the job. That churned for 10 minutes at the beginning of the job before adding any records, then started processing at its normal speed, and then quickly degraded into the same state.


I apologize. I reread my comments and they come off as downright incendiary which was not what I intended.

At least back when I was using it, there is a bit of a science to tuning mdbm correctly for your data. In general, it is a tool designed to allow you to tweak with whatever application level knowledge you have of your data's structure, and as a consequence it can be terribly suboptimal out of the box. Depending on circumstances, even if you don't need disk level transactional integrity, LMDB may indeed be a be better choice.

That said, the test case you are describing is definitely for a use case for which mdbm is suboptimal. The results you are getting are if anything surprisingly good under the circumstances, and frankly I'd have never even considered using mdbm for that kind of work load (LMDB would definitely be one of the first choices I'd consider for that kind of work load).

As you've mentioned, a hashtable based key-value store is fairly suboptimal for disk based storage (though depending on circumstnace and how you tweak it, a hash table with mdbm can work surprisingly well with an SSD based store), and the numbers you are presenting seem if anything better than I'd expect.

The reports you are seeing with XFS seem... odd, and almost feel like either a simple issue with a bug in how mdbm is talking to mmap/vfs layer, or more likely within XFS's implementation, but it doesn't seem like they are slowing you down.

In general, mdbm is most useful for storing and accessing compact rows in memory across potentially many processes with a random access pattern, which IMHO is not the problem space that you are testing and tackling with LMDB, LevelDB, or many of the others. One can (rightly) argue that that is a fairly narrow and simple problem space, but as with almost anything in computer science, doing even fairly narrow and simple things efficiently (and mdbm has a number of clever design choices that help it be efficient) and in an error free fashion is enough trouble that having a standard tool for solving that problem is terribly useful.

MDBM isn't really the only tool for solving that problem out there. Pretty much any shared memory hash table based solution may be a good fit for it, and there are alternative data structures that have desirable advantages over hash tables (critbit based structures are one of my favourite pets for such problems). Heck, in C++ I've used the Boost.IPC library's unordered maps for the job with reasonable results.

I bet there are probably some implementations that perform better than mdbm in certain cases (I'd actually be interested in benchmarks comparing that kind of workload with other tools designed for that problem space). Still, the mdbm codebase is battle hardened and really does perform well as long as your data set size & access patterns don't cause thrashing of the page store.


I think the comparison was totally appropriate because the article specifically highlighted mdbm's design limitations. The whole point was that given the compromises mdbm makes, it is possible to yield significant performance gains, so if those design limitations fit your context (and it cited 4 specific examples where Yahoo uses them), you ought to consider it.


My point, and the reason I will still say MDBM is garbage, is that you can get mmap/zero-copy performance without sacrificing reliability. Only a fool would argue for the performance without reliability when they can get both. And with a lower total system cost.


> Only a fool would argue for the performance without reliability when they can get both. And with a lower total system cost.

I think that statement is absolutely true. I just think you have a myopic view about how to achieve performance and reliability. Look at Yahoo's example use cases.


> For lots of people, speed is much more important. Machines don't crash every ten minutes.

Even if they do, it is often faster and more reliable to recover from the RAM of a surviving node than to try to recover from the disk on the crashed system.

> lets say that it takes a full day to rebuild the MDBM

I can't imagine a circumstance where it'd take even an hour to rebuild an MDBM from raw data.


> This is running on tmpfs, so no I/O involved.

Hmm... I think that might alter the results actually.

> which is what you'd expect since it's a hash and doesn't have to navigate down a tree to locate a record.

mdbm uses a hash to select a page, but it actually does store the keys within a page in order. It's kind of a funky mix.


I'm curious to see what the total timings would be like to get the data in a useable form - as opposed to just fetching a record from a data store. As noted - these data stores just store and retrieve data and don't do things like joins or ordering, etc.

Could there be a comparison between these datastores and the traditional ACID compliant databases when it comes to retrieving actual data in a useful format? E.g. perhaps doing a join or an ordering of some sort? I don't expect databases (e.g. Oracle, MS SQL Server, DB2) to be faster in raw performance, but I do expect them to be faster in terms of total development time and bug fixing since the application developer wouldn't have to do the locking, page pinning/unpinning, etc. manually.


Let the horrors of MDBM not get to you. I've used it when I worked at Yahoo, and the client support for Java etc. sucks.


Could not install this in Ubuntu 12.04 - basic commands are failing. I think they tested only in BSD?

ln -s -f -r /tmp/install/lib64/libmdbm.so.4 /tmp/install/lib64/libmdbm.so ln: invalid option -- 'r' Try `ln --help' for more informatio


Thoughts on using this as a cache instead of memcache or redis? Yes, it does not have nearly as many features or functions but when raw performance is needed I could see this working (given an api for using this via Node.JS, PHP, etc.).


You'd be better off using MemcacheDB/LMDB http://symas.com/mdb/memcache/


You seem to have written a great piece of software that you're very proud. I don't understand the ownership very well, but if you're allowed, why don't you promote your database with its own site, like every little javascript library out there?

Good examples: http://duktape.org/ (it might seem silly but that right column makes people want to try it!), http://redis.io (i bet this page wins many folks http://redis.io/topics/twitter-clone)


why even pay the cost of memory mapping if its a transient embedded cache not shared between servers?

just use a std::unordered_map, or better yet a tbb::concurrent_unordered_map or whatever the equivalent is for your language


Because it's shared between processes on the same server.


In theory STL implementations, if used with a custom allocator, should be able to pull this off... that's why the STL containers all have internal 'pointer' typedefs.

Practically speaking, Boost.Interprocess includes a shared memory hash table implementation. Boost Multi Index, which is a further generalisation of containers to allow the construction of database-like indexes, is also Interprocess compatible.

http://www.boost.org/doc/libs/1_57_0/doc/html/interprocess/a...


Because it will persist between restarts?


then it seems strange to call it a cache.


It could be a cache to a much slower backend. I pull a lot of stock price and other data to my server which is is slow - 200-1000ms per series. I cache it in postgres which allows me to load it nearly instantaneously. It also allows me access to data while I'm disconnected.

Another reason to cache to disk is that you want to store more data than you have ram.


I dont want to brag. But there is also DBM inspired Java port. And in-memory mode outperforms java heap collections such as j.u.HashMap.


Can anyone say whether it would be hard to port it to Windows? Maybe there already is something for Windows that is as good as this ?


I can't speak to the yahoo version, they've wacked it, but the base mdbm that we still use today works fine on windows, has for years.


How is MDBM for concurrent access? How does it handle locking (i.e., one big lock that blocks everyone else, or key-level locking)?


So the SGI owned code, that I don't have, did page level locking. There two kinds of locks, rd/wr on the directory, and rd/wr on a page. If you are inserting a key you get a read lock on the directory and a write lock on the page. If it fits in the page then you are done. So you can have lots of concurrent writers until a page is full and you have to split it. Bob Mende did that work I think, you might track him down for details.


How similar in performance is MDBM to GDBM (the GNU DBM)? They appear to be similar (if not identical) in functionality.


Not sure, but in my experience GDBM is a bit on the slow side. MDBM uses mmap(), so for that reason alone it should be faster.


MDBM was designed to be fast with special care taken on the lookup path. The goal was to do lookups with as few cache misses as possible. You can get to any key with at most two page faults.


They might appear similar, but that's just because they share the same DBM interface heritage.


Where is the original open source release from Silicon Graphics which Yahoo based this work on? Did they ever make one?


Nah, they didn't care and I didn't want to piss them off so I just handed the code to anyone who asked for it.


This is a very big deal, especially because of the BSD licensing.


Where does it say that this database is persistent?


It is memory mapped, which means that it is persisted to disk, perhaps confusingly if you aren't familiar with mmap.


How is this different than memcache?


Memcache is an in-memory cache. This is an on-disk key-value store.


Do people get annoyed by all the JavaScript frameworks and Databases coming out in regards to adoption from a company point of view? I mean every other day a new database comes out and claims to be better in one way or another than something else and then its like "fuck I picked X when now there's Y"

It seems over the last year technology has been growing more rapidly than any other period.

Fun times but so hard to keep track of everything!


For those old timers who did distributed systems work there's not that much new under the sun. I look at something like this and say "ah, a quirky and somewhat dangerous cache layer." What's different is bloggers promoting it as a "database."

While I'm sure someone out there will see this and say "wow, that's exactly what I need!" chances are that if you have these sorts of scale issues you're going to have to figure it out on your own.

I'd rather see a write-up of how they arrived at this particular conclusion than another non-database.


I have found the way in the middle is to stick to the traditional things for the most critical work, things where the abilities and problems are well understood, until some of the new thing mature. Then, for less critical things that can run as separate services, try out the new things. That way people in a team can also get exposure to the new technologies and you can experience first hand the pros and cons of these technologies. Some people are more generally optimistic about new things which solves some of the frustrations they have with the old things (or perhaps their lack of experience and understanding of the old things), without realising there will be new problems that comes with the new solutions that haven't even been identified yet, never mind workarounds or solutions to these new, and yet unknown, problems. However, I think the biggest, while also avoidable, risk is usually adopting the new thing for a problem you don't even have. A lot of distributed databases exist because you are going to have a lot of concurrent users access similar data, related to themselves mostly, accessed at the same time, or have a lot of user data that you want to do big, ad-hoc searches and analysis across all the data. If you don't have the scale of 'a lot' or 'concurrent' or 'across all the data' that these things are designed to address, you might not even need this solution and especially not all the tradeoffs you have to make along with it. I agree it's an exciting time, and a time of discovery for everyone, but there will be many things learned the hard way and it's tricky to position yourself here especially when there's pressure around you.


While I understand that feeling, I've come to realize it's an inevitability, that shouldn't affect your work.

At any given time, you either have a need/problem, or you don't. If you DO, you evaluate the current tech available, and hopefully select something that fits your needs. You build out around said tech, and if your choice was correct, that means it's either solving your problem, or on it's way to.

If something comes along while you're implementing with your chosen solution, that looks similar, but better, it's only noise - because hey, you found a solution.

Just as we don't all re-write all of our code whenever a new language comes along (unless the thing in question was desperately in need of a re-write anyway) even if newer languages are nicer, we needn't switch DBs or frameworks for the same reasons.


Yes, I find it annoying because:

1) it's non-stop

2) there seldom sems to be anything truly novel in a broadly meaningful way (i.e. esoteric, if anything)

3) there is rarely an objective improvement on existing options

I no longer feel compelled to replace or adopt though, precisely for those reasons.


On the contrary, I find it awesome. Having choices is much better than not. It just shows the maturity of the entire tech ecosystem. Just take everything with a grain of salt, evaluate if any new tool/library matches your particular need/use case, do a quick proof of concept if so, and adapt if successful. Rinse and repeat. Change is good. Embrace it.


> It seems over the last year technology has been growing more rapidly than any other period.

I tend to agree with this statement. The entire stack appears to be going through a revolution.

The data layer in particular is seeing very rapid change after being largely (not entirely) static for decades.


yahoo back to IT company ?


interesting. follow




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: