
MDBM – High-speed database - threepointone
http://yahooeng.tumblr.com/post/104861108931/mdbm-high-speed-database
======
otterley
I'm so excited that they finally open-sourced this. It's relatively old tech
at Yahoo, stuff folks outside never got to see. It was difficult to explain to
later colleagues the stuff I knew about shared-memory databases because I
couldn't give them a frame of reference.

mdbm performance is even better on FreeBSD than Linux because FreeBSD supports
MAP_NOSYNC, which causes the kernel not to flush dirty pages to disk until the
region is unmapped. Perhaps mdbm's release will finally get the Linux kernel
team to provide support for that flag.

~~~
jzawodn
Same here. I remember wishing we could Open Source it back in the early 2000s.
Good to see this coming out so people can take a little credit for work they
did back in the day.

~~~
luckydude
I've been providing people with source all along. SGI was pretty pleasant
about letting me retain copyright on stuff like that.

~~~
cbsmith
Didn't Yahoo have copyright on a bunch of the modifications to mdbm?

~~~
luckydude
Yeah, they wacked it pretty hard and I don't have anything to do with that. I
asked them to call it YDBM but that idea came too late.

~~~
cbsmith
Yeah, YDBM was _another_ thing (looks a lot like Project Voldemort, except in
C, not Java) that maybe someday will be released.

------
justin66
This looks interesting. At this stage of the game a more meaningful benchmark
might involve LMDB, Wiredtiger, and, yes, LevelDB.

~~~
hendzen
I don't think it's comparable to benchmark MDBM against LMDB or WiredTiger as
keys are not kept in sorted order (no range queries), there is no support for
transactions, and MDBM does not offer durability in the event of power loss.

MDBM is pretty much an optimized persistent hash table. LMDB and WiredTiger
aim to be full-fledged ACID compliant database storage engines with
functionality similar to that of BerkeleyDB or InnoDB.

~~~
hyc_symas
You make some good points. We benchmark LMDB against LevelDB and its
derivatives even though none of the LevelDB family offer ACID transactions.
([http://symas.com/mdb/ondisk/](http://symas.com/mdb/ondisk/) ) Despite this
fact, people will ask the question and try to make the comparison, so we run
those tests. It's silly, but most people seem to pay attention to performance
more than to safety/reliability.

From my totally biased perspective, MDBM is utter garbage. They use mmap but
make absolutely zero effort to use it _safely_. This was the biggest obstacle
to overcome in developing LMDB; I had a few lengthy conversations with the
SleepyCat guys about it as well. It's the reason it took 2 years (from 2009
when we first started talking about it, to 2011 first code release) to get
LMDB implemented. If you want to call something a "database" you have to do
more than just mmap a file and start shoving data into it - you have to exert
some kind of control over how and when the mapped data gets persisted to disk.
Otherwise, if you just let the OS randomly flush things, you'll wind up with
garbage. As Keith Bostic said to me (private email):

"The most significant problem with building an mmap'd back-end is implementing
write-ahead-logging (WAL). (You probably know this, but just in case: the way
databases usually guarantee consistency is by ensuring that log records
describing each change are written to disk before their transaction commits,
and before the database page that was changed. In other words, log record X
must hit disk before the database page containing the change described by log
record X.)

In Berkeley DB WAL is done by maintaining a relationship between the database
pages and the log records. If a database page is being written to disk,
there's a look-aside into the logging system to make sure the right log
records have already been written. In a memory-mapped system, you would do
this by locking modified pages into memory (mlock), and flushing them at
specific times (msync), otherwise the VM might just push a database page with
modifications to disk before its log record is written, and if you crash at
that point it's all over but the screaming."

The harsh realities of working with mmap are what dictated LMDB's copy-on-
write design - it's the only way to ensure consistency with an mmap without
losing performance (due to multiple mlock/msync syscalls). None of these
design considerations are evident in MDBM.

LMDB's mmap is read-only by default, because otherwise it's trivial to
permanently corrupt a database by overwriting a record, writing past the end,
etc. MDBM's mmap is read-write, and the only "protection" you get is a doc
that tells you "be Vewwy vewwy careful!" Ridiculously sloppy.

LMDB's design and implementation are proven incorruptible. MDBM (and LevelDB
and all its derivatives) are proven to be quite fragile.
[https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/pillai)

Leaving reliability aside for a moment, there's also the issue of performance
and efficiency. We used to use DBM-style hashes for the indexes in OpenLDAP,
up to release 2.1. We abandoned them in favor of B-trees in OpenLDAP 2.2
because extensive benchmarking showed that BDB's B-trees were faster than its
hash implementation at very large data sizes. The fundamental problem is that
hash data structures are only fast when they are sparsely populated. When the
number of data records you need to work with increases to fill the table, you
start getting more and more hash collisions that result in lots of linear
probes (or whatever other hash recovery strategy you're using). The other
problem is that the very sparse/unordered nature of hashes makes them
extremely cache unfriendly - you get zero locality-of-reference for groups of
related queries. So as your data volumes increase, you get less and less
benefit from the amount of RAM you have available. When the data exceeds the
size of RAM, the number of disk seeks required for an arbitrary lookup is
enormous, and every read is a random access. Using a hash for a large-scale
data store is just horrible. (We tested this extensively a decade ago
[http://www.openldap.org/lists/openldap-
devel/200401/msg00077...](http://www.openldap.org/lists/openldap-
devel/200401/msg00077.html) )

~~~
luckydude
You pretty clearly haven't used MDBM because the MDBM I worked on at SGI (and
still use to this day) gets to any key in two page faults (aka 2 disk seeks)
at the most. That was the whole point of it.

If you want I'll go shove a few GB into an mdbm, drop caches, and time a
lookup.

~~~
hyc_symas
If you've already ported the levelDB benchmark driver, feel free to send it to
me:
[https://github.com/hyc/leveldb/tree/benches/doc/bench](https://github.com/hyc/leveldb/tree/benches/doc/bench)

2 seeks at the most, are you talking about a 32 bit address space? The only
way that's possible in 64 bits is to direct map a hash into e.g. 2 32 bit
chunks and use the hash as an actual disk block address for the first chunk,
and an index into a block list for the 2nd chunk.

~~~
luckydude
2 seeks. Address space doesn't matter, you have one seek to read the directory
(I'm assuming 100% cold cache), and one seek to get to the page in question.

Not only that, we watched the bus on an SGI Challenge and counted cache misses
and TBL misses. 2 TBL misses to get a key.

Saying that it isn't possible on a 64 bit VM system makes no sense to me. If I
have a 2TB file and I seek to location A and read it, then seek to location B
and read it, you are saying that's not possible? Same thing with mmap, I set a
pointer to the mapping, read _p, p += <number>, read _p. Two seeks, two page
faults, whatever you want to call it, it does 2 and only 2 I/O's to get a
key/value (unless the pages are bigger than disk blocks but then those are
going to be sequential I/O's, no extra seeks).

~~~
hyc_symas
I was actually thinking of a >2GB DB file on a 32 bit server. But leaving that
aside, it sounds like you're assuming a perfect hash function with no
collisions. If you have collisions, you have to deal with the possibility of a
hash bucket overflowing and requiring an additional seek.

Anyway, I don't doubt that you can operate in 2 seeks in the normal case.

~~~
luckydude
It is 2 seeks, at most, for 100% of lookups.

------
remon
I'm not very comfortable with storage engines that directly build on memory
mapped files. MongoDB's current storage engine is mmap based and it's sub
optimal at best which is undoubtedly part of the reason they're building a
completely new storage engine now (WiredTiger).

~~~
hyc_symas
Using mmap well takes great care. MongoDB was careless. There's good reason to
believe the MDBM designers were careless too.

~~~
luckydude
MDBM guy here. Care to elaborate on what we got wrong?

~~~
hyc_symas
See below.
[https://news.ycombinator.com/item?id=8734356](https://news.ycombinator.com/item?id=8734356)

~~~
cbsmith
Umm... the lack of transactional integrity is part of the mdbm design. So it's
only "careless" in the strictest sense of the term (the designers explicitly
wanted to exploit not having to care about it).

mdbm is certainly not without limitations, but is careful about its use of
mmap to an extent that comparisons with MongoDB are laughable.

~~~
hyc_symas
I have already admitted my obvious biases, but seriously - when you design
such a trivially corruptible system, you can't call it a persistent database;
it's at most a cache. It won't survive a system crash at all, it probably
won't survive an application crash intact either. To call it persistent is
laughable.

~~~
cbsmith
> I have already admitted my obvious biases, but seriously - when you design
> such a trivially corruptible system, you can't call it a persistent
> database; it's at most a cache. It won't survive a system crash at all, it
> probably won't survive an application crash intact either. To call it
> persistent is laughable.

There are a number of use cases where in the event of a node failure it is
better to rebuild from a replica or a log. Statistically, the RAM on another
host is actually more reliable than local storage. Additionally, the database
does have sync'ing primitives that allow for a variety of persistence
strategies... just not the traditional ACID strategy.

In practice, there are _lots_ of cases where the freedom to ignore
transactional integrity is very handy, and yes, a cache would definitely be
one of them.

~~~
hyc_symas
If DB updates are not atomically visible, then no sync'ing strategy will
protect you from corruption during a crash.

So in other words, you're saying "MDBM is not a persistent database." Glad
that's clear. Totally agree, there are probably lots of use cases for it. But
persistent data store isn't one of them.

~~~
luckydude
MDBM is a hash, a pretty scalable one and fairly light weight. It is so fast
that a common usage is to log updates to a log file and just rebuild the hash
from that after a reboot. We've been doing that for years, works well for us.

I'm guessing you are one of the people behind some other technology. Goody for
you but do you really think you make your case by dissing anything else? If
you have a solution that works for you, great. SGI, Yahoo, and other companies
have found a use for MDBM. SGI was using it ~20 years ago and at the time
there was _nothing_ that came close to the same performance.

~~~
nieve
It doesn't mean he's wrong, but hyc_symas is the CTO of Symas who produce the
Lightning memory-mapped database, so part of his business is in direct
competition. He does dance around this a bit below, but I think how
aggressively he's trashing MDBM with broad & unsupported statements in this
thread doesn't really speak well of his company or his personaly integrity.
You can have a technical disagreement without looking like you're spreading
FUD in place of valid technical arguments. This isn't how you do that.

~~~
hyc_symas
I stated up front that I am totally biased. I provided multiple links backing
up my statements. Nothing I've said here is unfounded.

The only FUD here is advertising a piece of software as a high performance
embedded database when in fact it's not suitable for such use _on its own_.
The most viable use cases the authors have presented is when using MDBM as
part of a larger distributed system such that the loss of a single DB instance
isn't fatal. The above comment talks about restoring from a log, but the
actual log mechanism isn't part of MDBM. I.e., MDBM is incomplete on its own
and you must provide additional pieces in order to use it effectively.

I'm not solely interested in promoting my own DB. If you read the On-Disk
microbenchmarks I linked you'll see that there's a broad range of use cases
where LMDB gets trounced by LevelDB and other LSMs. I'm interested in facts.

From the original link:

"On clean shutdown of the machine, all of the MDBM data will be flushed to
disk. However, in cases like power-failure and hardware problems, it’s
possible for data to be lost, and the resulting DB to be corrupted. MDBM
includes a tool to check DB consistency. However, you should always have
contingencies. One way or another this is some form of redundancy…"

The fact is, this is a system that can lose data on a crash and it doesn't
include its own recovery mechanism. Without such a mechanism you can't call it
a persistent data store because MDBM _by itself_ is not persistent.

~~~
cbsmith
> The fact is, this is a system that can lose data on a crash and it doesn't
> include its own recovery mechanism. Without such a mechanism you can't call
> it a persistent data store because MDBM by itself is not persistent.

Persistence != ACID

By your definition, ext2 is not persistent. Give it a break.

~~~
hyc_symas
ext2 comes with its own recovery system: fsck.

~~~
luckydude
The yahoo people claim to have the same thing for mdbm. As someone who worked
on it, I can easily see how you would write such a thing, it's way more
trivial than fsck (and I'm a file system guy, I'd much rather write a
mdbm_fsck than a file system fsck).

~~~
cbsmith
They don't just claim to have the same thing for mdbm, it's right there in the
project.

Sounds like you are shooting off without really understanding what you are
criticizing.

------
PhuFighter
I'm curious to see what the total timings would be like to get the data in a
useable form - as opposed to just fetching a record from a data store. As
noted - these data stores just store and retrieve data and don't do things
like joins or ordering, etc.

Could there be a comparison between these datastores and the traditional ACID
compliant databases when it comes to retrieving actual data in a useful
format? E.g. perhaps doing a join or an ordering of some sort? I don't expect
databases (e.g. Oracle, MS SQL Server, DB2) to be faster in raw performance,
but I do expect them to be faster in terms of total development time and bug
fixing since the application developer wouldn't have to do the locking, page
pinning/unpinning, etc. manually.

------
chatman
Let the horrors of MDBM not get to you. I've used it when I worked at Yahoo,
and the client support for Java etc. sucks.

------
swah
Could not install this in Ubuntu 12.04 - basic commands are failing. I think
they tested only in BSD?

ln -s -f -r /tmp/install/lib64/libmdbm.so.4 /tmp/install/lib64/libmdbm.so ln:
invalid option -- 'r' Try `ln --help' for more informatio

------
coreymgilmore
Thoughts on using this as a cache instead of memcache or redis? Yes, it does
not have nearly as many features or functions but when raw performance is
needed I could see this working (given an api for using this via Node.JS, PHP,
etc.).

~~~
hendzen
why even pay the cost of memory mapping if its a transient embedded cache not
shared between servers?

just use a std::unordered_map, or better yet a tbb::concurrent_unordered_map
or whatever the equivalent is for your language

~~~
otterley
Because it's shared between processes on the same server.

~~~
nly
In theory STL implementations, if used with a custom allocator, should be able
to pull this off... that's why the STL containers all have internal 'pointer'
typedefs.

Practically speaking, Boost.Interprocess includes a shared memory hash table
implementation. Boost Multi Index, which is a further generalisation of
containers to allow the construction of database-like indexes, is also
Interprocess compatible.

[http://www.boost.org/doc/libs/1_57_0/doc/html/interprocess/a...](http://www.boost.org/doc/libs/1_57_0/doc/html/interprocess/allocators_containers.html#interprocess.allocators_containers.additional_containers.multi_index)

------
qwerta
I dont want to brag. But there is also DBM inspired Java port. And in-memory
mode outperforms java heap collections such as j.u.HashMap.

------
polskibus
Can anyone say whether it would be hard to port it to Windows? Maybe there
already is something for Windows that is as good as this ?

~~~
luckydude
I can't speak to the yahoo version, they've wacked it, but the base mdbm that
we still use today works fine on windows, has for years.

------
discardorama
How is MDBM for concurrent access? How does it handle locking (i.e., one big
lock that blocks everyone else, or key-level locking)?

~~~
luckydude
So the SGI owned code, that I don't have, did page level locking. There two
kinds of locks, rd/wr on the directory, and rd/wr on a page. If you are
inserting a key you get a read lock on the directory and a write lock on the
page. If it fits in the page then you are done. So you can have lots of
concurrent writers until a page is full and you have to split it. Bob Mende
did that work I think, you might track him down for details.

------
mbrzusto
How similar in performance is MDBM to GDBM (the GNU DBM)? They appear to be
similar (if not identical) in functionality.

~~~
api
Not sure, but in my experience GDBM is a bit on the slow side. MDBM uses
mmap(), so for that reason alone it should be faster.

~~~
luckydude
MDBM was designed to be fast with special care taken on the lookup path. The
goal was to do lookups with as few cache misses as possible. You can get to
any key with at most two page faults.

------
i_am_ralpht
Where is the original open source release from Silicon Graphics which Yahoo
based this work on? Did they ever make one?

~~~
luckydude
Nah, they didn't care and I didn't want to piss them off so I just handed the
code to anyone who asked for it.

------
jwr
This is a very big deal, especially because of the BSD licensing.

------
swah
Where does it say that this database is persistent?

~~~
t1m
It is memory mapped, which means that it is persisted to disk, perhaps
confusingly if you aren't familiar with mmap.

------
EGreg
How is this different than memcache?

~~~
pjscott
Memcache is an in-memory cache. This is an on-disk key-value store.

------
philliphaydon
Do people get annoyed by all the JavaScript frameworks and Databases coming
out in regards to adoption from a company point of view? I mean every other
day a new database comes out and claims to be better in one way or another
than something else and then its like "fuck I picked X when now there's Y"

It seems over the last year technology has been growing more rapidly than any
other period.

Fun times but so hard to keep track of everything!

~~~
tacos
For those old timers who did distributed systems work there's not that much
new under the sun. I look at something like this and say "ah, a quirky and
somewhat dangerous cache layer." What's different is bloggers promoting it as
a "database."

While I'm sure someone out there will see this and say "wow, that's exactly
what I need!" chances are that if you have these sorts of scale issues you're
going to have to figure it out on your own.

I'd rather see a write-up of how they arrived at this particular conclusion
than another non-database.

------
extralam
yahoo back to IT company ?

------
extralam
interesting. follow

