

Berkeley DB: Architecture - codezero
http://ctrl-c.club/~ksikka/articles/01-berkeley-db/

======
bantunes
A more in-depth look at BerkeleyDB is available at AOSA
([http://www.aosabook.org/en/bdb.html](http://www.aosabook.org/en/bdb.html))

~~~
colanderman
I second this AOSA chapter. It's written by the original developers of
Berkeley DB, and is at times the only hope of understanding what goes on
inside the beast that is Berkeley DB. Unfortunately it was written before (or,
the original developers have no insight into) several new, and at times,
inscrutable features of BDB, such as MVCC (snapshots).

------
colanderman
For anyone thinking Berkeley DB, because it has transactions and logging, is a
good replacement for an out-of-process DB, the answer is _no_. Its embedded
nature means that any "client" of a shared DB is free to deadlock the whole
thing (if the client dies) or corrupt the database's structure (if the client
has a memory bug).

By the time you have a working robust multi-client system, you have completely
reinvented half of what an RDBMS provides, since you will have pushed most
functionality into a central server to avoid the lack of robustness its
embedded nature brings, to simplify coordination of recovery when it
(inevitably) crashes, and to centralize the rest of the infrastructure you
have to build around it.

Its tools are crap: it can't tell you how big a database is without traversing
every single entry, they hold locks while writing to the terminal (meaning you
can't pipe them into a pager without freezing the whole DB), and if one of
them crashes (due to e.g. a dropped SSH connection) they'll bring down the
whole DB due to stale locks.

Everything is underdocumented, especially the knobs you have to tweak to make
it not run out of memory. Its more "advanced" features are unusuably buggy.
Partition traversal (by cursor) is completely broken, and if you try to use
partitions and key compression at once, things crash. MVCC inexplicably keeps
thousands of transactions open, causing out-of-memory issues.

Its C++ interface is a joke. Somehow they managed to take an "object-oriented"
C library, and jam it into C++ in a way that completely tosses all the
benefits of C++'s object-orientedness. RAII is _not_ a concept the developers
were familiar with: objects are created by factory functions and are variably
and inconsistently freed by either their "parent" object, their own methods,
and their destructors. Do the wrong thing, and you trample memory, likely
corrupting the database. This is surprisingly easy due to the lack of
documentation.

And my sympathy to anyone who tries to run SQLite on this bad boy… combined
with SQLite's shoddy track record interpreting any query more complicated than
a key-value lookup, it would be a wonder if you're able to get any data in and
out intact.

Sorry for the rant. Berkeley DB is many things, but a usable building block
for a reliable RDBMS is not one of them.

~~~
hyc_symas
Eh, you're right on some and wrong on many other points. An embedded key-value
store _can_ be a good foundation for an RDBMS - or many other data models. The
OpenLDAP Project used BerkeleyDB for more than 15 years. BerkeleyDB architects
still point to OpenLDAP as a reference standard for transactional BDB code.
[http://www.bozemanpass.com/services/bdb.html](http://www.bozemanpass.com/services/bdb.html)

As the primary author of OpenLDAP's BDB support, and of the LMDB library which
now supersedes BDB, I've got quite a deep perspective into this topic.

BDB is deadlock-prone, no argument there. But that doesn't mean the embedded
engine approach can't work. LMDB is deadlock-proof, and serves perfectly well
as the engine for an RDBMS as well as the engine for a hierarchical DB
(X.500/LDAP directory), or arbitrary graphs
([http://sph.io/content/2faf](http://sph.io/content/2faf) ,
[https://github.com/pietermartin/thundergraph](https://github.com/pietermartin/thundergraph)
, etc...) etc. When you start with a model-less key/value store you can
implement _any_ other data model on top.

LMDB's robustness is, in a word, flawless. It's crash-proof by design and
multiple independent researchers have borne out the integrity of that design.
([https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/pillai)
[https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/zheng_mai) ) Your assertion that you can't build a
robust multi-client system in a lightweight embedded system is flat wrong. In
fact, this approach is the _only_ way to get _maximum_ performance from a
software system. LMDB is so much faster than every other data management
mechanism in existence, nothing else even comes close.
[http://symas.com/mdb/#bench](http://symas.com/mdb/#bench)

Crap tools, crap locking - these are certainly weaknesses in BDB's locking
design. LMDB's locking design doesn't have these weaknesses. Querying the size
of the DB is an O(1) operation, simply reading a few words out of the DB
header.

Underdocumentation - I have no idea what you're talking about here. To me, BDB
was copiously documented, and I largely owe my understanding of transactional
systems to the BDB documentation. Yes, BDB is complicated and has too many
moving parts. LMDB doesn't have this flaw either. But I can't take anyone
seriously who says BDB was underdocumented.

C++ interface - I don't use C++, so no comment here.

C >> C++.

SQLite on BDB - yes, it's a joke. I've run it, and the performance is pretty
awful. But we do far better.
[https://github.com/LMDB/sqlightning](https://github.com/LMDB/sqlightning)

There are many positive and negative lessons in software design to be learned
from Berkeley DB. It really started life as a test bed, for learning. In that,
I think it succeeded admirably. Its authors were able to experiment with
extensible hashing, B+trees, ARIES logging, and a multitude of other important
software architecture and data management concepts. LMDB owes much of its
design to lessons learned from BDB.

LMDB's use of a read-only memory map by default means you can safely allow
multi-client access without fearing corruption due to memory bugs. LMDB's
exclusive use of mmap instead of complex application-level caching means you
avoid all of the horrors of BDB's cache tuning, bloat driving into swap space,
and other admin/tuning nightmares. LMDB's crash-proof persistent on-disk
format means you don't need to worry about error-prone and inherently
unreliable transaction logging mechanisms. All of these features are part of
the LMDB design because of our experience working with BDB over the past 15+
years.

~~~
colanderman
…I don't quite understand why you rebutted my points about Berkeley DB with
points about LMDB.

The only point you actually rebutted without referring to LMBD was regarding
documentation. Yes, _some_ parts of BDB are copiously documented. Others are
not. How cursors interact (or, don't) with partitioning is not documented at
all. Some settings are underdocumented (e.g. the cache size setting doesn't
specify whether "gigabytes" means 10^9 or 2^30). What the cache priority
levels mean is barely even discernible from the source code. The documentation
for how to set parameters like the total number of mutexes and active
transactions to allocate boils down to "guess and check". Why enabling MVCC
causes transactions to linger for ages isn't explained anywhere, beside maybe
one mailing list post from 2008.

I'm glad to see that my experience with BDB is corroborated by that of the
developer of BDB's poster child (despite your unsubstantiated claim that I'm
"wrong on so many points").

That said, since you felt the need to claim it, I feel the need to ask how you
can rectify the claim that "LMDB is so much faster than every other data
management mechanism in existence, nothing else even comes close" with the
fact that LMDB is Btree-based and thus must face significant performance
challenges with both sequential synchronous random writes, and with random
writes and deletes to/from databases whose working set does not fit in memory.

~~~
hyc_symas
As for performance challenges "sequential synchronous random writes" doesn't
make sense to me, what exactly do you mean?

LMDB is read-optimized; for reads nothing else can touch it. For writes, there
are other engines that may be faster some of the time.

For sequential writes LMDB cheats. The cheat is described in my LMDB Internals
presentations. Nothing else can touch its sequential write speed.

For random writes - LMDB generally loses to other designs when records are
smaller than 1/2 a DB page. For larger, LMDB wins, and the margin grows as
record sizes grow.

~~~
colanderman
"Sequential synchronous random writes" means that you issue writes with random
keys in sequence, waiting for each to be synchronously written to durable
storage before writing the next. This is an important metric in certain
domains, and it is generally bound by the underlying hardware regardless of
database architecture (though certain designs can make it much worse).

How does LMDB outperform log-structured databases when writing page-sized
objects with random keys when the working set is larger than RAM? I'm
genuinely curious. Any other Btree-based database will perform at disk read
seek rates, at best.

~~~
hyc_symas
OK, for your sequential synchronous random write case, LMDB is fair-to-
middling. In fully-synchronous mode (which is the default) it takes 2 fsyncs
per transaction, so at least 2 IOPS per write. Many other DBs (including
SQLite, Berkeley, and Kyoto) take even more than 2 fsyncs per transaction, and
are slower than LMDB. Some others take only 1, and are much faster in this
workload.

LMDB outperforms all other engines on larger-than-RAM DBs with page-sized
objects because its zero-copy architecture doesn't waste any RAM. For other
DBs you must dedicate some fraction of RAM to write buffers and page caches in
order to obtain their maximum performance. This means the largest working set
size they can keep in RAM is much smaller than the actual physical RAM. With
LMDB there are no separate write buffers or caches, all of available RAM can
be utilized for your working set. This is precisely the scenario tested here
[http://symas.com/mdb/hyperdex/](http://symas.com/mdb/hyperdex/)

I'll note that in that test, LMDB performed at exactly the disk read seek
rate; HyperLevelDB's LSM performed at much worse than that rate.

~~~
colanderman
Thanks for the explanation.

Last I tested, Berkeley DB issues only one fsync per transaction commit (after
writing to the log). (Although Berkeley DB screws up even this, by writing
unaligned log entries of arbitrary size, which means e.g. SSDs have to perform
expensive read-modify-writes.)

The interesting case for LSM isn't random read or sequential insert, it's
random insert (which the 20/80 update/read load doesn't really cover). While
it's good to see that LMDB outperforms LevelDB in the former two cases, one
would be foolish to use a LSM database if random insert (and/or delete)
weren't important to them. (Of course the world has no shortage of fools.)

Throw a larger-than-RAM rotary-disk-backed random insert workload at LevelDB
vs. LMDB. I would expect LevelDB to perform at around 1/4 the disk's write
bandwidth (probably ~2k IOPS with 4 KiB items; same as for sequential insert),
and LMDB to perform at the disk's seek rate (~100 IOPS).

Please note I'm _not_ trying to bash LMDB, or claim that LevelDB or LSM in
general is superior for anything but random insert workloads. But you made an
extraordinary claim, that LMDB is leaps and bounds better than _any_ other
data storage system! – as someone whose job it has been to wrangle extreme
performance out of certain databases, it behooves me to ask for extraordinary
evidence :)

~~~
rakoo
You should really take a look at LMDB's benchmarks then
([http://symas.com/mdb/#bench](http://symas.com/mdb/#bench)). The one of
interest to you should be the microbenchmarks
([http://symas.com/mdb/microbench/](http://symas.com/mdb/microbench/))
inspired I believe by those from LevelDB.

There are many modes being compared (synchronous/asynchronous,
sequential/random reads/writes, small/big values, in-memory/SSD/HDD). The one
you want (synchronous, random writes, HDD) shows this:

\- Small values (100B): Leveldb is at 1291 ops/sec, LMDB is at 297 ops/sec.

\- Large values (100kiB): Everyone is at ~115 ops/sec

(apply portion of salt as usual, of course)

------
Erwin
The key Sleepycat people went on to write WiredTiger which was acquired by
Mongo -- it has some interesting features:
[http://source.wiredtiger.com/2.6.1/index.html](http://source.wiredtiger.com/2.6.1/index.html)

> WiredTiger supports row-oriented storage (where all columns of a row are
> stored together), column-oriented storage (where columns are stored in
> groups, allowing for more efficient access and storage of column subsets)
> and log-structured merge trees (LSM), for sustained throughput under random
> insert workloads.

> WiredTiger includes ACID transactions with standard isolation levels and
> durability at both checkpoint and fine-grained granularity.

> WiredTiger can be used as a simple key/value store, but also has a complete
> schema layer, including indices and projections.

------
bkeroack
From the article: "It’s just a key-value store that you #include in your code.
Intuitively it seems like it could be a building block in constructing a full
blown relational database server."

IIRC, that's precisely how MySQL started.

~~~
hyc_symas
For reference, there actually was a BerkeleyDB backend for MySQL. The MySQL
guys stopped supporting it and deleted it from their source distro, but it was
available all the way up until MySQL 5.0.95.

------
0xdeadbeefbabe
You can use it where you'd use leveldb nowadays.

~~~
hyc_symas
LevelDB's design is inherently unreliable. Nobody should be using it nowadays.
[https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/pillai)

