Hacker News new | past | comments | ask | show | jobs | submit login
Berkeley DB: Architecture (ctrl-c.club)
69 points by codezero on Aug 31, 2015 | hide | past | favorite | 30 comments

A more in-depth look at BerkeleyDB is available at AOSA (http://www.aosabook.org/en/bdb.html)

I second this AOSA chapter. It's written by the original developers of Berkeley DB, and is at times the only hope of understanding what goes on inside the beast that is Berkeley DB. Unfortunately it was written before (or, the original developers have no insight into) several new, and at times, inscrutable features of BDB, such as MVCC (snapshots).

For anyone thinking Berkeley DB, because it has transactions and logging, is a good replacement for an out-of-process DB, the answer is no. Its embedded nature means that any "client" of a shared DB is free to deadlock the whole thing (if the client dies) or corrupt the database's structure (if the client has a memory bug).

By the time you have a working robust multi-client system, you have completely reinvented half of what an RDBMS provides, since you will have pushed most functionality into a central server to avoid the lack of robustness its embedded nature brings, to simplify coordination of recovery when it (inevitably) crashes, and to centralize the rest of the infrastructure you have to build around it.

Its tools are crap: it can't tell you how big a database is without traversing every single entry, they hold locks while writing to the terminal (meaning you can't pipe them into a pager without freezing the whole DB), and if one of them crashes (due to e.g. a dropped SSH connection) they'll bring down the whole DB due to stale locks.

Everything is underdocumented, especially the knobs you have to tweak to make it not run out of memory. Its more "advanced" features are unusuably buggy. Partition traversal (by cursor) is completely broken, and if you try to use partitions and key compression at once, things crash. MVCC inexplicably keeps thousands of transactions open, causing out-of-memory issues.

Its C++ interface is a joke. Somehow they managed to take an "object-oriented" C library, and jam it into C++ in a way that completely tosses all the benefits of C++'s object-orientedness. RAII is not a concept the developers were familiar with: objects are created by factory functions and are variably and inconsistently freed by either their "parent" object, their own methods, and their destructors. Do the wrong thing, and you trample memory, likely corrupting the database. This is surprisingly easy due to the lack of documentation.

And my sympathy to anyone who tries to run SQLite on this bad boy… combined with SQLite's shoddy track record interpreting any query more complicated than a key-value lookup, it would be a wonder if you're able to get any data in and out intact.

Sorry for the rant. Berkeley DB is many things, but a usable building block for a reliable RDBMS is not one of them.

To temper my rant (whose edit window has expired), I just want to note that Berkeley DB clearly was, at one point, a very well-designed, minimalistic and modular embedded key-value store. Almost all of my issues are with the features it has grown, like barnacles, over the past decade (or even two) to accommodate use cases which are better served by other databases designed with those features in mind. If you use it for what it was originally designed for, I am sure it is fantastic. Just don't pretend that it's anything more, despite what its most recent incarnations claim it to be.

Eh, you're right on some and wrong on many other points. An embedded key-value store can be a good foundation for an RDBMS - or many other data models. The OpenLDAP Project used BerkeleyDB for more than 15 years. BerkeleyDB architects still point to OpenLDAP as a reference standard for transactional BDB code. http://www.bozemanpass.com/services/bdb.html

As the primary author of OpenLDAP's BDB support, and of the LMDB library which now supersedes BDB, I've got quite a deep perspective into this topic.

BDB is deadlock-prone, no argument there. But that doesn't mean the embedded engine approach can't work. LMDB is deadlock-proof, and serves perfectly well as the engine for an RDBMS as well as the engine for a hierarchical DB (X.500/LDAP directory), or arbitrary graphs (http://sph.io/content/2faf , https://github.com/pietermartin/thundergraph , etc...) etc. When you start with a model-less key/value store you can implement any other data model on top.

LMDB's robustness is, in a word, flawless. It's crash-proof by design and multiple independent researchers have borne out the integrity of that design. (https://www.usenix.org/conference/osdi14/technical-sessions/... https://www.usenix.org/conference/osdi14/technical-sessions/... ) Your assertion that you can't build a robust multi-client system in a lightweight embedded system is flat wrong. In fact, this approach is the only way to get maximum performance from a software system. LMDB is so much faster than every other data management mechanism in existence, nothing else even comes close. http://symas.com/mdb/#bench

Crap tools, crap locking - these are certainly weaknesses in BDB's locking design. LMDB's locking design doesn't have these weaknesses. Querying the size of the DB is an O(1) operation, simply reading a few words out of the DB header.

Underdocumentation - I have no idea what you're talking about here. To me, BDB was copiously documented, and I largely owe my understanding of transactional systems to the BDB documentation. Yes, BDB is complicated and has too many moving parts. LMDB doesn't have this flaw either. But I can't take anyone seriously who says BDB was underdocumented.

C++ interface - I don't use C++, so no comment here.

C >> C++.

SQLite on BDB - yes, it's a joke. I've run it, and the performance is pretty awful. But we do far better. https://github.com/LMDB/sqlightning

There are many positive and negative lessons in software design to be learned from Berkeley DB. It really started life as a test bed, for learning. In that, I think it succeeded admirably. Its authors were able to experiment with extensible hashing, B+trees, ARIES logging, and a multitude of other important software architecture and data management concepts. LMDB owes much of its design to lessons learned from BDB.

LMDB's use of a read-only memory map by default means you can safely allow multi-client access without fearing corruption due to memory bugs. LMDB's exclusive use of mmap instead of complex application-level caching means you avoid all of the horrors of BDB's cache tuning, bloat driving into swap space, and other admin/tuning nightmares. LMDB's crash-proof persistent on-disk format means you don't need to worry about error-prone and inherently unreliable transaction logging mechanisms. All of these features are part of the LMDB design because of our experience working with BDB over the past 15+ years.

…I don't quite understand why you rebutted my points about Berkeley DB with points about LMDB.

The only point you actually rebutted without referring to LMBD was regarding documentation. Yes, some parts of BDB are copiously documented. Others are not. How cursors interact (or, don't) with partitioning is not documented at all. Some settings are underdocumented (e.g. the cache size setting doesn't specify whether "gigabytes" means 10^9 or 2^30). What the cache priority levels mean is barely even discernible from the source code. The documentation for how to set parameters like the total number of mutexes and active transactions to allocate boils down to "guess and check". Why enabling MVCC causes transactions to linger for ages isn't explained anywhere, beside maybe one mailing list post from 2008.

I'm glad to see that my experience with BDB is corroborated by that of the developer of BDB's poster child (despite your unsubstantiated claim that I'm "wrong on so many points").

That said, since you felt the need to claim it, I feel the need to ask how you can rectify the claim that "LMDB is so much faster than every other data management mechanism in existence, nothing else even comes close" with the fact that LMDB is Btree-based and thus must face significant performance challenges with both sequential synchronous random writes, and with random writes and deletes to/from databases whose working set does not fit in memory.

As for performance challenges "sequential synchronous random writes" doesn't make sense to me, what exactly do you mean?

LMDB is read-optimized; for reads nothing else can touch it. For writes, there are other engines that may be faster some of the time.

For sequential writes LMDB cheats. The cheat is described in my LMDB Internals presentations. Nothing else can touch its sequential write speed.

For random writes - LMDB generally loses to other designs when records are smaller than 1/2 a DB page. For larger, LMDB wins, and the margin grows as record sizes grow.

"Sequential synchronous random writes" means that you issue writes with random keys in sequence, waiting for each to be synchronously written to durable storage before writing the next. This is an important metric in certain domains, and it is generally bound by the underlying hardware regardless of database architecture (though certain designs can make it much worse).

How does LMDB outperform log-structured databases when writing page-sized objects with random keys when the working set is larger than RAM? I'm genuinely curious. Any other Btree-based database will perform at disk read seek rates, at best.

OK, for your sequential synchronous random write case, LMDB is fair-to-middling. In fully-synchronous mode (which is the default) it takes 2 fsyncs per transaction, so at least 2 IOPS per write. Many other DBs (including SQLite, Berkeley, and Kyoto) take even more than 2 fsyncs per transaction, and are slower than LMDB. Some others take only 1, and are much faster in this workload.

LMDB outperforms all other engines on larger-than-RAM DBs with page-sized objects because its zero-copy architecture doesn't waste any RAM. For other DBs you must dedicate some fraction of RAM to write buffers and page caches in order to obtain their maximum performance. This means the largest working set size they can keep in RAM is much smaller than the actual physical RAM. With LMDB there are no separate write buffers or caches, all of available RAM can be utilized for your working set. This is precisely the scenario tested here http://symas.com/mdb/hyperdex/

I'll note that in that test, LMDB performed at exactly the disk read seek rate; HyperLevelDB's LSM performed at much worse than that rate.

Thanks for the explanation.

Last I tested, Berkeley DB issues only one fsync per transaction commit (after writing to the log). (Although Berkeley DB screws up even this, by writing unaligned log entries of arbitrary size, which means e.g. SSDs have to perform expensive read-modify-writes.)

The interesting case for LSM isn't random read or sequential insert, it's random insert (which the 20/80 update/read load doesn't really cover). While it's good to see that LMDB outperforms LevelDB in the former two cases, one would be foolish to use a LSM database if random insert (and/or delete) weren't important to them. (Of course the world has no shortage of fools.)

Throw a larger-than-RAM rotary-disk-backed random insert workload at LevelDB vs. LMDB. I would expect LevelDB to perform at around 1/4 the disk's write bandwidth (probably ~2k IOPS with 4 KiB items; same as for sequential insert), and LMDB to perform at the disk's seek rate (~100 IOPS).

Please note I'm not trying to bash LMDB, or claim that LevelDB or LSM in general is superior for anything but random insert workloads. But you made an extraordinary claim, that LMDB is leaps and bounds better than any other data storage system! – as someone whose job it has been to wrangle extreme performance out of certain databases, it behooves me to ask for extraordinary evidence :)

You should really take a look at LMDB's benchmarks then (http://symas.com/mdb/#bench). The one of interest to you should be the microbenchmarks (http://symas.com/mdb/microbench/) inspired I believe by those from LevelDB.

There are many modes being compared (synchronous/asynchronous, sequential/random reads/writes, small/big values, in-memory/SSD/HDD). The one you want (synchronous, random writes, HDD) shows this:

- Small values (100B): Leveldb is at 1291 ops/sec, LMDB is at 297 ops/sec.

- Large values (100kiB): Everyone is at ~115 ops/sec

(apply portion of salt as usual, of course)

Many of your points were about embedded stores, categorically. LMDB is existence proof that as a category, embedded data stores are perfectly suitable for just about any purpose.

Cache size: "All sizes are in powers-of-two" http://docs.oracle.com/cd/E17076_04/html/api_reference/C/set...

Cache priority levels: "The lowest priority: pages are the most likely to be discarded." http://docs.oracle.com/cd/E17076_04/html/api_reference/C/mem...

Come on.

Not a single sentence of mine has a subject other than "Berkeley DB" or "it". I'm completely baffled how you read anything categorical from my post.

"More likely" is not documentation, when you need guarantees that certain pages are not evicted in favor of others. (If my memory of the source code serves me, these levels map to internal values that have some nonlinear relationship with some LRU cache parameter, with the highest priority effectively locking pages into cache, except for certain cases which I forget.)

Also, the powers-of-two phrase is a recent addition: http://web.stanford.edu/class/cs276a/projects/docs/berkeleyd...

"Its embedded nature means that any "client" of a shared DB is free to deadlock the whole thing (if the client dies) or corrupt the database's structure (if the client has a memory bug).

By the time you have a working robust multi-client system, you have completely reinvented half of what an RDBMS provides, since you will have pushed most functionality into a central server to avoid the lack of robustness its embedded nature brings,"

Sounds very much to me like you are claiming that embedded designs by their nature lack robustness. BDB's lack of robustness is due to its very specific design choices, not solely because it is an embedded data store.

OK, except you concede even this point in LMDB's documentation, when running it in read-write mode:

"Using read-write mode offers much higher write performance, but adds the possibility for stray application writes thru pointers to silently corrupt the database"

I don't doubt LMDB is more robust than Berkeley DB even in that mode (it's hard not to be, since BDB will gladly corrupt itself if you mess up following its Byzantine C++ allocation patterns), but I think this one generalization I make is at worst not unarguably falsified.

Hmmm... You have a point. I have to say though that the read-write mode was not part of LMDB's original design; it was added in response to a request from a RedHat engineer. So in its purest original form, LMDB was completely incorruptible.

In practice we've also found that writable mmap is only a win when your entire DB resides in RAM. When the DB is larger than RAM it's actually slower than just using write(). (When you try to write to an mmap page, if it's not already resident the OS must page it in. This is a wasted page-in since we're simply going to overwrite the entire page.)

Hi hyc_symas, I was curious why someone wouldn't want to use the `mmap` approach? What does having a user-space page cache buy you?

I talk a little about the history of this approach in my early LMDB design papers. The mmap approach was actually quite popular up until the mid-to-late 1980s. Up until then, 32 bit address spaces were still orders of magnitude larger than the physical memory of common machines. But when RAM sizes started approaching 1GB, and database sizes reached beyond 4GB, the notion of mmap'ing an entire DB into a process became impractical.

It took the introduction of SPARC64 in 1995 to really make it worth considering again, and Athlon64 in 2003 to make it feasible for mainstream computing.

Die-hard DB authors will tell you that having a user-space page cache buys you control - the DB has more knowledge about the workload and thus can make better decisions about which pages to evict and which to keep, than the kernel.

Of course, with greater power comes greater responsibility. Managing a cache, and eviction/replacement strategies in particular, is one of the most difficult problems in computing. Maintaining all of that knowledge about the workload can be much more compute-intensive than actually executing the original workload. Indeed, LMDB demonstrates that it's always more costly. Kernel code can program the MMU and handle page faults much more efficiently than any user-level code. It also has global visibility into all of the hardware resources in a system. It's already the kernel's job to balance memory demands among various processes in a system. A user-level app really only knows about its own consumption; it might be able to obtain information about other workloads on the machine but only at great cost.

In today's age of virtual-machine (over)use, app-level memory management is an even worse proposition. (In the OpenLDAP Project we had an effort led by an IBM team, looking into a self-tuning cache for OpenLDAP. They in particular wanted it to "play nice" and give up cache memory when there was memory pressure from other processes in the system, and even from pressure from the host system when running inside a VM. That effort was IMO futile and that work never reached fruition. But the mmap approach achieves all of their goals, without any of the massive amount of load monitoring and timing code they needed.)

On modern-day hardware with MMUs, there's no good reason to have user-space cache management.

With Berkeley DB it's even worse, since, by default, BDB uses a memory-mapped file as its cache. So the cache gets effectively double-managed, first by BDB, then by the OS itself. This wreaks havoc on performance.

The way around this design flaw is to either tell BDB to use SysV shared memory (which really should be the default), or to locate the cache on a ramdisk (e.g. tmpfs on Linux).

I'm surprised it's not the default. I agree it should be. I wonder why it's not... There must be some good reason? This page verifies that what you're saying is actually correct- http://pybsddb.sourceforge.net/ref/env/region.html

Update: Architecture dependence and having to release shared memory after a crash, are two possible reasons BDB defaults to mmap rather than shared memory.

The BDB mmmap cache actually worked fine for many years, on older OSs. They didn't keep accurate track of which memory pages were dirtied or when, so they were pretty lazy about flushing changes back to disk. This made an mmap'd file functionally equivalent to a SysV shared memory region. I remember when Linux changed in the 2.6 series, and suddenly using an mmap'd file got incredibly slow, because the kernel was proactively flushing dirtied pages back to disk. We started recommending SysV shm for all our deployments after that. You had to be careful in configuring this, since SysV shm wasn't swappable or pageable.

Incredibly insightful. Thank you!

I think the author was looking at this as a building block for understanding RDBMS, not building one. But I could be wrong. I'll ask him tomorrow :)

Regardless, I hope my post dissuades any readers who think otherwise.

The key Sleepycat people went on to write WiredTiger which was acquired by Mongo -- it has some interesting features: http://source.wiredtiger.com/2.6.1/index.html

> WiredTiger supports row-oriented storage (where all columns of a row are stored together), column-oriented storage (where columns are stored in groups, allowing for more efficient access and storage of column subsets) and log-structured merge trees (LSM), for sustained throughput under random insert workloads.

> WiredTiger includes ACID transactions with standard isolation levels and durability at both checkpoint and fine-grained granularity.

> WiredTiger can be used as a simple key/value store, but also has a complete schema layer, including indices and projections.

From the article: "It’s just a key-value store that you #include in your code. Intuitively it seems like it could be a building block in constructing a full blown relational database server."

IIRC, that's precisely how MySQL started.

For reference, there actually was a BerkeleyDB backend for MySQL. The MySQL guys stopped supporting it and deleted it from their source distro, but it was available all the way up until MySQL 5.0.95.

You can use it where you'd use leveldb nowadays.

LevelDB's design is inherently unreliable. Nobody should be using it nowadays. https://www.usenix.org/conference/osdi14/technical-sessions/...

I know Postfix supports lmdb instead, format from OpenLDAP http://symas.com/mdb/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact