By the time you have a working robust multi-client system, you have completely reinvented half of what an RDBMS provides, since you will have pushed most functionality into a central server to avoid the lack of robustness its embedded nature brings, to simplify coordination of recovery when it (inevitably) crashes, and to centralize the rest of the infrastructure you have to build around it.
Its tools are crap: it can't tell you how big a database is without traversing every single entry, they hold locks while writing to the terminal (meaning you can't pipe them into a pager without freezing the whole DB), and if one of them crashes (due to e.g. a dropped SSH connection) they'll bring down the whole DB due to stale locks.
Everything is underdocumented, especially the knobs you have to tweak to make it not run out of memory. Its more "advanced" features are unusuably buggy. Partition traversal (by cursor) is completely broken, and if you try to use partitions and key compression at once, things crash. MVCC inexplicably keeps thousands of transactions open, causing out-of-memory issues.
Its C++ interface is a joke. Somehow they managed to take an "object-oriented" C library, and jam it into C++ in a way that completely tosses all the benefits of C++'s object-orientedness. RAII is not a concept the developers were familiar with: objects are created by factory functions and are variably and inconsistently freed by either their "parent" object, their own methods, and their destructors. Do the wrong thing, and you trample memory, likely corrupting the database. This is surprisingly easy due to the lack of documentation.
And my sympathy to anyone who tries to run SQLite on this bad boy… combined with SQLite's shoddy track record interpreting any query more complicated than a key-value lookup, it would be a wonder if you're able to get any data in and out intact.
Sorry for the rant. Berkeley DB is many things, but a usable building block for a reliable RDBMS is not one of them.
As the primary author of OpenLDAP's BDB support, and of the LMDB library which now supersedes BDB, I've got quite a deep perspective into this topic.
BDB is deadlock-prone, no argument there. But that doesn't mean the embedded engine approach can't work. LMDB is deadlock-proof, and serves perfectly well as the engine for an RDBMS as well as the engine for a hierarchical DB (X.500/LDAP directory), or arbitrary graphs (http://sph.io/content/2faf , https://github.com/pietermartin/thundergraph , etc...) etc. When you start with a model-less key/value store you can implement any other data model on top.
LMDB's robustness is, in a word, flawless. It's crash-proof by design and multiple independent researchers have borne out the integrity of that design. (https://www.usenix.org/conference/osdi14/technical-sessions/... https://www.usenix.org/conference/osdi14/technical-sessions/... ) Your assertion that you can't build a robust multi-client system in a lightweight embedded system is flat wrong. In fact, this approach is the only way to get maximum performance from a software system. LMDB is so much faster than every other data management mechanism in existence, nothing else even comes close. http://symas.com/mdb/#bench
Crap tools, crap locking - these are certainly weaknesses in BDB's locking design. LMDB's locking design doesn't have these weaknesses. Querying the size of the DB is an O(1) operation, simply reading a few words out of the DB header.
Underdocumentation - I have no idea what you're talking about here. To me, BDB was copiously documented, and I largely owe my understanding of transactional systems to the BDB documentation. Yes, BDB is complicated and has too many moving parts. LMDB doesn't have this flaw either. But I can't take anyone seriously who says BDB was underdocumented.
C++ interface - I don't use C++, so no comment here.
C >> C++.
SQLite on BDB - yes, it's a joke. I've run it, and the performance is pretty awful. But we do far better. https://github.com/LMDB/sqlightning
There are many positive and negative lessons in software design to be learned from Berkeley DB. It really started life as a test bed, for learning. In that, I think it succeeded admirably. Its authors were able to experiment with extensible hashing, B+trees, ARIES logging, and a multitude of other important software architecture and data management concepts. LMDB owes much of its design to lessons learned from BDB.
LMDB's use of a read-only memory map by default means you can safely allow multi-client access without fearing corruption due to memory bugs. LMDB's exclusive use of mmap instead of complex application-level caching means you avoid all of the horrors of BDB's cache tuning, bloat driving into swap space, and other admin/tuning nightmares. LMDB's crash-proof persistent on-disk format means you don't need to worry about error-prone and inherently unreliable transaction logging mechanisms. All of these features are part of the LMDB design because of our experience working with BDB over the past 15+ years.
The only point you actually rebutted without referring to LMBD was regarding documentation. Yes, some parts of BDB are copiously documented. Others are not. How cursors interact (or, don't) with partitioning is not documented at all. Some settings are underdocumented (e.g. the cache size setting doesn't specify whether "gigabytes" means 10^9 or 2^30). What the cache priority levels mean is barely even discernible from the source code. The documentation for how to set parameters like the total number of mutexes and active transactions to allocate boils down to "guess and check". Why enabling MVCC causes transactions to linger for ages isn't explained anywhere, beside maybe one mailing list post from 2008.
I'm glad to see that my experience with BDB is corroborated by that of the developer of BDB's poster child (despite your unsubstantiated claim that I'm "wrong on so many points").
That said, since you felt the need to claim it, I feel the need to ask how you can rectify the claim that "LMDB is so much faster than every other data management mechanism in existence, nothing else even comes close" with the fact that LMDB is Btree-based and thus must face significant performance challenges with both sequential synchronous random writes, and with random writes and deletes to/from databases whose working set does not fit in memory.
LMDB is read-optimized; for reads nothing else can touch it. For writes, there are other engines that may be faster some of the time.
For sequential writes LMDB cheats. The cheat is described in my LMDB Internals presentations. Nothing else can touch its sequential write speed.
For random writes - LMDB generally loses to other designs when records are smaller than 1/2 a DB page. For larger, LMDB wins, and the margin grows as record sizes grow.
How does LMDB outperform log-structured databases when writing page-sized objects with random keys when the working set is larger than RAM? I'm genuinely curious. Any other Btree-based database will perform at disk read seek rates, at best.
LMDB outperforms all other engines on larger-than-RAM DBs with page-sized objects because its zero-copy architecture doesn't waste any RAM. For other DBs you must dedicate some fraction of RAM to write buffers and page caches in order to obtain their maximum performance. This means the largest working set size they can keep in RAM is much smaller than the actual physical RAM. With LMDB there are no separate write buffers or caches, all of available RAM can be utilized for your working set. This is precisely the scenario tested here http://symas.com/mdb/hyperdex/
I'll note that in that test, LMDB performed at exactly the disk read seek rate; HyperLevelDB's LSM performed at much worse than that rate.
Last I tested, Berkeley DB issues only one fsync per transaction commit (after writing to the log). (Although Berkeley DB screws up even this, by writing unaligned log entries of arbitrary size, which means e.g. SSDs have to perform expensive read-modify-writes.)
The interesting case for LSM isn't random read or sequential insert, it's random insert (which the 20/80 update/read load doesn't really cover). While it's good to see that LMDB outperforms LevelDB in the former two cases, one would be foolish to use a LSM database if random insert (and/or delete) weren't important to them. (Of course the world has no shortage of fools.)
Throw a larger-than-RAM rotary-disk-backed random insert workload at LevelDB vs. LMDB. I would expect LevelDB to perform at around 1/4 the disk's write bandwidth (probably ~2k IOPS with 4 KiB items; same as for sequential insert), and LMDB to perform at the disk's seek rate (~100 IOPS).
Please note I'm not trying to bash LMDB, or claim that LevelDB or LSM in general is superior for anything but random insert workloads. But you made an extraordinary claim, that LMDB is leaps and bounds better than any other data storage system! – as someone whose job it has been to wrangle extreme performance out of certain databases, it behooves me to ask for extraordinary evidence :)
There are many modes being compared (synchronous/asynchronous, sequential/random reads/writes, small/big values, in-memory/SSD/HDD). The one you want (synchronous, random writes, HDD) shows this:
- Small values (100B): Leveldb is at 1291 ops/sec, LMDB is at 297 ops/sec.
- Large values (100kiB): Everyone is at ~115 ops/sec
(apply portion of salt as usual, of course)
Cache size: "All sizes are in powers-of-two"
Cache priority levels: "The lowest priority: pages are the most likely to be discarded."
"More likely" is not documentation, when you need guarantees that certain pages are not evicted in favor of others. (If my memory of the source code serves me, these levels map to internal values that have some nonlinear relationship with some LRU cache parameter, with the highest priority effectively locking pages into cache, except for certain cases which I forget.)
Also, the powers-of-two phrase is a recent addition: http://web.stanford.edu/class/cs276a/projects/docs/berkeleyd...
By the time you have a working robust multi-client system, you have completely reinvented half of what an RDBMS provides, since you will have pushed most functionality into a central server to avoid the lack of robustness its embedded nature brings,"
Sounds very much to me like you are claiming that embedded designs by their nature lack robustness. BDB's lack of robustness is due to its very specific design choices, not solely because it is an embedded data store.
"Using read-write mode offers much higher write performance, but adds the possibility for stray application writes thru pointers to silently corrupt the database"
I don't doubt LMDB is more robust than Berkeley DB even in that mode (it's hard not to be, since BDB will gladly corrupt itself if you mess up following its Byzantine C++ allocation patterns), but I think this one generalization I make is at worst not unarguably falsified.
In practice we've also found that writable mmap is only a win when your entire DB resides in RAM. When the DB is larger than RAM it's actually slower than just using write(). (When you try to write to an mmap page, if it's not already resident the OS must page it in. This is a wasted page-in since we're simply going to overwrite the entire page.)
It took the introduction of SPARC64 in 1995 to really make it worth considering again, and Athlon64 in 2003 to make it feasible for mainstream computing.
Die-hard DB authors will tell you that having a user-space page cache buys you control - the DB has more knowledge about the workload and thus can make better decisions about which pages to evict and which to keep, than the kernel.
Of course, with greater power comes greater responsibility. Managing a cache, and eviction/replacement strategies in particular, is one of the most difficult problems in computing. Maintaining all of that knowledge about the workload can be much more compute-intensive than actually executing the original workload. Indeed, LMDB demonstrates that it's always more costly. Kernel code can program the MMU and handle page faults much more efficiently than any user-level code. It also has global visibility into all of the hardware resources in a system. It's already the kernel's job to balance memory demands among various processes in a system. A user-level app really only knows about its own consumption; it might be able to obtain information about other workloads on the machine but only at great cost.
In today's age of virtual-machine (over)use, app-level memory management is an even worse proposition. (In the OpenLDAP Project we had an effort led by an IBM team, looking into a self-tuning cache for OpenLDAP. They in particular wanted it to "play nice" and give up cache memory when there was memory pressure from other processes in the system, and even from pressure from the host system when running inside a VM. That effort was IMO futile and that work never reached fruition. But the mmap approach achieves all of their goals, without any of the massive amount of load monitoring and timing code they needed.)
On modern-day hardware with MMUs, there's no good reason to have user-space cache management.
The way around this design flaw is to either tell BDB to use SysV shared memory (which really should be the default), or to locate the cache on a ramdisk (e.g. tmpfs on Linux).
Update: Architecture dependence and having to release shared memory after a crash, are two possible reasons BDB defaults to mmap rather than shared memory.
> WiredTiger supports row-oriented storage (where all columns of a row are stored together), column-oriented storage (where columns are stored in groups, allowing for more efficient access and storage of column subsets) and log-structured merge trees (LSM), for sustained throughput under random insert workloads.
> WiredTiger includes ACID transactions with standard isolation levels and durability at both checkpoint and fine-grained granularity.
> WiredTiger can be used as a simple key/value store, but also has a complete schema layer, including indices and projections.
IIRC, that's precisely how MySQL started.