Hacker News new | comments | show | ask | jobs | submit login
LMDB – Lightning Memory-Mapped Database Manager (lmdb.tech)
81 points by hoov 37 days ago | hide | past | web | favorite | 22 comments

I’ve used LMDB as a simpler alternative to SQLite as “an alternative to fopen”. The goal was simply robust file writes in the face of unpredictable server reboots for a tiny Python program writing data to be processed later by a tiny C++ program.

That’s harder than it sounds to roll by hand with fopen. SQLite with write ahead logging is pretty much as good as it gets for reliablity, but SQL at all was overkill for the task. LMDB is a close second and it’s memory mapped key-value interface is much simpler. . Would write again.


LMDB is a storage engine whereas SQLite is a small database. There is even a version of SQLite that used LMDB as the underlying storage engine: https://github.com/LMDB/sqlightning.

LMDB's read performance crushes modern LSM databases. It's not an alternative to SQLite. It is well suited for read-heavy class of workloads.

I like LMDB, but why does ~most sql/nosql use LSM/rocksdb compared to it ? At least the ones going for read-speed ? Cause of missing WAL ?

There is also a fork? who claims is better/more-features than LMDB: https://github.com/leo-yuriev/libmdbx

To understand the tradeoffs between LSMs, B-Trees, and Fractal Trees, see the references in this previous post on TokuDB and Bε trees...

* BetrFS: An in-kernel file system that uses Bε trees to organize on-disk storage https://news.ycombinator.com/item?id=18202935

Memory model considerations and storage architecture design gets even more interesting now that NVMe has become a thing. For example, in addition to LMDB, how much more interesting have things become for Redis on NVMe?

* Caching Beyond RAM: The Case for NVMe https://news.ycombinator.com/item?id=17315494

* Intel Optane DC Persistent Memory is officially in Google Cloud https://news.ycombinator.com/item?id=1834816

And there are a few new forward-thinking DB architectures emerging on the scene, some that have been in the works for more than 10 years. Look at the work being done by the Berkeley RISELab team and the architecture behind Fluent DB.

* Ground: A Data Context Service (2017) [pdf] (berkeley.edu), https://news.ycombinator.com/item?id=18415456

What might have been conventional wisdom in the realm of DBs years ago will not be the best practices of today. Architectures have changed too much.

And this is not just true for storage, it's true for compute too. The availability of CPU/GPU/TPU accelerators in the data centers is driving a rethink in compute toward parallel algorithms in the form of Vector/Matrix/Tensor multiplication. The best way to store and index these arrays is something to consider too.

Can these be used for async filesystem access (AIO) that seastar framework does ? (i don't think so for now)

At least it doesn't support mmap so removes lmdb.

Google started the trend of LSM with its release of leveldb. But leveldb hasn't been updated in a long time. Facebook forked leveldb and renamed it to rocksdb. Those are the only two LSM databases I know of, and IMO they are really the same thing. Meanwhile, lmdb vs. rocksdb/leveldb is a frequently asked question that seems to have no clear answer. Test on your hardware to find the best solution for your use case.

It primarily depends on your requirements, as a rule of thumb:

* If your workload is random-writes heavy, choose lsm

* If your workload is serial-writes heavy, both are similar

* If your workload is read-heavy (random or not) go for lmdb

If your writes are larger than ~1/2 a page, LSMs are slower, regardless of random or sequential access pattern.


Also, if your writes are mostly smaller than ~1/2 a page, you can reduce your B+tree pagesize and regain performance.

All improvements over LMDB are listed here = https://github.com/leo-yuriev/libmdbx#improvements-over-lmdb

LMDB is a very good choice for many well-known reasons. I don't need to expand here, the advantages are well documented, and more and more projects are choosing LMDB.

However LMDB does not solve all problems, and can be a bad choice for some, and I couldn't find this documented anywhere. Specifically write-intensive workload. Why?

- LMDB by default provides full ACID semantics, which means that after every key-value write committed, it needs to sync to disk. Apparently if this happens tens of times per second, your system performance will suffer.

- LMDB provides a super-fast asynchronous mode (`MDB_NOSYNC`), and this is the one most often benchmarked. Writes are super-fast with this. But a little known fact is that you lose all of ACID, meaning that a system crash can cause total loss of the database. Only use `MDB_NOSYNC` if your data is expendable.

In short, I would advise against LMDB if you are expecting to have more than a couple of independent writes per second. In this case, consider choosing a database that syncs to disk only occasionally, offering just ACI semantics (without Durability, which means that a system crash can cause loss of only the last seconds of data).

> But a little known fact is that you lose all of ACID, meaning that a system crash can cause total loss of the database. Only use `MDB_NOSYNC` if your data is expendable.

Last I looked into LMDB, this was only the case if the filesystem doesn't respect write ordering, which depends on the filesystem. Otherwise you get everything but durability (i.e. ACI) If I recall, writes are ordered by default on Ext3.

This is exactly our experience. Using the default settings, RocksDB massively outperform LMDB on single-key write workload, because it writes asynchronously.

Your advice made sense in the age of rotating platter HDDs, limited to a max of ~120 IOPS. Today's world of NVMe SSDs makes your considerations obsolete.

That's not true. With SSDs we can sync with the disk more often, but it's still very slow.

There's an even older technology, battery backed RAM cached HDDs, that gives you everything an SSD can, except the thing you aren't actually using here, fast random-access read performance.

Is there a design doc or talk about the internals?

In particular are there any good resources about the details of using memory mapping?

I know how to implement persistent data structures (and it seems like lmdb is just a persistent b+-tree). But I don't know how to make it persist to disk. Is it as simple as using a memory mapped file for all memory allocations? Can all data structures be turned into a "database" in this way? If your workload fits in memory is there any performance difference between in-memory data structures? When do writes actually flush? What happens if multiple processes use the same file? etc

See these two talks by 'hyc...

LMDB talk at DEVOXX (2013) [video] https://youtu.be/Rx1-in-a1Xc

LMDB CMU Databaseology Lecture (2015) [video] https://youtu.be/tEa5sAh-kVk

There is much more information on the symas website (https://symas.com/lmdb/technical/) (see all the talks links)

> Data pages use a copy-on- write strategy so no active data pages are ever overwritten, which also provides resistance to corruption and eliminates the need of any special recovery procedures after a system crash.

But I imagine this is somewhat slower than keeping a log (and rewinding it if necessary)?

The web page seems to suggest that robust POSIX semaphores are Linux-specific, while they've been available in FreeBSD for quite some time. I wonder if they detect it properly, or is there some actual problem in FreeBSD's implementation?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact