
Design Review: Key-Value Storage - espeed
https://mozilla.github.io/firefox-browser-architecture/text/0015-rkv.html
======
jimis
The page writes that LMDB "is exceptionally fast for our kinds of load", and
then links to an in-memory microbenchmark:
[http://www.lmdb.tech/bench/inmem/](http://www.lmdb.tech/bench/inmem/)

Aren't they interested in persistence of the key-value data? In my experience,
once data is persisted to disk or SSD, LMDB is way slower from alternatives
because it needs to operate in synchronous mode to avoid corruption
(effectively flushing to disk after after every transaction committed). If
operated in the non-default MDB_NOSYNC mode (which is the mode chosen in the
above benchmarks), then there is a high probability to be left with an
unreadable database file after a crash, thus losing all your data.

It is not fair to compare with other databases in sync mode, since they might
operate safe but faster in async mode. For example sqlite with PRAGMA
journal_mode=WAL and PRAGMA synchronous=NORMAL can operate in semi-
asynchronous mode (fsync()ing sporadically) without fear of corruption in case
of crash, because it keeps a WAL journal and is able to properly roll-back
after a crash. This should be much faster than LMDB's default-and-safe
synchronous mode, that msync()s on every value written.

~~~
hyc_symas
We have only ever compared LMDB in synchronous mode to other DBs in
synchronous mode, and LMDB in asynch mode to other DBs in asynch mode. Come
on, that's too obvious.

And LMDB beats the crap out of SQLite, in any mode.
[http://www.lmdb.tech/bench/microbench/](http://www.lmdb.tech/bench/microbench/)

Replacing SQLite's Btree engine with LMDB makes the SQLite footprint smaller,
faster, and more reliable too.
[https://github.com/LMDB/sqlightning](https://github.com/LMDB/sqlightning)

~~~
jimis
It's not that simple. What you mean by saying "(a)synchronous mode" is very
different from database to database. See my above comment. Is SQLite
synchronous or asynchronous if you configure it as journal_mode=WAL and PRAGMA
synchronous=NORMAL?

For my purposes as an application developer, I care about comparing databases
operating in _safe_ mode i.e. a system crash should never cause total data
loss. According to my experience SQLite's safe mode is many times faster than
LMDB's safe mode, with a write workload. While LMDB is thrashing the disk and
achieves only a handful of write transactions per second.

~~~
hyc_symas
SQLite's safe mode is not comparable to LMDB's; SQLite is vulnerable to silent
data loss in a crash.

[https://wisdom.cs.wisc.edu/workshops/spring-14/talks/Thanu.p...](https://wisdom.cs.wisc.edu/workshops/spring-14/talks/Thanu.pdf)

LMDB is not.

~~~
jimis
Isn't this from the paper "all filesystems are not created equal"? [1] If you
search the tables for "sqlite-wal" you will see that it shows zero
vulnerabilities.

[1]
[https://www.usenix.org/system/files/conference/osdi14/osdi14...](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-
pillai.pdf)

~~~
hyc_symas
Ah, the earlier result was for SQLite-rollback, their default mode. IMO any DB
should default to its safest mode.

------
nilsocket
Didn't they review badger from dgraph.

[https://github.com/dgraph-io/badger](https://github.com/dgraph-io/badger)

Here is the comparison back in 2017: [https://blog.dgraph.io/post/badger-lmdb-
boltdb/](https://blog.dgraph.io/post/badger-lmdb-boltdb/)

It supports concurrent ACID transactions with serializable snapshot isolation
(SSI) guarantees.

------
drzaiusx11
Hopeful this goes better than WebSQL. WebSQL ran into standardization problems
because they started with an existing implementation (SQLite) instead of a
spec as their base implementation--meaning anyone making a greenfield
implementation would need to implement all the quirks of SQLite to be
compatible with other implementations. Sadly it died as a standard and we were
stuck with either local storage or indexdb.

~~~
Mossop
There is no intent to standardize this or expose it to the web. This is purely
for use internal to Firefox. Of course that may end up meaning that it is used
as the internal storage for some web feature.

~~~
drzaiusx11
The first sentence on that page says "We propose the standardization of a
simple key-value storage...usable from JS, Java, Rust, Swift, and C++"

I assumed this meant an API callable from webasm/js. Did I miss something?

~~~
drzaiusx11
Ah, found my error:

"Not-yet or never goals for this proposal are:

Standardization via a standards body as a web API."

So this is "internal" stuff I guess.

~~~
Mossop
Poor wording perhaps. There are a bunch of places in Firefox that use key-
value stores, this proposes standardising on one type of store.

------
lclarkmichalek
I'm really surprised LSM trees didn't get more commentary. 'All are targeted
at server workloads' \- sure, but they're also incredibly popular and appear
to be as close to the 'one-size-fits-all solution for storage' as we've found.

~~~
coleifer
Also typically more complicated and require a separate compaction process.
They're good for writing lots of data, but not so great for random reads.

~~~
mrjn
Author of Badger here. Our design of separating keys and values has gotten us
incredibly fast writes, while still keeping the read latencies neck-to-neck
against B+ trees. Worth checking out: [https://github.com/dgraph-
io/badger](https://github.com/dgraph-io/badger)

~~~
digikata
Badger looks nicely done! Did you end up needing to change much in
implementing Badger from what was described in the WiscKey paper?

------
catwell
In case someone involved in this reads this thread: the document does not
specify which LMDB version you tested. I suggest you run your tests with the
`mdb.master` branch, i.e. the work towards a future 1.0, and not the stable
0.9 branch. The answers to several of your interrogations will depend on that:
with `mdb.master` you can use the VL32 mode which greatly improves usage on 32
bit platforms, and Windows support is much better.

Regarding NFS: I have recently started testing LMDB on NFS v4 and had no
issues so far, but with a single process using the database. AFAIK the warning
at [http://www.lmdb.tech/doc/](http://www.lmdb.tech/doc/) is only for multiple
processes using the DB concurrently. I am still not entirely sure there won't
be any mmap-related issues, but so far so good.

Regarding "being careful": this is a very important point. The LMDB API does
not hold your hand, it lets you do dangerous things which will result in
corruption of your database, which you will discover too late. I suggest
writing a wrapper around the API to ensure you are using it correctly. (I wish
there was a compile flag like LUA_USE_APICHECK [1] for LMDB, which could help
detect problems like this, but there isn't.)

[1]
[https://www.lua.org/manual/5.3/manual.html#4](https://www.lua.org/manual/5.3/manual.html#4)

~~~
hyc_symas
The main reason for that warning about NFS is that people will try it, see
that it seems to work, and then get careless and try to use the same DB from
two different hosts at once. It's inevitable when you're working on files
living on networked filesystems, and it cannot work. NFS doesn't offer any
cache coherency guarantees, and the mutexes used for synchronizing writers
only work on the host that created them.

Not sure what you're talking about re: the API letting you corrupt your
database.

~~~
justin66
If we're using lmdb for something that's largely or entirely read only, how
viable is using it on NFS with (shudder) lock files or something like that?

~~~
hyc_symas
If it's 100% read-only, you could probably use it safely. Make sure its
filesystem on the NFS server is mounted read-only, and obviously all the
clients' NFS mounts must also be read-only. As for "largely read only" \- if
there are any writers at all, all bets are off.

LMDB automatically detects read-only filesystems, and turns off its locking in
that case, so it should perform as well as anyone could expect NFS to perform.

------
adobeeee
I have a layman question if somebody could please answer. I have never in my
entire life seen databases fail. But db failures and issues seem to be brought
up all the time. Now I understand that part if this maybe the cost function
associated with them. But I'm sure there's also something that I have no clue
about. So my questions are:

1) what kind of problems do databases actually face.

2) what kind of scenarios create those problems.

3) how does a programmer go about testing them?

~~~
mbreese
The easiest scenario to imagine is a hardware failure or power outage. The
database was in the middle of doing something, and then was prevented by a
hard drive dying or the lights going out. One way to test such a thing is to
literally unplug the computer to see how it handles the failure.

So, let's say you have a client/server application... the client is telling
the server (database) to write some records to the database. In the middle of
the write, you pull the plug. Some questions you'd want to know: what does the
database look like when it restarts? Can we read it? What is the current
state? Did any of the new data get written? What does the client think was
written? If there was an uncommitted database transaction, was the database
left unaltered?

It's just as important to test the client in these scenarios. While the server
may have crashed, what does the client think happened? Was it waiting for an
ACK or "OK" message? Did it get the message? If the update failed, what does
the client do in that situation?

Things can get even more complicated if you're thinking of replication across
different servers. If one of the servers fails, how does the replication work?
Do sessions fail over to other servers? How many servers are required? If
there was a corrupted record, did it propagate or was it scrubbed?

~~~
adobeeee
Thank you for explaining so well and clearly!

To you and others, are there any other scenarios too that happen in
production?

~~~
philix001
Enough material here to scare anyone about databases

[https://jepsen.io/talks](https://jepsen.io/talks)

------
shekispeaks
This doc says LevelDB has no transacion, thats not true, they have batch
writes and LEVEL-DB is not implemented in Go. Its implemented in CPP

~~~
floatingatoll
Transactions let you perform any series of SQL commands with various
expectations around data safety and locking guarantees depending on isolation
level.

Batch writes provide a tiny subset of the full possibilities of transactions.
While sufficient in many cases, that cannot be generalized to "LevelDB
supports transactions".

~~~
romed
That's just your pet definition of transaction. That is not universally
accepted.

~~~
Twirrim
I would love to see an example of someone actually knowledgeable about
databases who has a different definition for transaction.

~~~
erpellan
Atomicity, consistency, isolation, durability.

[https://en.wikipedia.org/wiki/ACID_(computer_science)](https://en.wikipedia.org/wiki/ACID_\(computer_science\))

But that's just like, the industry's opinion, man.

~~~
Twirrim
Maybe I completely misunderstood the parents description of a transaction, but
that was exactly what they were saying.

------
sidcool
This is a great example of how to conduct a technical design review.

------
mamcx
I see their have some questions (like how good is the windows and android
support) that are not answered? or only internally? I think will be good to
see what them found.

~~~
hyc_symas
Windows and Android are fully supported, have been for years. As are iOS and
MacOSX and all the BSDs.

~~~
mamcx
Thats great to know, and I suspect that. But what I'm saying is that them ask
some valid questions but where are the results of their findings?

Because if I read this, I could conclude LMDB could have troubles in that
areas...

------
jinqueeny
Curious about how this compares to TiKV...

------
AboutTheWhisles
It really sounds like they should take a look at this:

[https://github.com/LiveAsynchronousVisualizedArchitecture/si...](https://github.com/LiveAsynchronousVisualizedArchitecture/simdb)

"One appealing aspect of LMDB is its relative ease of use from multiple
processes, above and beyond its basic capabilities as yet-another-fast-key-
value-store."

simdb is only lock free and thread safe. While LMDB is benchmarked at around
10k writes, this should be able to do millions of mixed reads and writes with
4 modern cores. LMDB seems to use a separate lock file to sync multiple
threads/processes. The only catch here is that the keys aren't sorted, which
doesn't seem to be a requirement of theirs.

------
erichocean
LMDB is awesome, can't tell how many times I've wished to have access to it
from the browser...

------
stevewilhelm
How does this compare to Redis?

~~~
detaro
Redis isn't embeddable, but a standalone server application, and thus it's not
really the same space.

------
smacktoward
Bring back Mork!

 _ducks_

~~~
jcranmer
Mork actually sucked even as a key-value store. It's only decent if your
requirements are a) only lookup on a fixed, autoincrement integer ID, b) the
only operation you're likely to do is load an entire record or store an entire
record at once, and c) parallelism is not in your vocabulary.

Disclaimer: I'm one of the last people to make functional changes to mork.

~~~
majewsky
That's not a "Disclaimer". It's actually the polar opposite... a "Claimer"?
... I think "Source" is the best word to use there.

~~~
pvg
Welcome to the war, majewsky.

[https://news.ycombinator.com/item?id=17607457#17618288](https://news.ycombinator.com/item?id=17607457#17618288)

~~~
majewsky
Hah! :) I've been meaning to call out other misuses of "Disclaimer" vs.
"Disclosure" before, but I usually don't because it's just tedious.

------
amelius
Seems like an instance of reinventing the wheel ...

~~~
wvenable
From the article:

"We propose ‘buying’, not building, the core of such a solution, and wrapping
it in idiomatic libraries that we can use on all platforms.

We propose that LMDB is a suitable core (see Appendix A for options
considered): it is compact (32KB of object code), well-tested, professionally
maintained, reliable, portable, scales well, and is exceptionally fast for our
kinds of load. We have engineers at Mozilla with prior experience with LMDB,
and their feedback is entirely positive."

------
alexnewman
Why do they make this proposal? What does that mean?

~~~
Mossop
It's a proposal for making use of a new storage engine to store Firefox
internal data.

------
acqq
I thought that the issues like this would be blocking:

[https://stackoverflow.com/questions/44407659/how-to-
force-64...](https://stackoverflow.com/questions/44407659/how-to-force-64-bit-
lmdb-to-generate-a-32-bit-database)

Who knows how it should work on 32-bit systems?

And isn't endianess also a problem? And doesn't SQLite solve both problems by
default? And isn't also possible to configure SQLite to be very fast, if one
know what he's doing?

Btw I'd expected that the "notes here"
[https://docs.google.com/document/d/1bwbpqPb58a0GcEyB4W424pyf...](https://docs.google.com/document/d/1bwbpqPb58a0GcEyB4W424pyftPFiZBSvxxp_0uuN-z0/edit)
are conclusions, but the document seems to be not publicly accessible.

~~~
mykmelez
According to @hyc in [https://monero.stackexchange.com/questions/2606/are-the-
lmdb...](https://monero.stackexchange.com/questions/2606/are-the-lmdb-files-
cross-platform-compatible/2607#2607), "As of v0.10.0, yes the LMDB files are
cross-compatible between 32 and 64bit architectures. They have always been
cross-compatible between OSs. They are still byte-order dependent but almost
everyone uses little-endian CPUs these days so it's not much of an issue."

~~~
acqq
So the endianness answer is, from your link: "They are still byte-order
dependent but almost everyone uses little-endian CPUs these days so it's not
much of an issue." Which just means "we solve the problem by ignoring it
completely."

And what about the limitations on the 32-bit system? Isn't there also needed
to use memory space of RAM for the complete size of the database, that's how
that database works if I understood? Which makes LMDB effectively unsuitable
for 32-bit systems:

[https://stackoverflow.com/questions/52862176/lmdb-open-
large...](https://stackoverflow.com/questions/52862176/lmdb-open-large-
databases-in-a-limited-memory-system)

"your user will need to either enable PAE on their system, or upgrade to
64-bit CPU. If neither of these is an option in your application, then you
cannot use a memory mapped file larger than your available address space"

In short it still looks that LMDB is designed effectively only for one-
endianness and only for 64-bit systems, which is still limiting for many use
cases.

Again, it seems that for LMDB is again "solving" the problem by ignoring it.
Which is OK if it fits your use case... But shouldn't Firefox work properly on
32-bit systems? Or did they completely decide that they don't want to target
any 32-bit platform anymore?

Moreover, there are use-case scenarios where the memory mapped file approach
can suffer from problems of unnecessarily reading the page that will anyway be
completely overwritten. My conclusion is... if LMDB "works for you" fine, but
do properly your research first. I still believe SQLite covers much more use
cases and is safer starting point for many of them, including Firefox use
cases, until I read that they really decided to reduce these.

~~~
jnwatson
The last issue is merely one of performance, not capability.

I’ve successfully used LMDB in 32-bit land, though it takes some effort. I had
to scan available virtual memory for the largest contiguous chunk and use
that.

Bigger issues are growing the database size and performance in low memory. No
transactions can be running to increase the map size, which is usually hard to
coordinate.

Also, in low memory situations, LMDB’s performance suffers tremendously. It
can be 100x slower, and commits can take seconds. You won’t run into it unless
you are really hammering it, and the developer usually won’t notice because
they usually have lots of RAM.

~~~
hyc_symas
Fwiw, we've benchmarked on Raspberry Pis with slow SDcards. You want to talk
about low RAM situations and slow I/Os. The reality is still the same though -
every other DB engine is many times slower under the same conditions.

Even when the DB is 5x or 50x larger than RAM...
[http://www.lmdb.tech/bench/hyperdex/](http://www.lmdb.tech/bench/hyperdex/)

