Hacker News new | past | comments | ask | show | jobs | submit login
What's the big deal about embedded key-value databases? (eatonphil.com)
156 points by eatonphil on Aug 23, 2022 | hide | past | favorite | 70 comments



I feel like this is missing any mention of the history of KV stores. Unix came with an embedded database (dbm) from the early days (1979) [0] which was rewritten at Berkeley into the more popular bdb in the 80s. [1] Sendmail was one of the more common programs that used it. And then when djb built his replacement for sendmail, qmail, he invented cdb. [2]

[0] https://en.wikipedia.org/wiki/DBM_(computing)

[1] https://en.wikipedia.org/wiki/Berkeley_DB

[2] https://cr.yp.to/cdb.html


I highly recommend people comfortable with Go checkout the building blocks at https://github.com/thomasjungblut/go-sstables

This codebase shows how SSTables, WAL, memtables, recordio, skiplists, segment files, and other storage engine components work in a digestible way. Includes a demo database showing how it all comes together to make a RocksDB / LevelDB competitor (not really).


Very cool! In a similar vein Distributed Services with Go [0] works through SST creating a KV store. I found it helpful for working with BadgerDB [1].

[0] https://pragprog.com/titles/tjgo/distributed-services-with-g...

[1] https://github.com/dgraph-io/badger


BadgerDB is quite a nice piece of software and (from my tests) the best key-value store in Go.

Dgraph wrote a great explanation article about why they wrote Badger and the tradeoffs and designs reasons: http://web.archive.org/web/20181116033431/https://blog.dgrap...


Judging from the precipitous decline in Badger commits since 20221 [0] and that the original/primary author is no longer with dgraph [1] or working on Badger, it may be worth looking at Cockroach's Pebble [2] instead.

[0] https://github.com/dgraph-io/dgraph/graphs/contributors

[1] https://manishrjain.com/about

[2] https://github.com/cockroachdb/pebble


+100 and an upvote. Badger db seems so under-rated to be and is a great drop in replacement for an embedded KV store. Amazing for several simple sharded simple side projects!


Thank you! And thanks for all the stargazers :) Let me know if you have any issues, happy to help and fix things if necessary.


Thank you Thomas, we appreciate you taking the time to open source all this for our benefit.


Honestly, I'm still not sure, why would I use something like RocksDB instead or in addition to plain PostgreSQL/MongoDB/Redis instances.

I don't work with a lot of data, but typically my decisions base on basic factors and purpose:

PostgreSQL - SQL, structured data, cannot scale horizontally

MongoDB - NoSQL, unstructured data

Redis - key-value, distributed cache

I get it that you can replace storage engine and you can theoretically get more performance, but in practice compatibility and standardization is more important, because a lot of products (including third-party) will already use PostgreSQL/MongoDB/Redis, so it's no-brainer to use it as well for your solution.

However for me to pick RocksDB or some other, new, shining database/storage engine, there would have to be more compelling reasons.


Unless you are building a database, these embedded KV store libraries are less likely to be the best solution the job. If you are considering them for an app that isn't a database, you should also take a long, hard look at SQLite first.

What's also interesting is the trend of newer distributed "database systems" like Vitess[0] or SpiceDB[1] that forego embedded KV stores and instead reuse existing SQL databases as their "embedded database". Vitess leverages MySQL and SpiceDB leverages MySQL, PostgreSQL, CockroachDB, or Spanner. Systems built this way get to leverage many high-level features from existing databases systems such that they can focus on innovating in even higher-level functionality. In the case of Vitess, it's scaling, distributing, and schema management of MySQL. In the case of SpiceDB, it's building a database specifically optimized for querying access control data in a way that can coordinate with causality across multiple services.

[0]: https://github.com/vitessio/vitess

[1]: https://github.com/authzed/spicedb


Me: that's Zanzibar, innit? Their GitHub repo: based on Zanzibar


Like S3 or Redis, RocksDB is much more performant when you don't need the query engine and want to have highly compact storage with fast lookups and high write throughput.

Storage engines are different levels of complexity based on the query requirements. Simple K/V stores can run circles around Postgres/MySQL as long as you don't need the extra features.


In your list RocksDB is most like Redis, but even faster because the data doesn't have to leave the process.

Think of it as a high performance sports car like a Ferrari. It's not good at taking the kids to school or buying groceries. But if you need to prioritise performance at the expense of all other considerations then it's exactly what you need.


A few more entries that might be of interest:

  * DynamoDB and the Dynamo KV store
  * LMDB (embedded kv)
  * Dgraph (distributed graph db) and its embedded kv store BadgerDB


IMO it's just confusing to call both, say, RocksDB and MySQL "databases". They sit at different levels of the stack and it is easier to just think of them as entirely different things, your "SQL database" and your "storage engine". So your stack looks like

Application

|

MySQL

|

RocksDB

|

Filesystem

In general the MySQL layer is doing all the convenient stuff for application developers like supporting different queries and datatypes. The RocksDB layer is optimizing for performance metrics like throughput and reliability and just treats data as sequences of bytes.


Actually, this helps a lot. I'd never heard of RocksDB and I'm barely familiar with InnoDB and hopefully I am not wrong to compare the two.


Yes, that's right, InnoDB is the default MySQL storage engine and you can replace InnoDB with RocksDB. To summarize in one sentence, InnoDB is better at reads and RocksDB is better at writes, but if you were making an actual decision you should look at more detailed information than my one-sentence summary, such as:

https://minervadb.com/index.php/2018/06/06/a-friendly-compar...


100% agreed. TIL that mysql uses RocksDB under the hood.

Here's another example of a realtime database which uses RocksDB under the hood: https://rockset.com/blog/how-we-use-rocksdb-at-rockset/


As far as I'm aware, MySQL does not use RocksDB under the hood by default. MyRocks is a distribution of MySQL that uses RocksDB.


Yeah, weird comment from GP. By the time RocksDB was born, MySQL was already going to prom.


Close, but in database years it was actually already in its mid life crisis.


Only if you configure it that way. Same as MyISAM/InnoDB/etc.


I think the use of bare RocksDB is more common than the use of MyRocks.


Two more examples to check out: Yugabyte also persists with rocksDB https://www.yugabyte.com/blog/how-we-built-a-high-performanc...

And this is very cool, distributed SQLite with FDB: https://univalence.me/posts/mvsqlite


Thank you, edited to include Yugabyte!


With RockSet's converged indexes and an SQL query optimiser you can build an SQL database.

https://rockset.com/blog/converged-indexing-the-secret-sauce...

Rockset's converged indexes + denormalisation means you can have fast querying.


Great article! One cool thing about RocksDB it's actually even used in other KV databases such as Redis on Flash https://redis.com/blog/hood-redis-enterprise-flash-database-...


The article misses the point. All data storage and query systems end up architected in layers. Upper layers deal with higher abstractions (objects, rows, whatever). Lower layers deal with simpler functions, closer to the hardware. The upper layers are consumers of the lower layers. This is where "embedded KV stores" like LevelDB, RocksDB, etc come from. They began as the embedded storage layer for some bigger thing. Every product you think of as a database or document store is built like this, including MySQL and PostgreSQL and Oracle. Such a storage layer, shipped as an independent library, is how you (or anyone) builds your own database-ish thing. That's what the article should say.

The list of examples are odd. For instance MongoRocks is cited for using RocksDB, but actual stock MongoDB uses Wired Tiger, which isn't mentioned.

Disclosure: I played a part in the late-beginning of this space when Netscape funded Sleepycat to develop BerkeleyDB. dbm and ndbm existed beforehand, but BerkeleyDB used in LDAP servers is I think the genesis point for this pattern as it exists today.


> Upper layers deal with higher abstractions (objects, rows, whatever)

Right, I'm waiting for standard for a level above relational databases which is Object-databases. I know there are several ones already and there are Object-Relational mapping layers.

I think the key point there is that Object databases are a level ABOVE relational databases. They are not "better" but they deal with the higher level of objects rather than "tables", just like relational databases can be seen to be are a level above key-value -stores.

I would like Object databases to become better and easier to use and more standardized.

I think there is value in being able to see both level, the objects, and the relational data that makes up the objects.


Neither objects nor relations are "above" the other. You can map them in a vacuous mathematical sense, but it's a massively leaky abstraction in either direction.


Some concrete examples:

1. Yugabyte's relational query layer sits on top of a document store (DocDB): https://www.yugabyte.com/blog/how-we-built-a-high-performanc....

2. You can put documents in a PostgreSQL JSON(B) column.


When I use the word "above" I mean "layers" of code. So if an Object-database was implemented by using a relational database, it would be "above" the layer of the RDBMs.

I think that is what object-to-relational mappers like Hibernate do.

I think it would seem quite natural to implement objects on top of, with the help of an RDBMS. But not sure if the opposite is true.


If there's a difference between what you wrote and what I wrote I'm missing it.

But you're also welcome to write your own post. :)


I do feel like there's a historical perspective missing from the article which the GP touches on. Embedded KV stores aren't new (although some of the algorithms behind the current crop certainly are). They used to dominate "backend" software development until their popularity waned as the world got obsessed with "model the domain, damn the computation cost" (because all resources were doubling or more yearly) followed by "we'll just distribute it".

The need for parallelism killed the first approach and the cost of increasingly complex reduce steps killed the second. Now we're back to "how much can we fit in RAM on a local machine" and it turns out, if you can still bang bits for smart key formats, a hell of a lot.


> They began as the embedded storage layer for some bigger thing.

I immediately thought of Kafka's streaming query stuff when I read the headline (ksqlDB). I'm not sure if that's the origin story of RocksDB, but it's the storage engine underlying that streaming query tooling in Kafka's ecosystem.


Yup, FB's ZippyDB [0] is another example mentioned in the article.

[0] https://engineering.fb.com/2021/08/06/core-data/zippydb/

Edit: I've added Redis Enterprise Flash to the list now. Thanks!


Should see a rise in embedded KV popularity in correlation with ML applications. Storing embeddings in something like leveldb in formats such as flatbuffer offer high-performance solutions for online prediction (i.e. for mapping business values to their embedding format on the fly to send off to some model for inference).


Would that be on mobile devices for offline usage? I'm thinking that for typical backend use cases one would use a dedicated key value store service, right?


This would depend on your requirements and type of inference. Say you need to compute inference across 1000's of content/documents/images every second or so, out of some corpus of millions-billions, then having a kv store on disk/SSD (NVME) might be for more efficient & cheaper (in terms of grabbing those embeddings to conduct a downstream ML task). How you update the corpus matters too -- a lot of embedding spaces need to be updated in aggregate.


I've heard this a lot recently about storing embeddings. As someone who has dabbled in ML I don't understand what it means. Can you point me to a good overview of the topic please?


I work on a storage engine at $dayJob. We have created a connector for MongoDB, although for a very ancient version. We are currently working with $cloudProvider to use our storage engine in their cloud DBaaS offerings.

This field is pretty interesting when you're talking about performance vs space amp vs write amp vs read amp.


Plug for my python dict wrapper https://github.com/adammarples/rocksdbdict


Apache Ignite 3 also uses RocksDB as a pluggable storage https://www.gridgain.com/resources/blog/apache-ignite-3-alph...


Thanks! Adding this.


This is a good read. By the way, Kafka Streams is also built on top of RocksDB. Not strictly a database but relevant to a certain extent.


My team has a use-case that involves a precomputed RocksDB database saved on an AWS EFS volume that is mounted on a lambda with 100's-1000's of invocations per second. It allows for some extremely fast querying of relatively static data. Another process is responsible for periodically updating the database and writing it back to the EFS volume.


I am building a general-purpose data management system called Didgets (https://didgets.com/) that extensively uses KV stores that I invented. Since it was primarily designed to be a file system replacement, I used them for attaching contextual meta-data tags to file objects.

My whole container started to look like a sparsely populated relational table where every row/column intersection could have multiple values (e.g. a photo could have a tag for every person in the picture attached). I started experimenting with using the KV stores as columns to form regular relational tables.

It turns out that it was relatively easy and was extremely fast. I started building tables with 50+ million rows and many columns and performing queries against them. Benchmarking the system against other databases revealed that it was very fast (and didn't need separate indexes to accomplish this).

Here is a video showing how it does a bunch of queries 10x faster than the same data stored in a highly indexed table in Postgres: https://www.youtube.com/watch?v=OVICKCkWMZE


When I read about event sourcing, my mind immediately went to how that would map to a K/V database. Has anyone done this in production?

Also - no mention of LMDB? RocksDB and LMDB feel like the ones that stand out in that field - levelDB definitely had a reputation for corrupting data.


The article explains how you do primary key indices with key-value-stores. But how do you do secondary indexes?


"Time is a flat circle." - someone at Sleepycat, probably.


You should add RethinkDB! I moved to it from MongoDB years ago.


Are you still using it? How is the pace going on the community-supported version? I stopped using it after the company folded, but I do kind of miss it. Definitely one of the more interesting designs, and light years beyond what MongoDB was at the time.


I’m definitely still using it, via rethinkdb-ts (npm package). I even forked it to make it work with Deno.

The built-in Data Explorer is a must-have for me and idk of any other database that has something similar.


There are plenty of data explorers for other databases, especially SQL DBs. I don't think it being built into the DB should be a make-it-or-break-it feature.

I used RethinkDB back in the days because it was the first DB that had pretty good replication and sharding - it was zero effort. I felt the functional programming model to be strange, some stuff got executed locally, other parts remotely and it was not very straight forward when things didn't go as planned.

By the time the RethinkDB company folded, CockroachDB emerged and has been my go-to distributed DB since.


No I don't think that's relevant. They implement their own btree it seems [0].

They don't use a key-value store library.

I know it's a bit of a fine line. But I'm talking about standalone libraries people embed across different applications/databases. That's what RocksDB/LevelDB/Pebble are.

[0] https://github.com/rethinkdb/rethinkdb/tree/v2.4.x/src/btree


HSE[0] is another storage engine to throw on the pile.

[0]: https://github.com/hse-project/hse


RethinkDB is utterly defunct as a project, has not had a substantive release in years, and in my experience just flat out doesn't work. And let's don't even discuss Mongo. Asking yourself to choose between these is like selecting your favorite brand of thumbtack to step on.


Lol. When did you last use MongoDB and why is it a thumbtack?


The last time I used MongoDB was when it was necessary for me to demonstrate to decision makers that it silently loses data in trivial, common failure scenarios. Then I put it away and never used it again.


I was about to defend it as having come far along but actually seems like it's still having some big issues as discussed in 2020 [0].

> Yeah, there's no workaround that I can find for 3.4 (duplicate effects), 3.5 (read skew), 3.6 (cyclic information flow), or 3.7 (read own future writes). I've arranged those in "increasingly worrying order"--duplicating writes doesn't feel as bad as allowing transactions to mutually observe each other's effects, for example. The fact that you can't even rely on a single transactions' operations taking place (or, more precisely, appearing to take place) in the order they're written is especially worrying. All of these behaviors occurred with read and write concerns set to snapshot/majority.

[0] https://news.ycombinator.com/item?id=23290844


RethinkDB still works well for me /shrug


TiKV is not an embedded key-value store, it is distributed.


Thanks! Fixed and attributed you at the end.


No mention of SQLite as an embedded SQL database?


This post is about key-value stores.

While foundationdb uses SQLite I didn't otherwise think of SQLite as being relevant here. :)


The term is “key/value.”


Yes, keep modding INFORMATION down, Redditards.


The article says that Consul or etcd are designed to always be up, but it’s actually quite the opposite. They both leverage Raft for maintaining consensus and thus optimize for consistency at the cost of availability in case of network partitions. See CAP theorem.


All distributed databases are designed to "always be up", that's the point of making them distributed, otherwise a single instance is fine.


There are reasons to distribute DBs that do not need to be up constantly, e.g. distributing work (transactions or queries) across more resources than are available on one machine; or to bring a replica closer to some other service to reduce latency.

Kafka Streams is the first kind; the source-of-truth storage is HA (as HA as the Kafka topics it's backed with at least) but can only be queried with high consistency when the consumer is active, and it goes down for rebalances when you scale out or fail over (and in many operational setups also when you upgrade).

For an example of the second kind, see Fly.io's Litestream explanation - https://fly.io/blog/all-in-on-sqlite-litestream/.

That being said, I think the etcd etc. examples are just meant to be in contrast to stock Redis or Memcache, which offer very little HA support, generally just failover with minimal consistency guarantee.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: