Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Distributed SQLite on FoundationDB (github.com/losfair)
150 points by losfair 7 months ago | hide | past | favorite | 42 comments
Hello HN! I'm building mvsqlite, a distributed variant of SQLite with MVCC transactions, that runs on FoundationDB. It is a drop-in replacement that just needs an `LD_PRELOAD` for existing applications using SQLite.

I made this because Blueboat (https://github.com/losfair/blueboat) needs a native SQL interface to persistent data. Apparently, just providing a transactional key-value store isn’t enough - it is more easy and efficient to build complex business logic on an SQL database, and it seems necessary to bring a self-hostable distributed SQL DB onto the platform. Since FoundationDB is Blueboat’s only stateful external dependency, I decided to build the SQL capabilities on top of it.

At its core, mvsqlite’s storage engine, mvstore, is a multi-version page store built on FoundationDB. It addresses the duration and size limits (5 secs, 10 MB) of FDB transactions, by handling multi-versioning itself. Pages are fully versioned, so they are always snapshot-readable in the future. An SQLite transaction fetches the read version during `BEGIN TRANSACTION`, and this version is used as the per-page range scan upper bound in future page read requests.

For writes, pages are first written to a content-addressed store keyed by the page's hash. At commit, hashes of each written page in the SQLite transaction is written to the page index in a single FDB transaction to preserve atomicity. With 8K pages and ~60B per key-value entry in the page index, each SQLite transaction can be as large as 1.3 GB (compared to FDB's native txn size limit of 10 MB).

mvsqlite is not yet "production-ready", since it hasn’t received enough testing, and I may still have a few changes to make to the on-disk format. But please ask here if you have any questions!

I love the idea of distributed SQLite but I’m having a hard time understanding which parts of FoundationDB and which parts of SQLite are available in this implementation.

I’m guessing virtual table extensions work with this since you’re just replacing the storage engine? So we could in theory use FTS5 and even OSQuery and other extensions right?

However since this is using FoundationDB I’m also guessing we can’t use this as a serverless embedded DB since since you’ll probably need a foundation db cluster to use this. Is that right?

So if I understand correctly this is a SQLite query engine on top of FoundationDB with distributed transactions and we can theoretically use SQLite ecosystem stuff like FTS5 and datasette on top of it.

Yes it's correct! mvsqlite integrates as a custom VFS underlying SQLite's query engine, and SQLite ecosystem stuff can be used on top of it.

"The SSD storage engine stores the data in a B-tree based on SQLite." XD https://apple.github.io/foundationdb/architecture.html#stora...

Per your link they are leaving sqlite.

In the upcoming FoundationDB 7.0 release, the B-tree storage engine will be replaced with a brand new Redwood engine.

(I'm on the osquery steering committee)

In theory osquery is "just" virtual tables, but in practice there's quite a bit more that would probably make attaching it to mvsqlite. If you have a use case in mind I would love to know!

One thing I've been curious about with FDB (need to find time to try this myself) is using FDB as a way to easily implement replication with consistency.

For example: You have 5 Postgres instances. You send the query "SELECT * FROM TABLE" to FDB, you want the result of this from any of the 5 Postgres's (first to return wins). When you insert, you want to insert into all 5 and make sure that all 5 have actually finished the transaction before telling the client.

Seems simple enough to implement via FDB?

I think Kafka is enough for this use case? FDB should work too but sounds like overkill.

Right, my question is more around replicating any stateful system. Isn’t Kafka not partitionable?

I think foundationdb uses sqlite as its tablet kv engine? It’s sqlite all the way down

They are leaving sqlite.

In the upcoming FoundationDB 7.0 release, the B-tree storage engine will be replaced with a brand new Redwood engine.


Interesting… doesn’t seem to have a lot of docs on this. Surprised they didn’t just go along with RocksDB

There's one set of folk working on a btree, and another set of folk are now working on a RocksDB storage engine.

The original preference away from RocksDB was that it doesn't play well with deterministic simulation. Any code included into FDB needs to be able to be able to run with coroutines (strongly preferably stackless ones, though sqlite's btree has a stackful coroutine shimmed under it). RocksDB is definitely not written to support coroutines, and thus trying to use it anyway results in sacrificing developers' abilities to dig into failures.

Redwood has a couple design decisions that would make it a poor general purpose btree, but a great one for FoundationDB. But RocksDB will still have write and space amplification advantages.

It supports RocksDB also, although I think that feature is a bit more experimental.

> Surprised they didn’t just go along with RocksDB

Given how often they're butting heads publicly, maybe Apple didn't want to use something developed by Facebook? It probably wouldn't normally be a concern due to the licenses of each, but it's possible that corporate politics might still be relevant.

FDB is a nice piece of technology if you know how to go around its constraints. Congrats on the project.

This really, IMHO (as someone implements things on top of SQLite too https://dflat.io) pushes SQLite too far as the implementation of cross-db transactions have some big issues: https://www.sqlite.org/limits.html (the number of attached databases cannot exceed 10 or 125 (if you compile your own)) https://www.sqlite.org/wal.html (in WAL mode, there is no transactional guarantee for cross database transactions (atomic per database, but not cross database))

Anyone interested this topic would also be interested in the discussion at https://news.ycombinator.com/item?id=32285435. mvsqlite might be able to handle cross-database transactions in the storage layer(=mvsqlite's mvstore backed by FoundationDB's fully serializable transactions) rather than sqlite's native one so I think there should be no theoretical limit due to SQLITE_MAX_ATTACHED.


But yeah, mvsqlite as of today seems to be tied to one sqlite per mvstore so it's attaching databases to the sqlite instance is the only way to deal with multi databases. So, it will naturally be affected by SQLITE_MAX_ATTACHED.

Perhaps a potential big idea would be to have a sqlite instance per database and a "proxy" layer above sqlite instances to (1)obtain/set a global transaction ID per multi-db transaction and (2)redirect queries to each involved sqlite databse, while the serializability is guaranteed in the FoundationDB.

mvsqlite doesn't rely on the SQLite WAL or rollback journal for safety. Actually, it enforces journal_mode=memory! The atomic commit semantics is guaranteed by FDB instead.

The limit on max number of open DBs looks low, but maybe short-lived attaches can be done? Like attach just before the transaction and detach after commit. This should be enough for common transactions involving just a few DBs.

I am a bit out of depth here for obvious reasons (I work above SQLite, not under SQLite). But from my understanding, `journal_mode=memory` simply means it is in rollback mode, but rollback pages are not maintained on disk. Therefore, the modifications from a transaction will be applied in place in these pages. If the process that runs SQLite crashes in the middle of a transaction, these details will be leaked unless you know how to rollback these pages (through rollback journal, you can alternatively found these from FDB, but you should have no idea which one corresponding to which transaction?).

More over, without WAL, SQLite to the best of my knowledge would require read-lock on every read as well, practically becomes not only single-writer, but also single-reader (you probably can break the global lock (that through VFS layer)? But it doesn't make this thing safe (i.e. you will have things in a transaction leaked?)).

> Therefore, the modifications from a transaction will be applied in place in these pages.

With mvsqlite it is applied to the transaction's own snapshot of the database. Changes are not visible globally until transaction commit. That's what we get from page-level MVCC.

> More over, without WAL, SQLite to the best of my knowledge would require read-lock on every read as well

Reads are also MVCC. Data is fetched from a consistent and clean snapshot of the database, without uncommitted data.

Read a bit of your code. Seems most magic happens around xLock / xUnlock implementation (which were used to identify a txn).

I wonder if SQLite should expose one layer up (on the actual pager level). A lot of attempts (such as yours, or LiteFS) try to construct the pager / txn concept from VFS layer as well, which seems to be one layer lower than actually is.

This approach provides SERIALIZABLE transaction isolation right?

What happen if there are page conflicts during the write? The transaction fails and you retry? Does SQLite have a good way of of signalling that this is the case (vs for example a network failure or other write failures).

Yes it's the serializable isolation level.

For applications targeting upstream SQLite, mvsqlite enables pessimistic locking by default - when a transaction is promoted to EXCLUSIVE, it acquires a one-minute lock lease from mvstore. At this point we have the chance to fail gracefully and return a "database is locked" error if multiple clients want to acquire lock on the same DB. This is a best-effort mechanism to prevent conflict on commit (which causes the process to abort).

A future feature is "MVCC-aware clients". Compatible clients can opt-in to full, optimistic MVCC, and avoid pessimistic locking. After `COMMIT`, the client should call a SQLite custom function provided by mvsqlite to check whether the commit actually succeeds, and retry if not.

So the idea is that a small business could start on SQLite, and then switch over to this when it's time to scale, without re-writing it in the Postgres dialect?

Regardless, it's very very cool. Would love to see it get turned into a product.

If you've written your own multi-versioning, what does FDB bring to the table that you couldn't have gotten out of other distributed but non-transactional KV stores? E.g. Cassandra, etc. Isn't there an overhead to the MVCC aspect of FDB? And it sounds like you've had to jump hoops around things to get past its duration and size limits, as well...

Cassandra is a column store, surely that would be a very un-natural mapping for a SQL Engine backend to use? An orthogonal mapping, to be precise!

Cassandra is not column oriented. It coined the phrase “wide column” (which has caused untold confusion), but this just meant row-oriented without a schema and unlimited numbers of dynamic “columns” in each row. This is no longer true anyway, it is simply row-oriented now in the normal sense.

> But a group of N sqlite databases is an N-writer database. And mvsqlite provides the necessary mechanisms to do serializable cross-database transactions without additional overhead.

I'm confused, are these databases planned to be replicated? Or is it expected for the databases to have separate schemas?

Replication is handled by FDB so you don't need to care about it on the application level. These databases can contain partitioned data of your application, like one DB per user, so that a transaction on only user A and another one on only user B won't conflict.

Did you intend "shard", or s/shard/partition/?

I guess "partition" is the better wording. Updated my comment.

This sounds really cool, do you have any source code in a workable state yet or is this project still in the formulative ideation phase?

If there's anything concrete so far, I'd love to take a look and/or try it out!

It already works! There are steps to try it in readme.

Nice, I dug a bit and found the git repository:


Would be great to add this link to the body of your story to make it easy for HNers to get to the thing :)

If no longer editable, consider emailing moderator Dang (hn@ycombinator.com).

Any chance this could get Jepsen tested? I’d donate to make that happen.

Good work! I don't understand the innards of this at all, but I love the design.

Iirc when I first read through the book designing data centric applications, the author talked about a lot of trade-offs for data storage and replication and network connectivity issues.

the impression I left with was for my particular application that foundation DB was the best option I had for my wild dreams of web scale popularity.

the current data persistence layer I use is sqlite, which means if I use mvsqlite, that only makes it easier for me to try to use foundation DB for my someday no doubt irresistible web application.

I think this is a great idea, and probably sqlite bigger sibling PostgreSQL adopts it one day.

Comdb2: distributed sqlite: http://comdb2.org/

Does this work similar to rqlite or dqlite from a usage standpoint, or does it solve a different use case?

Haven't you heard of bedrock by expensify?

blockchain? different solution.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact