Raft is the secret sauce behind many distributed databases, like CockroachDB, TiDB, Yugabyte, etc..
But to make it scalable and production grade, you have to run multiple Raft operations in parallel and manage them well. E.g., sharding them correctly to make the load balanced and adjusting dynamically.
> The key enabler of these properties is a new TrueTime
API and its implementation. The API directly exposes
clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to
wait out that uncertainty. Google’s cluster-management
software provides an implementation of the TrueTime
API. This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock
references (GPS and atomic clocks).
This is ultimately an unimportant distinction. Multi-Paxos and Raft have only minor differences which are only important internally. From the perspective of designing a larger database system like Spanner or CockroachDB, the differences are not material.
Really nice blog post! Especially because it actually builds something that works in the end, while keeping stuff fairly simple.
Another approach could be using PostgreSQL as a stateless scalable query execution engine, while stubbing out the storage layer with a remote replicated consistent storage solution.
> Another approach could be using PostgreSQL as a stateless scalable query execution engine, while stubbing out the storage layer with a remote replicated consistent storage solution.
This is also what Yugabyte (OSS) does, at least according to my general understanding. It also supports the Cassandra API (though not to access the same data).
There’s also YugabyteDB which is actual Postgres with distributed flex. It even supports regular Postgres extensions. Anything in golang imitating Postgres will be subpar for years to come.
YugabyteDB splits tables into tablets (partitions), every tablet has its own raft. It’s a modified rocksdb under the hood.
OTOH, it’s cool to see how someone builds something like this from scratch and is willing to talk about it.
Yugabyte’s lower level supports cross shard transactions as a fundamental primitive. This allows consistent reads and acid txns similar to what you would expect from a single node postgres.
I really enjoyed this post. I had a question, is the raft-boltdb instance the actual raft log where the sql commands are being written to then?
It looks like the apply method is responsible for writing the sql statements to the raft log as well as executing the sql queries. Is waiting for a quorum of writes to the raft log by the other members not needed? Or is this all just handled under the hood by the raft libraries being used.
It's all handled by the Raft libraries (I know this because rqlite uses the same Raft library at this blog post). When integrating with Hashicorp's Raft code, your code must implement Apply()[1]. The Hashicorp code promises not to call Apply() until a quorum of nodes have been persisted the data to the Raft log. That way when your code is called to modify your state machine (FSM), quorum has already been achieved by the Raft system.
Thanks. So this embedding of a database engine into the Raft FSM is the general pattern then?
It looks like Phil's post uses boltdb for the Postgres storage engine as well as for Raft log(via Hashicorpo's implementation Raft lib.)
Thanks for the link to the slides as well. I've seen rqlite mentioned a few times in the last few week and so it was on my short list of things to read up on.
Novices: keep in mind that this is actually just the minimum viable code to support a distributed system. The next thing you need is enough tooling and features to support integrity and reliability guarantees. For example, when network instability forces each of your operations to fail in an unexpected way, you need to be able to resolve the different failure modes and reestablish a consistent application and network state. The Raft algorithm doesn't do that for you.
It's like how TCP doesn't guarantee that your data will be written or processed or transmitted without corruption; only that it will be received without corruption. You will still get corruption after recv() and before send(), and you have to handle those cases, or your application will begin to introduce unknown data corruption. And when that happens you need the tooling and telemetry to address it after the fact, "distributedly".
Something I'm not clear on from reading the post - the documentation for the Apply() method on the Raft type (https://pkg.go.dev/github.com/hashicorp/raft#Raft.Apply) mentions that "This must be run on the leader or it will fail". With the code as written, it seems like a user running psql has to know beforehand which process is the Raft leader for running CREATE TABLE/INSERT statements, and there's no way to redirect those messages to the Raft leader. Is this something that could be handled (and was just left out for brevity's sake), or is there something fundamental I'm missing?
I think the way rqlite (which uses the same Raft library) does it is by forwarding requests from the follower to the leader. I think they do that in application code.
I don't know for sure how this situation is supposed to be handled in a production system though.
Every node in the cluster "knows" the network address of the Leader. It knows because a) it's part of the Raft cluster configuration, and b) every follower "heartbeats" to the Leader periodically.
So it's actually pretty simple for a given node to contact the Leader. If a node receives a request which must be performed on the Leader, and that node is not itself the Leader, it can do one of the following things:
1) reject the request with an error, but this isn't really a production-viable option.
2) reject the request with an error, but tell the client where the Leader can be found, so the client can retry the request, this time sending the request to the Leader.
3) transparently forward the request to the leader, wait for the Leader to execute the request, get the response, and return the response to the client. In this case the client doesn't even know the forwarding to the Leader happened.
rqlite supports mode 2 and 3, client can choose which behavior it wants, on a request-by-request basis. Option 3 is the default.
I was expecting a guerilla db instead of a cockroach db. Like serialising the db for dropbox. This is like a drifter camping on private land... you can say it isnt really decentralised but i'll just move somewhere else.
I'd say distributed is about redundancy and scale within a single entity's infrastructure, where as decentralized implies multiple entities working together.
This started me thinking about counterexamples, and I thought of DNS. And it occurred to me that it's actually a good way to phrase the argument against split horizon DNS: DNS was designed to be a distributed system (eg like DB replicas) and split horizon DNS abruptly changes one branch of it to a decentralized system (eg like Git) without updating its users.
Distributed and decentralized have specific meaning in the database space. Decentralized is like Bitcoin. No centralized servers or control. Distributed just means more than one server. The servers could even be located in the same rack or even on the same server in separate virtual machines and that would be distributed but not geographically distributed.
> What is CockroachDB under the hood? ... wire protocol, a storage layer, a Raft implementation, PostgreSQL's grammar definition. ... To be absurdly reductionist, CockroachDB is just the glue around these libraries
^ semi off topic, but love the idea of distinguishing certain projects as 'just glue'. (Not a dis, glue matters). Especially interesting in the context of OSS tools whiteboxed by cloud vendors with 'proprietary glue'
I said absurdly reductionist and definitely meant that. :) There is a ton going on under the hood to make it work well, to fully implement postgres, and to work efficiently.
I don’t think we should be saying “fully implement PostgreSQL.” People are feeling burned by this notion because it sounds like a drop-in replacement, but it isn’t really that simple. Being wire compatible is mostly just convenient, since you can reuse existing drivers. But existing software built on top of PostgreSQL won’t necessarily work out of the box…
I've written about the "world of postgres wire compatible" [0] and yes there are some databases that implement the protocol and don't implement postgres (see: immudb and ravendb) I give databases like CockroachDB more credit for actually making their intention and generally following through with implementing Postgres. Also unlike some others on my list CockroachDB really is meant as an OLTP replacement. Performance characteristics and operations will definitely differ but as solely a user/application developer you shouldn't notice as much.
I think original poster might mean in terms of Postgres-specific functionality
For instance, a lot of Postgres tools use "pg_catalog" to do introspection on the datasource. Supporting pg_catalog is a bit of a pain -- this is the reason why Materialize doesn't work with Hasura.
CockroachDB also just barely doesn't work with Hasura OOTB because of a handful Postgres-specific functions and some metadata.
Yeah that's fair and you'd know better than I. But again for most app developers they won't notice as much. It's just systems that plug into metadata (BI tools, Hasura, etc.) that might not work as well (or I guess in Hasura's case, not work at all?).
Time might be up (saying as a long term PG user) with Litestream adding distributed features to SQLite, local inprocess databases might be the next thing.
Local inprocess databases might be a bigger thing, but won’t they do nothing to help cases where 10 different applications on different machines need to read and write in the same database? I would think the target audiences are very different, i.e. people who are finding non-distributed PG insufficient probably can’t alleviate their problems by switching to SQLite
It depends on your access pattern, if every application is write heavy to the same data set, this won't work. If every application is light on writes and those are mostly local (like for many SaaS companies), this will work fine with the future Litestream that redirect writes to one db and then distributes the data (if you can live with slightly stale data). Also depends on the size of you company. Slack does 300k/s messages, this won't work :-)
Oh, maybe I am behind on this “future litestream” that “redirects writes”. I only knew of the ability to sync/backup/restore. Do you have a link to something discussing those other features?
"The upcoming release of Litestream will let you live-replicate SQLite directly between databases, which means you can set up a write-leader database with distributed read replicas. Read replicas can catch writes and redirect them to the leader"
Litestream allows scale up application architectures whereas Postgres allows scale out architectures (by only scaling up the database portion instead of also scaling up application servers).
It's basically the same thing as monolith vs microservices but extending monolith to the data persistence layer. With horizontally scaling apps being the predominant architecture right now, I don't really see Litestream changing much.
If you're going to horizontally scale Litestream with a multiple writers you're going to end up introducing all the network and synchronization pieces Postgres architectures already have.
I only have experience with Azure's App Service, but their App Service can scale out and still use the same mounted local storage layer that is shared across all app instances. So I've been considering this for multi-tenant applications where each tenant's data can be siloed in it's own SQLite database.
I'm curious if you can get that setup to work reliably (and if it's performant when done). It looks like Azure uses SMB to mount a share so Sqlite + SMB caveats apply (generally getting synchronous writes and locks working correctly over a network filesystem). Additional details https://www.sqlite.org/draft/useovernet.html
Litestream still only does single-master, right? To me it feels like the next step up would be something like foundationdb (true multi-master) with a good psql layer.
I know this is controversial, but I think for 99.99% of companies multi-master is a fetish (been there, bought the tshirt). But if you need it, yes Litestream is going to be single master without automatic failover.
I was half-thinking about taking the Postgres wire protocol code[1] that CockroachDB uses, and adding it to rqlite[2]. Then you could talk to rqlite using Postgres tooling (perhaps). The only thing is that rqlite doesn't support distributed transactions in the usual sense, so there might be some difficulty mapping Postgres transaction control to rqlite. But still, could be interesting.
>"The only thing is that rqlite doesn't support distributed transactions in the usual sense, so there might be some difficulty mapping Postgres transaction control to rqlite."
Can you elaborate on this, in what sense does rqlite support distributed transactions?
You can wrap a single request (containing multiple SQL statements) in a transaction, and either all the statements will be successful, or none will. That's what rqlite currently supports.
I would rather use LOGICAL replication to achieve 1MASTER-1SLAVE or 1MASTER-Multi SLAVE or MULTI MASTER - Multi Slave configuration after working schema design details and read-write query patterns across tables.
My approach when learning new protocols like Raft or Paxos is to implement them in Pluscal (TLA+'s higher-level language) or P (https://github.com/p-org/P). I've found that helps separate the protocol-level concerns from the implementation-level concerns (sockets? wire format?) in a way that reduces the difficulty of learning the protocol.
But to make it scalable and production grade, you have to run multiple Raft operations in parallel and manage them well. E.g., sharding them correctly to make the load balanced and adjusting dynamically.
Here's a pleasant article to read: https://en.pingcap.com/blog/how-tikv-reads-and-writes/