
Rqlite: Replicated SQLite using the Raft consensus protocol - mcp_
https://github.com/otoolep/rqlite
======
biokoda
ActorDB takes this about 100 steps further. It replicates the actual WAL pages
instead of SQL commands. So the leader executes the SQL, followers just write
the pages from leader.

[https://github.com/biokoda/actordb](https://github.com/biokoda/actordb)

It uses LMDB to store everything instead of storing raft log in one place and
SQLite in another. It can run thousands of SQLite instances at the same time
and it actually compresses the pages when storing to LMDB.

You can query it with a MySQL client library (though Thrift is recommended).
You can have multiple raft clusters, there is a complementary actordb_console
app to manage it easily, it supports distributed transactions and more.

~~~
atombender
ActorDB is very cool, but from what I gather it's — at least nominally —
designed to work with many small databases (one per "actor") as opposed one
large one, and that the constraints it puts on the query model means you have
to carefully design your data model to be heavily compartmentalized from the
beginning. It also poses some problems for database-wide queries, which have
to fan out to every separate actor database. Whereas rqlite acts like one
single SQLite instance, and can work with existing schemas.

Please do correct me if I'm wrong. Would ActorDB be happy and perform well
with, say, just one actor?

~~~
derefr
To me, that just sounds like it forces you to do the Right Thing from the
beginning. Eventually, you _will_ need to shard your data in some way†.

If you're already locked into a "global queries against tables containing
Everyone's Stuff" data model, you end up doing this through Stupid Database
Tricks like "see-other" redirection or silent persistent-hash load-balancing
(with constant ops-heavy rebalancing as you grow.)

If you think about sharding from the beginning, though, you end up just
dividing your data into little naturally-atomic "worlds." Like sharding email
[headers, not bodies] by user, or sharding StackOverflow posts by community.

If you don't think you'll ever scale to the point where you'll be forced to
make those calls, just pick a different database. (But then, if you don't
think you'll ever reach that scale, then literally _any_ database will do.)

\---

† Unless you're EVE Online. They could probably write a really good whitepaper
about how they've scaled their single-node MSSQL database so far. I presume
they're mostly following the same patterns you would when running e.g. Oracle
on big iron; but—according to most sources—they're just using some heavily-
loaded commodity hardware with no fancy IO offloading et al. No idea how they
do it.

~~~
catnaroek
Please excuse my ignorance - how do I enforce nontrivial system-wide
invariants with lots of little databases rather than a single consolidated
one?

The invariants I care about are:

(0) Referential integrity. If the table `Foo` has a `FOREIGN KEY (BarID)
REFERENCES Bar (BarID)`, then no row in `Foo` must be seen as having a `BarID`
whose value can't be found in the table `Bar`.

(1) Linearizability. There must exist a total order on the entire transaction
history of a database, such that, starting from the empty database, and
executing the transactions nonconcurrently in the given total order, the
result is the current state of the database. (NOTE: The transaction history
need not be physically stored anywhere. So this invariant can't be “tested” -
it has to be proven to hold.)

These guarantees are so basic, so fundamental in my everyday use of RDBMSs,
that I need to be convinced that they hold.

~~~
biokoda
By finding the natural sharding factor. If you have one, which very often you
do. Three completely different examples:

\- Are you running a mail server? Every user is an actor (i.e. an sqlite
instance).

\- Are you running a backend for a dropbox/evernote/messaging type app? Every
user is an actor.

\- Do you have a distributed filesystem? Use the KV mode and shard on file
hash.

\- Do you need scalable counters? Use KV mode again, split every counter into
10 (or 100) and increment/decrement on those.

~~~
catnaroek
Again, please excuse my ignorance - what is the definition of “natural
sharding factor”, and how would I compute it for, say, an ERP system?

~~~
biokoda
I have no experience in developing in ERP systems. I do have experience on
being at the end of a very poorly working one.

These systems seem (from my outside view) to have a tendency to become giant
monoliths. So when developing you must fight against increasing monolithic
complexity. Using something like ActorDB can be somewhat of a beneficial
constraint. It forces you to maintain a clean design.

I would force sales, marketing, shipping, product planning (taking from
wikipedia here..), to be their own separate actors with their own schemas.
Then if possible shard within those types. So if something is customer
service, have an actor per customer and have all his data there. If you're
developing multiple products, every product has an actor.

~~~
catnaroek
> These systems seem (from my outside view) to have a tendency to become giant
> monoliths. So when developing you must fight against increasing monolithic
> complexity.

You aren't wrong, that's my experience as well. It's just as annoying for
programmers (or, at least, for me) as it is for users. The following question
has popped out countless times in my head: “Why do I have to rely on an
implicit convention that this application module never touches this database
table?” There was never a good answer.

The only reason why I put up with such things is that I have no idea how to
prevent more modular designs from turning into a data integrity nightmare.
(I'll freely admit my lack of education is to blame here.) For instance, let's
say we have three modules: inventory, sales and shipping. Furthermore, let's
assume each module is its own actor and uses its own backend database. We must
implement the use case “enter a sale in the system”:

(0) The sales module queries the inventory module whether there is enough of a
product in stock to satisfy a customer order. The expected sequence of actions
is:

(1) The inventory module “locks” the requested quantity/amount of the product
[so that it can't be used, say, for another sale], and gives the sales module
a “token” that can be used to confirm or cancel the withdrawal.

(2) The sales module queries the shipping module if there are enough available
trucks/ships/whatever to ship the product to the customer's location by a
given delivery date.

(3) The shipping module “locks” however many trucks/ships/whatever it deems
necessary to ship the product, and gives the sales modules a “token” that can
be used to confirm or cancel the shipping.

(4) The sales module queries the user for the customer's credit card number
and verification code, interfaces with the bank's system, blablabla...

(5) The sales module confirms to the inventory and shipping modules [in this
specific order] that the product will be withdrawn and shipped.

Now some exception handling:

(6) If step 3 fails, the sales module cancels the product withdrawal.

(7) If step 4 fails, the sales module cancels the product withdrawal and
shipping.

(8) If any system [inventory, sales, shipping] goes down, neither the product
nor the trucks/ships/whatever can be kept locked forever. So each lock must
have a timeout, and, if it's neither explicitly confirmed nor explicitly
cancelled by the sales module, it will be implicitly cancelled by the
inventory and/or shipping module when the timeout elapses.

(9) It may happen [unlikely, but not impossible], that the inventory and
shipping module's clocks get unsynchronized in such a way that, when the
product withdrawal has been been confirmed, the shipping lock has already
elapsed. _Oh, the nightmare._

Implementing all of this _correctly_ in all cases is actually tricky! And if
anything is implemented _slightly_ wrong, the whole system goes kaboom! With a
monolithic database, there is no need to “lock” any resources, nor issue
“confirmation tokens” - just use the DBMS's built-in transaction system!

~~~
biokoda
> Furthermore, let's assume each module is its own actor and uses its own
> backend database.

I would have every module an actor type. There can be multiple types each type
has its own schema. Within an actor type many actors. An actor for every
product for instance. So all X widgets are in one actor.

> With a monolithic database, there is no need to “lock” any resources, nor
> issue “confirmation tokens” - just use the DBMS's built-in transaction
> system!

ActorDB has distributed ACID transactions so I would use that. You can create
a transaction over multiple actors. The reason I would split it into many
actors is that you're always locking small parts of the system for the
duration of the transaction not the entire DB.

~~~
catnaroek
> I would have every module an actor type. There can be multiple types each
> type has its own schema. Within an actor type many actors.

Noted.

> ActorDB has distributed ACID transactions so I would use that. You can
> create a transaction over multiple actors.

Sweet! In particular, the second sentence is exactly what I wanted to hear.

I will give ActorDB a try.

------
mslot
I've written something similar for PostgreSQL using Paxos:
[https://github.com/citusdata/pg_paxos](https://github.com/citusdata/pg_paxos)
. It can replicate tables across a set of PostgreSQL servers.

pg_paxos is an extension that uses PostgreSQL's query execution hooks so it
doesn't require any external components. An advantage of pg_paxos is that you
can do reads and writes on any node. Would love to know what people think and
whether it's something you would use (if rigorously tested). Small warning: it
replicates SQL queries and doesn't stop you from using volatile functions like
now() yet, which would give inconsistent results.

A possible application is a reliable cron service. Each cron node tries to add
an entry to a replicated locks table with the same primary key on any of the
PostgreSQL nodes. Only one will succeed and then runs the cron job.

Some more details in these slides:
[http://www.p2d2.cz/files/pg_paxos.pdf](http://www.p2d2.cz/files/pg_paxos.pdf)

~~~
CodeWriter23
What was your motivation to DIY instead of using Bucardo?

~~~
mslot
Bucardo provides asynchronous (eventually consistent) replication. If you need
strong consistency with fault-tolerance, you need something like pg_paxos, but
it comes at a significant performance penalty so it's not really an
alternative for existing replication solutions.

~~~
CodeWriter23
Thanks. I've been looking at PG replication solutions to get rid of the hand-
coded synchronization built into a couple of my apps that date back to the
dial-up days.

------
ziedaniel1
"Due to the nature of Raft, there is a very small window (milliseconds) where
a node has been disposed as leader, but has not yet changed its internal
state. Therefore, even with the leader check in place, there is a very small
window of time where out-of-date results could be returned."

No, that's absolutely not inherent to the nature of Raft. It sounds like
they're making the same mistake that etcd did, simply relying on whether a
node believes it's still the leader rather than having it confirm that fact by
committing a log entry: [https://aphyr.com/posts/316-jepsen-etcd-and-
consul](https://aphyr.com/posts/316-jepsen-etcd-and-consul)

~~~
otoolep
We're way ahead of you.

[https://github.com/otoolep/rqlite/issues/5](https://github.com/otoolep/rqlite/issues/5)

But yeah, that is awkwardly phrased -- let me fix it.

~~~
otoolep
Done. You're right that it's not Raft, it's rqlite. Though I plan to add an
option that will send all queries through Raft, if so requested.

~~~
ziedaniel1
Great - sorry for the harsh words, I guess I've just come to expect people to
screw up distributed systems :)

------
TheHippo
> rqlite reports a small amount anonymous data to Loggly, each time it is
> launched. This data is just the host operating system and system
> architecture and is only used to track the number of rqlite deployments.
> Reporting can be disabled by passing -noreport=true to rqlite at launch
> time.

This is very close to the end of the Readme in my opinion the wrong default
setting.

~~~
otoolep
This behaviour has been in the program from the very start, and was a poor
decision on my part. I've pulled it completely from the code.

------
danielvf
Hey! This was the Stripe CTF 3 final challenge problem. Good work making a
polished version.

~~~
mbell
It was a fun level :)

Anyone know if Stripe is planning another CTF? The last one was great fun.

------
ktRolster
I'm having trouble understanding this (but everyone here thinks it's rather
cool, so I think I'm missing something).

I've used SQLite mainly for small, app-local data storage (because it's easier
than creating a custom file-format). What is the purpose of doing replication
on such a database?

~~~
otoolep
If you don't care about losing your data if you lose your host machine, then
there is no point.

The point is to have copies of your SQLite data on multiple different machines
so if you lose one of the machines, you don't lose your data. The same reason
anyone replicates a database.

~~~
otoolep
Though there are other reasons to replicate databases -- read replicas. rqlite
can do this do.

------
bernardom
Branding feedback: I can't be the only one who immediately thought this had
something to do with R!

sqldf is a pretty popular package in R; RSQLite also exists. Maybe the R
language shouldn't have a monopoly on packages starting with R, though...

~~~
otoolep
That's because the OP called it "Rqlite". I actually named it "rqlite". "r" as
in replicated.

[https://github.com/otoolep/rqlite#pronunciation](https://github.com/otoolep/rqlite#pronunciation)

~~~
bernardom
Makes sense, thanks!

------
philips
Here is a similar project built on the etcd raft package:
[https://github.com/chzchzchz/raftsql](https://github.com/chzchzchz/raftsql)

The etcd raft package is used by CockroachDB and etcd. The godoc is over here:
[https://godoc.org/github.com/coreos/etcd/raft](https://godoc.org/github.com/coreos/etcd/raft)

------
pyvpx
can anyone recommend a good overview of Raft vs. Paxos?

~~~
krat0sprakhar
While this doesn't answer your question, you will find this interesting -
[https://ramcloud.stanford.edu/~ongaro/userstudy/](https://ramcloud.stanford.edu/~ongaro/userstudy/)

~~~
dang
That's pretty interesting and doesn't look like it has been discussed on HN
before. You should submit it.

~~~
wyldfire
I'm almost positive that I heard about it here. It's Raft's primary
distinguishing feature from Paxos, right?

But hey, if you hadn't heard about it then there's a good chance that other
HNers haven't either.

------
ignoramous
Since SQLite can do in-memory, does this now make a good replacement for
distributed in-memory databases like memSQL and voltDB? Also, could this act
as a replacement to memcached/redis?

To me, it looks like there's a explosion in database solutions, that most
solutions are converging with overlapping set of features.

~~~
pkolaczk
To me it doesn't make a good replacement. It is replicated, but not scalable.
All the hard work is thrown at the leader.

~~~
otoolep
That's right. I wouldn't claim that rqlite is scalable. That's not the point.

------
api
Was thinking of doing this myself! Awesome.

------
pkolaczk
The description suggests it was fault-tolerant, but then I read about all
writes and queries going through a single leader, which is an obvious SPOF.
Even if temporary, SPOF is still a SPOF.

~~~
allengeorge
You can have a leader and still be fault tolerant...

At any rate, if the leader dies another member of the cluster is elected and
operations proceed. Yes, it's not "always available", but that's the trade-off
you make when you impose strong consistency.

~~~
pkolaczk
"Not always available" means "not fault tolerant" to me.

You can have strong consistency with no leader and no temporary SPOF with
Egalitarian Paxos. And even with standard Paxos/Raft it is possible to
architect the system to not have a single leader, but multiple leaders
responsible for different partitions of data.

~~~
allengeorge
Even if you had multiple leaders responsible for different partitions you'd
still be (temporarily) unavailable if the leader died, so...while you have
improved the system's availability characteristics you haven't changed the
underlying fault model.

Also, keep in mind that not _everything_ has to go through the leader. If your
application is OK with stale reads you can read committed data from any
follower, allowing you to be available (for some definition of available)
while the master is down.

Also, IIRC most _practical_ strongly-consistent systems aren't egalitarian
(I'd love to hear otherwise) and instead, rely on a leader.

