
Bedrock – Rock-solid distributed data - qertoip
http://bedrockdb.com/
======
hbrundage
I don't quite get why SQLite is the right primitive to compose a large scale
database system out of...

If the design goal is write throughput, SQLite can be beat really easily by
the client/server systems since each SQLite instance only supports one
concurrent writer. I guess users of Bedrock could partition their datasets
into different pieces that aren't often written in tandem, but why go through
all the trouble instead of just using a database with real MVCC?

If the design goal is geo replication, Bedrock seems to ignore the past many
years of innovation in the field. MySQL and company have methodically built up
features to allow safe synchronous and asynchronous replication for minimizing
the risk of downtime or data loss like GTIDs over a long time

If the design goal is read throughput, I don't think Bedrock can be faster
than just pure SQLite if your data can live beside your application, and if
you have more than one node, you need a complicated SQL optimizer that
understands how to push down things like predicates and aggregates to the
leaves, and then recombine results at the top level to actually efficiently
read in parallel, which I don't think Bedrock has built yet.

This thing kind of seems reminiscent of Jetpants or one of the other MySQL
orchestration tools that allow for scaling the DB beyond many nodes
automatically... but... why SQLite?

~~~
quinthar
Thanks for asking!

Re: "SQLite can be beat really easily by the client/server systems since each
SQLite instance only supports one concurrent writer" \-- That's not actually
true, sqlite supports concurrent writers via their page-locking branch (and
changesets allow for effectively row-level locks). But none of that matters,
because single-threaded replication means any multi-threaded write capability
is irrelevant. Regardless, when we add multi-threaded replication (on the
way), we'll take advantage of SQLite's multi-thread write capabilities.

Re: MySQL's replication. What specifically are you referring to? It's
distributed transaction features are actually quite new -- and I _think_ were
developed _after_ Bedrock was already in production for years. Furthermore,
MySQL doesn't support automatic master failover (or recovery), and struggles
over high-latency, low-reliability WAN connections. Anyway, if I'm wrong on
any of that, I'm eager to be corrected. Thanks!

Re: Read throughput. Agreed it doesn't have "parallel reads" (eg, it doesn't
split a large read up into multiple smaller reads that are joined
automatically). However, I don't think MySQL does either. Regardless, Bedrock
does quite well for normal read queries, and uses all of the node's CPUs
efficiently. (To be clear, I _love_ this idea of adding "super threaded
reads", and have some ideas. But at the moment we don't have any application
that requires it, so haven't built it.)

Re: Jetpants, I haven't heard of that before. Thanks for the tip! But as for
why SQLite over MySQL, I'd ask quite the opposite: why add a networked layer
on top of MySQL, thereby ignoring all of MySQL's networking? That just adds
all the overhead of MySQL, without any of the benefit. If you're going to
build a networking layer atop a database, SQLite seems the _clear_ choice
because it is explicitly designed to be a database _library_ \-- and not a
database _server_.

Thanks for all these comments!

~~~
chucky_z
Automatic failover -- [https://dev.mysql.com/doc/mysql-
utilities/1.5/en/mysqlfailov...](https://dev.mysql.com/doc/mysql-
utilities/1.5/en/mysqlfailover.html)

Guides to doing auto failover and recovery:

[https://dev.mysql.com/doc/mysql-utilities/1.5/en/utils-
task-...](https://dev.mysql.com/doc/mysql-utilities/1.5/en/utils-task-
autofailover.html)

[https://dev.mysql.com/doc/mysql-utilities/1.5/en/utils-
task-...](https://dev.mysql.com/doc/mysql-utilities/1.5/en/utils-task-restore-
master.html)

~~~
quinthar
Cool, thanks! It looks like that came out in 2012? Neat, I'll take a look!

~~~
quinthar
Incidentally, if anybody has experience with this I'd love to know:

1) What happens if "mysqlfailover" itself dies? For example, if I have 6
servers split equally between 2 datacenters, and the datacenter running
mysqlfailover loses power -- how does the other datacenter get reconfigured?

2) If you run two copies of mysqlfailover (one in each datacenter), how does
it solve the "split brain" problem? If you have 6 servers split equally
between two datacenters, and a clean network severance down the middle, what
prevents both sides from configuring and operating a master? (This is the
nightmare scenario.)

3) If mysqlfailover dies, it sounds like it will prevent any other copy from
running without manual intervention: "At startup, the console will attempt to
register itself with the master. If another console is already registered, and
the failover mode is auto or elect, the console will be blocked from running
failover. When a console quits, it unregisters itself from the master. If this
process is broken, the user may override the registration check by using the
--force option."

Overall, this sounds like a great improvement over what existed before (eg,
nothing), but still a pretty brittle, manual, and (in the case of split-brain)
very dangerous approach. Have you had good experience with it in practice?

~~~
brusch64
We are using Galera clustering with MariaDB on 3 nodes for that.

We are using it since 2014. First in line is a HAProxy which directs the
traffic to one of the 3 nodes. If one of the nodes goes down HAProxy directs
the traffic to the other two nodes.

This works quite well, as long as not all 3 nodes are down / lose network
connection. After that you have to initialise the cluster again. One node is
the new master and all later changes in the other databases are overwritten.
Then we connect each node to the new cluster.

The biggest problems are write intense operations. We had more luck using two
nodes as read nodes and using the third node as write node. When the data in
two nodes got updated simultaneously the performance suffered heavily. After
some smarter scheduling it works pretty well (as long as not all the nodes go
down or the network connection between the nodes is lost). The slowest node
can block the write operations for all the other nodes.

------
sonofgod
Where do you make your CAP Theorem tradeoffs?

When _shouldn 't_ people use Bedrock?
[https://sqlite.org/whentouse.html](https://sqlite.org/whentouse.html) is a
wonderful page - full of discouragement to would-be SQlite users to know when
they're in a use-case better served by some other tool.

~~~
quinthar
Heh, yep, we definitely had that sent to us. The point of Bedrock is to
resolve those points and make SQLite usable in the enterprise.

As for the CAP theorem, we compromise on "C". Each slave responds based on its
current state, which might be a different state than other slaves or the
master. However, every response returns the "commitCount" of the database at
the time the response was generated. Then when you submit a new request, you
can specify that commitCount and the slave will wait until it is _at least_ as
recent as that.

This way you can safely act upon the response of one request, and then make a
second request to another.

~~~
thesmallestcat
So you're ensuring consistency by blocking on all queries? And it's up to the
client to enforce this to any degree? And if there's one master and a global
clock, sounds like you threw out the "P" too. Frankly this thing sounds
simplistic and unusable, and it's sorta offensive to hear SQLite called
unusable in the enterprise, whatever that means.

~~~
quinthar
Consistency is only an issue if you switch database servers mid session. Just
"stick" to a node by issuing all subsequent requests to the same server and
there are no consistency issues.

~~~
thesmallestcat
And then laugh maniacally when the partition heals. What's a "session"? Can
you explain commitCount more, and how a client is supposed to use it
intelligently without deadlocking. Also there's a lot of master/slave talk and
I'm wondering how leaders are elected during a partition, and what recovery
looks like.

~~~
quinthar
Heh, there's a lot of detail to be captured I agree. But in short:

\- All nodes connect to all other nodes

\- Each node has a priority; a Paxos algorithm is used to identify the highest
priority node, which "stands up" to be the master

\- All nodes respond to read queries using the local database.

\- So long as you always talk to the same node, there are no consistency
issues: you are guaranteed that each request will answered with a database at
least as fresh as the last.

\- However, if you switch databases (eg, a node goes down and you are forced
to go to a different one that might not be as fresh), it will wait until it is
as fresh as the node you were using

\- Write requests are escalated to the master, which processes them with a
two-phase commit distributed transaction

\- By default, the master waits for a "quorum" of the cluster to approve the
transaction before committing

\- However, this limits write capacity to 1/median(rtt) of the slaves

\- Accordingly, we have a "selective synchronization" feature where you can
optionally designate some transactions as requiring "less consistency". For
example, you can specify that the master should commit when at least one other
node approves.

\- For the highest performance, you can designate a transaction as
"asynchronous" and the master will commit immediately

\- For example, when someone is reimbursed, we absolutely want to get quorum
approval of the transaction: we can't risk losing that if the master crashes.
But if someone just adds a report comment, we can risk losing it upon master
crash.

\- When the master does go down (either gracefully or not), the next highest
priority node steps up seamlessly

\- Any command escalated to the master but not processed will be re-escalated
to the new master

\- When a higher priority node comes up, it synchronizes to get up to date
with everything it missed, and then becomes master

Anyway, lots of details here and I agree they haven't been well captured in
the site yet. Coming soon!

~~~
HammadB
Are you doing Paxos across a WAN? Isn't this slow?

~~~
quinthar
The Paxos is only used to elect the master (or re-elect if the master fails).
Once a master is established, it coordinates all distributed transactions
centrally, which is very fast.

------
partycoder
I don't want to be a jerk but... just for the sake of common sense:

\- Faster? Provide a benchmark. Faster is a relative thing, faster than what?
faster under which conditions?

\- More reliable... than what? under which conditions? how do you know it is
more reliable?

\- More powerful? I thought SQLite is more constrained than other SQL
databases. e.g: No stored procedures. Powerful? how? compared to exactly what?

Nullius in verba. I don't take your statements as granted, give me a proof,
otherwise it's just flatus vocis.

~~~
quinthar
Thanks for all that Latin! Yes, it's been open sourced for under 24 hours.
Those are all good suggestions.

~~~
partycoder
Una lingua numquam satis est. Gratias tibi. Good luck with your project.

------
kbaker
OK, how is this different from Rqlite [1], except in C++ instead of Go, Paxos
instead of Raft?

> rqlite is a distributed relational database, which uses SQLite as its
> storage engine.

> rqlite gives you the functionality of a rock solid, fault-tolerant,
> replicated relational database, but with very easy installation, deployment,
> and operation.

Hmmm....

Also, is Bedrock DB only 30 days old? (since the 'first commit' message on
GitHub?) That's barely topsoil!

[1] [https://github.com/rqlite/rqlite](https://github.com/rqlite/rqlite)

~~~
quinthar
Haha, Bedrock has been in continuous development and production use (remember:
this powers all of Expensify's and has since day one) for about 8 years.
However, the public repo is very new.

I'm not actually familiar with Rqlite, but wow, it does look very similar.
Thanks for the tip! Does anybody use it at scale?

~~~
kbaker
Ah I see. So you are saying after enough weathering you have exposed the
Bedrock...

I have no idea if anyone uses Rqlite, I just remember it from previous
discussions here. Very similar indeed...

------
notacoward
The idea's not fundamentally bad, but some of the rhetoric raises alarm bells.

> built atop SQLite, the fastest, most reliable, and most widely distributed
> database in the world.

Most widely distributed maybe, but _far_ from the fastest and AFAIK others
beat it on reliability as well. No disrespect to SQLite, it's far more
feature-rich than other embedded-style DBs, but without a definition of
"database" that excludes them the statement as written is false.

> written for modern hardware with large SSD-backed RAID drives and generous
> RAM file caches

In other words, _relies_ on those things so it won't work worth crap on
anything less.

> The easiest way to talk with Bedrock is using netcat as follows

Nice security model you've got there.

~~~
quinthar
Re: SQLite's performance -- Can you point to anything that shows SQLite isn't
the fastest for some class of query? In my experience (though admittedly I
haven't formally benchmarked) it's at least as fast, and generally faster (due
to the dramatically reduced overhead to query it over a traditional database
server).

Re: SQLite reliability. I'm confused how SQLite isn't more reliable than every
other contender, merely given that there is so much less to fail. In 8 years,
2158341079 writes, and probably 100x that many reads, I don't think I've ever
had a single failure or bug with SQLite itself.

Re: SSDs -- That's a fair point. Bedrock is not the ideal choice if you are
using small volumes and spinning disks.

Re: Security -- I think all databases have crap security that nobody really
relies upon. Bedrock doesn't try to pretend. Rather, it gives you a very
complete plugin framework, such that you can write C++ "stored procedures"
that enforce your application's security model. In practice, we run with the
Bedrock::DB plugin _disabled_ , such that the only way to access the database
is via our stored procedures -- and those do complete authentication and
authorization in an application-specific way that no database could in a
generalized fashion.

Thanks for asking!

~~~
ngrilly
> 2158341079 writes

I'm curious: how do you keep track of this number?

~~~
quinthar
Every write transaction ever committed to the database is assigned a unique
ID, and the most recent few million transactions are kept in a "journal"
table. This is what allows nodes to figure out what happened while they were
offline and re-execute the missing transactions. A side-effect of that is we
know exactly how many write transactions have ever been committed.

~~~
ngrilly
That's a funny side-effect! Should I understand the transaction unique ID is
an incrementing number?

~~~
quinthar
Correct!

~~~
justinclift
Out of curiosity, what's the value that'll wrap around at?

------
lucio
Great idea!!! SQLite is an extraordinary product. I'll be following Bedrock.

This could be, for databases, what node was for server-side programming. There
are parallels, you're using a "client side single-writer (excellent) DB" and
making it apt for the cloud with Paxos. Again, great idea!!!!

And it seems you're taking with a good humor some "strong" comments here. +1
sr. +1 to u.

~~~
quinthar
Thanks, strong comments welcome!

------
lucio
Can u show stats of bedrock use in expensify? How many users? How many nodes?
How many transactions per second? queries per second? DB size? largest table
rowcount? node failures per month? how many times have u had a master failure?
network partition incidents?

~~~
quinthar
I should get better stats. 4.5M users, 6 nodes split between 3 datacenters.
Each node has 16 CPUs. 2158341079 total write transactions (over 8 years); not
sure how many read (10-100x more?). I'll try to get better stats on peak
read/write transactions per second. Not sure the total number of rows of the
largest table (that actually takes a long time to count).

Counting "failures" is difficult -- if it breaks, it's because of some bug in
our application logic (eg, our stored procedure). Most of our restarts are due
to normal maintenance and upgrades. In the history of the company there were a
handful of core problems to the logic (generally as we encountered some weird
edge case for the first time), but I can't remember the most recent.

Regardless, I'll try to get better data on this. Thanks for asking!

~~~
spotman
2 billion writes over 8 years? Cool if so, but many database systems both
MySQL, and others are capable of getting this in a day. I know this because I
maintain some high availability MySQL systems that see about 4 billion writes
per day across a similar amount of hardware.

Have you run this through Jespen or done any actual load testing or deep
testing for failures related to machines dying, network partitions or other?

I encourage you to do so, because (not trying to sound rude) this kind of
reads like "hey my homemade car has 25,000 miles on it and I still use it".
But you can build a performance and failure testing framework to really put it
through its paces.

Anyhow , have fun and good luck~~

~~~
Jweb_Guru
While I agree that we're not talking about huge write volume in the grand
scheme of things, and I definitely agree with you that Bedrock should pay
aphyr to run this through Jepsen (particularly with their new replication
strategy--see some of my concerns below), Expensify's overall transaction
volume is probably close to the limit of what 95% of businesses are ever going
to see, they have contractual latency and availability requirements they have
to fulfill that are also probably more stringent than 95% of businesses are
ever going to see, and they've been doing it with a single master database (no
partitioning) and a single writer on that master. I think it's useful
information that SQLite works for them, because it means SQLite would work
fine for most people and (as is evidenced by this thread) a lot of people
don't think SQLite can work for a site like Expensify.

And seeing actual _numbers_ are important, too. People often act like in terms
of write volume you're either shared hosting fodder, or you're Facebook, but
there's a lot of room in between, and a lot of HA workloads are very read-
dominated. It's very hard to find information about the businesses fall in the
middle of that spectrum because usually no technology company talks about its
traffic unless it's bragging about it (and then will release only vague
information that could be spun in lots of different ways--for instance, not to
call you out, I have no clue what "HA" means in your context, whether you need
distributed transactions, whether the writes are cleanly partitionable, what
isolation level you need, what your latency requirements are, or what the
read:write ratio looks like). So I'd rather encourage people to release their
numbers than poo-poo them because they aren't as big as whatever the biggest
system you've worked on is.

~~~
spotman
My point is that if they are advertising rock solid distributed data, I don't
think they have achieved that without putting it through some more paces.

I agree, actual numbers are great.

We also agree, jespen is great.

But I am saying, another path, to a similar level of confidence jespen
provides, is some real critical stress testing.

Sure, it will be out of the scope of what most people need, but it will give
them the confidence to say "rock solid distributed data".

------
harryf
Hmmm under plugins...

* Jobs - Provides a simple job queue.

* Cache - Provides a simple replicated cache.

...while these look like nice-to-have features which make life easy for
developers, IMO having a database which is also a queue and is also a cache is
a way to end up with a mess you can't scale out and is hard to reason about.

Have seen this happen with Redis - unless you're disciplined about how you use
it and deploy it, having db, cache and queue all in one place is a honeypot
for developers that turns into a future nightmare for operations.

~~~
quinthar
Ya, I'm not sure I agree with that. I understand that argument in theory, but
in practice the Jobs and Cache plugins have like 1% as much code as the
Bedrock core -- and use all of it. So the idea that it makes sense to have a
separate database, job queue, and cache -- despite all doing 99% of the same
work -- strikes me as rather redundant and needlessly complex.

In practice, a job queue, a cache, and a database all do largely the same
thing: maintain some internal state on disk, respond to some networked API
calls, and synchronize across multiple servers. The actual logic of a job
queue or cache is pretty trivial compared to what's underneath.

Indeed, I think the lesson of Redis is exactly why you _should_ build these
systems on top of a real database: they are struggling to hack on replication
and reliable storage, even though these are "free" when built atop a database.

------
thesmallestcat
You'd have to be nuts to depend on something backed by Expensify. They'll
probably fire the maintainers.

Edit: Turns out it's authored by the CEO! So you'd have to be nuts to use it,
period. But, um, poke the source if you're curious. This is a good intro
[https://github.com/Expensify/Bedrock/tree/master/libstuff#li...](https://github.com/Expensify/Bedrock/tree/master/libstuff#libstuff
----all-the-stuff-you-need-to-build-a-c-application) I can't imagine a more
ironic and hilarious backdrop for the "We Fire People" campaign. Must be tough
finding people who can keep up with a brain that big.

~~~
quinthar
Out of curiosity, what specifically do you find idiotic about the document you
linked to?

~~~
thesmallestcat
I didn't call anything idiotic. But let's just say that grappling with the STL
may be slightly preferable to this... thing. And there is some special irony
referring to STL as an "esoteric abstraction" while offering your own vastly
more esoteric abstractions in the same swoop. And the following paragraph
speaks for itself, particularly in the context of _developing a fucking
database_ :

> Computers are super fast these days, and real world applications spend 99%
> of their CPU time doing some very specific item, with the rest of the time
> spent doing just boring stuff. Accordingly, libstuff is not intended to be
> hyper-efficient: it's intended to be hyper-usable. This means libstuff uses
> a lot of std::string objects for pretty much everything. Where a more
> complex object is required, some higher-level string-based object is used.

~~~
quinthar
Yes, I think it does speak for itself. Thanks!

------
carterehsmith
Rock-solid?

Look. Just about anyone can create some "database" and declare it "rock-
solid". Frankly, a simple key-value store is built in most languages.

Do you have more than that? Do you have benchmarks? You know, competing
against the other providers?

~~~
quinthar
Agreed, benchmarks would be helpful to back up these claims. Good suggestion!

I think the argument in favor of it being more reliable is an easier one,
however, as Bedrock is specifically designed (and used, at scale, for 8 years)
to provide highly available redundancy in the form of WAN-replication and
automatic failover.

Regardless, I think you'll agree that "rock-solid" is not a precise technical
term, and I hope you'll allow a bit of play on words. Whether or not it's
literally as solid as a rock is unclear, but it is definitely very reliable --
and more reliable than most.

------
misframer
Here's Expensify's CEO (David Barrett) posting on the p2p-hackers mailing list
about this. It's from April 2016.

[http://p2p-hackers.709552.n3.nabble.com/p2p-hackers-
Advice-o...](http://p2p-hackers.709552.n3.nabble.com/p2p-hackers-Advice-on-
concurrent-relational-database-writes-td4025323.html)

~~~
quinthar
Yes, we've made a lot of good progress on this front, though nothing ready to
demo. Multi-threaded replication is _really_ exciting stuff. As it stands, our
"selective sync" capability already gives a lot of headroom for write
capacity, but multi-threaded replication will raise that ceiling even higher.

~~~
Jweb_Guru
The fundamental issue you're going to run into is that unless you have
"genuine partial replication" where only certain shards have a key, or
restrict your queries to key/value ones where the key can be determined
automatically and used to route to a per-shard master, you can't detect
conflicts committed on different nodes without executing half a round trip,
which is going to be bottlenecked by the _slowest_ node in your "fast" quorum.
That's why nobody uses Generalized Consensus in practice.

Batching before that _will_ help if you already know that most of the
transactions touching a particular key are going to the same server. So not a
clean partition, but "most of the time". This is the idea behind, e.g.,
[http://www.ssrg.ece.vt.edu/papers/peluso-M2PAXOS-
TR.pdf](http://www.ssrg.ece.vt.edu/papers/peluso-M2PAXOS-TR.pdf). You try to
make sure people usually contact the owner of a particular partition, and use
a variety of techniques from the literature to make that well-optimized
(ideally, clients actually know which partitions they "should" be accessing,
too).

Another approach is to restrict your transactions to deterministic ones (with
possible "scouting" transactions to do things like secondary index lookups
where you can't statically analyze which partitions they will hit), in which
case you can batch _all_ transactions, send them to _all_ participant nodes,
and run them afterwards without two-phase commit. This is the approach taken
by Calvin (which can exploit per-node parallelism during execution because it
uses deterministic, deadlock-free locking). Additionally, read-only
transactions scoped to a partition can execute locally without any distributed
access, because thanks to determinism serializability is guaranteed. See
[http://cs-www.cs.yale.edu/homes/dna/papers/calvin-tods14.pdf](http://cs-
www.cs.yale.edu/homes/dna/papers/calvin-tods14.pdf) (and yes, it really can do
500,000 distributed TPC-C transactions per second--by my estimation, more than
three orders of magnitude higher than the average total traffic you get, as
suggested by your "100x more read transactions than write" estimate. Though of
course your current average is higher than your 8-year average, I doubt it is
1000x higher).

Yet another idea is to try to combine coordination required for replication
with that required for the distributed transaction, as Tapir does:
[https://github.com/UWSysLab/tapir](https://github.com/UWSysLab/tapir). It
enjoys the highest linearizable read/write "general" (as in, doesn't have to
be submitted in one batch or need static analysis) transaction throughput I've
seen of any leaderless georeplicated system (aka it processes transactions
"fairly").

Still another approach is to forego serializability for a _very_ slightly
weaker guarantee, extended update serializability, which eliminates nearly all
interesting anomalies but can allow for dramatically better performance:
[http://www.ssrg.ece.vt.edu/papers/opodis14-alvin.pdf](http://www.ssrg.ece.vt.edu/papers/opodis14-alvin.pdf)
demonstrates that on read-mostly workloads you can do _very_ well with that
even in absurd deployments (e.g. the georeplicated 7 datacenter one in the
paper). Elsewhere in the thread you were talking about how "observational"
consistency is what's important, which would suggest you are probably already
relying on the "strong session" assumption (in which case EUS is
indistinguishable from serializability), so I encourage you to give that a
look.

All four represent pretty interesting points on the design space. It's not
clear to me whether RockSolid would be better than, say, SQLite over Calvin,
for your use case (in particular, I suspect Expensify would not find Calvin's
requirement that transactions be statically analyzable terribly onerous,
especially since you already require all transactions to be executed as stored
procedures).

I will say that I'm pleased that a site receiving a reasonable amount of
traffic is using SQLite. People consistently underestimate its performance.

~~~
quinthar
Thanks for all these links, I have some reading to do! Also, to clarify one
point, we're not doing multi- _master_ writes/replication -- just multi-
_threaded_ writes/replication.

Incidentally, the latest plan is to use
[http://sqlite.org/sessionintro.html](http://sqlite.org/sessionintro.html) to
do the following:

1) Spin up multiple write threads

2) Every write thread opens its own database handle

3) Every write thread creates a new "session" object before each write
command, and then creates a "changeset" afterwards

4) The first write thread to process a write command calls
sqlite3changebatch_new() to create a new batch (initially empty)

5) It then calls sqlite3changebatch_add() to add the changeset to that batch,
which returns SQLITE_OK to indicate that it does _not_ conflict with anything
in the (currently empty) batch

6) The next write thread calls sqlite3changebatch_add() on the existing batch,
providing the write command's changeset

7) If sqlite3changebatch_add() returns SQLITE_OK then it creates a patchset
from the changeset, and sends it to the slave.

8) Slaves apply and commit patchsets within the same batch in any order

9) On the other hand, if sqlite3changebatch_add() returns SQLITE_CONSTRAINT,
then that means the new changeset conflicts with one or more existing
changesets in the batch. In this scenario, it increments the batchID, calls
sqlite3changebatch_zero() and then sqlite3changebatch_add() again (to
initialize the new batch). It then creates and sends a patchset to the slave
for committing in any order with changes in the new batch.

Re: SQLite's performance, I find the database is consistently underestimated
in every regard -- performance, stability, functionality, etc. And the
database isn't half as amazing as the team behind it. Those are some of the
most solid engineers -- in a true sense of engineering (of which programming
very rarely is) -- that I've met.

~~~
Jweb_Guru
Oh okay. That sounds a lot like OCC in terms of the tradeoffs involved. If
you're looking for implementation hints, you might want to check out Silo
([http://db.csail.mit.edu/pubs/silo.pdf](http://db.csail.mit.edu/pubs/silo.pdf)).
It's _really_ hard to beat for transactions that are mostly nonconflicting (if
transactions _are_ mostly conflicting it is hard to exploit multiple writers
unless you are very clever about how you go about it; I can dredge up a few
links if you're interested). My overall understanding is that the usual rule
is that unless you have highly partitionable workloads, single writer almost
always wins, but I think your current strategy is also going to suffer from a
lot of contention even in cases with no conflicts: the write batch is going to
be constantly bouncing between cores and require lots of locking, which could
easily substantially degrade performance over what it currently is (though
obviously, benchmark :P).

From a correctness perspective: SQLITE_CONSTRAINT will only detect write
conflicts, not read conflicts, so you are losing serializability (instead you
have snapshot isolation, which is what SQLite gives you by default with
multiple writers). SI allows a variety of subtle anomalies that can screw you
over; see
[https://wiki.postgresql.org/wiki/SSI](https://wiki.postgresql.org/wiki/SSI)
and
[https://wiki.postgresql.org/wiki/Serializable](https://wiki.postgresql.org/wiki/Serializable).
If you are currently using only a single-writer thread, you will never
experience these issues, AFAIK, because you cannot form a "dangerous
structure" (you can never have two R-W dependencies that conflict with each
other because you can only have one R-W transaction at a time), but I could be
wrong about that; however, if I'm not, this could lead to you dealing with
some really bad bugs. Note that these kinds of anomalies have a tendency to
pop up when you're using stored procedures to perform validation!

~~~
drkp
You're correct that SI with a single writer will always be serializable.

(I was one of the authors of SSI in Postgres)

~~~
Jweb_Guru
And Tapir too, I see... and Arrakis? And speculative Paxos? Keep up the
awesome work, I always learn stuff from your papers :) And thanks for
confirming, I hadn't actually thought about SI in the single-writer case until
that comment.

~~~
drkp
Thanks! Glad you have been enjoying them.

------
dsl
For all the emphasis put on multi-datacenter replication, it would be awesome
to see it documented!

Really looking forward to playing with this. Cool stuff.

~~~
quinthar
There's a bit of that here:
[http://bedrockdb.com/multizone.html](http://bedrockdb.com/multizone.html)

But yes, lots more details to add over time. Thanks for asking!

~~~
quinthar
And a bit more here:
[http://bedrockdb.com/synchronization.html](http://bedrockdb.com/synchronization.html)

------
marknadal
I think it is important to encourage people to build their own databases, so I
want to commend you guys for doing it. But I do have some questions
(disclosure: I am the author of
[https://github.com/amark/gun](https://github.com/amark/gun) ):

\- Is it correct to say that Bedrock is primarily a replication layer for
SQLite? Most things your homepage highlight seem to be SQLite features, not
Bedrock features.

\- Elsewhere in this thread you mention you run PAXOS (what implementation are
you using? Your own? Do you have a test we can run to verify correctness?) But
then you mention you sacrifice the "C" in the CAP Theorem, which would go
against any point of having PAXOS. What happens in a split-brain scenario,
where you have lets say 6 peers, and there is a network partition straight
down the middle?

\- You also mention elsewhere in this thread that if you are talking to the
same node (a sticky session) there is no consistency issues. Do you mean
linearizable? Because consistency has to do with multi-machine data
consistency. For instance, if a write happens on another peer, then the node
you are talking to should return that write not a stale write.

\- Is there anything other than just assuming it is run on SSD that make it
faster than SQLite on SSD?

Thanks!

~~~
quinthar
Re: SQLite -- Yes, Bedrock is just the replication layer, SQLite does all the
SQL and storage.

Re: Paxos -- It's our own implementation. Split brain is prevented by the
master refusing to stand up unless a majority of _configured_ (not just
active) slaves approve its standup request. So in a 6 node deployment, 1
master and 5 slaves, 3 of the slaves would need to approve. This means one
"half" would have 4 nodes, and the other would have 2 -- and the half with 2
would recognize it doesn't have quorum and thus stay idle. In a scenario where
there is a split down the middle, nobody would do anything because nobody has
quorum. (This is precisely why you shouldn't deploy in just 2 datacenters --
three is the magic number.)

Re: Sticky sessions -- True multi-machine consistency is impossible to
guarantee and has no real practical application. The only consistency that
matters is from the perspective of an observer, ensuring that it always sees a
world whose time arrow progresses linearly forward. If that is what
"linearizable" means, then perhaps. Sorry, my terminology could be off. Thanks
for the correction!

Re: SQLite on SSD -- Sorry, I didn't mean to claim Bedrock was somehow faster
than SQLite. Indeed, Bedrock adds overhead to SQLite (to do all the networking
and such). My main claim is that Bedrock is faster than other databases, in
particular when using C++ stored procedures for complex operations.

Thanks for the questions!

~~~
toolslive
Is the set of servers that participate a paxos value too (ie determined using
the consensus algorithm), or is it configured ? Can you grow a cluster without
bringing it down ?

~~~
quinthar
Unfortunately you can't add new nodes without reconfiguring the cluster, and
currently that requires restarting each server (though so long as you don't do
it all at once, server restarts are normal maintenance that causes no downtime
to the end user). This could likely be added without too much effort, but also
the real-world use case of this is uncertain.

The consensus algorithm is only used to elect a master -- once the master is
identified, it coordinates all the distributed transactions. If the master
dies, everyone who remains elects a new master and re-escalates all
unprocessed write transactions to the new master.

Read transactions are processed locally by each node, so only writes need to
be escalated.

------
nvartolomei
> Bedrock is designed from the ground up to operate in a multi-datacenter
> environment with rigorous use of stored procedures – the new standard for
> modern application design.

stored procedures—new standard for modern application design?

~~~
quinthar
Well, perhaps "new" is relative. But yes, I think strong application design
moves the queries into stored procedures that execute inside the database,
exposing a higher-level interface to the application layer. This not only
maintains better layering from a software engineering perspective, but also
provides better isolation from a security perspective:

Your webserver is the first thing to be hacked by an attacker because it's the
server that sits on the internet. If all your security logic is built on your
webserver, an attacker can easily bypass it all. However, if your security
logic is built into your database, then it dramatically limits the damage a
hacker can do from the webserver.

In particular, I would recommend you create an "Authenticate" stored procedure
that accepts a username and password, returning an "authToken". Then make all
your other functions into similar stored procedures, each of which accepts and
verifies an "authToken" before returning the results.

(And in the case of Bedrock, you might also disable the Bedrock::DB plugin
entirely to prevent direct access to your database from the webserver, forcing
everybody to go through your stored procedures.)

This design means attackers who root your webserver can't randomly access your
database without knowing the username/password (or valid authToken) -- all of
which is neatly and securely contained inside the database itself.

To be clear, if you _do_ keep your SQL on your webserver (as most applications
honestly do, though I believe it's a mistake), then the Bedrock::DB plugin is
perfect for you. But I think Bedrock::DB (or really, any direct access to a
database outside of stored procedures) should be largely viewed as a
programming and maintenance convenience, at the expense of security.

------
didip
To take this further, perhaps pay Aphyr for his time and run Jepsen tests?

------
lucio
How does the client get "redirected" to another node if the one you're talking
to fails? It is transparent to the client?

~~~
quinthar
Our PHP library has the concept of a primary and failover host. We recommend
configuring a local Bedrock node as the "primary host", and then setting up a
load-balanced pool of all nodes to be used as the "secondary host". This means
in normal operation, every webserver talks directly its nearest node
(bypassing the load balancer), but "fails over" to the load balancer if the
primary is down.

Granted, a simpler configuration is to just have two load balancers -- both of
which have all nodes -- and then use each as the primary/secondary. We did
this for years and it was fine, but for just a tiny bit more configuration
overhead you can bypass the load balancer for the primary and thus shave off
those precious milliseconds from each request.

------
quinthar
Discuss in real-time here: [https://gitter.im/Expensify-
Bedrock/Lobby](https://gitter.im/Expensify-Bedrock/Lobby)

------
petre
How is data distributed over multiple hosts? Aside from being replicated, is
is sharded? Does it have any sort of redundancy? What happens if you lose a
host?

~~~
quinthar
Every host has all data. (This sounds crazy, until you do the math on how
cheap storage is and how comparatively "little" data people actually need in
the real world.) So we have 6x redundancy in normal operation, across three
datacenters (and three different power grids, three different providers).
There are also nightly backups.

------
alucab
Hi this is very interesting. Can you elaborate on your concept of plugin
development ?

You clearly state that your ideal model is to develop plugins that are
embedded in the DB to execute core business functionalities. However, in the
website there is only tutorial documentation about the SQL interface.

Is the only guide the code for now ?

Thanks !

------
ngrilly
Two questions:

If I'm not mistaken, creating an index is a blocking operation in SQLite
(other writers are blocked). How do you manage this in production?

How do you create a stored procedure? Do you have to recompile the whole
program? How do you deploy with zero downtime?

~~~
quinthar
Great questions! Yes, indexing is actually a significant challenge at our
scale, as it can take a long time to add and currently, it is a blocking
operation. However, the sqlite team is amazing and has all sorts of write-
concurrency tricks up their sleeve to allow for creating indexes in a parallel
thread. We haven't used it yet, but we plan to.

As for the stored procedures, yes they are written in C++, and thus deploying
them requires compiling and upgrading the server itself. However, we have 6 of
them (and honestly, any one of which has enough read capacity to satisfy about
all our traffic) so a minute of downtime to upgrade for each independently
isn't a problem.

The result is zero downtime as perceived by the user, even though each server
has occasional downtime for maintenance and upgrades. In general I'd say we
upgrade the database about weekly, or more frequent depending on how active we
are in the stored procedures.

~~~
ngrilly
> Yes, indexing is actually a significant challenge at our scale, as it can
> take a long time to add and currently, it is a blocking operation.

What do you do during index creation: Do you drop write requests on the floor?
Do you store them in a persistent queue somewhere? Or do you remove a node
from the cluster, create the index, re-add the node to the cluster, and repeat
this on every other nodes?

I'm asking because at some point I thought of using SQLite, embedded in a
process containing the business logic and the network layer, but I abandoned
this idea specifically because of blocking operations like index creation, and
decided to use PostgreSQL instead.

> the sqlite team is amazing and has all sorts of write-concurrency tricks up
> their sleeve to allow for creating indexes in a parallel thread

I'd be very interested in any link related to this new feature of SQLite.

> However, we have 6 of them (and honestly, any one of which has enough read
> capacity to satisfy about all our traffic) so a minute of downtime to
> upgrade for each independently isn't a problem.

So, basically, you're doing a rolling upgrade? Have you evaluated compiling
your stored procedures to dynamically loaded libraries, or using an
interpreted language like Lua/Python/JavaScript?

~~~
quinthar
> What do you do during index creation: Do you drop write requests on the
> floor? Do you store them in a persistent queue somewhere? Or do you remove a
> node from the cluster, create the index, re-add the node to the cluster, and
> repeat this on every other nodes?

The last of those -- for small indexes, we just replicate them out like normal
queries. For large indexes, we take that node down and add offline.

> So, basically, you're doing a rolling upgrade? Have you evaluated compiling
> your stored procedures to dynamically loaded libraries, or using an
> interpreted language like Lua/Python/JavaScript?

Correct, rolling upgrades. It's worked well to date, but the idea of putting
the plugin into a dynamically loaded library is really interesting. Our plugin
system is relatively new (only the past few months) so it hasn't really been
considered. Great idea!

~~~
ngrilly
Great! Thanks for following-up.

If you have any link about "SQLite write-concurrency tricks to allow for
creating indexes in a parallel thread", please share ;-)

------
Animats
So where are the other copies? The documentation tells you how to set up
Bedrock locally, but to get redundancy, you need multiple copies talking. How
do you set that up?

And how does security work?

~~~
quinthar
Here's a quick overview of how to configure in multiple zones:
[http://bedrockdb.com/multizone.html](http://bedrockdb.com/multizone.html)

Security isn't provided by Bedrock (or really, any database, at least not
well). Rather, with Bedrock security is the responsibility of the application
layer, but built into the database via stored procedures.

------
naasking
Linux-only? Or will it run on sqlite-supported platform?

~~~
quinthar
We do most of the development in MacOS, but deploy it to Linux. I've never
compiled it for Windows, but I imagine it would be straightforward.

------
misframer
How do you monitor a database / system like this?

~~~
quinthar
The Bedrock::Status plugin returns full status on its internal state, as well
as the state of the other nodes it sees. We have Icinga monitor this to make
sure the serve is not merely up, but in the correct state.

