Hacker News new | comments | show | ask | jobs | submit login
Bedrock – Rock-solid distributed data (bedrockdb.com)
134 points by qertoip on Oct 18, 2016 | hide | past | web | favorite | 109 comments

I don't quite get why SQLite is the right primitive to compose a large scale database system out of...

If the design goal is write throughput, SQLite can be beat really easily by the client/server systems since each SQLite instance only supports one concurrent writer. I guess users of Bedrock could partition their datasets into different pieces that aren't often written in tandem, but why go through all the trouble instead of just using a database with real MVCC?

If the design goal is geo replication, Bedrock seems to ignore the past many years of innovation in the field. MySQL and company have methodically built up features to allow safe synchronous and asynchronous replication for minimizing the risk of downtime or data loss like GTIDs over a long time

If the design goal is read throughput, I don't think Bedrock can be faster than just pure SQLite if your data can live beside your application, and if you have more than one node, you need a complicated SQL optimizer that understands how to push down things like predicates and aggregates to the leaves, and then recombine results at the top level to actually efficiently read in parallel, which I don't think Bedrock has built yet.

This thing kind of seems reminiscent of Jetpants or one of the other MySQL orchestration tools that allow for scaling the DB beyond many nodes automatically... but... why SQLite?

Thanks for asking!

Re: "SQLite can be beat really easily by the client/server systems since each SQLite instance only supports one concurrent writer" -- That's not actually true, sqlite supports concurrent writers via their page-locking branch (and changesets allow for effectively row-level locks). But none of that matters, because single-threaded replication means any multi-threaded write capability is irrelevant. Regardless, when we add multi-threaded replication (on the way), we'll take advantage of SQLite's multi-thread write capabilities.

Re: MySQL's replication. What specifically are you referring to? It's distributed transaction features are actually quite new -- and I think were developed after Bedrock was already in production for years. Furthermore, MySQL doesn't support automatic master failover (or recovery), and struggles over high-latency, low-reliability WAN connections. Anyway, if I'm wrong on any of that, I'm eager to be corrected. Thanks!

Re: Read throughput. Agreed it doesn't have "parallel reads" (eg, it doesn't split a large read up into multiple smaller reads that are joined automatically). However, I don't think MySQL does either. Regardless, Bedrock does quite well for normal read queries, and uses all of the node's CPUs efficiently. (To be clear, I love this idea of adding "super threaded reads", and have some ideas. But at the moment we don't have any application that requires it, so haven't built it.)

Re: Jetpants, I haven't heard of that before. Thanks for the tip! But as for why SQLite over MySQL, I'd ask quite the opposite: why add a networked layer on top of MySQL, thereby ignoring all of MySQL's networking? That just adds all the overhead of MySQL, without any of the benefit. If you're going to build a networking layer atop a database, SQLite seems the clear choice because it is explicitly designed to be a database library -- and not a database server.

Thanks for all these comments!

Don't let me barge in and tell you how to do your job or anything, but for a piece of a system designed and marketed as the bottom, super stable piece of the stack, shipping out a brand new technology not yet in trunk seems kind of dangerous, no? The MVCC implementations in your competitors have been battle tested for stability and maximum performance for decades. Gonna be hard to compete with that.

Re: "struggles over high-latency, low-reliability WAN connections" If by struggle you mean a vanilla MySQL replication setup doesn't satisfy C, A and P of the CAP theorem then yeah I agree, however if you just mean that it struggles to be consistent with an unreliable network, that just means MySQL has made the same CAP tradeoff Bedrock has. AFAIK there aren't many critical bugs that mean you lose data that could have otherwise been stored in the presence of unreliable networks.

You're right that vanilla MySQL doesn't support automatic failover, however, some consider that a feature, because the engineering required to correctly implement this distributed system is genuinely challenging. Bedrock may have done it, but if there are some complicated cases that it has yet to address users may prefer to manage failover manually before they trust it to the system. MySQL also isn't the only system that we should draw comparison too -- Cassandra or Riak come to mind as well. Both have excellent replication for geo-redundancy and something akin to automatic failover by storing data in a configurable number of places. They don't support raw SQL statements like Bedrock does, but to be honest if Bedrock doesn't support actual SQL fanout where the results can be combined to make the thing seem like one big database, the client is left to do a bunch of the work like they would be doing in Cassandra to model their data and fetch it efficiently.

I'm not trying to suggest that MySQL or a MySQL orchestration layer is better than Bedrock for all use cases; I just don't understand what niche Bedrock is trying to fill. For basic consumers of the system it's more work to author your queries for Bedrock because (I think) you now have to manage multiple logical SQL instances instead of one virtual big one, and you only get as much read throughput for as much data as you are willing to duplicate. You only get write throughput for the same but at 1 writer per shard/slave/whatever it's called. It's using a homegrown replication algorithm that has the same consistency tradeoffs of other more widely deployed and thus better tested solutions, and it's not clear that database builders or the target audience for Bedrock-as-framework would actually want to use SQLite due to the performance considerations mentioned above.

Lots of comments! I'll try to address the key points:

Re: "brand new technology not yet in trunk" -- I'm not sure what you mean by that. Bedrock has been used continuously for 8 years.

Re: "struggles over high-latency, low-reliability WAN connections" -- Ah, I mean MySQL's replication is designed for active/passive deployments with manual failover connected via fast/reliable networks. Not saying it can't support slow/unreliable WAN connections, only that it requires a lot of glue code and manual recovery when things go more wrong thn it's designed to handle.

Re: "the engineering required to correctly implement this distributed system is genuinely challenging" -- Agreed! This is precisely what Bedrock provides.

Re: "SQL fanout" -- To my knowledge, no RDBMS does this automatically. Furthermore, very few real world applications actually require this. Don't get me wrong: this is cool stuff. But this isn't a common requirement. Most businesses will never exceed the capacity of a single server -- the number of businesses that truly require this level of scalability is very small.

Re: "I just don't understand what niche Bedrock is trying to fill." -- The niche of businesses that want a simple SQL database that has built in automatic failover.

Re: "multiple logical SQL instances" -- I'm not sure what you mean by this. Bedrock maintains a single contiguous SQL database across all nodes.

Re: "you only get as much read throughput for as much data as you are willing to duplicate" -- Yes, I'm saying there are very few reasons not to duplicate it all for the vast majority of real world use cases.


Cool, thanks! It looks like that came out in 2012? Neat, I'll take a look!

Incidentally, if anybody has experience with this I'd love to know:

1) What happens if "mysqlfailover" itself dies? For example, if I have 6 servers split equally between 2 datacenters, and the datacenter running mysqlfailover loses power -- how does the other datacenter get reconfigured?

2) If you run two copies of mysqlfailover (one in each datacenter), how does it solve the "split brain" problem? If you have 6 servers split equally between two datacenters, and a clean network severance down the middle, what prevents both sides from configuring and operating a master? (This is the nightmare scenario.)

3) If mysqlfailover dies, it sounds like it will prevent any other copy from running without manual intervention: "At startup, the console will attempt to register itself with the master. If another console is already registered, and the failover mode is auto or elect, the console will be blocked from running failover. When a console quits, it unregisters itself from the master. If this process is broken, the user may override the registration check by using the --force option."

Overall, this sounds like a great improvement over what existed before (eg, nothing), but still a pretty brittle, manual, and (in the case of split-brain) very dangerous approach. Have you had good experience with it in practice?

We are using Galera clustering with MariaDB on 3 nodes for that.

We are using it since 2014. First in line is a HAProxy which directs the traffic to one of the 3 nodes. If one of the nodes goes down HAProxy directs the traffic to the other two nodes.

This works quite well, as long as not all 3 nodes are down / lose network connection. After that you have to initialise the cluster again. One node is the new master and all later changes in the other databases are overwritten. Then we connect each node to the new cluster.

The biggest problems are write intense operations. We had more luck using two nodes as read nodes and using the third node as write node. When the data in two nodes got updated simultaneously the performance suffered heavily. After some smarter scheduling it works pretty well (as long as not all the nodes go down or the network connection between the nodes is lost). The slowest node can block the write operations for all the other nodes.

> sqlite supports concurrent writers via their page-locking branch (and changesets allow for effectively row-level locks)

I've been unable to find the branch on SQLite website. Do you have a link to the branch and changesets?

Thanks! Is there some documentation already? Do you plan to integrate "BEGIN CONCURRENT" in the standard SQLite distribution?

Where do you make your CAP Theorem tradeoffs?

When shouldn't people use Bedrock? https://sqlite.org/whentouse.html is a wonderful page - full of discouragement to would-be SQlite users to know when they're in a use-case better served by some other tool.

Heh, yep, we definitely had that sent to us. The point of Bedrock is to resolve those points and make SQLite usable in the enterprise.

As for the CAP theorem, we compromise on "C". Each slave responds based on its current state, which might be a different state than other slaves or the master. However, every response returns the "commitCount" of the database at the time the response was generated. Then when you submit a new request, you can specify that commitCount and the slave will wait until it is at least as recent as that.

This way you can safely act upon the response of one request, and then make a second request to another.

So you're ensuring consistency by blocking on all queries? And it's up to the client to enforce this to any degree? And if there's one master and a global clock, sounds like you threw out the "P" too. Frankly this thing sounds simplistic and unusable, and it's sorta offensive to hear SQLite called unusable in the enterprise, whatever that means.

Consistency is only an issue if you switch database servers mid session. Just "stick" to a node by issuing all subsequent requests to the same server and there are no consistency issues.

And then laugh maniacally when the partition heals. What's a "session"? Can you explain commitCount more, and how a client is supposed to use it intelligently without deadlocking. Also there's a lot of master/slave talk and I'm wondering how leaders are elected during a partition, and what recovery looks like.

Heh, there's a lot of detail to be captured I agree. But in short:

- All nodes connect to all other nodes

- Each node has a priority; a Paxos algorithm is used to identify the highest priority node, which "stands up" to be the master

- All nodes respond to read queries using the local database.

- So long as you always talk to the same node, there are no consistency issues: you are guaranteed that each request will answered with a database at least as fresh as the last.

- However, if you switch databases (eg, a node goes down and you are forced to go to a different one that might not be as fresh), it will wait until it is as fresh as the node you were using

- Write requests are escalated to the master, which processes them with a two-phase commit distributed transaction

- By default, the master waits for a "quorum" of the cluster to approve the transaction before committing

- However, this limits write capacity to 1/median(rtt) of the slaves

- Accordingly, we have a "selective synchronization" feature where you can optionally designate some transactions as requiring "less consistency". For example, you can specify that the master should commit when at least one other node approves.

- For the highest performance, you can designate a transaction as "asynchronous" and the master will commit immediately

- For example, when someone is reimbursed, we absolutely want to get quorum approval of the transaction: we can't risk losing that if the master crashes. But if someone just adds a report comment, we can risk losing it upon master crash.

- When the master does go down (either gracefully or not), the next highest priority node steps up seamlessly

- Any command escalated to the master but not processed will be re-escalated to the new master

- When a higher priority node comes up, it synchronizes to get up to date with everything it missed, and then becomes master

Anyway, lots of details here and I agree they haven't been well captured in the site yet. Coming soon!

Are you doing Paxos across a WAN? Isn't this slow?

The Paxos is only used to elect the master (or re-elect if the master fails). Once a master is established, it coordinates all distributed transactions centrally, which is very fast.

I don't want to be a jerk but... just for the sake of common sense:

- Faster? Provide a benchmark. Faster is a relative thing, faster than what? faster under which conditions?

- More reliable... than what? under which conditions? how do you know it is more reliable?

- More powerful? I thought SQLite is more constrained than other SQL databases. e.g: No stored procedures. Powerful? how? compared to exactly what?

Nullius in verba. I don't take your statements as granted, give me a proof, otherwise it's just flatus vocis.

Thanks for all that Latin! Yes, it's been open sourced for under 24 hours. Those are all good suggestions.

Una lingua numquam satis est. Gratias tibi. Good luck with your project.

Incidentally, sqlite doesn't provide stored procedures, but Bedrock does (using a C++ plugin capability). Otherwise, SQLite is actually very feature rich; a subset of features are here: http://sqlite.org/fullsql.html

Yes... but you said "More powerful", rather than just "Powerful". More implies a comparison with respect to something else... with what specifically?

Well, with respect to stored procedures, I think C++ is a far more powerful language than MySQL's SQL-based approach -- especially since Bedrock plugins can also encapsulate schema changes.

I think Bedrock's replication is definitely more more powerful than MySQL's as well.

So in general I think it's more powerful than MySQL. I'd love to hear your thoughts about where MySQL has the edge, however. Thanks!

Does it have transactions?

Yes, SQLite has transactions, and Bedrock runs all stored procedures in transactions. However, the Bedrock::DB plugin (which is just a single stored procedure: "query") doesn't expose transactions to the caller. So you couldn't, for example, call "BEGIN TRANSACTION" from the webserver, issue a few queries, and then call "COMMIT". The proper way in Bedrock (and I'd argue, for any modern database) is to package your transactions into stored procedures that execute inside the database itself.

OK, how is this different from Rqlite [1], except in C++ instead of Go, Paxos instead of Raft?

> rqlite is a distributed relational database, which uses SQLite as its storage engine.

> rqlite gives you the functionality of a rock solid, fault-tolerant, replicated relational database, but with very easy installation, deployment, and operation.


Also, is Bedrock DB only 30 days old? (since the 'first commit' message on GitHub?) That's barely topsoil!

[1] https://github.com/rqlite/rqlite

Haha, Bedrock has been in continuous development and production use (remember: this powers all of Expensify's and has since day one) for about 8 years. However, the public repo is very new.

I'm not actually familiar with Rqlite, but wow, it does look very similar. Thanks for the tip! Does anybody use it at scale?

Ah I see. So you are saying after enough weathering you have exposed the Bedrock...

I have no idea if anyone uses Rqlite, I just remember it from previous discussions here. Very similar indeed...

Also with identical approach there is http://www.actordb.com

Oh, neat! That one seems to replace SQLite's storage engine with a custom one. But yes, also similar. Cool! Again, what is the largest real-world user? We've found a lot of the theories tend to break down when they are subjected to reality -- Bedrock's advantage is it's been hardened in the crucible of live production traffic for years.

ActorDB was developed for this service : http://www.emitcloud.com/home

Cool. It looks like they have 500-1000 Android installs. As context, Expensify has over 1M Android installs, and over 4M total users. So I think Bedrock is probably operating 4-5 orders of magnitude more activity, at least in this case.

The idea's not fundamentally bad, but some of the rhetoric raises alarm bells.

> built atop SQLite, the fastest, most reliable, and most widely distributed database in the world.

Most widely distributed maybe, but far from the fastest and AFAIK others beat it on reliability as well. No disrespect to SQLite, it's far more feature-rich than other embedded-style DBs, but without a definition of "database" that excludes them the statement as written is false.

> written for modern hardware with large SSD-backed RAID drives and generous RAM file caches

In other words, relies on those things so it won't work worth crap on anything less.

> The easiest way to talk with Bedrock is using netcat as follows

Nice security model you've got there.

Re: SQLite's performance -- Can you point to anything that shows SQLite isn't the fastest for some class of query? In my experience (though admittedly I haven't formally benchmarked) it's at least as fast, and generally faster (due to the dramatically reduced overhead to query it over a traditional database server).

Re: SQLite reliability. I'm confused how SQLite isn't more reliable than every other contender, merely given that there is so much less to fail. In 8 years, 2158341079 writes, and probably 100x that many reads, I don't think I've ever had a single failure or bug with SQLite itself.

Re: SSDs -- That's a fair point. Bedrock is not the ideal choice if you are using small volumes and spinning disks.

Re: Security -- I think all databases have crap security that nobody really relies upon. Bedrock doesn't try to pretend. Rather, it gives you a very complete plugin framework, such that you can write C++ "stored procedures" that enforce your application's security model. In practice, we run with the Bedrock::DB plugin disabled, such that the only way to access the database is via our stored procedures -- and those do complete authentication and authorization in an application-specific way that no database could in a generalized fashion.

Thanks for asking!

> 2158341079 writes

I'm curious: how do you keep track of this number?

Every write transaction ever committed to the database is assigned a unique ID, and the most recent few million transactions are kept in a "journal" table. This is what allows nodes to figure out what happened while they were offline and re-execute the missing transactions. A side-effect of that is we know exactly how many write transactions have ever been committed.

That's a funny side-effect! Should I understand the transaction unique ID is an incrementing number?


Out of curiosity, what's the value that'll wrap around at?

Thanks :-)

Great idea!!! SQLite is an extraordinary product. I'll be following Bedrock.

This could be, for databases, what node was for server-side programming. There are parallels, you're using a "client side single-writer (excellent) DB" and making it apt for the cloud with Paxos. Again, great idea!!!!

And it seems you're taking with a good humor some "strong" comments here. +1 sr. +1 to u.

Thanks, strong comments welcome!

Can u show stats of bedrock use in expensify? How many users? How many nodes? How many transactions per second? queries per second? DB size? largest table rowcount? node failures per month? how many times have u had a master failure? network partition incidents?

I should get better stats. 4.5M users, 6 nodes split between 3 datacenters. Each node has 16 CPUs. 2158341079 total write transactions (over 8 years); not sure how many read (10-100x more?). I'll try to get better stats on peak read/write transactions per second. Not sure the total number of rows of the largest table (that actually takes a long time to count).

Counting "failures" is difficult -- if it breaks, it's because of some bug in our application logic (eg, our stored procedure). Most of our restarts are due to normal maintenance and upgrades. In the history of the company there were a handful of core problems to the logic (generally as we encountered some weird edge case for the first time), but I can't remember the most recent.

Regardless, I'll try to get better data on this. Thanks for asking!

2 billion writes over 8 years? Cool if so, but many database systems both MySQL, and others are capable of getting this in a day. I know this because I maintain some high availability MySQL systems that see about 4 billion writes per day across a similar amount of hardware.

Have you run this through Jespen or done any actual load testing or deep testing for failures related to machines dying, network partitions or other?

I encourage you to do so, because (not trying to sound rude) this kind of reads like "hey my homemade car has 25,000 miles on it and I still use it". But you can build a performance and failure testing framework to really put it through its paces.

Anyhow , have fun and good luck~~

While I agree that we're not talking about huge write volume in the grand scheme of things, and I definitely agree with you that Bedrock should pay aphyr to run this through Jepsen (particularly with their new replication strategy--see some of my concerns below), Expensify's overall transaction volume is probably close to the limit of what 95% of businesses are ever going to see, they have contractual latency and availability requirements they have to fulfill that are also probably more stringent than 95% of businesses are ever going to see, and they've been doing it with a single master database (no partitioning) and a single writer on that master. I think it's useful information that SQLite works for them, because it means SQLite would work fine for most people and (as is evidenced by this thread) a lot of people don't think SQLite can work for a site like Expensify.

And seeing actual numbers are important, too. People often act like in terms of write volume you're either shared hosting fodder, or you're Facebook, but there's a lot of room in between, and a lot of HA workloads are very read-dominated. It's very hard to find information about the businesses fall in the middle of that spectrum because usually no technology company talks about its traffic unless it's bragging about it (and then will release only vague information that could be spun in lots of different ways--for instance, not to call you out, I have no clue what "HA" means in your context, whether you need distributed transactions, whether the writes are cleanly partitionable, what isolation level you need, what your latency requirements are, or what the read:write ratio looks like). So I'd rather encourage people to release their numbers than poo-poo them because they aren't as big as whatever the biggest system you've worked on is.

My point is that if they are advertising rock solid distributed data, I don't think they have achieved that without putting it through some more paces.

I agree, actual numbers are great.

We also agree, jespen is great.

But I am saying, another path, to a similar level of confidence jespen provides, is some real critical stress testing.

Sure, it will be out of the scope of what most people need, but it will give them the confidence to say "rock solid distributed data".

Some of these FIXME's in Bedrock's er... Paxos implementation look like they could be important:


> 6 nodes split between 3 datacenters

This makes 2 nodes per datacenter. Do the 2 nodes cooperate in some way, or are they fully independent from each other (like shards)?

Just found the answer to my question:

> In practice, we deploy two servers in each datacenter (across three geographically-distributed datacenters, so six nodes total in the cluster -- with one configured as a "permaslave" that doesn't participate in quorum). Given this, the node that's in the same datacenter generally gets most if not all of the master's transactions in a real world crash scenario.

Source: http://p2p-hackers.709552.n3.nabble.com/p2p-hackers-Advice-o...

Hmmm under plugins...

* Jobs - Provides a simple job queue.

* Cache - Provides a simple replicated cache.

...while these look like nice-to-have features which make life easy for developers, IMO having a database which is also a queue and is also a cache is a way to end up with a mess you can't scale out and is hard to reason about.

Have seen this happen with Redis - unless you're disciplined about how you use it and deploy it, having db, cache and queue all in one place is a honeypot for developers that turns into a future nightmare for operations.

Ya, I'm not sure I agree with that. I understand that argument in theory, but in practice the Jobs and Cache plugins have like 1% as much code as the Bedrock core -- and use all of it. So the idea that it makes sense to have a separate database, job queue, and cache -- despite all doing 99% of the same work -- strikes me as rather redundant and needlessly complex.

In practice, a job queue, a cache, and a database all do largely the same thing: maintain some internal state on disk, respond to some networked API calls, and synchronize across multiple servers. The actual logic of a job queue or cache is pretty trivial compared to what's underneath.

Indeed, I think the lesson of Redis is exactly why you should build these systems on top of a real database: they are struggling to hack on replication and reliable storage, even though these are "free" when built atop a database.

You'd have to be nuts to depend on something backed by Expensify. They'll probably fire the maintainers.

Edit: Turns out it's authored by the CEO! So you'd have to be nuts to use it, period. But, um, poke the source if you're curious. This is a good intro https://github.com/Expensify/Bedrock/tree/master/libstuff#li... I can't imagine a more ironic and hilarious backdrop for the "We Fire People" campaign. Must be tough finding people who can keep up with a brain that big.

Out of curiosity, what specifically do you find idiotic about the document you linked to?

I didn't call anything idiotic. But let's just say that grappling with the STL may be slightly preferable to this... thing. And there is some special irony referring to STL as an "esoteric abstraction" while offering your own vastly more esoteric abstractions in the same swoop. And the following paragraph speaks for itself, particularly in the context of developing a fucking database:

> Computers are super fast these days, and real world applications spend 99% of their CPU time doing some very specific item, with the rest of the time spent doing just boring stuff. Accordingly, libstuff is not intended to be hyper-efficient: it's intended to be hyper-usable. This means libstuff uses a lot of std::string objects for pretty much everything. Where a more complex object is required, some higher-level string-based object is used.

Yes, I think it does speak for itself. Thanks!


Look. Just about anyone can create some "database" and declare it "rock-solid". Frankly, a simple key-value store is built in most languages.

Do you have more than that? Do you have benchmarks? You know, competing against the other providers?

Agreed, benchmarks would be helpful to back up these claims. Good suggestion!

I think the argument in favor of it being more reliable is an easier one, however, as Bedrock is specifically designed (and used, at scale, for 8 years) to provide highly available redundancy in the form of WAN-replication and automatic failover.

Regardless, I think you'll agree that "rock-solid" is not a precise technical term, and I hope you'll allow a bit of play on words. Whether or not it's literally as solid as a rock is unclear, but it is definitely very reliable -- and more reliable than most.

It must be Bedrock Linux' standard DB.

Here's Expensify's CEO (David Barrett) posting on the p2p-hackers mailing list about this. It's from April 2016.


Yes, we've made a lot of good progress on this front, though nothing ready to demo. Multi-threaded replication is really exciting stuff. As it stands, our "selective sync" capability already gives a lot of headroom for write capacity, but multi-threaded replication will raise that ceiling even higher.

The fundamental issue you're going to run into is that unless you have "genuine partial replication" where only certain shards have a key, or restrict your queries to key/value ones where the key can be determined automatically and used to route to a per-shard master, you can't detect conflicts committed on different nodes without executing half a round trip, which is going to be bottlenecked by the slowest node in your "fast" quorum. That's why nobody uses Generalized Consensus in practice.

Batching before that will help if you already know that most of the transactions touching a particular key are going to the same server. So not a clean partition, but "most of the time". This is the idea behind, e.g., http://www.ssrg.ece.vt.edu/papers/peluso-M2PAXOS-TR.pdf. You try to make sure people usually contact the owner of a particular partition, and use a variety of techniques from the literature to make that well-optimized (ideally, clients actually know which partitions they "should" be accessing, too).

Another approach is to restrict your transactions to deterministic ones (with possible "scouting" transactions to do things like secondary index lookups where you can't statically analyze which partitions they will hit), in which case you can batch all transactions, send them to all participant nodes, and run them afterwards without two-phase commit. This is the approach taken by Calvin (which can exploit per-node parallelism during execution because it uses deterministic, deadlock-free locking). Additionally, read-only transactions scoped to a partition can execute locally without any distributed access, because thanks to determinism serializability is guaranteed. See http://cs-www.cs.yale.edu/homes/dna/papers/calvin-tods14.pdf (and yes, it really can do 500,000 distributed TPC-C transactions per second--by my estimation, more than three orders of magnitude higher than the average total traffic you get, as suggested by your "100x more read transactions than write" estimate. Though of course your current average is higher than your 8-year average, I doubt it is 1000x higher).

Yet another idea is to try to combine coordination required for replication with that required for the distributed transaction, as Tapir does: https://github.com/UWSysLab/tapir. It enjoys the highest linearizable read/write "general" (as in, doesn't have to be submitted in one batch or need static analysis) transaction throughput I've seen of any leaderless georeplicated system (aka it processes transactions "fairly").

Still another approach is to forego serializability for a very slightly weaker guarantee, extended update serializability, which eliminates nearly all interesting anomalies but can allow for dramatically better performance: http://www.ssrg.ece.vt.edu/papers/opodis14-alvin.pdf demonstrates that on read-mostly workloads you can do very well with that even in absurd deployments (e.g. the georeplicated 7 datacenter one in the paper). Elsewhere in the thread you were talking about how "observational" consistency is what's important, which would suggest you are probably already relying on the "strong session" assumption (in which case EUS is indistinguishable from serializability), so I encourage you to give that a look.

All four represent pretty interesting points on the design space. It's not clear to me whether RockSolid would be better than, say, SQLite over Calvin, for your use case (in particular, I suspect Expensify would not find Calvin's requirement that transactions be statically analyzable terribly onerous, especially since you already require all transactions to be executed as stored procedures).

I will say that I'm pleased that a site receiving a reasonable amount of traffic is using SQLite. People consistently underestimate its performance.

Thanks for all these links, I have some reading to do! Also, to clarify one point, we're not doing multi-master writes/replication -- just multi-threaded writes/replication.

Incidentally, the latest plan is to use http://sqlite.org/sessionintro.html to do the following:

1) Spin up multiple write threads

2) Every write thread opens its own database handle

3) Every write thread creates a new "session" object before each write command, and then creates a "changeset" afterwards

4) The first write thread to process a write command calls sqlite3changebatch_new() to create a new batch (initially empty)

5) It then calls sqlite3changebatch_add() to add the changeset to that batch, which returns SQLITE_OK to indicate that it does not conflict with anything in the (currently empty) batch

6) The next write thread calls sqlite3changebatch_add() on the existing batch, providing the write command's changeset

7) If sqlite3changebatch_add() returns SQLITE_OK then it creates a patchset from the changeset, and sends it to the slave.

8) Slaves apply and commit patchsets within the same batch in any order

9) On the other hand, if sqlite3changebatch_add() returns SQLITE_CONSTRAINT, then that means the new changeset conflicts with one or more existing changesets in the batch. In this scenario, it increments the batchID, calls sqlite3changebatch_zero() and then sqlite3changebatch_add() again (to initialize the new batch). It then creates and sends a patchset to the slave for committing in any order with changes in the new batch.

Re: SQLite's performance, I find the database is consistently underestimated in every regard -- performance, stability, functionality, etc. And the database isn't half as amazing as the team behind it. Those are some of the most solid engineers -- in a true sense of engineering (of which programming very rarely is) -- that I've met.

Oh okay. That sounds a lot like OCC in terms of the tradeoffs involved. If you're looking for implementation hints, you might want to check out Silo (http://db.csail.mit.edu/pubs/silo.pdf). It's really hard to beat for transactions that are mostly nonconflicting (if transactions are mostly conflicting it is hard to exploit multiple writers unless you are very clever about how you go about it; I can dredge up a few links if you're interested). My overall understanding is that the usual rule is that unless you have highly partitionable workloads, single writer almost always wins, but I think your current strategy is also going to suffer from a lot of contention even in cases with no conflicts: the write batch is going to be constantly bouncing between cores and require lots of locking, which could easily substantially degrade performance over what it currently is (though obviously, benchmark :P).

From a correctness perspective: SQLITE_CONSTRAINT will only detect write conflicts, not read conflicts, so you are losing serializability (instead you have snapshot isolation, which is what SQLite gives you by default with multiple writers). SI allows a variety of subtle anomalies that can screw you over; see https://wiki.postgresql.org/wiki/SSI and https://wiki.postgresql.org/wiki/Serializable. If you are currently using only a single-writer thread, you will never experience these issues, AFAIK, because you cannot form a "dangerous structure" (you can never have two R-W dependencies that conflict with each other because you can only have one R-W transaction at a time), but I could be wrong about that; however, if I'm not, this could lead to you dealing with some really bad bugs. Note that these kinds of anomalies have a tendency to pop up when you're using stored procedures to perform validation!

You're correct that SI with a single writer will always be serializable.

(I was one of the authors of SSI in Postgres)

And Tapir too, I see... and Arrakis? And speculative Paxos? Keep up the awesome work, I always learn stuff from your papers :) And thanks for confirming, I hadn't actually thought about SI in the single-writer case until that comment.

Thanks! Glad you have been enjoying them.

Hm... thanks for the head's up. I thought that SQLITE_CONSTRAINT would detect both read and write conflicts, but I'll dig deeper there.

As for how much contention there will be, I'm not sure: that's one reason we haven't taken the plunge yet as we sorta want to determine this first. However, given that we have a huge number of users modifying unshared data (eg, our largest group still makes up a tiny fraction of the userbase), I suspect conflicts will be relatively few.

And to be clear: even a little write concurrency goes a long way. If we can even do two writes simultaneously, that doubles our write performance. But I personally suspect the average batch size of non-conflicting transactions will be in the dozens or even hundreds -- meaning I suspect we will achieve so much replication throughput that something else becomes the bottleneck before replication is saturated.

Only one way to find out however!

Thinking about it, your algorithm also does not sound crash-resilient. Specifically, unless you're doing round-trip confirmations, you can easily have different replicas with different (possibly overlapping, possibly disjoint, in weird patterns) sets of changesets applied within a batch. If you have asynchronous replication at all you probably already have to deal with any issues arising from "gaps" in replication on some of the replicas for a single version, but if you don't, be aware that it can be very tricky to deal with properly on recovery. Specifically, if the master applies different transactions (within a batch) to different replicas, then goes down, and you elect a replica that only saw some of the transactions, and there can't be gaps, it's relatively easy to identify which replica is most up-to-date and make that one the leader. But if there can be gaps, there's the possibility that different nodes have different transactions applied, so you'll have to make sure any gaps are filled in before you elect a new leader.

Still, with a total order, it's not that bad, because it's easy to identify gaps; with your algorithm, however, there's no total order between changesets within or across batches (at least, none that I can see in your algorithm), so you won't have any way of knowing whether there are gaps (at least, not without doing something silly like broadcasting every changeset)! Adding a total order seems like the easiest way to resolve that, but like I said even that case can be tricky. Also, your algorithm doesn't specify that replicas which receive changesets with batch ids with gaps in them wait to apply them until the gap is filled in (which means they have to know when the previous batch is done), or alternately that before a batch with a new id can be sent from the master all replicas must have synchronously applied all changes within the old batch. I think something like that is a requirement here in order for there not to be potential conflicts on your replicas during replication even if the master doesn't crash.

IMO, doing exactly your algorithm but delaying replication or transaction confirmation until a specified amount of time has passed (say 10 ms), then synchronously sending the entire changeset and incrementing the batch id (aka group commit) is a much better idea. You exchange marginally worse latency (and more bursty network usage) for a far less complex update scenario (yes, you're back to having a total ordering on writes at replicas you need to follow, but you can include multiple changesets in the batch). You can also have your master monitor number of writes and only turn on group commit if there's unusually high write volume.

(Also, belatedly reading one of your other comments more carefully, I'm fairly confident commitCount will also not work correctly if changesets can be applied out of order on different replicas, even if there was no data loss, since the same commitCount could include different sets of changesets at any one time).

To be clear, my sense is that every replica would receive every transaction, and would be free to commit the transactions inside each batch in any order.

Now I do agree that it's tricky to avoid "gaps" in the failure cases. However, every replica keeps a record of the past several million transactions (we aim for 3 days), and every transaction is assigned a unique ID. When a replica starts up, it "synchronizes" down every missing transaction it has, and at this point would "repair" any gaps it somehow obtained when it went down.

Admittedly, the exact details of that part are TBD, but it doesn't strike me as an unresolvable problem on the surface.

The question is:

(1) In the event of a crash, how does it know which transactions it's missing without a total order on the write transactions?

(2) In the non-failure case, how do you know the transactions in one batch are ordered before the transactions in a subsequent batch (which requires the replica to identify any "gaps" within the previous batch so it can wait for them to come in before continuing on)?

I don't think any of this is unresolvable (just adding a total order would go a long way). I do think it's very tricky to get right, with lots of edge cases, and that if you can't efficiently identify missing transactions it potentially makes recovery unusably slow.

In any case, since I believe you are losing serializability with multiple writers, you should probably weigh that against any performance gains you get from multiple writers (I think even the group commit variant suffers from this problem).

Also, as pointed out in https://twitter.com/aphyr/status/788757992829222912, even your current solution isn't safe if your commit order isn't total across leader changes (you can resolve this by adding an epoch number that increments every time leader election occurs). Strongly recommend you take him up on his offer.

To be clear, we're talking about functionality that is not implemented or fully designed. Today all transactions are committed on all nodes in the same order, which is a much simpler world. I agree, the multi-threaded replication case is a much more complex and interesting world, with much greater performance opportunities. Lots of exciting problems to solve when we get there!

> Today all transactions are committed on all nodes in the same order, which is a much simpler world.

This is difficult to reconcile with:

> - For the highest performance, you can designate a transaction as "asynchronous" and the master will commit immediately

because if the leader crashes, a replica becomes leader and starts accepting writes, then the old leader recovers as a replica, without something like an epoch number it won't be able to tell that it has commits that the current leader doesn't (using a unique incrementing transaction number based on just a counter at the leader won't work, because it won't necessarily be unique across leader elections thanks to the asynchronous commits).

Ah, sorry for the confusion. Every transaction is given an incrementing ID by the leader, and every follower commits the transactions in ID order.

Furthermore, every commit has a running SHA hash of all prior commits (and every node keeps a history of the last few million commits). This way any two nodes can compare their journals to make sure they agree -- and if there is any split, then the cluster kicks that node out.

Basically, there is no scenario in which a node that commits a different transaction (or a transaction in a different order) is allowed to remain in the cluster.

I think this or something like this can probably work if you're okay with losing all data that wasn't acked by a majority (though I suspect actually recovering a divergent replica would be very difficult), but this doesn't work with the batch commit idea at all, does it? Seems like it enforces strict serial ordering of writes (even nonconflicting ones).

For all the emphasis put on multi-datacenter replication, it would be awesome to see it documented!

Really looking forward to playing with this. Cool stuff.

There's a bit of that here: http://bedrockdb.com/multizone.html

But yes, lots more details to add over time. Thanks for asking!

I think it is important to encourage people to build their own databases, so I want to commend you guys for doing it. But I do have some questions (disclosure: I am the author of https://github.com/amark/gun ):

- Is it correct to say that Bedrock is primarily a replication layer for SQLite? Most things your homepage highlight seem to be SQLite features, not Bedrock features.

- Elsewhere in this thread you mention you run PAXOS (what implementation are you using? Your own? Do you have a test we can run to verify correctness?) But then you mention you sacrifice the "C" in the CAP Theorem, which would go against any point of having PAXOS. What happens in a split-brain scenario, where you have lets say 6 peers, and there is a network partition straight down the middle?

- You also mention elsewhere in this thread that if you are talking to the same node (a sticky session) there is no consistency issues. Do you mean linearizable? Because consistency has to do with multi-machine data consistency. For instance, if a write happens on another peer, then the node you are talking to should return that write not a stale write.

- Is there anything other than just assuming it is run on SSD that make it faster than SQLite on SSD?


Re: SQLite -- Yes, Bedrock is just the replication layer, SQLite does all the SQL and storage.

Re: Paxos -- It's our own implementation. Split brain is prevented by the master refusing to stand up unless a majority of configured (not just active) slaves approve its standup request. So in a 6 node deployment, 1 master and 5 slaves, 3 of the slaves would need to approve. This means one "half" would have 4 nodes, and the other would have 2 -- and the half with 2 would recognize it doesn't have quorum and thus stay idle. In a scenario where there is a split down the middle, nobody would do anything because nobody has quorum. (This is precisely why you shouldn't deploy in just 2 datacenters -- three is the magic number.)

Re: Sticky sessions -- True multi-machine consistency is impossible to guarantee and has no real practical application. The only consistency that matters is from the perspective of an observer, ensuring that it always sees a world whose time arrow progresses linearly forward. If that is what "linearizable" means, then perhaps. Sorry, my terminology could be off. Thanks for the correction!

Re: SQLite on SSD -- Sorry, I didn't mean to claim Bedrock was somehow faster than SQLite. Indeed, Bedrock adds overhead to SQLite (to do all the networking and such). My main claim is that Bedrock is faster than other databases, in particular when using C++ stored procedures for complex operations.

Thanks for the questions!

Is the set of servers that participate a paxos value too (ie determined using the consensus algorithm), or is it configured ? Can you grow a cluster without bringing it down ?

Unfortunately you can't add new nodes without reconfiguring the cluster, and currently that requires restarting each server (though so long as you don't do it all at once, server restarts are normal maintenance that causes no downtime to the end user). This could likely be added without too much effort, but also the real-world use case of this is uncertain.

The consensus algorithm is only used to elect a master -- once the master is identified, it coordinates all the distributed transactions. If the master dies, everyone who remains elects a new master and re-escalates all unprocessed write transactions to the new master.

Read transactions are processed locally by each node, so only writes need to be escalated.

Paxos: I've been trying to look for that. Having cloned the code and grepped for paxos I'm getting no hits. Where is the paxos implementation?

It's not called out specifically (actually, when writing it I didn't even know what Paxos was, and only realized I had implemented it years later). However, the logic is here: https://github.com/Expensify/Bedrock/blob/master/sqliteclust...

With these: https://github.com/Expensify/Bedrock/blob/ecda922dc279e06fda...

Do you use the Issue Tracker to keep on top of things like that, and/or prioritise?

We're using GitHub Issues: https://github.com/Expensify/Bedrock/issues However, honestly those specific issues aren't on the list. In general we focus less on what could happen, and more on what actually does happen. Those specific issues haven't ever occurred, and thus never got "fixed" because they never became real problems. But PRs welcome!!

> Bedrock is designed from the ground up to operate in a multi-datacenter environment with rigorous use of stored procedures – the new standard for modern application design.

stored procedures—new standard for modern application design?

Well, perhaps "new" is relative. But yes, I think strong application design moves the queries into stored procedures that execute inside the database, exposing a higher-level interface to the application layer. This not only maintains better layering from a software engineering perspective, but also provides better isolation from a security perspective:

Your webserver is the first thing to be hacked by an attacker because it's the server that sits on the internet. If all your security logic is built on your webserver, an attacker can easily bypass it all. However, if your security logic is built into your database, then it dramatically limits the damage a hacker can do from the webserver.

In particular, I would recommend you create an "Authenticate" stored procedure that accepts a username and password, returning an "authToken". Then make all your other functions into similar stored procedures, each of which accepts and verifies an "authToken" before returning the results.

(And in the case of Bedrock, you might also disable the Bedrock::DB plugin entirely to prevent direct access to your database from the webserver, forcing everybody to go through your stored procedures.)

This design means attackers who root your webserver can't randomly access your database without knowing the username/password (or valid authToken) -- all of which is neatly and securely contained inside the database itself.

To be clear, if you do keep your SQL on your webserver (as most applications honestly do, though I believe it's a mistake), then the Bedrock::DB plugin is perfect for you. But I think Bedrock::DB (or really, any direct access to a database outside of stored procedures) should be largely viewed as a programming and maintenance convenience, at the expense of security.

To take this further, perhaps pay Aphyr for his time and run Jepsen tests?

How does the client get "redirected" to another node if the one you're talking to fails? It is transparent to the client?

Our PHP library has the concept of a primary and failover host. We recommend configuring a local Bedrock node as the "primary host", and then setting up a load-balanced pool of all nodes to be used as the "secondary host". This means in normal operation, every webserver talks directly its nearest node (bypassing the load balancer), but "fails over" to the load balancer if the primary is down.

Granted, a simpler configuration is to just have two load balancers -- both of which have all nodes -- and then use each as the primary/secondary. We did this for years and it was fine, but for just a tiny bit more configuration overhead you can bypass the load balancer for the primary and thus shave off those precious milliseconds from each request.

Hi this is very interesting. Can you elaborate on your concept of plugin development ?

You clearly state that your ideal model is to develop plugins that are embedded in the DB to execute core business functionalities. However, in the website there is only tutorial documentation about the SQL interface.

Is the only guide the code for now ?

Thanks !

Discuss in real-time here: https://gitter.im/Expensify-Bedrock/Lobby

How is data distributed over multiple hosts? Aside from being replicated, is is sharded? Does it have any sort of redundancy? What happens if you lose a host?

Every host has all data. (This sounds crazy, until you do the math on how cheap storage is and how comparatively "little" data people actually need in the real world.) So we have 6x redundancy in normal operation, across three datacenters (and three different power grids, three different providers). There are also nightly backups.

Two questions:

If I'm not mistaken, creating an index is a blocking operation in SQLite (other writers are blocked). How do you manage this in production?

How do you create a stored procedure? Do you have to recompile the whole program? How do you deploy with zero downtime?

Great questions! Yes, indexing is actually a significant challenge at our scale, as it can take a long time to add and currently, it is a blocking operation. However, the sqlite team is amazing and has all sorts of write-concurrency tricks up their sleeve to allow for creating indexes in a parallel thread. We haven't used it yet, but we plan to.

As for the stored procedures, yes they are written in C++, and thus deploying them requires compiling and upgrading the server itself. However, we have 6 of them (and honestly, any one of which has enough read capacity to satisfy about all our traffic) so a minute of downtime to upgrade for each independently isn't a problem.

The result is zero downtime as perceived by the user, even though each server has occasional downtime for maintenance and upgrades. In general I'd say we upgrade the database about weekly, or more frequent depending on how active we are in the stored procedures.

> Yes, indexing is actually a significant challenge at our scale, as it can take a long time to add and currently, it is a blocking operation.

What do you do during index creation: Do you drop write requests on the floor? Do you store them in a persistent queue somewhere? Or do you remove a node from the cluster, create the index, re-add the node to the cluster, and repeat this on every other nodes?

I'm asking because at some point I thought of using SQLite, embedded in a process containing the business logic and the network layer, but I abandoned this idea specifically because of blocking operations like index creation, and decided to use PostgreSQL instead.

> the sqlite team is amazing and has all sorts of write-concurrency tricks up their sleeve to allow for creating indexes in a parallel thread

I'd be very interested in any link related to this new feature of SQLite.

> However, we have 6 of them (and honestly, any one of which has enough read capacity to satisfy about all our traffic) so a minute of downtime to upgrade for each independently isn't a problem.

So, basically, you're doing a rolling upgrade? Have you evaluated compiling your stored procedures to dynamically loaded libraries, or using an interpreted language like Lua/Python/JavaScript?

> What do you do during index creation: Do you drop write requests on the floor? Do you store them in a persistent queue somewhere? Or do you remove a node from the cluster, create the index, re-add the node to the cluster, and repeat this on every other nodes?

The last of those -- for small indexes, we just replicate them out like normal queries. For large indexes, we take that node down and add offline.

> So, basically, you're doing a rolling upgrade? Have you evaluated compiling your stored procedures to dynamically loaded libraries, or using an interpreted language like Lua/Python/JavaScript?

Correct, rolling upgrades. It's worked well to date, but the idea of putting the plugin into a dynamically loaded library is really interesting. Our plugin system is relatively new (only the past few months) so it hasn't really been considered. Great idea!

Great! Thanks for following-up.

If you have any link about "SQLite write-concurrency tricks to allow for creating indexes in a parallel thread", please share ;-)

So where are the other copies? The documentation tells you how to set up Bedrock locally, but to get redundancy, you need multiple copies talking. How do you set that up?

And how does security work?

Here's a quick overview of how to configure in multiple zones: http://bedrockdb.com/multizone.html

Security isn't provided by Bedrock (or really, any database, at least not well). Rather, with Bedrock security is the responsibility of the application layer, but built into the database via stored procedures.

Linux-only? Or will it run on sqlite-supported platform?

We do most of the development in MacOS, but deploy it to Linux. I've never compiled it for Windows, but I imagine it would be straightforward.

How do you monitor a database / system like this?

The Bedrock::Status plugin returns full status on its internal state, as well as the state of the other nodes it sees. We have Icinga monitor this to make sure the serve is not merely up, but in the correct state.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact