If the design goal is write throughput, SQLite can be beat really easily by the client/server systems since each SQLite instance only supports one concurrent writer. I guess users of Bedrock could partition their datasets into different pieces that aren't often written in tandem, but why go through all the trouble instead of just using a database with real MVCC?
If the design goal is geo replication, Bedrock seems to ignore the past many years of innovation in the field. MySQL and company have methodically built up features to allow safe synchronous and asynchronous replication for minimizing the risk of downtime or data loss like GTIDs over a long time
If the design goal is read throughput, I don't think Bedrock can be faster than just pure SQLite if your data can live beside your application, and if you have more than one node, you need a complicated SQL optimizer that understands how to push down things like predicates and aggregates to the leaves, and then recombine results at the top level to actually efficiently read in parallel, which I don't think Bedrock has built yet.
This thing kind of seems reminiscent of Jetpants or one of the other MySQL orchestration tools that allow for scaling the DB beyond many nodes automatically... but... why SQLite?
Re: "SQLite can be beat really easily by the client/server systems since each SQLite instance only supports one concurrent writer" -- That's not actually true, sqlite supports concurrent writers via their page-locking branch (and changesets allow for effectively row-level locks). But none of that matters, because single-threaded replication means any multi-threaded write capability is irrelevant. Regardless, when we add multi-threaded replication (on the way), we'll take advantage of SQLite's multi-thread write capabilities.
Re: MySQL's replication. What specifically are you referring to? It's distributed transaction features are actually quite new -- and I think were developed after Bedrock was already in production for years. Furthermore, MySQL doesn't support automatic master failover (or recovery), and struggles over high-latency, low-reliability WAN connections. Anyway, if I'm wrong on any of that, I'm eager to be corrected. Thanks!
Re: Read throughput. Agreed it doesn't have "parallel reads" (eg, it doesn't split a large read up into multiple smaller reads that are joined automatically). However, I don't think MySQL does either. Regardless, Bedrock does quite well for normal read queries, and uses all of the node's CPUs efficiently. (To be clear, I love this idea of adding "super threaded reads", and have some ideas. But at the moment we don't have any application that requires it, so haven't built it.)
Re: Jetpants, I haven't heard of that before. Thanks for the tip! But as for why SQLite over MySQL, I'd ask quite the opposite: why add a networked layer on top of MySQL, thereby ignoring all of MySQL's networking? That just adds all the overhead of MySQL, without any of the benefit. If you're going to build a networking layer atop a database, SQLite seems the clear choice because it is explicitly designed to be a database library -- and not a database server.
Thanks for all these comments!
Guides to doing auto failover and recovery:
1) What happens if "mysqlfailover" itself dies? For example, if I have 6 servers split equally between 2 datacenters, and the datacenter running mysqlfailover loses power -- how does the other datacenter get reconfigured?
2) If you run two copies of mysqlfailover (one in each datacenter), how does it solve the "split brain" problem? If you have 6 servers split equally between two datacenters, and a clean network severance down the middle, what prevents both sides from configuring and operating a master? (This is the nightmare scenario.)
3) If mysqlfailover dies, it sounds like it will prevent any other copy from running without manual intervention: "At startup, the console will attempt to register itself with the master. If another console is already registered, and the failover mode is auto or elect, the console will be blocked from running failover. When a console quits, it unregisters itself from the master. If this process is broken, the user may override the registration check by using the --force option."
Overall, this sounds like a great improvement over what existed before (eg, nothing), but still a pretty brittle, manual, and (in the case of split-brain) very dangerous approach. Have you had good experience with it in practice?
We are using it since 2014. First in line is a HAProxy which directs the traffic to one of the 3 nodes. If one of the nodes goes down HAProxy directs the traffic to the other two nodes.
This works quite well, as long as not all 3 nodes are down / lose network connection. After that you have to initialise the cluster again. One node is the new master and all later changes in the other databases are overwritten. Then we connect each node to the new cluster.
The biggest problems are write intense operations. We had more luck using two nodes as read nodes and using the third node as write node. When the data in two nodes got updated simultaneously the performance suffered heavily. After some smarter scheduling it works pretty well (as long as not all the nodes go down or the network connection between the nodes is lost). The slowest node can block the write operations for all the other nodes.
Re: "struggles over high-latency, low-reliability WAN connections" If by struggle you mean a vanilla MySQL replication setup doesn't satisfy C, A and P of the CAP theorem then yeah I agree, however if you just mean that it struggles to be consistent with an unreliable network, that just means MySQL has made the same CAP tradeoff Bedrock has. AFAIK there aren't many critical bugs that mean you lose data that could have otherwise been stored in the presence of unreliable networks.
You're right that vanilla MySQL doesn't support automatic failover, however, some consider that a feature, because the engineering required to correctly implement this distributed system is genuinely challenging. Bedrock may have done it, but if there are some complicated cases that it has yet to address users may prefer to manage failover manually before they trust it to the system. MySQL also isn't the only system that we should draw comparison too -- Cassandra or Riak come to mind as well. Both have excellent replication for geo-redundancy and something akin to automatic failover by storing data in a configurable number of places. They don't support raw SQL statements like Bedrock does, but to be honest if Bedrock doesn't support actual SQL fanout where the results can be combined to make the thing seem like one big database, the client is left to do a bunch of the work like they would be doing in Cassandra to model their data and fetch it efficiently.
I'm not trying to suggest that MySQL or a MySQL orchestration layer is better than Bedrock for all use cases; I just don't understand what niche Bedrock is trying to fill. For basic consumers of the system it's more work to author your queries for Bedrock because (I think) you now have to manage multiple logical SQL instances instead of one virtual big one, and you only get as much read throughput for as much data as you are willing to duplicate. You only get write throughput for the same but at 1 writer per shard/slave/whatever it's called. It's using a homegrown replication algorithm that has the same consistency tradeoffs of other more widely deployed and thus better tested solutions, and it's not clear that database builders or the target audience for Bedrock-as-framework would actually want to use SQLite due to the performance considerations mentioned above.
Re: "brand new technology not yet in trunk" -- I'm not sure what you mean by that. Bedrock has been used continuously for 8 years.
Re: "struggles over high-latency, low-reliability WAN connections" -- Ah, I mean MySQL's replication is designed for active/passive deployments with manual failover connected via fast/reliable networks. Not saying it can't support slow/unreliable WAN connections, only that it requires a lot of glue code and manual recovery when things go more wrong thn it's designed to handle.
Re: "the engineering required to correctly implement this distributed system is genuinely challenging" -- Agreed! This is precisely what Bedrock provides.
Re: "SQL fanout" -- To my knowledge, no RDBMS does this automatically. Furthermore, very few real world applications actually require this. Don't get me wrong: this is cool stuff. But this isn't a common requirement. Most businesses will never exceed the capacity of a single server -- the number of businesses that truly require this level of scalability is very small.
Re: "I just don't understand what niche Bedrock is trying to fill." -- The niche of businesses that want a simple SQL database that has built in automatic failover.
Re: "multiple logical SQL instances" -- I'm not sure what you mean by this. Bedrock maintains a single contiguous SQL database across all nodes.
Re: "you only get as much read throughput for as much data as you are willing to duplicate" -- Yes, I'm saying there are very few reasons not to duplicate it all for the vast majority of real world use cases.
I've been unable to find the branch on SQLite website. Do you have a link to the branch and changesets?
When shouldn't people use Bedrock? https://sqlite.org/whentouse.html is a wonderful page - full of discouragement to would-be SQlite users to know when they're in a use-case better served by some other tool.
As for the CAP theorem, we compromise on "C". Each slave responds based on its current state, which might be a different state than other slaves or the master. However, every response returns the "commitCount" of the database at the time the response was generated. Then when you submit a new request, you can specify that commitCount and the slave will wait until it is at least as recent as that.
This way you can safely act upon the response of one request, and then make a second request to another.
- All nodes connect to all other nodes
- Each node has a priority; a Paxos algorithm is used to identify the highest priority node, which "stands up" to be the master
- All nodes respond to read queries using the local database.
- So long as you always talk to the same node, there are no consistency issues: you are guaranteed that each request will answered with a database at least as fresh as the last.
- However, if you switch databases (eg, a node goes down and you are forced to go to a different one that might not be as fresh), it will wait until it is as fresh as the node you were using
- Write requests are escalated to the master, which processes them with a two-phase commit distributed transaction
- By default, the master waits for a "quorum" of the cluster to approve the transaction before committing
- However, this limits write capacity to 1/median(rtt) of the slaves
- Accordingly, we have a "selective synchronization" feature where you can optionally designate some transactions as requiring "less consistency". For example, you can specify that the master should commit when at least one other node approves.
- For the highest performance, you can designate a transaction as "asynchronous" and the master will commit immediately
- For example, when someone is reimbursed, we absolutely want to get quorum approval of the transaction: we can't risk losing that if the master crashes. But if someone just adds a report comment, we can risk losing it upon master crash.
- When the master does go down (either gracefully or not), the next highest priority node steps up seamlessly
- Any command escalated to the master but not processed will be re-escalated to the new master
- When a higher priority node comes up, it synchronizes to get up to date with everything it missed, and then becomes master
Anyway, lots of details here and I agree they haven't been well captured in the site yet. Coming soon!
- Faster? Provide a benchmark. Faster is a relative thing, faster than what? faster under which conditions?
- More reliable... than what? under which conditions? how do you know it is more reliable?
- More powerful? I thought SQLite is more constrained than other SQL databases. e.g: No stored procedures. Powerful? how? compared to exactly what?
Nullius in verba. I don't take your statements as granted, give me a proof, otherwise it's just flatus vocis.
I think Bedrock's replication is definitely more more powerful than MySQL's as well.
So in general I think it's more powerful than MySQL. I'd love to hear your thoughts about where MySQL has the edge, however. Thanks!
> rqlite is a distributed relational database, which uses SQLite as its storage engine.
> rqlite gives you the functionality of a rock solid, fault-tolerant, replicated relational database, but with very easy installation, deployment, and operation.
Also, is Bedrock DB only 30 days old? (since the 'first commit' message on GitHub?) That's barely topsoil!
I'm not actually familiar with Rqlite, but wow, it does look very similar. Thanks for the tip! Does anybody use it at scale?
I have no idea if anyone uses Rqlite, I just remember it from previous discussions here. Very similar indeed...
> built atop SQLite, the fastest, most reliable, and most widely distributed database in the world.
Most widely distributed maybe, but far from the fastest and AFAIK others beat it on reliability as well. No disrespect to SQLite, it's far more feature-rich than other embedded-style DBs, but without a definition of "database" that excludes them the statement as written is false.
> written for modern hardware with large SSD-backed RAID drives and generous RAM file caches
In other words, relies on those things so it won't work worth crap on anything less.
> The easiest way to talk with Bedrock is using netcat as follows
Nice security model you've got there.
Re: SQLite reliability. I'm confused how SQLite isn't more reliable than every other contender, merely given that there is so much less to fail. In 8 years, 2158341079 writes, and probably 100x that many reads, I don't think I've ever had a single failure or bug with SQLite itself.
Re: SSDs -- That's a fair point. Bedrock is not the ideal choice if you are using small volumes and spinning disks.
Re: Security -- I think all databases have crap security that nobody really relies upon. Bedrock doesn't try to pretend. Rather, it gives you a very complete plugin framework, such that you can write C++ "stored procedures" that enforce your application's security model. In practice, we run with the Bedrock::DB plugin disabled, such that the only way to access the database is via our stored procedures -- and those do complete authentication and authorization in an application-specific way that no database could in a generalized fashion.
Thanks for asking!
I'm curious: how do you keep track of this number?
This could be, for databases, what node was for server-side programming. There are parallels, you're using a "client side single-writer (excellent) DB" and making it apt for the cloud with Paxos. Again, great idea!!!!
And it seems you're taking with a good humor some "strong" comments here. +1 sr. +1 to u.
Counting "failures" is difficult -- if it breaks, it's because of some bug in our application logic (eg, our stored procedure). Most of our restarts are due to normal maintenance and upgrades. In the history of the company there were a handful of core problems to the logic (generally as we encountered some weird edge case for the first time), but I can't remember the most recent.
Regardless, I'll try to get better data on this. Thanks for asking!
Have you run this through Jespen or done any actual load testing or deep testing for failures related to machines dying, network partitions or other?
I encourage you to do so, because (not trying to sound rude) this kind of reads like "hey my homemade car has 25,000 miles on it and I still use it". But you can build a performance and failure testing framework to really put it through its paces.
Anyhow , have fun and good luck~~
And seeing actual numbers are important, too. People often act like in terms of write volume you're either shared hosting fodder, or you're Facebook, but there's a lot of room in between, and a lot of HA workloads are very read-dominated. It's very hard to find information about the businesses fall in the middle of that spectrum because usually no technology company talks about its traffic unless it's bragging about it (and then will release only vague information that could be spun in lots of different ways--for instance, not to call you out, I have no clue what "HA" means in your context, whether you need distributed transactions, whether the writes are cleanly partitionable, what isolation level you need, what your latency requirements are, or what the read:write ratio looks like). So I'd rather encourage people to release their numbers than poo-poo them because they aren't as big as whatever the biggest system you've worked on is.
I agree, actual numbers are great.
We also agree, jespen is great.
But I am saying, another path, to a similar level of confidence jespen provides, is some real critical stress testing.
Sure, it will be out of the scope of what most people need, but it will give them the confidence to say "rock solid distributed data".
This makes 2 nodes per datacenter. Do the 2 nodes cooperate in some way, or are they fully independent from each other (like shards)?
> In practice, we deploy two servers in each datacenter (across three geographically-distributed datacenters, so six nodes total in the cluster -- with one configured as a "permaslave" that doesn't participate in quorum). Given this, the node that's in the same datacenter generally gets most if not all of the master's transactions in a real world crash scenario.
* Jobs - Provides a simple job queue.
* Cache - Provides a simple replicated cache.
...while these look like nice-to-have features which make life easy for developers, IMO having a database which is also a queue and is also a cache is a way to end up with a mess you can't scale out and is hard to reason about.
Have seen this happen with Redis - unless you're disciplined about how you use it and deploy it, having db, cache and queue all in one place is a honeypot for developers that turns into a future nightmare for operations.
In practice, a job queue, a cache, and a database all do largely the same thing: maintain some internal state on disk, respond to some networked API calls, and synchronize across multiple servers. The actual logic of a job queue or cache is pretty trivial compared to what's underneath.
Indeed, I think the lesson of Redis is exactly why you should build these systems on top of a real database: they are struggling to hack on replication and reliable storage, even though these are "free" when built atop a database.
Edit: Turns out it's authored by the CEO! So you'd have to be nuts to use it, period. But, um, poke the source if you're curious. This is a good intro https://github.com/Expensify/Bedrock/tree/master/libstuff#li... I can't imagine a more ironic and hilarious backdrop for the "We Fire People" campaign. Must be tough finding people who can keep up with a brain that big.
> Computers are super fast these days, and real world applications spend 99% of their CPU time doing some very specific item, with the rest of the time spent doing just boring stuff. Accordingly, libstuff is not intended to be hyper-efficient: it's intended to be hyper-usable. This means libstuff uses a lot of std::string objects for pretty much everything. Where a more complex object is required, some higher-level string-based object is used.
Look. Just about anyone can create some "database" and declare it "rock-solid".
Frankly, a simple key-value store is built in most languages.
Do you have more than that?
Do you have benchmarks? You know, competing against the other providers?
I think the argument in favor of it being more reliable is an easier one, however, as Bedrock is specifically designed (and used, at scale, for 8 years) to provide highly available redundancy in the form of WAN-replication and automatic failover.
Regardless, I think you'll agree that "rock-solid" is not a precise technical term, and I hope you'll allow a bit of play on words. Whether or not it's literally as solid as a rock is unclear, but it is definitely very reliable -- and more reliable than most.
Batching before that will help if you already know that most of the transactions touching a particular key are going to the same server. So not a clean partition, but "most of the time". This is the idea behind, e.g., http://www.ssrg.ece.vt.edu/papers/peluso-M2PAXOS-TR.pdf. You try to make sure people usually contact the owner of a particular partition, and use a variety of techniques from the literature to make that well-optimized (ideally, clients actually know which partitions they "should" be accessing, too).
Another approach is to restrict your transactions to deterministic ones (with possible "scouting" transactions to do things like secondary index lookups where you can't statically analyze which partitions they will hit), in which case you can batch all transactions, send them to all participant nodes, and run them afterwards without two-phase commit. This is the approach taken by Calvin (which can exploit per-node parallelism during execution because it uses deterministic, deadlock-free locking). Additionally, read-only transactions scoped to a partition can execute locally without any distributed access, because thanks to determinism serializability is guaranteed. See http://cs-www.cs.yale.edu/homes/dna/papers/calvin-tods14.pdf (and yes, it really can do 500,000 distributed TPC-C transactions per second--by my estimation, more than three orders of magnitude higher than the average total traffic you get, as suggested by your "100x more read transactions than write" estimate. Though of course your current average is higher than your 8-year average, I doubt it is 1000x higher).
Yet another idea is to try to combine coordination required for replication with that required for the distributed transaction, as Tapir does: https://github.com/UWSysLab/tapir. It enjoys the highest linearizable read/write "general" (as in, doesn't have to be submitted in one batch or need static analysis) transaction throughput I've seen of any leaderless georeplicated system (aka it processes transactions "fairly").
Still another approach is to forego serializability for a very slightly weaker guarantee, extended update serializability, which eliminates nearly all interesting anomalies but can allow for dramatically better performance: http://www.ssrg.ece.vt.edu/papers/opodis14-alvin.pdf demonstrates that on read-mostly workloads you can do very well with that even in absurd deployments (e.g. the georeplicated 7 datacenter one in the paper). Elsewhere in the thread you were talking about how "observational" consistency is what's important, which would suggest you are probably already relying on the "strong session" assumption (in which case EUS is indistinguishable from serializability), so I encourage you to give that a look.
All four represent pretty interesting points on the design space. It's not clear to me whether RockSolid would be better than, say, SQLite over Calvin, for your use case (in particular, I suspect Expensify would not find Calvin's requirement that transactions be statically analyzable terribly onerous, especially since you already require all transactions to be executed as stored procedures).
I will say that I'm pleased that a site receiving a reasonable amount of traffic is using SQLite. People consistently underestimate its performance.
Incidentally, the latest plan is to use http://sqlite.org/sessionintro.html to do the following:
1) Spin up multiple write threads
2) Every write thread opens its own database handle
3) Every write thread creates a new "session" object before each write command, and then creates a "changeset" afterwards
4) The first write thread to process a write command calls sqlite3changebatch_new() to create a new batch (initially empty)
5) It then calls sqlite3changebatch_add() to add the changeset to that batch, which returns SQLITE_OK to indicate that it does not conflict with anything in the (currently empty) batch
6) The next write thread calls sqlite3changebatch_add() on the existing batch, providing the write command's changeset
7) If sqlite3changebatch_add() returns SQLITE_OK then it creates a patchset from the changeset, and sends it to the slave.
8) Slaves apply and commit patchsets within the same batch in any order
9) On the other hand, if sqlite3changebatch_add() returns SQLITE_CONSTRAINT, then that means the new changeset conflicts with one or more existing changesets in the batch. In this scenario, it increments the batchID, calls sqlite3changebatch_zero() and then sqlite3changebatch_add() again (to initialize the new batch). It then creates and sends a patchset to the slave for committing in any order with changes in the new batch.
Re: SQLite's performance, I find the database is consistently underestimated in every regard -- performance, stability, functionality, etc. And the database isn't half as amazing as the team behind it. Those are some of the most solid engineers -- in a true sense of engineering (of which programming very rarely is) -- that I've met.
From a correctness perspective: SQLITE_CONSTRAINT will only detect write conflicts, not read conflicts, so you are losing serializability (instead you have snapshot isolation, which is what SQLite gives you by default with multiple writers). SI allows a variety of subtle anomalies that can screw you over; see https://wiki.postgresql.org/wiki/SSI and https://wiki.postgresql.org/wiki/Serializable. If you are currently using only a single-writer thread, you will never experience these issues, AFAIK, because you cannot form a "dangerous structure" (you can never have two R-W dependencies that conflict with each other because you can only have one R-W transaction at a time), but I could be wrong about that; however, if I'm not, this could lead to you dealing with some really bad bugs. Note that these kinds of anomalies have a tendency to pop up when you're using stored procedures to perform validation!
(I was one of the authors of SSI in Postgres)
As for how much contention there will be, I'm not sure: that's one reason we haven't taken the plunge yet as we sorta want to determine this first. However, given that we have a huge number of users modifying unshared data (eg, our largest group still makes up a tiny fraction of the userbase), I suspect conflicts will be relatively few.
And to be clear: even a little write concurrency goes a long way. If we can even do two writes simultaneously, that doubles our write performance. But I personally suspect the average batch size of non-conflicting transactions will be in the dozens or even hundreds -- meaning I suspect we will achieve so much replication throughput that something else becomes the bottleneck before replication is saturated.
Only one way to find out however!
Still, with a total order, it's not that bad, because it's easy to identify gaps; with your algorithm, however, there's no total order between changesets within or across batches (at least, none that I can see in your algorithm), so you won't have any way of knowing whether there are gaps (at least, not without doing something silly like broadcasting every changeset)! Adding a total order seems like the easiest way to resolve that, but like I said even that case can be tricky. Also, your algorithm doesn't specify that replicas which receive changesets with batch ids with gaps in them wait to apply them until the gap is filled in (which means they have to know when the previous batch is done), or alternately that before a batch with a new id can be sent from the master all replicas must have synchronously applied all changes within the old batch. I think something like that is a requirement here in order for there not to be potential conflicts on your replicas during replication even if the master doesn't crash.
IMO, doing exactly your algorithm but delaying replication or transaction confirmation until a specified amount of time has passed (say 10 ms), then synchronously sending the entire changeset and incrementing the batch id (aka group commit) is a much better idea. You exchange marginally worse latency (and more bursty network usage) for a far less complex update scenario (yes, you're back to having a total ordering on writes at replicas you need to follow, but you can include multiple changesets in the batch). You can also have your master monitor number of writes and only turn on group commit if there's unusually high write volume.
(Also, belatedly reading one of your other comments more carefully, I'm fairly confident commitCount will also not work correctly if changesets can be applied out of order on different replicas, even if there was no data loss, since the same commitCount could include different sets of changesets at any one time).
Now I do agree that it's tricky to avoid "gaps" in the failure cases. However, every replica keeps a record of the past several million transactions (we aim for 3 days), and every transaction is assigned a unique ID. When a replica starts up, it "synchronizes" down every missing transaction it has, and at this point would "repair" any gaps it somehow obtained when it went down.
Admittedly, the exact details of that part are TBD, but it doesn't strike me as an unresolvable problem on the surface.
(1) In the event of a crash, how does it know which transactions it's missing without a total order on the write transactions?
(2) In the non-failure case, how do you know the transactions in one batch are ordered before the transactions in a subsequent batch (which requires the replica to identify any "gaps" within the previous batch so it can wait for them to come in before continuing on)?
I don't think any of this is unresolvable (just adding a total order would go a long way). I do think it's very tricky to get right, with lots of edge cases, and that if you can't efficiently identify missing transactions it potentially makes recovery unusably slow.
In any case, since I believe you are losing serializability with multiple writers, you should probably weigh that against any performance gains you get from multiple writers (I think even the group commit variant suffers from this problem).
This is difficult to reconcile with:
> - For the highest performance, you can designate a transaction as "asynchronous" and the master will commit immediately
because if the leader crashes, a replica becomes leader and starts accepting writes, then the old leader recovers as a replica, without something like an epoch number it won't be able to tell that it has commits that the current leader doesn't (using a unique incrementing transaction number based on just a counter at the leader won't work, because it won't necessarily be unique across leader elections thanks to the asynchronous commits).
Furthermore, every commit has a running SHA hash of all prior commits (and every node keeps a history of the last few million commits). This way any two nodes can compare their journals to make sure they agree -- and if there is any split, then the cluster kicks that node out.
Basically, there is no scenario in which a node that commits a different transaction (or a transaction in a different order) is allowed to remain in the cluster.
Really looking forward to playing with this. Cool stuff.
But yes, lots more details to add over time. Thanks for asking!
- Is it correct to say that Bedrock is primarily a replication layer for SQLite? Most things your homepage highlight seem to be SQLite features, not Bedrock features.
- Elsewhere in this thread you mention you run PAXOS (what implementation are you using? Your own? Do you have a test we can run to verify correctness?) But then you mention you sacrifice the "C" in the CAP Theorem, which would go against any point of having PAXOS. What happens in a split-brain scenario, where you have lets say 6 peers, and there is a network partition straight down the middle?
- You also mention elsewhere in this thread that if you are talking to the same node (a sticky session) there is no consistency issues. Do you mean linearizable? Because consistency has to do with multi-machine data consistency. For instance, if a write happens on another peer, then the node you are talking to should return that write not a stale write.
- Is there anything other than just assuming it is run on SSD that make it faster than SQLite on SSD?
Re: Paxos -- It's our own implementation. Split brain is prevented by the master refusing to stand up unless a majority of configured (not just active) slaves approve its standup request. So in a 6 node deployment, 1 master and 5 slaves, 3 of the slaves would need to approve. This means one "half" would have 4 nodes, and the other would have 2 -- and the half with 2 would recognize it doesn't have quorum and thus stay idle. In a scenario where there is a split down the middle, nobody would do anything because nobody has quorum. (This is precisely why you shouldn't deploy in just 2 datacenters -- three is the magic number.)
Re: Sticky sessions -- True multi-machine consistency is impossible to guarantee and has no real practical application. The only consistency that matters is from the perspective of an observer, ensuring that it always sees a world whose time arrow progresses linearly forward. If that is what "linearizable" means, then perhaps. Sorry, my terminology could be off. Thanks for the correction!
Re: SQLite on SSD -- Sorry, I didn't mean to claim Bedrock was somehow faster than SQLite. Indeed, Bedrock adds overhead to SQLite (to do all the networking and such). My main claim is that Bedrock is faster than other databases, in particular when using C++ stored procedures for complex operations.
Thanks for the questions!
The consensus algorithm is only used to elect a master -- once the master is identified, it coordinates all the distributed transactions. If the master dies, everyone who remains elects a new master and re-escalates all unprocessed write transactions to the new master.
Read transactions are processed locally by each node, so only writes need to be escalated.
Do you use the Issue Tracker to keep on top of things like that, and/or prioritise?
stored procedures—new standard for modern application design?
Your webserver is the first thing to be hacked by an attacker because it's the server that sits on the internet. If all your security logic is built on your webserver, an attacker can easily bypass it all. However, if your security logic is built into your database, then it dramatically limits the damage a hacker can do from the webserver.
In particular, I would recommend you create an "Authenticate" stored procedure that accepts a username and password, returning an "authToken". Then make all your other functions into similar stored procedures, each of which accepts and verifies an "authToken" before returning the results.
(And in the case of Bedrock, you might also disable the Bedrock::DB plugin entirely to prevent direct access to your database from the webserver, forcing everybody to go through your stored procedures.)
This design means attackers who root your webserver can't randomly access your database without knowing the username/password (or valid authToken) -- all of which is neatly and securely contained inside the database itself.
To be clear, if you do keep your SQL on your webserver (as most applications honestly do, though I believe it's a mistake), then the Bedrock::DB plugin is perfect for you. But I think Bedrock::DB (or really, any direct access to a database outside of stored procedures) should be largely viewed as a programming and maintenance convenience, at the expense of security.
Granted, a simpler configuration is to just have two load balancers -- both of which have all nodes -- and then use each as the primary/secondary. We did this for years and it was fine, but for just a tiny bit more configuration overhead you can bypass the load balancer for the primary and thus shave off those precious milliseconds from each request.
You clearly state that your ideal model is to develop plugins that are embedded in the DB to execute core business functionalities. However, in the website there is only tutorial documentation about the SQL interface.
Is the only guide the code for now ?
If I'm not mistaken, creating an index is a blocking operation in SQLite (other writers are blocked). How do you manage this in production?
How do you create a stored procedure? Do you have to recompile the whole program? How do you deploy with zero downtime?
As for the stored procedures, yes they are written in C++, and thus deploying them requires compiling and upgrading the server itself. However, we have 6 of them (and honestly, any one of which has enough read capacity to satisfy about all our traffic) so a minute of downtime to upgrade for each independently isn't a problem.
The result is zero downtime as perceived by the user, even though each server has occasional downtime for maintenance and upgrades. In general I'd say we upgrade the database about weekly, or more frequent depending on how active we are in the stored procedures.
What do you do during index creation: Do you drop write requests on the floor? Do you store them in a persistent queue somewhere? Or do you remove a node from the cluster, create the index, re-add the node to the cluster, and repeat this on every other nodes?
I'm asking because at some point I thought of using SQLite, embedded in a process containing the business logic and the network layer, but I abandoned this idea specifically because of blocking operations like index creation, and decided to use PostgreSQL instead.
> the sqlite team is amazing and has all sorts of write-concurrency tricks up their sleeve to allow for creating indexes in a parallel thread
I'd be very interested in any link related to this new feature of SQLite.
> However, we have 6 of them (and honestly, any one of which has enough read capacity to satisfy about all our traffic) so a minute of downtime to upgrade for each independently isn't a problem.
The last of those -- for small indexes, we just replicate them out like normal queries. For large indexes, we take that node down and add offline.
Correct, rolling upgrades. It's worked well to date, but the idea of putting the plugin into a dynamically loaded library is really interesting. Our plugin system is relatively new (only the past few months) so it hasn't really been considered. Great idea!
If you have any link about "SQLite write-concurrency tricks to allow for creating indexes in a parallel thread", please share ;-)
And how does security work?
Security isn't provided by Bedrock (or really, any database, at least not well). Rather, with Bedrock security is the responsibility of the application layer, but built into the database via stored procedures.