There are also a number of us here to answer any questions about this Jepsen work or about VoltDB generally.
It's interesting to read about some of the relatively "unusual" choices VoltDB has made as well. Like this one:
> network partitions cause minority nodes to shut down permanently; operator intervention is required to restore full service
...combined with requiring majorities of current rather than original clusters to continue. This is, for human operational interactions, really interesting. There's a big history of self-healing systems getting a little carried away and masking or sometimes even triggering bigger issues. Making signoff to rejoin nodes a required operation seems like it makes a whole category of "flapping" problems a non-issue. Kudos for making these "unusual" choices; this is a great horizon to explore :)
Personally, when a node fails, I still want to try to understand why before I take the next operational step.
Thanks for your words, BTW.
For 100% read workloads, the impact to maximum throughput can be significant. That's a pretty uncommon workload for us though. It's likely a single-digit percentage problem at 50% reads.
Two nice things about VoltDB users: 1) They often are very write heavy, 99% read-write transactions isn't uncommon. 2) Few are using anywhere near the maximum throughput for their machines. Most size their clusters for appropriate storage and redundancy, not throughput.
The lost write issues weren't optimizations, just implementation bugs. Sigh.
First, Raft wasn't a thing when we started.
Second, the consensus algorithm we use today is used for cluster membership, schema changes and other agreement purposes, but not for the main transactional path. However, in versions before 3.0 it was.
At the time, we wanted to make really different tradeoffs than something like Paxos. Namely, we wanted to be able to run with two copies, rather than the three that Paxos/Raft require. We also wanted crazy throughput and low performance variation, which are hard things to get with Paxos/Raft. The main tradeoff is that our system doesn't work well across high-latency links, which is ok with us.
Ultimately we added a second layer protocol for transactions in 3.0, and this made the transaction path not dependent on clock skew. This layer actually looks a lot like Raft, but does differ in some key ways.
We may even someday switch to Raft for the core agreement stuff, but there's a lot of work to be done for that to happen.
This link has more info about how transactions work in VoltDB:
Kyle Kingsbury started something like this (in Clojure) as part of this project (https://github.com/jepsen-io/voltdb/tree/master/replication-...), but it's a large task and would probably take more than a few days.
We have done some initial work on TLA+, but it may be a while to finish anything. If someone wants to help, there's probably a publishable paper in it.
i mean, i can use a postgres or mssql cluster for SQL needs and/or i can use riak/... for kv and/or cassandra/... for column indexed storage and/or ... etc. why should i look at voltdb?
But my more typical answer is the combination of throughput and transactional logic. No other system does both as well as VoltDB. This comes up a lot in policy enforcement, fraud detection, online gaming, ad-tech, billing-support and more.
Here are some blog posts that might help:
This video's not bad:
Why can't a sequence of statements between a begin and end be transparently treated as though they were the body of a stored procedure?
In practice, this isn't always all that useful without a real stored procedure language.
This is the most effective explanation (for me, anyway) of the difference between "serializable" and "linearizable" of any of aphyr's blogs so far. They've been keywords in that little cladistic tree of consistency models he draws for a while now, but with this explanation and the examples, I finally grok what they mean.
I'm not sure what you mean by "within the start/end time of the partition executing the transaction". Nothing in the informal definition I provided mentioned partitions, or even process-local orders (that'd be sequential consistency) so ... yeah, I dunno where you got this from. I agree that linearizability is a global real-time constraint, and I use that sense in the post and in the Knossos verifier.
Most distributed databases (from what I can tell, including voltdb) only guarantee that for operations which read/write the same keys.
You raise an interesting question: if I verify only that operations on a single key are linearizable, have I also verified that operations on systems of independent keys are linearizable? The answer, as far as I know, is yes: linearizability is a local (or "composable") property. From Herlihy & Wing (https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf):
Unlike alternative correctness conditions such as sequential consistency  or serializability , linearizability is a local property: a system is linearizable if each individual object is linearizable. Locality enhances modularity and concurrency, since objects can be implemented and verified independently, and run-time scheduling can be completely decentralized.
This is a commonly cited property in the literature, and has been proven several ways--for instance, see Lin's recent constructive proof (http://arxiv.org/abs/1412.8324). There is research showing linearizable systems vary in the probability distribution of outcomes (https://arxiv.org/pdf/1103.4690.pdf), but this does not affect safety.
However, your comment led me to Charron-Bost & Cori 2003 (http://www.worldscientific.com/doi/abs/10.1142/S012962640300...), whose abstract claims a counterexample system of two linearizable objects whose composed system is nonlinearizable. I haven't found the full text yet, and I'm not familiar with their sense of "The Global Time Axiom", so it's possible their finding is still consistent with "linearizability is composable". Not sure.
In any case, the multi-key tests in this analysis do perform single-key transactions (as well as multi-key transactions), and verify that their composed system is fully linearizable. Because the state space for composed systems is larger, these tests aren't as good at finding bugs--but if composability turns out not to hold, I can use this strategy more often.
But in the presence of side channels, you need a truly global clock (like spanner) to achieve strict serializablility.
As I understand it, Spanner's global clocks are a performance optimization, not a correctness condition. If linearizability required a global clock, Zookeeper (http://static.cs.brown.edu/courses/cs227/archives/2012/paper...) and Raft (https://raft.github.io/raft.pdf) wouldn't be able to provide linearizable semantics. It is, of course, possible that these papers are wrong, in which case I encourage you to publish!
Let me know if I don't have that right.
It's a good fit any time you're considering storing a bunch of data in-memory for performance reasons. (Like telecom routing info.) Instead of writing a custom daemon, just spit it into VoltDB. Get replication, performance, etc. for free! Very neat.
Sadly the open source version isn't very ACID. They dropped the D from community edition. So you can scale out, but if any node dies, you're toast. There's still some uses, where you're running a transient or easily-rebuildable dataset. Or where you can manually run multiple full nodes (though I guess you'd need to implement cluster failover manually).
I guess it shows that it is hard to make a living off of open source products if they're really great. I've heard this from other open-source companies: the product's fantastic, no one pays. But make a taste as open source, basically a demo/trial, and get them to upgrade to commercial.
Version 6.4 includes fixes for all the issues discussed here: stale reads, dirty reads, lost updates (due to both partition detection races and invalid recovery plans), and read-only transaction reordering are all fixed, plus several incidental bugs the VoltDB team identified. After 6.4, VoltDB plans to introduce per-session and per-request isolation levels for users who prefer weaker isolation guarantees in exchange for improved latency. VoltDB’s pre-6.4 development builds have now passed all the original Jepsen tests, as well as more aggressive elaborations on their themes. Version 6.4 appears to provide strong serializability: the strongest safety invariant of any system we’ve tested thus far. This is not a guarantee of correctness: Jepsen can only demonstrate faults, not their absence. However, I am confident that the scenarios we identified in these tests have been resolved. VoltDB has also expanded their internal test suite to replicate Jepsen’s findings, which should help prevent regressions.
I read the top part and thought 'oh, another system that fails to meet their claims' but it's pretty impressive that they did the work to fix it. Nice job.