"For each batch of parallel transactions, they are inserted into a distributed, write-ahead transaction log"
"Replicas must achieve consensus for how to insert new transactions into the log. FaunaDB uses an optimized Raft implementation to achieve consensus."
There are constrains on running consensus across the world (Raft), it adds at least 200 ms to serialize txs. Also higher latency means longer interval between hearbeats and hence longer downtime if leader is isolated - known issue of leader based consensus protocols (see "There Is More Consensus in Egalitarian Parliaments" paper or "In search of a simple consensus algorithm" post)
Google's Percolator doesn't depend on global consensus but just on global TSO (timestamp oracle) which is possible to implement in a way:
- it doesn't suffer from leader isolation (no leader)
- doesn't have bottleneck (each node handles requests)
- doesn't touch disk on each request
details in the "Quorum clock: leaderless distributed clock" post.
> FaunaDB is an elegant, software-only solution for achieving global ACID transactions, with complete guarantees of serializability and consistency.
The paper makes it sound like FaunaDB claims strict serializability for all transactions (like Spanner). This means that if Txn(A) ends before Txn(B) begins, then Txn(B) must be guaranteed to see all data written by Txn(A).
However, what if Txn(B) is a read-only transaction? According to the paper, read-only transactions do not contact the global sequencer. Here is an example where an application would fail to read its own writes:
1. Txn(A) performs a write W(X) of data on replica R1.
2. Txn(A) is assigned a logical timestamp of T1 by the global log.
3. Conflict verification is triggered. Replica R1 plays back the global log and verifies there is no conflict.
4. Txn(A) completes, since one replica has verified there are no conflicts, and passes control back to the application.
5. The application now triggers a read-only Txn(B).
6. The coordinator for Txn(B) decides to read replica R2 rather than R1. However, the coordinator is not yet aware that the global log is now at timestamp T1, and it picks T0 as its snapshot timestamp.
7. It reads the value of X on replica R2, which does not yet reflect W(X).
-- From the application's point of view, it did not read its own writes. --
I'm not familiar enough with Fauna DB's subtleties to know if this scenario is possible in practice. Perhaps you could comment?
I did notice that at the end of the article, the language is carefully phrased to only make the claim of "serializable" (not "strictly serializable") for read-only transactions. But that would fall short of the guarantees that Spanner and Calvin make, and undermine the "complete guarantees of serializability and consistency".
Causal tokens are practical from a performance point of view, but do require application awareness and changes, which presents a different set of issues. Most application developers would not appreciate the subtleties and just use the default behavior, which does not guarantee "read your own writes". And causal tokens don't prevent the more general case of stale reads.
This would mean that in the default FaunaDB configuration, applications could see anomalies:
1. Inability to read own writes - described above, with the additional clarification that it can't be fully solved at the driver level. For example, say I call a web API that creates a blog post and returns a PostID. I then call another API that edits that post, passing it the PostID I got back from the first. If FaunaDB is in use, the second call could fail to find the PostID, because I happened to hit a different replica that did not yet have it.
2. Stale reads - could be arbitrary levels of staleness if replica/coordinator haven't been able to sync to the global log. If a coordinator node and a replica were stuck in their own little partition, cut off from the rest of the cluster, they'd still happily serve stale reads (until perhaps they detect a partition and shut down?).
These read anomalies aren't usually considered compatible with a consistent system, but instead are hallmarks of an eventually consistent system. It seems misleading to make an unqualified 100% consistency claim when it doesn't apply to read-only txns by default, and when the fully consistent mode is not practical to use in global deployments.
Let me try to explain more clearly and you tell me what you would call it. It is subtle because there are three coordination points, the log, the replica, and the client.
The log is marching along, moving time itself forward. Every node within a replica is following the log monotonically for a disjoint set of partitions. Every client process is querying an arbitrary and changing set of nodes within a replica, which is normally a datacenter.
Every node in every replica maintains a highwater mark (the causal token) of the last transaction time it has applied. Every client does the same, and keeps a highwater mark of the last transaction time (read or write) it has seen. If it was a write, the highwater mark came from first replica to apply the transaction from the log. If it was a read, it came from the local replica based on the last applied transaction time that node has personally applied.
These transaction times are passed back and forth with every request, from client to replica and back, automatically by the Fauna drivers. Thus, a replica node will block if a client performs a read-only transaction presenting a transaction time which it has not yet replicated from the log. And if a client receives a later time from a replica or from the log, it will update its local highwater mark so that its view of time (and thus its key set) can never backtrack.
1a. If the client process is long-lived and maintains the highwater mark, then it is guaranteed to read its own writes and maintains serializability (but not strict serializability) for all keys on all partitions that own any keys it has queried via log write or local read. Even if the write acknowledgement comes from another replica, the highwater mark is still advanced and any local read partitions will block until they have replicated beyond that point. A stateful webserver would maintain the highwater mark internally across requests.
2. Stale reads from partitioned nodes are no different than increased client-server latency, because if the client can write, it will receive a higher transaction time, and its subsequent reads will block and timeout, preserving the serial order of operations. If it doesn't write, as far as it is aware, time is just running a little slow, and the data will be stale from an external consistency perspective, but order will be preserved. This is a violation of strict serializability but not of serializability. In this case if the client switches datacenters, it will read or write and get an updated transaction highwater mark, and will no longer be able to successfully read from the partitioned node because it knows about the future now.
I should note that Spanner-derived systems typically offer some variant of serializability or worse for read-only transactions since nobody wants to wait around for the clock ambiguity window to expire. Spanner itself can wait, but it still comes at a cost, and Spanner goes out of its way to offer bounded staleness reads that are low latency but do not meet any serializability guarantee at all, as far as I can tell.
Serializable tends to sound really good to database people, because it's the "highest" isolation level available in most databases. However, those databases are built for a single machine, and they really offer "strict serializable" without calling it that, because they maintain a single log which serializes all writes, and all reads operate as of the latest time. With the advent of distributed databases, the "strict" part is becoming more relevant, because it's no longer something you just get without realizing it.
I gave an example of the "blog post" anomaly that occurs in a system that only guarantees serializability. As another example, imagine I am using FaunaDB to store bank account balances. As the developer, I design a system where any time a customer receives money to their account, I send them a text giving the amount and their new balance, along with a link to the website where they can get more details. Except, when they click the link and go to the website, they don't see any change to their balance. Furthermore, when they refresh the page, sometimes they do see their new balance, and then they refresh again, and it disappears!
More investigation reveals that FaunaDB is in a partitioned state, and was serving stale bank account balances from some replicas. The replicas were behind a load balancer, and sometimes the web client hit one replica, and sometimes another, which explains the alternating balances. Is it possible to solve this problem using clever application logic, causal tokens, or saying "don't do that"? Sure. But a consistent database is supposed to prevent this scenario, not the application. Also, note that the "this is no different than increased client-server latency" argument does not hold here. People expect that if they finish action A (like writing to the DB), then do action B (like clicking a web link that reads the DB), that B will always see the results of A. The stale bank account balance isn't a delayed value; it's a wrong value.
You're correct that most systems allow some variant of "secondary" or "follower" reads that can return stale data, in exchange for better performance. I'm all for this kind of feature, as long as the developer understands the tradeoff they're making. Moreover, I think it's appropriate for FaunaDB to offer the same. But I'd advise against making unqualified claims of "complete guarantees of serializability and consistency" and the "latency profile of an eventually-consistent system like Apache Cassandra". FaunaDB is able to offer one or the other, but not both at once.
I think it's a stretch to say that including a single timestamp in the link is too hard for developers. It happens by default across queries within a process with the Fauna drivers, so the database is preventing it; the only place you have to think about it is multi-process, multi-location reads. Also, having a loadbalancer jump back and forth between two replicas would not normally occur because of a geo routing, a soft form of datacenter affinity, but it is possible. Note that in your example, even if the link didn't include the timestamp, once they visit the webpage, the webserver should begin managing the timestamp in process or in a session cookie, preventing the the balance from disappearing once seen for the first time.
In my experience as a DBA the legacy database experience is much worse because in practice there are always physically distinct read replicas in the mix for scalability and hot failover that have no consistency guarantees at all, due to various faults like statement-based replication desynchronization, the lack of any ability to share log cursor information, and the like. I am not making a worse is better argument. I'm just pointing out that you don't get any C at all for the feeble amount of A gained under CAP in practice with most “single node” systems.
You are correct that the latency argument does not hold if you do not pass the timestamp information across queries. But if the observer can observe multiple replicas in the database topology, it is always possible to use the observer itself as the communication channel for the timestamp to avoid the anomaly. If you want to maintain strict serializability in all cases, but minimum-latency reads, the writer could block on transaction application acknowledgement from all replicas. Obviously that gives up availability for writes under partition, but it may be reasonable in your asynchronous example in order to avoid sending the txt message too early. Perhaps, for example, the message is "go to your local ATM" instead of a link with a timestamp. Although I would expect the ATM had the latency budget to do a linearized read even in a global context.
We are always happy to make the descriptions of the behavior and consistency levels more precise.
Why does someone want this functionality? Do you expect, say, Oracle to work this way? I honestly don't see this as a useful feature in a database that's operating at scale, but I'm open to hearing reasons for needing this.
FaunaDB (like other highly available databases) is set up to have multiple replicas of the same data. You can direct reads to any of the replicas. For example if replica #1 is unavailable due to a network outage, you can read replica #2 instead.
This thread is about the behavior of FaunaDB, which allows reads to replicas that have older stale values (i.e. they're not up-to-date with latest data). This causes the various consistency anomalies that I've been describing. The reason FaunaDB works like this is to make reads faster and to better distribute read load. Allowing secondary/follower/slave reads like this is a common tactic for increasing read performance.
Does that answer your question?
Sharding is complex and makes parallel queries a requirement but it really does seem like the only way to distribute a database. In the end, once you have the op log on more than one machine, they're going to have to synchronize with each other, otherwise you risk giving out not just stale answers, but completely wrong answers. This is fine for a site like FB but is not OK if it's attached to military equipment.
CRDTs don't help in the case that you don't want to deliver wrong answers, they only help making sure merges (during synch time) are resolvable without conflicts. Lamport clocks only give you a partial order, and you need something else (physical clocks) for total order, which is what you get from raft (because you only let one node manage the log in the first place).
What I haven't seen yet/want to try is to see if sharded + replicated raft could introduce some interesting performance benefits. It's basically max complexity (the complexity plus sharding, plus multiple instances of raft at the same time), but it could be only way to distribute write load (smart routing/forwarding of requests) and increase availability (eventually consistent replication for every node of every shard) at the same time. This is basically what Faunda does but I'm not 100% sure about splitting a single table across replicas...
Though their write performance is bounded by the more limited number of transactors, but they still acheive fairly large write scale.
> CRDTs don't help in the case that you don't want to deliver wrong answers
Usually when people talk about CRDT they assume that a client talks to a single node and then a background process replicate data to others node. Before this replication is done, if a client contact the other node they get "wrong answers". But nothing prevent a client always talk with quorum of nodes and do replication on its own, in this case we'll get always only right answers :)
However we still need to merge if we do quorum-ing. Paxos/Raft-based systems basically take optimistic lock then they quorum (see prepare phase) so they suffer from contention and need a leader to serialize requests (leader in paxos/raft is for liveness not safety). But with CRDT we always can merge so concurrent requests are not a problem and we can achieve better level of parallelism (request-scalability).
Imagine a write that commits to one replica but is delayed committing to two others. Some quorum reads will see it (but also see a tie with the old value...what breaks the tie? Certainly not a timestamp) and others will not see it at all, indefinitely.
It’s easy to end up in a state where every replica has a different value because of torn writes.
If you and a majority of nodes agree on a value like an CRDT OpSet (more simplistically, just agreeing on the state of the log), how does that not guarantee agreement and serializability? It is impossible from that point on to have a another majority of nodes have some other view of what happened. Consensus algorithms are
One copy serializability is exactly what would be achieved by having a read and write quorum. It intuitively makes sense to me (and maybe my intuition is wrong), but if you talk to a majority of nodes and ask "this is what we all have, right?" before every write and every read, you've got a guaranteed consistent (of course, progress isn't guaranteed if partitions/nodes die, etc) state.
AFAIK Quorums are the state of the art (and arguably the only relatively efficient option) as far as achieving serializability in a distributed system...
You cannot know this, because the transaction replication is racy and not atomic--it may have applied to only one node while you are doing your read. Whether you see it or don't is luck. So you can have the following scenario (and in practice you will):
- TX commit begins
- TX replicated to node A
- Read from coordinator A' begins
- Read sees replica A (has tx) and replica B (does not)
- Read assumes A wins because of some kind of vector clock in the data value (choosing the older value doesn't make things better, just in case you are wondering)
- Read from coordinator B' begins
- Read sees B (no TX) and C (no TX)
- Read completes with stale value--serializability violation has occurred
- TX finishes replicating to B and C
This leaves aside write skew, torn transactions due to partitions, and all kinds of other problems.
We must be talking about different things because I can't find any literature that has reached the conclusion that serializability is impossible in distributed transactions? Can you point me to that?
Also, do you have any thoughts on the literature that contradicts what you're saying? I'm not an expert but unless I'm not misreading english serializability is possible with quorum reads and writes.
> In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of rvotes to read a file, and a write quorum of wvotes to write a file, such that r+w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file's voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.
Has this paper been refuted? In addition to this there's literally the whole section on distributed serializability.
There is no consensus; that requires a leader system. The paper you link appears to require a multi-phase lock; the quorum itself does not guarantee serializability. Explicit preparation via quorum can guarantee serializability (but not strict serializability, I don't think), but cleanup of locks is a big performance problem in practice.
> Replica A shouldn't be reporting an unacknowledged transaction to coordinators -- the write hasn't been quorum acknowledged.
Acknowledged by who? The replicas can't block on the other replicas; they just tell the coordinator when they applied the write.
This is worth a blog post.
> To summarize the overall FaunaDB protocol, each transaction proceeds in three phases:
> 1. The first phase is a speculative phase in which reads are performed as of a recent snapshot, and writes are buffered.
> 2. Next, a consensus protocol is used (Raft) to insert the transaction into a global log, which results in the transaction receiving a global transaction identifier that specifies its equivalent serial order relative to all other transactions that are being processed by the system. This is the only point at which global consensus is required.
> 3. Finally, a check begins in each replica which verifies the speculative work. If that speculative work did not result in potential violations of serializability guarantees, then the work becomes permanent and the buffered writes written back to the database. Otherwise, the transaction is aborted and restarted.
Or if you want a very concise description, see slide 13 from Daniel Abadi's presentation: https://www.slideshare.net/abadid/the-power-of-determinism-i...
> ... the consensus protocol is only being used for inserting transactions into a global log. For every other part of the protocol, replicas can proceed completely independently from each other.
Clearly this central resource can't be distributed or replicated. Right? So it, and the hardware it runs on, and the communications links to it, are a single point of failure. True?
If the global log goes down, or can't be reached at all, or perhaps only from some region of the regionized application, what happens?
In any case, there is an open-source in-memory database called NDB (MySQL Cluster) that can get to 100 million write transactions per second. The echo chamber of SV has ignored it, but it's real:
NDB has what i call lock-aware programming. When you read a row, you take either a read_lock or a write_lock, as it only supports READ_COMMITTED isolation level. But, it does not serialize cross-partition transactions. Transaction coordinators run independently, and it is up to the developer to ensure consistency by taking either read or write locks on rows.
There is no single node that serves as a global coordinator. The "coordinator" as referred here is a stateless logical role for each transaction and every node in the cluster can do it on an ad-hoc basis.
Our current performance numbers are much better but we don't have anything published quite yet. The bottleneck, so to speak, is the distributed, partitioned write-ahead log which is implemented in a pipelined version of Raft. It is not materially different than the Paxos or Raft-based rings that support replicas in aa Spanner-style leader-per-tablet two phase system. However it has better long-tail latency because a fixed subset of nodes resolves transactions instead of every leader regardless of location, and it is not subject to write skew across disjoint transactions.
We are well aware of NBD. NBD is not designed for WAN latencies, and like you say, it pushes consensus complexity onto the developer. Certainly it is easier to reach higher theoretical throughput numbers when you do less work on faster hardware. I admire NBD but it doesn't make sense to compare a WAN system with a higher consistency guarantee to a LAN system with a lower guarantee.
Fauna will support multiple consensus regions in the future to uncork aggregate throughput, at the cost of some explicit optimistic consensus across region boundaries. This is still a better story than explicitly locking on a read-committed isolation level which is subject to read and write skew in a way that cannot be architecturally overcome, even with locks.
Dont get me wrong. I think Calvin/FaunaDb is great as an alternative to Spanner for multi region Dbs. But strict serializability is not a goal for all systems and certainly not for high performance distributed systems that can provide their own concurrency models. Not just HopsFs, but any system layered on top.
Think of it as a way to scale a single Spanner tablet to support the entire dataset and eliminate the 2-phase commit, as well as the associated write skew or clock ambiguity window.
More important for most applications than theoretical benchmarks is taking a sound approach and using a database you can depend on. Asking your development team to manage isolation is going to introduce more costs than benefits, except in the most specialized of circumstances.
FaunaDB's model is always ACID, so you never have to guess what's in the database, even in the presence of region failures or other anomalies.
Absolutely untrue. Stop lying about other systems.
It is both distributed and replicated, even across datacenters, depending on topological configuration.
We will update the post to make this more clear.
You can learn more about Raft from this excellent visualization: http://thesecretlivesofdata.com/raft/
Intuitively, it seems to be exactly the way I would do a distributed database - distributed transactional storage with a global ordering of transactions that are based on ids rather than timestamps. The only thing that I wouldn't intuitively come up with is the actual consensus algorithm, but that already existed...
Both the tendency to use declarative rather than transactional queries and the tendency to use timestamps with clock skew rather than an id-based ordering always seemed surprising to me.
FoundationDB also forged its own path and is worth studying.
 https://trends.google.com/trends/explore?date=2016-09-20%202... compare that to some other distributed database https://trends.google.com/trends/explore?date=2016-09-20%202...
For the average application, the benefit is that you don't need to write code to address database drift / eventual consistency. I expect Fauna-like systems to gain popularity as more developers come to expect cloud-native solutions to offer global ACID transactions.
Furthermore, eventual consistency was never problematic, it was just a myth. It was never hard, certainly not harder than ACID transactions in all those MVCC systems. So not much to benefit from for an average application. And today with all the research in strong eventual consistency applications can have all that correctness without coordination and without sacrificing performance, latency and availability. This is where the opportunity for distributed databases still exists, performance and latency are sellable.
As far as why it's not more prevalent, all of this research is work from the last decade or less, and it takes a long time to develop a database ready for production use.
My understanding is you're essentially shipping a "stored procedure" to the database where every input and condition is known up-front.
It seems like a reasonable trade-off for lowering the latency of WAN transactions, but it might make application development more complicated.
Most read/modify/write cycles end up being essentially that, but with the added confusion of local application threading and standard library. We eliminate that, but preserve the general ability to express a complex query in application code (which also increases application concurrency since the database executes transactions with maximum parallelism.)
Transactions do not have to be pre-registered as stored procedures.
If you want to see a query with precondition checks etc, the code at the end of this post is another example: https://fauna.com/blog/distributed-ledger-without-the-blockc...
If so, seems this violate serializability?
"My cat and other marvels"
That you have to specify the index in a query is a regression imo at least I've been spoiled by not having to have to do so in when using Mongo or SQL databases since the query engine will more often than not find the right index.
You don't want your service to collapse when the query optimizer suddenly doesn't choose the right index anymore. That was a hard lesson learned for us at Twitter.
This way you can't do queries that don't scale in production, but still don't have to increase the cognitive load on the developer. If a query is rejected by the planner that (for the most part - it's not perfect) means the query won't scale.
In the rare case when no index or the wrong index is used in a SQL query an index hint is all it takes to fix or updating statistics of a table to help the optimizer. Not having to remember the exact index for a query imo is better, but I could learn to adapt.
I think is better that every time I see a query it do exactly the same steps. Only when testing/monitoring say so, I hint it.
In fact I like the idea of a RDBMS to be less of a black box.
A similar issue reported
Any plans to have Jepsen / Aphyr conduct a rigorous test and write a report on the results?
Might be useful to your team. One of rare times I was surprised and impressed by a new company's verification process. I look forward to reading your tesm's write-up of techniques and lessons learned, too.
Is this actually how it works or am I missing something important?
In the discussed proposal datacenter failure is considered an extraordinary event with major performance implications, which is different than a global "write anywhere, read anywhere" multi-datacenter partitioned log model like Fauna.
The new "satellite" replication model in FoundationDB 6.0 allows the best of both worlds: if you have two or more datacenters in each region (each storing transaction logs, but only one storing actual data) then you can do synchronous (0xRTT) commits to the regional datacenters and asynchronous replication to the other region(s). If there are failures in one region, the database can quickly and automatically fail over to the other region, while maintaining durability (by getting the last bits of transaction log from at least one datacenter in the failing region). Even if a whole region fails, as long as the failures of the datacenters and network links in it are spread out over time the database will complete a full ACID recovery. And in the worst case you can fall back to ACI recovery as in asynchronous replication.
The question you are linking to is asking about something different, the ability to have different parts of a single database have subgeographic latencies in different regions.
 You will also see 1xRTT latencies when you start transactions, but you can solve this by setting the causal_read_risky transaction option on all read/write transactions. This is not actually risky because if the reads in the transaction turn out not to be externally consistent, the transaction will not commit.
What do you mean, "active datacenter"? Can all datacenters accept transactions concurrently?
Moreover, most current users of FoundationDB aren't willing to accept even 1xRTT latencies anywhere in their transaction lifecycle, so they can't abstract away which region is the fast one. A common way (though not the only way) to set things up is that your whole application (not just the database) is active in region A, but prepared to fail over at a moment's notice to run in region B. Ideally this comes at basically no performance penalty relative to a single-region setup in region A. Alternatively, it's possible to read or write the database from passive region(s) (and individual reads will normally happen from the local region with low latency), but each transaction has to pay a geographic latency at some point (or, in the case of read-only transactions, accept reduced consistency).
I think it would be possible to implement something analogous to satellite replication for Raft. It's a really nice set of tradeoffs for many applications, and if your cloud provider or equivalent has their act at all together instantaneous failures of all the datacenters in a region or their networking should really be pretty rare.
What's even more annoying is that I tweeted this article to them and they didn't say anything.
The speculative work ("reconnaissance query") is what enables support for dependent transactions (transactions where the write set depends on the values of the read set rather than just the keys themselves). The transaction will optimistically retry if the speculation turns out to be false.
I also signed up for the cloud account -- your pricing terms are some of the best I've ever seen, kudos -- so that I can check out the API some more.
But there is an even harder super boss level for distributed systems: Byzantine Fault Tolerance
Is FaunaDB Byzantine resistant? I have asked many projects, such as Dat, what happens if there are conflicts, and they haven’t built systems that are BFT yet.