Hacker News new | past | comments | ask | show | jobs | submit login
Consistency Without Clocks: FaunaDB's Distributed Transaction Protocol (fauna.com)
171 points by evanweaver on Oct 19, 2018 | hide | past | favorite | 101 comments

"unlike Google Percolator, FoundationDB, or similar systems, FaunaDB places no constraints on replica distance and is practical to deploy at global internet latencies"

"For each batch of parallel transactions, they are inserted into a distributed, write-ahead transaction log"

"Replicas must achieve consensus for how to insert new transactions into the log. FaunaDB uses an optimized Raft implementation to achieve consensus."

There are constrains on running consensus across the world (Raft), it adds at least 200 ms to serialize txs. Also higher latency means longer interval between hearbeats and hence longer downtime if leader is isolated - known issue of leader based consensus protocols (see "There Is More Consensus in Egalitarian Parliaments" paper[1] or "In search of a simple consensus algorithm" post[2])

Google's Percolator doesn't depend on global consensus but just on global TSO (timestamp oracle) which is possible to implement in a way:

- it doesn't suffer from leader isolation (no leader)

- doesn't have bottleneck (each node handles requests)

- doesn't touch disk on each request

details in the "Quorum clock: leaderless distributed clock" post[3].

[1] https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf

[2] http://rystsov.info/2017/02/15/simple-consensus.html

[3] http://rystsov.info/2018/10/01/tso.html

The paper contains this claim:

> FaunaDB is an elegant, software-only solution for achieving global ACID transactions, with complete guarantees of serializability and consistency.

The paper makes it sound like FaunaDB claims strict serializability for all transactions (like Spanner). This means that if Txn(A) ends before Txn(B) begins, then Txn(B) must be guaranteed to see all data written by Txn(A).

However, what if Txn(B) is a read-only transaction? According to the paper, read-only transactions do not contact the global sequencer. Here is an example where an application would fail to read its own writes:

1. Txn(A) performs a write W(X) of data on replica R1.

2. Txn(A) is assigned a logical timestamp of T1 by the global log.

3. Conflict verification is triggered. Replica R1 plays back the global log and verifies there is no conflict.

4. Txn(A) completes, since one replica has verified there are no conflicts, and passes control back to the application.

5. The application now triggers a read-only Txn(B).

6. The coordinator for Txn(B) decides to read replica R2 rather than R1. However, the coordinator is not yet aware that the global log is now at timestamp T1, and it picks T0 as its snapshot timestamp.

7. It reads the value of X on replica R2, which does not yet reflect W(X).

-- From the application's point of view, it did not read its own writes. --

I'm not familiar enough with Fauna DB's subtleties to know if this scenario is possible in practice. Perhaps you could comment?

I did notice that at the end of the article, the language is carefully phrased to only make the claim of "serializable" (not "strictly serializable") for read-only transactions. But that would fall short of the guarantees that Spanner and Calvin make, and undermine the "complete guarantees of serializability and consistency".

Read-only transactions are serializable by default but can be upgraded to strict serializability, either from the perspective of a single client by passing a transaction timestamp that serves as a causal token, which is free and happens by default with our drivers, or globally, by opting in to the log at the price of additional latency.

But opting into the log for read-only txns would destroy performance for many apps, both in terms of latency and throughput. This is rarely practical since so many apps are read-heavy. I assume you don't recommend that to your customers, especially in global deployments.

Causal tokens are practical from a performance point of view, but do require application awareness and changes, which presents a different set of issues. Most application developers would not appreciate the subtleties and just use the default behavior, which does not guarantee "read your own writes". And causal tokens don't prevent the more general case of stale reads.

This would mean that in the default FaunaDB configuration, applications could see anomalies:

1. Inability to read own writes - described above, with the additional clarification that it can't be fully solved at the driver level. For example, say I call a web API that creates a blog post and returns a PostID. I then call another API that edits that post, passing it the PostID I got back from the first. If FaunaDB is in use, the second call could fail to find the PostID, because I happened to hit a different replica that did not yet have it.

2. Stale reads - could be arbitrary levels of staleness if replica/coordinator haven't been able to sync to the global log. If a coordinator node and a replica were stuck in their own little partition, cut off from the rest of the cluster, they'd still happily serve stale reads (until perhaps they detect a partition and shut down?).

These read anomalies aren't usually considered compatible with a consistent system, but instead are hallmarks of an eventually consistent system. It seems misleading to make an unqualified 100% consistency claim when it doesn't apply to read-only txns by default, and when the fully consistent mode is not practical to use in global deployments.

Really putting me on the spot when our CTO is out of town. My initial reply was ambiguous and appears to suggest that the transaction highwater mark/causal token preserves strict serializability. It preserves serializability. Strict serializability requires the log.

Let me try to explain more clearly and you tell me what you would call it. It is subtle because there are three coordination points, the log, the replica, and the client.

The log is marching along, moving time itself forward. Every node within a replica is following the log monotonically for a disjoint set of partitions. Every client process is querying an arbitrary and changing set of nodes within a replica, which is normally a datacenter.

Every node in every replica maintains a highwater mark (the causal token) of the last transaction time it has applied. Every client does the same, and keeps a highwater mark of the last transaction time (read or write) it has seen. If it was a write, the highwater mark came from first replica to apply the transaction from the log. If it was a read, it came from the local replica based on the last applied transaction time that node has personally applied.

These transaction times are passed back and forth with every request, from client to replica and back, automatically by the Fauna drivers. Thus, a replica node will block if a client performs a read-only transaction presenting a transaction time which it has not yet replicated from the log. And if a client receives a later time from a replica or from the log, it will update its local highwater mark so that its view of time (and thus its key set) can never backtrack.

1a. If the client process is long-lived and maintains the highwater mark, then it is guaranteed to read its own writes and maintains serializability (but not strict serializability) for all keys on all partitions that own any keys it has queried via log write or local read. Even if the write acknowledgement comes from another replica, the highwater mark is still advanced and any local read partitions will block until they have replicated beyond that point. A stateful webserver would maintain the highwater mark internally across requests.

1b. If the client process is not long-lived (a Javascript call directly from a browser, for example, which is possible with FaunaDB's security model), then you will still read your own writes as long as you do not accidentally switch replicas e.g. datacenters because there is only one node per replica that owns any specific key. In practice, switching datacenters during a client session is a pathological edge case. The most likely scenario would be that the client performs a write, receives a commit acknowledgement from a remote datacenter, discards the transaction time, queries the locally partitioned node for the key it just wrote, and does not see the update. This could be eliminated by threading replica affinity through the entire write/read cycle.

2. Stale reads from partitioned nodes are no different than increased client-server latency, because if the client can write, it will receive a higher transaction time, and its subsequent reads will block and timeout, preserving the serial order of operations. If it doesn't write, as far as it is aware, time is just running a little slow, and the data will be stale from an external consistency perspective, but order will be preserved. This is a violation of strict serializability but not of serializability. In this case if the client switches datacenters, it will read or write and get an updated transaction highwater mark, and will no longer be able to successfully read from the partitioned node because it knows about the future now.

I should note that Spanner-derived systems typically offer some variant of serializability or worse for read-only transactions since nobody wants to wait around for the clock ambiguity window to expire. Spanner itself can wait, but it still comes at a cost, and Spanner goes out of its way to offer bounded staleness reads that are low latency but do not meet any serializability guarantee at all, as far as I can tell.

Sorry to put you on the spot. Perhaps your CTO can chime in when he gets back.

Serializable tends to sound really good to database people, because it's the "highest" isolation level available in most databases. However, those databases are built for a single machine, and they really offer "strict serializable" without calling it that, because they maintain a single log which serializes all writes, and all reads operate as of the latest time. With the advent of distributed databases, the "strict" part is becoming more relevant, because it's no longer something you just get without realizing it.

I gave an example of the "blog post" anomaly that occurs in a system that only guarantees serializability. As another example, imagine I am using FaunaDB to store bank account balances. As the developer, I design a system where any time a customer receives money to their account, I send them a text giving the amount and their new balance, along with a link to the website where they can get more details. Except, when they click the link and go to the website, they don't see any change to their balance. Furthermore, when they refresh the page, sometimes they do see their new balance, and then they refresh again, and it disappears!

More investigation reveals that FaunaDB is in a partitioned state, and was serving stale bank account balances from some replicas. The replicas were behind a load balancer, and sometimes the web client hit one replica, and sometimes another, which explains the alternating balances. Is it possible to solve this problem using clever application logic, causal tokens, or saying "don't do that"? Sure. But a consistent database is supposed to prevent this scenario, not the application. Also, note that the "this is no different than increased client-server latency" argument does not hold here. People expect that if they finish action A (like writing to the DB), then do action B (like clicking a web link that reads the DB), that B will always see the results of A. The stale bank account balance isn't a delayed value; it's a wrong value.

You're correct that most systems allow some variant of "secondary" or "follower" reads that can return stale data, in exchange for better performance. I'm all for this kind of feature, as long as the developer understands the tradeoff they're making. Moreover, I think it's appropriate for FaunaDB to offer the same. But I'd advise against making unqualified claims of "complete guarantees of serializability and consistency" and the "latency profile of an eventually-consistent system like Apache Cassandra". FaunaDB is able to offer one or the other, but not both at once.

Having the read information sent from one location (presumably where the write occurs) and then read from another is a good example of one of the few times this anomaly can occur.

I think it's a stretch to say that including a single timestamp in the link is too hard for developers. It happens by default across queries within a process with the Fauna drivers, so the database is preventing it; the only place you have to think about it is multi-process, multi-location reads. Also, having a loadbalancer jump back and forth between two replicas would not normally occur because of a geo routing, a soft form of datacenter affinity, but it is possible. Note that in your example, even if the link didn't include the timestamp, once they visit the webpage, the webserver should begin managing the timestamp in process or in a session cookie, preventing the the balance from disappearing once seen for the first time.

In my experience as a DBA the legacy database experience is much worse because in practice there are always physically distinct read replicas in the mix for scalability and hot failover that have no consistency guarantees at all, due to various faults like statement-based replication desynchronization, the lack of any ability to share log cursor information, and the like. I am not making a worse is better argument. I'm just pointing out that you don't get any C at all for the feeble amount of A gained under CAP in practice with most “single node” systems.

You are correct that the latency argument does not hold if you do not pass the timestamp information across queries. But if the observer can observe multiple replicas in the database topology, it is always possible to use the observer itself as the communication channel for the timestamp to avoid the anomaly. If you want to maintain strict serializability in all cases, but minimum-latency reads, the writer could block on transaction application acknowledgement from all replicas. Obviously that gives up availability for writes under partition, but it may be reasonable in your asynchronous example in order to avoid sending the txt message too early. Perhaps, for example, the message is "go to your local ATM" instead of a link with a timestamp. Although I would expect the ATM had the latency budget to do a linearized read even in a global context.

We are always happy to make the descriptions of the behavior and consistency levels more precise.

Also: please email us! It would be great to talk further.

>It reads the value of X on replica R2, which does not yet reflect W(X).

Why does someone want this functionality? Do you expect, say, Oracle to work this way? I honestly don't see this as a useful feature in a database that's operating at scale, but I'm open to hearing reasons for needing this.

Not sure I fully understand your question, but I'll try to answer:

FaunaDB (like other highly available databases) is set up to have multiple replicas of the same data. You can direct reads to any of the replicas. For example if replica #1 is unavailable due to a network outage, you can read replica #2 instead.

This thread is about the behavior of FaunaDB, which allows reads to replicas that have older stale values (i.e. they're not up-to-date with latest data). This causes the various consistency anomalies that I've been describing. The reason FaunaDB works like this is to make reads faster and to better distribute read load. Allowing secondary/follower/slave reads like this is a common tactic for increasing read performance.

Does that answer your question?

I've come to the realization that sharding might be the only way to actually scale multi-master systems without allowing stale (or potentially outright wrong) reads.

Sharding is complex and makes parallel queries a requirement but it really does seem like the only way to distribute a database. In the end, once you have the op log on more than one machine, they're going to have to synchronize with each other, otherwise you risk giving out not just stale answers, but completely wrong answers. This is fine for a site like FB but is not OK if it's attached to military equipment.

CRDTs don't help in the case that you don't want to deliver wrong answers, they only help making sure merges (during synch time) are resolvable without conflicts. Lamport clocks only give you a partial order, and you need something else (physical clocks) for total order, which is what you get from raft (because you only let one node manage the log in the first place).

What I haven't seen yet/want to try is to see if sharded + replicated raft could introduce some interesting performance benefits. It's basically max complexity (the complexity plus sharding, plus multiple instances of raft at the same time), but it could be only way to distribute write load (smart routing/forwarding of requests) and increase availability (eventually consistent replication for every node of every shard) at the same time. This is basically what Faunda does but I'm not 100% sure about splitting a single table across replicas...

Sounds like foundationdb to some extent. All data is sharded (and replicated) i.e. chunks of the keyspace up to the small blocksize limit are spread across the cluster. With this you can scale read throughput with the size of the cluster through parrallel reads.

Though their write performance is bounded by the more limited number of transactors, but they still acheive fairly large write scale.

CRDT may help: if you read from quorum, merge(+modify) and write back to quorum for read (write) then you'll always get right answers.

I'm not sure if I'm understanding what you're suggesting right, but it sounds like you're just quorum-ing for everything. You shouldn't need improved merging if you quorum both reads and writes, since it implies the majority agreeing on the contents of the log to begin with.

I agree that CRDT don't help with data-scalability (eventually all data converge on a single node), just wanted to comment on the following phrase

> CRDTs don't help in the case that you don't want to deliver wrong answers

Usually when people talk about CRDT they assume that a client talks to a single node and then a background process replicate data to others node. Before this replication is done, if a client contact the other node they get "wrong answers". But nothing prevent a client always talk with quorum of nodes and do replication on its own, in this case we'll get always only right answers :)

However we still need to merge if we do quorum-ing. Paxos/Raft-based systems basically take optimistic lock then they quorum (see prepare phase) so they suffer from contention and need a leader to serialize requests (leader in paxos/raft is for liveness not safety). But with CRDT we always can merge so concurrent requests are not a problem and we can achieve better level of parallelism (request-scalability).

Quorum/quorum does not guarantee agreement or serializability even for a single key.

Imagine a write that commits to one replica but is delayed committing to two others. Some quorum reads will see it (but also see a tie with the old value...what breaks the tie? Certainly not a timestamp) and others will not see it at all, indefinitely.

It’s easy to end up in a state where every replica has a different value because of torn writes.

Sorry, I think either my fundamental understanding is off or I wasn't clear enough in how I was imagining this happening...

If you and a majority of nodes agree on a value like an CRDT OpSet (more simplistically, just agreeing on the state of the log), how does that not guarantee agreement and serializability? It is impossible from that point on to have a another majority of nodes have some other view of what happened. Consensus algorithms are

One copy serializability is exactly what would be achieved by having a read and write quorum[0][1]. It intuitively makes sense to me (and maybe my intuition is wrong), but if you talk to a majority of nodes and ask "this is what we all have, right?" before every write and every read, you've got a guaranteed consistent (of course, progress isn't guaranteed if partitions/nodes die, etc) state.

AFAIK Quorums are the state of the art (and arguably the only relatively efficient option) as far as achieving serializability in a distributed system...

[0]: https://en.wikipedia.org/wiki/Quorum_(distributed_computing)...

[1]: https://arxiv.org/pdf/1406.7423.pdf

> If you talk to a majority of nodes and ask "this is what we all have, right?"

You cannot know this, because the transaction replication is racy and not atomic--it may have applied to only one node while you are doing your read. Whether you see it or don't is luck. So you can have the following scenario (and in practice you will):

- TX commit begins

- TX replicated to node A

- Read from coordinator A' begins

- Read sees replica A (has tx) and replica B (does not)

- Read assumes A wins because of some kind of vector clock in the data value (choosing the older value doesn't make things better, just in case you are wondering)

- Read from coordinator B' begins

- Read sees B (no TX) and C (no TX)

- Read completes with stale value--serializability violation has occurred

- TX finishes replicating to B and C

This leaves aside write skew, torn transactions due to partitions, and all kinds of other problems.

I'm super confused -- what you're describing isn't a quorum read/write scenario -- what do you mean by "a wins"? Also where is the prepare phase for 2 phrase commit/any consensus algo? Replica A shouldn't be reporting an unacknowledged transaction to coordinators -- the write hasn't been quorum acknowledged. TX is not considered committed until it reaches a majority of nodes. You are right that if you have a network partition you're in trouble, but that ends in lack of progress, not loss of serializability.

We must be talking about different things because I can't find any literature that has reached the conclusion that serializability is impossible in distributed transactions? Can you point me to that?

Also, do you have any thoughts on the literature[0] that contradicts what you're saying? I'm not an expert but unless I'm not misreading english serializability is possible with quorum reads and writes.

> In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of rvotes to read a file, and a write quorum of wvotes to write a file, such that r+w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file's voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.

Has this paper been refuted? In addition to this there's literally the whole section on distributed serializability[1].

[0]: https://dl.acm.org/citation.cfm?doid=800215.806583

[1]: https://en.wikipedia.org/wiki/Serializability#Distributed_se...

> what you're describing isn't a quorum read/write scenario -- what do you mean by "a wins"? Also where is the prepare phase for 2 phrase commit/any consensus algo?

There is no consensus; that requires a leader system. The paper you link appears to require a multi-phase lock; the quorum itself does not guarantee serializability. Explicit preparation via quorum can guarantee serializability (but not strict serializability, I don't think), but cleanup of locks is a big performance problem in practice.

> Replica A shouldn't be reporting an unacknowledged transaction to coordinators -- the write hasn't been quorum acknowledged.

Acknowledged by who? The replicas can't block on the other replicas; they just tell the coordinator when they applied the write.

This is worth a blog post.

Summary, taken from the article:

> To summarize the overall FaunaDB protocol, each transaction proceeds in three phases:

> 1. The first phase is a speculative phase in which reads are performed as of a recent snapshot, and writes are buffered.

> 2. Next, a consensus protocol is used (Raft) to insert the transaction into a global log, which results in the transaction receiving a global transaction identifier that specifies its equivalent serial order relative to all other transactions that are being processed by the system. This is the only point at which global consensus is required.

> 3. Finally, a check begins in each replica which verifies the speculative work. If that speculative work did not result in potential violations of serializability guarantees, then the work becomes permanent and the buffered writes written back to the database. Otherwise, the transaction is aborted and restarted.

I wish projects like this would publish their proofs with their protocols. It's interesting to be sure but I find all of the prose and diagrams to be too verbose. I'd much rather read the mathematical model of the transaction protocol.

You can read the paper that inspired the FaunaDB transaction protocol here: http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

Or if you want a very concise description, see slide 13 from Daniel Abadi's presentation: https://www.slideshare.net/abadid/the-power-of-determinism-i...

Thanks for the links! I meant the actual proof or model they use to verify the protocol. Is the FaunaDB team using TLA+, Coq, Lean?

The concept of the "global log" is mentioned many times in the article, but not much detail on this critical piece:

> ... the consensus protocol is only being used for inserting transactions into a global log. For every other part of the protocol, replicas can proceed completely independently from each other.

Clearly this central resource can't be distributed or replicated. Right? So it, and the hardware it runs on, and the communications links to it, are a single point of failure. True?

If the global log goes down, or can't be reached at all, or perhaps only from some region of the regionized application, what happens?

You are correct that they serialize cross-partition transactions. So if your workload has a reasonable percentage of write/update/delete operations it is possible that you will bottleneck on the global coordinator. From this blog https://fauna.com/blog/distributed-acid-transaction-performa..., you can get 3300 transactions/second. Daniel Abadi claims you can get to 500,000 trans/sec - http://dbmsmusings.blogspot.com/2018/09/newsql-database-syst....

In any case, there is an open-source in-memory database called NDB (MySQL Cluster) that can get to 100 million write transactions per second. The echo chamber of SV has ignored it, but it's real: http://mikaelronstrom.blogspot.com/2015/03/200m-reads-per-se...

NDB has what i call lock-aware programming. When you read a row, you take either a read_lock or a write_lock, as it only supports READ_COMMITTED isolation level. But, it does not serialize cross-partition transactions. Transaction coordinators run independently, and it is up to the developer to ensure consistency by taking either read or write locks on rows.

Serializing cross-partition transactions is known as "external consistency" or "strict serializability"; that's what this game is about. The illusion of total order. :-)

There is no single node that serves as a global coordinator. The "coordinator" as referred here is a stateless logical role for each transaction and every node in the cluster can do it on an ad-hoc basis.

Our current performance numbers are much better but we don't have anything published quite yet. The bottleneck, so to speak, is the distributed, partitioned write-ahead log which is implemented in a pipelined version of Raft. It is not materially different than the Paxos or Raft-based rings that support replicas in aa Spanner-style leader-per-tablet two phase system. However it has better long-tail latency because a fixed subset of nodes resolves transactions instead of every leader regardless of location, and it is not subject to write skew across disjoint transactions.

We are well aware of NBD. NBD is not designed for WAN latencies, and like you say, it pushes consensus complexity onto the developer. Certainly it is easier to reach higher theoretical throughput numbers when you do less work on faster hardware. I admire NBD but it doesn't make sense to compare a WAN system with a higher consistency guarantee to a LAN system with a lower guarantee.

Fauna will support multiple consensus regions in the future to uncork aggregate throughput, at the cost of some explicit optimistic consensus across region boundaries. This is still a better story than explicitly locking on a read-committed isolation level which is subject to read and write skew in a way that cannot be architecturally overcome, even with locks.

But spanner uses 2pc for cross partition transactions and paxos within a partition. There's no global coordinator service there.

Dont get me wrong. I think Calvin/FaunaDb is great as an alternative to Spanner for multi region Dbs. But strict serializability is not a goal for all systems and certainly not for high performance distributed systems that can provide their own concurrency models. Not just HopsFs, but any system layered on top.

There is no global coordinator "service" in FaunaDB either. The "coordinator" is a stateless function on available on every node. The log is a stateful, logical function implemented in Raft, so no different than a Paxos ring in terms of failure modes and high availability.

Think of it as a way to scale a single Spanner tablet to support the entire dataset and eliminate the 2-phase commit, as well as the associated write skew or clock ambiguity window.

Paxos/raft is an agreement service where nodes agree on entries in a single log. I have no problem calling it a distributed/global coordination service for agreeing on entries in a log.

I'd like more detail on the "coordinator". Seems all cross-partition transactions need to be serialized into a global ordering (via consensus algorithm?). So why this isn't a bottleneck?

The log segments committed by FaunaDB contain batches of transactions, which means our throughput is constrained not by our consensus protocol, but rather by conflicts as transactions are resolved. The benchmarked 3300 transactions/second mentioned is for complex transactions with dozens of reads and writes. Additionally, read-only transactions are not run through the transaction pipeline, since they can be served consistently with low latency from replicas using FaunaDB's snapshot storage.

More important for most applications than theoretical benchmarks is taking a sound approach and using a database you can depend on. Asking your development team to manage isolation is going to introduce more costs than benefits, except in the most specialized of circumstances.

FaunaDB's model is always ACID, so you never have to guess what's in the database, even in the presence of region failures or other anomalies.

We built HopsFS on NDB, not CalvinDB, because we needed performance for cross-partition transactions. Some workloads need it. In the filesystem workload, when FS path components are normalized and stored on different partitions, practically all transactions cross partitions. So if you serialize them, then writing a file in /home/jchan will block writing a file in /home/jim. This is what most distributed filesystems already do - have a global write lock on the filesystem metadata. I like having the freedom to build my the concurrency model on top of READ_COMMITTED, as i can squeeze out 16X performance improvemnts by doing so (ref: https://www.usenix.org/conference/fast17/technical-sessions/... )

Sure, if you don't care about serializability, you don't need to pay consensus costs to enforce it, but you will be subject to read and write skew.

> This is what most distributed filesystems already do - have a global write lock on the filesystem metadata

Absolutely untrue. Stop lying about other systems.

I should add that when you don't serialize cross-partition transactions, you have the hard problem of system recovery - transaction coordinators need to reach consensus on a consistent set of transactions to recover. Here's NDB's protocol for doing so: NDB’s solution to this problem was only published this year in Mikael Ronstrom's book on MySQL Cluster : https://drive.google.com/file/d/1gAYQPrWCTEhgxP8dQ8XLwMrwZPc...

I should add that it is not correct to say 'our throughput is constrained not by our consensus protocol'. A trivial example would be a workload of transactions, where each transaction has at least two non-conflicting writes on different partitions. FaunaDB will serialize those transactions, and you will bottleneck on the consensus protocol - compared to NDB.

What he's saying is that the throughput number they mention is constrained by transaction conflicts. The limit for non-conflicting transactions allowed by the consensus protocol is no doubt much higher.

Can you partition your batch transactions, so that all up to the conflict succeed?

My understanding is yes, we commit to the log in batches, but abort conflicts at the transaction level. So only the conflicts have to suffer the retry loop, everything else is durably committed.

The "global log" is a logical component. It is a distributed, partitioned write-ahead log which is implemented in a pipelined version of Raft.

It is both distributed and replicated, even across datacenters, depending on topological configuration.

We will update the post to make this more clear.

The global log is replicated using a Raft-like consensus mechanism, so that all replicas have a copy of the log, and agree on its order. So even if one replica site is lost, the others have a full copy of the log, and can continue.

You can learn more about Raft from this excellent visualization: http://thesecretlivesofdata.com/raft/

We're here, ready for your questions to be consistently replicated across the WAN at the lowest latency information science allows ;-)

Not a database person, so I might be missing context, but why is FaunaDB the only implementation of this?

Intuitively, it seems to be exactly the way I would do a distributed database - distributed transactional storage with a global ordering of transactions that are based on ids rather than timestamps. The only thing that I wouldn't intuitively come up with is the actual consensus algorithm, but that already existed...

Both the tendency to use declarative rather than transactional queries and the tendency to use timestamps with clock skew rather than an id-based ordering always seemed surprising to me.

Everybody wants to copy Google apparently; there are a variety of Spanner or Percolator clones although some of them do not meet the original guarantees of their reference systems.

FoundationDB also forged its own path and is worth studying.

This is such a niche subcategory of distributed databases [1], that it's a miracle even a few of them exist and you want more implementations. Intuitively distributed systems try to avoid consensus/coordination, as doing those things over public internet is not that useful.

[1] https://trends.google.com/trends/explore?date=2016-09-20%202... compare that to some other distributed database https://trends.google.com/trends/explore?date=2016-09-20%202...

The property that attracts the most mission critical use cases is the ability to keep transactions rolling without loss of guarantees, even while hitting the typical cloud failure modes. Some applications can't make tradeoffs between correctness and global availability, for them FaunaDB's mainframe-like capabilities are key.

For the average application, the benefit is that you don't need to write code to address database drift / eventual consistency. I expect Fauna-like systems to gain popularity as more developers come to expect cloud-native solutions to offer global ACID transactions.

Well, I don't expect distributed ACID transactions to ever become a thing developers expect. I see people try them, but leave unhappy because of the unexpectedly bad latency and performance. Hence the lack of growth in trends.

Furthermore, eventual consistency was never problematic, it was just a myth. It was never hard, certainly not harder than ACID transactions in all those MVCC systems. So not much to benefit from for an average application. And today with all the research in strong eventual consistency applications can have all that correctness without coordination and without sacrificing performance, latency and availability. This is where the opportunity for distributed databases still exists, performance and latency are sellable.

There's a few other research projects based on the same concept of a distributed log and an explicit sequencer such as Corfu: https://github.com/CorfuDB/CorfuDB

As far as why it's not more prevalent, all of this research is work from the last decade or less, and it takes a long time to develop a database ready for production use.

Has the lack of interactive transactions and only supporting entirely deterministic transactions been a big issue for your customers?

My understanding is you're essentially shipping a "stored procedure" to the database where every input and condition is known up-front.

It seems like a reasonable trade-off for lowering the latency of WAN transactions, but it might make application development more complicated.

No, it hasn't. It is possible to layer an optimistic concurrency model for session transactions on top of our protocol, but the database standard library is sufficiently full-feature that it's easy to express most application logic as a pure function over the current state of the database.

Most read/modify/write cycles end up being essentially that, but with the added confusion of local application threading and standard library. We eliminate that, but preserve the general ability to express a complex query in application code (which also increases application concurrency since the database executes transactions with maximum parallelism.)

Transactions do not have to be pre-registered as stored procedures.

This FaunaDB ledger tutorial gives a feel for how you'd build out your application logic: https://fauna.com/blog/tutorial-how-to-create-and-query-a-le...

If you want to see a query with precondition checks etc, the code at the end of this post is another example: https://fauna.com/blog/distributed-ledger-without-the-blockc...

have a question about phantom read issue, i.e. two doctors are oncall for the night, but we have the constraint that at least one is needed to be oncall. If we have two transaction checking whether there is at least 2 doctors oncall, then remove one, seems we could end up with no one oncall with FaunaDB?

If so, seems this violate serializability?

“Consisteny without clocks” seems like a misnomer considering Raft indexes are tantamount to a logical clock. Perhaps it should be “consistency without wall clocks.” But that’s not really novel by itself, except in the context of geo-distributed databases.

This is a weakness imo:

`client.query( q.get( q.match( q.index("posts_by_title"), "My cat and other marvels" ) ))`

That you have to specify the index in a query is a regression imo at least I've been spoiled by not having to have to do so in when using Mongo or SQL databases since the query engine will more often than not find the right index.

This a design choice to guarantee predictable performance at scale, not an architectural limitation, and subject to revision.

You don't want your service to collapse when the query optimizer suddenly doesn't choose the right index anymore. That was a hard lesson learned for us at Twitter.

Another interesting approach is the way Google Cloud Datastore/Firestore works. The query engine will only allow queries that scale and can happen in a single contiguous scan of an index range. The database can then stream results back and the work done is proportional to the size of the result set.

This way you can't do queries that don't scale in production, but still don't have to increase the cognitive load on the developer. If a query is rejected by the planner that (for the most part - it's not perfect) means the query won't scale.

I agree -- it burns us in the SQL world too. I was just seeing it from a developer cognitive load point of view and maybe as a premature optimization, again, though if you guys and gals put this in place because of war stories from twitter then it's not premature by any means.

In the rare case when no index or the wrong index is used in a SQL query an index hint is all it takes to fix or updating statistics of a table to help the optimizer. Not having to remember the exact index for a query imo is better, but I could learn to adapt.

I should also add that Fauna indexes are more akin to views. They can cover multiple source collections and transform the data as well. So it’s not always possible to detect which index is correct via heuristic.

Ahh makes sense. Interesting.

Is not possible to make the query optimizer deterministic?

I think is better that every time I see a query it do exactly the same steps. Only when testing/monitoring say so, I hint it.

Perhaps. What if you add a new index and it changes the optimization plan of existing queries that you didn’t anticipate? There are a lot of edge cases, but we are not religiously opposed to making the optimizer smarter.

Oh, I see. That make sense.

In fact I like the idea of a RDBMS to be less of a black box.

Mongo has issues with its query optimizer. I’ve seen this happen before, killing performance. The fix was to give the query a hint on the correct index to use.

A similar issue reported


So I can see the justification for adding a bit of work to the developer or query writer to pick the right index for the document would mean consistent query performance every time (since it's a document anyway finding the right index should be pretty easy I would imagine). I can see why one would choose this then.

This looks like a fascinating approach. Unfortunately I’m not well versed enough to intelligently compare it to alternatives.

Any plans to have Jepsen / Aphyr conduct a rigorous test and write a report on the results?

We have a whole suite of internal correctness verification tools, in addition to which we are also using Jepsen. We don't have an official report out yet, but you can get a preview of our progress here: https://fauna.com/blog/verifying-transactional-consistency-w...

Might want to look into methods FoundationDB was using:


Might be useful to your team. One of rare times I was surprised and impressed by a new company's verification process. I look forward to reading your tesm's write-up of techniques and lessons learned, too.

Just my opinion, having Kyle write a formal report would give a huge bump to validating the stated properties :)

Stay tuned...

thanks drift.com (and driftt.com) for driving me to learn how to block specific sites in uBlock. Audio 'pops' for a webpage chatroom is a new level of annoyance.

Thanks for the bump I needed to go fix that. Sound is off now.

I'm really confused how this scales to high transaction rates. If the replica has to redo all the reads (which means talking to multiple nodes) before it can make a commit/abort decision for the transaction this could take tens of microseconds if all nodes are in the same datacenter (if serving from RAM). Since it also has to process transactions from the log in order, that seems like it would limit the transaction rate to tens of thousands of TPS? Forget about distributing a replica across data centers or having enough data that it may not be RAM resident.

Is this actually how it works or am I missing something important?

As an aside, can somebody explain the constraints FoundationDB puts on replica distance?

FoundationDB cannot currently replicate its transaction log across datacenters, either synchronously or asynchronously: https://forums.foundationdb.org/t/multi-dc-replication/499

In the discussed proposal datacenter failure is considered an extraordinary event with major performance implications, which is different than a global "write anywhere, read anywhere" multi-datacenter partitioned log model like Fauna.

FoundationDB has long supported both synchronous and asynchronous replication across regions, and its major users use multi-regional configurations. In the former mode, you will see 1xRTT latency for commits [1] from the active datacenter and 2xRTT latency for commits from other datacenters. In the latter mode, commits from the active datacenter are fast (0xRTT) but the durability property must be sacrificed if a region fails. I believe almost all users currently use asynchronous replication because the performance is so much better (than any geographically replicated synchronous database).

The new "satellite" replication model in FoundationDB 6.0 allows the best of both worlds: if you have two or more datacenters in each region (each storing transaction logs, but only one storing actual data) then you can do synchronous (0xRTT) commits to the regional datacenters and asynchronous replication to the other region(s). If there are failures in one region, the database can quickly and automatically fail over to the other region, while maintaining durability (by getting the last bits of transaction log from at least one datacenter in the failing region). Even if a whole region fails, as long as the failures of the datacenters and network links in it are spread out over time the database will complete a full ACID recovery. And in the worst case you can fall back to ACI recovery as in asynchronous replication.

The question you are linking to is asking about something different, the ability to have different parts of a single database have subgeographic latencies in different regions.

[1] You will also see 1xRTT latencies when you start transactions, but you can solve this by setting the causal_read_risky transaction option on all read/write transactions. This is not actually risky because if the reads in the transaction turn out not to be externally consistent, the transaction will not commit.

Very interesting. People really care about latency so we are also look at more datacenter-local journaling schemes that maintain consistency at the expense of a theoretically unlikely hit to durability.

What do you mean, "active datacenter"? Can all datacenters accept transactions concurrently?

If you are using Raft for replication, at any given time your replica set has a leader and it is located somewhere, and I would assume that writes from near there are faster than from anywhere else. In FoundationDB this is handled at a somewhat higher layer of the system, and since our transaction pipeline is (very roughly) resolve conflicts -> transaction log -> storage rather than transaction log -> resolve conflicts -> storage, we are also doing conflict resolution in that region.

Moreover, most current users of FoundationDB aren't willing to accept even 1xRTT latencies anywhere in their transaction lifecycle, so they can't abstract away which region is the fast one. A common way (though not the only way) to set things up is that your whole application (not just the database) is active in region A, but prepared to fail over at a moment's notice to run in region B. Ideally this comes at basically no performance penalty relative to a single-region setup in region A. Alternatively, it's possible to read or write the database from passive region(s) (and individual reads will normally happen from the local region with low latency), but each transaction has to pay a geographic latency at some point (or, in the case of read-only transactions, accept reduced consistency).

I think it would be possible to implement something analogous to satellite replication for Raft. It's a really nice set of tradeoffs for many applications, and if your cloud provider or equivalent has their act at all together instantaneous failures of all the datacenters in a region or their networking should really be pretty rare.

How does FaunaDB achieves atomicity without some sort of 2 phase commit? What I understood from the article is that each node in a replica commits independently from the distributed transaction log. So, if a transaction updates data in multiple nodes in a replica then it can happen that one of the commit in one of the nodes fails because of some failure. In that case will there be partial commit?

I've published this article 5 months ago. Mine doesn't require global consensus for all commits, just local one and it works from there.

What's even more annoying is that I tweeted this article to them and they didn't say anything.


Sorry, let me clarify. Tweeted to them at the time. So, back in May.

Did you implement it?

I'm planning on releasing a prototype as open source in a few months. I'm also planning on releasing my patent into the public domain.

Apologies, I quickly skimmed the blog but in the summary did not see clear answers for: When is the transaction acked? What is the reader-writer consistency? If the transaction is acknowledged only after speculative work is validated after ordering, what is the value add for doing the speculative work? Would you achieve similar benefits from just batching commits?

The transaction is acknowledged after it has been durably replicated to the distributed write-ahead log, and applied by at least one replica. The first replica to apply will ack back to the coordinator and on to the client. This increases partition tolerance and reduces latency.

The speculative work ("reconnaissance query") is what enables support for dependent transactions (transactions where the write set depends on the values of the read set rather than just the keys themselves). The transaction will optimistically retry if the speculation turns out to be false.

A system like CORFU where the log is serialized upfront (not post speculative commit) will give most of the benefits mentioned?

My understanding is that in CORFU the write set is always known, so this kind of transaction cannot be expressed. You could layer it on top with the same strategy.

Hmm. I’m intrigued though skeptical. Who’s using this in production that can give an unbiased take?

Not quite what you are asking, but some customers with case studies on the site: https://fauna.com/customers

Unbiased in that the article we are all referencing is coming from you guys, the creators, you can't help but show it in a positive light and I don't fault you for it. Just I've become so jaded of late that I am looking for people who have used it in the wild.

I also signed up for the cloud account -- your pricing terms are some of the best I've ever seen, kudos -- so that I can check out the API some more.

Thanks; what do you like about the pricing?

The free forever on premise pricing for personal use to the 90-trial in production even in commercial use. And I thought the free threshold for the cloud reasonable as I can forgo all the setup work and just use the cloud isntance for free if my usage falls under the right level. Anything that saves me time but then doesn’t cost me anything while learning it / trying it out is a win in my book.

This is very cool

But there is an even harder super boss level for distributed systems: Byzantine Fault Tolerance

Is FaunaDB Byzantine resistant? I have asked many projects, such as Dat, what happens if there are conflicts, and they haven’t built systems that are BFT yet.

Our deployment model assumes a single administrator. With reasonable network operations, Byzantine fault tolerance is largely irrelevant. (Although it's an important problem in other contexts.)

I think BFT can be added as an option to your database and it wouldn’t need refactoring any other part of the database. Just have crypto signatures and the log would need to use BFT consensus with a quorum before committing. May be worth it at least for marketing purposes, but also for multiple mistrusting parties jointly running the database. You’ve done so much already, may as well benefit from this addition!

We have a roadmap item to add blockchain style tamper evidence to the storage engine. The crypto we'll use is certainly a step along the path you describe. If you have a use case, we love talking to developers to help make sure we are building features you can use.

A BFT database is called a blockchain.

The global log sounds like a lynchpin but I didn't see a good explanation in the article. It sounds like all synchronization is done at the level of this log? If so, isn't that a bit of a bottleneck?

The log is distributed across the cluster, and can be partitioned. The protocol is optimized for long physical distances between replicas. See links elsewhere in the thread to Raft.

is the issue not so much other dbs cant provide that global consitency but other dbs just choose not to for the sake of speed. like their are trade offs

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact