
Consistency Without Clocks: FaunaDB's Distributed Transaction Protocol - evanweaver
https://fauna.com/blog/consistency-without-clocks-faunadb-transaction-protocol
======
rystsov
"unlike Google Percolator, FoundationDB, or similar systems, FaunaDB places no
constraints on replica distance and is practical to deploy at global internet
latencies"

"For each batch of parallel transactions, they are inserted into a
distributed, write-ahead transaction log"

"Replicas must achieve consensus for how to insert new transactions into the
log. FaunaDB uses an optimized Raft implementation to achieve consensus."

There are constrains on running consensus across the world (Raft), it adds at
least 200 ms to serialize txs. Also higher latency means longer interval
between hearbeats and hence longer downtime if leader is isolated - known
issue of leader based consensus protocols (see "There Is More Consensus in
Egalitarian Parliaments" paper[1] or "In search of a simple consensus
algorithm" post[2])

Google's Percolator doesn't depend on global consensus but just on global TSO
(timestamp oracle) which is possible to implement in a way:

\- it doesn't suffer from leader isolation (no leader)

\- doesn't have bottleneck (each node handles requests)

\- doesn't touch disk on each request

details in the "Quorum clock: leaderless distributed clock" post[3].

[1] [https://www.cs.cmu.edu/~dga/papers/epaxos-
sosp2013.pdf](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf)

[2] [http://rystsov.info/2017/02/15/simple-
consensus.html](http://rystsov.info/2017/02/15/simple-consensus.html)

[3]
[http://rystsov.info/2018/10/01/tso.html](http://rystsov.info/2018/10/01/tso.html)

------
andydb
The paper contains this claim:

> FaunaDB is an elegant, software-only solution for achieving global ACID
> transactions, with complete guarantees of serializability and consistency.

The paper makes it sound like FaunaDB claims strict serializability for all
transactions (like Spanner). This means that if Txn(A) ends before Txn(B)
begins, then Txn(B) must be guaranteed to see all data written by Txn(A).

However, what if Txn(B) is a read-only transaction? According to the paper,
read-only transactions do not contact the global sequencer. Here is an example
where an application would fail to read its own writes:

1\. Txn(A) performs a write W(X) of data on replica R1.

2\. Txn(A) is assigned a logical timestamp of T1 by the global log.

3\. Conflict verification is triggered. Replica R1 plays back the global log
and verifies there is no conflict.

4\. Txn(A) completes, since one replica has verified there are no conflicts,
and passes control back to the application.

5\. The application now triggers a read-only Txn(B).

6\. The coordinator for Txn(B) decides to read replica R2 rather than R1.
However, the coordinator is not yet aware that the global log is now at
timestamp T1, and it picks T0 as its snapshot timestamp.

7\. It reads the value of X on replica R2, which does not yet reflect W(X).

\-- From the application's point of view, it did not read its own writes. --

I'm not familiar enough with Fauna DB's subtleties to know if this scenario is
possible in practice. Perhaps you could comment?

I did notice that at the end of the article, the language is carefully phrased
to only make the claim of "serializable" (not "strictly serializable") for
read-only transactions. But that would fall short of the guarantees that
Spanner and Calvin make, and undermine the "complete guarantees of
serializability and consistency".

~~~
evanweaver
Read-only transactions are serializable by default but can be upgraded to
strict serializability, either from the perspective of a single client by
passing a transaction timestamp that serves as a causal token, which is free
and happens by default with our drivers, or globally, by opting in to the log
at the price of additional latency.

~~~
andydb
But opting into the log for read-only txns would destroy performance for many
apps, both in terms of latency and throughput. This is rarely practical since
so many apps are read-heavy. I assume you don't recommend that to your
customers, especially in global deployments.

Causal tokens are practical from a performance point of view, but do require
application awareness and changes, which presents a different set of issues.
Most application developers would not appreciate the subtleties and just use
the default behavior, which does not guarantee "read your own writes". And
causal tokens don't prevent the more general case of stale reads.

This would mean that in the default FaunaDB configuration, applications could
see anomalies:

1\. Inability to read own writes - described above, with the additional
clarification that it can't be fully solved at the driver level. For example,
say I call a web API that creates a blog post and returns a PostID. I then
call another API that edits that post, passing it the PostID I got back from
the first. If FaunaDB is in use, the second call could fail to find the
PostID, because I happened to hit a different replica that did not yet have
it.

2\. Stale reads - could be arbitrary levels of staleness if
replica/coordinator haven't been able to sync to the global log. If a
coordinator node and a replica were stuck in their own little partition, cut
off from the rest of the cluster, they'd still happily serve stale reads
(until perhaps they detect a partition and shut down?).

These read anomalies aren't usually considered compatible with a consistent
system, but instead are hallmarks of an eventually consistent system. It seems
misleading to make an unqualified 100% consistency claim when it doesn't apply
to read-only txns by default, and when the fully consistent mode is not
practical to use in global deployments.

~~~
evanweaver
Really putting me on the spot when our CTO is out of town. My initial reply
was ambiguous and appears to suggest that the transaction highwater
mark/causal token preserves strict serializability. It preserves
serializability. Strict serializability requires the log.

Let me try to explain more clearly and you tell me what you would call it. It
is subtle because there are three coordination points, the log, the replica,
and the client.

The log is marching along, moving time itself forward. Every node within a
replica is following the log monotonically for a disjoint set of partitions.
Every client process is querying an arbitrary and changing set of nodes within
a replica, which is normally a datacenter.

Every node in every replica maintains a highwater mark (the causal token) of
the last transaction time it has applied. Every client does the same, and
keeps a highwater mark of the last transaction time (read or write) it has
seen. If it was a write, the highwater mark came from first replica to apply
the transaction from the log. If it was a read, it came from the local replica
based on the last applied transaction time that node has personally applied.

These transaction times are passed back and forth with every request, from
client to replica and back, automatically by the Fauna drivers. Thus, a
replica node will block if a client performs a read-only transaction
presenting a transaction time which it has not yet replicated from the log.
And if a client receives a later time from a replica or from the log, it will
update its local highwater mark so that its view of time (and thus its key
set) can never backtrack.

1a. If the client process is long-lived and maintains the highwater mark, then
it is guaranteed to read its own writes and maintains serializability (but not
strict serializability) for all keys on all partitions that own any keys it
has queried via log write or local read. Even if the write acknowledgement
comes from another replica, the highwater mark is still advanced and any local
read partitions will block until they have replicated beyond that point. A
stateful webserver would maintain the highwater mark internally across
requests.

1b. If the client process is not long-lived (a Javascript call directly from a
browser, for example, which is possible with FaunaDB's security model), then
you will still read your own writes __as long as you do not accidentally
switch replicas e.g. datacenters __because there is only one node per replica
that owns any specific key. In practice, switching datacenters during a client
session is a pathological edge case. The most likely scenario would be that
the client performs a write, receives a commit acknowledgement from a remote
datacenter, discards the transaction time, queries the locally partitioned
node for the key it just wrote, and does not see the update. This could be
eliminated by threading replica affinity through the entire write /read cycle.

2\. Stale reads from partitioned nodes are no different than increased client-
server latency, because if the client can write, it will receive a higher
transaction time, and its subsequent reads will block and timeout, preserving
the serial order of operations. If it doesn't write, as far as it is aware,
time is just running a little slow, and the data will be stale from an
external consistency perspective, but order will be preserved. This is a
violation of strict serializability but not of serializability. In this case
if the client switches datacenters, it will read or write and get an updated
transaction highwater mark, and will no longer be able to successfully read
from the partitioned node because it knows about the future now.

I should note that Spanner-derived systems typically offer some variant of
serializability or worse for read-only transactions since nobody wants to wait
around for the clock ambiguity window to expire. Spanner itself can wait, but
it still comes at a cost, and Spanner goes out of its way to offer bounded
staleness reads that are low latency but do not meet any serializability
guarantee at all, as far as I can tell.

~~~
andydb
Sorry to put you on the spot. Perhaps your CTO can chime in when he gets back.

Serializable tends to sound really good to database people, because it's the
"highest" isolation level available in most databases. However, those
databases are built for a single machine, and they really offer "strict
serializable" without calling it that, because they maintain a single log
which serializes all writes, and all reads operate as of the latest time. With
the advent of distributed databases, the "strict" part is becoming more
relevant, because it's no longer something you just get without realizing it.

I gave an example of the "blog post" anomaly that occurs in a system that only
guarantees serializability. As another example, imagine I am using FaunaDB to
store bank account balances. As the developer, I design a system where any
time a customer receives money to their account, I send them a text giving the
amount and their new balance, along with a link to the website where they can
get more details. Except, when they click the link and go to the website, they
don't see any change to their balance. Furthermore, when they refresh the
page, sometimes they _do_ see their new balance, and then they refresh again,
and it disappears!

More investigation reveals that FaunaDB is in a partitioned state, and was
serving stale bank account balances from some replicas. The replicas were
behind a load balancer, and sometimes the web client hit one replica, and
sometimes another, which explains the alternating balances. Is it possible to
solve this problem using clever application logic, causal tokens, or saying
"don't do that"? Sure. But a consistent database is supposed to prevent this
scenario, not the application. Also, note that the "this is no different than
increased client-server latency" argument does not hold here. People expect
that if they finish action A (like writing to the DB), then do action B (like
clicking a web link that reads the DB), that B will _always_ see the results
of A. The stale bank account balance isn't a delayed value; it's a wrong
value.

You're correct that most systems allow some variant of "secondary" or
"follower" reads that can return stale data, in exchange for better
performance. I'm all for this kind of feature, as long as the developer
understands the tradeoff they're making. Moreover, I think it's appropriate
for FaunaDB to offer the same. But I'd advise against making unqualified
claims of "complete guarantees of serializability and consistency" and the
"latency profile of an eventually-consistent system like Apache Cassandra".
FaunaDB is able to offer one or the other, but not both at once.

~~~
evanweaver
Having the read information sent from one location (presumably where the write
occurs) and then read from another is a good example of one of the few times
this anomaly can occur.

I think it's a stretch to say that including a single timestamp in the link is
too hard for developers. It happens by default across queries within a process
with the Fauna drivers, so the database _is_ preventing it; the only place you
have to think about it is multi-process, multi-location reads. Also, having a
loadbalancer jump back and forth between two replicas would not normally occur
because of a geo routing, a soft form of datacenter affinity, but it is
possible. Note that in your example, even if the link didn't include the
timestamp, once they visit the webpage, the webserver should begin managing
the timestamp in process or in a session cookie, preventing the the balance
from disappearing once seen for the first time.

In my experience as a DBA the legacy database experience is much worse because
in practice there are always physically distinct read replicas in the mix for
scalability and hot failover that have no consistency guarantees at all, due
to various faults like statement-based replication desynchronization, the lack
of any ability to share log cursor information, and the like. I am not making
a worse is better argument. I'm just pointing out that you don't get any C at
all for the feeble amount of A gained under CAP in practice with most “single
node” systems.

You are correct that the latency argument does not hold if you do not pass the
timestamp information across queries. But if the observer can observe multiple
replicas in the database topology, it is always possible to use the observer
itself as the communication channel for the timestamp to avoid the anomaly. If
you want to maintain strict serializability in all cases, but minimum-latency
reads, the writer could block on transaction application acknowledgement from
all replicas. Obviously that gives up availability for writes under partition,
but it may be reasonable in your asynchronous example in order to avoid
sending the txt message too early. Perhaps, for example, the message is "go to
your local ATM" instead of a link with a timestamp. Although I would expect
the ATM had the latency budget to do a linearized read even in a global
context.

We are always happy to make the descriptions of the behavior and consistency
levels more precise.

~~~
evanweaver
Also: please email us! It would be great to talk further.

------
hardwaresofton
I've come to the realization that sharding might be the only way to actually
scale multi-master systems without allowing stale (or potentially outright
wrong) reads.

Sharding is complex and makes parallel queries a requirement but it really
does seem like the only way to distribute a database. In the end, once you
have the op log on more than one machine, they're going to _have_ to
synchronize with each other, otherwise you risk giving out not just stale
answers, but completely wrong answers. This is fine for a site like FB but is
not OK if it's attached to military equipment.

CRDTs don't help in the case that you don't want to deliver wrong answers,
they only help making sure merges (during synch time) are resolvable without
conflicts. Lamport clocks only give you a partial order, and you need
something else (physical clocks) for total order, which is what you get from
raft (because you only let one node manage the log in the first place).

What I haven't seen yet/want to try is to see if sharded + replicated raft
could introduce some interesting performance benefits. It's basically max
complexity (the complexity plus sharding, plus multiple instances of raft at
the same time), but it could be only way to distribute write load (smart
routing/forwarding of requests) and increase availability (eventually
consistent replication for every node of every shard) at the same time. This
is basically what Faunda does but I'm not 100% sure about splitting a single
table across replicas...

~~~
rystsov
CRDT may help: if you read from quorum, merge(+modify) and write back to
quorum for read (write) then you'll always get right answers.

~~~
hardwaresofton
I'm not sure if I'm understanding what you're suggesting right, but it sounds
like you're just quorum-ing for everything. You shouldn't need improved
merging if you quorum both reads and writes, since it implies the majority
agreeing on the contents of the log to begin with.

~~~
evanweaver
Quorum/quorum does not guarantee agreement or serializability even for a
single key.

Imagine a write that commits to one replica but is delayed committing to two
others. Some quorum reads will see it (but also see a tie with the old
value...what breaks the tie? Certainly not a timestamp) and others will not
see it at all, indefinitely.

It’s easy to end up in a state where every replica has a different value
because of torn writes.

~~~
hardwaresofton
Sorry, I think either my fundamental understanding is off or I wasn't clear
enough in how I was imagining this happening...

If you and a majority of nodes agree on a value like an CRDT OpSet (more
simplistically, just agreeing on the state of the log), how does that not
guarantee agreement and serializability? It is impossible from that point on
to have a another majority of nodes have some other view of what happened.
Consensus algorithms are

One copy serializability is exactly what would be achieved by having a read
and write quorum[0][1]. It intuitively makes sense to me (and maybe my
intuition is wrong), but if you talk to a majority of nodes and ask "this is
what we all have, right?" before every write and every read, you've got a
guaranteed consistent (of course, progress isn't guaranteed if
partitions/nodes die, etc) state.

AFAIK Quorums _are_ the state of the art (and arguably the only relatively
efficient option) as far as achieving serializability in a distributed
system...

[0]:
[https://en.wikipedia.org/wiki/Quorum_(distributed_computing)...](https://en.wikipedia.org/wiki/Quorum_\(distributed_computing\)#Quorum-
based_voting_for_replica_control)

[1]:
[https://arxiv.org/pdf/1406.7423.pdf](https://arxiv.org/pdf/1406.7423.pdf)

~~~
evanweaver
> If you talk to a majority of nodes and ask "this is what we all have,
> right?"

You cannot know this, because the transaction replication is racy and not
atomic--it may have applied to only one node while you are doing your read.
Whether you see it or don't is luck. So you can have the following scenario
(and in practice you will):

\- TX commit begins

\- TX replicated to node A

\- Read from coordinator A' begins

\- Read sees replica A (has tx) and replica B (does not)

\- Read assumes A wins because of some kind of vector clock in the data value
(choosing the older value doesn't make things better, just in case you are
wondering)

\- Read from coordinator B' begins

\- Read sees B (no TX) and C (no TX)

\- Read completes with stale value--serializability violation has occurred

\- TX finishes replicating to B and C

This leaves aside write skew, torn transactions due to partitions, and all
kinds of other problems.

~~~
hardwaresofton
I'm super confused -- what you're describing isn't a quorum read/write
scenario -- what do you mean by "a wins"? Also where is the prepare phase for
2 phrase commit/any consensus algo? Replica A shouldn't be reporting an
unacknowledged transaction to coordinators -- the write hasn't been quorum
acknowledged. TX is not considered committed until it reaches a majority of
nodes. You are right that if you have a network partition you're in trouble,
but that ends in lack of progress, _not_ loss of serializability.

We must be talking about different things because I can't find any literature
that has reached the conclusion that serializability is impossible in
distributed transactions? Can you point me to that?

Also, do you have any thoughts on the literature[0] that contradicts what
you're saying? I'm not an expert but unless I'm not misreading english
serializability _is_ possible with quorum reads and writes.

> In a new algorithm for maintaining replicated data, every copy of a
> replicated file is assigned some number of votes. Every transaction collects
> a read quorum of rvotes to read a file, and a write quorum of wvotes to
> write a file, such that r+w is greater than the total number of votes
> assigned to the file. This ensures that there is a non-null intersection
> between every read quorum and every write quorum. Version numbers make it
> possible to determine which copies are current. The reliability and
> performance characteristics of a replicated file can be controlled by
> appropriately choosing r, w, and the file's voting configuration. The
> algorithm guarantees serial consistency, admits temporary copies in a
> natural way by the introduction of copies with no votes, and has been
> implemented in the context of an application system called Violet.

Has this paper been refuted? In addition to this there's literally the whole
section on distributed serializability[1].

[0]:
[https://dl.acm.org/citation.cfm?doid=800215.806583](https://dl.acm.org/citation.cfm?doid=800215.806583)

[1]:
[https://en.wikipedia.org/wiki/Serializability#Distributed_se...](https://en.wikipedia.org/wiki/Serializability#Distributed_serializability)

~~~
evanweaver
> what you're describing isn't a quorum read/write scenario -- what do you
> mean by "a wins"? Also where is the prepare phase for 2 phrase commit/any
> consensus algo?

There is no consensus; that requires a leader system. The paper you link
appears to require a multi-phase lock; the quorum itself does not guarantee
serializability. Explicit preparation via quorum can guarantee serializability
(but not strict serializability, I don't think), but cleanup of locks is a big
performance problem in practice.

> Replica A shouldn't be reporting an unacknowledged transaction to
> coordinators -- the write hasn't been quorum acknowledged.

Acknowledged by who? The replicas can't block on the other replicas; they just
tell the coordinator when they applied the write.

This is worth a blog post.

------
netgusto
Summary, taken from the article:

> To summarize the overall FaunaDB protocol, each transaction proceeds in
> three phases:

> 1\. The first phase is a speculative phase in which reads are performed as
> of a recent snapshot, and writes are buffered.

> 2\. Next, a consensus protocol is used (Raft) to insert the transaction into
> a global log, which results in the transaction receiving a global
> transaction identifier that specifies its equivalent serial order relative
> to all other transactions that are being processed by the system. This is
> the only point at which global consensus is required.

> 3\. Finally, a check begins in each replica which verifies the speculative
> work. If that speculative work did not result in potential violations of
> serializability guarantees, then the work becomes permanent and the buffered
> writes written back to the database. Otherwise, the transaction is aborted
> and restarted.

~~~
convolvatron
optimistic concurrency control
[https://www.eecs.harvard.edu/~htk/publication/1981-tods-
kung...](https://www.eecs.harvard.edu/~htk/publication/1981-tods-kung-
robinson.pdf)

------
agentultra
I wish projects like this would publish their proofs with their protocols.
It's interesting to be sure but I find all of the prose and diagrams to be too
verbose. I'd much rather read the mathematical model of the transaction
protocol.

~~~
jchrisa
You can read the paper that inspired the FaunaDB transaction protocol here:
[http://cs.yale.edu/homes/thomson/publications/calvin-
sigmod1...](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf)

Or if you want a very concise description, see slide 13 from Daniel Abadi's
presentation: [https://www.slideshare.net/abadid/the-power-of-
determinism-i...](https://www.slideshare.net/abadid/the-power-of-determinism-
in-database-systems)

~~~
agentultra
Thanks for the links! I meant the actual proof or model they use to verify the
protocol. Is the FaunaDB team using TLA+, Coq, Lean?

------
fernly
The concept of the "global log" is mentioned many times in the article, but
not much detail on this critical piece:

> ... the consensus protocol is only being used for inserting transactions
> into a global log. For every other part of the protocol, replicas can
> proceed completely independently from each other.

Clearly this central resource can't be distributed or replicated. Right? So
it, and the hardware it runs on, and the communications links to it, are a
single point of failure. True?

If the global log goes down, or can't be reached at all, or perhaps only from
some region of the regionized application, what happens?

~~~
jamesblonde
You are correct that they serialize cross-partition transactions. So if your
workload has a reasonable percentage of write/update/delete operations it is
possible that you will bottleneck on the global coordinator. From this blog
[https://fauna.com/blog/distributed-acid-transaction-
performa...](https://fauna.com/blog/distributed-acid-transaction-performance),
you can get 3300 transactions/second. Daniel Abadi claims you can get to
500,000 trans/sec - [http://dbmsmusings.blogspot.com/2018/09/newsql-database-
syst...](http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-
failing-to.html).

In any case, there is an open-source in-memory database called NDB (MySQL
Cluster) that can get to 100 million write transactions per second. The echo
chamber of SV has ignored it, but it's real:
[http://mikaelronstrom.blogspot.com/2015/03/200m-reads-per-
se...](http://mikaelronstrom.blogspot.com/2015/03/200m-reads-per-second-in-
mysql-cluster.html)

NDB has what i call lock-aware programming. When you read a row, you take
either a read_lock or a write_lock, as it only supports READ_COMMITTED
isolation level. But, it does not serialize cross-partition transactions.
Transaction coordinators run independently, and it is up to the developer to
ensure consistency by taking either read or write locks on rows.

~~~
jchanimal
The log segments committed by FaunaDB contain batches of transactions, which
means our throughput is constrained not by our consensus protocol, but rather
by conflicts as transactions are resolved. The benchmarked 3300
transactions/second mentioned is for complex transactions with dozens of reads
and writes. Additionally, read-only transactions are not run through the
transaction pipeline, since they can be served consistently with low latency
from replicas using FaunaDB's snapshot storage.

More important for most applications than theoretical benchmarks is taking a
sound approach and using a database you can depend on. Asking your development
team to manage isolation is going to introduce more costs than benefits,
except in the most specialized of circumstances.

FaunaDB's model is always ACID, so you never have to guess what's in the
database, even in the presence of region failures or other anomalies.

~~~
jamesblonde
We built HopsFS on NDB, not CalvinDB, because we needed performance for cross-
partition transactions. Some workloads need it. In the filesystem workload,
when FS path components are normalized and stored on different partitions,
practically all transactions cross partitions. So if you serialize them, then
writing a file in /home/jchan will block writing a file in /home/jim. This is
what most distributed filesystems already do - have a global write lock on the
filesystem metadata. I like having the freedom to build my the concurrency
model on top of READ_COMMITTED, as i can squeeze out 16X performance
improvemnts by doing so (ref:
[https://www.usenix.org/conference/fast17/technical-
sessions/...](https://www.usenix.org/conference/fast17/technical-
sessions/presentation/niazi) )

~~~
evanweaver
Sure, if you don't care about serializability, you don't need to pay consensus
costs to enforce it, but you will be subject to read and write skew.

------
evanweaver
We're here, ready for your questions to be consistently replicated across the
WAN at the lowest latency information science allows ;-)

~~~
ummonk
Not a database person, so I might be missing context, but why is FaunaDB the
only implementation of this?

Intuitively, it seems to be exactly the way I would do a distributed database
- distributed transactional storage with a global ordering of transactions
that are based on ids rather than timestamps. The only thing that I wouldn't
intuitively come up with is the actual consensus algorithm, but that already
existed...

Both the tendency to use declarative rather than transactional queries and the
tendency to use timestamps with clock skew rather than an id-based ordering
always seemed surprising to me.

~~~
zzzcpan
This is such a niche subcategory of distributed databases [1], that it's a
miracle even a few of them exist and you want more implementations.
Intuitively distributed systems try to avoid consensus/coordination, as doing
those things over public internet is not that useful.

[1]
[https://trends.google.com/trends/explore?date=2016-09-20%202...](https://trends.google.com/trends/explore?date=2016-09-20%202018-10-20&q=FaunaDB,FoundationDB,CockroachDB,TiDB)
compare that to some other distributed database
[https://trends.google.com/trends/explore?date=2016-09-20%202...](https://trends.google.com/trends/explore?date=2016-09-20%202018-10-20&q=FaunaDB,FoundationDB,CockroachDB,TiDB,DynamoDB)

~~~
jchrisa
The property that attracts the most mission critical use cases is the ability
to keep transactions rolling without loss of guarantees, even while hitting
the typical cloud failure modes. Some applications can't make tradeoffs
between correctness and global availability, for them FaunaDB's mainframe-like
capabilities are key.

For the average application, the benefit is that you don't need to write code
to address database drift / eventual consistency. I expect Fauna-like systems
to gain popularity as more developers come to expect cloud-native solutions to
offer global ACID transactions.

~~~
zzzcpan
Well, I don't expect distributed ACID transactions to ever become a thing
developers expect. I see people try them, but leave unhappy because of the
unexpectedly bad latency and performance. Hence the lack of growth in trends.

Furthermore, eventual consistency was never problematic, it was just a myth.
It was never hard, certainly not harder than ACID transactions in all those
MVCC systems. So not much to benefit from for an average application. And
today with all the research in strong eventual consistency applications can
have all that correctness without coordination and without sacrificing
performance, latency and availability. This is where the opportunity for
distributed databases still exists, performance and latency are sellable.

------
kuujo
“Consisteny without clocks” seems like a misnomer considering Raft indexes are
tantamount to a logical clock. Perhaps it should be “consistency without wall
clocks.” But that’s not really novel by itself, except in the context of geo-
distributed databases.

------
gigatexal
This is a weakness imo:

`client.query( q.get( q.match( q.index("posts_by_title"), "My cat and other
marvels" ) ))`

That you have to specify the index in a query is a regression imo at least
I've been spoiled by not having to have to do so in when using Mongo or SQL
databases since the query engine will more often than not find the right
index.

~~~
evanweaver
This a design choice to guarantee predictable performance at scale, not an
architectural limitation, and subject to revision.

You don't want your service to collapse when the query optimizer suddenly
doesn't choose the right index anymore. That was a hard lesson learned for us
at Twitter.

~~~
gigatexal
I agree -- it burns us in the SQL world too. I was just seeing it from a
developer cognitive load point of view and maybe as a premature optimization,
again, though if you guys and gals put this in place because of war stories
from twitter then it's not premature by any means.

In the rare case when no index or the wrong index is used in a SQL query an
index hint is all it takes to fix or updating statistics of a table to help
the optimizer. Not having to remember the exact index for a query imo is
better, but I could learn to adapt.

~~~
evanweaver
I should also add that Fauna indexes are more akin to views. They can cover
multiple source collections and transform the data as well. So it’s not always
possible to detect which index is correct via heuristic.

~~~
gigatexal
Ahh makes sense. Interesting.

------
wikibob
This looks like a fascinating approach. Unfortunately I’m not well versed
enough to intelligently compare it to alternatives.

Any plans to have Jepsen / Aphyr conduct a rigorous test and write a report on
the results?

~~~
jchrisa
We have a whole suite of internal correctness verification tools, in addition
to which we are also using Jepsen. We don't have an official report out yet,
but you can get a preview of our progress here:
[https://fauna.com/blog/verifying-transactional-
consistency-w...](https://fauna.com/blog/verifying-transactional-consistency-
with-jepsen-and-faunadb)

~~~
wikibob
Just my opinion, having Kyle write a formal report would give a huge bump to
validating the stated properties :)

~~~
evanweaver
Stay tuned...

------
donpdonp
thanks drift.com (and driftt.com) for driving me to learn how to block
specific sites in uBlock. Audio 'pops' for a webpage chatroom is a new level
of annoyance.

~~~
jchrisa
Thanks for the bump I needed to go fix that. Sound is off now.

------
grogers
I'm really confused how this scales to high transaction rates. If the replica
has to redo all the reads (which means talking to multiple nodes) before it
can make a commit/abort decision for the transaction this could take tens of
microseconds if all nodes are in the same datacenter (if serving from RAM).
Since it also has to process transactions from the log in order, that seems
like it would limit the transaction rate to tens of thousands of TPS? Forget
about distributing a replica across data centers or having enough data that it
may not be RAM resident.

Is this actually how it works or am I missing something important?

------
whargarbl
As an aside, can somebody explain the constraints FoundationDB puts on replica
distance?

~~~
evanweaver
FoundationDB cannot currently replicate its transaction log across
datacenters, either synchronously or asynchronously:
[https://forums.foundationdb.org/t/multi-dc-
replication/499](https://forums.foundationdb.org/t/multi-dc-replication/499)

In the discussed proposal datacenter failure is considered an extraordinary
event with major performance implications, which is different than a global
"write anywhere, read anywhere" multi-datacenter partitioned log model like
Fauna.

~~~
voidmain
FoundationDB has long supported _both_ synchronous and asynchronous
replication across regions, and its major users use multi-regional
configurations. In the former mode, you will see 1xRTT latency for commits [1]
from the active datacenter and 2xRTT latency for commits from other
datacenters. In the latter mode, commits from the active datacenter are fast
(0xRTT) but the durability property must be sacrificed if a region fails. I
believe almost all users currently use asynchronous replication because the
performance is so much better (than any geographically replicated synchronous
database).

The new "satellite" replication model in FoundationDB 6.0 allows the best of
both worlds: if you have two or more datacenters in each region (each storing
transaction logs, but only one storing actual data) then you can do
synchronous (0xRTT) commits to the regional datacenters and asynchronous
replication to the other region(s). If there are failures in one region, the
database can quickly and automatically fail over to the other region, while
maintaining durability (by getting the last bits of transaction log from at
least one datacenter in the failing region). Even if a whole region fails, as
long as the failures of the datacenters and network links in it are spread out
over time the database will complete a full ACID recovery. And in the worst
case you can fall back to ACI recovery as in asynchronous replication.

The question you are linking to is asking about something different, the
ability to have different parts of a single database have subgeographic
latencies in different regions.

[1] You will _also_ see 1xRTT latencies when you start transactions, but you
can solve this by setting the causal_read_risky transaction option on all
read/write transactions. This is not actually risky because if the reads in
the transaction turn out not to be externally consistent, the transaction will
not commit.

~~~
evanweaver
Very interesting. People really care about latency so we are also look at more
datacenter-local journaling schemes that maintain consistency at the expense
of a theoretically unlikely hit to durability.

What do you mean, "active datacenter"? Can all datacenters accept transactions
concurrently?

~~~
voidmain
If you are using Raft for replication, at any given time your replica set has
a leader and it is located somewhere, and I would assume that writes from near
there are faster than from anywhere else. In FoundationDB this is handled at a
somewhat higher layer of the system, and since our transaction pipeline is
(very roughly) resolve conflicts -> transaction log -> storage rather than
transaction log -> resolve conflicts -> storage, we are also doing conflict
resolution in that region.

Moreover, most current users of FoundationDB aren't willing to accept even
1xRTT latencies anywhere in their transaction lifecycle, so they can't
abstract away which region is the fast one. A common way (though not the only
way) to set things up is that your whole application (not just the database)
is active in region A, but prepared to fail over at a moment's notice to run
in region B. Ideally this comes at basically no performance penalty relative
to a single-region setup in region A. Alternatively, it's possible to read or
write the database from passive region(s) (and individual reads will normally
happen from the local region with low latency), but each transaction has to
pay a geographic latency at some point (or, in the case of read-only
transactions, accept reduced consistency).

I think it would be possible to implement something analogous to satellite
replication for Raft. It's a really nice set of tradeoffs for many
applications, and if your cloud provider or equivalent has their act at all
together _instantaneous_ failures of all the datacenters in a region or their
networking should really be pretty rare.

------
chappar
How does FaunaDB achieves atomicity without some sort of 2 phase commit? What
I understood from the article is that each node in a replica commits
independently from the distributed transaction log. So, if a transaction
updates data in multiple nodes in a replica then it can happen that one of the
commit in one of the nodes fails because of some failure. In that case will
there be partial commit?

------
andras_gerlits
I've published this article 5 months ago. Mine doesn't require global
consensus for all commits, just local one and it works from there.

What's even more annoying is that I tweeted this article to them and they
didn't say anything.

[https://medium.com/p/optimistic-acid-
transactions-4f193844bb...](https://medium.com/p/optimistic-acid-
transactions-4f193844bb5f)

~~~
mirlyn86
Did you implement it?

~~~
andras_gerlits
I'm planning on releasing a prototype as open source in a few months. I'm also
planning on releasing my patent into the public domain.

------
kerneltime
Apologies, I quickly skimmed the blog but in the summary did not see clear
answers for: When is the transaction acked? What is the reader-writer
consistency? If the transaction is acknowledged only after speculative work is
validated after ordering, what is the value add for doing the speculative
work? Would you achieve similar benefits from just batching commits?

~~~
evanweaver
The transaction is acknowledged after it has been durably replicated to the
distributed write-ahead log, and applied by at least one replica. The first
replica to apply will ack back to the coordinator and on to the client. This
increases partition tolerance and reduces latency.

The speculative work ("reconnaissance query") is what enables support for
dependent transactions (transactions where the write set depends on the values
of the read set rather than just the keys themselves). The transaction will
optimistically retry if the speculation turns out to be false.

~~~
kerneltime
A system like CORFU where the log is serialized upfront (not post speculative
commit) will give most of the benefits mentioned?

~~~
evanweaver
My understanding is that in CORFU the write set is always known, so this kind
of transaction cannot be expressed. You could layer it on top with the same
strategy.

------
gigatexal
Hmm. I’m intrigued though skeptical. Who’s using this in production that can
give an unbiased take?

~~~
evanweaver
Not quite what you are asking, but some customers with case studies on the
site: [https://fauna.com/customers](https://fauna.com/customers)

~~~
gigatexal
Unbiased in that the article we are all referencing is coming from you guys,
the creators, you can't help but show it in a positive light and I don't fault
you for it. Just I've become so jaded of late that I am looking for people who
have used it in the wild.

I also signed up for the cloud account -- your pricing terms are some of the
best I've ever seen, kudos -- so that I can check out the API some more.

~~~
evanweaver
Thanks; what do you like about the pricing?

~~~
gigatexal
The free forever on premise pricing for personal use to the 90-trial in
production even in commercial use. And I thought the free threshold for the
cloud reasonable as I can forgo all the setup work and just use the cloud
isntance for free if my usage falls under the right level. Anything that saves
me time but then doesn’t cost me anything while learning it / trying it out is
a win in my book.

------
EGreg
This is very cool

But there is an even harder super boss level for distributed systems:
Byzantine Fault Tolerance

Is FaunaDB Byzantine resistant? I have asked many projects, such as Dat, what
happens if there are conflicts, and they haven’t built systems that are BFT
yet.

~~~
jchrisa
Our deployment model assumes a single administrator. With reasonable network
operations, Byzantine fault tolerance is largely irrelevant. (Although it's an
important problem in other contexts.)

~~~
EGreg
I think BFT can be added as an option to your database and it wouldn’t need
refactoring any other part of the database. Just have crypto signatures and
the log would need to use BFT consensus with a quorum before committing. May
be worth it at least for marketing purposes, but also for multiple mistrusting
parties jointly running the database. You’ve done so much already, may as well
benefit from this addition!

~~~
jchrisa
We have a roadmap item to add blockchain style tamper evidence to the storage
engine. The crypto we'll use is certainly a step along the path you describe.
If you have a use case, we love talking to developers to help make sure we are
building features you can use.

------
EngineerBetter
The global log sounds like a lynchpin but I didn't see a good explanation in
the article. It sounds like all synchronization is done at the level of this
log? If so, isn't that a bit of a bottleneck?

~~~
jchrisa
The log is distributed across the cluster, and can be partitioned. The
protocol is optimized for long physical distances between replicas. See links
elsewhere in the thread to Raft.

------
PHGamer
is the issue not so much other dbs cant provide that global consitency but
other dbs just choose not to for the sake of speed. like their are trade offs

