
FaunaDB 2.5.4 - aphyr
https://jepsen.io/analyses/faunadb-2.5.4
======
evanweaver
So excited to see this; this is the culmination of three months of very hard
work by both teams.

FaunaDB 2.5 passed the core linearizability tests for multi-partition
transactions immediately. To my knowledge no other distributed system has done
this. Zookeeper was the strongest candidate on initial testing in the early
days, but it does not offer multiple partitions at all, as discussed in the
FaunaDB report. And Jepsen itself was much less comprehensive at the time.

All other issues affecting correctness were fixed in the course of the
analysis, and FaunaDB 2.6 is now available with the improvements.

We're happy to answer questions along with @aphyr. Our blog post is here:
[https://fauna.com/blog/faunadbs-official-jepsen-
results](https://fauna.com/blog/faunadbs-official-jepsen-results)

~~~
AtlasBarfed
I greatly appreciate any distributed system that subjects itself to jepsen. It
shows real commitment to honesty and a genuine desire to improve a db tech.

One thing that seems to fall by the wayside frequently is a followup. Often
the problems found are declared solved after a couple point releases in a
blog. Elasticsearch did this to my disappointment, and so did Cassandra.

Reading your stuff, it appears you scale with complete replicas of a database
on individual nodes. Is that still true?

~~~
evanweaver
> One thing that seems to fall by the wayside frequently is a followup.

Yeah, I agree. We worked very hard to fix all major issues during the
evaluation period so that Kyle could test the fixes himself, and we are a
planning a formal followup on the remaining items and some planned
improvements as well, once we are ready.

------
jatsign
I hadn't heard of Fauna before. What's the use case?

Looks like it's not open source, and the pricing isn't very clear if I want to
host it locally. The "Download" page requires you to provide your contact info
first.

Why should I go through all those hoops?

~~~
jchrisa
TLDR: FaunaDB is a distributed database that offers multi-partition ACID
transactions and is optimized for cloud / multi-cloud deployment. This is a
big deal because there aren't many other options with this level of data
integrity at worldwide scale.

Use cases include financial services, retail, user identity, game world state,
etc. Basically anything you'd put in an operational database.

In addition to the download (free for one node), you can get started with
FaunaDB Serverless Cloud for free in moments. It's a fully managed version
that is used in production by some major sites and many smaller apps.

~~~
jatsign
Sorry, I'm not trying to be snarky, but that sounds like "it's good for
everything" and doesn't help me determine when I should look at Fauna and when
I shouldn't.

Perhaps alternately - when is Fauna NOT a good choice?

~~~
aphyr
A few other points to consider!

FaunaDB is a unique database: its architecture offers linear scalability for
transaction throughput (limited, of course, by contention on common records
and predicates). That sets it aside from databases which use a single
coordinator for all, or just cross-shard, transactions, like Datomic and
VoltDB, respectively.

It's also intended for geographic replication, which is a scenario many
databases don't try to handle with the same degree of transactional safety--
lots of folks support, say, snapshot isolation or linearizability inside one
DC, but between DCs all bets are off.

FaunaDB also does _not_ depend on clocks for safety, which sets it aside from
other georeplicated databases that assume clocks are well-synchronized.
Theoretically, relying on clocks can make you faster, so those kinds of
systems might outperform Fauna--at the potential cost of safety issues when
clocks misbehave. In practice, there are lots of factors that affect
performance, and it's easy to introduce extra round trips into a protocol
which could theoretically be faster. There's a lot of room for optimization! I
can't speak very well to performance numbers, because Jepsen isn't designed as
a performance benchmark, and its workloads are intentionally pathological,
with lots of contention.

One of the things you lose with FaunaDB's architecture is interactive
transactions--as in VoltDB, you submit transactions all at once. That means
FaunaDB can make some optimizations that interactive session transactions
can't do! But it also means you have to fit your transactional logic into
FaunaDB's query language. The query language is (IMO) really expressive, but
if you needed to do, e.g. a complex string-munging operation inside a
transaction to decide what to do, it might not be expressible in a single
FaunaDB transaction; you might have to say, read data, make the decision in
your application, then execute a second CaS transaction to update state.

FaunaDB's not totally available--it uses unanimous agreement over majority
quorums, which means some kinds of crashes or partitions can pause progress.
If you're looking for a super low latency CRDT-style database where every node
can commit even if totally isolated, it's not the right fit. It _is_ a good
fit if you need snapshot isolation to strict serializability, which are strong
consistency models!

~~~
dsergeyev
Hey I am curious what does it use in place clocks for safety and quorum? Is
there a paper

~~~
aphyr
Yes, there IS a paper! Check out Calvin: Fast Distributed Transactions for
Partitioned Database Systems. FaunaDB uses Calvin with some changes to allow
more types of transactions, and to improve fault tolerance. :)

[http://cs.yale.edu/homes/thomson/publications/calvin-
sigmod1...](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf)

------
asien
Tried Fauna once with their « Cloud » versions.

I was absolutely shocked by the poor performance of the service.

In my case I prototype some simple CRUD queries with NodeJS ,within the same
datacenter region.

Insert took well over a second to complete and reading a simple document with
one field took also half a second.

I was also unable to make « join » between document because how complex their
query language is and their support basically encouraged me not to use « join
» but to use « aggregate » like mongo ... Why offer this feature if I can’t
use it ?

Has it changed since then ? It seems very clear for me that Fauna is entirely
focused on Enterprises customers ( after all this is where the money is ) the
cloud version seem to be just a gimmick.

~~~
evanweaver
Hmm, when did you try it? FaunaDB and the cloud service have changed a lot in
the last few years, and performance is always improving.

Typical write latencies in Cloud are in the 100ms range because the data is
globally replicated. Typical read latencies are the 1ms-10ms range, because
global coordination is not required, discounting the additional latency from
the client to the closest datacenter.

If you experienced something worse than that recently, maybe there is some
other issue going on.

~~~
asien
I will give it a second shot then.

But just to confirm , is doing « join » something that is still not
recommended ? Aggregation is tedious and lead to queries that are difficult to
read and maintain.

~~~
freels
It's definitely recommended, depending on what you're trying to get out of the
result. Happy to help out with query patterns on our community slack
[https://publicslack.com/slacks/fauna-
community/invites/new](https://publicslack.com/slacks/fauna-
community/invites/new)

------
wmsiler
From the post, FaunaDB initially had several issues, which they've generally
resolved. Jepsen is open source, so I'm curious why a database company
wouldn't run Jepsen internally, work out as many problems as they can, and
then engage aphyr in order to get the official thumbs up. Given how important
data integrity is, I would assume that any database company would be running
Jepsen (or something equivalent) regularly in-house. If they are doing that,
then how is it that aphyr finds so many previously unknown issues? And if they
aren't running Jepsen in-house, why not?

~~~
evanweaver
This is a very good question, and to a substantial degree, this is what we
did. We have internal QA systems that overlap Jepsen that catch most issues.
We also ran our own Jepsen tests on the core product properties last year,
fixed some issues and identified others, and reported the results on our blog.

However, correctness testing is fundamentally adversarial, like security
penetration testing. Building a database is not easy, and testing a database
is not easy either. It is a separate skill set, as anomalies that lingered for
decades in other databases reveal. The engagement with the Jepsen team is
explicitly designed to explore the entire product surface area for faults, not
to apply Jepsen as it currently stands. Thus, a lot of custom work ensued on
both sides to make sure that the database was both properly testable, and
properly tested. The result of that work is what you see in the report.

The typical Jepsen report implicates not just implementation bugs, but the
entire architecture of the system itself. Jepsen usually identifies anomalies
that cannot be prevented even with a perfect implementation, which didn't
happen here.

Some vendors restrict their engagement with the Jepsen team to only what they
have tested themselves already, although those tests are not always valid.
This was not our mindset—we wanted to improve our database by taking advantage
of Kyle’s expertise, not present a superficially perfect report that failed to
actually exercise the potential faults of the system.

~~~
aphyr
To follow up on this a little bit--many of my clients do their own Jepsen
testing, or have analogous tests using their own testing toolkit. When they
engage me, the early part of my work is reviewing their existing tests,
looking for problems, and then expanding and elaborating on those tests to
find new issues in the DB.

Companies _are_ finding bugs using Jepsen internally, which is great! But when
they hire me, I'm usually able to find new behaviors. Some of that is
exploring the concurrency and scheduling state space, some of it is reviewing
code and looking for places where tests would fail to identify anomalies, some
of it is designing new workloads or failure modes, and some is reading the
histories and graphs, and using intuition to guide my search. I've been at
this for five years now (gosh) and built up a good deal of expertise, and
coming at a system with an outsider's perspective, and a different "box of
tools", helps me explore the system in a different way.

I do work with my clients to determine what they'd like to focus on, and how
much time I can give, but by and large, my clients let me guide the process,
and I think the Jepsen analyses I've published are reasonably independent. If
there's something I think would be useful to test, and we don't have time or
the client isn't interested in exploring it, I note it in the future work
section of the writeup.

It's not like clients are saying "please stick ONLY to these tests, we want a
positive result." One of the things I love about my job is how much the
vendors I work with care about fixing bugs and doing right by their users, and
I love that I get to help them with that process. :)

------
georgewfraser
Fauna’s writeup heavily emphasizes the fact that it doesn’t rely on atomic
clocks. My understanding is that both AWS and GCP use atomic clock based
timekeepers since 2017, so it’s not like this is some exotic technology.

The primary advantage described in the Calvin papers is that it’s the only
distributed transaction protocol that can handle high contention workloads.
But Fauna never seems to bring this up. Does that mean that Fauna’s current
implementation isn’t fast under contention?

~~~
freels
It does handle contention well, we haven't emphasized that point well enough
yet. Writes never contend on conflicting reads, and serialized reads never
contend at all.

Accurate clocks are not enough... to really get the benefits that Spanner
alone enjoys, you have to have a TrueTime equivalent service available, and it
has to be rock solid. As well once your system is sensitive to clock skews in
the milliseconds, you start having to care about things like the leap-second
policies of your clock sources. All in all, the resiliency tradeoffs are a
significant downside to relying on clock synchronization, which is why we did
not pursue a transaction protocol dependent on it.

~~~
georgewfraser
My understanding is that AWS TimeSync and the current GCP time daemon
implementation are the equivalent of TrueTime, is that incorrect?

I guess what I’m saying is it seems like Fauna is using atomic clocks as FUD
against Spanner and CockroachDB, when they aren’t really a problem. Based on
my reading of the Calvin paper, the main advantage of Calvin style systems is
higher throughout under contention. But for some reason the Fauna marketing
team has chosen not to emphasize that, which makes me suspicious that maybe
Fauna hasn’t yet realized that advantage in its implementation.

~~~
evanweaver
If only this were the case! Unfortunately it is not practical to reliably
maintain single-digit millisecond clock tolerances in the public cloud as an
end user via NTP. The entire software/hardware/network stack has to be tightly
controlled, not just the clock source itself. And atomic clocks across
multiple cloud providers are not in sync with each other, either.

Thus, databases that rely on clock synchronization recommend configuring
tolerance windows of 500ms and above, and cannot reliably detect if those
windows have been violated. Additionally, this window affects latency for
serializable local reads all the time, even if the clocks actually fine,
because there is no way for the system to know.

~~~
georgewfraser
AWS has GPS/atomic clocks in each datacenter that provide an accurate
reference time. Recent linux distros use chrony instead of ntpd to synchronize
with the reference time, which should introduce only microseconds of error
between the reference time and the system clock.

Am I missing something? I am not an expert, I'm just not seeing where the 100s
of ms of error is going to enter this system.

(edit: thanks for the great explanation aesipp!)

~~~
aseipp
I spent some time looking at TimeSync and my primary takeaway was simply that
while it was nice, there's no actual hard numbers on how accurate it really
is. I suspect it is very accurate but proving this (to yourself) is going to
be challenging if you want to rely on global clocks to avoid consensus without
details or insight. You are essentially making a _huge_ bet on performance
considerations by trading consensus for clocks, at the expense of a far, far
higher bar for correctness.

It seems very likely based on when it was rolled out that it underpins AWS
tech like DynamoDB Global Tables -- so it almost certainly powers critical
infrastructure. But there's no SLA or reports on what the tolerances you can
expect without doing a lot of work on your own. It's more of a nice bonus
rather than a "product" they offer you, in that sense, so being wary maybe
isn't unwarranted.

IIRC from the original Spanner/TT paper, they had a general error window of
~10ms from the TT daemons, and I would be extremely surprised if Google hasn't
pushed that even lower, now, so your job is much more cut out for you than
100s of ms of error. And yes the clocks are in the same DC at a very precise
window, but bugs happen through-out the stack, your hypervisor bugs out,
systems get misconfigured, whatever, your process will fuzz out, especially as
you begin to tighten things. You don't have the QA/testing of Spanner or
DynamoDB, basically.

None of this is insurmountable, I think, though. It's just not easy any way
you cut it. Even a few people doing the work to test and experiment with this
would be very valuable. (It would be even better if AWS would make it a real
product with real SLAs/numbers to back it up.) It's just a lot of work no
matter what.

The fact that it is limited to AWS (for now) is a bit of a shame. I do hope
other cloud providers start thinking about providing precise clocks in their
datacenters, as well as accompanying software to go with it.

> Recent linux distros use chrony instead of ntpd to synchronize with the
> reference time, which should introduce only microseconds of error between
> the reference time and the system clock.

To be fair not everyone uses chrony; a lot of systems still use just ntpd or
timesyncd (I spent a lot of time working on fixing time-sync related issues in
our Linux distro lately across all our supported daemons, so I can at least
say Chrony is a very wonderful choice, accurate, and so very easy to use! I
actually found out about it when looking up TimeSync)

~~~
erik_seaberg
If you need strongly consistent data, you must write to a global table in only
one region, and then clocks don't matter because replication does not create
conflicts.

If you can survive lost writes, clock skew just makes a zone win more or less
often. Even if the clocks were in perfect sync, you still wouldn't observe
causality across regions (changes to different items can replicate out of
order).

------
etaioinshrdlu
This actually reminds me a lot of how Ethereum transactions are represented as
code as well.

Anyone else see a parallel there?

Seems like a good idea, overall. One annoying thing that affects pretty much
every database with transactions is that the effort of retrying failed
transactions is pushed onto the user, by necessity.

But if your transactions are airtight chunks of code... then the DB can retry
them for you and provide a simpler interface to your app code.

~~~
cakoose
> One annoying thing that affects pretty much every database with transactions
> is that the effort of retrying failed transactions is pushed onto the user,
> by necessity. > > But if your transactions are airtight chunks of code...
> then the DB can retry them for you and provide a simpler interface to your
> app code.

Building an FaunaDB query is more difficult than just writing session-based DB
code.

But if you are willing to build FaunaDB queries, it should be strictly easier
to write session-based DB code as "airtight" chunks that are easy to retry.

------
buremba
Looks great but why did you decide to develop your own query language instead
of just using SQL? Even no-sql transactional database solutions started to
adopt SQL lately and learning a new language is not really easy for the
application developers.

~~~
aphyr
Ooh, I'll bite! For one, they're different data models. FaunaDB is a document
store, so records are fundamentally trees, whereas SQL is oriented around
processing tuples. FaunaDB records have queryable metadata; SQL rows (I
think?) don't. You can extend SQL (look at JSON support in Postgres) to deal
with arrays and maps as core datatypes, but a special-purpose language can be
better suited to the purpose.

Second, FaunaDB's transactional model precludes interactive transactions,
whereas SQL transactions are designed for interactive use. Imagine if every
transaction was a stored procedure--that's the query structure you'd be
looking at. It's certainly possible to do, but stored procedures are sort of
an imperative language grafted on to the relational algebra of SQL, and
support isn't as standardized as SQL's core.

Third, FaunaDB is a temporal store--you can ask for the state of any query at
any point in time, and even mix temporal scopes in the same query expression.
SQL doesn't have a first-class temporal model.

In general, using SQL offers advantages, including user familiarity, code
reuse, and easier migration from other SQL stores. None of the things FaunaDB
does are impossible to express in SQL, and they _have_ been tackled by various
DBs' extensions to SQL, but the familiarity+reuse advantages aren't as
applicable once you start thinking about the distinct properties of FaunaDB's
data model.

------
twic
FaunaDB uses Calvin, a transaction protocol developed by Daniel Abadi. Their
blog post explains it nicely, after a bit of a slow start:

[https://fauna.com/blog/consistency-without-clocks-faunadb-
tr...](https://fauna.com/blog/consistency-without-clocks-faunadb-transaction-
protocol)

But in summary:

1\. A 'transaction' is a self-contained blob of code which reads input, does
deterministic logic, and writes output (so not like a traditional RDBMS
transaction, where the application opens a transaction and then interleaves
its own logic between reads and writes)

2\. When a transaction arrives, the receiving node runs it, and captures the
inputs it read, and the outputs it wrote

3\. The transaction, with its captured inputs and outputs, is written to a
global stream of transactions - this is the only point of synchronisation
between the nodes

4\. Each node reads the global stream, and writes each transaction into its
persistent state; to do that, it repeats all the reads that the transaction
did, and checks that they match the captured input - if so, the outputs are
committed, and it not, the transaction is aborted, and retried

The key idea is that because the process is deterministic, the nodes can write
transactions to disk independently without drifting out of sync.

It's pretty neat. And it's exactly what Abadi wrote about a couple of months
ago:

[http://dbmsmusings.blogspot.com/2019/01/its-time-to-move-
on-...](http://dbmsmusings.blogspot.com/2019/01/its-time-to-move-on-from-two-
phase.html)

This is also what VoltDB does (which Abadi worked on, along with Michael
Stonebraker):

 _As an operational store, the VoltDB “operations” in question are actually
full ACID transactions, with multiple rounds of reads, writes and conditional
logic. If the system is going to run transactions to completion, one after
another, disk latency isn’t the only stall that must be eliminated; it is also
necessary to eliminate waiting on the user mid-transaction._

 _That means external transaction control is out – no stopping a transaction
in the middle to make a network round-trip to the client for the next action.
The team made a decision to move logic server-side and use stored procedures._

[https://www.voltdb.com/product/data-architecture/no-wait-
des...](https://www.voltdb.com/product/data-architecture/no-wait-design/)

It's also similar to, although categorically more sophisticated than, the idea
of object prevalence, which is now so old and forgotten that i can't find any
really good references, but:

 _Clients communicate with the prevalent system by executing transactions,
which are implemented by a set of transaction classes. These are examples of
the Command design pattern [Gamma 1995]. Transactions are written to a journal
when they are executed. If the prevalent system crashes, its state can be
recovered by reading the journal and executing the transactions again. [...]
Replaying the journal must always give the same result, so transactions must
be deterministic. Although clients can have a high degree of concurrency, the
prevalent system is single-threaded, and transactions execute to completion._

[https://web.archive.org/web/20170610140344/http://hillside.n...](https://web.archive.org/web/20170610140344/http://hillside.net/sugarloafplop/papers/5.pdf)

------
drej
I recommend first reading bits of the Jepsen report, because the company blog
paints quite a different picture.

> We’re excited to report that FaunaDB has passed: > Additionally, it offers
> the highest possible level of correctness: > In consultation with Kyle,
> we’ve fixed many _known_ issues and bugs

vs.

> However, queries involving indices, temporal queries, or event streams
> failed to live up to claimed guarantees. We found 19 issues in FaunaDB[.]

~~~
greggh
Your comment doesnt seem fair at all. The blog post is from after they worked
hard to fix everything. The comment from Jepsen that you are quoting is from
before any of that fix work was done.

You are comparing two very different situations.

~~~
drej
I know they worked hard, I appreciate that. But I got a wholly different
feeling about the state of things from those two sources.

Also, does everyone run the very latest version of all their software? What
use to me is that my vendor has fixed everything in the newest release that I
am not using?

Oh and yes, I'm only quoting bits of each (fully knowing you all have links to
both and can read it in full), but that's to illustrate the omission from the
PR piece. I know that aphyr concludes that work has been done, "By 2.6.0-rc10,
Fauna had addressed almost all issues we identified; some minor work around
availability and schema changes is still in progress.", but that doesn't
change the fact that the blog post doesn't address their past shortcomings.

~~~
jwr
Approach like yours causes companies not to submit to Jepsen testing and try
to hide shortcomings. They did submit, they found problems (it is rare for the
Jepsen suite not to find any), and they fixed them. This is a fantastic result
and definitely one to be proud of.

Additionally, I am not sure if you fully appreciate the complexity of Jepsen
and distributed databases in general.

As for me, I've actually been waiting for this result to recommend the use of
FaunaDB in a commercial setting.

~~~
drej
Oh I do appreciate it - I've read Martin Kleppmann's book cover to cover and
then watched all of the speeches Kyle Kingsbury has given in the past three
years or so. I love this area, my absolute favourite is deterministic testing
of FoundationDB [0].

It's _because_ I appreciate this work that I felt the blog post didn't do it
justice. And I know Jepsen hardly ever passes (ZooKeeper, I believe, did). And
I don't take FaunaDB's hard work for granted.

[0]
[https://www.youtube.com/watch?v=4fFDFbi3toc](https://www.youtube.com/watch?v=4fFDFbi3toc)

------
jwr
I know this is slightly off-topic, but I'd be very interested in Jepsen
testing FoundationDB. They claim to have developed the database test-first
(starting with simulations), and it would be great to be able to compare the
claims to reality using an external (by now becoming an industry standard!)
testing tool.

~~~
misframer
This was done a few years ago. See
[https://web.archive.org/web/20150325003526/http://blog.found...](https://web.archive.org/web/20150325003526/http://blog.foundationdb.com/call-
me-maybe-foundationdb-vs-jepsen).

------
anentropic
That was pretty great. Does anyone have links to tests of FaunaDB write
performance?

~~~
evanweaver
We are working on new benchmarks.

------
gigatexal
If they added a standard SQL layer I’d be onboard. Interesting project though.

~~~
devDan
They are adding a GraphQL layer in. It's in prototype mode right now, I
believe, but we should see it soon.

