FaunaDB 2.5 passed the core linearizability tests for multi-partition transactions immediately. To my knowledge no other distributed system has done this. Zookeeper was the strongest candidate on initial testing in the early days, but it does not offer multiple partitions at all, as discussed in the FaunaDB report. And Jepsen itself was much less comprehensive at the time.
All other issues affecting correctness were fixed in the course of the analysis, and FaunaDB 2.6 is now available with the improvements.
We're happy to answer questions along with @aphyr. Our blog post is here: https://fauna.com/blog/faunadbs-official-jepsen-results
One thing that seems to fall by the wayside frequently is a followup. Often the problems found are declared solved after a couple point releases in a blog. Elasticsearch did this to my disappointment, and so did Cassandra.
Reading your stuff, it appears you scale with complete replicas of a database on individual nodes. Is that still true?
Yeah, I agree. We worked very hard to fix all major issues during the evaluation period so that Kyle could test the fixes himself, and we are a planning a formal followup on the remaining items and some planned improvements as well, once we are ready.
The cluster topology which was tested was a 3 replica cluster with 3 nodes each. Each node contained 1/3 of the dataset.
Anyway, FaunaDB looks like a well-engineered product in a highly competitive space and I hope you will succeed :)
One could build a Calvin-style system on top of FDB to provide durability, cluster membership, and fault tolerance for data at a single site.
Looks like it's not open source, and the pricing isn't very clear if I want to host it locally. The "Download" page requires you to provide your contact info first.
Why should I go through all those hoops?
Use cases include financial services, retail, user identity, game world state, etc. Basically anything you'd put in an operational database.
In addition to the download (free for one node), you can get started with FaunaDB Serverless Cloud for free in moments. It's a fully managed version that is used in production by some major sites and many smaller apps.
Perhaps alternately - when is Fauna NOT a good choice?
FaunaDB is a unique database: its architecture offers linear scalability for transaction throughput (limited, of course, by contention on common records and predicates). That sets it aside from databases which use a single coordinator for all, or just cross-shard, transactions, like Datomic and VoltDB, respectively.
It's also intended for geographic replication, which is a scenario many databases don't try to handle with the same degree of transactional safety--lots of folks support, say, snapshot isolation or linearizability inside one DC, but between DCs all bets are off.
FaunaDB also does not depend on clocks for safety, which sets it aside from other georeplicated databases that assume clocks are well-synchronized. Theoretically, relying on clocks can make you faster, so those kinds of systems might outperform Fauna--at the potential cost of safety issues when clocks misbehave. In practice, there are lots of factors that affect performance, and it's easy to introduce extra round trips into a protocol which could theoretically be faster. There's a lot of room for optimization! I can't speak very well to performance numbers, because Jepsen isn't designed as a performance benchmark, and its workloads are intentionally pathological, with lots of contention.
One of the things you lose with FaunaDB's architecture is interactive transactions--as in VoltDB, you submit transactions all at once. That means FaunaDB can make some optimizations that interactive session transactions can't do! But it also means you have to fit your transactional logic into FaunaDB's query language. The query language is (IMO) really expressive, but if you needed to do, e.g. a complex string-munging operation inside a transaction to decide what to do, it might not be expressible in a single FaunaDB transaction; you might have to say, read data, make the decision in your application, then execute a second CaS transaction to update state.
FaunaDB's not totally available--it uses unanimous agreement over majority quorums, which means some kinds of crashes or partitions can pause progress. If you're looking for a super low latency CRDT-style database where every node can commit even if totally isolated, it's not the right fit. It is a good fit if you need snapshot isolation to strict serializability, which are strong consistency models!
What is actually a good for this lately?
It's true. The community is keeping Riak alive, and pushing it forward. The CRDT stuff has not been worked on for a long time, despite there being numerous tabled improvements in that area SINCE 2014. It was neglected by the "new" management that crashed Basho into the ground as being "complicated computer science nonsense". Since then I've been unable to do the CRDT work for Riak I wanted, due to lack of time, and no one willing to pay for it (despite numerous bait-and-switch offers for completing the bigsets work.)
The CRDT work in riak needs:
1. the smaller/more efficient map merging
2. some new types from Sam merging (range reg etc)
3. the bigsets work completing and integrating (this enables delta-CRDTs)
4. big maps (built on the above work)
When it has all that it would be at the place I'd envisioned for it, before everything went super bad at Basho. I work at Joyent now, so can't dedicate any time to taking the CRDT work forward, though I really wish I could. I still have a deep interest in the EC datatypes world.
IMO, at the time it was released (2014) Riak had ground breaking support for convergent datatypes, since then it has only lost ground. Am I bitter? Yes, a little.
WRT other systems with CRDT support, https://www.antidotedb.eu/ came out of the syncfree research project.
Chistopher Meiklejohn's work is industrial grade, state of the art, actively developed, and highly recommended. https://github.com/lasp-lang
As others have mentioned, REDIS also uses CRDTs for MDC replication, and has CRDT co-inventor Carlos Baquero on the tech board.
Currently it supports its own native DSL similar to LINQ or an ORM, and if you want compatibility with existing query languages from other databases, we will be rolling those out over time.
It's not a good choice for analytics or time series data which is redundant or aggregated, and doesn't need the high availability or performance overhead of transactional isolation.
Additionally, FaunaDB gives you the tools to work within a global environment with _safe tradeoffs_. For example, reads at SI/Serializable can be fast and uncoordinated. You choose per-query.
That said, I don't think it's reasonable to infer that because failures occur more often than we'd like, systems based on consensus are always inappropriate for latency and performance sensitive systems. While there are minimum bounds on latency in consensus systems, there are also plenty of human-scale problems for which that bound isn't a dealbreaker. Moreover, some types of computation (e.g. updates in sequentially consistent or snapshot isolated systems) simply can't be performed without paying the same costs as consensus. Consensus can be an appropriate, and in some cases the only possible solution, to some problems.
For example, a 7 replica FaunaDB cluster can lose 3 replicas and maintain availability for reads and writes; better than Aurora in your example.
FaunaDB supports function execution in the database, if the functions are written in FQL. It has a much more powerful index system and consistency model than Gemstone/Gemfire/Geode and is designed for synchronous global replication.
However, unlike Geode it is not an in-memory system, so it is not appropriate to use as a cache.
Geode skews towards eventual consistency when connecting geographically dispersed clusters. This means you still get super fast local transactions with batched updates to remote sites.
I was absolutely shocked by the poor performance of the service.
In my case I prototype some simple CRUD queries with NodeJS ,within the same datacenter region.
Insert took well over a second to complete and reading a simple document with one field took also half a second.
I was also unable to make « join » between document because how complex their query language is and their support basically encouraged me not to use « join » but to use « aggregate » like mongo ... Why offer this feature if I can’t use it ?
Has it changed since then ? It seems very clear for me that Fauna is entirely focused on Enterprises customers ( after all this is where the money is ) the cloud version seem to be just a gimmick.
Typical write latencies in Cloud are in the 100ms range because the data is globally replicated. Typical read latencies are the 1ms-10ms range, because global coordination is not required, discounting the additional latency from the client to the closest datacenter.
If you experienced something worse than that recently, maybe there is some other issue going on.
But just to confirm , is doing « join » something that is still not recommended ? Aggregation is tedious and lead to queries that are difficult to read and maintain.
However, correctness testing is fundamentally adversarial, like security penetration testing. Building a database is not easy, and testing a database is not easy either. It is a separate skill set, as anomalies that lingered for decades in other databases reveal. The engagement with the Jepsen team is explicitly designed to explore the entire product surface area for faults, not to apply Jepsen as it currently stands. Thus, a lot of custom work ensued on both sides to make sure that the database was both properly testable, and properly tested. The result of that work is what you see in the report.
The typical Jepsen report implicates not just implementation bugs, but the entire architecture of the system itself. Jepsen usually identifies anomalies that cannot be prevented even with a perfect implementation, which didn't happen here.
Some vendors restrict their engagement with the Jepsen team to only what they have tested themselves already, although those tests are not always valid. This was not our mindset—we wanted to improve our database by taking advantage of Kyle’s expertise, not present a superficially perfect report that failed to actually exercise the potential faults of the system.
Companies are finding bugs using Jepsen internally, which is great! But when they hire me, I'm usually able to find new behaviors. Some of that is exploring the concurrency and scheduling state space, some of it is reviewing code and looking for places where tests would fail to identify anomalies, some of it is designing new workloads or failure modes, and some is reading the histories and graphs, and using intuition to guide my search. I've been at this for five years now (gosh) and built up a good deal of expertise, and coming at a system with an outsider's perspective, and a different "box of tools", helps me explore the system in a different way.
I do work with my clients to determine what they'd like to focus on, and how much time I can give, but by and large, my clients let me guide the process, and I think the Jepsen analyses I've published are reasonably independent. If there's something I think would be useful to test, and we don't have time or the client isn't interested in exploring it, I note it in the future work section of the writeup.
It's not like clients are saying "please stick ONLY to these tests, we want a positive result." One of the things I love about my job is how much the vendors I work with care about fixing bugs and doing right by their users, and I love that I get to help them with that process. :)
The primary advantage described in the Calvin papers is that it’s the only distributed transaction protocol that can handle high contention workloads. But Fauna never seems to bring this up. Does that mean that Fauna’s current implementation isn’t fast under contention?
Accurate clocks are not enough... to really get the benefits that Spanner alone enjoys, you have to have a TrueTime equivalent service available, and it has to be rock solid. As well once your system is sensitive to clock skews in the milliseconds, you start having to care about things like the leap-second policies of your clock sources. All in all, the resiliency tradeoffs are a significant downside to relying on clock synchronization, which is why we did not pursue a transaction protocol dependent on it.
I guess what I’m saying is it seems like Fauna is using atomic clocks as FUD against Spanner and CockroachDB, when they aren’t really a problem. Based on my reading of the Calvin paper, the main advantage of Calvin style systems is higher throughout under contention. But for some reason the Fauna marketing team has chosen not to emphasize that, which makes me suspicious that maybe Fauna hasn’t yet realized that advantage in its implementation.
Thus, databases that rely on clock synchronization recommend configuring tolerance windows of 500ms and above, and cannot reliably detect if those windows have been violated. Additionally, this window affects latency for serializable local reads all the time, even if the clocks actually fine, because there is no way for the system to know.
Am I missing something? I am not an expert, I'm just not seeing where the 100s of ms of error is going to enter this system.
(edit: thanks for the great explanation aesipp!)
It seems very likely based on when it was rolled out that it underpins AWS tech like DynamoDB Global Tables -- so it almost certainly powers critical infrastructure. But there's no SLA or reports on what the tolerances you can expect without doing a lot of work on your own. It's more of a nice bonus rather than a "product" they offer you, in that sense, so being wary maybe isn't unwarranted.
IIRC from the original Spanner/TT paper, they had a general error window of ~10ms from the TT daemons, and I would be extremely surprised if Google hasn't pushed that even lower, now, so your job is much more cut out for you than 100s of ms of error. And yes the clocks are in the same DC at a very precise window, but bugs happen through-out the stack, your hypervisor bugs out, systems get misconfigured, whatever, your process will fuzz out, especially as you begin to tighten things. You don't have the QA/testing of Spanner or DynamoDB, basically.
None of this is insurmountable, I think, though. It's just not easy any way you cut it. Even a few people doing the work to test and experiment with this would be very valuable. (It would be even better if AWS would make it a real product with real SLAs/numbers to back it up.) It's just a lot of work no matter what.
The fact that it is limited to AWS (for now) is a bit of a shame. I do hope other cloud providers start thinking about providing precise clocks in their datacenters, as well as accompanying software to go with it.
> Recent linux distros use chrony instead of ntpd to synchronize with the reference time, which should introduce only microseconds of error between the reference time and the system clock.
To be fair not everyone uses chrony; a lot of systems still use just ntpd or timesyncd (I spent a lot of time working on fixing time-sync related issues in our Linux distro lately across all our supported daemons, so I can at least say Chrony is a very wonderful choice, accurate, and so very easy to use! I actually found out about it when looking up TimeSync)
If you can survive lost writes, clock skew just makes a zone win more or less often. Even if the clocks were in perfect sync, you still wouldn't observe causality across regions (changes to different items can replicate out of order).
Anyone else see a parallel there?
Seems like a good idea, overall. One annoying thing that affects pretty much every database with transactions is that the effort of retrying failed transactions is pushed onto the user, by necessity.
But if your transactions are airtight chunks of code... then the DB can retry them for you and provide a simpler interface to your app code.
Building an FaunaDB query is more difficult than just writing session-based DB code.
But if you are willing to build FaunaDB queries, it should be strictly easier to write session-based DB code as "airtight" chunks that are easy to retry.
Second, FaunaDB's transactional model precludes interactive transactions, whereas SQL transactions are designed for interactive use. Imagine if every transaction was a stored procedure--that's the query structure you'd be looking at. It's certainly possible to do, but stored procedures are sort of an imperative language grafted on to the relational algebra of SQL, and support isn't as standardized as SQL's core.
Third, FaunaDB is a temporal store--you can ask for the state of any query at any point in time, and even mix temporal scopes in the same query expression. SQL doesn't have a first-class temporal model.
In general, using SQL offers advantages, including user familiarity, code reuse, and easier migration from other SQL stores. None of the things FaunaDB does are impossible to express in SQL, and they have been tackled by various DBs' extensions to SQL, but the familiarity+reuse advantages aren't as applicable once you start thinking about the distinct properties of FaunaDB's data model.
Our core use-case is OLTP, and we wanted to address the shortcomings of SQL in this context, such as query performance being highly unpredictable in the presence or not of indexes, whims of the optimizer, etc.
SQL is just not great as an application-level interface: Tables are not a natural fit for many data models (the classic impedance mismatch problem), programmatic composition is difficult. We want to obviate the necessity of an ORM library on top Fauna.
SQL the language is not great for writing complex transactions in, and session transactions require a lot of back and forth between the client and database. We wanted to make it easy to write queries which can encode as much business logic as possible, hence FQL's semantics are a lot closer to a regular programming language.
But in summary:
1. A 'transaction' is a self-contained blob of code which reads input, does deterministic logic, and writes output (so not like a traditional RDBMS transaction, where the application opens a transaction and then interleaves its own logic between reads and writes)
2. When a transaction arrives, the receiving node runs it, and captures the inputs it read, and the outputs it wrote
3. The transaction, with its captured inputs and outputs, is written to a global stream of transactions - this is the only point of synchronisation between the nodes
4. Each node reads the global stream, and writes each transaction into its persistent state; to do that, it repeats all the reads that the transaction did, and checks that they match the captured input - if so, the outputs are committed, and it not, the transaction is aborted, and retried
The key idea is that because the process is deterministic, the nodes can write transactions to disk independently without drifting out of sync.
It's pretty neat. And it's exactly what Abadi wrote about a couple of months ago:
This is also what VoltDB does (which Abadi worked on, along with Michael Stonebraker):
As an operational store, the VoltDB “operations” in question are actually full ACID transactions, with multiple rounds of reads, writes and conditional logic. If the system is going to run transactions to completion, one after another, disk latency isn’t the only stall that must be eliminated; it is also necessary to eliminate waiting on the user mid-transaction.
That means external transaction control is out – no stopping a transaction in the middle to make a network round-trip to the client for the next action. The team made a decision to move logic server-side and use stored procedures.
It's also similar to, although categorically more sophisticated than, the idea of object prevalence, which is now so old and forgotten that i can't find any really good references, but:
Clients communicate with the prevalent system by executing transactions, which are implemented by a set of transaction classes. These are examples of the Command design pattern [Gamma 1995]. Transactions are written to a journal when they are executed. If the prevalent system crashes, its state can be recovered by reading the journal and executing the transactions again. [...] Replaying the journal must always give the same result, so transactions must be deterministic. Although clients can have a high degree of concurrency, the prevalent system is single-threaded, and transactions execute to completion.
> We’re excited to report that FaunaDB has passed:
> Additionally, it offers the highest possible level of correctness:
> In consultation with Kyle, we’ve fixed many known issues and bugs
> However, queries involving indices, temporal queries, or event streams failed to live up to claimed guarantees. We found 19 issues in FaunaDB[.]
You are comparing two very different situations.
Also, does everyone run the very latest version of all their software? What use to me is that my vendor has fixed everything in the newest release that I am not using?
Oh and yes, I'm only quoting bits of each (fully knowing you all have links to both and can read it in full), but that's to illustrate the omission from the PR piece. I know that aphyr concludes that work has been done, "By 2.6.0-rc10, Fauna had addressed almost all issues we identified; some minor work around availability and schema changes is still in progress.", but that doesn't change the fact that the blog post doesn't address their past shortcomings.
If you’re running on cloud, we do the upgrade work for you.
Additionally, I am not sure if you fully appreciate the complexity of Jepsen and distributed databases in general.
As for me, I've actually been waiting for this result to recommend the use of FaunaDB in a commercial setting.
It's because I appreciate this work that I felt the blog post didn't do it justice. And I know Jepsen hardly ever passes (ZooKeeper, I believe, did). And I don't take FaunaDB's hard work for granted.
Last I checked, when I visit the Windows website it doesn't say "Really stable now but we've had 1,123,432 critical security bugs in the past!". Same for any other product or open source project.
Acknowledging current limitations is an absolute must, and posting thoughtful articles that delve into a past issue and how it was addressed are a bonus. Otherwise, there's no need for self-flagellation ;)