
CockroachDB's Consistency Model - dilloc
https://www.cockroachlabs.com/blog/consistency-model/
======
evanweaver
The suggestion that this anomaly can occur in FaunaDB is wrong:

    
    
      > We do not allow the bad outcome in the Hacker News commenting scenario. Other 
      > distributed databases that claim serializability do allow this to happen.
    

FaunaDB's consistency levels are as follows:

    
    
                 | No indices   | Serializable indices | Other indices
      -----------+--------------+----------------------+--------------
      Read-Write | Strict-1SR   | Strict-1SR           | Snapshot
      Read-Only  | Serializable | Serializable         | Snapshot
    

In FaunaDB, all read-write transactions execute at external consistency,
including all their row read intents, and all index intents when requested.
Any read-only transaction, even if uncoordinated, will see a consistent prefix
view of all its priors. Not just causal priors, so there is no "causal
reversal". All physical, linearizable, multi-partition priors.

In FaunaDB, it is not possible to see the later comment without seeing the
previous comment, even when executing at snapshot instead of externally
consistent isolation.

FaunaDB can serve serializable/snapshot reads out of any datacenter without
any global coordination, and can serve externally consistent reads with
coordination whenever requested. CockroachDB doesn't offer global reads at
all, to work around the clock skew issues, but can only serve reads from
partition leaders.

Comparing these models fairly requires executing all transactions in FaunaDB
at external consistency, which is both more consistent, and has lower tail
latency in a global context, than CockroachDB.

~~~
andydb
FaunaDB has to make painful (to applications) tradeoffs between latency and
consistency in global scenarios:

1\. All read-write transactions pay global latency to a central sequencer. So,
yes, FaunaDB is strictly serializable, but at the cost of high latency of
read-write transactions.

2\. Read transactions have to choose between:

    
    
      a. Strict serializability but high latency
      b. Stale reads but low latency
    

Anomalies absolutely can happen in FaunaDB if applications use option 2b.
However, most users will not appreciate the subtlety here, and some will
unwittingly go into production with consistency bugs in their application that
only manifest under stress conditions (like data centers going down and
clogged network connections). Their only other option is 2a, and that is just
a no-go for global scenarios. You can't route your regional reads through a
central sequencer that might be located on the other side of the world.

Your argument reduces to: "FaunaDB has no anomalies in global scenarios! That
is, as long as you're OK with a global round-trip for every read-write and
read-only transaction...". FaunaDB has not solved the consistency vs. latency
tradeoff problem, but has simply given the application tools to manage it. A
heavy burden still rests on the application.

By contrast, Spanner users get both low latency _and_ strict serializability
for partitioned reads and writes (that's _and_ , not _or_ , like FaunaDB).
CockroachDB users get low latency _and_ "no stale reads" for partitioned reads
and writes. Partitioning your tables/indexes to get both high consistency and
low latency is a requirement, but it's not difficult to do this in a way that
gives these benefits to the majority of your latency-sensitive queries, if not
all of them. After all, this is what virtually every global company does today
- they partition data by region, so that each region gets low latency _and_
high consistency. You only pay the global round-trip cost when you want to
query data located across multiple regions, which is rare by DBA design.

The main point of the article is that the "no stale reads" isolation level is
almost as strong as "strict serializability", and is identical in virtually
every real-world application scenario. This means CockroachDB is equivalent to
Spanner for all intensive purposes.

~~~
evanweaver
Re. 1, this is a common misunderstanding. The FaunaDB sequencer is not
physically centralized: it is logically centralized, but physically
partitioned. Transactions find the nearest sequencer and commit to the closest
majority set of logs for that sequencer, which usually is equivalent to the
nearest majority set of datacenters (specialized topologies can do a little
bit better). This gives predictable, moderate latency for all read-write
transactions regardless of data layout or transaction complexity.

Re. 2, I understand your perspective, but I disagree that the worse-is-better
argument is valid. Google Spanner offers exact-staleness and bounded-staleness
snapshot reads, almost identical to FaunaDB. The reason is that the 10ms clock
ambiguity window is still too long for many users to wait for serializability.
Like FaunaDB, Spanner users must use the consistency levels correctly or
anomalies will result, but these anomalies typically only occur in read-
modify-write scenarios that are not wrapped in a transaction. Doing that in
any database (including in CockroachDB) creates anomalies at any isolation
level, because it defeats conflict detection.

But it turns out that waiting out the clock ambiguity window in the public
cloud is actually worse than routing to every partition leader all the time.
So CockroachDB offers neither snapshot isolation reads, nor serializable
follower reads, nor strict serializability for writes. It is _not_ equivalent
to Spanner. As you explain, CockroachDB applications have no choice but to
avoid creating transactions that read or write data partitioned in other
datacenters. Transactions that do are much higher latency than Spanner and
FaunaDB both. This mandatory partitioning is a non-relational experience—the
data must be laid out in the way it will be queried—and it is harder than
understanding an additional consistency level and taking advantage of it when
appropriate.

Like you say, FaunaDB has "given the application the tools" to manage global
latency.

~~~
andydb
Re. 1, doesn't the majority set of logs for a given logical sequencer need to
overlap with the majority set of logs for each other logical sequencer? If
not, I don't see how you guarantee strict serializability without clocks. But
if so, then all you're saying is that you're "averaging" the latencies to get
better predictability. So if the average latency between any 2 DC's is 100ms,
but the max latency between the two furthest DC's is 300ms, then you'll get
closer to 100ms everywhere in the world by using this scheme. That's a good
thing, but it's still 100ms!

However, since I may be misunderstanding something, a concrete example is
helpful. Say you have 3 DC's in each of 3 regions (9 total DC's): Europe,
Asia, and the US, with 10ms of intra-region latency and 100ms of inter-region
latency. Say I want to commit a read-write transaction. What commit latency
should I expect?

Re. 2, the reason Spanner (and soon CockroachDB) offer bounded-staleness
snapshot reads is not due to the clock ambiguity window. You're working with
old information from a paper published years ago. At this point, my
understanding is that Google has reduced the window from ~7ms down to 1-2ms
(maybe even less). Furthermore, these are reads, which don't need to wait in
Spanner to begin with. There are at least 2 scenarios where bounded-staleness
reads are useful:

    
    
      1. When you have heavy read load and want to distribute that load across all of your replicas.
      2. When you want to read from the nearest replica in order to reduce latency, in cases where replicas are geographically distributed.
    

You also have a misconception about CockroachDB. In practice, CockroachDB
almost never waits when reading. It only does so when it tries to read a
record at time T1, and notices that another transaction has written to that
same record at time T2 (i.e. later in time, within the clock uncertainty
window). So unless you're trying to read heavily contended records, you'll
rarely see any wait at all.

You're also incorrect about the reasons for CockroachDB's lack of support for
follower reads. Here is an RFC describing how bounded-staleness follower reads
will work in an upcoming version:

    
    
      https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20180603_follower_reads.md
    

The real reason that CockroachDB does not yet support follower reads is that
other things took priority in the development schedule. It's always been
possible to support them. I expect they'll work with similar utility and
latency to Spanner and FaunaDB follower reads.

As for your assertion that "Ones that do are much higher latency than Spanner
and FaunaDB both", it's just ... wrong. Perhaps you're thinking of older
versions, before CockroachDB had transaction pipelining:

    
    
      https://www.cockroachlabs.com/blog/transaction-pipelining/
    

Furthermore, the end of the article alludes to an upcoming feature called
"parallel commits", in which most distributed transactions will have commit
latency = ~1 round of distributed consensus.

------
bithavoc
I just learned a new thing: Time Travel Queries[0]. I haven't used it but it
seems queries are able to go back 25 hours by default, freaking cool.

[0][https://www.cockroachlabs.com/docs/stable/select-
clause.html...](https://www.cockroachlabs.com/docs/stable/select-
clause.html#select-historical-data-time-travel)

~~~
tyingq
It seems like there should be a specialty database that lets you branch any
point in time for not just queries, but also inserts, updates and deletes. I
recall a fair amount of buzz around something called "thingamy" that was going
to deliver that. Then it went quiet, closed source, and "call for pricing".

~~~
SamReidHughes
That is certainly possible. What would you use it for?

~~~
tyingq
Instant what-if for end users that doesn't require IT help or lots of
space/time, and doesn't interfere with the current prod data, etc.

Think data savvy, but not IT savvy end users. Like financial analysts.

~~~
SamReidHughes
Thanks for answering. Myself, I'd only thought of performance for creating and
using test environments in software development. Which, I guess, is another
kind of what-if scenario.

------
andrewflnr
You might want to come up with a better example for how your causal reverse
anomaly isn't that big a deal. My first reaction was approximately "dafuq?
Nathan saw an inconsistent database! That's terrible!" It took a long time
before you got around to explaining that with a foreign key check (which I
pretty much assumed as part of the scenario), it wouldn't happen.

~~~
andreimatei1
Perhaps I could emphasize things differently. FWIW, besides the fact that
having a foreign key constraint in that schema would prevent the badness from
happening, the even bigger reason why scenarios like that are unlikely is
that, realistically, for Tobi to reply to a comment, he must have seen that
comment he's about to reply to, and it's very hard to imagine a scenario where
he'd see it but still have the response transaction not read it (cause if the
txn read it, the two transactions wouldn't be independent any more and so
they'd be well ordered). The foreign key is just one way of ensuring that the
read happens.

~~~
andrewflnr
To me, that still makes it sound like a bad example. Adding all those
qualifiers is a distraction from the actual point you're trying to make.

------
shaklee3
This is a great article. I read the jepsen crdb analysis long ago, but never
understood exactly what was wrong. This describes it (and defends) really
well.

~~~
ccmonnett
I don't use it (yet) but every interaction I've had with Cockroach as a
company has been great. Even to the point where I had a marketer for a not-
quite-direct competitor to Cockroach whisper to me "You know, for your use
case, you should probably just use Cockroach..."

