
ACID transactions in a globally distributed database - jchanimal
https://fauna.com/blog/consistent-transactions-in-a-globally-distributed-database
======
d__k
In a globally distributed database, the main problem is time control because
different locations have different clocks. For example, assume that one table
of the database is located on Earth while another table is stored on Mars
(4-13 minutes delay). Now we want to commit a transaction with these two
tables from the moon while the result will be queried from Venus. I could not
understand from this post if Fauna is able to correctly work under such
conditions and whether it was designed for such use cases.

~~~
sleepydog
"Globally distributed" should imply that it's confined to one planet.

~~~
vidarh
More generally, the point is that if you have two locations where the fastest
you can send a unit of data is t, then the shortest possible time interval you
can pass a transaction from one location to another is t.

That puts a upper (EDIT: corrected from lower) bound on throughput of many
types of operations, such as e.g. if you need to guarantee strict ordering of
transactions (EDIT: in the worst case, where transactions are being issued
from multiple nodes; best case you can optimize in a variety of ways).

So while we won't have to deal with interplanetary latencies, it still matters
greatly for throughput and consistency guarantees.

~~~
mattb314
Even assuming you meant "upper bound" instead of "lower bound", this isn't
strictly true in theory. To achieve high throughput with large latency between
nodes, all you have to do is group transactions into large batches and reach
consensus on how they should be ordered (for example, you can run paxos to
decide what a batch consists of, and then use any method you like to order
within the batch). Running paxos at interplanetary latencies would obviously
take some time, but since you can have multiple rounds in progress at once (or
increase the size of each batch to be much larger than the latency),
throughput isn't limited by latency. In practice, most people care a lot about
latency, and achieving the desired latency puts a limit on throughput. See the
Calvin DB paper for a more thorough explanation.

~~~
vidarh
Yes, upper bound. Lower bound on time it takes for the transaction to be
globally available.

You're right that there are operations that you can increase throughput of,
and I was imprecise in that you are right you can order them "after the fact",
but that's not very interesting. The scenario I had in mind was where clients
talking to different nodes are issuing transactions that depend on the same
items of data.

In that case you either need to obtain a lock, in which case you best case
need to wait for the latency interval (plus a margin or largest possible clock
skew) to see if the other side wants a lock on the same object, or you need to
be optimistically firing off transactions, but the best case then is that you
_just_ get your transaction in right before the remote side start operating on
it, and they happen to _just_ fire off another transaction, and so on. In
reality there'd be slowdowns.

If you're dealing with uncontended objects, you can do much better on average,
but you can't guarantee better than the latency between nodes.

There are certainly tons of special cases where you can optimize.

------
alexvoda
Interesting but ultimately irrelevant since it appears to not be FLOSS. We
have already seen what happened to FoundationDB.

~~~
setra
They are primarily targeting the cloud database market. There are many
examples of non free software surviving just fine in this market. Both amazon
and google run their own proprietary cloud databases like this.

~~~
jankotek
And it is very easy to host your own instance of Amazon or Google :-)

------
mabbo
> industry experience has taught us that availability—in the strict sense that
> it is defined in the CAP theorem—is overrated: In practice, the uptime of
> systems that favor availability have not proven greater than what can be
> achieved with consistent systems.

The answer to any distributed consistent database is always the same: we gave
up on one letter in CAP. In this case, it was 'A', availability.

And hey, cool, that's certainly an option. But I'd prefer if they had simply
made that the subtitle.

~~~
jacobparker
What they're trying to explain is that so-called AP systems do not have 100%
availability either. The CAP theorem applies to very specific scenarios that
are not necessarily common - it is a much more narrow theorem then "CP vs AP"
implies. For example, if your network partitions are usually small (i.e. there
exists a majority outside the partition) and you can avoid routing requests to
the minority partition (e.g. because you peer with ISPs and can route
somewhere else, or internally avoid routing to a rack that is don etc.) then
you won't observe a loss of A.

For more information on this perspective you can read _Spanner, TrueTime and
the CAP theorem_ by the fella that coined the term CAP:
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45855.pdf)

Yammer concurred with this perspective, stating:

 _" At Yammer we have experience with AP systems, and we’ve seen loss of
availability for both Cassandra and Riak for various reasons. Our AP systems
have not been more reliable than our CP systems, yet they have been more
difficult to work with and reason about in the presence of inconsistencies.
Other companies have also seen outages with AP systems in production. So in
practice, AP systems are just as susceptible as CP systems to outages due to
issues such as human error and buggy code, both on the client side and the
server side."_

[https://yokota.blog/2017/02/17/dont-settle-for-eventual-
cons...](https://yokota.blog/2017/02/17/dont-settle-for-eventual-consistency/)

~~~
pkolaczk
"In practice, the uptime of systems that favor availability have not proven
greater than what can be achieved with consistent systems."

This claim is much _stronger_ than just saying AP systems sometimes fail.
Therefore observing some AP systems failures is not enough to prove it. For
the statement to hold true, AP systems would have to fail just as often as CP
systems. However, such claim is totally unfounded. The article doesn't link to
any research showing AP systems have the same (or worse) availability than CP
systems.

On the other hand, most CP systems, even _centralized ones_ , are not truly
consistent as required by ACID either. Most RDBMS systems run at Read
Committed level for performance reasons, which is ACD, not ACID.

It is also ignoring the fact that most systems are neither AP or CP, and that
this distinction is not very useful to describe database systems.

~~~
jacobparker
Yep, the Yammer article is just opinion. It may or may not match your
experiences/beliefs :) Check the Spanner paper for slightly more data (but its
still pretty informal) from Google. They don't talk about other systems, but
they explain why Spanner has such high availability. It would be interesting
to see an empirical study comparing availability but I'm not holding my breath
because doing that fairly and realistically would be challenging.

I'm not sure about the claim about CP systems, but I probably have a different
opinion on "common" CP systems (etcd, consul, Spanner-likes, which do offer
consistency in the sense of ACID.) But it's definitely worth considering
things weaker than that - e.g. Azure Blob Storage and Google Cloud Storage
(S3-likes) offer strong consistency but no cross-key transactions. These
databases are very applicable but weaker than Spanner-likes like the one from
TFA.

Agreed that AP vs. CP isn't a good way to describe databases :)

~~~
nyrikki
Note that this mostly just an opinion piece is also very guilty of attempting
to poison the well.

"AP systems are not 100% available in practice"

Nothing is 100% available, it is like absolute zero, you can get close but
never reach it.

I am not sure if the intent on this was to justify the uptime problems that
yammer had this year, or if it was to just self justify engineering choices.

The real flaw is ignoring the horses for courses reality of systems needs.
Automated Teller Machine's transactions are not strongly consistent, yet we
manage to get by.

Spanner is a "shared nothing" architecture which has advantages availability
and recoverablity. But you do pay a price for the high level of consistency.

Each write is mediated by a leader, and has to be committed on at least 2 of
the three regional nodes which does have a cost.

Yammer (and I guess rayokota) chose to use PostgreSQL which is not a horrible
choice for a traditional RDBMS these days, but we all know the pain of a
master node failover, re-sync etc.. And with most streaming replication
methods there will be a serious issue with the nodes that may have been on a
partition with the original master and the fail-over node that is promoted.

Distributed systems are hard, but I am having a hard time really finding much
value in this post outside of a fairly opaque opinion that there is no perfect
product for all needs, which will always be true.

But as for why I chose to respond to your comment jacobparker. The thing to
remember about strong consistency is that locks, two stage commits, single
mediators etc...they will all limit the theoretical speedup in latency you can
have by adding resources.

While not perfect this means that you should engineer with Amdahl's law and
hope for Gustafson's law.

As a practical example on how this becomes an issue with scaling consider the
following article about the limitations of single disk queues in the linux
kernel on systems with multiple CPUs

kernel.dk/blk-mq.pdf

Note the extreme falloff on Figure 4.

This doesn't mean that you can't start out with a simple strong consistency
model, but if you don't at least try to avoid the need for ACID type
transactions it will be very hard to scale later. Worse most teams I have been
on try to implement complex and fragile sharding schemes which in effect
replicate a BASE style datastore. If we think managing distributed systems are
hard, writing them is much more difficult.

But really there are no universal truths in this area, and all decisions
should be made on use case and not by selecting the product before defining
the need (which is our most common method it seems)

------
smnscu
I started using CockroachDB [1] recently and I'm really impressed with it
overall. The Postgres compatibility is genius in my opinion, and despite some
hurdles it's been a joy to work with (I'm using xo [2] to generate Go code
from custom templates). I've yet to deploy it in production, but the best part
is that if the performance sucks I can simply deploy postgres instead. As
someone who used to be a big RethinkDB supporter, this is my new favorite pet
oss project.

edit: oh, and it's written in Go, my main language, which is definitely a plus
if you're a gopher

1:
[https://www.cockroachlabs.com/tags/acid/](https://www.cockroachlabs.com/tags/acid/)

2: [https://github.com/xo/xo](https://github.com/xo/xo)

~~~
therealmarv
Postgres compatibility? Really? Reference? I want to believe but are not you
comparing apples to oranges? Last time I checked the JSON datatype was not
supported on Cockroach.

~~~
detaro
CockroachDB explicitly aims to offer a Postgres-compatible interface. Is their
homepage reference enough for that?
[https://www.cockroachlabs.com/product/cockroachdb/](https://www.cockroachlabs.com/product/cockroachdb/)
or [https://www.cockroachlabs.com/docs/stable/build-an-app-
with-...](https://www.cockroachlabs.com/docs/stable/build-an-app-with-
cockroachdb.html)

~~~
therealmarv
no, not even a JSON datatype is now supported. I cannot use a database without
JSON for my use case
[https://github.com/cockroachdb/cockroach/issues/2969](https://github.com/cockroachdb/cockroach/issues/2969)
An compatible interface is not compatibility and JSON as an example is not the
biggest magic stuff inside PostgreSQL.

~~~
ddorian43
Give the dudes time. Postgres has a lot of stuff.

~~~
orangechairs
JSONB support is on our 2.0 roadmap and is actively under development.

Roadmap:
[https://github.com/cockroachdb/cockroach/wiki/Roadmap](https://github.com/cockroachdb/cockroach/wiki/Roadmap)

Confirmed Development (I see you already reference this issue above):
[https://github.com/cockroachdb/cockroach/issues/2969](https://github.com/cockroachdb/cockroach/issues/2969)

