
Distributed Databases Should Work More Like CDNs - loiselleatwork
https://www.cockroachlabs.com/blog/distributed-database-performance/
======
astine
Sooo, being able to distribute data globally is good for performance? Who
knew? The thing about distributed systems, including distributed databases is
that they need to navigate around the the CAP theorem(Consistency,
Availability, Partition Tolerance, pick two, essentially,) and every solution
is ultimately a trade off. This article would be a lot more interesting if it
showed _how_ CockroachDB made a better trade off than the other solutions
listed.

~~~
Chickenosaurus
I think the PACELC theorem [1] should be preferred to the CAP theorem as it is
more precise. It states in case of a partition (P), there is a trade-off
between availability (A) and consistency (C). Else (E), there is a trade-off
between latency (L) and consistency (C).

[1]
[https://en.m.wikipedia.org/wiki/PACELC_theorem](https://en.m.wikipedia.org/wiki/PACELC_theorem)

~~~
tormeh
Ah, much better, I have to say. Much more informative as well. The CAP theorem
is a bit ungainly since the only way to have a CA database is to only have a
single node.

~~~
ithkuil
> the only way to have a CA database is to only have a single node.

I know that this is meant to mean: in the real world you cannot just write off
partition resilience and still call your system highly available, since
partitions will happen sooner or later and when they do your CA system won't
be available.

But in the other hand, having a system that is always available _except_
during a network partition is a useful thing: you can design a network where
partitions happen much rarely that the rate at which individual machines die.

I.e. in practice a single node, while if you nitpick is the only true CA, will
available for less time in average than a multi node CA system which if you
nitpick is not CA (provided that the underlying network is partition
resilient; it's not a boolean, it's a probability)

(See Google spanner)

~~~
zjaffee
It's my understanding that Google Spanner is leveraging hardware (direct lines
between machines), so the degree at which there is a network partition is
different than compared to other systems.

~~~
Chickenosaurus
According to a paper by Eric Brewer, the author of the original CAP theorem,
Google Spanner technically is not a CA system [1]. It is advertised as such
because partitons are supposed to be exceedingly rare as the infrastructure is
outstanding and a lot of operating experience is present [1].

I believe at the end of the day the CAP theorem is too fuzzy for such
discussions.

[1]
[https://research.google.com/pubs/pub45855.html](https://research.google.com/pubs/pub45855.html)

~~~
ithkuil
The paper says:

"Does this mean that Spanner is a CA system as defined by CAP? The short
answer is “no” technically, but “yes” in effect and its users can and do
assume CA.

The purist answer is “no” because partitions can happen and in fact have
happened at Google, and during (some) partitions, Spanner chooses C and
forfeits A. It is technically a CP system."

which I believe is another way to word what I say in my post above.

~~~
Chickenosaurus
I agree. I did not mean to refute your point, but to provide more context.

------
dantiberian
I love CockroachDB, but feel like this article was a bit misleading in it's
claims. I started out writing a comment response, but it got so long that it
turned into a blog post: [https://danielcompton.net/2018/02/08/add-context-
cockroachdb...](https://danielcompton.net/2018/02/08/add-context-cockroachdb-
databases-acting-like-cdns).

------
cagenut
this is true, and why fastly is kindof like a globally distributed nearly-
cache-coherent key value store that people _use_ as a CDN.

there's a great talk on how its done with a custom gossip protocol
implementation:
[https://www.youtube.com/watch?v=HfO_6bKsy_g](https://www.youtube.com/watch?v=HfO_6bKsy_g)

~~~
loiselleatwork
This is super interesting; thanks for the pointer.

------
dstroot
CDN: no trade offs. Faster everywhere. More reliable overall

Cockroach DB: trade some performance for geographic redundancy. The trade off
may work in your favor - e.g. read heavy workloads (or not).

I plugged in CDB I place of Postgres for some testing this week, was surprised
it worked so well.

~~~
icebraining
_no trade offs_

This is almost never the case, and CDNs are no exception. A CDN like
Cloudflare that reuses your domain(s) means that they become just a useless
point that your dynamic requests have to travel to and from the main server. A
CDN that uses its own domains requires extra DNS queries, extra TCP & SSL
connection setup, etc, plus it only starts loading when the browser has
started processing the HTML.

~~~
dstroot
Excellent points. I definitely oversimplified.

------
zzzcpan
> When partitions heal, you might have to make ugly decisions: which version
> of your customer’s data to you choose to discard? If two partitions received
> updates, it’s a lose-lose situation.

When partitions heal you simply merge all versions through conflict-free
replicated data types. No ugly decisions, no sacrificing neither latency nor
consistency. We call it strong eventual consistency [1] nowadays. And it's
exactly like CDNs, except more reliable.

I'm wondering, since CockroachDB keeps lying about and attacking eventual
consistency in these PR posts, the whole "consistency is worth sacrificing
latency" mantra might not work in practice after all. People just don't buy
it, they want low latency, they want something like CDNs, something fast and
reliable, something that just works. Something that CockroachDB can never
deliver.

[1]
[https://en.wikipedia.org/wiki/Eventual_consistency#Strong_ev...](https://en.wikipedia.org/wiki/Eventual_consistency#Strong_eventual_consistency)

~~~
imtringued
My application is just plain old CRUD.

What if two users want to change e.g. the telephone number of an existing
record during a network partition.

There just is no obvious way to merge a telephone number. One of them is
correct, the other is incorrect.

Can CRDTs solve my simple problem?

~~~
merb
I think the issue is solved (while still complex) with an event sourcing
system. The chances are really high that the users did not push / updated the
number at the exact same time. (so they are at least 1ns apart, even when not
it's not that bad). So you can actually restore to a sane state by reapplying
the log from scratch.

i.e. this is solved with CQRS and Event Sourcing and it probably works in like
all databases. It's quite complex but pretty reliable, I'm pretty sure that
everybody already built at least a extremly simple append only event log.

~~~
elvinyung
Does this imply that there's a single place that log entries get ingested at?
Doesn't that make it a single point of failure?

~~~
cube2222
No, you can just have partitioned replicated topics like kafka has, just make
sure that all the events for one user get put into the same partition, so that
they get totally ordered.

~~~
elvinyung
So I guess the specific issue that grandparent raised was if multiple users
wanted to edit the same record. So even if the queue is sharded on some
primary key, then for each partition, either a) you only have one machine
ingesting and thus a SPOF, or b) you ingest data using multiple machines,
which means you need consensus/conflict resolution somewhere, right?

~~~
Joeri
You ingest the log partition in all machines to reconstruct the state from the
log. You do this even on the machine that writes to the log, treating the log
as the source of truth. Since every user would belong to one partition there
would be a globally agreed ordering of their events in the log.

Basically though that makes a CP tradeoff, since a network partition that
hides the primary kafka replica from some of the writers causes writes to
fail.

There’s simply no way to have a globally distributed system that’s always
available and always consistent, either you get inconsistency or you drop
writes.

~~~
elvinyung
Yep, this was the answer I was expecting. Basically, I really doubt that event
sourcing magically solves this, which is what the above comment claimed [1].

[1]
[https://news.ycombinator.com/item?id=16328127](https://news.ycombinator.com/item?id=16328127)

------
pdeva1
Am I wrong, or does seems the article not really tell how CDB deals with the
latency issue, especially with regards to writes?

If the write has to be consistent and available across multiple regions, it
will need to synchronously replicate that write to all the regions, thus
incurring the same performance penalty as RDS or any other consistent
database.

~~~
irfansharif
I doubt this particular article was intended to address how CRDB handles
writes cross-region, synchronous replication, etc. You'll find articles
touching on varying aspects of what you're looking for if you dig through some
of the earlier posts on their blog[1] or their FAQ[2].

[1]: [https://www.cockroachlabs.com/blog](https://www.cockroachlabs.com/blog)

[2]: [https://www.cockroachlabs.com/docs/stable/frequently-
asked-q...](https://www.cockroachlabs.com/docs/stable/frequently-asked-
questions.html)

~~~
pdeva1
well the comparison section to RDS specifically claims that RDS is inferiors
because "this forces all writes to travel to the primary copy of your data".
So it doesn't explain how CDB is superior to RDS, since writes will incur the
same penalty in CDB too.

~~~
irfansharif
At the key-value level, CockroachDB starts off with a single, empty range (a
set of sorted, contiguous data from your cluster). As you put data in, this
single range eventually reaches a threshold size (64MB by default). When that
happens, the data splits into two ranges, each again covering a contiguous
segment of the entire key-value space. This process continues indefinitely; as
new data flows in, existing ranges continue to split into new ranges, aiming
to keep a relatively small and consistent range size. Each range is replicated
3-way (by default) as well, and is backed by a single Raft instance.

When your cluster spans multiple nodes (physical machines, virtual machines,
or containers), newly split ranges (or more specifically, replicas of these
ranges) are automatically rebalanced to nodes with more capacity. Writes
addressed to a range are handled by the Raft leader for that range (which can
hop around its various replicas as needed). Writes to different ranges (non-
overlapping key spaces by definition) are processed independently, and very
well may be processed across multiple machines.

Source: [https://www.cockroachlabs.com/docs/stable/frequently-
asked-q...](https://www.cockroachlabs.com/docs/stable/frequently-asked-
questions.html)

~~~
pdeva1
what you described is the mechanism to deal with scalability, ie throughput.
What is being disputed is the claim to latency. Every bit of data needs to be
replicated across multiple regions consistently and that will incur the same
latency as rds or any other consistent database

~~~
irfansharif
Given the flexibility of where the range raft leader could be, CRDB makes an
active effort to colocate it near to where the requests originate from (which
is some part of what the CDN parallel was alluding to with low RTT for multi-
region deployments).

WRT to the writes here, if a majority of the replicas for that range are in
the proximate regions, the requests would only travel that far before
responding. I believe the argument is that this a more flexible design point
than a single point of entry for all incoming writes, regardless of the
origin. The cost to write out to the furthest region within any majority of
replicas is of course inevitable to have cross-region durability,
alternatively you could trade this off to have the majority of your replicas
specific to requests from a specific region, be located to that specific
region.

~~~
pdeva1
> The cost to write out to the furthest region within any majority of replicas
> is of course inevitable to have cross-region durability

So we are in agreement that CDB has same write latency as rds for multi region
deployments. The article seems to imply that is not the case, but as you
yourself agree it actually is.

~~~
irfansharif
No? Apologies if I'm not able to communicate as clearly. The RTT would be to
the furthest replica within `n` replicas ordered by proximity to the query
origin, where `n` is the number it would take to have a majority within the
specified replication factor (so 2 for a replication factor of 3, 3 for 5,
etc.). It would be different from the RTT observed with Amazon RDS as long as
the furthest replica, as described above, is in a different zone/region/DC as
compared to RDS's primary instance (assuming a similar topology). So with a
replication factor of 3, if your DCs are in the UK, Australia and the US, if
the RDS primary was in the US, all write requests from Australia would have to
travel over to the US. But with a replica in each zone, write requests from
Australia would hit the DC in Australia, and the DC in UK, avoiding the larger
RTT to the US.

To reiterate: I believe the argument is that this a more flexible design point
than a single point of entry for all incoming writes, regardless of the
origin.

------
daxfohl
Does GDPR really require all Europeans' data to stay on EU soil without
explicit consent?

~~~
itcmcgrath
Seems to be a common misconception. Look at EU Model Contract Clauses: e.g.
[https://cloud.google.com/terms/eu-model-contract-
clause](https://cloud.google.com/terms/eu-model-contract-clause)

------
jjevanoorschot
I used CockroachDB for a university project, and while I think it looks very
promising, I found the tooling and documentation to be a bit lacking. I
wouldn't use it in production yet. However, when CockroachDB matures a bit I
can really see it take off.

------
timkpaine
Imagine a globally replicated and version controlled object store. You could
use this for data, assets, source code, anything you like. Would be incredibly
useful for both the development side of things, as well as production.

~~~
prepend
We could call it the World Wide Web.

~~~
TeMPOraL
It would fall short of all of these goals, as it would get turned into
advertisement, colorful magazine pseudocontent and malware distribution
platform.

~~~
blowski
Don’t we now have both things, with the latter subsidising the former?

~~~
TeMPOraL
Yes. Come to think of it, it's not that the original web died - it just _didn
't grow much_; almost the entire growth went to the commercial part.

------
supergirl
what a dumb article about nothing. hey, a database replicates stuff but so
does a CDN! what a brilliant insight. let's make an article with this title
but fill it with marketing text.

------
hexspeaker
If this interests you, you'll probably enjoy reading Google's paper on
Spanner. Cockroachdb was heavily influenced by it.

[https://research.google.com/archive/spanner.html](https://research.google.com/archive/spanner.html)

------
philbogle
As long as we're advertising, consider:
[https://cloud.google.com/spanner/](https://cloud.google.com/spanner/)

------
russell_h
I'm curious, what sort of read latency is achievable with CockroachDB? Does it
support some notion of tunable read consistency in order to achieve lower read
latency at the expense of consistency?

~~~
susegado
Reads from CockroachDB go through a lease holder for the piece of data being
read without needing confirmation from any replicas about consistency, so
there is no overhead from replication. But read latency can be affected by
writes on the data (conflicts), because it is fundamentally a consistent
system (serializable). This is not tunable.

~~~
russell_h
Thanks, that makes sense (and provides good context for the article). So for a
cluster spanning multiple regions, one region can support low-latency reads
for a given range, but reads in any other region will have to go cross-region
to the leaseholder. Being able to move the leaseholder around to optimize read
latency makes a lot of sense.

It would also be useful to me, in some cases, to be able to perform a read-
only query in an "inconsistent" mode to avoid that cross-region latency, at
the expense of potentially receiving stale data.

------
appdrag
Very interesting! Have you used it at big scale for production or not yet?

------
HumanDrivenDev
The only distributed database I'm familiar with is CouchDB. Can anyone give me
a birds eye view on how Cockroach is different?

~~~
tyingq
Unfortunately, it's not on the list for the table of feature comparisons, but
here's a start: [https://www.cockroachlabs.com/docs/stable/cockroachdb-in-
com...](https://www.cockroachlabs.com/docs/stable/cockroachdb-in-
comparison.html)

Might be helpful if you already know the couchdb answer for each feature.

------
anon1253
so immutable, like datomic? [https://www.infoq.com/articles/Datomic-
Information-Model](https://www.infoq.com/articles/Datomic-Information-Model)

Or more like federated SPARQL on RDF?
[https://www.w3.org/TR/sparql11-federated-
query/](https://www.w3.org/TR/sparql11-federated-query/)

------
yuz
advertisement

------
ddorian43
CDN post with no performance talk (beside keyword).

Never mention lower performance (even on single-node). Add to that aws-vps
with pseudo-cores and spectre-upgrade and good luck with your tps-reports.

~~~
vog
Would you mind to elaborate? Your criticism is so condensed that I'm unable to
make a lot of sense of it.

~~~
ddorian43
Their performance is 0.1-0.03 of postgres. And spectre makes your aws-vps
~0.7x compared to previously. Meaning you can't use it for performance-
sensitive stuff (the whole point of fancy sharding and (no/new)sql).

~~~
vog
Thanks! That's something I can make sense of.

And I agree that this is perhaps one of the many situations where people throw
away the "C" of "ACID" for no reason beyond it is modern to do so. (At least
most strive for "Eventual Consistency", but that's another can of worms.)

This is even less understandable once you notice that PostgreSQL offers a lot
more features than most other databases (SQL or NoSQL) and is extremely
flexible and extensible - even if you use it just as a fancy JSON or XML
store.

