
Ex-Googlers CockroachDB: A Scalable, Geo-Replicated, Transactional Datastore - ethanjones
https://github.com/cockroachdb/cockroach/blob/master/README.md
======
JackC
Are there distributed data stores like this that are also resilient to
intentional sabotage?

I've been looking recently at long-term digital preservation systems -- tools
designed to archive large amounts of data for decades. This is the Library of
Alexandria problem -- how do we preserve all this data we're generating
against once-in-a-century disasters?

So this 2005 paper lists thirteen different threats to long-term archives:
Media Failure; Hardware Failure; Software Failure; Communication Errors;
Failure of Network Services; Media & Hardware Obsolescence; Software
Obsolescence; Operator Error; Natural Disaster; External Attack; Internal
Attack; Economic Failure; Organizational Failure.[1]

Fault-tolerant distributed data stores are exciting, because they solve a
bunch of those problems off the bat -- media failure, hardware failure,
communication errors, failure of network services, hardware obsolescence, and
natural disaster.

They also help to address software failure, software obsolescence, and
economic failure, because archival projects are always strapped for resources
and it's great to rely on tools that exist for totally distinct, commercially-
valuable reasons.

But that still leaves operator error, external attack, and internal attack --
burning down the Library.

Hence my original question: are there distributed data stores that can be
configured to resist intentional destruction of data?

[1]
[http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.ht...](http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.html)

~~~
tomp
The problem with internal/external attacks is that we (the society) don't
really want to prevent it. The reason is simple: child porn. To date, Bitcoin
block chain (and related ideas) is the only data-storage that is 100%
resistant to attacks (i.e. changing history), but luckily it cannot handle
amounts of data large enough to be viable for child porn (or most other forms
of media). Tor, on the other hand, gets a bad rep precisely because it doesn't
prevent it (despite its numerous other, beneficial, uses).

The core of the issue is that humans view different information differently
(child porn vs. Mona Lisa), whereas for computers, bits are bits and numbers
are numbers. As long as child porn remains illegal and socially unacceptable,
we'll _want to_ enable _attacks_ on data, i.e. for someone (usually internal
operators) to be able to delete some kind of information, corrupt it or at
least track it. Of course, this necessarily means that _all_ information
stored in the same data-store will be vulnerable.

~~~
mhb
You're conflating the archival properties of the medium with the decision
about what to save. Oil paint on canvas is durable. It doesn't mean that a
museum needs to retain every piece of crap that anyone paints.

~~~
rincebrain
The problem is that removal of content because it's crap/immoral versus
operator destruction is not a meaningful distinction, from a software
perspective.

So it would probably need to be write-only to prevent people from burning it
down, which would necessarily mean that, once content is included, it cannot
be modified or removed.

------
rabino
Why the 'Ex-Googlers' in the title. Is it like a seal of approval or
something?

~~~
melling
Yes, it's called social proof. If you're observant you'll notice that people
use it everywhere. Of course, social proof doesn't mean guarantees.

~~~
peterwwillis
Nah, it's honor-by-association fallacy. Social proof is a behavioral thing,
not false reasoning.

------
IgorPartola
So I cannot tell if this is aiming to be CA or AP. Having beaten my head
against the CAP wall for a while, how does it deal with partitions?

~~~
pjc50
The design doc
[https://docs.google.com/document/d/11k2EmhLGSbViBvi6_zFEiKzu...](https://docs.google.com/document/d/11k2EmhLGSbViBvi6_zFEiKzuXxYF49ZuuDJLe6O8gBU/edit)
says, "TBD: how to avoid partitions? Need to work out a simulation of the
protocol to tune the behavior and see empirically how well it works"

.. so they've gone for CA and forgotten about P.

~~~
teraflop
You're taking a sentence out of context and giving it the most uncharitable
possible interpretation. They certainly haven't "forgotten" about network
partitions because if you actually _read_ the design document, instead of
ctrl-F'ing for the word "partition", they talk about the mechanisms they use
to ensure sequential consistency. The software is not yet at the point of
being testable AFAIK, but clearly the _intent_ is to build a CP system.

The section you're quoting is discussing a separate gossip protocol that is
used to lazily propagate node state information. It does _not_ affect the
consistency of actual data replicas.

~~~
IgorPartola
Here's how I understand it: if you claim that your database cannot be taken
down by component failure, you have to necessarily consider network partitions
as well as individual node failure. If you are node A talking to node B, and
node B stops responding you _cannot_ distinguish between a node failure and a
network failure. In order to claim high availability you must build an AP
system.

Let's reduce this case to a multi-master setup where a client can connect to
any node and write to it. If a node fails outright and a client tries to
connect, no big deal: the client chooses a different node, the failed node
eventually comes back online later, catches up, then says "OK, write to me!"
opening a listening socket.

However, if a partition happens, and client X writes to node A, client Y
writes to node B, and then the two nodes cannot agree on the correct data,
then you lose consistency. You can of course say that no node can be written
to if other nodes are offline, which means the system is not highly available.

So their stated goal: "The primary design goal is global consistency and
survivability..." either implies that high availability is not something they
are going for, or that they are shooting for something that is not
theoretically possible.

Of course all of the above is just my understanding, not necessarily fact, so
please correct me if I'm wrong.

~~~
teraflop
> In order to claim high availability you must build an AP system.

I strongly disagree. The "A" in CAP describes a system such that any single
non-failing node can _always_ make progress. That would be a nice property to
have, but it's much stricter than is required for a real system.

If a distributed database is resilient to failure of a minority of nodes, it
still makes sense to describe it as high-availability. And that is exactly
what a consensus algorithm like Paxos or Raft gives you.

------
tedchs
Here in the Southeast, we call them "Palmetto Bugs". Maybe a rename to
PalmettoDB? :)

~~~
wuliwong
Palmetto bugs are generally the larger, flying "American Cockroach." "German
Cockroaches" are smaller, don't fly but are generally the ones that cause
infestations. I grew up in the Northeast and German Cockroaches are all I ever
saw up there. I've been down in Atlanta for years now and I only see Palmetto
bugs here.

Unless this database can fly, I'd say they've chosen the right name. ;)

------
srcmap
Comparing to Facebook ' s TAO approach from yesterday post, I like the FB
small set of api approach better. But that mainly focus on handling social
graph with objects and associations.

On the other hand, almost all my backend related features can be easily
abstracted to those APIs.

~~~
Shish2k
> Comparing to Facebook ' s TAO approach from yesterday post

Link? I see no relevant-looking "Facebook" or "TAO" in my recent RSS entries
:(

~~~
anonetal
I think he is talking about this post:
[https://news.ycombinator.com/item?id=8557408](https://news.ycombinator.com/item?id=8557408)

------
Thaxll
Many things that I don't agree with:
[https://github.com/cockroachdb/cockroach/](https://github.com/cockroachdb/cockroach/)

MySQL: Weak consistency

Cassandra: No availability or weak consistency with datacenter failure

~~~
teraflop
MySQL provides strong consistency when run on a single machine, but that
breaks down when you handle failover using asynchronous replication between
DCs.

------
ultimape
I hope Ceph can takes cues from this and be able to do geo-replication at
scale.

------
xedarius
Are there any details on how the distributed joins are achieved? (Sorry if
that detail is in the design doc, my access to google drive is blocked from
work).

------
peterwwillis
It's really strange to see networked databases go through this iterative
design fad. It was file transfers back in the 90s/00s; everyone had their own
distributed decentralized file transfer solution. Little known fact, Gentoo's
Portage almost became an internet-wide distributed decentralized public file
system. Thank god they abandoned that idea. Can you imagine trying to debug a
file transfer error just to get an mp3 player to install on your machine? I
can't wait until databases go back to being mainframes.

~~~
anonbanker
> Little known fact, Gentoo's Portage almost became an internet-wide
> distributed decentralized public file system.

Am I the only person really sad that this didn't happen? after using apt over
Tahoe-LAFS (over I2P - KillYourTV's PPA is on clearnet and I2P), I wanted this
to be the default behavior for apt.

------
kul_
awful name!

~~~
eksith
I think the sentiment was "hard to kill" and/or become extinct due to its
resiliency. But I agree, they should have picked something else.

~~~
vram22
Good point about the likely reason.

Hydra could have been another good choice :)

[http://en.wikipedia.org/wiki/Hydra](http://en.wikipedia.org/wiki/Hydra)

See first entry at above Wikipedia page, about the many-header serpent.

------
datashovel
Happy to see it's written in Golang.

~~~
bojo
As a Go developer I was happy as well.

On the other hand this question immediately popped into my mind: What kind of
overhead does the GC incur, and how does it affect processes like a database
where low latency is desired?

~~~
datashovel
If I had to guess, I would say network and disk I/O will be the bottlenecks.
Disk I/O less so because distributed system, and SSD. I imagine if anyone can
solve GC issues, though, I would bet on Google :)

~~~
natebrennand
Their chosen solutions to the GC issues are in this roadmap. [0]

[0]:
[https://docs.google.com/document/d/16Y4IsnNRCN43Mx0NZc5YXZLo...](https://docs.google.com/document/d/16Y4IsnNRCN43Mx0NZc5YXZLovrHvvLhK_h0KN8woTO4/preview?sle=true)

------
vishly
Awful name!!! Why Cockroach..??

------
vishly
awful awful name... folks didnt get anything than cockroach..

------
JSno
the name feels sick

