
How CockroachDB Does Distributed, Atomic Transactions - orangechairs
http://www.cockroachlabs.com/blog/how-cockroachdb-distributes-atomic-transactions/
======
DivineTraube
This is the transaction protocol from Google Percolator - with slightly
different wording. I don't know if that's a well known fact but I was
surprised they did not even mention it. OSDI Paper on Percolator:
[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)

Percolator is the processing system Google uses to manage its search index
incrementally. It's built on BigTable which provides cell-level
linearizability, similar to what CockroachDB seems to achieve by using Raft.

What CockroachDB calls "intents" are per-value columns called "write" in
Percolator and ordered by BigTables timestamp system. Percolators "master
lock" is rebranded to "switch" and, tada, there you have CockroachDBs fancy
lockless algorithm.

Edit: Amazon DynamoDB, too, uses a very similar approach to this in their
client-driven transaction implementation [https://github.com/awslabs/dynamodb-
transactions/blob/master...](https://github.com/awslabs/dynamodb-
transactions/blob/master/DESIGN.md)

It would be interesting to have one of the CochroachDB guys comment on the
novelty of their approach.

~~~
mrtracy
Post author here:

In terms of Google products, CockroachDB much more closely resembles a
combination Spanner
([http://research.google.com/archive/spanner.html](http://research.google.com/archive/spanner.html))
and F1
([http://research.google.com/pubs/pub41344.html](http://research.google.com/pubs/pub41344.html)).
However, while we do share many traits with these systems (a consistently
replicated, transactional key/value store with a SQL interface), our internal
transactional model is actually quite different.

If you're interested in some of the papers that inspired CockroachDB's
transactional model, we have mentioned a number of them in our design doc
([https://github.com/cockroachdb/cockroach/blob/master/docs/de...](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-
free-distributed-transactions)).

As for Percolator, that system is not a general purpose database but an
"incremental processor" for large datasets (such as a web crawler). While it
does use a snapshot transaction system internally, it appears to use real
exclusive access locks, which CockroachDB does not use. Our write intents are
not locks, because they do not guarantee exclusive access to a resource (I've
expanded on this in another comment on the blog:
[http://www.cockroachlabs.com/blog/how-cockroachdb-
distribute...](http://www.cockroachlabs.com/blog/how-cockroachdb-distributes-
atomic-transactions/#comment-2232720938))

We are not claiming our transaction model to be exceptionally novel, this post
is just trying to shed light on a specific property of that model.

------
ahachete
Thank you for sharing your strategy for concurrent transactions. It's a very
interesting topic :)

I see however some possible pain points in this strategy. My main concerns
are:

* This looks to me a lot like some poor (no pun intended) form of MVCC ([https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)). Basically, it supports only two "V"s, the "on" and "off", while "normal" MVCC implementations support as many as required.

* Related to the above, I don't see how you can support more than one overlapping tx with this strategy. Sure, I'll keep tuned for the next post on this topic, but I fail to see how --without exclusive locking-- it could be done based on these principles.

* Every read seems like it needs to check the switch. I understand this might be expensive, specially if there is a concurrent modification in progress. Keys are co-located in the node, but they still require random access.

* It looks like even aborted transactions require writes to the database. And every transaction requires at least two writes (the write intent and its consolidation to the key). I don't expect good performance here.

Please help me clarify these questions, specially if I'm missing something.
Hope it helps anyway. Thanks!

~~~
mrtracy
Hi there, blog author here!

This blog post is not a full description of CockroachDB's transaction
strategy. If you're interested, you can find a complete rundown of our
strategy here:
([https://github.com/cockroachdb/cockroach/blob/master/docs/de...](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-
free-distributed-transactions)). However, that is a very terse document: this
blog post was an attempt to expand on one of the basic ideas behind our
design. My goal to clearly explain how we approach atomicity, without
broaching the complicated topic of Isolation.

That said, I will attempt to address some of the pain points you raised:

* Cockroach does use a full MVCC system; writes are "append-only" and we keep the full history of values for each key. Keys can be queried at a timestamp, and every write explicitly specifies a timestamp as well (each transaction is assigned a timestamp, all writes will use the same timestamp). With this system, each transaction operates on a snapshot of the database.

* Our strategy for overlapping transactions is straightforward: if two concurrent transactions have a conflict, one of them will be aborted and retried. Write intents (in combination with another component, the "read timestamp cache") are used by conflicting transactions to discover each other. Once discovered, the decision of which in-progress transaction gets aborted is deterministic - each transaction is assigned a random integer "priority", and the transaction with the highest priority always wins the conflict (unless one of the transactions has already committed).

* Every read does need to check the switch _if the key being read has a write intent_ ; reads of plain values (with no in-progress transaction) do not need to check a switch.

* Yes, aborted transactions can write intents to the database in our current system. However, the second "write" (consolidation) happens asynchronously, and is not required for the transaction to function correctly. It should only be noticed if the key is accessed again by another transaction before it is consolidated. For committed transactions, I don't believe this is a particularly high price to pay.

A system that doesn't write aborted transactions to disk would be considerably
more complex; I think it would involve two commit phases (I. The transaction
has been "committed" in memory, and can no longer be aborted by other
transactions II. The transaction must successfully commit all changes to disk
in an atomic fashion). You likely could get some additional performance (if
disk latency is a sufficient bottleneck), but it would be hard won in terms of
complexity.

~~~
ahachete
Hi, thanks for the reply!

I will read the more detailed document. It would probably make more sense to
me.

From the description of MVCC that you say, it looks like you're storing the
whole replicated state machine that you have via the Raft protocol. If txs are
assigned such a timestamp, I agree that could operate as a full MVCC system.
Good.

Regarding the overlapping txs, it makes sense once you say you're aborting
txs. Without that, it sounded really difficult to achieve without locking.

When you say reads only check the switch if there is a write intent, where is
exactly that "switch"? I understood it's not with the key, but rather at some
other random location. If that's the case, I insist that you require two
reads: one for the key, the other one for the switch. Os is the switch co-
located with every key?

Regarding the aborted txs, I can't argue without numbers, but I'd like to see
those. I mean: you may abort txs voluntarily (I want to ROLLBACK) or
involuntary (there's a concurrent conflict). That may or may not be a high
number. But all of those would be written to disk, so it might become a high
price.

------
grogers
I'll be interested to see the further details discussed in the isolation post,
which might answer some questions.

This is basically the same how the DynamoDB client transaction library works.
The two major problems with that library are:

* High write amplification (7N for DDB, probably similar for CockroachDB)

* Poor throughput under contention. This is mostly due to the immediately abort conflicting pending transactions policy, which could be changed if contention is frequent (client side exponential back off is probably good enough for most cases). The special sauce they didn't mention might also help here.

------
Exuma
Could you have picked a worse more disgusting name?

You want people to feel good when they think of your product, not repulsed.

~~~
Strom
You may also want to hint that user data will survive even a nuclear winter.

~~~
wingerlang
I think a nicer name would be RoachDB. But CockroachDB/Cockroach labs is quite
a memorable name, more so than roachdb for sure.

------
joshribakoff
So they do it without a lock by using a "switch" instead. What is the
difference? Or is it the same concept as a "lock" but they've just created a
whole new confusion around terms by trying to be different?

Edit - there's an answer on the blog's comments. Apparently it differs from a
lock because " it won't prevent any operations from executing, it will just
properly order them"

------
sargun
Why didn't you go down the route of Paxos Commit for your ACP versus the
Yabandeh-style ACP? I realize that they have very different semantics, but
Paxos Commit seems to have (nearly) all the same properties, and benefits
without a single key, that basically acts as the "coordinator" for the
transaction (a process as opposed to a key).

------
hartror
I look forward to a Kyle Kingsbury (Aphyr) "Call me maybe" post. eg.
[https://aphyr.com/posts/324-call-me-maybe-
aerospike](https://aphyr.com/posts/324-call-me-maybe-aerospike)

------
saganus
This looks very interesting, but I still don't get how do they avoid
concurrency on the switch:

"The switch cannot be concurrently accessed – reads and writes of the switch
are strictly ordered". If not with a lock file, maybe a log-based
architecture?

~~~
mrtracy
Post author here!

The on disk part of the "switch", our transaction record, can only be accessed
internally. There are only a small number of operations to perform on it -
create it, read its value, and a small number of deterministic "read-then-
write" operations. For example, in the case of an "abandoned" transaction, a
later transaction may want to read the abandoned transaction record, see if it
is still "pending", and then mark it as "aborted" if it is.

Our runtime orders all such operations on the same record so that they are
completely sequential - that is, even for a "read-then-write" operation, it
must completely finish before any other operation can occur on that key. This
is done using basic runtime synchronization primitives (mutexes, channels,
etc).

~~~
saganus
Thanks for the detailed explanation. Looking forward for the next post!

------
ruggeri
The writeup mentions that concurrent attempts to update a row will be
discussed in a future post. The simplified discussion here says that a writer
who encounters a write intent of an uncommitted transaction aborts the
concurrent transaction.

I believe also that non-repeatable reads would be a potential problem. A
transaction could read a row, then read it a second time; if a concurrent,
committed transaction mutated the row, the repeated read will return a
different result.

I believe that both these problems would be solved by MVCC which I suspect is
the next blog post.

But I'm not super great at distributed DB stuff, so please correct me if I've
misunderstood!

~~~
mrtracy
Post Author here!

Your understanding of this particular issue is correct: the system described
in this post would not be able to provide repeatable reads.

However, as you've alluded to, Cockroach transactions use a form of snapshot
isolation (implemented with MVCC) which does provide repeatable reads. And
yes, that will be the subject of my next post. :)

------
knodi
I love CockroachDB but the name is dismissed by many that don't know all its
capabilities.

------
jchrisa
I'm not sure if CockroachDB has multi-DC async replication, but the challenge
I've always found in this stuff is ensuring the transaction also commits on
remote replicas in an all-or-nothing way. Maybe I didn't read close enough...

~~~
mrtracy
Post author here

Sargun is correct, all of our replication is performed by Raft, which exists
at a lower level than the content covered in this post. Every command executed
against a cockroach key (i.e. "create write intent", "begin this transaction")
is persisted to the Raft log before executing; Commands will not proceed if
they are not confirmed by Raft. The outcome of all commands is deterministic,
so this provides complete replication of all mutations to each key.

If you're interested, there's more information available in our primary design
document:
([https://github.com/cockroachdb/cockroach/blob/master/docs/de...](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#raft
---consistency-of-range-replicas))

------
aidenn0
I'm much more interested in how they get "C" in a distributed DB than how they
get "A"

------
tbrock
Interesting read, the CochroachDB guys are top notch engineers. I'm excited to
see how this product develops.

