Hacker News new | past | comments | ask | show | jobs | submit login
How CockroachDB Does Distributed, Atomic Transactions (cockroachlabs.com)
147 points by orangechairs on Sept 2, 2015 | hide | past | web | favorite | 32 comments

This is the transaction protocol from Google Percolator - with slightly different wording. I don't know if that's a well known fact but I was surprised they did not even mention it. OSDI Paper on Percolator: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...

Percolator is the processing system Google uses to manage its search index incrementally. It's built on BigTable which provides cell-level linearizability, similar to what CockroachDB seems to achieve by using Raft.

What CockroachDB calls "intents" are per-value columns called "write" in Percolator and ordered by BigTables timestamp system. Percolators "master lock" is rebranded to "switch" and, tada, there you have CockroachDBs fancy lockless algorithm.

Edit: Amazon DynamoDB, too, uses a very similar approach to this in their client-driven transaction implementation https://github.com/awslabs/dynamodb-transactions/blob/master...

It would be interesting to have one of the CochroachDB guys comment on the novelty of their approach.

Post author here:

In terms of Google products, CockroachDB much more closely resembles a combination Spanner (http://research.google.com/archive/spanner.html) and F1 (http://research.google.com/pubs/pub41344.html). However, while we do share many traits with these systems (a consistently replicated, transactional key/value store with a SQL interface), our internal transactional model is actually quite different.

If you're interested in some of the papers that inspired CockroachDB's transactional model, we have mentioned a number of them in our design doc (https://github.com/cockroachdb/cockroach/blob/master/docs/de...).

As for Percolator, that system is not a general purpose database but an "incremental processor" for large datasets (such as a web crawler). While it does use a snapshot transaction system internally, it appears to use real exclusive access locks, which CockroachDB does not use. Our write intents are not locks, because they do not guarantee exclusive access to a resource (I've expanded on this in another comment on the blog: http://www.cockroachlabs.com/blog/how-cockroachdb-distribute...)

We are not claiming our transaction model to be exceptionally novel, this post is just trying to shed light on a specific property of that model.

I don't think either CockroachDB or Percolator is claiming any novelty for this basic approach. The Percolator article cites another paper from 1981 that discusses essentially the same idea. (I think it's the same, at least -- the writing is pretty terse, and I've only skimmed it so far.)


Many of the general principles for this kind of distributed database have been known for decades; the key to making it practical is in the details.

The idea is pretty simple. For example, long time ago when Zookeeper didn't support multi key transaction I had to implement them and came to the same algorithm - http://rystsov.info/2012/09/01/cas.html.

It would be great to know its name to simplify communication with the colleagues, do you know it?

Amazon doesn't mention Google or Percolator either.

But are they written in Go?

Thank you for sharing your strategy for concurrent transactions. It's a very interesting topic :)

I see however some possible pain points in this strategy. My main concerns are:

* This looks to me a lot like some poor (no pun intended) form of MVCC (https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...). Basically, it supports only two "V"s, the "on" and "off", while "normal" MVCC implementations support as many as required.

* Related to the above, I don't see how you can support more than one overlapping tx with this strategy. Sure, I'll keep tuned for the next post on this topic, but I fail to see how --without exclusive locking-- it could be done based on these principles.

* Every read seems like it needs to check the switch. I understand this might be expensive, specially if there is a concurrent modification in progress. Keys are co-located in the node, but they still require random access.

* It looks like even aborted transactions require writes to the database. And every transaction requires at least two writes (the write intent and its consolidation to the key). I don't expect good performance here.

Please help me clarify these questions, specially if I'm missing something. Hope it helps anyway. Thanks!

I'm not at all associated with the Cockroach folks, but I'm pretty sure they actually do use MVCC, and the description they gave in this post is a simplification of the actual implementation. At least, there are a bunch of references to MVCC in their codebase: https://github.com/cockroachdb/cockroach/tree/master/storage...

Hi there, blog author here!

This blog post is not a full description of CockroachDB's transaction strategy. If you're interested, you can find a complete rundown of our strategy here: (https://github.com/cockroachdb/cockroach/blob/master/docs/de...). However, that is a very terse document: this blog post was an attempt to expand on one of the basic ideas behind our design. My goal to clearly explain how we approach atomicity, without broaching the complicated topic of Isolation.

That said, I will attempt to address some of the pain points you raised:

* Cockroach does use a full MVCC system; writes are "append-only" and we keep the full history of values for each key. Keys can be queried at a timestamp, and every write explicitly specifies a timestamp as well (each transaction is assigned a timestamp, all writes will use the same timestamp). With this system, each transaction operates on a snapshot of the database.

* Our strategy for overlapping transactions is straightforward: if two concurrent transactions have a conflict, one of them will be aborted and retried. Write intents (in combination with another component, the "read timestamp cache") are used by conflicting transactions to discover each other. Once discovered, the decision of which in-progress transaction gets aborted is deterministic - each transaction is assigned a random integer "priority", and the transaction with the highest priority always wins the conflict (unless one of the transactions has already committed).

* Every read does need to check the switch if the key being read has a write intent; reads of plain values (with no in-progress transaction) do not need to check a switch.

* Yes, aborted transactions can write intents to the database in our current system. However, the second "write" (consolidation) happens asynchronously, and is not required for the transaction to function correctly. It should only be noticed if the key is accessed again by another transaction before it is consolidated. For committed transactions, I don't believe this is a particularly high price to pay.

A system that doesn't write aborted transactions to disk would be considerably more complex; I think it would involve two commit phases (I. The transaction has been "committed" in memory, and can no longer be aborted by other transactions II. The transaction must successfully commit all changes to disk in an atomic fashion). You likely could get some additional performance (if disk latency is a sufficient bottleneck), but it would be hard won in terms of complexity.

Hi, thanks for the reply!

I will read the more detailed document. It would probably make more sense to me.

From the description of MVCC that you say, it looks like you're storing the whole replicated state machine that you have via the Raft protocol. If txs are assigned such a timestamp, I agree that could operate as a full MVCC system. Good.

Regarding the overlapping txs, it makes sense once you say you're aborting txs. Without that, it sounded really difficult to achieve without locking.

When you say reads only check the switch if there is a write intent, where is exactly that "switch"? I understood it's not with the key, but rather at some other random location. If that's the case, I insist that you require two reads: one for the key, the other one for the switch. Os is the switch co-located with every key?

Regarding the aborted txs, I can't argue without numbers, but I'd like to see those. I mean: you may abort txs voluntarily (I want to ROLLBACK) or involuntary (there's a concurrent conflict). That may or may not be a high number. But all of those would be written to disk, so it might become a high price.

> It looks like even aborted transactions require writes to the database. And every transaction requires at least two writes (the write intent and its consolidation to the key). I don't expect good performance here.

Isn't this kind of needed in case of unexpected network partitions? I'm thinking of tests like Jespen (https://aphyr.com/tags/jepsen), if you want to retain full ACID and consistency in worst case scenarios, you'll need these kind of slower transactions checks, I may be wrong though.

Yeah, I am not sure how multiple queries will co-exist in this strategy. Especially, if you have two queries which are not commutative (say, one which increments a column and the other one which replaces the column with its square root).

Also, how does this work in the presence of replicas? Does every replica go through the strategy? What if one or more replicas go down during a transaction?

Excellent questions! thanks for this, I Was too lazy to lay them down like this.

I hope they get an answer!

I'll be interested to see the further details discussed in the isolation post, which might answer some questions.

This is basically the same how the DynamoDB client transaction library works. The two major problems with that library are:

* High write amplification (7N for DDB, probably similar for CockroachDB)

* Poor throughput under contention. This is mostly due to the immediately abort conflicting pending transactions policy, which could be changed if contention is frequent (client side exponential back off is probably good enough for most cases). The special sauce they didn't mention might also help here.

Could you have picked a worse more disgusting name?

You want people to feel good when they think of your product, not repulsed.

You may also want to hint that user data will survive even a nuclear winter.

I think a nicer name would be RoachDB. But CockroachDB/Cockroach labs is quite a memorable name, more so than roachdb for sure.

So they do it without a lock by using a "switch" instead. What is the difference? Or is it the same concept as a "lock" but they've just created a whole new confusion around terms by trying to be different?

Edit - there's an answer on the blog's comments. Apparently it differs from a lock because " it won't prevent any operations from executing, it will just properly order them"

Why didn't you go down the route of Paxos Commit for your ACP versus the Yabandeh-style ACP? I realize that they have very different semantics, but Paxos Commit seems to have (nearly) all the same properties, and benefits without a single key, that basically acts as the "coordinator" for the transaction (a process as opposed to a key).

I look forward to a Kyle Kingsbury (Aphyr) "Call me maybe" post. eg. https://aphyr.com/posts/324-call-me-maybe-aerospike

This looks very interesting, but I still don't get how do they avoid concurrency on the switch:

"The switch cannot be concurrently accessed – reads and writes of the switch are strictly ordered". If not with a lock file, maybe a log-based architecture?

Post author here!

The on disk part of the "switch", our transaction record, can only be accessed internally. There are only a small number of operations to perform on it - create it, read its value, and a small number of deterministic "read-then-write" operations. For example, in the case of an "abandoned" transaction, a later transaction may want to read the abandoned transaction record, see if it is still "pending", and then mark it as "aborted" if it is.

Our runtime orders all such operations on the same record so that they are completely sequential - that is, even for a "read-then-write" operation, it must completely finish before any other operation can occur on that key. This is done using basic runtime synchronization primitives (mutexes, channels, etc).

Thanks for the detailed explanation. Looking forward for the next post!

The detail section says "by a combination of raft and our underlying storage engine"

The writeup mentions that concurrent attempts to update a row will be discussed in a future post. The simplified discussion here says that a writer who encounters a write intent of an uncommitted transaction aborts the concurrent transaction.

I believe also that non-repeatable reads would be a potential problem. A transaction could read a row, then read it a second time; if a concurrent, committed transaction mutated the row, the repeated read will return a different result.

I believe that both these problems would be solved by MVCC which I suspect is the next blog post.

But I'm not super great at distributed DB stuff, so please correct me if I've misunderstood!

Post Author here!

Your understanding of this particular issue is correct: the system described in this post would not be able to provide repeatable reads.

However, as you've alluded to, Cockroach transactions use a form of snapshot isolation (implemented with MVCC) which does provide repeatable reads. And yes, that will be the subject of my next post. :)

I love CockroachDB but the name is dismissed by many that don't know all its capabilities.

I'm not sure if CockroachDB has multi-DC async replication, but the challenge I've always found in this stuff is ensuring the transaction also commits on remote replicas in an all-or-nothing way. Maybe I didn't read close enough...

Post author here

Sargun is correct, all of our replication is performed by Raft, which exists at a lower level than the content covered in this post. Every command executed against a cockroach key (i.e. "create write intent", "begin this transaction") is persisted to the Raft log before executing; Commands will not proceed if they are not confirmed by Raft. The outcome of all commands is deterministic, so this provides complete replication of all mutations to each key.

If you're interested, there's more information available in our primary design document: (https://github.com/cockroachdb/cockroach/blob/master/docs/de...)

They have Multi-Raft running between all of the replicas of a specific key range to ensure that there is an atomic commit across all DCs.

I'm much more interested in how they get "C" in a distributed DB than how they get "A"

Interesting read, the CochroachDB guys are top notch engineers. I'm excited to see how this product develops.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact