Percolator is the processing system Google uses to manage its search index incrementally. It's built on BigTable which provides cell-level linearizability, similar to what CockroachDB seems to achieve by using Raft.
What CockroachDB calls "intents" are per-value columns called "write" in Percolator and ordered by BigTables timestamp system. Percolators "master lock" is rebranded to "switch" and, tada, there you have CockroachDBs fancy lockless algorithm.
Edit: Amazon DynamoDB, too, uses a very similar approach to this in their client-driven transaction implementation https://github.com/awslabs/dynamodb-transactions/blob/master...
It would be interesting to have one of the CochroachDB guys comment on the novelty of their approach.
In terms of Google products, CockroachDB much more closely resembles a combination Spanner (http://research.google.com/archive/spanner.html) and F1 (http://research.google.com/pubs/pub41344.html). However, while we do share many traits with these systems (a consistently replicated, transactional key/value store with a SQL interface), our internal transactional model is actually quite different.
If you're interested in some of the papers that inspired CockroachDB's transactional model, we have mentioned a number of them in our design doc (https://github.com/cockroachdb/cockroach/blob/master/docs/de...).
As for Percolator, that system is not a general purpose database but an "incremental processor" for large datasets (such as a web crawler). While it does use a snapshot transaction system internally, it appears to use real exclusive access locks, which CockroachDB does not use. Our write intents are not locks, because they do not guarantee exclusive access to a resource (I've expanded on this in another comment on the blog: http://www.cockroachlabs.com/blog/how-cockroachdb-distribute...)
We are not claiming our transaction model to be exceptionally novel, this post is just trying to shed light on a specific property of that model.
Many of the general principles for this kind of distributed database have been known for decades; the key to making it practical is in the details.
It would be great to know its name to simplify communication with the colleagues, do you know it?
I see however some possible pain points in this strategy. My main concerns are:
* This looks to me a lot like some poor (no pun intended) form of MVCC (https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...). Basically, it supports only two "V"s, the "on" and "off", while "normal" MVCC implementations support as many as required.
* Related to the above, I don't see how you can support more than one overlapping tx with this strategy. Sure, I'll keep tuned for the next post on this topic, but I fail to see how --without exclusive locking-- it could be done based on these principles.
* Every read seems like it needs to check the switch. I understand this might be expensive, specially if there is a concurrent modification in progress. Keys are co-located in the node, but they still require random access.
* It looks like even aborted transactions require writes to the database. And every transaction requires at least two writes (the write intent and its consolidation to the key). I don't expect good performance here.
Please help me clarify these questions, specially if I'm missing something. Hope it helps anyway. Thanks!
This blog post is not a full description of CockroachDB's transaction strategy. If you're interested, you can find a complete rundown of our strategy here: (https://github.com/cockroachdb/cockroach/blob/master/docs/de...). However, that is a very terse document: this blog post was an attempt to expand on one of the basic ideas behind our design. My goal to clearly explain how we approach atomicity, without broaching the complicated topic of Isolation.
That said, I will attempt to address some of the pain points you raised:
* Cockroach does use a full MVCC system; writes are "append-only" and we keep the full history of values for each key. Keys can be queried at a timestamp, and every write explicitly specifies a timestamp as well (each transaction is assigned a timestamp, all writes will use the same timestamp). With this system, each transaction operates on a snapshot of the database.
* Our strategy for overlapping transactions is straightforward: if two concurrent transactions have a conflict, one of them will be aborted and retried. Write intents (in combination with another component, the "read timestamp cache") are used by conflicting transactions to discover each other. Once discovered, the decision of which in-progress transaction gets aborted is deterministic - each transaction is assigned a random integer "priority", and the transaction with the highest priority always wins the conflict (unless one of the transactions has already committed).
* Every read does need to check the switch if the key being read has a write intent; reads of plain values (with no in-progress transaction) do not need to check a switch.
* Yes, aborted transactions can write intents to the database in our current system. However, the second "write" (consolidation) happens asynchronously, and is not required for the transaction to function correctly. It should only be noticed if the key is accessed again by another transaction before it is consolidated. For committed transactions, I don't believe this is a particularly high price to pay.
A system that doesn't write aborted transactions to disk would be considerably more complex; I think it would involve two commit phases (I. The transaction has been "committed" in memory, and can no longer be aborted by other transactions II. The transaction must successfully commit all changes to disk in an atomic fashion). You likely could get some additional performance (if disk latency is a sufficient bottleneck), but it would be hard won in terms of complexity.
I will read the more detailed document. It would probably make more sense to me.
From the description of MVCC that you say, it looks like you're storing the whole replicated state machine that you have via the Raft protocol. If txs are assigned such a timestamp, I agree that could operate as a full MVCC system. Good.
Regarding the overlapping txs, it makes sense once you say you're aborting txs. Without that, it sounded really difficult to achieve without locking.
When you say reads only check the switch if there is a write intent, where is exactly that "switch"? I understood it's not with the key, but rather at some other random location. If that's the case, I insist that you require two reads: one for the key, the other one for the switch. Os is the switch co-located with every key?
Regarding the aborted txs, I can't argue without numbers, but I'd like to see those. I mean: you may abort txs voluntarily (I want to ROLLBACK) or involuntary (there's a concurrent conflict). That may or may not be a high number. But all of those would be written to disk, so it might become a high price.
Isn't this kind of needed in case of unexpected network partitions? I'm thinking of tests like Jespen (https://aphyr.com/tags/jepsen), if you want to retain full ACID and consistency in worst case scenarios, you'll need these kind of slower transactions checks, I may be wrong though.
Also, how does this work in the presence of replicas? Does every replica go through the strategy? What if one or more replicas go down during a transaction?
I hope they get an answer!
This is basically the same how the DynamoDB client transaction library works. The two major problems with that library are:
* High write amplification (7N for DDB, probably similar for CockroachDB)
* Poor throughput under contention. This is mostly due to the immediately abort conflicting pending transactions policy, which could be changed if contention is frequent (client side exponential back off is probably good enough for most cases). The special sauce they didn't mention might also help here.
You want people to feel good when they think of your product, not repulsed.
Edit - there's an answer on the blog's comments. Apparently it differs from a lock because " it won't prevent any operations from executing, it will just properly order them"
"The switch cannot be concurrently accessed – reads and writes of the switch are strictly ordered". If not with a lock file, maybe a log-based architecture?
The on disk part of the "switch", our transaction record, can only be accessed internally. There are only a small number of operations to perform on it - create it, read its value, and a small number of deterministic "read-then-write" operations. For example, in the case of an "abandoned" transaction, a later transaction may want to read the abandoned transaction record, see if it is still "pending", and then mark it as "aborted" if it is.
Our runtime orders all such operations on the same record so that they are completely sequential - that is, even for a "read-then-write" operation, it must completely finish before any other operation can occur on that key. This is done using basic runtime synchronization primitives (mutexes, channels, etc).
I believe also that non-repeatable reads would be a potential problem. A transaction could read a row, then read it a second time; if a concurrent, committed transaction mutated the row, the repeated read will return a different result.
I believe that both these problems would be solved by MVCC which I suspect is the next blog post.
But I'm not super great at distributed DB stuff, so please correct me if I've misunderstood!
Your understanding of this particular issue is correct: the system described in this post would not be able to provide repeatable reads.
However, as you've alluded to, Cockroach transactions use a form of snapshot isolation (implemented with MVCC) which does provide repeatable reads. And yes, that will be the subject of my next post. :)
Sargun is correct, all of our replication is performed by Raft, which exists at a lower level than the content covered in this post. Every command executed against a cockroach key (i.e. "create write intent", "begin this transaction") is persisted to the Raft log before executing; Commands will not proceed if they are not confirmed by Raft. The outcome of all commands is deterministic, so this provides complete replication of all mutations to each key.
If you're interested, there's more information available in our primary design document: