
Serializable, Lockless, Distributed: Isolation in CockroachDB - orangechairs
https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/
======
LoSboccacc
By coincidence I was looking around for a datastore. I almost settled on
couchbase when I saw this; easy cluster deployment is my main draw after
looking at how bad setting up master-master system is in other solutions
(hbase, mongo, couchdb all have eterogeneus nodes and weird failure modes)

One quick question since I saw the devs around and I can't find a final answer
on it on google: are there any strong roadblock or performance drawbacks
against storing principally medium size binary data like images, say, < 100mb
blobs?

Yeah I could use something more appropriate but I'd be back to figuring out
hadoop installation or fighting s3 eventual consistency.

~~~
hnkimb3558
(employee/founder here)

Anything under 64K is perfectly reasonable to store in a CockroachDB column.
Between 64K and maybe 1M is trending towards trouble. Values greater than this
are going to go through CockroachDB like a goat through a python.

Why is this the case? For starters, at the level of RocksDB, values greater
than 64K are not jammed into SSTables (to avoid constantly rewriting them
during compactions of the LSM tree). Instead, individual files are created.
Also, CockroachDB has quite a lot of write amplification, which is generally
OK for structured relational data, but becomes progressively more terrible for
large blobs. Write amplification comes from the Raft log, as well as RocksDB's
write-ahead log.

What we really need is an integrated storage system for immutable blobs,
something we're taking very seriously. Roughly half of the original team which
built Colossus at Google are working on CockroachDB, so there's some knowledge
of how to go about building such a system.

While we're not sure where it would fall on our roadmap, the idea is that
large blob values would be efficiently replicated and maintained through a
separate subsystem. The blob column itself would just contain a pointer to the
blob. The value in tight integration (a single CockroachDB cluster providing
both OLTP SQL database as well as a distributed blob store) would be one
deployment & admin console, and transactionally consistent blob column values
(e.g. no fighting s3 eventual consistency).

~~~
LoSboccacc
Thanks! That's fantastic to hear even if it's going to materialize
later/eventually/never at least it's great to know the need is recognized.

Disregarding write amplification issues for a sec, would it make things better
to split binaries in 64k chunks and have them in a chunk table keyed by name
and offset?

~~~
hnkimb3558
I'm not really sure which strategy would benchmark best between 64K chunks, 1M
chunks, or even 8M chunks. I think this requires some experimentation. Pushing
them all through as 64K chunks has a lot of overhead, and you'd reap the full
write amplification. Could you tell me a bit more about your use case? You can
email me spencer at cockroachlabs.com.

I've been meaning to work on the "CockroachDB Egg Store" (my disgusting name
for a blob storage subsystem) as a Free Fridays side project for a while, but
have been distracted with all manner of other enticing options. There aren't
enough hours in a week...

------
tyingq
I love the transparency this project operates with. Looking forward to when
joins are implemented so I can try using it with something non-trivial.

~~~
chillaxtian
> Looking forward to when joins are implemented so I can try using it with
> something non-trivial.

these types of databases typically don't have joins. you'll be waiting a
while.

~~~
petermattis
(employee/founder here)

Re joins: we have top men working on it right now. Joins are definitely an
interesting problem for a distributed database.

~~~
chris_wot
Hi Peter, sounds pretty awesome! Quick question though - on your front page
you say that CockroachDB does SQL - but if it can't do a join, then how can
you say it uses SQL? Or is distributed SQL a different thing entirely? It does
sounds like a very limited SQL subset though... I'm sure I must be missing
something as I'm not familiar with your product.

Also, what levels of isolation do you actually offer? Serialized snapshot
isolation appears to be MVCC, but I see you also have just snapshot isolation
- what is the difference?

 _Edit:_ oh brother, the proof you link to, I just realized I bought that book
some time ago and never got around to reading it... Transactional Information
Systems by Weikum & Vossen, right? time for me to hit the books I guess. Still
trying to get my head around that second graph you have drawn, can't work out
how you have gotten it :(

~~~
petermattis
Hi Chris, we describe CockroachDB as a SQL database because that is what we're
aspiring to. The missing functionality (i.e. joins) is on our near-term
roadmap.

~~~
chris_wot
Sweet :-)

Joins over a distributed database aren't easy. Love to see what you have
going! The main issue I see with distributed joins are that they need to be
done in a single transaction - if even one table gets an insert, delete or
update then it invalidated the join. But this distributed serialised snapshot
isolation, that sounds like it might be the best way around it.

------
barking
Sorry if this is a little bit off topic This is a startup with employees that
is making a non-commercial open source database system. How will they make a
financial return on such a product?

~~~
sgarman
It's common in the industry to build an open source product and then do paid
support, hosting, etc with it.

Joyent/Node Sonatype Nexus

~~~
barking
OK thanks, I sort of thought that it might be something like that but wondered
if I was missing some other angle.

------
danieltoshea
How does this work if the clocks drift between the nodes? Does this allow
incorrect behavior because one transaction looks like it happened before
another?

~~~
mrtracy
(blog author here)

Interestingly, clock drift does _not_ affect the serializability of the
transaction history; this system guarantees that the history is serializable,
regardless of clock drift.

However, "serializable" only means that the history is equivalent to _some_
serial ordering of transactions - it makes no guarantee that the equivalent
serial ordering is consistent with the real-time ordering of the involved
transactions. A history with that property (agrees with real-time) is termed
"linearizable", and requires additional rules to guarantee in an environment
with clock drift.

As mentioned by knz42, there was another Cockroach Labs blog post (written by
Spencer Kimball) that addressed this in some detail; that blog post contrasted
our strategy for dealing with drift with that of Google's Spanner.

A quick overview of CockroachDB's properties re linearizability: it guarantees
that access to any individual key is linearizable, and by composition any two
transactions which share a key (that one of the transactions modifies) will be
linearizable with respect to each other. However, if two transactions do not
have any overlap in modified keys, Cockroach does not (by default) guarantee
the resulting commit history is linearizable. CockroachDB's underlying KV
layer does have a "linearizable" flag on transactions that can guarantee this,
but it requires that transactions be slowed down considerably; Spencer's blog
post addresses some other strategies that CockroachDB is considering to
address the issue.

~~~
lobster_johnson
Can Cockroach do the equivalent of a "select ... for update" (e.g.,
PostgreSQL), where you lock one thing while applying changes elsewhere?

Concrete example: We have app that has a "documents" table and a "translog"
table. The translog is like a series of diff-patches, representing changes to
the documents. When we write to the translog, we first lock the document with
a "select ... for update", so that no intervening translog entries can be
written concurrently against the same document, then we patch the document,
and then we write the translog entry and commit.

We do this with Postgres, and we can do the same thing with Redis' MULTI since
Redis is completely single-threaded. I can't think of any other NoSQL data
store that allows a similar "lock A, update A, insert B, unlock A"; for
example, Cassandra's "lightweight transactions" are only transactional in the
context of a single row.

(By "lock" I'd also accept optimistic locking, where you can retry on
failure.)

~~~
tdrd
(employee here)

It seems to me that your use-case does not require locking specifically - you
just want to make sure no concurrent transactions can clobber your "update A".

As mrtracy explained, such overlapping transactions are linearizable in
CockroachDB, so this invariant is preserved without the need for explicit
locking.

~~~
ams6110
I think it does require locking, because in PostGres (or Oracle) readers do
not block writers and writers do not block readers. So to be sure you update
the same version you read, you have to select...for update.

~~~
teraflop
Having serializable transactions is equivalent to adding "FOR UPDATE" to every
SELECT statement, so it sounds like CockroachDB already does what you want.

A typical RDBMS will prevent conflicts by forcing queries to block until they
can be executed in a conflict-free ordering. CockroachDB instead detects
conflicts after the fact and prevents inconsistent transactions from
committing, forcing them to retry. The end result -- that is, the set of
possible outcomes of a series of transactions -- is the same, but the
performance characteristics will be different.

~~~
ams6110
GP said that the use case does not require locking, but in PgSQL (which was
mentioned) or Oracle, it does. The default transaction isolation level is not
serializable. You don't read uncommitted updates, but reads are not repeatable
unless you explicitly ask for that. If you do something like this (in a
transaction):

    
    
       select ... from T where <condition>;
       ...
       update T ... where <condition>;
       commit;
    

there is no guarantee that the row you are updating is the same as the one you
selected, unless you add "for update" to the select.

 _Also note that two successive SELECT commands can see different data, even
though they are within a single transaction, if other transactions commit
changes after the first SELECT starts and before the second SELECT starts._

\-- [http://www.postgresql.org/docs/current/static/transaction-
is...](http://www.postgresql.org/docs/current/static/transaction-iso.html)

 _A query acquires no data locks. Therefore, other transactions can query and
update a table being queried, including the specific rows being queried.
Because queries lacking FOR UPDATE clauses do not acquire any data locks to
block other operations, such queries are often referred to in Oracle as
nonblocking queries._

\--
[http://docs.oracle.com/cd/B19306_01/server.102/b14220/consis...](http://docs.oracle.com/cd/B19306_01/server.102/b14220/consist.htm)

~~~
teraflop
Sorry, I don't understand this comment because I can't tell if you're
disagreeing with me about anything.

> GP said that the use case does not require locking, but in PgSQL (which was
> mentioned) or Oracle, it does.

Right, it does in a typical RDBMS, but not in CockroachDB. The definition of
an isolation level is defined in terms of what interactions are possible
between concurrent successful transactions. Locks, or the lack of locks, are
an implementation detail.

> The default transaction isolation level is not serializable. ... Also note
> that two successive SELECT commands can see different data, even though they
> are within a single transaction, if other transactions commit changes after
> the first SELECT starts and before the second SELECT starts.

I agree. If you set the isolation level to "serializable", such anomalies
aren't possible, even if you don't use FOR UPDATE.

------
grogers
Naively it looks like your conflict resolution is prone to livelock, where
concurrent transactions for a key keep aborting each other without getting any
work done. The same can happen if you run a single node RDBMS in serializable
isolation mode and read a row before writing to it. In that case you can add
locking reads on rows you intend to later write to, which avoids the livelock.

Have you done any studies to quantify how bad that effect is in cockroachdb?
Assuming the effect exists and I didn't just miss something silly, are there
any workarounds since cockroachdb is only OCC? (Exponential backoff between
retries and keeping transactions short don't count, I assume those are a
given)

~~~
knz42
When transactions conflict, the priority of the losing transaction is
internally ratcheted up. At some point it's just higher than everything after
it and it succeeeds.

------
ruggeri
I just wanted to thank the team for this writeup. I've been excited to read it
since the last post several months back. This was definitely very useful, and
very clearly written.

So many thanks!

