By coincidence I was looking around for a datastore. I almost settled on couchbase when I saw this; easy cluster deployment is my main draw after looking at how bad setting up master-master system is in other solutions (hbase, mongo, couchdb all have eterogeneus nodes and weird failure modes)
One quick question since I saw the devs around and I can't find a final answer on it on google: are there any strong roadblock or performance drawbacks against storing principally medium size binary data like images, say, < 100mb blobs?
Yeah I could use something more appropriate but I'd be back to figuring out hadoop installation or fighting s3 eventual consistency.
Anything under 64K is perfectly reasonable to store in a CockroachDB column. Between 64K and maybe 1M is trending towards trouble. Values greater than this are going to go through CockroachDB like a goat through a python.
Why is this the case? For starters, at the level of RocksDB, values greater than 64K are not jammed into SSTables (to avoid constantly rewriting them during compactions of the LSM tree). Instead, individual files are created. Also, CockroachDB has quite a lot of write amplification, which is generally OK for structured relational data, but becomes progressively more terrible for large blobs. Write amplification comes from the Raft log, as well as RocksDB's write-ahead log.
What we really need is an integrated storage system for immutable blobs, something we're taking very seriously. Roughly half of the original team which built Colossus at Google are working on CockroachDB, so there's some knowledge of how to go about building such a system.
While we're not sure where it would fall on our roadmap, the idea is that large blob values would be efficiently replicated and maintained through a separate subsystem. The blob column itself would just contain a pointer to the blob. The value in tight integration (a single CockroachDB cluster providing both OLTP SQL database as well as a distributed blob store) would be one deployment & admin console, and transactionally consistent blob column values (e.g. no fighting s3 eventual consistency).
Thanks! That's fantastic to hear even if it's going to materialize later/eventually/never at least it's great to know the need is recognized.
Disregarding write amplification issues for a sec, would it make things better to split binaries in 64k chunks and have them in a chunk table keyed by name and offset?
I'm not really sure which strategy would benchmark best between 64K chunks, 1M chunks, or even 8M chunks. I think this requires some experimentation. Pushing them all through as 64K chunks has a lot of overhead, and you'd reap the full write amplification. Could you tell me a bit more about your use case? You can email me spencer at cockroachlabs.com.
I've been meaning to work on the "CockroachDB Egg Store" (my disgusting name for a blob storage subsystem) as a Free Fridays side project for a while, but have been distracted with all manner of other enticing options. There aren't enough hours in a week...
As a long-time CouchDB user, I'm going to assume that by "weird failure modes", you mean "conflicts". There are some awesome aspects to using CouchDB, but getting used to conflicts certainly took some time. In short, your application needs to include the code to resolve problems that arise due to network partitions or concurrent writes to different nodes. On the other hand, it's a really stable platform and the write-only database files worked really well (we lost zero data in over 5 years of operation ... even with tests that included powering down a node while in use).
But ... I also believe that CockroachDB is going to be the project to finally solve these problems and it's got an SQL interface! I'm waiting for SQL joins, and then I suspect I'll be moving at least some of my clustered PostgreSQL instances to CockroachDB.
sharding needs to be taken care manually, there is nowhere on the docs handling what happens in the event of a failover and no how to replace a failed populated node with an empty one so that the cluster can recover.
those are quite big concerns if sharding with a replica set for fault tolerance is the main objective.
but then you'd need to store file path in the datastore, with all the failure modes that follows.
I know what the current best practices are, but for bootstrapped one man startups you always end up with dozen moving parts to be tamed and instead of building stuff you need to solve the same problems over and over again
I'm implementing a datastore on top of glusterfs, which while amateurish is a wonderful learning experience, but since it's nowhere close to being workable I am also looking around for what's happening in this space.
I've this longer term goal about having a serverless framework where the datastore also provide oauth for users, which was inspired by https://github.com/serverless/serverless - would be a boon for client side webapp app style startups.
I have a strong sense that you're vastly over-engineering something.
The solved problem is called an object store. Just use S3 or Azure Storage or Google Cloud Storage - they all have options for geo-replication and it'll be just as fast or faster than any database replication you set up, while being far more 9's of availability and reliability.
Storing a tiny bit of text for the filename is trivial and you can use any durable database for that.
CockroachDB doesn't currently support joins, but it's been designed so that it can potentially support them, unlike most NoSQL databases. In particular, it supports cross-machine transactions, which are a critical building block for both correctness and performance.
For instance, suppose you want to join on a column that's not a primary key. No big deal in a typical RDBMS; just do an index lookup on the join column. But if your data is sharded across machines, that doesn't work, because you can't update the index without running the risk that readers will observe it in an inconsistent state. Unless you can do transactions, that is.
So I'm optimistic that Cockroach will get join support before too long. The necessary primitives are available, making it "merely" a matter of figuring out the engineering details.
Yes, distributed joins are well understood. But I'm not at all convinced that the relational model is a good fit for a scale-out, widely distributed, fault-tolerant database. The engineering decisions for each of these aspects add enough latency to the process as it is.
Joins in ActorDB work great. Of course that is because we use an entirely different way of making an SQL database distributed and our joins aren't actually distributed even though the database is.
I looked at ActorDB, with the thought of using it as a sort of "Sql Enabled" version of etcd. Meaning, storing config data, mostly reads, with no big performance requirements.
But, the use case is to ensure that the config data is available on all nodes....high availability. So, for example, sharding isn't wanted or needed.
It was difficult, however, to get my arms around the whole "actor model", and understand how to use ActorDB in this relatively simple deployment.
Long story short, is there some reference, or dead simple example of a "single actor" deployment with a small number of tables?
It's no different from any other deployment really. You have a single actor type in your init.sql then when you send in queries you always specify the same actor. It's easier to explain if you tell me what you don't understand.
It's basically trying to figure out what the purpose of an "actor" is, and what it means in terms of schema design. I can't tell if I'm supposed to use multiple actors because it adds some kind of resilience, or performance, or something else?
On the surface, it seems analogous to "CREATE DATABASE" / "USE DATABASE", but then the examples show applications using multiple actors...which wouldn't be typical (one app / multiple databases). So, it seems clear the idea is multiple actors within a single app, but what drives the choice of what the actors are?
If the documentation started with some deeper explanation of actors, it might be easier to follow...as is, it jumps into actor syntax, creation, etc, without the reader really knowing what one is first.
I do get that this might be unique to me...I'm just not grocking the concept.
Hi Peter, sounds pretty awesome! Quick question though - on your front page you say that CockroachDB does SQL - but if it can't do a join, then how can you say it uses SQL? Or is distributed SQL a different thing entirely? It does sounds like a very limited SQL subset though... I'm sure I must be missing something as I'm not familiar with your product.
Also, what levels of isolation do you actually offer? Serialized snapshot isolation appears to be MVCC, but I see you also have just snapshot isolation - what is the difference?
Edit: oh brother, the proof you link to, I just realized I bought that book some time ago and never got around to reading it... Transactional Information Systems by Weikum & Vossen, right? time for me to hit the books I guess. Still trying to get my head around that second graph you have drawn, can't work out how you have gotten it :(
SQL is not a single language but rather a family of languages. Each RDBMS that advertises support for SQL ends up implementing a different flavor of SQL, most of the times they are not even compatible :)
For now CockroachDB's supports a subset of the SQL implemented by other databases, with some extensions of its own. This subset will grow over time.
The two levels of isolation offered are snapshot (SI) and serializable (SSI). Snapshot means that concurrent transactions are atomic with regards to each other and "see" the same initial state from the DB. Serializable adds into this that they can't introduce write skews, ie if there are concurrent transactions they will "see" each others effects in some order.
Hi Chris, we describe CockroachDB as a SQL database because that is what we're aspiring to. The missing functionality (i.e. joins) is on our near-term roadmap.
Joins over a distributed database aren't easy. Love to see what you have going! The main issue I see with distributed joins are that they need to be done in a single transaction - if even one table gets an insert, delete or update then it invalidated the join. But this distributed serialised snapshot isolation, that sounds like it might be the best way around it.
You may find that you will have to choose at best between two of these three: joins, speed and convenience. In other words, you won't be able to join quickly in a convenient way for the users, of if you want the user to join like he can in SQL database, it will be much slower.
It is possible to massively distribute joins but it requires a much more sophisticated database design than you see in most distributed database implementations because the whole system has to be designed for that use case e.g. support for ad hoc inter-node orchestration and data flows. At that point you basically have a parallel database instead of just a distributed one.
And if you add an additional requirement to correctly execute ad hoc consistent joins under very high write workloads and continuous failures, then it is definitely non-trivial engineering. But it is possible, I've designed implementations like this before, it just requires a very high degree of effort and skill that is rarely applied to the design of typical NoSQL databases.
MemSQL does distributed joins really well, using a combination of in-memory rowstore + disk columnstore and the ability to have certain tables replicated to every node for faster joining.
Sorry if this is a little bit off topic
This is a startup with employees that is making a non-commercial open source database system. How will they make a financial return on such a product?
How does this work if the clocks drift between the nodes?
Does this allow incorrect behavior because one transaction looks like it happened before another?
Interestingly, clock drift does not affect the serializability of the transaction history; this system guarantees that the history is serializable, regardless of clock drift.
However, "serializable" only means that the history is equivalent to some serial ordering of transactions - it makes no guarantee that the equivalent serial ordering is consistent with the real-time ordering of the involved transactions. A history with that property (agrees with real-time) is termed "linearizable", and requires additional rules to guarantee in an environment with clock drift.
As mentioned by knz42, there was another Cockroach Labs blog post (written by Spencer Kimball) that addressed this in some detail; that blog post contrasted our strategy for dealing with drift with that of Google's Spanner.
A quick overview of CockroachDB's properties re linearizability: it guarantees that access to any individual key is linearizable, and by composition any two transactions which share a key (that one of the transactions modifies) will be linearizable with respect to each other. However, if two transactions do not have any overlap in modified keys, Cockroach does not (by default) guarantee the resulting commit history is linearizable. CockroachDB's underlying KV layer does have a "linearizable" flag on transactions that can guarantee this, but it requires that transactions be slowed down considerably; Spencer's blog post addresses some other strategies that CockroachDB is considering to address the issue.
Can Cockroach do the equivalent of a "select ... for update" (e.g., PostgreSQL), where you lock one thing while applying changes elsewhere?
Concrete example: We have app that has a "documents" table and a "translog" table. The translog is like a series of diff-patches, representing changes to the documents. When we write to the translog, we first lock the document with a "select ... for update", so that no intervening translog entries can be written concurrently against the same document, then we patch the document, and then we write the translog entry and commit.
We do this with Postgres, and we can do the same thing with Redis' MULTI since Redis is completely single-threaded. I can't think of any other NoSQL data store that allows a similar "lock A, update A, insert B, unlock A"; for example, Cassandra's "lightweight transactions" are only transactional in the context of a single row.
(By "lock" I'd also accept optimistic locking, where you can retry on failure.)
CockroachDB is optimistically concurrent, so there is not locking. However, your use case is definitely possible.
The transaction would:
1. Read the current document (i'm assuming this needs to be done to compute the translog).
2. Read the latest ID in the translog table
3. Write a new entry to translog with ID+1
4. Write the document.
If any other transaction interleaves with this process (by either reading or writing one of the same keys in a way that would violate isolation), one of the two transactions will be aborted.
It seems to me that your use-case does not require locking specifically - you just want to make sure no concurrent transactions can clobber your "update A".
As mrtracy explained, such overlapping transactions are linearizable in CockroachDB, so this invariant is preserved without the need for explicit locking.
What I need is for our translog to reflect the order of updates. So if diff A was applied before B, then the translog order also needs to be A, B. (The order only needs to be consistent per document.)
This is because we have listeners — through APIs — that play the translog as it happens and maintain various state based on it.
Currently, the translog is ordered by a sequential number (because it's cheap in Postgres), but every entry also records the ID of the previous entry (so B will point at A). One could sort by time and then reorder by causality before emitting the linear log to consumers, but that would of course be more complicated than one that is already linear.
I think it does require locking, because in PostGres (or Oracle) readers do not block writers and writers do not block readers. So to be sure you update the same version you read, you have to select...for update.
Having serializable transactions is equivalent to adding "FOR UPDATE" to every SELECT statement, so it sounds like CockroachDB already does what you want.
A typical RDBMS will prevent conflicts by forcing queries to block until they can be executed in a conflict-free ordering. CockroachDB instead detects conflicts after the fact and prevents inconsistent transactions from committing, forcing them to retry. The end result -- that is, the set of possible outcomes of a series of transactions -- is the same, but the performance characteristics will be different.
GP said that the use case does not require locking, but in PgSQL (which was mentioned) or Oracle, it does. The default transaction isolation level is not serializable. You don't read uncommitted updates, but reads are not repeatable unless you explicitly ask for that. If you do something like this (in a transaction):
select ... from T where <condition>;
...
update T ... where <condition>;
commit;
there is no guarantee that the row you are updating is the same as the one you selected, unless you add "for update" to the select.
Also note that two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes after the first SELECT starts and before the second SELECT starts.
A query acquires no data locks. Therefore, other transactions can query and update a table being queried, including the specific rows being queried. Because queries lacking FOR UPDATE clauses do not acquire any data locks to block other operations, such queries are often referred to in Oracle as nonblocking queries.
Sorry, I don't understand this comment because I can't tell if you're disagreeing with me about anything.
> GP said that the use case does not require locking, but in PgSQL (which was mentioned) or Oracle, it does.
Right, it does in a typical RDBMS, but not in CockroachDB. The definition of an isolation level is defined in terms of what interactions are possible between concurrent successful transactions. Locks, or the lack of locks, are an implementation detail.
> The default transaction isolation level is not serializable. ... Also note that two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes after the first SELECT starts and before the second SELECT starts.
I agree. If you set the isolation level to "serializable", such anomalies aren't possible, even if you don't use FOR UPDATE.
For serializability, all you care about it is some sequential order. You get that with hybrid logical clocks (http://www.cse.buffalo.edu/tech-reports/2014-04.pdf). It gives you a monotonically increasing timestamp that you can use instead of dumb version numbers.
On the other hand, if you want to ensure linearizability, you do care about the worst case clock drift, which in CockroachDB is a configurable parameter. One can adopt Google's Spanner's approach ("commit wait"), which is to wait out the response to the client to ride out NTP uncertainty (typically a few milliseconds inside a data center, but 100s of milliseconds in the wide area).
Naively it looks like your conflict resolution is prone to livelock, where concurrent transactions for a key keep aborting each other without getting any work done. The same can happen if you run a single node RDBMS in serializable isolation mode and read a row before writing to it. In that case you can add locking reads on rows you intend to later write to, which avoids the livelock.
Have you done any studies to quantify how bad that effect is in cockroachdb? Assuming the effect exists and I didn't just miss something silly, are there any workarounds since cockroachdb is only OCC? (Exponential backoff between retries and keeping transactions short don't count, I assume those are a given)
When transactions conflict, the priority of the losing transaction is internally ratcheted up. At some point it's just higher than everything after it and it succeeeds.
This is a good insight; Live Lock was one of the things that kept me up at night, although my fellow contributors have allayed my concerns considerably.
In a write-heavy workload where the transactions follow a read-modify-write pattern, it does seem possible that your application could be facing live lock concerns. So, let's look at where those concerns come from in CockroachDB:
1. For write-write conflicts, live-lock concerns comes from the fact that a later transaction can abort your in-progress transaction if it has a priority. On top of this, if your transaction ends up aborting an earlier transaction, the other transaction may retry with a potentially higher priority, and this can result in a "priority war" of sorts.
2. "Read-write" conflicts have a worse problem: if a transaction with a later timestamp reads a key before your transaction writes to it, then your transaction always aborts. In read-modify-write workloads where the read is first, this seems to be the biggest theoretical source of live-lock.
Now, as for dealing with those issues.
+ The "Write-write" issue is the smaller concern here. The ratcheting-up of priorities is probabilistic; currently, after enough retries your transaction will have a high enough priority that it will almost certainly make progress. However, we are not totally content with this: we have an issue for our 1.0 release (https://github.com/cockroachdb/cockroach/issues/5727) to do some serious investigation of this, with some alternatives already suggested. For example, one such alternative is using "lowest original timestamp wins" to settle conflicts instead of a random priority.
+ The read-write concern seems more problematic; our read timestamp cache retains minimal data due to memory concerns, and thus aborts a bit conservatively (i.e. aborts transactions that might have been able to continue). This can be dealt with by modifying your application pattern - in particular, I would suggest using our "SNAPSHOT" isolation level for many workloads. I didn't cover this in the blog post, but in short the SNAPSHOT mode allows RW conflicts to occur without aborting transactions. SNAPSHOT transactions are subject to the "write skew" anomaly, but this anomaly does not occur if the involved transactions write to a common key. This is fortunately the case for many OLTP-type workloads, meaning SNAPSHOT can be used without anomalies; this also moves conflict-detection responsibility entirely to WW conflicts, which abort less conservatively than RW and should ameliorate any live-lock problems.
I just wanted to thank the team for this writeup. I've been excited to read it since the last post several months back. This was definitely very useful, and very clearly written.
One quick question since I saw the devs around and I can't find a final answer on it on google: are there any strong roadblock or performance drawbacks against storing principally medium size binary data like images, say, < 100mb blobs?
Yeah I could use something more appropriate but I'd be back to figuring out hadoop installation or fighting s3 eventual consistency.