Under the terms of the Unified HN Convention, agreed 2015, every thread about CockroachDB must by law contain a series of complaints about the name of the database. Please post yours below.
To help you get started, here's some prompts you might use:
"My Enterprise CTO will never go for something named..."
"I just think the name sounds really disgusting and off-putting..."
"Marketing a product is at least as important as making a product, and this is bad marketing..."
As it stands, it seems to me that CockroachDB is mostly just reinventing Spark from scratch, except maybe from a more OLTP-centric perspective.
I know the hype is to "use NoSQL" but the reality is that there's a limited set of cases where the existing solutions truly make sense. Generally speaking, SQL is the best choice for many companies. That CockroachDB is bringing some new ideas to the SQL space is something that should be lauded. And I think they're doing it in a way that can't really be replicated easily or efficiently by current systems (which is perhaps why they're receiving so much attention.)
Spark has a sophisticated  query engine, which seems perfectly capable of pruning partitions and pushing down index scans into data sources. Yes, Spark can't do writes, so you'll still have to build a transactional KV store that sits underneath the query layer, but you won't have to implement SQL from scratch. (This seems to be similar to the approach taken by e.g. SnappyData  and Splice Machine .)
Like some of the other comments in this thread, the idea was to provide all the guarantees of a OLTP store (HA, ACID, Scalability, Mutations etc) with the powerful analytic capabilities of Spark.
Now of course Spark makes up for it with its great flexibility and scalability, but I do not really see the two technologies as competing ones.
This even without getting into the other parts of the data model (insert, update, delete) that do not exist in Spark (or "kind of" exist), by design.
Also, last I checked Spark had very limited support for mutating data in SQL; it was designed for queries. Cockroach has to do both.
Spark has some support for SQL DML, and to me it actually mostly seems like a concern for the underlying KV layer to provide ACID transactions, and expose them to Spark through the Data Sources API.
This is a big difference, separating compute and storage layer comes with a big overhead with the flexibility.
Furthermore, it seems to me that query evaluation techniques aren't really that different between OLTP and OLAP workloads, i.e. the difference is mainly in the storage format.
In that sense, I think it follows that Spark is a good fit as the query layer. Notice phrasings like "arbitrary computations close to your data". That's basically exactly what Spark already is. I think there's a lot of synergy in combining it with a strongly consistent distributed storage engine.
Anyways, my $0.02, evidently not a very popular opinion ¯\_(ツ)_/¯
Potentially added constraints:
-ACID for OLTP
-30-way analytical joins on complex criteria and multiple data sets with billions of entries
-fast iterations on data prep for analytics, so analysts can make, find and correct errors
-proper workload management (almost no "stupidly designed" queries)
I'm asking because I can't see this without hw and sw being integrated to allow for it (appliance). Are there any cloud offerings that live up to this?
EDIT: Formatting got mangled on submit.
In my view, one reason that we don't see huge demand for this combination is that the schema that makes sense for analytics is often different from that which makes sense for the online system.
Question(s): Do you offer any appliances? The reason why I am asking is for computationally intense workloads where the same data may be shuffled around multiple times between processors. Can one e.g. set up MemSQL with RDMA over Infiniband?
No, we do not offer appliances. We are a software only solution. I do not know of any deployments where RDMA is being utilized today. I'm interested in your use case. If you're so inclined, join chat.memsql.com (my UN is eklhad) and we can converse a bit more rapidly.
I am charting the landscape of distributed database systems (federated and homogenous). Node interconnectivity is just one of many potential bottlenecks.
With a sufficiently complex query, redistribution of data by hash must occur a number of times for linear scalability (based on my understanding). Ethernet based interconnectivity typically suffers from high CPU utilization and various QoS issues for this particular use case. This also seems to apply to Ethernet based fabric offerings, though I haven't kept up with that field for a couple of years.
If you guys are encountering performance issues connected to either RAM=>CPU loading or data redistribution between nodes, you may want to keep this in mind.
I may get in touch via chat at a later time as I'm slightly more than average interested in HPC database systems :) The more offerings, the better!