
CockroachDB Stability Post-Mortem: From 1 Node to 100 Nodes - orangechairs
https://www.cockroachlabs.com/blog/cockroachdb-stability-from-1-node-to-100-nodes/
======
jakewins
Makes me really happy to see how earnest y'all are being about these issues
and how you're approaching them.

FWIW, we had very similar issues at Neo4j a few years back; I kind of wish
we'd have blogged about it, but like you insinuate, I guess there's a strong
social pressure to keep your dirty laundry in the hamper.

One of the interesting things we learned was that testing "worst case" was
often not very useful - the first time we shipped a GA that'd gone through
months-long soak testing in the worst possible load we could imagine, it took
less than a few weeks for a customer with a seemingly benign load to come back
with an unstable cluster.

Today, stability testing is done both on the "max out write throughput, then
pull the power cord, then pull the powercord again while it recovers, and do
that a million times"-level and "run these four customer use cases, including
their devops setup, for two months and fail if uptime is not 100%"-level.

I'm not at Neo anymore, but I do wish someone who is would blog about the
stability testing regimen, because at this point, it is brutal.

~~~
PetahNZ
Unleash the chaos monkey!

~~~
jakewins
Which reminds me - another thing we learned is that too much chaos can be
easier on a distributed system than less chaos.

One of the robustness suites tries as hard as it can to permanently destroy a
Neo4j cluster, looking for things like distributed deadlocks, faults in the
leader election, data inconsistencies between replicas and so on.

It does that by applying a randomized load of all operations the system
supports; reads, writes and schema changes. That's then combined with induced
hardware faults and "regular" operational restarts and cluster configuration
changes.

The problem was that, early on, the test would actually create unresponsive
clusters, but then the "chaos" would continue, stirring up enough dust to get
the cluster going again before the "unresponsive" timeouts triggered, causing
false green tests.

Hence: Today this suite plays out a "chaos" scenario, but then it heals
network partitions, turns unpowered instances back on and so forth, and sits
back and waits for the cluster to recover "on its own".

~~~
brianwawok
This is considered multiple unrelated failures. You usually don't want to (at
least to start) design a system to handle this.

handle a machine power down? Yes. Handle a network cable unplug? Yes. Handle 1
machine power down and 1 network cable unplug at the same time? That may be
impossible to handle, or it may be an order of magnitude harder to handle.

Get all the single failure issues nailed down. Then perhaps you can do a
little work on multiple unrelated failures, but it gets insanely complex fast.

------
grizzles
I found the conclusions for mitigating instability in future efforts very
surprising. There was no mention of test suites, dependency injection, etc.

If I was building this type of system, I would want to build an exhaustive
test suite that allowed me to make stability invariant forward progress on the
code base without relying on engineers eye catching regressions.

That would mean constructing a system to enumerate all the edge cases at
different cluster sizes, automate testing, be able to simulate the network
with DI and without docker, and be able configure different network delays and
other salient parameters.

The omission of a system like this is surprising because it's really not that
hard to build and make it an integral part of core development. I may be
wrong, but the last paragraph of the post certainly seems to imply that this
isn't done or hasn't been considered.

~~~
qaq
There is nothing simple in building or testing a system like this. How exactly
would DI help you test multi-node system?

~~~
grizzles
Everything's relative. It's easier to build a test framework for CockroachDB
than to build CockroachDB itself. DI could help by simulating the network. It
gives you more power to test out and record states around weird edge cases. If
you were using docker, there are things that you either probably can't
configure or it would be very difficult to. Eg. changing dns lookup latency
for a subset of hosts.

~~~
Jweb_Guru
Oh, come on. Nothing short of formal verification is going to be sufficient
for testing something like this (and formal verification is far, far harder
for distributed systems than it is for single-node ones), and even then it
still can't validate your environment assumptions. DI isn't even a drop in the
bucket. If anything, DI makes it more likely your tests will be off,
_especially_ if you're relying in any way on (even loose) global clock
synchronization.

~~~
grizzles
As the author mentioned in the original article TLA+, ATPs etc verify the
correctness of systems, they don't apply to the type of stability problems
which have plagued CockroachDB. That emergent behavior arose from it's
implementation, optimizations, etc. Just because a system is unstable doesn't
mean it's incorrect (or correct).

That is why a simulator is useful. The entire point of building a sim is so
you can run the exact code that comprises CockroachDB and control for any
situation. For example, an optimization that works great at 10 nodes may be
terrible for larger clusters. You need to find that out through testing
because most humans can't look at a line of code and infer that.

A side benefit of this approach is that if you change some parameter and
suddenly experience a big increase or decrease in db throughput you gain a
sense for which parameters are most sensitive to the overall system and will
then be able to more easily diagnose problems in production.

~~~
Jweb_Guru
What sort of cycle-accurate simulation are you proposing, exactly (I assume
you're aiming for something pretty accurate if your simulation is going to
figure out performance-related problems!), and how will it be improved by
dependency injection? If you just want to simulate a network, why not use
something like [http://mininet.org/](http://mininet.org/)?

(Also: formal proofs of correctness for stuff like leader election are very
much concerned with stability / liveness, not just safety; that's a common
misconception).

------
tyingq
They also recently got joins working[1], so it's pretty close now to being
like etcd, but with sql instead of KV.

Once they get to the rest of the stability and performance roadmap milestones,
it seems like they are positioned to be a great backing store for a lot of
interesting use cases.

Nice work!

[1]
[https://github.com/cockroachdb/cockroach/issues/2970](https://github.com/cockroachdb/cockroach/issues/2970)

~~~
irfansharif
our current implementation for JOINs is our first, non-optimized
implementation. we're currently working on speeding this up but here's the
blog post from when our first implementation was out if you're interested [1].

[1]: [https://www.cockroachlabs.com/blog/cockroachdbs-first-
join/](https://www.cockroachlabs.com/blog/cockroachdbs-first-join/)

------
andrewchambers
Cockroach seems like a great product to me. I have a project i'm developing
against it, If they can launch a managed version which is set and forget I
would be very tempted to pay for it.

I have concerns they will not be able to reach stability in a reasonable time
frame and am prepared to pull the plug and go to postgres if needed. The fact
that they are trying to maintain compatibility with postgres really makes this
gamble far less risky, and was a wise strategic move.

I wish them the best.

~~~
hamandcheese
So if you can get away with Postgres then why do you want to use Cocroach?

~~~
andrewchambers
I like the idea of a self healing cluster where each node is equal and the
amount of work required to maintain it is kept to a minimum.

------
ninjakeyboard
This sounds like a cool project - Similar claims to spanner?
[http://research.google.com/archive/spanner.html](http://research.google.com/archive/spanner.html)

I'll keep watching for jepsen tests!! :)
[https://www.cockroachlabs.com/blog/diy-jepsen-testing-
cockro...](https://www.cockroachlabs.com/blog/diy-jepsen-testing-cockroachdb/)

~~~
irfansharif
Spanner and F1[1] were actually the inspirations for the project[2].

[1][http://research.google.com/pubs/pub41344.html](http://research.google.com/pubs/pub41344.html)

[2][https://www.cockroachlabs.com/docs/frequently-asked-
question...](https://www.cockroachlabs.com/docs/frequently-asked-
questions.html#what-is-cockroachdb)

~~~
ninjakeyboard
Thanks! I missed that.

------
jasonwatkinspdx
Given some nasty comments about this here before, I just want to say I really
appreciate these kinds of transparent postmortems. Credit to the cockroach
folks for putting themselves out there in a way many would not.

------
nathancahill
I'm so ashamed of HN after reading the previous comments here:
[https://news.ycombinator.com/item?id=12361921](https://news.ycombinator.com/item?id=12361921).
Glad these are more positive :)

Love @redwood's response though: "This is like a parent saying "We're really
looking forward to our child learning to walk by her next birthday" and you
responding: "I've got children in their 30s... what kind of parents focus on
getting their kids to walk?""

~~~
ploxiln
That response to "ok we're going to make the database stable now" is quite
reasonable for something which has had the publicity that CockroachDB has had.
(Even the name suggests stability/availability above all else!)

The alternative is for CockroachDB devs to clearly state "this does not work
yet, no one should be trying to use it" before all notable communications.

This is actually a necessary immune response to a "open source ecosystem" that
is becoming overwhelming to navigate - people need to know what to not bother
paying any attention to yet.

~~~
nathancahill
It's only overwhelming if you jump from technology to technology for real
products. I have a list of up and coming technologies that I will play around
with and build toy projects on over a year. If it lives up to it's claims, I
might consider using it for real products.

Software is fast paced. You just can't ignore all unstable/beta technologies,
you'll be far behind the game when they hit 1.0.

~~~
ploxiln
Hah - these days 1.0 is still alpha. Docker was still unstable at v1.6 (but I
was able to deploy it at v1.8.2 if I carefully avoided many of its features).
MySQL is actually pretty good at 5.5 / 5.6 but earlier versions were a bit of
trouble ...

------
datahack
I love the project. Technologically, this thing is _awesome_ and I want to try
it out.

But I haven't because of the name.

I want to raise a totally sideways concern that has no relevance to the
project, just so it gets out there. The only problem I'm having is that I,
like about 10% of the US Population, have Entomophobia.

The name of the project gives me the heebie-jeebies and keeps me from wanting
to work with it because I loath cockroaches. I know it's not relevant
technologically, but I really can't imagine doing training on cockroach for
example.

Oh. My. God. No.

Anyways, wanted to provide the feedback. Sorry if it's totally weird.

~~~
grizzles
I think the branding is brilliant. It's unforgettable. It's a statement of
purpose. It hits their target market (engineers) perfectly.

I sympathize though. Entomophobia must be a bummer.

------
Animats
_" We accomplished this by splitting our master branch into two branches: the
master branch would be dedicated to stability, freezing with the exception of
pull requests targeting stability. All other development would continue in a
develop branch."_

There was a time when most software was developed that way.

------
jondubois
Raft is useful for handling redundancy (involving a limited number of nodes)
but it has hard scalability limits. To achieve theoretically unlimited
scalability, EVERY subsystem has to be embarrassingly parallel (which as far
as I know Raft isn't) - If even one subsystem is not embarrassingly parallel,
then the system as a whole will be limited in terms of scalability.

To build an embarrassingly parallel system; you can't avoid some sort of
sharding. I haven't looked too deeply into 'Quiescing Raft', but it still
looks like scalability would be limited - It still looks like we have a
scenario where each node might communicate with every other node (though in a
more economical way).

To remove the limit completely, you'd have to settle with a solution that has
fixed-size Raft groups which do not grow as you add more nodes to the cluster.

------
johnnydoebk
Sorry for a question that is not directly related to the blog post, but
anyway. CockroachDB is the only Go project I'm aware of that uses `2 spaces`
indentation instead of `tab`s. I'm curious, is there any reason for that?

~~~
bdarnell
(Cockroach Labs co-founder here) The source files contain tab characters
(gofmt enforces this for all go projects). We configure our editors to render
tabs as two spaces. There's no deep reason for this; we've just gotten used to
two-space indents from our time at Google and other projects that used this
convention.

Since Go code is formatted with tabs, you can mostly get away with setting the
indentation to whatever you want. The one practical problem with letting
people choose their own values for the width of a tab is that it becomes
tricky to enforce a uniform line length, so we've standardized on two-space
indents (and 100-char line lengths) across the project.

~~~
johnnydoebk
Thanks, now I see those are tabs. I didn't know Github indentation style is
customizable. Interesting.

------
joncrane
The final bullet point is super weird to me:

>Systems like CockroachDB must be tested in a real world setting. However,
there is significant overhead to debugging a cluster on AWS. The cycle to
develop, deploy, and debug using the cloud is very slow. An incredibly helpful
intermediate step is to deploy clusters locally as part of every engineer’s
normal development cycle (use multiple processes on different ports). See the
allocsim and zerosum tools.

I'd like to know more details about why for this use case, AWS deployment is
slower; bandwidth?

~~~
knz42
It's a combination of bandwidth and latency. Latency in particular feels slow;
you need to copy the new binary on every node, then issue commands over ssh to
scrub any previous leftover files, initialize a cluster, run the tests, stop
the cluster, retrieve the log files. Doing all this over the network is just
taking a few more seconds which makes running hundreds of tests much longer
than they would locally.

------
Firefishy
Seeing random data corruption over network, try: /sbin/ethtool -K ethX gso off
tso off sg off gro off

udev rule something like: SUBSYSTEM=="net", ACTION=="add",
ATTRS{vendor}=="0x8086", ATTRS{device}=="0x10c9",
ATTRS{subsystem_vendor}=="0x103c", ATTRS{subsystem_device}=="0x323f",
RUN+="/sbin/ethtool -K $name gso off tso off sg off gro off"

~~~
brazzledazzle
What does this do and what are the downsides?

~~~
Firefishy
It disables some of the network card's hardware offloading. Some have
occasional faults, as low as every 4TB of data. The corrupt packets has valid
TCP CRC. Commonly first spotted with an ssh error of "Corrupted MAC on input"

Example of an OpenStack vendor guide recommending turning the offloading off:
[http://clouddoc.stratus.com/1.5.1.0/en-
us/Content/Help/P03_S...](http://clouddoc.stratus.com/1.5.1.0/en-
us/Content/Help/P03_Support/C02_InstallGuide/C05_OpenStackInstall/T_DisablingTCPOffload.htm)

Downside is you may have slightly higher CPU usage and/or slightly lower
network throughput, I personally have not noticed a significant drop in
performance.

------
prospero
What did you use to create your animated simulations?

~~~
orangechairs
(Cockroach Labs here) The simulations are available on Spencer's github repo
here:
[https://github.com/spencerkimball/spencerkimball.github.io/t...](https://github.com/spencerkimball/spencerkimball.github.io/tree/master/simulation)

~~~
prospero
Thanks!

------
bpicolo
The links in the article seem broken (all point to to the blog post).

Would be cool to see some of the bigger relevant PRs

~~~
irfansharif
[cockroach labs employee] most of the 'stability' related PRs/isues were
labelled explicitly [1]. Going over the ones with the heaviest discussions
could maybe point you to what you were looking for [2].

[1][https://github.com/cockroachdb/cockroach/labels/stability](https://github.com/cockroachdb/cockroach/labels/stability)

[2][https://github.com/cockroachdb/cockroach/issues?utf8=%E2%9C%...](https://github.com/cockroachdb/cockroach/issues?utf8=%E2%9C%93&q=label%3Astability%20sort%3Acomments-
desc%20)

------
velodrome
Anybody using it in production yet?

~~~
nas
Not me. However, I starting planning for a new product that needs to scale to
reliably hold a lot of measurement data. I initially was thinking Cassandra
might be a good fit. After doing some research, I like the design of
CockroachDB better.

~~~
velodrome
I'm also looking at this project for a similar use case. Have you looked at
InfluxDB or Prometheus?

