Hacker News new | past | comments | ask | show | jobs | submit login
CockroachDB Stability Post-Mortem: From 1 Node to 100 Nodes (cockroachlabs.com)
202 points by orangechairs on Nov 15, 2016 | hide | past | favorite | 62 comments



Makes me really happy to see how earnest y'all are being about these issues and how you're approaching them.

FWIW, we had very similar issues at Neo4j a few years back; I kind of wish we'd have blogged about it, but like you insinuate, I guess there's a strong social pressure to keep your dirty laundry in the hamper.

One of the interesting things we learned was that testing "worst case" was often not very useful - the first time we shipped a GA that'd gone through months-long soak testing in the worst possible load we could imagine, it took less than a few weeks for a customer with a seemingly benign load to come back with an unstable cluster.

Today, stability testing is done both on the "max out write throughput, then pull the power cord, then pull the powercord again while it recovers, and do that a million times"-level and "run these four customer use cases, including their devops setup, for two months and fail if uptime is not 100%"-level.

I'm not at Neo anymore, but I do wish someone who is would blog about the stability testing regimen, because at this point, it is brutal.


Unleash the chaos monkey!


Which reminds me - another thing we learned is that too much chaos can be easier on a distributed system than less chaos.

One of the robustness suites tries as hard as it can to permanently destroy a Neo4j cluster, looking for things like distributed deadlocks, faults in the leader election, data inconsistencies between replicas and so on.

It does that by applying a randomized load of all operations the system supports; reads, writes and schema changes. That's then combined with induced hardware faults and "regular" operational restarts and cluster configuration changes.

The problem was that, early on, the test would actually create unresponsive clusters, but then the "chaos" would continue, stirring up enough dust to get the cluster going again before the "unresponsive" timeouts triggered, causing false green tests.

Hence: Today this suite plays out a "chaos" scenario, but then it heals network partitions, turns unpowered instances back on and so forth, and sits back and waits for the cluster to recover "on its own".


This is considered multiple unrelated failures. You usually don't want to (at least to start) design a system to handle this.

handle a machine power down? Yes. Handle a network cable unplug? Yes. Handle 1 machine power down and 1 network cable unplug at the same time? That may be impossible to handle, or it may be an order of magnitude harder to handle.

Get all the single failure issues nailed down. Then perhaps you can do a little work on multiple unrelated failures, but it gets insanely complex fast.


That suggests that your original chaos was more "random" than "chaotic", as actually chaotic processes will go through indefinite periods of slack where they are apparently predictable. i.e. a better "chaos monkey" might employ some traversal of a fractal system rather than simply aim for high entropy noise.


I found the conclusions for mitigating instability in future efforts very surprising. There was no mention of test suites, dependency injection, etc.

If I was building this type of system, I would want to build an exhaustive test suite that allowed me to make stability invariant forward progress on the code base without relying on engineers eye catching regressions.

That would mean constructing a system to enumerate all the edge cases at different cluster sizes, automate testing, be able to simulate the network with DI and without docker, and be able configure different network delays and other salient parameters.

The omission of a system like this is surprising because it's really not that hard to build and make it an integral part of core development. I may be wrong, but the last paragraph of the post certainly seems to imply that this isn't done or hasn't been considered.


There is nothing simple in building or testing a system like this. How exactly would DI help you test multi-node system?


Everything's relative. It's easier to build a test framework for CockroachDB than to build CockroachDB itself. DI could help by simulating the network. It gives you more power to test out and record states around weird edge cases. If you were using docker, there are things that you either probably can't configure or it would be very difficult to. Eg. changing dns lookup latency for a subset of hosts.


Oh, come on. Nothing short of formal verification is going to be sufficient for testing something like this (and formal verification is far, far harder for distributed systems than it is for single-node ones), and even then it still can't validate your environment assumptions. DI isn't even a drop in the bucket. If anything, DI makes it more likely your tests will be off, especially if you're relying in any way on (even loose) global clock synchronization.


As the author mentioned in the original article TLA+, ATPs etc verify the correctness of systems, they don't apply to the type of stability problems which have plagued CockroachDB. That emergent behavior arose from it's implementation, optimizations, etc. Just because a system is unstable doesn't mean it's incorrect (or correct).

That is why a simulator is useful. The entire point of building a sim is so you can run the exact code that comprises CockroachDB and control for any situation. For example, an optimization that works great at 10 nodes may be terrible for larger clusters. You need to find that out through testing because most humans can't look at a line of code and infer that.

A side benefit of this approach is that if you change some parameter and suddenly experience a big increase or decrease in db throughput you gain a sense for which parameters are most sensitive to the overall system and will then be able to more easily diagnose problems in production.


What sort of cycle-accurate simulation are you proposing, exactly (I assume you're aiming for something pretty accurate if your simulation is going to figure out performance-related problems!), and how will it be improved by dependency injection? If you just want to simulate a network, why not use something like http://mininet.org/?

(Also: formal proofs of correctness for stuff like leader election are very much concerned with stability / liveness, not just safety; that's a common misconception).


It's not relative it's literally the hardest type of DB system you can try to build. I have my doubts that an entity with relatively modest resources can pull it off, but pretending there are trivial solutions for even testing a distributed system let alone a distributed system of this complexity is simply not grounded in reality.


Religion.


They also recently got joins working[1], so it's pretty close now to being like etcd, but with sql instead of KV.

Once they get to the rest of the stability and performance roadmap milestones, it seems like they are positioned to be a great backing store for a lot of interesting use cases.

Nice work!

[1] https://github.com/cockroachdb/cockroach/issues/2970


our current implementation for JOINs is our first, non-optimized implementation. we're currently working on speeding this up but here's the blog post from when our first implementation was out if you're interested [1].

[1]: https://www.cockroachlabs.com/blog/cockroachdbs-first-join/


Cockroach seems like a great product to me. I have a project i'm developing against it, If they can launch a managed version which is set and forget I would be very tempted to pay for it.

I have concerns they will not be able to reach stability in a reasonable time frame and am prepared to pull the plug and go to postgres if needed. The fact that they are trying to maintain compatibility with postgres really makes this gamble far less risky, and was a wise strategic move.

I wish them the best.


So if you can get away with Postgres then why do you want to use Cocroach?


I like the idea of a self healing cluster where each node is equal and the amount of work required to maintain it is kept to a minimum.


This sounds like a cool project - Similar claims to spanner? http://research.google.com/archive/spanner.html

I'll keep watching for jepsen tests!! :) https://www.cockroachlabs.com/blog/diy-jepsen-testing-cockro...


Spanner and F1[1] were actually the inspirations for the project[2].

[1]http://research.google.com/pubs/pub41344.html

[2]https://www.cockroachlabs.com/docs/frequently-asked-question...


Thanks! I missed that.


Given some nasty comments about this here before, I just want to say I really appreciate these kinds of transparent postmortems. Credit to the cockroach folks for putting themselves out there in a way many would not.


I'm so ashamed of HN after reading the previous comments here: https://news.ycombinator.com/item?id=12361921. Glad these are more positive :)

Love @redwood's response though: "This is like a parent saying "We're really looking forward to our child learning to walk by her next birthday" and you responding: "I've got children in their 30s... what kind of parents focus on getting their kids to walk?""


That response to "ok we're going to make the database stable now" is quite reasonable for something which has had the publicity that CockroachDB has had. (Even the name suggests stability/availability above all else!)

The alternative is for CockroachDB devs to clearly state "this does not work yet, no one should be trying to use it" before all notable communications.

This is actually a necessary immune response to a "open source ecosystem" that is becoming overwhelming to navigate - people need to know what to not bother paying any attention to yet.


It's only overwhelming if you jump from technology to technology for real products. I have a list of up and coming technologies that I will play around with and build toy projects on over a year. If it lives up to it's claims, I might consider using it for real products.

Software is fast paced. You just can't ignore all unstable/beta technologies, you'll be far behind the game when they hit 1.0.


Hah - these days 1.0 is still alpha. Docker was still unstable at v1.6 (but I was able to deploy it at v1.8.2 if I carefully avoided many of its features). MySQL is actually pretty good at 5.5 / 5.6 but earlier versions were a bit of trouble ...


Wow, yeah...that thread has some pretty harsh comments.

Perhaps somewhat driven by Cockroach Labs home page, which mostly describes the product as they intend it to be, not as it currently is. The github page, blog posts, etc, are very transparent about the current beta status, but the home page is a bit marketing heavy.

Not trying to be overly critical, but rather, thinking it might be the reason for the harsh commentary. (commenters not seeing that the product isn't yet at a 1.0 release)


For me, I kept hearing "try CockroachDB" any time I brought up Google F1 or FoundationDB. Months of that went by then I see a post saying that high-availability & scaling database had no attention to stability plus couldn't scale to a dozen nodes. The criticism was warranted.

Their reaction to the problems and criticism was good though. They got on the problems. The product is in better confition. A dedicated team is there to keep it that way. Good ending.


I love the project. Technologically, this thing is awesome and I want to try it out.

But I haven't because of the name.

I want to raise a totally sideways concern that has no relevance to the project, just so it gets out there. The only problem I'm having is that I, like about 10% of the US Population, have Entomophobia.

The name of the project gives me the heebie-jeebies and keeps me from wanting to work with it because I loath cockroaches. I know it's not relevant technologically, but I really can't imagine doing training on cockroach for example.

Oh. My. God. No.

Anyways, wanted to provide the feedback. Sorry if it's totally weird.


I think the branding is brilliant. It's unforgettable. It's a statement of purpose. It hits their target market (engineers) perfectly.

I sympathize though. Entomophobia must be a bummer.


Well, at least it is not grabherbythepussyDB.


It could be great exposure therapy :)


"We accomplished this by splitting our master branch into two branches: the master branch would be dedicated to stability, freezing with the exception of pull requests targeting stability. All other development would continue in a develop branch."

There was a time when most software was developed that way.


Raft is useful for handling redundancy (involving a limited number of nodes) but it has hard scalability limits. To achieve theoretically unlimited scalability, EVERY subsystem has to be embarrassingly parallel (which as far as I know Raft isn't) - If even one subsystem is not embarrassingly parallel, then the system as a whole will be limited in terms of scalability.

To build an embarrassingly parallel system; you can't avoid some sort of sharding. I haven't looked too deeply into 'Quiescing Raft', but it still looks like scalability would be limited - It still looks like we have a scenario where each node might communicate with every other node (though in a more economical way).

To remove the limit completely, you'd have to settle with a solution that has fixed-size Raft groups which do not grow as you add more nodes to the cluster.


Sorry for a question that is not directly related to the blog post, but anyway. CockroachDB is the only Go project I'm aware of that uses `2 spaces` indentation instead of `tab`s. I'm curious, is there any reason for that?


(Cockroach Labs co-founder here) The source files contain tab characters (gofmt enforces this for all go projects). We configure our editors to render tabs as two spaces. There's no deep reason for this; we've just gotten used to two-space indents from our time at Google and other projects that used this convention.

Since Go code is formatted with tabs, you can mostly get away with setting the indentation to whatever you want. The one practical problem with letting people choose their own values for the width of a tab is that it becomes tricky to enforce a uniform line length, so we've standardized on two-space indents (and 100-char line lengths) across the project.


Thanks, now I see those are tabs. I didn't know Github indentation style is customizable. Interesting.


The final bullet point is super weird to me:

>Systems like CockroachDB must be tested in a real world setting. However, there is significant overhead to debugging a cluster on AWS. The cycle to develop, deploy, and debug using the cloud is very slow. An incredibly helpful intermediate step is to deploy clusters locally as part of every engineer’s normal development cycle (use multiple processes on different ports). See the allocsim and zerosum tools.

I'd like to know more details about why for this use case, AWS deployment is slower; bandwidth?


It's a combination of bandwidth and latency. Latency in particular feels slow; you need to copy the new binary on every node, then issue commands over ssh to scrub any previous leftover files, initialize a cluster, run the tests, stop the cluster, retrieve the log files. Doing all this over the network is just taking a few more seconds which makes running hundreds of tests much longer than they would locally.


Seeing random data corruption over network, try: /sbin/ethtool -K ethX gso off tso off sg off gro off

udev rule something like: SUBSYSTEM=="net", ACTION=="add", ATTRS{vendor}=="0x8086", ATTRS{device}=="0x10c9", ATTRS{subsystem_vendor}=="0x103c", ATTRS{subsystem_device}=="0x323f", RUN+="/sbin/ethtool -K $name gso off tso off sg off gro off"


What does this do and what are the downsides?


It disables some of the network card's hardware offloading. Some have occasional faults, as low as every 4TB of data. The corrupt packets has valid TCP CRC. Commonly first spotted with an ssh error of "Corrupted MAC on input"

Example of an OpenStack vendor guide recommending turning the offloading off: http://clouddoc.stratus.com/1.5.1.0/en-us/Content/Help/P03_S...

Downside is you may have slightly higher CPU usage and/or slightly lower network throughput, I personally have not noticed a significant drop in performance.


What did you use to create your animated simulations?


(Cockroach Labs here) The simulations are available on Spencer's github repo here: https://github.com/spencerkimball/spencerkimball.github.io/t...


Thanks!


The links in the article seem broken (all point to to the blog post).

Would be cool to see some of the bigger relevant PRs


[cockroach labs employee] most of the 'stability' related PRs/isues were labelled explicitly [1]. Going over the ones with the heaviest discussions could maybe point you to what you were looking for [2].

[1]https://github.com/cockroachdb/cockroach/labels/stability

[2]https://github.com/cockroachdb/cockroach/issues?utf8=%E2%9C%...


Sorry about that! They should all be working now.


Anybody using it in production yet?


Not me. However, I starting planning for a new product that needs to scale to reliably hold a lot of measurement data. I initially was thinking Cassandra might be a good fit. After doing some research, I like the design of CockroachDB better.


I'm also looking at this project for a similar use case. Have you looked at InfluxDB or Prometheus?


Take a look at druid


(cockroach labs here) No one is yet using it in production, but we're actively seeking beta clients. If you'd like to talk more, feel free to ping me: jessica @ cockroachlabs


If you spend some time using it now, you will find its not ready yet.


[flagged]


With an attitude like that, I'm not surprised they don't take you seriously.

Even with lower performance I can see use cases where the ease of configuration works. Medium sized businesses with relatively low query volume seem like a good fit to me.


Why would you use spanner inspired system for medium size business with low query volume :)?


I like the idea of a self healing cluster where each node is equal and the amount of work required to maintain and migrate it is kept to a minimum.


I misspoke, i meant with moderate latency requirements. I have a product in mind where the queries don't require high performance. Instead i value availability without babysitting.


Do I need to take you seriously?! You don't have to play a "victim" role here You can just keep scratching someone's back. "Medium sized businesses" don't stop doing their business because of CockroachDB. They were able to do just fine with whatever systems/databases they were using prior to CockroachDB existence.


Where's the arrogance? All I see is a honest, introspective look at the development process and stability of CockroachDB.

What statements are you referring to?


My guess is that the discussion being referred to is this one: https://groups.google.com/d/topic/cockroach-db/7w463Bv-Mbk/d...

Not sure where the vitriol is coming from though.


Here on HN, allegations don't usually go well without evidence.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: