Hacker News new | past | comments | ask | show | jobs | submit login
A Focus on Stability (cockroachlabs.com)
73 points by orangechairs on Aug 25, 2016 | hide | past | favorite | 32 comments

"Our goal of a 10-node cluster running under continuous load for two solid weeks has been maddeningly elusive."

That's a low goal for a database. I've had MySQL and MariaDB instances running for years between reboots.

"We’ve decided to stabilize the master branch in isolation; new feature development will continue in the develop branch."

Well, yes, that's how it's done. You don't develop a usable database by constantly churning the production code.

This is like a parent saying "We're really looking forward to our child learning to walk by her next birthday" and you responding: "I've got children in their 30s... what kind of parents focus on getting their kids to walk?"

[cockroach labs here] We should clarify that the 10-node goal is our Q3 OKR. (I'll update the original post.) We're still actively developing the beta product and will be aiming for much more solid stability numbers by 1.0.

What's unbelievable to me is that the post even exists. The main goals of a database are integrity and availability. For performance, many companies will throw cash at the problem or just accept it being slower. The integrity/availability is the part that's non-negotiable. Even more so in a database advertising a stronger, consistency advantage.

And they weren't focusing on that much at all until now? Really shakes the confidence that the end result of this project will be trustworthy at all. Good they're shifting their focus. Who knows if it will be too late given how much code is in there already.

Database development is hard - especially from scratch. Oracle reached mainstream stability only in version 5. In the early days you cycle between feature development and stabilization - nothing new there. I like the transparency and commitment from the team. Source: I've been an early database core engineer.

This is true. It's quite hard. Oracle started in a day and age before we had many good or popular tools to make robust software. They also cared more about profit than robustness. Hard to see that their case proves anything. Whereas, the OpenVMS's records and clusters were very stable by adding features for a week, testing on weekends, fixing problems for a week, and repeating that. Uptime of 17+ years with a few being more common. This is aiming at a milestone of... 2 weeks?

It's certainly hard. There are worked examples of getting the job done, though. All it took was focus and effort on QA. I believe the distributed nature will cause more difficulties than the older systems. Yet, same remedy. Even more so given its so much easier to fail.

"I like the transparency and commitment from the team"

I liked that a lot as I saw in post and HN comments. It gave me the impression of a database being designed for correctness, speed, scale, consistency, and reliability. Then, this post says a few got dropped a while back. That's a lack of transparency and commitment on a key issue. Or a resurgence of it. Hope that problem goes away.

> And they weren't focusing on that much at all until now?

They're working on a state of the art design that is far more ambitious than the bulk of the NoSQL world. What they're doing is extremely difficult and involves integrating multiple complex distributed protocols. It's not so much that integrity and availability weren't focus'd on, it's that they're working with things that take enormous time and effort to fully debug.

"It's not so much that integrity and availability weren't focus'd on, it's that they're working with things that take enormous time and effort to fully debug."

You literally just contradicted yourself. You didn't mean to but you did. What you said is their software requirements and challenge are enormous. It's going to be hard to pull of the theory and implementation. There is room for huge problems in protocol, custom code, libraries used, and OS interactions. Preventing tons of debugging requires QA to be turned up in these situations. Maybe even add protocol analysis like Amazon does with TLA+ on top of integration/fuzz/unit tests and language-level analysis.

Then, you said they were too focused on making it work to do that part. The part that was a prerequisite of making it work. As they're now seeing.

All I mean is that the evidence is no matter how much you care about getting it right, it will take several years to get this kind of system right. That was google and everyone else's experience with just Paxos, let alone a larger system that also involves time synchronization, transaction protocols, etc.

I don't think it was a matter of "ooops, we just didn't care enough". There's no way to make this kind of thing where it comes out of the oven perfect the first time. There just isn't.

I agree that using TLA+ or the like from the very beginning would probably help. I also found the "rule based development" paper from the RAMCloud folks pretty convincing, but I haven't tried to put it into practice.

"it will take several years to get this kind of system right."

That's definitely true.

I don't think it was a matter of "ooops, we just didn't care enough"."

In the article, they said that's whst happen. Little to no attention paid to problem. No QA person. Problems mounted. I don't why people keep speculating on causes when article itself said it was negligence they're correcting. That's also why Im countering all comments to the contrary.

Re RAMcloud paper. I might have missed it. Will look it up. Thanks.

It's pretty common for products like this to first make your implementation work correctly on a single node then scale it out to correctness on several nodes. Sounds from the blog post like they're midway through that process. It's still in beta so it all seems pretty reasonable to me.

> For performance, many companies will throw cash at the problem or just accept it being slower.

Yes, fine, but in that case you can ignore CockroachDB entirely and keep using Postgres. The point of the project is to handle cases where Postgres has performance problems.

To be fair, once you have self-healing clusters having a node disappear / crash is not a big deal for availability of the entire cluster.

Also the blog post says nothing about integrity. It does not seem that integrity is affected.

You're hitting the nail on the head here this is a cattle not pets type of database with different dynamics than these relational model that's being described here

We're not talking cattle, pets, and so on. We're talking software that does a combo of performance, integrity, and availability at multiple locations for a database. A lack of QA can wreak all kinds of havoc. I'm not even sure such a system is safe with bullet-proof, 3rd-party clustering if the core algorithms lacked the QA.

I'll reconsider my opinion if you link to a reliable source showing that failed logic in software continues to work so long as an extra node runs the same, failed logic when the first one crashes or corrupts data due to failed logic or hardware. Maybe redundant, incorrect computations add up to success. Like multiplying two negatives gives a positive.

We're talking about an in-development distributed database. Obviously it's no where near ready for production, but that is not to say that it should be directly compared to the single-node reliability requirements of a relational monolith. Sure the software should be capable, and obviously isn't yet. But the infrastructure it runs on will have a lower bar and hence the software will need to tolerate more failures. In that sense, instability is a challenge they need to overcome in order to succeed, but the comparisons with scale-up database stores just doesn't make sense here.

A combo of a memory-safe language, restricted expression of it to ease analysis, and design-by-contract will knock out many problems with little cost. Likewise, the Cleanroom methodology used to do the same with regular languages. There's finance companies developing crash-free, fast stuff in Haskell and Ocaml. One guy at IRONSIDES put together a DNS immune to single-packet crashes just using SPARK Ada. Finally, SQLlite shows how rugged a database can be just integrating rigorous testing into its design that runs every time they change something.

All of this indicates many problems they're having could've been avoided with some different methodology. I don't need it, though, because they said they were ignoring stability instead. You keep forgetting that part in your comments. I forgive lots of inadvertent failures but intentionally ignoring the QA in mission-critical software deserves harsh comments. ;)

[cockroach labs engineer here] Where are you getting the "ignoring stability" part? Or that we "skipped correctness and stability" (in another comment of yours)?

Correctness and robustness are the main factors behind the design and implementation choices we make. A significant part of the overall engineering effort has been on stability for a long time. We're now making it closer to 90% for a while.

None of this is unexpected in the development of a complex system, especially when there are many factors involved in deciding how much to focus on various aspects. I have worked on a few unrelated systems which turned out to be stable and successful when released and - at a comparable stage in their development - they were much less stable. So personally, I am very optimistic about CockroachDB.

It's implied in the article. It mentions many mounting problems in stability, including nobody was dedicated to working on it. Implies a lack of QA. I based my claims on the article's. If those claims were mistated, then any of mine drawing on them wont apply of course.

Just reads like a lack of QA in general with most correctness effort focused on protocol design itself.

It really boggles my mind but you referring to this as mission-critical software when it's an in development product in its infancy.

Surely you're trolling?

It's intended to eventually be mission-critical software. So, you design it for verification from the beginning. You don't have to do all the QA at once. You just do a little plus leave room in how it's structured and coded to do more easily later on. This isn't trolling: it's standard practice in safety- and security-critical development. Also used by teams outside those fields that just want their stuff to work reliably & be easy to maintain.

Another principle from high-assurance security is that it's usually impossible to retrofit high robustness into a product after the fact. Has to be baked into it with each decision you make. Interestingly, the author cites the fact that correctness is usually impossible to retrofit and should be in from beginning. So, they already know this. ;)

"once you have self-healing clusters having a node disappear / crash is not a big deal for availability of the entire cluster."

I was assuming the self-healing clusters in their setup would operate using code not produced with stability in mind.

"Also the blog post says nothing about integrity. It does not seem that integrity is affected."

I was assuming that the code that maintained integrity had to run correctly and stably in order to maintain integrity.

As long as it crashes rather than breaking integrity, the integrity-maintaining code doesn't necessarily have to be stable.

You can't know it will do that unless you designed it to. The high-assurance kernels of Orange Book era often used that strategy where they'd run correctly or fail-safe. They had strong assurance they would in terms of design, specs, proofs, tests, etc. What you said applied to CochroachDB minus a good subset of such practices is "We hope it will crash instead of affect integrity." Different ballpark entirely.

EDIT to add: Recall this played out in filesystems and regular databases where software errors could corrupt things. Now, just imagine same thing for distributed programming. Same or worse results from defects as always.

Reading their FAQs I don't think they understand how consistency works, or the CAP theorem. Maybe I'm misunderstanding, but it seems they are replicating what Couchbase does already (and has been doing in production for a long time) but thinking that they can somehow avoid the "issues" of "stale" reads in the face of partitions. But nothing indicates they can do so any better than Couchbase, so I'm not sure why CockroachDB exists.

One of us is missing something for sure, maybe me.

I know they took some inspiration from Google's F1 RDBMS:


Currently my gold standard for these things. Apple acquired and shutdown FoundationDB, which appeared awesome too. There's been at least two that combined high-performance with higher consistency. CochroachDB's people indicated in prior conversations they're trying to do something similar to Google's but without the GPS reliance. They've been blogging about their techniques for doing so with them hitting HN often. I suggest reading those to evaluate whether you think their methods make sense.

Underneath all that, the code has to be written with correctness and stability in mind from the beginning to end. That's the part they skipped. It's on their blog, too.

> Reading their FAQs I don't think they understand how consistency works, or the CAP theorem.

I'm going to try to avoid being rude, but it's very clear you just haven't done more than a cursory glance at their material and don't even know the broad shape of what you're disparaging.

Cockroach follows the same general pattern as spanner. The database is sharded into independent subranges or spans. Each span is hosted by a group of replicas using raft consensus. Transactions use a lock free 2 phase protocol atop the sea of raft replica sets.

This general pattern is a nice design that I expect to become increasingly common over the next decade. The consensus groups are fault tolerant, so the transaction protocol avoids the classic problem with distributed two phase commit not tolerating failure of the coordinator. It allows horizontal scalability for transactions that don't interfere with each other, and preserves correctness and consistency for those that do via the transaction protocol. Read the spanner paper for more about this general idea.

This is a state of the art design, that you would not be capable of producing without reading the majority of the distributed systems literature published in the last 2 decades.

They know what CAP is.

The design is nothing like couchbase. Systems like spanner and cockroach are capable of full Serializable Snapshot Isolation. They provide point in time snapshot reads across the entire distributed system, including points of time in the past (it's explicitly a multi-version database). These systems can provide full external consistency. Again, this is nothing like couchbase. Couchbase has it's advantages, and some applications can tolerate the lack of consistent reads and only single document atomicity.

Read https://github.com/cockroachdb/cockroach/blob/master/docs/de... to pick up what you missed about cockroach specifically.

I've read most of their design documents and commentaries since the very start and it's clear they've read the bulk of the distributed systems literature, including some of the most novel papers to be published in the few years. Here's two papers that they've referenced as inspiring/informing their design that I particularly like:



It looks like you were already somewhat aware that you'd not understood what the design is or what its properties are. I'd encourage you check into things a bit more and be sure you really do know what you think you know before making disparaging remarks.

Nice write-up. It's good they use Raft given the formal methodists are doing a ton of work on verified implementations. Recent example using Verdi where it's verified plus performance-competitive:


Likewise with 2 phase. Lots of analysis out there. Building on such proven protocol designs will help them out in long run to get where they want to be.

"running under continuous load for two solid weeks" is merely the beginning. You should really abuse it meanwhile. For example, every minute, shoot the current leader (master or whatever you call it) while under load. Watch another node take over. Then restart the killed node, (maybe from scratch). For this type of products, a monkey


is a good idea, as it will generate new types of torture you never considered.

What I will give your team is credit for owning up to this mess. Many would try to hide it. I hope you all resolve it in near future so you succeed later on.

Does anyone actually use this?

But not for 2 solid weeks!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact