And they weren't focusing on that much at all until now? Really shakes the confidence that the end result of this project will be trustworthy at all. Good they're shifting their focus. Who knows if it will be too late given how much code is in there already.
It's certainly hard. There are worked examples of getting the job done, though. All it took was focus and effort on QA. I believe the distributed nature will cause more difficulties than the older systems. Yet, same remedy. Even more so given its so much easier to fail.
"I like the transparency and commitment from the team"
I liked that a lot as I saw in post and HN comments. It gave me the impression of a database being designed for correctness, speed, scale, consistency, and reliability. Then, this post says a few got dropped a while back. That's a lack of transparency and commitment on a key issue. Or a resurgence of it. Hope that problem goes away.
They're working on a state of the art design that is far more ambitious than the bulk of the NoSQL world. What they're doing is extremely difficult and involves integrating multiple complex distributed protocols. It's not so much that integrity and availability weren't focus'd on, it's that they're working with things that take enormous time and effort to fully debug.
You literally just contradicted yourself. You didn't mean to but you did. What you said is their software requirements and challenge are enormous. It's going to be hard to pull of the theory and implementation. There is room for huge problems in protocol, custom code, libraries used, and OS interactions. Preventing tons of debugging requires QA to be turned up in these situations. Maybe even add protocol analysis like Amazon does with TLA+ on top of integration/fuzz/unit tests and language-level analysis.
Then, you said they were too focused on making it work to do that part. The part that was a prerequisite of making it work. As they're now seeing.
I don't think it was a matter of "ooops, we just didn't care enough". There's no way to make this kind of thing where it comes out of the oven perfect the first time. There just isn't.
I agree that using TLA+ or the like from the very beginning would probably help. I also found the "rule based development" paper from the RAMCloud folks pretty convincing, but I haven't tried to put it into practice.
That's definitely true.
I don't think it was a matter of "ooops, we just didn't care enough"."
In the article, they said that's whst happen. Little to no attention paid to problem. No QA person. Problems mounted. I don't why people keep speculating on causes when article itself said it was negligence they're correcting. That's also why Im countering all comments to the contrary.
Re RAMcloud paper. I might have missed it. Will look it up. Thanks.
Yes, fine, but in that case you can ignore CockroachDB entirely and keep using Postgres. The point of the project is to handle cases where Postgres has performance problems.
Also the blog post says nothing about integrity. It does not seem that integrity is affected.
I'll reconsider my opinion if you link to a reliable source showing that failed logic in software continues to work so long as an extra node runs the same, failed logic when the first one crashes or corrupts data due to failed logic or hardware. Maybe redundant, incorrect computations add up to success. Like multiplying two negatives gives a positive.
All of this indicates many problems they're having could've been avoided with some different methodology. I don't need it, though, because they said they were ignoring stability instead. You keep forgetting that part in your comments. I forgive lots of inadvertent failures but intentionally ignoring the QA in mission-critical software deserves harsh comments. ;)
Correctness and robustness are the main factors behind the design and implementation choices we make. A significant part of the overall engineering effort has been on stability for a long time. We're now making it closer to 90% for a while.
None of this is unexpected in the development of a complex system, especially when there are many factors involved in deciding how much to focus on various aspects. I have worked on a few unrelated systems which turned out to be stable and successful when released and - at a comparable stage in their development - they were much less stable. So personally, I am very optimistic about CockroachDB.
Just reads like a lack of QA in general with most correctness effort focused on protocol design itself.
Surely you're trolling?
Another principle from high-assurance security is that it's usually impossible to retrofit high robustness into a product after the fact. Has to be baked into it with each decision you make. Interestingly, the author cites the fact that correctness is usually impossible to retrofit and should be in from beginning. So, they already know this. ;)
I was assuming the self-healing clusters in their setup would operate using code not produced with stability in mind.
"Also the blog post says nothing about integrity. It does not seem that integrity is affected."
I was assuming that the code that maintained integrity had to run correctly and stably in order to maintain integrity.
EDIT to add: Recall this played out in filesystems and regular databases where software errors could corrupt things. Now, just imagine same thing for distributed programming. Same or worse results from defects as always.
One of us is missing something for sure, maybe me.
Currently my gold standard for these things. Apple acquired and shutdown FoundationDB, which appeared awesome too. There's been at least two that combined high-performance with higher consistency. CochroachDB's people indicated in prior conversations they're trying to do something similar to Google's but without the GPS reliance. They've been blogging about their techniques for doing so with them hitting HN often. I suggest reading those to evaluate whether you think their methods make sense.
Underneath all that, the code has to be written with correctness and stability in mind from the beginning to end. That's the part they skipped. It's on their blog, too.
I'm going to try to avoid being rude, but it's very clear you just haven't done more than a cursory glance at their material and don't even know the broad shape of what you're disparaging.
Cockroach follows the same general pattern as spanner. The database is sharded into independent subranges or spans. Each span is hosted by a group of replicas using raft consensus. Transactions use a lock free 2 phase protocol atop the sea of raft replica sets.
This general pattern is a nice design that I expect to become increasingly common over the next decade. The consensus groups are fault tolerant, so the transaction protocol avoids the classic problem with distributed two phase commit not tolerating failure of the coordinator. It allows horizontal scalability for transactions that don't interfere with each other, and preserves correctness and consistency for those that do via the transaction protocol. Read the spanner paper for more about this general idea.
This is a state of the art design, that you would not be capable of producing without reading the majority of the distributed systems literature published in the last 2 decades.
They know what CAP is.
The design is nothing like couchbase. Systems like spanner and cockroach are capable of full Serializable Snapshot Isolation. They provide point in time snapshot reads across the entire distributed system, including points of time in the past (it's explicitly a multi-version database). These systems can provide full external consistency. Again, this is nothing like couchbase. Couchbase has it's advantages, and some applications can tolerate the lack of consistent reads and only single document atomicity.
Read https://github.com/cockroachdb/cockroach/blob/master/docs/de... to pick up what you missed about cockroach specifically.
I've read most of their design documents and commentaries since the very start and it's clear they've read the bulk of the distributed systems literature, including some of the most novel papers to be published in the few years. Here's two papers that they've referenced as inspiring/informing their design that I particularly like:
It looks like you were already somewhat aware that you'd not understood what the design is or what its properties are. I'd encourage you check into things a bit more and be sure you really do know what you think you know before making disparaging remarks.
Likewise with 2 phase. Lots of analysis out there. Building on such proven protocol designs will help them out in long run to get where they want to be.