They're working on a state of the art design that is far more ambitious than the bulk of the NoSQL world. What they're doing is extremely difficult and involves integrating multiple complex distributed protocols. It's not so much that integrity and availability weren't focus'd on, it's that they're working with things that take enormous time and effort to fully debug.
You literally just contradicted yourself. You didn't mean to but you did. What you said is their software requirements and challenge are enormous. It's going to be hard to pull of the theory and implementation. There is room for huge problems in protocol, custom code, libraries used, and OS interactions. Preventing tons of debugging requires QA to be turned up in these situations. Maybe even add protocol analysis like Amazon does with TLA+ on top of integration/fuzz/unit tests and language-level analysis.
Then, you said they were too focused on making it work to do that part. The part that was a prerequisite of making it work. As they're now seeing.
I don't think it was a matter of "ooops, we just didn't care enough". There's no way to make this kind of thing where it comes out of the oven perfect the first time. There just isn't.
I agree that using TLA+ or the like from the very beginning would probably help. I also found the "rule based development" paper from the RAMCloud folks pretty convincing, but I haven't tried to put it into practice.
That's definitely true.
I don't think it was a matter of "ooops, we just didn't care enough"."
In the article, they said that's whst happen. Little to no attention paid to problem. No QA person. Problems mounted. I don't why people keep speculating on causes when article itself said it was negligence they're correcting. That's also why Im countering all comments to the contrary.
Re RAMcloud paper. I might have missed it. Will look it up. Thanks.