
A Focus on Stability - orangechairs
https://www.cockroachlabs.com/blog/cant-run-100-node-cockroachdb-cluster/
======
Animats
_" Our goal of a 10-node cluster running under continuous load for two solid
weeks has been maddeningly elusive."_

That's a low goal for a database. I've had MySQL and MariaDB instances running
for years between reboots.

 _" We’ve decided to stabilize the master branch in isolation; new feature
development will continue in the develop branch."_

Well, yes, that's how it's done. You don't develop a usable database by
constantly churning the production code.

~~~
nickpsecurity
What's unbelievable to me is that the post even exists. The main goals of a
database are integrity and availability. For performance, many companies will
throw cash at the problem or just accept it being slower. The
integrity/availability is the part that's non-negotiable. Even more so in a
database advertising a stronger, consistency advantage.

And they weren't focusing on that much at all until now? Really shakes the
confidence that the end result of this project will be trustworthy at all.
Good they're shifting their focus. Who knows if it will be too late given how
much code is in there already.

~~~
knz42
To be fair, once you have self-healing clusters having a node disappear /
crash is not a big deal for availability of the entire cluster.

Also the blog post says nothing about integrity. It does not seem that
integrity is affected.

~~~
redwood
You're hitting the nail on the head here this is a cattle not pets type of
database with different dynamics than these relational model that's being
described here

~~~
nickpsecurity
We're not talking cattle, pets, and so on. We're talking software that does a
combo of performance, integrity, and availability at multiple locations for a
database. A lack of QA can wreak all kinds of havoc. I'm not even sure such a
system is safe with bullet-proof, 3rd-party clustering if the core algorithms
lacked the QA.

I'll reconsider my opinion if you link to a reliable source showing that
failed logic in software continues to work so long as an extra node runs the
same, failed logic when the first one crashes or corrupts data due to failed
logic or hardware. Maybe redundant, incorrect computations add up to success.
Like multiplying two negatives gives a positive.

~~~
redwood
We're talking about an in-development distributed database. Obviously it's no
where near ready for production, but that is not to say that it should be
directly compared to the single-node reliability requirements of a relational
monolith. Sure the software should be capable, and obviously isn't yet. But
the infrastructure it runs on will have a lower bar and hence the software
will need to tolerate more failures. In that sense, instability is a challenge
they need to overcome in order to succeed, but the comparisons with scale-up
database stores just doesn't make sense here.

~~~
nickpsecurity
A combo of a memory-safe language, restricted expression of it to ease
analysis, and design-by-contract will knock out many problems with little
cost. Likewise, the Cleanroom methodology used to do the same with regular
languages. There's finance companies developing crash-free, fast stuff in
Haskell and Ocaml. One guy at IRONSIDES put together a DNS immune to single-
packet crashes just using SPARK Ada. Finally, SQLlite shows how rugged a
database can be just integrating rigorous testing into its design that runs
every time they change something.

All of this indicates many problems they're having could've been avoided with
some different methodology. I don't need it, though, because they said they
were _ignoring stability_ instead. You keep forgetting that part in your
comments. I forgive lots of inadvertent failures but intentionally ignoring
the QA in mission-critical software deserves harsh comments. ;)

~~~
radub
[cockroach labs engineer here] Where are you getting the "ignoring stability"
part? Or that we "skipped correctness and stability" (in another comment of
yours)?

Correctness and robustness are the main factors behind the design and
implementation choices we make. A significant part of the overall engineering
effort has been on stability for a long time. We're now making it closer to
90% for a while.

None of this is unexpected in the development of a complex system, especially
when there are many factors involved in deciding how much to focus on various
aspects. I have worked on a few unrelated systems which turned out to be
stable and successful when released and - at a comparable stage in their
development - they were much less stable. So personally, I am very optimistic
about CockroachDB.

~~~
nickpsecurity
It's implied in the article. It mentions many mounting problems in stability,
including nobody was dedicated to working on it. Implies a lack of QA. I based
my claims on the article's. If those claims were mistated, then any of mine
drawing on them wont apply of course.

Just reads like a lack of QA in general with most correctness effort focused
on protocol design itself.

------
nickpsecurity
What I will give your team is credit for owning up to this mess. Many would
try to hide it. I hope you all resolve it in near future so you succeed later
on.

------
karma_vaccum123
Does anyone actually use this?

~~~
frozenport
But not for 2 solid weeks!

