

Why you need STONITH - morphics
http://advogato.org/person/lmb/diary/105.html

======
lucian1900
Sadly, even simple concepts like STONITH are hard to get right. I believe it
was GitHub that had an outage because both db nodes shot each other, but their
network was extremely slow because of some fault (which caused the initial
problem as well) and both nodes received the STONITH message from the other at
similar times, long after they each timed out waiting for a response.

Distributed systems are hard.

~~~
derekp7
One thing to keep in mind, is clustering will never make your app 100%
available. The best it can do is add 1 "9" to your uptime (so you are already
at 99.9 percent, it can get you to 99.99 percent uptime, etc). But one thing
that it should NEVER do is corrupt your data. If you have a service outage,
yet your data was kept safe, then the cluster still did it's job.

~~~
MichaelGG
Is this for scenarios where there's no shared storage medium, so IO fencing
isn't a solution? I can see for a DRBD solution how you need some real way of
ensuring only one node is up. It just feels like STONITH is a really ugly hack
and would be better solved via other quorum solutions, even if it means adding
a witness system.

~~~
derekp7
Normally when you have shared storage, you can use that shared storage as a
quorum device (i.e., via an exclusive scsi lock). If you have something like
DRBD, then you still have the issue where each node can't see each other, but
an outside application writing to the database served up by the cluster can
see both nodes -- and if each node wants to bring up it's shared IP address,
some writes will go to one node, some to the other. Then you have the database
on each node not having all the current data (even if it isn't technically
"corrupted").

------
notacoward
I was working on HA software in 1992. Specifically, I was working on the
software from which Linux-HA copied all of its terminology and basic
architecture. We ourselves were not the first, and often found ourselves
copying things done even earlier at DEC, so I'm not complaining, but I want to
make the point that this article from 2010 is actually a rehash of a much
older conversation. As cute as the metaphor is, it gets two things seriously
wrong.

(1) Fencing and STONITH are not the same thing. Fencing is shutting off access
to a shared resource (e.g. a LUN on a disk array) from another possibly
contending node. STONITH is shutting down the possibly contending node itself.
They're quite different in both implementation and operational significance.
Using the two terms as though they're interchangeable only sows confusion.

(2) You only need STONITH if you have the aforementioned possibly contending
nodes - in other words, only if the same resource can be provided by/through
either node. If the resources provided by each node are _known to be
different_ , as e.g. in any of the systems derived from Dynamo, then STONITH
is not necessary.

To elaborate on that second point, the problem STONITH addresses is one of
mutual exclusion. It might not be safe for the resource to be available
through two nodes, because it could lead to inconsistency or because they
can't both do a proper job of it simultaneously. As in other contexts, mutual
exclusion is a useful primitive but often not the optimal one to use. _In
general_ it's better to avoid it by avoiding the kinds of resource sharing
that make it necessary. That's why "shared nothing" is the most common model
for such systems designed in the last decade or more, and they don't need
STONITH unless they've screwed up by not fully distributing some component
(such as a metadata server for a distributed filesystem).

------
derekp7
For some reason, I always get a bit disappointed when I read an article about
STONITH, and it doesn't begin with a pointer to the world's funniest joke
([http://en.wikipedia.org/wiki/World's_funniest_joke](http://en.wikipedia.org/wiki/World's_funniest_joke)).
Now I know that misplaced humor in technical documentation can go wrong
sometimes, but this is one case that I think it can help make the concept
really stick to the reader.

