

The Grasshopper Outage: Co-Founders Response - dh
http://grasshopper.com/blog/company/2011/06/09/the-grasshopper-outage-co-founders-response/

======
dexen
_> a simultaneous 2 disk failure in a storage array (...) is a very unusual
event_

I'd disagree. If you have same model, same production batch of harddrives in a
RAID, the chances of simultaneous failure is elevated. Depending on RAID level
and workload, it's quite possible two or more drives are loaded (accessed) in
exactly the same way. If, again, they are from the same production batch,
chances are they have similar production defects in mechanics or
semiconductors, thus it makes sense they'd fail almost simultaneously.

~~~
gtuhl
Agreed, hard drives fail constantly. I've had as many as 3 dead at a time
while waiting for hot spares to rebuild in arrays as small as 24 drives. Dual
drive failures have happened at least twice that I can remember off the top of
my head in the last 3-4 years.

Usually running raid10 for performance reasons but when I don't need fast
writes I use raid6 instead of raid5 for anything beyond 8 drives.

------
gemenon
Having a DR site and not testing that failover works is the big issue I see
here. You could check your theory all day long, but if you never actually do
it to make sure it works then you might as well not have a DR site. Similar
case to taking backups without ever actually verifying you could restore what
you need in the event of a loss. While I'm surprised that actually testing DR
capability wasn't listed as something to work on, this sort of open write up
is very valuable to both customers and the rest of us as engineers.

~~~
gavingmiller
I really liked Netflix's approach to testing backup scenarios by creating the
Chaos Monkey[1] (Jeff Atwood had good insight as well[2].) The idea of
creating a mechanism to test fail over scenarios is something that I wouldn't
have thought of prior to the transparency that companies like Grasshopper have
shown. So hat tip to them for opening up on a large failure; It makes the rest
of the dev community smarter/better because of their honesty.

[1]: [http://techblog.netflix.com/2010/12/5-lessons-weve-
learned-u...](http://techblog.netflix.com/2010/12/5-lessons-weve-learned-
using-aws.html) [2]: [http://www.codinghorror.com/blog/2011/04/working-with-
the-ch...](http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-
monkey.html)

------
mike_mccracken
This is a typical case of poor administration at a company blaming a vendor
for their problems: 1\. Yes, any storage array should work to first protect
the integrity of the data 2\. Yes, a good DR solution is only as good as the
testing and tuning that the administrators have done to ensure it is working
properly

Here are some questions: \- How many spare drives were in the system when the
first and second drive failed? Netapp does not shut down volumes of storage if
spares are in the system to take over for the failed drives. \- How long was
it really before those drives were replaced with a spare that could take over
and rebuild? \- Why don't you publish your DR plan and explain exactly where
it didn't go as tested and planned? Point this out to show where the issue
occurred that had been previously tested and shown to work properly.

I am not an employee of NetApp and not even a customer of NetApp. I have used
it in the past and like all technologies it requires the care and feeding that
is well documented in manuals they provide. And it requires good
administrators to do their jobs to test and monitor things and ensure the
resources are available for the system to work as designed.

Once you have heard both sides of the story the only thing you learn is that
there are more the 2 sides to the story.

When these types of things happen the folks closest to them always leave out
details or cannibalize the story so that they can be found blameless. I have
been in the IT field for 15 years and seen it time and time again (with many
technologies).

I read an email from Grasshopper this AM detailing to their customers what
this issue was. It was so vague and left so much to interpretation that it
really came across as whiny and misinformed. It was extremely unprofessional
to apologize and then blame (without full explanation or root cause).

There's more to this than meets the eye. Believe me.

Since you all are replacing NetApp, I would suggest paying for a full time
engineer from the next storage company you buy from. They can manage the array
for you (properly) and ensure these things don't happen. Otherwise you'll need
your sysadmins to start reading product documentation, following best
practices, and testing procedures.

~~~
dh
If you want to talk about any of the details I am happy to discuss. I never
said that the array went offline because of the failure but the head was under
very heavy load trying to recover from it.

The email is a careful balance of information that 90% of people will find
useful and not too much information that no one understands it. Never once did
we say we are not to blame, actually the opposite, it is our responsibility no
matter the vendor or what we replace the hardware with.

------
WestCoastJustin
Updated Again: disclaimer -- shooting from the hip here. I know how it sucks
to have people pissed at you for IT related issues!! I have zero knowledge of
this company and have only read about them today.

Basically, this just boils down to a time issue. No data was lost it just took
time for things be to rosy again. Restoring from backups or switching to a DR
site takes time too. If you have never fully tested the DR site it might take
a long time.

It it easy to be Cpt. Obvious; "Well, there's you problem right there" in
cases like these but it just sounds like they need better documentation about
what to do in the event something like this happens.

Forgetting about the DR site for a minute. Why is the NetApp a single point of
failure?? Given they are extremely stable and running multiple heads further
reduces this but if a single issue with one array causes massive downtime then
you might want to think about a snap mirror to a second filer. Switching to a
different storage vendor doesn't sound like it will fix the underlining issue
here!!

~~~
dh
We have more redundancy than that with NetApp, there are multiple filer heads
and SnapMirror relationships and SnapMirror was part of the problem and the
process that it runs to make sure all data is correct at all end points.

~~~
WestCoastJustin
Ah, ok, much more complex than anything I've touched. Hope it all works out!

------
rpug
We are actually planning to buy some NetApp equipment and have heard nothing
but good things.

I am very curious as to why a two disk failure caused an outage. What exactly
happened when both disks failed?

~~~
dh
We have used NetApp for years and decided to move to Pillar Data Systems as
they are much more forward thinking, easier to work with and understand
storage systems at a very deep level. NetApp wants you to buy new equipment
every few years and force this by increasing support costs very quickly.

The 2 disk failure did not cause the outage, but the process the filer head
had to go through to get the data back onto new drives and then further
actions taken with SnapMirror and other items to try and recover faster.

~~~
spoold
The way you've worded this, and the official response, it reads like you
didn't have hot spares available?

That being the case, the two disk failures didn't need to be concurrent for
you to end up where you did...

~~~
dh
RAID-DP and the array always has a hot spare, and we had one cold spare on-
site and another there within hours to replace the 2nd failure.

