
 Engine Yard experiencing serious downtimes - nickb
http://engineyard.wordpress.com/
======
lwalley
Hi,

We are very sorry for Friday's downtime on one of our clusters in Sacramento.

We take downtime seriously and we learn from each incident. Yesterday's
incident appears to be the result of a software malfunction in 2 or more of 4
switches that handle traffic between a group of servers and their SAN shelves.
The switches are from Extreme Networks, which has a good reputation, and they
are helping us debug error logs.

Organizationally, we are making changes to reduce the incidence of such
problems and to always improve how we deal with them in the future. We're
expanding the group that focuses on infrastructure design and management. This
group looks at ways to improve our technology (design and vendor choices) and
human processes. For instance, one of the top people in that group visited a
SAN vendor earlier this week to review their latest offerings.

If you were affected by this downtime, please contact your account manager to
discuss service credits. If you don't know your account manager, please send
an email to 'info@engineyard.com' and your message will reach them.

Thanks for being our customer, and again, sorry for the downtime. Believe me,
when I hear about downtime while on a week off with my dad coming this
weekend, it is not pleasant - it actually hurts!

\--- Lance Walley, CEO \--- Engine Yard

------
swombat
Disclaimer: My company's an EngineYard customer.

1) In our experience, EngineYard has provided amazing customer service 100% of
the time. They helped us debug some very nasty problems with connections,
server set-ups, proxies (not EY proxies, proxies that our customers had
installed), IE download header craptasticness, etc. It's like having a team of
24/7, very knowledgeable Rails sys admins.

2) The infrastructure they provide is top notch. The database is blazing fast,
so's the SAN. Bandwidth is rock-solid. Putting together a similarly performant
setup would take weeks of our time - which we'd much rather spend on
developing our product.

3) Downtime is inevitable on any host. Anyone who tells you that their data
centre will never go down is lying. Whole data centres do go down. UPS's fail.
Software issues develop. It's part of the job. If you can't deal with it,
don't get into web start-ups. Get this into your head: shit happens. The real
question is not whether there is downtime, but how quickly they get things
back up.

In this case, they figured out the cause of the failure within an hour and
started bringing customer slices back up 20 minutes later. All customers were
back up within about 4 hours. I think that's pretty good.

4) If you really can't afford to have any downtime at all, then you set up
your site in two separate data centres, with a failover between the two. If
you did that, then this issue won't have affected you, except for increasing
load a little.

I'm not saying EngineYard are perfect, but for that price, I believe they're
the best damn rails host you can find anywhere.

------
morten
We're hosted with EY for one reason: When the shit hits the fan (which it
will) - I want an army of rocket scientists at my back figuring out what's up.
I have just that and I'm as happy as can be.

------
callmeed
Signed up with EY last month. Frankly, our experience has been disappointing
so far. We haven't gone live yet on our EY slices yet-this makes me nervous as
well.

~~~
tptacek
What haven't you liked?

------
goodkarma
We just pulled the trigger for EY this week, should be migrating to our
"slices" within 7-10 days.

Things like this make me nervous..

------
crazyirish
the blog seems a little light on details, have they discovered the root cause?

~~~
tmornini
Hello there. Thought I'd follow up as we have some additional news from our
vendor.

The fact that this "should not have happened" doesn't help our customers out,
and doesn't make us feel particularly better about the incident we had on
Friday.

Just one of our six clusters was affected, but it was a bad day for customers
on that cluster, and for that I personally apologize to them.

We're working hard to make sure that this issue is entirely resolved on this,
and our other clusters as well, providing Extreme with the core dumps they've
requested.

Details below:

\----------------------------------------------------------------------------------------------------------------------------

08/16/2008 03:53:58.16 <Noti:EPM.wd_warm_reset> Slot-1: Changing to watchdog
warm reset mode

08/16/2008 03:09:42.05 <Warn:DM.Warning> Slot-2: Slot-3 FAILED (2) Error on
Slot-3

08/16/2008 03:09:38.62 <Warn:DM.Warning> Slot-2: Slot-3 FAILED (1) Error on
Slot-3

08/16/2008 03:09:29.23 <Warn:DM.Warning> Slot-1: Slot-3 FAILED (2) Conduit
receive error encountered

08/16/2008 03:09:25.17 <Warn:DM.Warning> Slot-1: Slot-3 FAILED (1) Conduit
receive error encountered

08/16/2008 03:09:25.17 <Warn:DM.Warning> Slot-1: System Error 0: Conduit
receive error encountered

08/16/2008 03:08:22.29 <Warn:DM.Warning> Slot-1: Slot-4 FAILED (1)

08/16/2008 03:08:02.94 <Warn:DM.Warning> Slot-1: Slot-4 FAILED (1)

\----------------------------------------------------------------------------------------------------------------------------

The above warning messages are caused by “System memory depletion” and
“watchdog timer” triggered “restarting slot 3 and 4. After crashed, slot 3 and
4 created “core dump file”. We need that files for detail investigation.

