
Amazon’s problem isn’t the outage, it’s the communication - webwright
http://www.geekwire.com/2011/amazoncoms-real-problem-outage-communication
======
dasil003
My company is in the same boat, so I'm definitely sympathetic, but I think the
criticism is a bit harsh. First of all, transparency is easy for a startup
because no one is listening (the majority of your prospective customers
haven't heard of you yet) and you don't have that much to lose anyway. It also
pays dividends because early adopters are the people who appreciate that
transparency the most. When you're a big corporation the economics change, and
lawyers and PR call the shots; it's not because they're disingenuous, it's
just because they have a lot more to lose.

The updates from Amazon have been okay in my opinion. They could have been
better, sure, but do we want the engineers working on this to stop every 20
minutes to write up obscure details that don't yet paint a cohesive picture?
The bottom line is better updates won't help them fix the problem faster, and
actually could distract from the resolution.

If they don't post any more details even after everything's back up and
running, then there will be something to complain about.

~~~
derefr
> do we want the engineers working on this to stop every 20 minutes to write
> up obscure details that don't yet paint a cohesive picture?

In a company the size of Amazon, I'd think they could afford to hire at least
one person who had all the knowledge of a programmer, but who did no coding of
his/her own—whose sole purpose would instead be to watch over the shoulders of
the people doing the programming at times like this, and push-notify
management and/or PR with the progress and difficulties in an accessible
fashion. A technical stenographer, or code-bard, if you will.

~~~
lawnchair_larry
You'd be wrong. Even though it is a huge company, budgets are managed in
really small groups, and no manager would "waste" a headcount when they only
have a handful to begin with.

~~~
pero
I've worked at a large FI where there is an _entire team_ devoted solely to
this task--"communicate with/between stakeholders and techies when something
doesn't work." I have no idea what they do--or are qualified to do--if
otherwise. They are 10 deep with 3 middle managers and a senior, out of a
100ish department devoted to the FI's .com and online services _front-ends_.
There is another team for customer support.

Although definitely overkill, and understandably in an organization that
doesn't particularly value efficiency either, it goes some ways to illustrate
the importance of these communication channels in mission-critical
environments.

------
chime
The worst thing for any service provider is to delay the initial confirmation
that yes, something seems to be going wrong.

> <http://blog.dotcloud.com/working-around-the-ec2-outage>

"After one hour, Amazon's Health Dashboard was still pretending that
everything was right."

There really is no negative side for someone as large as Amazon to immediately
put up a quick notice that "we are receiving complaints about x-y-z and
looking into it." I get pinged like crazy within minutes of one my client's
servers slowing/going down - I can't imagine that Amazon doesn't know
something is wrong within seconds. If it turns out that it wasn't AWS but
something else (Level 3 pipe issues or Comcast DNS errors) then just clarify
that later. Why make every single AWS customer panic for an hour fearing that
the fault lies within their code/services?

~~~
moe
_There really is no negative side for someone as large as Amazon to
immediately put up a quick notice that "we are receiving complaints about x-y-
z_

Do you have an idea how many such complaints amazon is receiving on a normal
day? Per hour?

 _Why make every single AWS customer panic for an hour_

Diagnosing problems in a big system is not that easy.

A turnaround time of an hour is not too bad for a behemoth the size of Amazon,
and when you consider that this was a worst-case scenario.

~~~
cft
Even with a large service, you can immediately identify a fluctuation in the
number of complaints (in addition to signals from monitoring tools). Speaking
from experience. In fact, the larger the service is, the easier it is to
statistically identify an uptick in the number of complaints per smaller time
interval

~~~
moe
_Even with a large service, you can immediately identify a fluctuation in the
number of complaints_

Immediately is a relative term. I would say "1 hour" is pretty much as
immediately as it gets on these scales.

I wouldn't be surprised if a significant complaint fluctuation only manifested
long _after_ amazon discovered the problem in their own monitoring.

 _statistically identify an uptick in the number of complaints per smaller
time interval_

Yes, but volume is not everything. You also have to qualify (triage) the
input, get engineers on the case, confirm the issue, perhaps get clearance for
a public announcement. All the while many of the key people are busy either
trying to figure out what is going on, or trying to dispatch information the
right people, or just running around waving their arms furiously.

------
lsc
The title is probably the most suprising thing that I've learned running a
hosting business. I think most people would prefer that I take the time to
email everyone _even if this means I take time away from fixing the problem
and the outage lasts another 20 minutes._

To me, this is shocking. My PFY, who is even more nerdy than I am still
doesn't believe that an incomplete (which is to say, "I see X problem but I
haven't figured out how to fix it. I'm working on it") message has any value
at all.

This is something that I think is difficult for nerds to learn, because it
doesn't make sense to us. But it's true. Most people, even fairly technical
people, are willing to accept slightly increased downtime for better status
updates, even when those updates largely consist of "I'm working on it"

~~~
switch007
I don't know about your customers but ours want both excellent communication
and for the problem to be fixed immediately. This means having more staff
during an outage.

~~~
lsc
Yes, of course. My customers, though, don't want to pay very much, and that's
another tradeoff. I think I'm pretty up front about what I'm selling and what
tradeoffs I make, and for some people, this is a good tradeoff. There are
other services that have made different tradeoffs, that work out better for
different people with different needs.

I'm just a little surprised about my customer's preferences in this particular
tradeoff.

------
jacques_chester
As a _reducio ad absurdum_ , suppose that Amazon was secretive beyond all
dreams of a CIA director, but never _ever_ failed. Would people care about
their communications strategy?

Likewise, if they could telepathically beam status updates during a pattern of
approximately half hourly failures, would they be feted or avoided?

The updates probably have a lawerly feel because when the engineers are asked
for a status update by management, they get a "piss off, we're working on it,
don't bug me". Indeed a smart manager would _refrain_ from indulging the
instinct to constantly interrupt work to receive status updates. Perhaps
Amazon has such managers.

~~~
random42
> _As a reducio ad absurdum, suppose that Amazon was secretive beyond all
> dreams of a CIA director, but never ever failed. Would people care about
> their communications strategy?_

This is not the same. No one cares about _active_ communication, when things
go _as expected_. However, _failure/unexpected outcomes_ are supposed to be
communicated properly.

> _The updates probably have a lawerly feel because when the engineers are
> asked for a status update by management, they get a "piss off, we're working
> on it, don't bug me"._

The relationship between a boss and an employee is different than a customer
and a company. More so, A boss _asking_ for update, is not the same analogy as
company _providing_ an update.

When I(an employee, and most of my coworkers) work on some bug fix, makes
things inconvenient others (including my bosses), I try to communicate the
status of the issue and progress with my efforts _actively_ , even if _they
dont ask for it_ , because this is what I believe is the _professional_ thing
to do.

~~~
jacques_chester
OP's complaint is that he's not getting a ringside seat. There's a difference
between asking for minute technical updates (where I made the analogy with
interfering management) and getting general, high-level status reports.

------
chacha102

        We absolutely love AWS because of the pace of innovation
        and scale that it has allowed us to accomplish. But after 
        today’s episode is over, we will have a big decision
        to make.
    
        We can spend cycles designing and building technical belts
        and suspenders that will help us avoid a massive failure 
        like this in the future, or we can continue to rely on a 
        single huge partner and also continue our break-neck 
        pace of iteration and product development.
    

Some could make a case that "Improving Stability" __is __"Product
Development".

~~~
defrost
Indeed they could, however one could equally make the case that working on
(say) the Photoshop code base is product development, having to rewrite or
work around faulty NFS or ext3 File System just to get Photoshop to work is an
aberration.

"Improving Stability" of AWS is rightly "Product Development" that Amazon
should be undertaking, not their clients on an individual ad hoc basis.

~~~
billswift
Improving the stability ___of their offering___ , which they _currently host_
on AWS, is AWS's clients responsibility.

------
rmoriz
Amazon's problem is transparency.

Their technology stack, their failover and redundancy setup — nobody knows it
(exactly). Nobody can review the measurements and rate it. Nobody can easily
swich to other options if AWS isn't a valid option anymore.

Besides the AWS marketing saying you just buy into another SPOF.

------
trotsky
While they make the obvious point that communication has been bad, the
analysis loses site of reality at some point. With an outage of this size and
duration, clearly the outage _is_ the problem. Better communication might
smooth some feathers, but at the end of the day if a lot of your big customers
are down it won't matter a ton.

I've heard one of the arguments several times: that good early communication
would have allowed many customers to move off of the East region early and
avoid bulk of the problem. Given the size of the East region, it's hard for me
to imagine they would have had the ready capacity to accept some sort of mass
migration. Worst case you could imagine load/peak use problems significantly
degrading service in other regions as customers moved.

------
danenania
It's worth noting that Heroku's communication (status.heroku.com) has been
terrific.

