
Amazon search was down - gourou
http://techcrunch.com/2016/06/02/its-not-just-you-amazon-search-is-down/
======
granos
Does something like this actually cost them a huge amount in sales or do
people know enough about the internet today to just come back in a few hours
and try again? I know that personally, if I'm buying from Amazon, I would just
come back later and try again.

~~~
splike
I have a friend that worked in Amazon for a while. He told me that after
outages are repaired, there is a spike in sales but that it is never enough to
cover the loses. The revenue does just disappear.

~~~
logicallee
could be even worse. On the Internet it's not like some corner store you can
check later, as it's still just as convenient next time: everything is a
Google search away. So everyone who goes through the trouble of figuring out
where else they can buy from - might just end up as those other people's
customers for good!

~~~
fizzbatter
I'd be curious to see the numbers on that sort of thing. I know for some
people that is the case, but i almost feel.. locked in to Amazon. With free 2
day shipping, and decent customer service if i have problems/etc, i am not
likely to shop elsewhere.

Is it a strange indicator that this feels more like lockin to me, than
loyalty? I have no idea what that means.. or how to change the conceptual
impression.

~~~
semi-extrinsic
> Is it a strange indicator that this feels more like lockin to me, than
> loyalty?

My hunch would be that your subconcious is telling you loyalty towards a huge
faceless corporation feels weird. If it was a local store with people you
could connect your positive experience to, my guess is you'd have no issue
calling it loyalty, even if that store was part of an equally big corporation.

~~~
fizzbatter
That may be an accurate hunch. I feel loyalty towards Costco, because i hear
good things about their employees and know multiple people who want to work
there.

With that said, it's hard to say if that's because i hear good things, or
because it's a physical store with people i see.

Compare that to Amazon, and i rarely hear good things about their staff
treatment. Despite having a good experience with Amazon. .. to be clear, i
just don't hear good things.. not implying that i hear lots of bad.

------
jdimov10
They should consider moving to AWS ;)

~~~
0xmohit
In that event, AWS would respond by saying:

    
    
        The EC2 has responded, and informed me that they do not provide analysis of individual issues in AWS infrastructure.
    
        They referred to the AWS SLA that states that "AWS will use commercially reasonable efforts to make Amazon EC2 and Amazon EBS each available with a Monthly Uptime Percentage (defined below) of at least 99.95%".
    
        The AWS SLA is available here:
        https://aws.amazon.com/ec2/sla/
    

(The above is a part of the response I received when I attempted to ask AWS
about sporadic reboots and outages on one of my instances.)

~~~
0xmohit
Another gem from AWS in response to a query that mentioned that every major
DNS server around the world had picked up a change made to the A record of a
domain except for Amazon DNS (even after 2 hours of the change):

    
    
      Thanks for that information. It really help us understand the issue faced. As you know changing a A record for a domain can take time to replicate through all the Root DNS servers and then onto the none authoritative servers from there on. When you are editing your DNS Zone file it is highly related to the TTL settings for quicker updates on changes to them.
    
      However it really isn't uncommon to see full DNS replication when making changes to an A record.
    
      ..
    
      Also for even more control and perhaps faster DNS resolving times within AWS look into our Route53 service.
      https://aws.amazon.com/route53/
      I hope this has addressed your questions and please feel free to ask if there is anything else.
    

Tools like
[http://mxtoolbox.com/dnspropagation.aspx](http://mxtoolbox.com/dnspropagation.aspx)
and
[http://www.viewdns.info/propagation/](http://www.viewdns.info/propagation/)
indicated that the change had been picked up by every listed server but for
AWS.

Solution proposed by AWS: Use Route53.

------
_nickwhite
It just came back up it seems (9AM EST).

I imagine a room full of Amazon engineers all standing around freaking out and
possibly yelling at interns.

~~~
colmmacc
One of my responsibilities at Amazon is "call leader". It's a rotation of
long-time Amazon folks who facilitate the recovery process whenever there are
outage events; call leaders run the conference call (there's always a
conference call) and help people focus on the right things in the moment as
well as making sure we gather everything we'll need for a rigorous post
mortem.

The first and most important principle that every call leader promotes is
"stay calm". It would be natural for folks to freak out and get stressed when
the clock is ticking on an outage. There's an innate human sense that our
emotional state has to match the urgency of a problem. But this would only
lead to a kind frenetic energy that isn't helpful.

Instead, maintaining a collected composure quickly cools everything down so
that actions can be thought through and communicated more clearly. Many people
are surprised that when they join a call, especially with an experienced team,
that it is so calm. If you listen to the audio of Michael Schumacher as an F1
driver, or to fighter pilots performing complex maneuvers, it can also be
striking just how calm they are. It's very effective and a great lesson for
teams under pressure.

~~~
lazyant
conference calls are terrible for troubleshooting, let the tech do their jobs
(they can use chat) and then for the postmortem you can conference.

~~~
colmmacc
In general, our conference calls are structured and laser focused on recovery;
"what can we rollback, undo, failover or change to restore availability?"
There's much less emphasis on figuring out the root cause, or the "why" of the
issue, in that moment. That's another natural tendency - to want to understand
the issue rather than fix it - that can be counterproductive.

In that situation, and especially at scale, it's very useful to keep everyone
on the same page about what actions are happening when, and to help folks out
by showing them how to find things which may have changed (keep in mind the
change may have been in a totally different system than the one that
ultimately breaks). For example: it would slow things down if someone tried to
flip data centers at the same time that someone else was rolling back
software.

If on a call it becomes clear that a team needs to make some kind of non-
rollback change, like a code change, or push a new config, and thankfully
that's very rare, we'll generally recommend they break off and work on it in
isolation, while nominating a point person to liaise with everyone else. We do
use chat too and that can be more effective as more people join.

For postmortems we try hard to do in writing first, and then an in person
review. It's very helpful to be able to re-assure and guide each other. I can
back up the "Blame the system, not the engineer" guidance others have written
about here too.

------
yomly
I can only imagine the hell some team must be experiencing right now.

Godspeed, Amazon site reliability team...

------
ionwake
Do people get fired over things like this in the bigger companies?

~~~
hmate9
Someone probably will get fired.

Last quarter amazon earned $20.58bn from product sales.

So 4 hours of that is $38,000,000.

~~~
BinaryIdiot
I highly doubt search equates to all of their sales during a period. That
would seem odd and would indicate no one uses their navigation at all.

~~~
jessriedel
Also, Amazon is established enough, and has few enough comparable competitors,
that anyone whose search breaks is liable to just try again tomorrow.

Still, some people who do go somewhere else, or are less likely to try Amazon
in the future. I'd be surprised if this wasn't at least a multi-million dollar
mistake.

------
vamur
Could be due to HTTP/2 switch - e.g in the morning it was working in HTTPS and
now it's HTTP only again. BTW, it's interesting how much bloat there is in
their HTML code.

------
evansekeful
It's kind of interesting how being the purchasing agent for an Amazon seller
account gives you visibility on Amazon's health. After getting my 9am sales
report, I panicked and had our tech team check if our seller software was
working properly. Conversely, I knew Bezos was laughing all the way to the
bank when people ragged on Prime Day last year before the official numbers
came out. We didn't even have a featured deal and could see the insane
difference.

------
ikeboy
Not the first time this year
[https://archive.is/mvVUo](https://archive.is/mvVUo)
[http://www.usatoday.com/story/tech/news/2016/03/10/amazons-w...](http://www.usatoday.com/story/tech/news/2016/03/10/amazons-
website-down-quickly-up/81596092/)

Although this was longer and more severe.

------
dtdbdesign
I found an article about their last downtime and either way they are losing a
lot of money [https://www.upguard.com/blog/the-cost-of-downtime-at-the-
wor...](https://www.upguard.com/blog/the-cost-of-downtime-at-the-worlds-
biggest-online-retailer)

~~~
rdoto454
Nice!! ^^

------
BinaryIdiot
I noticed this while I was up very late working but it only was an issue for
some items. For instance when I searched for Item1 and Item2 it failed but
Item3 just kept working. Not sure how they have search divided up but found
that interesting.

Would love to see something come out about how it affected their users.

~~~
umanwizard
Was one of the searches in a category, while the others were in all product
search? That could account for the issue.

~~~
BinaryIdiot
I tried them all from the homepage I just don't remember the exact products I
searched for at the time.

~~~
breakingcups
Search caches maybe?

~~~
BinaryIdiot
Yeah maybe. They would have to be server-side as I was picking random things I
have never searched before to see if search was broken for everything or not
(I was initially getting different error messages for different products).

~~~
umanwizard
Yeah loads of stuff is cached server-side; this is a pretty realistic guess.

------
artursapek
Does Amazon do public post-mortems for things like this, the way they do for
AWS outages?

------
sabujp
I expect to see a post mortem.

~~~
el_duderino
Does Amazon disclose such things as this when it affects a primary site
functions?

------
misiti3780
Is it back up now because I just searched from Croatia and it worked fine ?

~~~
thomnottom
I read that as "searched FOR Croatia" and thought, "Man, they really do sell
everything these days..."

~~~
iagooar
Would have been funnier if he searched for "Greece".

------
lwhalen
Would love to see the outage writeup and RCA for this, if/when it's published.

------
ohitsdom
Working for me right now. Hard to imagine search being down for 4 hours...

------
TazeTSchnitzel
Was broken the other day for Amazon UK, too.

------
snambi
Amazon has search?

~~~
khedoros
You're surprised that they do?

------
known
That hurts AMZN credibility

------
aiabgold
Working fine for me (~6:30am Pacific Time).

------
Drakim
Did it take so long to get it up again because nobody noticed for the first 3
hours?

(Actually what is Amazon search used for?)

~~~
mille562
It's used to search their site for products. If you want to buy a "571B Banana
Slicer" the easiest way to find it on amazon's site is to use their search.

The search feature being broken translates to sales being close to zero for
that time period.

~~~
arbitrage
> "571B Banana Slicer"

Wow, turns out that's actually a thing.

~~~
SparkyMcUnicorn
Look at it for the novelty, and buy it for the reviews.

~~~
mwambua
'For decades I have been trying to come up with an ideal way to slice a
banana. "Use a knife!" they say. Well...my parole officer won't allow me to be
around knives....'

You made my day! :-)

