
AWS S3 Outage - gschier
Seeing huge numbers of 503s from the S3 API in us-east-1. Anyone else having problems? I only found one other on Twitter: https:&#x2F;&#x2F;twitter.com&#x2F;cperciva&#x2F;status&#x2F;630641484677558273
======
nmjohn
I'm seeing it as well - majority of connections are being dropped for us atm

    
    
      The Amazon S3 team recently completed some maintenance   
      changes to Amazon S3’s DNS configuration for the US STANDARD region on 
      July 30th, 2015.
        
      You are receiving this email because we noticed that your bucket
      is still receiving requests on the IP addresses which were removed 
      from DNS rotation. These IP addresses will be disabled on August 
      10th at  11:00 am PDT, at which time any requests still using
      those addresses will receive an HTTP 503 response status code.
      
      Applications should use the published Amazon S3 DNS names for 
      US STANDARD: either s3.amazonaws.com or s3-external-2.amazonaws.com
      with their associated time to live (TTL) values. Please refer to 
      our documentation at: 
      http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region 
      for more information on Amazon S3 DNS names.
    

Something to do with that perhaps? AWS sent us that last thursday

~~~
bbrazil
Java by default can cache DNS forever, which may be why many people are seeing
problems. Set networkaddress.cache.ttl to adjust this.

[http://javaeesupportpatterns.blogspot.ie/2011/03/java-dns-
ca...](http://javaeesupportpatterns.blogspot.ie/2011/03/java-dns-cache-
reference-guide.html) has more detail.

~~~
ork
Reading this article, it seems that starting at 1.6 the default TTL became
30s.

~~~
Xorlev
It's not true as far as I can tell. You need this flag set or it keeps DNS
entries forever, at least on the JVM we run (Oracle, 1.8u51).

------
fizx
I wonder what it would take for amazon to show one of the yellow icons on
their status page? Has it ever happened? Would a datacenter have to fall in
the ocean?

~~~
gschier
Ya, it bothers me that their status messages for major outages are simply
"elevated error rates".

~~~
nmjohn
Whats frustrating is when you have customers who are also down because of the
outage - but when you say Amazon is experiencing severe outages causing 50% of
our requests to be dropped and there's not much we can do, it makes us look
pretty bad when they they go to the amazon dashboard and only see "Elevated
Error Rates."

~~~
RoryH
"Elevated Error rates" probably does not qualify as breach of their SLA

------
jschorr
Latest Update from
[http://status.aws.amazon.com/](http://status.aws.amazon.com/):

1:52 AM PDT We are actively working on the recovery process, focusing on
multiple steps in parallel. While we are in recovery, customers will continue
to see elevated error rate and latencies.

~~~
kore_sar
The icon got yellow. I repeat: YELLOW

~~~
hughstephens
It only goes red when a nuclear event occurs, obliterating most of humanity
and only the machines remain.

~~~
gotroot
That would actually turn it green again, I think.

~~~
hughstephens
maybe true – but if a server gets no requests, does it really exist?

or would the inter-machine chatter continue ad infinitum? would they run out
of IPs or successfully transition to IPv6?

So many questions.

~~~
mryan
Cory Doctorow's When Sysadmins Ruled the Earth touches on this - what happens
to the internet activity during a global crisis?

[http://craphound.com/overclocked/Cory_Doctorow_-
_Overclocked...](http://craphound.com/overclocked/Cory_Doctorow_-
_Overclocked_-_When_Sysadmins_Ruled_the_Earth.html)

~~~
hughstephens
this is excellent, hadn't seen it before! thx

------
hughstephens
Next update is live

    
    
      2:38 AM PDT We continue to execute on our recovery plan
      and have taken multiple steps to reduce latencies and error
      rates for Amazon S3 in US-STANDARD. Customers may continue 
      to experience elevated latencies and error rates as we 
      proceed through our recovery plan.

------
adamtulinius
Can't pull docker images from the hub either, and their statuspage currently
shows S3-problems: [https://status.docker.com/](https://status.docker.com/)

~~~
endymi0n
[http://status.docker.com/](http://status.docker.com/) \- they are
"Investigating issue with high load", now yellow, but was red before.

Can't pull any images either.

------
simonpantzare
Seeing the same thing. Got back from vacation an hour ago, probably related.
:)

~~~
sschueller
At least it didn't happen while you where on vacation. :)

------
crodjer
Could this be a reason why Heroku is misbehaving?
[https://status.heroku.com/incidents/792K](https://status.heroku.com/incidents/792K)

~~~
robotfelix
I can't help but be a little surprised that Heroku's entire build system is
disabled by an S3 failure in _one_ region. Now I'm unable to add a notice
about the issues to my site's HTML...

~~~
crodjer
True. Given the business Heroku is in, they should have redundancies in
multiple regions if not multiple service providers.

------
ranrub
"1:08 AM PDT We believe we have identified the root cause of the elevated
error rates and latencies for requests to the US-STANDARD Region and are
working to resolve the issue."

looks like the cavalry are coming

------
chncdcksn
GitHub is having release download issues, possibly due to this.
[https://status.github.com/](https://status.github.com/)

~~~
Maxious
Mapbox and Hipchat too
[https://www.mapbox.com/status/](https://www.mapbox.com/status/)
[https://status.hipchat.com/](https://status.hipchat.com/)

~~~
DanKlinton
Feels like internet is not anymore distributed thing that if one website/node
goes down others keep working...

Feels like in future... If cloud provider goes down... All internet will stop
working :)

~~~
rodgerd
This is pretty much the case. Years of evangelising the idea that (a)
everybody should be on Amazon and (b) everybody should be on the cheapest
regions of Amazon mean that while the underlying datacentres are probably much
better managed, individually speaking, than the tapestry of colos that made up
the world a decade ago, an outage has much more wide-ranging effects than
you'd get at that point.

------
xenoclast

      [3:25] AM PDT We are still working through our recovery plan.
    

Man, I'd love to see that plan.

~~~
cddotdotslash
I'd also love to see a post mortem on this, but I highly doubt they'll release
anything about it.

------
clebio
Seems Hasicorp is maybe affected by this as well.

    
    
        $ vagrant up
        Bringing machine '...' up with 'virtualbox' provider... ==> ...: Box 'debian/jessie64' could not be found.
        ...
        ...: Downloading: https://atlas.hashicorp.com/debian/boxes/jessie64/versions/8.1.0/providers/virtualbox.box
        An error occurred while downloading the remote file. The error message, if any, is reproduced below. Please fix this error and try again.
        The requested URL returned error: 500 Internal Server Error
    

EDIT: _not_ Markdown.

~~~
uxp
VagrantCloud, now known as Atlas, is merely a redirector service for Vagrant
box management. I host all my own boxes (on S3), but still use them for an
easy means of sharing without having to remember a long-ass URL.

------
cperciva
As of 10:29:33 UTC, everything is back to normal as far as I can measure.

------
mryan
Should be back to normal now. The latest update is:

3:46 AM PDT Between 12:08 AM and 3:40 AM PDT, Amazon S3 experienced elevated
error rates and latencies. We identified the root cause and pursued multiple
paths to recovery. The error has been corrected and the service is operating
normally.

------
jakozaur
Open-source library request: A library that lets you use S3 and Google Storage
Cloud simultaneously and fail-back to another if one have problems.

There are many use-case when paying 2x for storage is a reasonable tradeoff
for higher availability and also be provider independent.

~~~
ketralnis
What's stopping you from writing it?

~~~
jakozaur
Time and priorities. There is a difference between "I wish this thing existed,
so I can use it/contribute to it" and "I need it so badly that I'm willing to
spend a lot of time to make it production ready".

So far S3 seems to be reliable enough...

~~~
toomuchtodo
Might be something to integrate into Libcloud [1] instead of rolling your own.

[1] [https://libcloud.apache.org/](https://libcloud.apache.org/)

------
dangravell
Looks like I picked a bad week to stop sniffing glue.

------
jontro
Looks like this brought down typekit too. "Font Network is experiencing issues
caused by an outage at our storage provider."
[http://status.typekit.com/](http://status.typekit.com/)

------
mrsuprawsm
From [http://status.aws.amazon.com/](http://status.aws.amazon.com/):

12:36 AM PDT We are investigating elevated errors for requests made to Amazon
S3 in the US-STANDARD Region.

------
kevindeasis
That would explain why my console was not performing well even if
[http://status.aws.amazon.com/](http://status.aws.amazon.com/) says "Service
is operating normally", good thing their api seems to be functioning during
that outage, for me at least

------
cubicfur
Good thing I built myself a local game streaming server instead of putting
that in a remote GPU instance.

------
colinbartlett
I started receiving lots of alerts from my side project
[https://StatusGator.io](https://StatusGator.io) which monitors status pages.
It's astonishing to me how many services depend on AWS directly or indirectly.

~~~
padelt
The most ironic thing I've seen in a while: Their homepage features icons of
services they monitor. Many of them 503-fail - they are hosted on cloudfront.

------
pemp
As it has happened before, Amazon AWS status page is lying to us.

S3 is in yellow, which means "performance issues". But not being able to
download files from many buckets it's clearly a "service disruption" (red).

~~~
mahouse
It is actually possible if you keep trying.

~~~
pemp
It has to be many times, because I have not succeeded and I still keep trying.

------
ramon
This happens a couple of times, specially when replacing files frequently! I
submit things to S3 everyday, if you're uploading a chunck of files you'll get
errors every now and then when replacing files.

------
theyeti
It seems to have come back now for me. Could someone else confirm the same ?

~~~
theyeti
Update: We are beginning to see recovery in error rates and latencies for
Amazon S3.

------
JoshGlazebrook
I'm assuming this is also why I can't start any instances.

> 12:51 AM PDT We are investigating increased error rates for the EC2 APIs and
> launch failures for new EC2 instances in the US-EAST-1 Region.

~~~
rpmartz
Same here

------
blowski
I'm also having issues connecting to buckets based in Ireland (eu-west-1).
Just hangs at authentication stage. Tried from 3 different internet
connections, all having the same problem.

~~~
onre
Ditto. API does not answer at all, latency is high and customers are not
amused. Outage has lasted for hours already.

edit; does not seem to affect all the buckets though. Only one of ours is
experiencing this, others are fine.

------
ranrub
Looks like it's getting better on our side now

------
mentat
Now updates about ELB scaling and Lambda failures.

------
whyleyc

      3:36 AM PDT Customers should start to see declines in elevated errors and
      latencies in the Amazon S3 service.
    

Fixed ?

------
greenleafjacob
We are seeing lots of 503s, empty response bodies, and peer reset / dropped
connections.

------
thinkindie
I'm also getting problems with Cloudfront attached to an S3 bucket

~~~
nebulon
Same here, first it only affected some dev deployment (s3+cdn), now it spilled
over to other buckets :-/

------
gedrap
I see this thread as a list of services depending on S3 being healthy.

~~~
jschorr
Also a good list of developers and devops people not getting any sleep tonight
:-/

~~~
bardworx
Yup!

3am "Dry Run" Staging on Heroku...fail 7am Deployment to Production....fail

Now: "We have confirmed elevated latencies affecting our SendEmail,
SendRawEmail and SendSMTPEmail APIs in the US-EAST-1 Region and are working to
address the problem."

Which is perfect since most of our PO orders are being placed between
6am-10am.

Tonight, adult beverages will be needed after everything is resolved.

------
rwitoff
same here. our s3 services are reporting similar 503's and network timeouts. a
few of our partners are already down as well with their own 500s. another
stormy night in the cloud.

------
zubairov
+1 for us it's CDN (CloudFront) - only HTTP 503 responses

------
kernel_sanders
Can't launch instances in EC2 in US-East-1 at the moment.

~~~
jschorr
It appears EC2 is affected as well now:

12:51 AM PDT We are investigating increased error rates for the EC2 APIs and
launch failures for new EC2 instances in the US-EAST-1 Region.

~~~
cperciva
Probably just fallout from the fact that EBS snapshots are stored in S3. If
you can't create an EBS volume, you won't be able to launch an EC2 instance
from it.

~~~
mentat
It appears that s3 based AMIs totally fail to launch as unavailable as well.

------
geomark
Yep. Can't even get a response to a s3cmd command.

------
pydevops
AWS API still works while AWS web console is not.

------
vaibhavrajput
Why it is behaving like this each other day?

~~~
eva1984
What is the last time that AWS gets two consecutive major outages within,
maybe, 30 days?

~~~
lubos
Last one was 10 days ago.

[https://news.ycombinator.com/item?id=9980222](https://news.ycombinator.com/item?id=9980222)

~~~
vaibhavrajput
+1, Also it was down on leap second. [http://mashable.com/2015/06/30/aws-
disruption](http://mashable.com/2015/06/30/aws-disruption)

~~~
cheeseprocedure
To be fair, that was not really AWS's fault, nor (apparently) a leap second
issue:

[http://www.bgpexpert.com/article.php?article=167](http://www.bgpexpert.com/article.php?article=167)

[https://twitter.com/Axcelx/status/616058414746202113](https://twitter.com/Axcelx/status/616058414746202113)

------
mentat
Current list of additional services affected: CloudSearch

Elastic Compute Cloud

Elastic Load Balancing

Elastic MapReduce

Relational Database Service

CloudTrail

Config

Lambda

OpsWorks

------
rgbrgb
yep, we're seeing timeouts and 404s for images stored on s3 :(

good luck to the on-call engineers at amazon!

------
jackyjjc
even ec2 is down:

Amazon Elastic Compute Cloud (N. Virginia) Increased API Error Rates 12:51 AM
PDT We are investigating increased error rates for the EC2 APIs and launch
failures for new EC2 instances in the US-EAST-1 Region.

~~~
mryan
EC2 is not down.

~~~
devastor
It's up, but launching new instances doesn't work due to the S3 issue.

~~~
ranman
I can still launch new instances.

------
jtwaleson
loading fonts on aws pages is slow as hell because of this
[https://aws.amazon.com/ses/](https://aws.amazon.com/ses/)

------
lostdd
What a wonderful beginning of a week! Thanks AWS.

------
gschier
Lots more complaints on Twitter...

------
jsonperl
Yep, seeing the same.

------
shaper60
oh... seeing the same

------
zyzyis
wtf, aws again

------
shaper60
まだ直らないねえ

------
kureikain
It's two hours. Luckily it happenned around 12:00AM otherwise we are having a
bad day.

Whomever being a DevOps or SysAdmin probably cannot sleep tonight :(.

In our case, we put Fastly on top of our assets/images so that only a partil
of request get errors. The cached object on Fastly is still fine.

~~~
jbbarth
Actually on some parts of the world it's not 12:00AM you know, and there are
sysadmins/devops there too :-)

~~~
addandsubtract
11:40am: Currently can't push to Heroku :(

