Hacker News new | comments | show | ask | jobs | submit login
AWS S3 Outage
268 points by gschier 683 days ago | hide | past | web | 131 comments | favorite
Seeing huge numbers of 503s from the S3 API in us-east-1. Anyone else having problems? I only found one other on Twitter: https://twitter.com/cperciva/status/630641484677558273



I'm seeing it as well - majority of connections are being dropped for us atm

  The Amazon S3 team recently completed some maintenance   
  changes to Amazon S3’s DNS configuration for the US STANDARD region on 
  July 30th, 2015.
    
  You are receiving this email because we noticed that your bucket
  is still receiving requests on the IP addresses which were removed 
  from DNS rotation. These IP addresses will be disabled on August 
  10th at  11:00 am PDT, at which time any requests still using
  those addresses will receive an HTTP 503 response status code.
  
  Applications should use the published Amazon S3 DNS names for 
  US STANDARD: either s3.amazonaws.com or s3-external-2.amazonaws.com
  with their associated time to live (TTL) values. Please refer to 
  our documentation at: 
  http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region 
  for more information on Amazon S3 DNS names.
Something to do with that perhaps? AWS sent us that last thursday


Java by default can cache DNS forever, which may be why many people are seeing problems. Set networkaddress.cache.ttl to adjust this.

http://javaeesupportpatterns.blogspot.ie/2011/03/java-dns-ca... has more detail.


Reading this article, it seems that starting at 1.6 the default TTL became 30s.


It's not true as far as I can tell. You need this flag set or it keeps DNS entries forever, at least on the JVM we run (Oracle, 1.8u51).


If it is related, someone screwed up badly. I didn't receive such an email; and I'm seeing the same error rate with IPs I resolved 5 minutes ago.


I wonder what it would take for amazon to show one of the yellow icons on their status page? Has it ever happened? Would a datacenter have to fall in the ocean?


Ya, it bothers me that their status messages for major outages are simply "elevated error rates".


Whats frustrating is when you have customers who are also down because of the outage - but when you say Amazon is experiencing severe outages causing 50% of our requests to be dropped and there's not much we can do, it makes us look pretty bad when they they go to the amazon dashboard and only see "Elevated Error Rates."


"Elevated Error rates" probably does not qualify as breach of their SLA


In these cases I suggest saying "we are being severely affected by an Amazon outage"


Often times it's just that though. Just because many customers are experiencing something doesn't mean ALL customers are experiencing something. When I worked there what the media would describe as a major outage was really less than 1% of one region... this particular instance seems pretty odd though.


I'm sure "elevated error rates" is the first alarm which goes off. And once they've put that description onto the status page, they're probably more worried about getting it fixed than going back and changing the wording.


They should worry about that a lot. Amazon are notoriously bad at communicating during outages. They've gotten better, but they're big enough that it should have priority.


I agree. I was speculating, not trying to defend. :-)


Amazon isn't a couple of guys in a garage. They have hordes of administrative personnel who could be tasked to update a status page.

Companies 1/1000th the size of Amazon can manage it.


Yes and no. Sure, Amazon has administrative personnel. Sure, some of those administrative personnel would probably be happy to get paid extra to carry a pager and be summoned to work at 3AM to update a status page.

But the last thing you want to do is put inaccurate information onto a status page; so mere administrative personnel isn't enough -- you'd need people who understand enough about the system to be able to write about it without introducing errors.

I'm guessing that the intersection of "administrative personnel", "willing to carry pagers" and "understand the internals of AWS services" is a very small set.


Except "willing to carry pagers" is currently the basis of employment at Amazon, and not just for AWS but for whole chunks of their technical business. It's one of the many reasons why they have a pretty dire reputation (see plenty of discussions on here from former Amazon employees).

They also claim to have "customer obsession" as a leadership principle, this whole thread is an excellent example of that being failed in a big way.


This is a situation where "we pay you, now do as you're told" comes in handy.

Not every job can be full of self-directed aspirational spiritual awakenings. If that were the case, nobody would deliver my dinner on a bike when it's -20ºF outside.


>I'm guessing that the intersection of "administrative personnel", "willing to carry pagers" and "understand the internals of AWS services" is a very small set

Being a non-engineer doesn't mean they don't know anything about the technology. And they don't need to know the internals, just enough to convey information from the engineers managers to the public.

Plenty of other organizations manage resolving issues while transmitting information about the issue to other stakeholders.

Also, most administrative personnel have far less job opportunities than engineers. If they can get the engineers to carry pagers they can get a PR minion to carry one.


You would be surprised to learn just how few people run your favorite web service.


It's yellow now. I'm pretty sure the datacenter is still there. I think they go yellow/red after a certain time has passed or someone manually changes it (probably rarely)


Seems that "S3 offline" is the AWS equivalent of a datacenter falling into the ocean. Gotta wonder how many services are using S3 as a faux message queue?


I think most services use the message queue as a message queue (SNS), but everyone has to store their files somewhere.


They generally only use those icons in hindsight. As long as they're still "investigating" they stick to green with a note attached.


There's one there now for "Amazon Simple Storage Service (US Standard)"


Latest Update from http://status.aws.amazon.com/:

1:52 AM PDT We are actively working on the recovery process, focusing on multiple steps in parallel. While we are in recovery, customers will continue to see elevated error rate and latencies.


The icon got yellow. I repeat: YELLOW


It only goes red when a nuclear event occurs, obliterating most of humanity and only the machines remain.


That would actually turn it green again, I think.


maybe true – but if a server gets no requests, does it really exist?

or would the inter-machine chatter continue ad infinitum? would they run out of IPs or successfully transition to IPv6?

So many questions.


Cory Doctorow's When Sysadmins Ruled the Earth touches on this - what happens to the internet activity during a global crisis?

http://craphound.com/overclocked/Cory_Doctorow_-_Overclocked...


this is excellent, hadn't seen it before! thx


Eventually the undefined behavior of this chatter will result in unallocated memory slowly churning in the garbage collection of time. Some day one of the sectors of unallocated memory will be executed resulting in a self replicating program. This program will evolve and multiply, pondering on the vastness of the S3verse, forever in search of Root.


my service looks pretty red now


The EC2 launch thing is still green even though all launches fail for us.


Could you expand what you mean? Is that because your launch is trying to fetch something from s3?


There was a period in which I couldn't launch instances. Meaning the instance state did not ever reach "running" according to the console and were not responsive to initial ssh attempts. (knife timed out after 5 minutes and the machines were still unavailable after several more)


Maybe related to AMI retrieval from S3?


Went to green again and it seems to be resolved.


Next update is live

  2:38 AM PDT We continue to execute on our recovery plan
  and have taken multiple steps to reduce latencies and error
  rates for Amazon S3 in US-STANDARD. Customers may continue 
  to experience elevated latencies and error rates as we 
  proceed through our recovery plan.


Can't pull docker images from the hub either, and their statuspage currently shows S3-problems: https://status.docker.com/


http://status.docker.com/ - they are "Investigating issue with high load", now yellow, but was red before.

Can't pull any images either.


Can't docker login or pull images from the hub. Both operations hang.


Seeing the same thing. Got back from vacation an hour ago, probably related. :)


At least it didn't happen while you where on vacation. :)


Welcome back!


Could this be a reason why Heroku is misbehaving? https://status.heroku.com/incidents/792K


I can't help but be a little surprised that Heroku's entire build system is disabled by an S3 failure in one region. Now I'm unable to add a notice about the issues to my site's HTML...


True. Given the business Heroku is in, they should have redundancies in multiple regions if not multiple service providers.


Since S3 is the defacto artifact delivery system for most people that run on AWS, it's not much of a surprise. For the most part, very isolated incidents aside, S3 is rock solid. Even EC2 relies on S3 for launching non-EBS instances.


Heroku build logs spitted this out:

    Unable to fetch source from: https://s3-external-1.amazonaws.com/heroku-sources-production/heroku.com/<some-uuid>?AWSAccessKeyId=<some-access-key>&Signature=<some-signature>&Expires=1439198046


"1:08 AM PDT We believe we have identified the root cause of the elevated error rates and latencies for requests to the US-STANDARD Region and are working to resolve the issue."

looks like the cavalry are coming


GitHub is having release download issues, possibly due to this. https://status.github.com/



Feels like internet is not anymore distributed thing that if one website/node goes down others keep working...

Feels like in future... If cloud provider goes down... All internet will stop working :)


This is pretty much the case. Years of evangelising the idea that (a) everybody should be on Amazon and (b) everybody should be on the cheapest regions of Amazon mean that while the underlying datacentres are probably much better managed, individually speaking, than the tapestry of colos that made up the world a decade ago, an outage has much more wide-ranging effects than you'd get at that point.


  [3:25] AM PDT We are still working through our recovery plan.
Man, I'd love to see that plan.


I'd also love to see a post mortem on this, but I highly doubt they'll release anything about it.


Seems Hasicorp is maybe affected by this as well.

    $ vagrant up
    Bringing machine '...' up with 'virtualbox' provider... ==> ...: Box 'debian/jessie64' could not be found.
    ...
    ...: Downloading: https://atlas.hashicorp.com/debian/boxes/jessie64/versions/8.1.0/providers/virtualbox.box
    An error occurred while downloading the remote file. The error message, if any, is reproduced below. Please fix this error and try again.
    The requested URL returned error: 500 Internal Server Error
EDIT: not Markdown.


VagrantCloud, now known as Atlas, is merely a redirector service for Vagrant box management. I host all my own boxes (on S3), but still use them for an easy means of sharing without having to remember a long-ass URL.


As of 10:29:33 UTC, everything is back to normal as far as I can measure.


Should be back to normal now. The latest update is:

3:46 AM PDT Between 12:08 AM and 3:40 AM PDT, Amazon S3 experienced elevated error rates and latencies. We identified the root cause and pursued multiple paths to recovery. The error has been corrected and the service is operating normally.


Open-source library request: A library that lets you use S3 and Google Storage Cloud simultaneously and fail-back to another if one have problems.

There are many use-case when paying 2x for storage is a reasonable tradeoff for higher availability and also be provider independent.


Depending on your use case, it may be slightly easier to accomplish this with s3 event notifications + AWS Lambda to write to a different region or service.

Importantly, make sure you CNAME your bucket under your own domain so that you can switch services.

edit: Much easier than AWS Lambda, actually: http://aws.amazon.com/about-aws/whats-new/2015/03/amazon-s3-...


https://jclouds.apache.org/

Apache jclouds® is an open source multi-cloud toolkit for the Java platform that gives you the freedom to create applications that are portable across clouds while giving you full control to use cloud-specific features.


The GCS command line tool, gsutil, can talk to both S3 and GCS. That might be a nice place to start.


There used to be this https://deltacloud.apache.org/ (not exactly what you want + it seems to be dead).


What's stopping you from writing it?


Time and priorities. There is a difference between "I wish this thing existed, so I can use it/contribute to it" and "I need it so badly that I'm willing to spend a lot of time to make it production ready".

So far S3 seems to be reliable enough...


Might be something to integrate into Libcloud [1] instead of rolling your own.

[1] https://libcloud.apache.org/


Looks like I picked a bad week to stop sniffing glue.


Looks like this brought down typekit too. "Font Network is experiencing issues caused by an outage at our storage provider." http://status.typekit.com/


From http://status.aws.amazon.com/:

12:36 AM PDT We are investigating elevated errors for requests made to Amazon S3 in the US-STANDARD Region.


That would explain why my console was not performing well even if http://status.aws.amazon.com/ says "Service is operating normally", good thing their api seems to be functioning during that outage, for me at least


Good thing I built myself a local game streaming server instead of putting that in a remote GPU instance.


I started receiving lots of alerts from my side project https://StatusGator.io which monitors status pages. It's astonishing to me how many services depend on AWS directly or indirectly.


The most ironic thing I've seen in a while: Their homepage features icons of services they monitor. Many of them 503-fail - they are hosted on cloudfront.


As it has happened before, Amazon AWS status page is lying to us.

S3 is in yellow, which means "performance issues". But not being able to download files from many buckets it's clearly a "service disruption" (red).


It is actually possible if you keep trying.


It has to be many times, because I have not succeeded and I still keep trying.


This happens a couple of times, specially when replacing files frequently! I submit things to S3 everyday, if you're uploading a chunck of files you'll get errors every now and then when replacing files.


It seems to have come back now for me. Could someone else confirm the same ?


Update: We are beginning to see recovery in error rates and latencies for Amazon S3.


Yes, I noticed things getting better around 3:30 AM Pacific.


I'm assuming this is also why I can't start any instances.

> 12:51 AM PDT We are investigating increased error rates for the EC2 APIs and launch failures for new EC2 instances in the US-EAST-1 Region.


Same here


I'm also having issues connecting to buckets based in Ireland (eu-west-1). Just hangs at authentication stage. Tried from 3 different internet connections, all having the same problem.


Ditto. API does not answer at all, latency is high and customers are not amused. Outage has lasted for hours already.

edit; does not seem to affect all the buckets though. Only one of ours is experiencing this, others are fine.


For what it's worth, eu-west-1 buckets are working fine for me here, via s3cmd and aws cli.


Looks like it's getting better on our side now


Now updates about ELB scaling and Lambda failures.


  3:36 AM PDT Customers should start to see declines in elevated errors and
  latencies in the Amazon S3 service.
Fixed ?


We are seeing lots of 503s, empty response bodies, and peer reset / dropped connections.


I'm also getting problems with Cloudfront attached to an S3 bucket


Same here, first it only affected some dev deployment (s3+cdn), now it spilled over to other buckets :-/


I see this thread as a list of services depending on S3 being healthy.


Also a good list of developers and devops people not getting any sleep tonight :-/


Yup!

3am "Dry Run" Staging on Heroku...fail 7am Deployment to Production....fail

Now: "We have confirmed elevated latencies affecting our SendEmail, SendRawEmail and SendSMTPEmail APIs in the US-EAST-1 Region and are working to address the problem."

Which is perfect since most of our PO orders are being placed between 6am-10am.

Tonight, adult beverages will be needed after everything is resolved.


same here. our s3 services are reporting similar 503's and network timeouts. a few of our partners are already down as well with their own 500s. another stormy night in the cloud.


+1 for us it's CDN (CloudFront) - only HTTP 503 responses


Can't launch instances in EC2 in US-East-1 at the moment.


It appears EC2 is affected as well now:

12:51 AM PDT We are investigating increased error rates for the EC2 APIs and launch failures for new EC2 instances in the US-EAST-1 Region.


Probably just fallout from the fact that EBS snapshots are stored in S3. If you can't create an EBS volume, you won't be able to launch an EC2 instance from it.


It appears that s3 based AMIs totally fail to launch as unavailable as well.


Yep. Can't even get a response to a s3cmd command.


AWS API still works while AWS web console is not.


Why it is behaving like this each other day?


What is the last time that AWS gets two consecutive major outages within, maybe, 30 days?



+1, Also it was down on leap second. http://mashable.com/2015/06/30/aws-disruption


To be fair, that was not really AWS's fault, nor (apparently) a leap second issue:

http://www.bgpexpert.com/article.php?article=167

https://twitter.com/Axcelx/status/616058414746202113


Current list of additional services affected: CloudSearch

Elastic Compute Cloud

Elastic Load Balancing

Elastic MapReduce

Relational Database Service

CloudTrail

Config

Lambda

OpsWorks


yep, we're seeing timeouts and 404s for images stored on s3 :(

good luck to the on-call engineers at amazon!


even ec2 is down:

Amazon Elastic Compute Cloud (N. Virginia) Increased API Error Rates 12:51 AM PDT We are investigating increased error rates for the EC2 APIs and launch failures for new EC2 instances in the US-EAST-1 Region.


EC2 is not down.


It's up, but launching new instances doesn't work due to the S3 issue.


I can still launch new instances.


loading fonts on aws pages is slow as hell because of this https://aws.amazon.com/ses/


What a wonderful beginning of a week! Thanks AWS.


Lots more complaints on Twitter...


Yep, seeing the same.


oh... seeing the same


wtf, aws again


まだ直らないねえ


It's two hours. Luckily it happenned around 12:00AM otherwise we are having a bad day.

Whomever being a DevOps or SysAdmin probably cannot sleep tonight :(.

In our case, we put Fastly on top of our assets/images so that only a partil of request get errors. The cached object on Fastly is still fine.


Actually on some parts of the world it's not 12:00AM you know, and there are sysadmins/devops there too :-)


11:40am: Currently can't push to Heroku :(


yeah, forgot that :(. Hope it comes back soon.


Our European customers are not amused.


haha if two hours is nothing to you, that's pretty lucky. We work 24x7 (global customers, yes US is a bit busier but not by a lot) and we're an Australian company, so not a whole lot of fun.


Yeah, sorry to hear that :(. I mean it's already two hours, a long time.

However, if it doesn't come back before 5AM I think we are screw :(




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: