Hacker News new | past | comments | ask | show | jobs | submit login
AWS having major issues (amazon.com)
136 points by pegler on July 31, 2015 | hide | past | favorite | 62 comments

CAn confirm we are also experiencing massive issues, took AWS about an hour to update status page. Attempted to file support ticket and that timed out, when the request finished it had created 10 duplicated tickets and still waiting on the phone call from AWS RE ticket.

All of our opsworks instances are currently being shut down, we have already had 20 servers automatically terminated...

There goes the weekend!!

We have 4 projects in Amazon OpsWorks, all of them in eu-west. Only one of them has been affected. ELB removed instances from balancing, and then instances were automatically shut down. Now when we try to create new instances they freeze with "requested" status.

What an awful way to start a day (in Europe).

Update: we just found another of our projects has the worker instance being shut down (by Amazon, we didn't touch it).

Update 2: our new instances are finally being created and running setup.

Update 3: we found that the Health Check for our Elastic Load Balancer changed its path to the default one (index.html). We had to edit it again.. This is weird, I don't like the idea of Amazon changing our Health Check path for no reason.

I'm in the same situation on US West on OpsWorks. Servers all began to terminate and will not start back up. I put too many eggs in the OpsWorks basket, I suppose.

All our production servers in OpsWorks had "Connection lost" status. Currently most of them is back "online" but from time time downtime (few minutes) comes back on some instances. AWS Support is not working (50x and 40x errors) - they have moved the support for production to amazon.com

Hey javiecr,

Can you please post your OpsWorks instances ids on the AWS forums and just c/p what you wrote here?



Do you work for Amazon? We already sent a support ticket using http://www.amazon.com/gp/html-forms-controller/support-cente...

Most of our OpsWorks projects in EU-West went down and all the critical ones at that. I envy your luck

Chiming in as I was completely unaffected... maybe this can help someone here.

I switched our applications off of OpsWorks a couple months back after losing faith in the OW team, in favor of a lower-level CI flow using auto-scaling groups and CodeDeploy.

Aside from some headaches in the very beginning (where CodeDeploy maintenance window would crash the daemon and health checks would cause an "initialize/destruct" loop) it precluded this issue entirely.

Two of our boxes were affected by the outage and disabled, however they were re-initialized within minutes by the ASG, meaning we experienced essentially no downtime.

Just be aware that if you use CodeDeploy, it's essentially just a low-level deployment hook which takes the packaged revision passed to S3 from your continuous integration setup, unpackages and runs any initialization scripts you require. You'll need to configure the security groups and scaling policies on your own which is something I know OpsWorks tries to make easier with their higher level app/layer constructs...

Thanks for sharing your ideas. Actually I want to do that as well. I'm so tired of Opsworks, failing deploy and of course today issue.

Could you share more about your code-deploy setup and auto-scaling ?

AWS issues on Sysadmin Appreciation day.. what an irony.

For some users of AWS this would be irony, its also schadenfreude for the Ops/Sysadmins who tell management types that just trusting everything to AWS is a bad idea.

EU-West is also experiencing problems. A lot of our instances are currently in connection_lost status.

EDIT: The console is working fine at the moment EDIT1: Apparently OpsWorks is hosted and managed by the North Virginia data center which is why all our opsworks instances in EU-West are experiencing issues too.

Frankfurt is up and running. Only the Web Console is slower.

thanks. updated the title to reflect that

Ok. If you're receiving errors, and you're NOT using opsworks, please respond. We're using opsworks too and have ~30 servers down. Maybe we all should be looking at the opsworks agent.

We started getting SQS errors from us-east-1 around 05:08 UTC.

We also have a bunch of files on S3 which Cross-Region Replication hasn't yet replicated... I think that depends on SQS as well.

Can confirm. We were also seeing SQS errors around that time. We had some Opsworks instances reboot as well, but only minor outages overall (us-east and us-west).

We are getting errors across multiple AWS API's. It's nothing to do with Opsworks itself, rather it appears like there is an internal networking issue.

Both SQS and SNS were erroring and now SQS has gone down completly with all requests timing out.

Looks like there's critical infrastructure in us-east-1 that's broken and causing a ripple effect across all of AWS

Our platform is entirely hosted in ap-southeast-2 but we've had our EC2 instances deregistered and OpsWorks reporting them terminated where EC2 is showing them active and they're still reachable via SSH

Yeah, we don't use OpsWorks and had SQS/SNS/SES trouble as well -- thankfully those are not used to serve production traffic. From the set of services affected, it looks like Amazon's internal Kafka-like pub/sub system went down.

I lost my buildserver - uncontactable, but not terminated. Can't even 'force off'. It had nothing to do with any form of AWS provisioning (manual ansible; a job for Monday), and it's a relatively recent machine (couple of months, t2.medium). Got an email from AWS that the host had degraded, and I noticed the instance was having weird disk issues earlier.

"Description: The instance is running on degraded hardware"

SES errors, no opsworks.

having issues. Not using opsworks. It is not limited to that.

Was ~ an hour before the status page indicated even the potential for a fault, and we were seeing solid errors the entire time.

Currently can't raise support tickets either, so left in the dark until they fix that...

Heads up if you're using opsworks your instances may have been removed from their ELB.

exactly what is happening here.

We are in Singapore and our ELB is automatically deregistering our instances. Has happened twice in last 1 hour causing system downtime.

we use Opsworks.

Ok, I can't tell if this is functioning normally. I am trying to launch a new instance. Numerous services have "increased API error rates" one even said "Elevated error rates"

All services panel were green indicating: Service is operating normally

I can't put confidence in this. Anyone have any info on whether this is resolved. I can do some on digital ocean, but I really need a few things on AWS. Confirmed up? COnfirmed working? Info?

Edit: Successfully launched an AMI instance and it deployed successfully and is accessible from shell and ip.

We seem to have had issues and we're not using OpWorks. I'm trying to determine if our issues are related.

Our instances are docker hosts, network seemed to lag/stop when proxying traffic to the internal container IP addresses.

Our ASG spun up other instances but health checks reported "Insufficient Data". The web console also seems buggy (API requests are failing).

So three different environments have had the ELB just release all of their clients and not come back onboard. Manual intervention was the only way.

Whats the recommended way to monitor if instances have become detached or non responsive? Want to immediately alert slack or send an email, etc.

There are a lot of ways to integrate, but we use datadog and I generally find it excellent. Alerts to hipchat, pagerduty, etc, with AWS Cloudwatch integration. Of course, you have to assume the Cloudwatch API is working... :)

A vote for datadog here too. Just started using it - it's clear that someone there really loves data.

Their status history is 100% green for the last week. You must be wrong about major issues. /s

All my (5) instances (OpsWorks) in South America were with status "Stopping" and detached from ELB. I manually stop and start again. Now it is working but ~ again ~ I have 2 instances with "Stopping" status. Welcome weekend!

Half of our servers in Singapore (100+) are all down, cant access console either. Damn it.

hey fredonrails, what u can do is just login to console , when u see 500 error page . Just hit the back button you will be logged into the account. this seemed to be working for me please try

Our Amazon _POST_ORDER_FULFILLMENT_DATA_ feeds have started going through now so it must be nearly fixed!!! I thought I was going to have to tell all our customers they need to manually dispatch all their orders. Woo hoo!!!

Killing me I was wondering why my sites weren't working and console access is killed. Can't log in, can't find any information. Thanks for this. Huge.

hey vonklaus, what u can do is just login to console , when u see 500 error page . Just hit the back button you will be logged into the account

couldn't get that to work for me. I tried doing an incognito window, hitting the back button, etc. I thought it was something specific to me as I had just killed a shell session and deleted my only droplet when I noticed. It's up now though, glad I saw this post though thanks.

I have been in that same situation before so I built StatusGator.io. It basically alerts you when a service posts a status update so you aren't spinning your wheels trying to figure out what's wrong with your app.

SQS seems to be resolved now. (As per website)

Fortunately the Opsworks service wasn't able to contact EC2 API so although it attempted to terminate the servers, the termination call failed so the instances are still alive and functioning but there is now a disconnect between Opsworks and EC2

I go to vacation and everything collapses, guess they will chain me to the desk next time :-)

I can relate

Servers in Japan seem ok, but e-mail (N. Virginia) is more or less dead for me.

We've been having issues with SES N.Virginia, but have managed to get mail through. It's intermittent, but we are getting our batches out. It's not pretty.

Our e-mails are more or less mission-critical, so we try to send, then try to send again -- I suspect the additional failures are causing this to back up a bit. For concrete numbers, of 27 e-mails we tried to send, only 2 got out successfully.

Also of interest is that I only got a 454 ("Temporary service failure") three times -- the rest of the time, it just freezes/times out in ssl.

Although it's looking like end-user deliverability is not great we are seeing huge delays in things actually getting out of SES.

We were all good since last few hours (post the outage session in the morning) but just 10 mins back our ELB again automatically deregistered all the EC2 instances. Is the issue still happening?

We where also affected in Frankfurt, really scary. Does anyone know about some managed chef like Ops-Work that would be cloud-agnostic with a little less vendor-lock in?

Using OpsWorks. Got several deploys timing out as "unreachable" between 2h ago and now.

all our opsworks servers have been unreachable from the internet since around 05:13Z when we see a mysterious 'configure' command with no further details in the opsworks logs

EDIT: they are in EU-West zone btw

AWS SES 454 Temporary service failure

govcloud seems ok

So many pages...

There is your reliable cloud :) You can always count on cloud, right?

What a pointless comment. Nobody expects 100% uptime on any cloud service.

Are private servers 100% reliable?

Is your ISP 100% reliable? Even with the perfect local infrastructure there is still a single point of failure.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact