CAn confirm we are also experiencing massive issues, took AWS about an hour to update status page. Attempted to file support ticket and that timed out, when the request finished it had created 10 duplicated tickets and still waiting on the phone call from AWS RE ticket.
We have 4 projects in Amazon OpsWorks, all of them in eu-west. Only one of them has been affected. ELB removed instances from balancing, and then instances were automatically shut down. Now when we try to create new instances they freeze with "requested" status.
What an awful way to start a day (in Europe).
Update: we just found another of our projects has the worker instance being shut down (by Amazon, we didn't touch it).
Update 2: our new instances are finally being created and running setup.
Update 3: we found that the Health Check for our Elastic Load Balancer changed its path to the default one (index.html). We had to edit it again.. This is weird, I don't like the idea of Amazon changing our Health Check path for no reason.
I'm in the same situation on US West on OpsWorks. Servers all began to terminate and will not start back up. I put too many eggs in the OpsWorks basket, I suppose.
All our production servers in OpsWorks had "Connection lost" status. Currently most of them is back "online" but from time time downtime (few minutes) comes back on some instances. AWS Support is not working (50x and 40x errors) - they have moved the support for production to amazon.com
Chiming in as I was completely unaffected... maybe this can help someone here.
I switched our applications off of OpsWorks a couple months back after losing faith in the OW team, in favor of a lower-level CI flow using auto-scaling groups and CodeDeploy.
Aside from some headaches in the very beginning (where CodeDeploy maintenance window would crash the daemon and health checks would cause an "initialize/destruct" loop) it precluded this issue entirely.
Two of our boxes were affected by the outage and disabled, however they were re-initialized within minutes by the ASG, meaning we experienced essentially no downtime.
Just be aware that if you use CodeDeploy, it's essentially just a low-level deployment hook which takes the packaged revision passed to S3 from your continuous integration setup, unpackages and runs any initialization scripts you require. You'll need to configure the security groups and scaling policies on your own which is something I know OpsWorks tries to make easier with their higher level app/layer constructs...
For some users of AWS this would be irony, its also schadenfreude for the Ops/Sysadmins who tell management types that just trusting everything to AWS is a bad idea.
EU-West is also experiencing problems. A lot of our instances are currently in connection_lost status.
EDIT: The console is working fine at the moment
EDIT1: Apparently OpsWorks is hosted and managed by the North Virginia data center which is why all our opsworks instances in EU-West are experiencing issues too.
Ok. If you're receiving errors, and you're NOT using opsworks, please respond. We're using opsworks too and have ~30 servers down. Maybe we all should be looking at the opsworks agent.
Can confirm. We were also seeing SQS errors around that time. We had some Opsworks instances reboot as well, but only minor outages overall (us-east and us-west).
We are getting errors across multiple AWS API's. It's nothing to do with Opsworks itself, rather it appears like there is an internal networking issue.
Both SQS and SNS were erroring and now SQS has gone down completly with all requests timing out.
Looks like there's critical infrastructure in us-east-1 that's broken and causing a ripple effect across all of AWS
Our platform is entirely hosted in ap-southeast-2 but we've had our EC2 instances deregistered and OpsWorks reporting them terminated where EC2 is showing them active and they're still reachable via SSH
Yeah, we don't use OpsWorks and had SQS/SNS/SES trouble as well -- thankfully those are not used to serve production traffic. From the set of services affected, it looks like Amazon's internal Kafka-like pub/sub system went down.
I lost my buildserver - uncontactable, but not terminated. Can't even 'force off'. It had nothing to do with any form of AWS provisioning (manual ansible; a job for Monday), and it's a relatively recent machine (couple of months, t2.medium). Got an email from AWS that the host had degraded, and I noticed the instance was having weird disk issues earlier.
"Description:
The instance is running on degraded hardware"
Ok, I can't tell if this is functioning normally. I am trying to launch a new instance. Numerous services have "increased API error rates" one even said "Elevated error rates"
All services panel were green indicating: Service is operating normally
I can't put confidence in this. Anyone have any info on whether this is resolved. I can do some on digital ocean, but I really need a few things on AWS. Confirmed up? COnfirmed working? Info?
Edit: Successfully launched an AMI instance and it deployed successfully and is accessible from shell and ip.
There are a lot of ways to integrate, but we use datadog and I generally find it excellent. Alerts to hipchat, pagerduty, etc, with AWS Cloudwatch integration. Of course, you have to assume the Cloudwatch API is working... :)
All my (5) instances (OpsWorks) in South America were with status "Stopping" and detached from ELB. I manually stop and start again. Now it is working but ~ again ~ I have 2 instances with "Stopping" status. Welcome weekend!
hey fredonrails, what u can do is just login to console , when u see 500 error page . Just hit the back button you will be logged into the account. this seemed to be working for me please try
Our Amazon _POST_ORDER_FULFILLMENT_DATA_ feeds have started going through now so it must be nearly fixed!!! I thought I was going to have to tell all our customers they need to manually dispatch all their orders. Woo hoo!!!
couldn't get that to work for me. I tried doing an incognito window, hitting the back button, etc. I thought it was something specific to me as I had just killed a shell session and deleted my only droplet when I noticed. It's up now though, glad I saw this post though thanks.
I have been in that same situation before so I built StatusGator.io. It basically alerts you when a service posts a status update so you aren't spinning your wheels trying to figure out what's wrong with your app.
Fortunately the Opsworks service wasn't able to contact EC2 API so although it attempted to terminate the servers, the termination call failed so the instances are still alive and functioning but there is now a disconnect between Opsworks and EC2
We've been having issues with SES N.Virginia, but have managed to get mail through. It's intermittent, but we are getting our batches out. It's not pretty.
Our e-mails are more or less mission-critical, so we try to send, then try to send again -- I suspect the additional failures are causing this to back up a bit. For concrete numbers, of 27 e-mails we tried to send, only 2 got out successfully.
Also of interest is that I only got a 454 ("Temporary service failure") three times -- the rest of the time, it just freezes/times out in ssl.
We were all good since last few hours (post the outage session in the morning) but just 10 mins back our ELB again automatically deregistered all the EC2 instances. Is the issue still happening?
We where also affected in Frankfurt, really scary. Does anyone know about some managed chef like Ops-Work that would be cloud-agnostic with a little less vendor-lock in?
all our opsworks servers have been unreachable from the internet since around 05:13Z when we see a mysterious 'configure' command with no further details in the opsworks logs