There goes the weekend!!
What an awful way to start a day (in Europe).
Update: we just found another of our projects has the worker instance being shut down (by Amazon, we didn't touch it).
Update 2: our new instances are finally being created and running setup.
Update 3: we found that the Health Check for our Elastic Load Balancer changed its path to the default one (index.html). We had to edit it again.. This is weird, I don't like the idea of Amazon changing our Health Check path for no reason.
Can you please post your OpsWorks instances ids on the AWS forums and just c/p what you wrote here?
I switched our applications off of OpsWorks a couple months back after losing faith in the OW team, in favor of a lower-level CI flow using auto-scaling groups and CodeDeploy.
Aside from some headaches in the very beginning (where CodeDeploy maintenance window would crash the daemon and health checks would cause an "initialize/destruct" loop) it precluded this issue entirely.
Two of our boxes were affected by the outage and disabled, however they were re-initialized within minutes by the ASG, meaning we experienced essentially no downtime.
Just be aware that if you use CodeDeploy, it's essentially just a low-level deployment hook which takes the packaged revision passed to S3 from your continuous integration setup, unpackages and runs any initialization scripts you require. You'll need to configure the security groups and scaling policies on your own which is something I know OpsWorks tries to make easier with their higher level app/layer constructs...
Could you share more about your code-deploy setup and auto-scaling ?
EDIT: The console is working fine at the moment
EDIT1: Apparently OpsWorks is hosted and managed by the North Virginia data center which is why all our opsworks instances in EU-West are experiencing issues too.
We also have a bunch of files on S3 which Cross-Region Replication hasn't yet replicated... I think that depends on SQS as well.
Both SQS and SNS were erroring and now SQS has gone down completly with all requests timing out.
Our platform is entirely hosted in ap-southeast-2 but we've had our EC2 instances deregistered and OpsWorks reporting them terminated where EC2 is showing them active and they're still reachable via SSH
The instance is running on degraded hardware"
Currently can't raise support tickets either, so left in the dark until they fix that...
All services panel were green indicating: Service is operating normally
I can't put confidence in this. Anyone have any info on whether this is resolved. I can do some on digital ocean, but I really need a few things on AWS. Confirmed up? COnfirmed working? Info?
Edit: Successfully launched an AMI instance and it deployed successfully and is accessible from shell and ip.
Our instances are docker hosts, network seemed to lag/stop when proxying traffic to the internal container IP addresses.
Our ASG spun up other instances but health checks reported "Insufficient Data". The web console also seems buggy (API requests are failing).
Whats the recommended way to monitor if instances have become detached or non responsive? Want to immediately alert slack or send an email, etc.
Also of interest is that I only got a 454 ("Temporary service failure") three times -- the rest of the time, it just freezes/times out in ssl.
EDIT: they are in EU-West zone btw