

AWS having major issues - pegler
http://status.aws.amazon.com/#31-july

======
codingninja
CAn confirm we are also experiencing massive issues, took AWS about an hour to
update status page. Attempted to file support ticket and that timed out, when
the request finished it had created 10 duplicated tickets and still waiting on
the phone call from AWS RE ticket.

~~~
codingninja
All of our opsworks instances are currently being shut down, we have already
had 20 servers automatically terminated...

There goes the weekend!!

------
javiercr
We have 4 projects in Amazon OpsWorks, all of them in eu-west. Only one of
them has been affected. ELB removed instances from balancing, and then
instances were automatically shut down. Now when we try to create new
instances they freeze with "requested" status.

What an awful way to start a day (in Europe).

Update: we just found another of our projects has the worker instance being
shut down (by Amazon, we didn't touch it).

Update 2: our new instances are finally being created and running setup.

Update 3: we found that the Health Check for our Elastic Load Balancer changed
its path to the default one (index.html). We had to edit it again.. This is
weird, I don't like the idea of Amazon changing our Health Check path for no
reason.

~~~
toriaezu
Hey javiecr,

Can you please post your OpsWorks instances ids on the AWS forums and just c/p
what you wrote here?

[https://forums.aws.amazon.com/forum.jspa?forumID=153](https://forums.aws.amazon.com/forum.jspa?forumID=153)

thanks.

~~~
javiercr
Do you work for Amazon? We already sent a support ticket using
[http://www.amazon.com/gp/html-forms-controller/support-
cente...](http://www.amazon.com/gp/html-forms-controller/support-center-
issues-u)

------
j42
Chiming in as I was completely unaffected... maybe this can help someone here.

I switched our applications off of OpsWorks a couple months back after losing
faith in the OW team, in favor of a lower-level CI flow using auto-scaling
groups and CodeDeploy.

Aside from some headaches in the very beginning (where CodeDeploy maintenance
window would crash the daemon and health checks would cause an
"initialize/destruct" loop) it precluded this issue entirely.

Two of our boxes were affected by the outage and disabled, however they were
re-initialized within minutes by the ASG, meaning we experienced essentially
no downtime.

Just be aware that if you use CodeDeploy, it's essentially just a low-level
deployment hook which takes the packaged revision passed to S3 from your
continuous integration setup, unpackages and runs any initialization scripts
you require. You'll need to configure the security groups and scaling policies
on your own which is something I know OpsWorks tries to make easier with their
higher level app/layer constructs...

~~~
fredonrails
Thanks for sharing your ideas. Actually I want to do that as well. I'm so
tired of Opsworks, failing deploy and of course today issue.

Could you share more about your code-deploy setup and auto-scaling ?

------
technofriends
AWS issues on Sysadmin Appreciation day.. what an irony.

~~~
stephenr
For some users of AWS this would be irony, its also schadenfreude for the
Ops/Sysadmins who tell management types that just trusting everything to AWS
is a bad idea.

------
K0nserv
EU-West is also experiencing problems. A lot of our instances are currently in
connection_lost status.

EDIT: The console is working fine at the moment EDIT1: Apparently OpsWorks is
hosted and managed by the North Virginia data center which is why all our
opsworks instances in EU-West are experiencing issues too.

~~~
merb
Frankfurt is up and running. Only the Web Console is slower.

------
kacy
Ok. If you're receiving errors, and you're NOT using opsworks, please respond.
We're using opsworks too and have ~30 servers down. Maybe we all should be
looking at the opsworks agent.

~~~
codingninja
We are getting errors across multiple AWS API's. It's nothing to do with
Opsworks itself, rather it appears like there is an internal networking issue.

Both SQS and SNS were erroring and now SQS has gone down completly with all
requests timing out.

~~~
oliyoung
Looks like there's critical infrastructure in us-east-1 that's broken and
causing a ripple effect across all of AWS

Our platform is entirely hosted in ap-southeast-2 but we've had our EC2
instances deregistered and OpsWorks reporting them terminated where EC2 is
showing them active and they're still reachable via SSH

~~~
frankchn
Yeah, we don't use OpsWorks and had SQS/SNS/SES trouble as well -- thankfully
those are not used to serve production traffic. From the set of services
affected, it looks like Amazon's internal Kafka-like pub/sub system went down.

------
dc_gregory
Was ~ an hour before the status page indicated even the potential for a fault,
and we were seeing solid errors the entire time.

Currently can't raise support tickets either, so left in the dark until they
fix that...

------
phrotoma
Heads up if you're using opsworks your instances may have been removed from
their ELB.

~~~
fredonrails
exactly what is happening here.

------
technofriends
We are in Singapore and our ELB is automatically deregistering our instances.
Has happened twice in last 1 hour causing system downtime.

~~~
technofriends
we use Opsworks.

------
vonklaus
Ok, I can't tell if this is functioning normally. I am trying to launch a new
instance. Numerous services have "increased API error rates" one even said
"Elevated error rates"

All services panel were green indicating: Service is operating normally

I can't put confidence in this. Anyone have any info on whether this is
resolved. I can do some on digital ocean, but I really need a few things on
AWS. Confirmed up? COnfirmed working? Info?

Edit: Successfully launched an AMI instance and it deployed successfully and
is accessible from shell and ip.

------
mmattax
We seem to have had issues and we're not using OpWorks. I'm trying to
determine if our issues are related.

Our instances are docker hosts, network seemed to lag/stop when proxying
traffic to the internal container IP addresses.

Our ASG spun up other instances but health checks reported "Insufficient
Data". The web console also seems buggy (API requests are failing).

------
beilabs
So three different environments have had the ELB just release all of their
clients and not come back onboard. Manual intervention was the only way.

Whats the recommended way to monitor if instances have become detached or non
responsive? Want to immediately alert slack or send an email, etc.

~~~
Rapzid
There are a lot of ways to integrate, but we use datadog and I generally find
it excellent. Alerts to hipchat, pagerduty, etc, with AWS Cloudwatch
integration. Of course, you have to assume the Cloudwatch API is working... :)

~~~
vacri
A vote for datadog here too. Just started using it - it's clear that someone
there really loves data.

------
fizx
Their status history is 100% green for the last week. You must be wrong about
major issues. /s

------
danieltamiosso
All my (5) instances (OpsWorks) in South America were with status "Stopping"
and detached from ELB. I manually stop and start again. Now it is working but
~ again ~ I have 2 instances with "Stopping" status. Welcome weekend!

------
fredonrails
Half of our servers in Singapore (100+) are all down, cant access console
either. Damn it.

~~~
uditsrn28
hey fredonrails, what u can do is just login to console , when u see 500 error
page . Just hit the back button you will be logged into the account. this
seemed to be working for me please try

------
BIllyPeanutsDev
Our Amazon _POST_ORDER_FULFILLMENT_DATA_ feeds have started going through now
so it must be nearly fixed!!! I thought I was going to have to tell all our
customers they need to manually dispatch all their orders. Woo hoo!!!

------
vonklaus
Killing me I was wondering why my sites weren't working and console access is
killed. Can't log in, can't find any information. Thanks for this. Huge.

~~~
uditsrn28
hey vonklaus, what u can do is just login to console , when u see 500 error
page . Just hit the back button you will be logged into the account

~~~
vonklaus
couldn't get that to work for me. I tried doing an incognito window, hitting
the back button, etc. I thought it was something specific to me as I had just
killed a shell session and deleted my only droplet when I noticed. It's up now
though, glad I saw this post though thanks.

------
shashwat986
SQS seems to be resolved now. (As per website)

------
codingninja
Fortunately the Opsworks service wasn't able to contact EC2 API so although it
attempted to terminate the servers, the termination call failed so the
instances are still alive and functioning but there is now a disconnect
between Opsworks and EC2

------
atopuzov
I go to vacation and everything collapses, guess they will chain me to the
desk next time :-)

~~~
fasteo
I can relate

------
sdrothrock
Servers in Japan seem ok, but e-mail (N. Virginia) is more or less dead for
me.

~~~
benjaminRRR
We've been having issues with SES N.Virginia, but have managed to get mail
through. It's intermittent, but we are getting our batches out. It's not
pretty.

~~~
sdrothrock
Our e-mails are more or less mission-critical, so we try to send, then try to
send again -- I suspect the additional failures are causing this to back up a
bit. For concrete numbers, of 27 e-mails we tried to send, only 2 got out
successfully.

Also of interest is that I only got a 454 ("Temporary service failure") three
times -- the rest of the time, it just freezes/times out in ssl.

~~~
benjaminRRR
Although it's looking like end-user deliverability is not great we are seeing
huge delays in things actually getting out of SES.

------
technofriends
We were all good since last few hours (post the outage session in the morning)
but just 10 mins back our ELB again automatically deregistered all the EC2
instances. Is the issue still happening?

------
halotrope
We where also affected in Frankfurt, really scary. Does anyone know about some
managed chef like Ops-Work that would be cloud-agnostic with a little less
vendor-lock in?

------
jayunit
Using OpsWorks. Got several deploys timing out as "unreachable" between 2h ago
and now.

------
twykke
all our opsworks servers have been unreachable from the internet since around
05:13Z when we see a mysterious 'configure' command with no further details in
the opsworks logs

EDIT: they are in EU-West zone btw

------
vaibhavrajput
AWS SES 454 Temporary service failure

------
ranman
govcloud seems ok

------
econner
So many pages...

------
systemz
There is your reliable cloud :) You can always count on cloud, right?

~~~
mryan
What a pointless comment. Nobody expects 100% uptime on any cloud service.

