

AWS: Network Connectivity issues affecting EC2 in US-EAST-1 - alanbyrne
http://status.aws.amazon.com/?HNURLFix

======
plasma
Again with the green tick with an 'i' icon as a status, rather than a
yellow/red icon, jeeze.

------
1SaltwaterC
Had a bunch of timeout alerts. Some machines of an application server array
have the packet drop issue. ELB says that everything is peachy. Folks, we're
experiencing yet another "EC2 flavored SNAFU™".

Initial though was: f#*k, I ran out of network I/O, since EC2 simply states
"low, mid, high" as performance specs, therefore some proper planning is out.
Turned out that with load avg under 0.02 on all machines, the I/O wait wasn't
to blame. Average response time, per Pingdom, went up from 170ms to 450ms. New
Relic isn't happy either. I guess we should all thank Amazon. Again.

~~~
mrcalzone
I see the same thing. Pingdom reports higher response-time, but no downtime
(meaning no alert). Also no alerts from AWS Cloudwatch. I first became aware
of the issue when internal api-tests started failing at 9:56am CET. I see
users accessing the site, but I don't know how many it's failing for.

~~~
1SaltwaterC
No issues in the internal EC2 network. At least, none that I could find. I
guess that's the reason why ELB doesn't shift any traffic. The whole issue
seems to be on the Internet facing network. Failing routers, maybe.

Pingdom still claims 100% uptime, but New Relic (which includes an equivalent
pinging service) reports downtime from time to time. Around 25 timeout alerts
into the last couple of hours.

------
Xymak1y
Health Dashboard updated the status to "Resolved":

 _5:17 AM PDT Between 12:51 AM and 4:52 AM PDT we experienced elevated packet
loss affecting instances in the US-EAST-1 region. Some of our APIs also
experienced increased error rates and latencies. The issue has been resolved
and the service is currently operating normally._

~~~
jayzalowitz
Bullshit, I am still down.

------
bgentry
This started a little before 01:00 PDT (08:00 UTC), so we're approaching the 3
hour mark now. FWIW, that's about 1h15m before there was any sort of
indication of a problem on status.aws.amazon.com

------
garindra
This got me wondering for quite a long time: I'm pretty sure almost all EC2
issues happen on the US east region -- why is that? Is it because it's the
most used region?

~~~
sudhirj
Think so... it's the cheapest, and is the default choice. Sounds like it would
have an order of magnitude more usage (and therefore problems) than the other
regions.

~~~
rplnt
The US-East-1 is also spread across more than 10 datacenters, which doesn't
make things easy. Might not be true for other, more expensive, regions.

------
ksdsh
My site is affected. I don't know how to handle this situation because the
issues affect whole region.

~~~
dsl
You should have duplicate infrastructure in another region which you fail over
to automatically or manually.

~~~
laszlocph
Duplicating across regions is not well supported by AWS unfortunately. Any
idea doing that without completely rebuild the infrastructure in an other
region?

~~~
dsl
I assume you are using the AWS dashboard to manually deploy instances and
setting them up by hand? First step is to get your infrastucture to the point
where it deploys and scales up and down by itself. Once you manage that,
moving to or keeping hot spares in another region is pretty easy.

Look into automated deployment tools like Foreman and configuration management
tools like Puppet.

~~~
laszlocph
You assumed right.

We are learning the hard way that the EC2 Availability Zone separation is not
enough, and EC2 is lacking some key features offering multi region tools.

Thanks for the hints, I'm going to check them out.

~~~
druiid
I echo this... although I don't personally use Foreman. You can do scaling
type stuff without the tool and I just use POP (plain old puppet). For me
personally it took longer to get on the config-management bandwagon but once I
did I have never looked back.

Learning to use Puppet, Chef, Salt or similar will only bring benefits!

------
Jare
Some of my us-east-1 servers are affected, others are happy so far, so it's
not the entire datacenter. However, connectivity on affected servers has gone
from just flaky to completely gone.

------
sudhirj
Odd... my site is on Heroku and there seems to be no trouble.

~~~
damniatx
git push seems usable. :(

~~~
manaslutech
i can't git push to heroku...

------
api
US-EAST-1 seems to have more issues than their other data centers... anyone
know if this is really true?

~~~
rkalla
It does, but not for nefarious reasons -- it (until the more recent history)
the cheapest region in the world for AWS -- it was only right before Oregon
rolled out that Ireland, US-EAST and US-WEST-2 all became the same price
point, but for the 4 years prior to that, it was always the cheapest so that
is where most customers rolled out most of their infra.

Now that the prices has normalized I think the load is distributing more
evenly, but for historical reasons I think that region sees a lot more churn
(starting/stopping/deploying/etc.) -- just more grinding on the hardware at
that region that others.

------
manaslutech
Looks like this is the reason I can't push to heroku either.

