
Ask HN: Has anyone else seen TCP connection issues in AWS US East this week? - bognition
Over the last week we&#x27;ve seen random TCP connection issues in the US-1 East data center in AWS. Has anyone else been seeing this?
======
osipovas
Maybe AWS US East is going to be the region with random failures?

"I want to have an AWS region where everything breaks with high frequency"
[https://news.ycombinator.com/item?id=24103746](https://news.ycombinator.com/item?id=24103746)

~~~
BillinghamJ
It always has been - us-east-1 is the biggest region by far, so scale problems
tend to arise there first.

~~~
kjaftaedi
All updates and patching start with us-east-1.

The main reason it has the most issues is because it's the guinea pig for
production update deployments.

~~~
whoevercares
Hmm that’s not true, nowadays most new things will start with small regions,
typically us-east-2

~~~
arsh
+1

------
BookPage
It would help if you defined exactly what type of issues you experienced.
Packet loss? Early RSTs? Latency? Single AZ or cross AZ? Same for VPCs or
NAT'd internet traffic?

You should take some tcpdumps and open a support case.

~~~
bognition
Connection issues between hosts, it appears to be across AZ. Internal traffic
inside a VPC

------
fossuser
Yes - I've been seeing failures this morning.

Searched around didn't see much on twitter beyond this:
[https://twitter.com/Flock/status/1294304262126804993?s=20](https://twitter.com/Flock/status/1294304262126804993?s=20)

We think it's one AZ in us-east-1.

~~~
bognition
yeah this is very similar to what we're seeing. We reached out to AWS for help
and they reported issues on their side but didn't go into greater details.

we've been seeing issues like this on and off for a few weeks now.

------
omreaderhn
Yes I have seen sporadic connection issues with various site scraping
functions my app employs. I figured it was a widespread issue but I'm glad you
made a post about it that basically confirms that.

------
RulerOf
One instance was randomly powered down about 22 hours ago.

We saw a synthetic monitor failure at midnight. Investigation of the
transaction trace shows that a specific code path that should take maybe
~100ms took almost 40000ms.

It could have been unresponsive EBS. Or failure to look up the Redis server's
IP address. Or some other infrastructure-level failure. The synthetic browser
saw it as a 502.

------
mweberxyz
Perhaps related: our load tests this week showed an increase in 502s from the
ALB. The app server request logs indicate those requests never made it from
the ALB.

------
renewiltord
We had a full `us-west-2` 30 minute network drop-out this week. CloudTrail
shows nothing.

------
syllableai
We are seeing sporadic connection issues where tcp syn packets are dropped
before reaching our elb. Have noticed off and on for a few weeks now. Still
investigating and have support ticket out with aws.

~~~
toast0
Do ELBs operate behind the 'security groups' firewall? I don't think they do,
but if they do, you might be hitting connection tracking limits? It's
consistent with dropped syns.

Some more details below, but if you use normalish (naive?) rules in the aws
firewall, you get connection tracking behavior and there's an unspecified
connection limit for each instance type. Above that limit, incoming syns are
dropped.

[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-secu...](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-security-
groups.html#security-group-connection-tracking)

------
bassman9000
Not as much as TCP issues, but increased API call failure, across multiple
services (cloudformation, ec2, rds), yes, on us-east-1. Mind you, still pretty
low rate, but enough to notice some pattern.

------
wgyn
We had a few minutes earlier this week where a machine saw packets in/out go
to zero for no discernible reason.

~~~
leesalminen
Same here in us-east-2.

------
jeppesen-io
I've not seen anything across 4 AZs

------
sg47
Someone asked for a region where everything breaks all the time.

------
jscheel
Yep, had issues yesterday. Botched a big deploy for me too.

------
WaxProlix
In any specific AZ or across the board?

~~~
sudhirj
Each account's AZ codes point to a different AZ—it's a way of avoiding
conversations like this where people think a particular AZ is better than
others.

~~~
WaxProlix
That's no longer the case, except in legacy regions (which IAD is); I think
there's a reconciliation mechanism though in those cases.

~~~
gregmac
Can you cite something about this? I can't find anything, except confirming
that's not true [1].

Your profile says you work at AWS so I assume you have inside info on this.
Perhaps you could also explain why this change would get made? I always
considered it pretty smart to do - With consistent machines, wouldn't the
lowered-lettered zones get significantly more traffic? Most of my deploys go
to a+b or a+b+c (and I have various services running in 5 different regions, I
think). I'm not sure I even have anything running in more than 3 AZs, and thus
never use AZ d, for example. I'm positive I'm not alone in that style of
setup.

\----

EDIT: Just comparing two accounts I have (which are linked, if that makes a
difference), it does in fact look like most regions do have the same mappings.
us-east-1 and us-west-2 are definitely different, but all the other ones I
checked seem to be the same. They're not all consistent (a=1, b=2, etc)
though, but for example, these are the same on both accounts:

    
    
        AZ Name         AZ ID
        eu-central-1a   euc1-az2
        eu-central-1b   euc1-az3
        eu-central-1c   euc1-az1
        eu-west-3a      euw3-az1
        eu-west-3b      euw3-az2
        eu-west-3c      euw3-az3
        ap-south-1a     aps1-az1
        ap-south-1b     aps1-az3
        ap-south-1c     aps1-az2
    

I still find this silly. Anyone following basic examples or deploying single-
AZ is going to provision stuff in the "a" zone. That zone must be 10x bigger
than "c" in any given region. It blows my mind.

[1] [https://aws.amazon.com/premiumsupport/knowledge-
center/vpc-m...](https://aws.amazon.com/premiumsupport/knowledge-center/vpc-
map-cross-account-availability-zones/)

~~~
NBJack
Zone IDs are meant to reconcile this. And due to the nature of my job, it
creates plenty of headaches explaining why two accounts can have different AZ
names for the 'same' place.

[https://docs.aws.amazon.com/ram/latest/userguide/working-
wit...](https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-
ids.html)

My understanding of the 'why' AWS did this for many regions was to avoid folks
hammering the "first" zone they came across (a) when they either didn't care
about multi-zone availability or were ignorant of the difference it made. If
everyone hops in the 'first' zone, you could end up with disproportionate
amounts of traffic. Either way, given how many new regions don't do this
(either because they stopped, or that new regions tend to come up slowly one
zone at a time), it seems they've abandoned the practice. Unfortunately, even
new accounts in these legacy regions still end up with randomized mappings.

I also suspect they didn't care much about this until cross-account features
were offered.

