Hacker News new | past | comments | ask | show | jobs | submit login
AWS NLBs and the mixed up TCP connections (niels-ole.com)
175 points by nielsole on Oct 22, 2020 | hide | past | favorite | 66 comments

This is good advice. Ideally, never blend your AZs, each should be an independent stack. Use 3+ not 2, keeps you honest about an availability strategy instead of a standby failover strategy. In front of them, use DNS geo IP or even basic round robin (with service availability check) to get to the NLB. Behind the NLB, stay in that AZ!

If you need to call out of the the AZ for other data or API sources, either figure out which AZs that service is using and configure to stay in them, or, make sure to go all the way back out to a well-balanced (for resilience) endpoint.

Agree. From 6+ years of experience it seems that we got fouled by the multi-az promise of being able to survive datacenter outage.

You can survive datacenter (AZ) outage IF you have separate stacks per AZ and don't mix traffic. If you have Kafka cluster spread out in 3 AZ don't get surprised if you just LOWERED your availability because any issue in one AZ makes your stack unstable. And issues in single AZ are quite common.

A properly configured kafka cluster across 3 AZs _should_ be able to survive the loss of a single AZ. Obviously you should do testing and DR exercises to make sure _your_ cluster and application work in that scenario.

That's a really interesting point. The startup I currently work for only uses a single AZ due to financial concerns (and some performance as well), but I assume we'll have to move to more AZs for reliability. Would you advise the same for clusters of RDS and Elasticache? I'm wondering how you would even go about having two separate data sources, how would this be manageable?

Before assuming that your reliability would be increased by adding more AZs, verify where the problems of reliability comes from in the first place. I find more times than not, the down times comes from people applying changes, not when you just leave things running like they are. It's only if the AZ or underlying machines has troubles, that you should start thinking of expanding to other AZs

I've found that for RDS, a writer instance and a hot standby reader instance with automatic failover work pretty well. When a failover happens, you're usually looking at about 30 seconds of downtime, which is "good enough" for most purposes.

30 seconds is pretty good. I worked on an "enterprise" system running AIX and HACMP (IBM's HA software.) A failover event would take minutes... and this was on the same local network.

Active-passive with database replication and manual failover is the usual solution.

> From 6+ years of experience it seems that we got fouled by the multi-az promise of being able to survive datacenter outage.

You have quite a misunderstanding ...

AWS' "multi-az promise" has always been that they will try to take only one AZ down at a time within a region.

It was never "blend your AZ usage so we can't take one down."

If you don't have a wiki page with some HA architecture diagrams for each of your systems, then you probably don't have HA. Hint: at every company that I've worked at, I drew the first diagrams. Something to think about.

This is good advice but not always easy to implement. We have some customers that insist on using IPs instead of DNS (usually because of bad/old software on their side). In our case we have some commercial LBs that can pass the EIP to each other as needed. However we do see quite a few resets so I wonder if something like this is still going on.

You can use AWS Global Accelerator. AWS will assign 2 static IPs (not EIPs, but they will never change until the Global Accelerator is deleted) or you can use your own block of IP (BYOIP). Then resolve the DNS to the Global Accelerator static IPs. And forward your traffic from AWS Global Accelerator to your ALBs, EC2 instances or NLBs.

We reported this issue back in 2018 to AWS! In our case it was exacerbated because of a NAT through which traffic to the cross-AZ loadbalancer would flow. As mentioned in the article, the client side is free to reuse ports as long as the destination in the tuple is different - as would be the case for cross-AZ NLB.

We reduced the severity of this by randomizing port mappings on the NAT but that just reduces the probability. They claimed that it was a high priority issue for them back then but seems this is still in the wild - insane!

I was expecting this to be about the NLB's strange "lag" in updating its flows, wreaking havoc when it comes to changes in the target group, and possibly also relating to weirdly long delays before starting health checks of newly registered targets. I'm bewildered why this hasn't been more of a problem for others, and also why AWS seem to have kind of silently acknowledged the issue (by not closing them), while not coming up with a fix, even after a year. Am I the only one seeing this problem? Ref: https://github.com/aws/containers-roadmap/issues/470 and https://github.com/aws/containers-roadmap/issues/469

I debugged and identified the exact same problem a few weeks ago. I don't have any solutions, but can confirm what you're seeing. I suspect most clients aren't creating enough tcp connections in the window to cause a collision. (In our case, we discovered the issue from our load box during performance testing)

I also saw this recently, although on another provider. The solution in the article's case (as stated) is to not terminate multiple public IPs across one or more NLB which target the same instance. If you must target the same instance more than once, give that instance an IP address for each public IP.

Collisions can already occur at two connections

AFAIK, all the major tcp/ip stacks do round robin port assignment. Given relatively short, and relatively few, connections you should not have any collisions.

If your customers are behind CGNAT, and you have a good number of them, it would be pretty easy to run into collisions. CGNAT doesn't have any problem using the same source ip:port for connections to different destination ip:port, because there's no reason not to.

I agree though, that most OS stacks are very conservative about using the same source ip:port for connections to anything else. It works ok until you need to make lots of connections, then you have to manage ports in userspace.

fwiw CGNAT is fairly uncommon in the United States except for cellphones. This creates one of those bad situations where you can have something that falls flat in the face of CGNAT but also "works fine" for all of your developer/employee traffic.

Imagine the client is actually two machines behind a NAT (corporate network or school) which each open 1 connection.

You can even reproduce this easily with two docker containers on the same host and curl using the "--local-port" option.

I hit this issue a year ago, in a slightly different setup. We were trying to expose an internal ECS Service to other services via an NLB. Things worked great, as long as the sending and receiving services didn't end up on the same EC2 instance. Unfortunately this occurred fairly often, since the ECS scheduler wasn't aware of that bizarre constraint (maybe we could have made it aware, but, that seems like a pretty brittle way to fix things).

Our resolution was to just fall back to an ELB. A bummer in that it didn't support dynamic port registration (the service could only be on a single EC2 instance at a time, since it has a static port), but joyful in that we didn't have spurious failures as an artifact of task packing.

Wow, what a PITA. Thanks for sharing your hard-won insights in this excelent write-up.

Does anyone know a good way to observe & measure the impact of this? I have a small fleet of Linux servers behind NLBs with cross-AZ load balancing turned on, and they have some significant collection of OS-level metrics via the Prometheus node_exporter.

Would incidents of this be captured by the "node_netstat_TcpExt_TCPAbortOnData" metric ("connections reset due to unexpected data")?

If you figure it out please post/reply - I’m wondering the same.

Anyone have any context why the OPs post was removed from Reddit?


No idea what I did wrong, I really thought it was the right subreddit to post it in. :) I have since updated the link to another subreddit.

What makes you think it was removed? It's currently visible but nobody seems to have commented on it.

It's not visible unless you go directly to the post, which is how deleted posts work in Reddit. See:


Ah, the one that's been on the page all day is a cross-post from r/devops:


I'm fairly certain r/aws is run by AWS/Amazon employees so it's only natural that articles that could be seen as negative are removed from it.

If you have cross zone load balancing disabled and you only have one instance per AZ, how do you ensure that when an instance is down or during a deploy, when not all AZ might have instances, that you don't have downtimes? Cross zone load balancing in NLB seems a must to me if you are constantly deploying new targets.

There’s a few strategies you can employ. One is to roll each AZ one at a time, and take each out of service first.

Another strategy is to have redundancy in each AZ such that your rolling strategy never takes the AZ itself offline.

Just depends on how much infra you want to manage to be honest. Taking an AZ offline can be as easy as having an ingress hop that signals back to the LBs that it is now failing health checks and should be temporarily removed from service.

I witnessed this issue as well in my previous workplace. I can't understand why Cross AZ is even a feature of NLB if this a problem.

Under what circumstances would anyone enable this? UDP only traffic?

On the issue of NATs and to extend this to GCP, we had an issue where the NAT was just dropping SYN packets. Clients would then eventually timeout, but connection pools eventually became fully drained. We had to look at tcpdump on the clients to see what was going on. It was 'solved' by giving machines that made external calls their own External IP. I don't know if it's still an issue.

gcp cloud nat wasn't really used in the first itrations, it was only introduced at the end of 2018 and "private" networks were only a thing in like middle of 2019 and it is still not the default, especially not on gke (besides that cloud nat is cheaper)

Nice insight! I’m thinking this should be relatively easy for aws to fix (or at least make the chance vanishingly small) on nlb side by hacking up their tcp stack to select source port randomly

Edit: actually i just realized for really high number of long conns that wouldn’t matter. Hm maybe not rewrite the dst ip and make vms handle that. Seems like a much more intrusive chancge though

they could fix the problem by assigning each NLB AZ IP an aliased IP on your target machine. however, this would be an absolute mess to configure.

For example if the 3 NLB AZ IPs were:

then if you set up 3 IP aliases on your target instance:

then if the NLB always mapped: -> -> ->

then there would be no clashes.

Since you seem knowledgable and I'm not an AWS customer... Can you confirm the general workings of the Network Load Balancer?

It seems like it's basically doing NAT to take client:port <-> server:port to client:port <-> instance:port, including adjusting instance NAT mappings so the outgoing traffic gets NATed back?

Sort of like direct server return load balancing, but weird because the instance doesnt see the service IP and NAT takes care of it, somehow?

In that case, yeah, you either need to have a rfc1918 service address mapped for each public service address or just use the public address.

Elsewhere in thread, it sounds like it's possible to set this up using the public address but kubernetes doesn't like that; in which case, it seems appropriate to fix kubernetes.

There are two modes to NLB:

'instance' mode which is what they are describing in the article. I think what happens is the packets hit the AWS network and then somehow AWS decides whether it is an existing flow or chooses a target instance to create a new flow. then it just routes the packet by only modifying the destination address. it can do this because the whole of the AWS network is basically a lie and ip packets can just magically pop up in your interface and they don't have to go through a normal routing process. then on the way back it is able to track the flow of the connection and know that has to rewrite the source address of the packet. i think this is basically what you describe above.

then there is 'ip' mode which i think it looks like it works similar to how haproxy would proxy TCP traffic. all of the packets you receive look like they come from the NLB and not the client. because there can be many clients this means the NLB networking layer needs to rewrite the source address and the source port which i guess is a more traditional kind of NAT. this also means there is a limit to the number of concurrent flows you can have. in the documentation i think AWS says there is 55,000 connection limit.

Ok, ip mode seems usable if you don't have a lot of connections per NLB/instance pair, but kind of meh.

Does the instance port have to match the service port? In instance mode, you could setup so us-east1 is like 441, us-east2 is 442, us-west1 is 451 etc. And then you wouldn't have tcp 4-tuple collisions on the instance.

Similarly for ip mode, you could put the same host in for ports 440-449 and get 10x 55,000 connections, or however many ports you need to hit your numbers.

(Port numbers chosen for example only, check /etc/services if you don't want to stomp on allocations for services you're probably not using)


That AZ-port solution would work as well and would be easier to configure than different IPs. But they don't let you map port or IP based on AZ for a single NLB. You could setup different independent NLBs in each zone and then setup target groups for each NLB AZ and I think this might have been what you were proposing. I'm pretty sure you can setup a different port for each target group and it doesn't have to match the traffic port on the NLB.

For IP mode you are correct that you can run on multiple ports to get around the connection limit. The documentation even says it is based on IP address and port.

> When the target type is ip, the load balancer can support 55,000 simultaneous connections or about 55,000 connections per minute to each unique target (IP address and port). If you exceed these connections, there is an increased chance of port allocation errors. If you get port allocation errors, add more targets to the target group.


What would be required for the NLB host to do port-rewriting to make things stable? Is the NLB host not present on the return-path?

If that is not the issue, is this a performance issue? Could random allocation of ports work instead?

My guess is the network routing is happening in the software defined network AWS creates below the traditional networking layers.

Sometimes it's tempting to think AWS' services are just like traditional appliances you'd stick in the middle of a network path but in reality they're all virtual and run on top of their software defined network in something they call hyperplane.

That means traditional intuition about how packets flow, like needing to be in the path to preserve the IP, may not apply.

It seems like in this case they need to preserve the static public IP that was used on the NLB for the incoming request across the whole transaction but maybe aren't doing that.

> they need to preserve the static public IP that was used on the NLB for the incoming request

Even if they did that, to determine which outgoing public IP the packet should go to, they would need to guess based on segment numbers and ack offsets. This would still be a guessing game

Oh for sure. Distributed routing like this is complex!

My main point is we have no idea what AWS' SDN looks like under the covers. Traditional rules about routing may not apply and each node in the path may have a lot more information about traffic than a traditional router would have.

You can run NLB in ip mode instead of instance mode and I don’t think it will have this issue. In ip mode your server sees the IP address of the NLB instead of the client.

There is no simple way to do this when using Kubernetes. NLB provisioned via Kubernetes will use instance mode, and you cannot change that, and aws-alb-ingress-controller doesn't support NLBs.

Weirdly, provisioning NLB via Kubernetes supports `aws-load-balancer-cross-zone-load-balancing-enabled` annotation, even though this is quite broken behaviour, as per the article.

If one is using proxy protocol within your stack, it would seem like this could be a good work-around here. Particularly if having knowledge of the original client source IP is important to your application.

Has anyone tried this?

Is anyone experiencing this type of issue when publishing PrivateLink Endpoint Services? Presumably it means you need to deploy services to every AZ in the region you are operating? Is that adequate?

Source IP is not preserved with PrivateLink

OK that makes sense, as long as the service can't see the original source IP/port there's no collision risk.

If the description is correct, then there would be a much worse problem if you had two clients talking to the same NLB. If they happen to use the same local port number at the same time, then it would all fail. This seems likely to happen a lot more than a single client reusing its local-port because it is a different remote IP.

How does an NLB handle the following two connections:

1) from to NLB:443 2) from to NLB:443

What does the (single) inside server see?

The client ip is preserved, so is different from

Wow very interesting insights. Thanks for the write up.

why do and both translate to

Are they both registered with the NLB and assigned to the same EC2 instance?

This is surprising behavior, to say the least.

Does this happen only with IPv4 and not IPv6?

NLB doesn’t support IPv6.

An ALB is the only one of the load balancers in AWS to support IPv6, but only to terminate the connection, not to send traffic to an IPv6 target.

This would be a nice interview question, reframed as a CYOA.

It annoys me to no end when people don't explain their abbreviations... "AZ" means Availability Zone, which is to say: Data Center. So cross-AZ means going to multiple data centers.

An availability zone isn't equivalent to a data center, as it might consist of multiple data centers. A better explanation for availability zone would be "a bunch of data centers in close physical proximity, exposed to users as a single logical entity".

Or as AWS explains it [1]:

> An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZ’s give customers the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center.

[1]: https://aws.amazon.com/about-aws/global-infrastructure/regio...

It still data center.

And when you choose the low latency instance placement means that all your virtual machines are placed in the same rack and/or host.

Basically throwing availability over the window.

>And when you choose the low latency instance placement means that all your virtual machines are placed in the same rack and/or host

That's also not true.

AZ is a well-known abbreviation in AWS. For all intents and purposes, it can mean a data center, but I've heard that AZs can span multiple physical data centers (which sort of makes sense -- there's a physical limit to how many servers you can fit in to one physical data center)

Some AZs have five data centers, though I’m not sure that up-to-date numbers are published. A DC and a single AZ are definitely not equatable in this regard.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact