If you need to call out of the the AZ for other data or API sources, either figure out which AZs that service is using and configure to stay in them, or, make sure to go all the way back out to a well-balanced (for resilience) endpoint.
You can survive datacenter (AZ) outage IF you have separate stacks per AZ and don't mix traffic. If you have Kafka cluster spread out in 3 AZ don't get surprised if you just LOWERED your availability because any issue in one AZ makes your stack unstable. And issues in single AZ are quite common.
You have quite a misunderstanding ...
AWS' "multi-az promise" has always been that they will try to take only one AZ down at a time within a region.
It was never "blend your AZ usage so we can't take one down."
If you don't have a wiki page with some HA architecture diagrams for each of your systems, then you probably don't have HA. Hint: at every company that I've worked at, I drew the first diagrams. Something to think about.
We reduced the severity of this by randomizing port mappings on the NAT but that just reduces the probability. They claimed that it was a high priority issue for them back then but seems this is still in the wild - insane!
I agree though, that most OS stacks are very conservative about using the same source ip:port for connections to anything else. It works ok until you need to make lots of connections, then you have to manage ports in userspace.
You can even reproduce this easily with two docker containers on the same host and curl using the "--local-port" option.
Our resolution was to just fall back to an ELB. A bummer in that it didn't support dynamic port registration (the service could only be on a single EC2 instance at a time, since it has a static port), but joyful in that we didn't have spurious failures as an artifact of task packing.
Would incidents of this be captured by the "node_netstat_TcpExt_TCPAbortOnData" metric ("connections reset due to unexpected data")?
Another strategy is to have redundancy in each AZ such that your rolling strategy never takes the AZ itself offline.
Just depends on how much infra you want to manage to be honest. Taking an AZ offline can be as easy as having an ingress hop that signals back to the LBs that it is now failing health checks and should be temporarily removed from service.
Under what circumstances would anyone enable this? UDP only traffic?
Edit: actually i just realized for really high number of long conns that wouldn’t matter. Hm maybe not rewrite the dst ip and make vms handle that. Seems like a much more intrusive chancge though
For example if the 3 NLB AZ IPs were:
then if you set up 3 IP aliases on your target instance:
then if the NLB always mapped:
126.96.36.199 -> 192.168.2.6
188.8.131.52 -> 192.168.2.7
184.108.40.206 -> 192.168.2.8
then there would be no clashes.
It seems like it's basically doing NAT to take client:port <-> server:port to client:port <-> instance:port, including adjusting instance NAT mappings so the outgoing traffic gets NATed back?
Sort of like direct server return load balancing, but weird because the instance doesnt see the service IP and NAT takes care of it, somehow?
In that case, yeah, you either need to have a rfc1918 service address mapped for each public service address or just use the public address.
Elsewhere in thread, it sounds like it's possible to set this up using the public address but kubernetes doesn't like that; in which case, it seems appropriate to fix kubernetes.
'instance' mode which is what they are describing in the article. I think what happens is the packets hit the AWS network and then somehow AWS decides whether it is an existing flow or chooses a target instance to create a new flow. then it just routes the packet by only modifying the destination address. it can do this because the whole of the AWS network is basically a lie and ip packets can just magically pop up in your interface and they don't have to go through a normal routing process. then on the way back it is able to track the flow of the connection and know that has to rewrite the source address of the packet. i think this is basically what you describe above.
then there is 'ip' mode which i think it looks like it works similar to how haproxy would proxy TCP traffic. all of the packets you receive look like they come from the NLB and not the client. because there can be many clients this means the NLB networking layer needs to rewrite the source address and the source port which i guess is a more traditional kind of NAT. this also means there is a limit to the number of concurrent flows you can have. in the documentation i think AWS says there is 55,000 connection limit.
Does the instance port have to match the service port? In instance mode, you could setup so us-east1 is like 441, us-east2 is 442, us-west1 is 451 etc. And then you wouldn't have tcp 4-tuple collisions on the instance.
Similarly for ip mode, you could put the same host in for ports 440-449 and get 10x 55,000 connections, or however many ports you need to hit your numbers.
(Port numbers chosen for example only, check /etc/services if you don't want to stomp on allocations for services you're probably not using)
For IP mode you are correct that you can run on multiple ports to get around the connection limit. The documentation even says it is based on IP address and port.
> When the target type is ip, the load balancer can support 55,000 simultaneous connections or about 55,000 connections per minute to each unique target (IP address and port). If you exceed these connections, there is an increased chance of port allocation errors. If you get port allocation errors, add more targets to the target group.
If that is not the issue, is this a performance issue? Could random allocation of ports work instead?
Sometimes it's tempting to think AWS' services are just like traditional appliances you'd stick in the middle of a network path but in reality they're all virtual and run on top of their software defined network in something they call hyperplane.
That means traditional intuition about how packets flow, like needing to be in the path to preserve the IP, may not apply.
It seems like in this case they need to preserve the static public IP that was used on the NLB for the incoming request across the whole transaction but maybe aren't doing that.
Even if they did that, to determine which outgoing public IP the packet should go to, they would need to guess based on segment numbers and ack offsets. This would still be a guessing game
My main point is we have no idea what AWS' SDN looks like under the covers. Traditional rules about routing may not apply and each node in the path may have a lot more information about traffic than a traditional router would have.
Weirdly, provisioning NLB via Kubernetes supports `aws-load-balancer-cross-zone-load-balancing-enabled` annotation, even though this is quite broken behaviour, as per the article.
Has anyone tried this?
How does an NLB handle the following two connections:
1) from 220.127.116.11:44444 to NLB:443
2) from 18.104.22.168:44444 to NLB:443
What does the (single) inside server see?
Are they both registered with the NLB and assigned to the same EC2 instance?
An ALB is the only one of the load balancers in AWS to support IPv6, but only to terminate the connection, not to send traffic to an IPv6 target.
Or as AWS explains it :
> An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZ’s give customers the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center.
And when you choose the low latency instance placement means that all your virtual machines are placed in the same rack and/or host.
Basically throwing availability over the window.
That's also not true.