You can see this happening by looking at the "Active Connection Count" graphs from your ALB, and adding or removing an instance from an ASG.
At 30+GBPS and over 20kRPS, removing one instance can cause absolute chaos.
From your description it may be that you have long lived connections that build up over time, at a rate that targets can easily handle, but that the re-connect spikes associated with a target failure/withdrawal are too intense. This is a challenge I've seen with web sockets: imagine building up 100,000 mostly-idle web sockets slowly over time, even a modest pair of backends can handle this. But then a backend fails, and 50,000 connections come storming in at once!
Another scenario is adding an "idle" target to a busy workload, but it not being able to handle the increased rate of new connections it will get. Software that relies on caching (including things like internal object caches) often can handle a slow ramp-up, but not a sudden rush.
We're currently experimenting with algorithms that allow customers to more slowly ramp-up the incoming rate of connections in these kinds of scenarios.
Anyway, those are guesses, so I may be wrong about your case, but hopefully the information is still useful to others reading.