
How we found a bug in Amazon ELB - davideschiera
https://sysdig.com/blog/amazon-elb-bug/
======
nzoschke
Great article. The Sysdig team really knows how to root cause tough problems.
The Sysdig tools can be invaluable for getting and making sense of low level
data.

If you want to play with ELBs, rolling deploys, connection draining to ECS
containers, I humbly submit the open source Convox project I am working on.

[https://github.com/convox/rack](https://github.com/convox/rack)

It sets up a peer reviewed, production tested batteries-included VPC, ECS,
ASG, ELB, etc cluster in minutes.

If the conclusion of this Sysdig post was that you always need to run 2
instances per AZ for the best reliability, I would strongly consider adding
that knowledge into the tools either as a default or a production check.

Since it sounds like an ELB bug I'll keep the 3 instances in 3 AZs default.

~~~
kirbypineapple
Do you have any thoughts on how to scale load balancers horizontally and on
demand? I've played briefly with attempting some dynamic DNS routing based on
health checks to re-route traffic from balancers that have been shut down due
to low traffic, but DNS really isn't designed to work this way.

~~~
tedmiston
I'm not clear what you're asking... Do you mean auto scaling the EC2 instances
in the load balancer? Or auto scaling the # of load balancers? Or something
else?

Of course the former is very common with Auto Scaling Groups [1] [2]. Then you
can use round robin or session sticky routing algorithms in the load
balancers.

(Apologies if I'm totally off-base for what you were asking.)

1:
[http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide...](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html)

2:
[http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide...](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-
scale-based-on-demand.html)

~~~
kirbypineapple
I meant the latter; if your load balancers are overwhelmed how do you scale
them? Further to that point, is it possible to create an architecture where
load balancers are responsive and can spin up in response to traffic? If you
have to deal with loads that are prone to bursts, you need to allocate those
load balancers in such a way that they can handle the worst case scenario.

~~~
tedmiston
I haven't worked at the scale where dynamically scaling the load balancers
themselves is the bottleneck. I think you pose an interesting question, and
I'm hoping someone with more knowledge can comment on that.

~~~
sphinx501
I don't believe there is any way to scale out the ELB from user side; you can
contact AWS support to 'pre warm' ELB's for high traffic sites before cutting
over DNS to them.

[http://aws.amazon.com/articles/1636185810492479#pre-
warming](http://aws.amazon.com/articles/1636185810492479#pre-warming)

------
jlgaddis
As a network engineer, I'm constantly having to prove that "it's not the
network" so I love reading others' technical analyses of similar things. Great
troubleshooting and technical detail in this write-up.

~~~
manigandham
What are some examples where you proved this? Curious about scenarios...

------
azundo
We were told ELBs are explicitly not designed for long-running connections
when we ran into this exact same issue so know that you will always be working
around this design constraint if you do long-running connections through ELBs.

There's another case that the article doesn't really discuss (though the
evidence of it is in the beginning when all connections drop simultaneously)
where the ELB nodes themselves scale vertically at a particular threshold. I
believe the setup described is still vulnerable to those scaling events.

~~~
gighi
We definitely observed such drops that we attributed to presumably internal
ELB scaling activity, but they happen so occasionally that for the moment they
haven't been a real issue, as opposed to this one described in the article
which happened consistently at every deployment in our test environment.

~~~
azundo
Yeah, we've decided to live with the internal ELB scaling risks for the moment
as well. We had the exact same situation where a deployment without gradual
connection draining (even if we kept an instance in service in every AZ) would
cause the ELBs to scale and drop all of our connections every time once we
were at a certain scale. Definitely caused us a fair amount of confusion as it
would happen minutes after the deploy when everything seemed to be calmed down
again.

~~~
gleenn
The author said he needed at least 2 instances in a AZ to avoid the bug, and
used that as his workaround in the mean time that Amazon works on the bug.

------
djb_hackernews
In general, if you are using ELBs you should have at least 2 instances per AZ
or cross zone load balancing enabled. I've seen this get teams several times.

The other thing to consider when deploying to the cloud with load balancers is
to use an immutable architecture. Taking hosts out of service, updating them,
and putting them back in service is a bit cumbersome at best and leaves you
vulnerable to service outages.

~~~
faceplanter
While I agree with having an immutable arch is preferable but in some cases
it's not viable. In one of our projects we re-use the instances like in the
article since we deploy multiple times an hour. In AWS you are billed for each
started hour which in this case would mean that we would pay a lot extra if we
created new instances for each deploy.

~~~
MBCook
Is Elastic Beanstalk not an option? It doesn't replace hosts on redeploy so
you wouldn't end up cycling through unnecessary instance.

~~~
martin_
Genuine question: Why is it okay to reuse instances because it's controlled
via an abstraction layer, as opposed to doing it yourself?

~~~
MBCook
I agree, that doesn't make a difference.

I only mentioned EB because it does that kind of thing for you and if you
don't have a highly complicated setup it makes rolling updates without
changing instances very easy.

------
narsil
We recently discovered that the NAT Gateway also terminates connections by
issuing a RST packet when it receives the next packet for a connection that it
believes to have timed out, effectively causing the new request to fail. The
previous recommended approach of NATing in VPC was to use NAT instances, which
sent FIN packets when the timeout was hit, cleanly closing the connection.
That behavior was far better, since it indicated that a new request should re-
connect first.

AWS Support indicated that this was a feature of the new NAT Gateways, even
though it breaks outbound connections made by popular implementations such as
the Requests python library's urllib3 connection pools. This is pretty
unfortunate, and has been a roadblock in migrating to the NAT Gateways.

~~~
colmmacc
Full disclosure: I'm an engineer at AWS and I work on NAT Gateway :)

Thanks for the pointer to urllib3 - we'll take a look at it and see if there's
anything we can do about the behavior. One of the challenges with sending
"FIN" on timeout is, as you write ... it closes the connections cleanly.

Some TCP based protocols (Including even HTTP in some modes) use a successful
connection close to indicate that an object has been transferred fully; so
what we've seen is that a network connection may stall (internet packet loss
for example) ... then the connection eventually times out ... and the "FIN"
falsely conveys that the entire object has been transferred. The end result is
a truncated object, which is no good either.

~~~
narsil
Thanks for the explanation colmmacc! I agree with the challenge you described,
and am not sure what the best approach would be. Perhaps a configurable
timeout such as ELBs have?

------
seliopou
Somewhat unrelated to the ELB problem identified, but an alternative solution
to the original deployment problem: assuming that the collectors are stateless
(seem to be) start off the deployment by spinning up a new collector with the
new code installed. Then, proceed with the deployment in the original fashion.
Once that's over, kill the extra collector. This will ensure that load is
distributed roughly in the same manner, over the same number of nodes during
the deployment as before the deployment. Depending on load caused by
initiating a connection, more than one extra mode may be utilized. In any
case, this is a much simpler approach than baking in application-level
connection termination. All for a few extra bucks per deploy and a small
amount of engineering time up front.

~~~
gighi
Definitely a feasible approach. Let's just say that the reality has a bit more
color and we have some other practical advantages in controlling the exact
moment when we disconnect a particular client :)

------
earless1
I don't really see a benefit in updating existing instances in this manner.
Launching replacement instances with the new code is much easier for us, and
it also provides a super fast means of rollback.

~~~
gighi
Both approaches are reasonable (and there's also a third one, ship your
application in containers and replace containers instead of instances).

We update existing instances because in our test environment we deploy at
every single new _commit_ (we absolutely love that), and we have hundreds (or
more) a day. At that pace, replacing instances would be more time consuming
(again, for our specific use case) and less cost efficient.

Plus, updating existing instances is handled automatically by AWS Code Deploy,
which provides a very good deploying pipeline that you can control using the
aws cli tool.

There are other minor advantages but those are the two main ones.

~~~
nzoschke
That's an awesomely aggressive deployment rate and a great reason to do
instance mutation.

Does something verify every commit in the testing environment too?

~~~
gighi
Yes, every commit gets pulled by jenkins which builds the whole thing, runs
unit tests and then starts the deployment once the tests pass.

------
bhz
We experienced something like this a long while ago, something like 4-5 years.
We still employ our workaround, which is to have a tiny "keepalive" instance
in each AZ in the ELB.

------
DanielDent
When the cloud work as desired, life is grand.

But when it doesn't, debugging might actually be simpler with less black boxes
between you and the metal.

------
jdreaver
Hmm, this seems like a pretty big bug in connection draining. I feel like one
instance per AZ is a pretty common scenario. Great article!

~~~
gighi
To be fair, the scenario is less common due to the fact that it happens just
when the drained connections are terminated in a certain pattern (as shown in
the charts). Still definitely common enough that can be easily replicated and
cause real troubles :)

------
simonebrunozzi
This is a great article to read.

The author mentions WireShark - fun fact: the founder of Sysdig, Loris, is
also the creator of WireShark.

~~~
geraldcombs
He created WinPcap, not Wireshark.

------
atomicbeanie
Nice article.

------
stevesun21
Great work done!

------
sivalingam
wonderful debugged the issue.

