
Spooky action at a distance, how an AWS outage ate our load balancer - kiyanwang
https://blog.hostedgraphite.com/2018/03/01/spooky-action-at-a-distance-how-an-aws-outage-ate-our-load-balancer/
======
l1n
>Our SRE team relies heavily on Glitter – our trusty SRE chatbot – to assist
us with many of our operational tasks, so under normal circumstances, it’d be
a simple operation to disable these health checks in our Slack channel.
However, things aren’t quite as straightforward as we’d expect, and the AWS
outage has impacted Slack’s IRC gateway, which our chatbot relies on. This
means we have to make all the necessary changes manually… Interesting example
of how infrastructure can become tangled in a way that makes crisis management
much harder.

~~~
subway
This is an unfortunate example of "Ultron" style automation:
[https://queue.acm.org/detail.cfm?id=2841313](https://queue.acm.org/detail.cfm?id=2841313)

Also kind of terrifying from a security perspective, since it's implied that
Slack has been effectively delegated permission to modify AWS resources in
their account.

------
cthalupa
Interesting!

I guess I've dealt enough with load balancers that as soon as I saw 'AWS
network issues' and 'Many of our clients are hosted in AWS' I immediately
jumped to the LB connection pool being full of mostly-dead connections, but
this is an important lesson for anyone utilizing LBs. This is a very easy
avenue to attack for bad actors, so setting strict timeouts and high
connection limits (within reason vs. the capability of your load balancers) is
important.

------
wetha
This, if done maliciously, is a valid DOS attack vector known as
Slowloris/SlowPost attacks.

I know it’s easier said than done, but there should be active mitigation in
place, rather than only monitoring.

~~~
NetStrikeForce
Slowloris is HTTP-based, right? In this case I'm not sure they didn't even
have to go up to that layer 7, it seems they had some generous time-outs for
TCP and SSL idle (or incomplete) sessions

------
bscanlan
Two sentence summary: Slow clients strangled our service. The slow clients
were caused by an external AWS networking issue.

~~~
rossdavidh
Seeing how the debugging process actually went, though, is also useful at
times, to give you more examples of how other people work through problems
where the cause is not clear.

~~~
tetha
And it allows for reflection: How would our troubleshooting and alerting
handle this?

We had a similar haproxy session saturation problem a couple of month ago. By
now, our alerting would pick this up within a minute and trigger alerts. Those
include a runbook to resolve it. Our standard resolution would fail in a case
like this, but I'm pretty sure we'd solve it in a second iteration.

------
orf
There is some kind of redirect loop going on here, can't read it at all.

