
Using load shedding to survive a success disaster - CRE life lessons - fhoffa
https://cloudplatform.googleblog.com/2016/12/using-load-shedding-to-survive-a-success-disaster-CRE-life-lessons.html?m=1
======
stuckagain
I have a set of rules for load shedding that I urge you to consider. First,
and foremost, whenever you read (or are about to say) "load shedding" just
mentally substitute the correct terminology, which is "intentionally serving
errors." This will put you in the right frame of mind to properly ponder the
outcomes.

Secondly, the error path on your backend must be strictly cheaper than the
success path, or the whole scheme doesn't work. Particularly bad actions on
error are for example logging an error at such a high severity that the log
files need to be flushed and synced, which is likely to be tremendously
expensive. Another example is taking a mutex and incrementing some error
counter that normally wouldn't be incremented on the serving path. If this
tends to synchronize all your serving threads, your server will collapse.

Third, load shedding can only be implemented correctly if you control the
client and the server, end-to-end. Perhaps you want to avoid hot spots by
serving a soft error from an overloaded shard. If your client is guaranteed to
try another shard (or just give up) this is a good approach. If the client
might retry on the same shard, then it's not helpful. You just "shed load" in
such a way that you had to serve the same request twice.

~~~
Kalium
Your second point is critically important and often not well considered. I
once worked on a system where the base where the base exception class logged
to the filesystem upon instantiation, which meant a catastrophic response to a
full disk.

------
drdrey
Some additional things that can be done:

* soft-shedding, where instead of dropping a request (which might just incur a retry storm), sometimes it is appropriate to send back a cheap response so that the clients sees a successful response instead of an error

* route critical requests and non-critical requests to separate clusters that can be scaled and configured independently. The blog post mention doing that using DNS, but that also works for mid-tier services.

* build back-pressure into the client. Instead of a timeout or error, a well-conforming client can enter "polite" mode when it receives a signal that the backend is overwhelmed.

~~~
euyyn
Is soft-shedding a good idea? You're still paying CPU time to come up with a
response that is a 200, but probably won't fool the client or the user anyway.
Without a signal that the server is having problems with load, the retries can
be more aggressive.

~~~
dantiberian
Imagine a Netflix type service which needed to calculate a bunch of different
categories to show the user. If some of them are more expensive to compute
than others, those categories could be dropped, while still returning an
acceptable response to the user.

------
ChuckMcM
We did this very successfully at Blekko (search engine) to keep the system
from getting over loaded. The frontend engineer Bryn designed a really useful
way of monitoring nginx connections to the backend and to shed load when they
exceeded a threshold, and Greg designed a 'geoknob' that would let us turn off
traffic to regions of the world that were unlikely to be our primary customer
base.

Also anomalous load shedding is a great indication of a traffic anomaly. Big
scrapers sometimes appeared that way first even when their attack was coming
from a wide number of IPs.

------
okreallywtf
Can anyone comment on when at what level of scale this kind of issue might
arise? It seems like it would be fairly costly to implement and test, is it
safe to assume that when you attempt this it is either infeasible or too
costly to continue to scale to be able to service the peak load? If you were
running on bare-metal and simply could not add more
instances/databases/caches/etc fast enough I can understand that you might be
able to deploy a software solution like this faster than increasing capacity.
I could also understand being capped by the cost of continuing to scale but I
can't imagine putting the development effort into this kind of solution unless
there were no other options?

Would these kind of techniques ever be a worthwhile exercise to a (early)
startup or a small company that is hosting in the cloud or is it a last resort
after you have already gotten quite large?

~~~
stuckagain
I think your question is really excellent. If you don't feel pressure to eek
out the very last drop of marginal utilization, load shedding might not be for
you. It may be quite a bit easier to just add 5% more CPU or sandbag your
open-loop capacity plan.

~~~
okreallywtf
Thanks, its an interesting topic but as I read it I started to think about
someone coming along with a startup or a side project actually thinking this
is something they would want to put effort into prematurely rather than
putting more effort into more (cost and time) productive optimizations or the
product itself.

I'm definitely going to store the techniques away in my mental toolbox for
later though and hope I get an opportunity to say "Procrustean load shedding"
in a standup sometime.

------
ozgune
For anyone who's interested, the following paper on load shedding is also a
good read:
[https://pdos.csail.mit.edu/6.828/2010/readings/mogul96usenix...](https://pdos.csail.mit.edu/6.828/2010/readings/mogul96usenix.pdf)

The paper basically identifies the problem as a "livelock". You have a system
that receives so many requests that instead of making any real progress, it
tries to move those requests through different queues (Section 6).

If you're building a distributed system (say SOA), I find that load shedding
also has the nice property that it gives the system's clients immediate
feedback -- rather than having the client wait for a long time and make
guesses.

------
intr1nsic
This reads like an ideal use case for an object store service. My guess is
with the traffic patterns of mobile clients, this was a necessity. Good read.

