
Preventing server overload: limit requests being processed - jsnell
http://www.evanjones.ca/prevent-server-overload.html
======
Animats
I've used fair queuing to solve this problem.

I have a site and API which rates other sites, checking them for ad links and
trying to match them to real world business records. Inspecting a site takes
10 seconds to 2 minutes. The API, which is usually used from browser add-ons
which tag web search results, lets anyone request the rating info for a site.

Most requests are already cached and return immediately. Requests for unknown
sites are queued up for the rating engine, which runs in processes separate
from the web-facing side, and return "wait" to the requestor. So there's a
work queue to manage.

It's managed using fair queuing. If there is no pending request from the
requesting IP address, the new request goes into the queue. If there's a
request from the same IP address already in the queue, the new request is held
until the first one completes, and then added to the end of the main queue. No
request is rejected until there are 100 requests from a single IP address. So
each IP address competes against itself, and no one IP can hog the system.

A typical overload comes when someone searches for an unusual topic and flips
through many pages of search results quickly. This can result in the rating
engine having fifty or so new sites to examine within a few seconds. Those
requests will be processed one at a time until the rating engine catches up.

This queuing system held up well when someone tried a test where they fed a
huge list of sites into the API without waiting for completion of any of their
requests. The rating engine ran busily for a week, but they were only tying up
one rating process, and it didn't affect other users at all.

There are no adjustment parameters. It just runs. It's not perfect, but it
deals well with legitimate transients and with abuse from a small number of IP
addresses.

~~~
thunderbong
That's really interesting. Can you tell more about how it is implemented
technology wise?

~~~
Animats
Python and MySQL/MariaDB. The queue is an in-memory table. Fair queuing is
implemented as a big SQL SELECT statement.

The time scale is slow enough that polling it once a second for new work is
sufficient. If it had to go fast, it would require something like a standard
fair queuing implementation, where you hash the IP address and put the request
onto 1 of N queues, serviced round-robin.

------
wutbrodo
> Over Quota

This application is temporarily over its serving quota. Please try again
later.

I'm actually not super sure is this is a single-serving joke site like
isthemissiononfire.com, or if this is just a hilarious coincidence.

~~~
HappyDeveloper
I thought that was pretty funny lol! They probably weren't expecting all that
traffic from hacker news

~~~
lhuser123
Indeed, very funny :)

------
kyrra
[I'm a Google employee, opinions are my own]

The GCP team did a blog post similar to this about using Load Shedding to
survive a spike[0]. It's definitely a great way to survive a sudden spike in
traffic, but not optimal, as serving errors is also bad (unless your clients
have some sort of retry mechanism in-place). But if your clients do have a
retry mechanism in place (especially one that has a backoff time on retries),
it can be a great way to handle high loads.

[0] [https://cloudplatform.googleblog.com/2016/12/using-load-
shed...](https://cloudplatform.googleblog.com/2016/12/using-load-shedding-to-
survive-a-success-disaster-CRE-life-lessons.html)

~~~
jonhohle
Retries aren't the only option if you control (or can at least influence) the
client. Fallback sources and sane (or customer preferential) defaults are some
additional options when the primary source of truth is unavailable. Both of
these bypass thundering heard situations where recovery can be difficult even
if the application returns to full health.

As a concrete example, when I worked at a large online retailer, we had to
check certain customer properties during checkout. In the optimal case, we
went to a service which fronted a consistent, transactional database. If that
service failed or timed out, we had the option to hit a fast cache which
generally was only a few milliseconds out of date. Finally, we could use less
authoritative data that came from the application earlier, and if present,
would always be in the customer's favor. That made it unlikely that our
service ever appeared "down" regardless of the state of the dependencies.

~~~
robbles
Why wouldn't you start with the cache?

Isn't the consistent database only useful if you require it to be used, and
don't fall back to a less consistent alternative?

~~~
jonhohle
Based on previous decisions made by customers that were processed within the
lifetime of the same request, there was some probability that the cache could
be out of date.

Typically, there was enough context within the request to make assumptions
about the state of the authoritative store if data hadn't propagated to the
cache in time. We eventually moved to using that initially.

edit: I should add, though, the order of the datasources wasn't critical to
the strategy: have a primary data source, backup data source, use any data
available in the request context, and finally err in the customer's favor.
That's 4 chances to get things right, none of which include retrying.

------
makmanalp
The thing that trips me up along with request caps a lot is timeouts (briefly
mentioned). There are so many network / API / retry timeouts set to
arbitrarily large amounts (e.g. 30 seconds) that it's hard to predict how
they'll interact, and often you don't know that some of these exist and at
which layers they exist (application layer, network layer, DNS, etc).

How you often see this is that your request limiter keeps booting new requests
(FIFO/drop tail), and then when you check your currently processing requests,
they seem to be all in some busy loop, and hopefully you have some means of
checking why. You check, and you find that they're all querying a service that
accepts the request but doesn't respond ever (e.g. server down behind a
proxy), but for some reason the timeout on the client side is set to 30
seconds so stuff keeps hanging until it's killed.

Anyway, this is to say that in any sufficiently complex networked system
scenario, it helps to review all these timeouts holistically and in the
context of your application.

~~~
puzzle
You also have the issue that you might be dropping new requests, but there's
so much backlog that old requests eventually time out in droves, too. You
might end up spending a lot of CPU cycles doing partial work on some requests,
only to eventually see them hit their deadline.

That's one of the reasons why Facebook has a mechanism that switches their
FIFO queues to LIFO, once the server has passed a certain load/latency
threshold.

~~~
rrdharan
Thereby inadvertently rewarding people who mash their reload button when the
app/site is slow :)

~~~
adrianratnapala
But mashing reload is what users are supposed to do by design of the web.

That is, errors in internetworked systems can happen at a level high enough
that machines can't decide if it is sane to retry or not, they should simply
bubble the error all the way back to the user and let her press F5 if she
wants to.

Modern UIs might hide this, but users well know the web is flaky and
understand it in their bones. So when they press reload they are sending
important information.

------
kevan
The SRE book has some useful discussion[1] on this topic too.

[1] [https://landing.google.com/sre/book/chapters/addressing-
casc...](https://landing.google.com/sre/book/chapters/addressing-cascading-
failures.html#xref_cascading-failure_load-shed-graceful-degredation)

~~~
ngrilly
The SRE book is already mentioned (with the very same link you provided) in
the discussed article:

> The Google SRE book has a chapter about server overload and cascading
> failures, which I highly recommend if you want to learn more.

------
ot
A classical solution is to use a control algorithm (for example PI or PID) to
avoid having to estimate the overload threshold: the server will just start
shedding a fraction of the requests to maintain some metric (CPU utilization,
latency, queue length, ...) below some target.

See for example [1]. This is about background jobs, but the same principles
apply to load shedding.

[1]
[http://folk.ntnu.no/skoge/prost/proceedings/acc04/Papers/035...](http://folk.ntnu.no/skoge/prost/proceedings/acc04/Papers/0354_ThA01.3.pdf)

~~~
evanj
Thanks for the link, the abstract seems relevant. I agree: a control algorithm
seems like it should work. The challenge is figuring out the metrics and
parameters to make something that works "well enough" for most applications,
like TCP does for networks. I would really love for someone to figure that
out, so we don't run into this very often.

~~~
ot
Yeah, it's something that needs to be tuned per-service. It is quite easy if
the service is X-bound for some local resource X (CPU, disk, flash, network
card), but if it is bottlenecked on external service calls it can be quite
challenging to define a representative stress metric.

------
kevin_nisbet
I've seen similar issues occur, but based around GC behavior.

As with the authors example, when requests start stacking up in a process, we
also potentially increase the memory load on the GC. It's also possible to
cause a sort of pathological case, where the object allocation lives long
enough to get promoted to the older generations, and then die, requiring
almost constant major GC's to occur to free the promoted objects.

I worked on one system, where all performance testing was done in a lab
environment under perfect conditions, and external resources all taking less
than 1ms to respond. However, when deployed to a production network, it
consistently performed abysmally, losing an order of magnitude of performance
to having more outstanding requests being processed, and the associated GC
load.

Newer GC's do appear to handle object promotion much better, but it's one more
thing to consider, where some GC tuning and reduction in garbage can cause an
order of magnitude increase in performance.

Anyway's, one more thing to consider.

------
kailuowang
I developed a library to guard against server overload. It automatically
gauges the capacity and rejects requests exceeding it.

[http://iheartradio.github.io/kanaloa/docs/theories.html](http://iheartradio.github.io/kanaloa/docs/theories.html)

~~~
evanj
Awesome this is exactly the sort of system I had in mind when I wrote that
article, I'll take a look, thanks!

~~~
sitkack
Logged in users should take priority, depending on the number of page
requests, all users will see an error at some point breaking the site for
everyone.

------
STRML
There was a native NodeJS module called "toobusy" that I rewrote a few years
back to remove the native dependencies:

[https://www.npmjs.com/package/toobusy-
js](https://www.npmjs.com/package/toobusy-js)

It's worth checking out - and helps you achieve this goal _without_
necessarily having to come up with a req/s magicnum; instead you measure
actual event loop latency to detect a slowdown.

~~~
spraak
That's really neat to actually measure event loop latency

------
anonacct37
A non-general solution that might work for most of us is often goes by the
name "shopper prioritization".

Basically if you have a site that does lots of processing (search, orders,
etc) then a pretty effective strategy is often to show a generic static "down
for maintenance, try again" page. This requires vastly fewer resources to
serve since the page can be sitting in web server ram and doesn't result in
rpc calls throughout the whole stack.

You can also do something like prioritize logged in users over casual
browsers.

Since most http clients are smart clients you can even do some stuff with rngs
and timeouts to selectively let people who are willing to wait around back
into the site.

Of course if you're an ad-tech company it's likely that serving an error is
about as fast as serving something successfully.

~~~
jacquesm
Non-logged in users should be served almost exclusively from cache anyway.

------
ninjakeyboard
Ironically, I'm getting a message saying the server is overloaded. Here is the
cached copy:
[http://webcache.googleusercontent.com/search?q=cache:1oBS2v-...](http://webcache.googleusercontent.com/search?q=cache:1oBS2v-V_r0J:www.evanjones.ca/prevent-
server-overload.html+&cd=1&hl=en&ct=clnk&gl=ca)

------
tyingq
Haproxy is very flexible for this sort of thing. You can, for example, limit
concurrent connections to backends, and place requests in a queue with a
specific timeout. Not at all automatic, but at least all the right knobs are
there.

Edit: Related, Fedex's tracking is down on their website right now. It's
returning quickly, though, with this in the JSON: _" Service
com.fedex.nxgen.trck.v8.ientities.TrckInterface is busy, max invoke limit
reached"_

------
peterwwillis
In AWS, achieving peak theoretical performance requires queueing & pooling
jobs of specific sizes. You are literally going to suffer costly bad
performance if you don't implement and limit backlogs and concurrency.

This is basically just "stacks and queues" in a general sense, most often seen
as wait-and-retry in an application, and rate controls in a service.

To do this and survive cascading failures you have to know the limits of each
part of your stack and implement rate controls for each. An easy way to do
this is to design network services that serve requests for applications via an
API, rather than giving apps unfettered access to network resources. You can
allow applications access to your network service and change rate controls for
each app as needed. Alternately, you can add limits to your apps' layers at
design time, rather than having to discover the limits by performance testing
or trial-and-error.

The author wants a universal solution, which is a bit like saying, please
solve "traffic" for me for every mode of transportation. Airplanes have
different traffic than cars, so their designs will differ, although they all
involve an agent that coordinates traffic flows according to a set of rules.

~~~
evanj
I agree that a universal solution sounds difficult. However, it seems like it
should be possible to have something that works "well enough" for some (broad)
class of servers. My example is TCP, which does a "good enough" job of
controlling the flow of packets over a wide range of networks.

------
abritinthebay
This is basically how Hapi works in the NodeJS world.

If the event loop is too long (you configure this) it means you're doing too
much! So it issues 503s to all new traffic until the server can cope with the
requests it's receiving and get the loop time time.

Works very well and you can configure it to do the same with memory limits as
well.

Doesn't fix all issues, but it's a good way to keep your site usable by _some_
users at least.

~~~
necrodome
Not for some users though, usable for some requests. Users will still be
annoyed when they can't complete actions, because some of their requests are
failing randomly.

~~~
abritinthebay
That's true, and it is a limitation.

------
EGreg
My recommendations for rate limiting:

1) Sign session ids that you issue and reject requests that don't have a valid
signature. This can be done entirely in software at the router level without
any I/O

2) For each session, do authentication. Unauthenticated sessions get lower
caps.

3) For authenticated sessions, have various schemes for X units of Y in Z
seconds. And then use this database - sharded by Y - before each expensive Y.

4) Y should be prefixed with a user role or payment plan or whatever, followed
by the actual resource id. So people can buy another payment plan.

5) Possibly let people pay to access a resource beyond the quota.

6) Make clients recognize quota errors and retry with exponential backoff.

7) Host static resources in the app, fallback to a CDN, and only host dynamic
resources on your own source servers.

8) Wherever possible, cache things in the client, keeping in mind that you
have to evict the lease recently used items eventually.

------
FRex
Nginx has some config features related to just that:
[https://www.nginx.com/blog/rate-limiting-
nginx/](https://www.nginx.com/blog/rate-limiting-nginx/)

~~~
tyingq
The docs are centered on per client rate limits, which is a bit different.
Though I believe you can make them global.

------
mnw21cam
The main problem with this article is that it assumes that the only resource
consumption is CPU/memory on the server. With that assumption, it makes
complete sense to limit the number of concurrent connections.

However, if you do so, then your server can be brought to its knees by a
single client opening that many connections, and then just being really slow
in accepting the data down the network. Then the limited connections on the
server are all sitting there doing nothing, and no new connections are
allowed.

~~~
evanj
This is a good point, although I don't think I said "connections", I said
"requests". The definition of requests is going to vary significantly. Yes,
very slow clients are yet another problem that extremely robust systems don't
have to handle.

In my personal case, nearly all of my work has always been on "internal"
services that are being used inside a single organization inside a data
center, where this is rarely a problem. This is much more of an issue for
services that are accessed over the public Internet, where you need to handle
malicious attackers as well as users that have horrible connections.

In any case, I believe what I wrote still applies, but you do need to set
appropriate timeouts (yet another parameter to tune; yuck!)

------
cristaloleg
Sounds like a
[https://en.wikipedia.org/wiki/Leaky_bucket](https://en.wikipedia.org/wiki/Leaky_bucket)

------
boulos
For those asking, this seems to be hosted on App Engine and he's gone over his
quota. App Engine's throttling is both a blessing and a curse (presumably he's
using the free tier and never noticed this). You'll note that it's explicitly
saying he's over _quota_ not that it can't handle it ;).

Disclosure: I work on Google Cloud.

~~~
evanj
Exactly right. I just enabled billing on this project. Oops!

------
TekMol
This extremely simple solution works nicely for us:

    
    
        $load=file_get_contents('/proc/loadavg')[0];
        if ($load>10) die('Server load currently too high.');
    

So when the load gets too high, the server will not procss the request. This
will bring the server load down. An easy self regulating system.

~~~
nkozyra
But by this point the request has already been accepted and processed. It's
also using a rolling average, which means you'll be "killing" connections that
otherwise wouldn't need to be killed.

~~~
TekMol
Well, there is no exact threshold what "needs to bee killed". The higher the
load the longer it takes to process requests. At some point you rather want to
drop additional requests instead of making things even slower for everybody.
This is what happens here.

Think of it like this: When the server is already darn slow at a load of 10 we
don't want to process more requests because we know at values over 10 it gets
unbearable.

~~~
nkozyra
My point is it's still processing the request. You're just shielding the web
server from whatever the application holds.

In other words, let's find the theoretical limit to what your web server can
handle in terms of concurrent requests. Let's throw them all at your
application. In this scenario, the check-load-then-die code is worthless
because the request will never get there. It doesn't solve the problem as
shown in the OP. What it does is mitigate the effect of the application on the
total load of the web server. Which is another problem altogether.

Also:

> drop additional requests instead of making things even slower for everybody.
> This is what happens here.

Not exactly. The request completes like any HTTP request would. That you're
die()ing out doesn't change that, at least from the perspective of the web
server.

~~~
TekMol

        find the theoretical limit to what your web server
        can handle in terms of concurrent requests. Let's
        throw them all at your application. In this scenario,
        the check-load-then-die code is worthless because
        the request will never get there. 
    

If there is such a limit, then why not mitigate that by setting a max number
of connections for the webserver?

    
    
        You're just shielding the web server from
        whatever the application holds.
    

True. But as I said: it works. Never seen our server get slow since we
implemented this a few years ago. It kicks in when we get about 25x our normal
traffic. Which happened less then once a year so far.

------
thosakwe
Ironically enough, clicking the link tells me that your site itself has
exceeded its serving quota.

What a world we live in.

------
clon
[https://github.com/brandur/redis-cell](https://github.com/brandur/redis-cell)

This is fast becoming my go-to tool for any sort of rate limiting at the
application level.

------
cfieber
This is another good post on the topic: [https://stripe.com/blog/rate-
limiters](https://stripe.com/blog/rate-limiters)

------
EGreg
Irony of ironies:

* Error

Over Quota

This application is temporarily over its serving quota. Please try again
later. _

------
PatrickAuld
The GCE Over Quota page is perfect.

------
mnx
Is reddit using this? At times of high load, getting an error there seems
kinda random.

------
palakchokshi
is this a /s post? I get a server overloaded message when I click the link

