
An Alternative Approach to Rate Limiting - dfield
https://medium.com/figma-design/an-alternative-approach-to-rate-limiting-f8a06cf7c94c
======
sulam
There's a fixed memory solution that doesn't suffer from the boundary
condition that allows you to double the rate. It's pretty straightforward, so
I'll describe it in prose, since it's 6am and I'd rather not get the code
wrong. :)

The approach uses a ring buffer. If you're not familiar with them, they are a
fix-sized array or linked list that you iterate through monotonically,
wrapping around to the beginning when you are at the limit. Our ring buffer
will hold timestamps and should be initialized to hold 0's -- ensuring that
only someone with a time machine could be rate limited before they send any
requests. The size of the buffer is the rate limit's value, expressed in
whatever time unit you find convenient.

As each request comes in, you fetch the value from the buffer at the current
position and compare it to current time. If the value from the buffer is more
than the current time minus the rate limit interval you're using, then you
return a 420 to the client and are done. If not, their request is ok and you
should serve it normally, but first you store the current time stamp in the
buffer and then advance the counter/index.

~~~
cakoose
The article describes solutions that use Redis so that multiple app servers
can share rate limits. The article's second solution is basically what you're
describing, except adapted to work with Redis.

Also, what do you mean by "fixed" memory? Sure, the memory doesn't grow over
time, but neither does the memory of the other solutions. Of the solutions
listed in the article, this is the most memory-hungry.

~~~
sulam
The other fixed memory solution that is specifically mentioned suffers from a
defect that allows you to go up to 2X past the rate limit by sending 1X on
both sides of an aggregation boundary. I thought readers might appreciate an
alternative that is also fixed memory but that doesn't suffer from this
defect. And sure, the fixed value is larger than other solutions (probably not
larger than the Redis solution) but it's likely optimal if you care about
being sure the rate limit is never exceeded within your given time interval.
YMMV, I am making no claims about universal applicability.

Finally, yes, it's not Redis -- but it could be exposed as a service if you
wanted that pretty easily. Operational complexity will exist regardless and
depending on your organization different solutions will be appealing for
different reasons.

------
irgeek
_In response, we implemented a shadow ban: On the surface, the attackers
continued to receive a 200 HTTP response code, but behind the scenes we simply
stopped sending document invitations after they exceeded the rate limit._

And right there they broke the service for legitimate users. Totally
unacceptable collateral damage IMHO.

~~~
karmakaze
Without shadow ban, you're just telling the spammers how to be effective and
stay just under the limit.

~~~
scaryclam
So? Users playing by the rules of the service get to use the service
unhindered. A much better approach to shadow banning would be to make better
rules and enforce them. Spammers don't want to jump through hoops, so if you
implement rules based on bounces and spam reports you'll get the same result
(less spammers) but without screwing over legitimate users.

This is how Mailgun and their ilk operate, and while it's annoying to get
bitten by their rules (we forgot to warm up a mailing list once and got a
temporary suspension as our bounce rate was too high) they treated us like
adults, told us _why_ our service had been suspended and proceeded to help us
clean up the mailing list. If they had pulled some shadow banning BS we'd have
just left the service as we wouldn't be able to trust that they're not messing
us (and our clients) around.

Shadow banning works just fine for online forums and the like. It's a pretty
terrible method of rate limiting though.

~~~
forgottenpass
_This is how Mailgun and their ilk operate_

That's because their business model is to facilitate the level of spamming
that sits right below the threshold of anti-spam measures.

Of course they're going to help you send out as many messages as possible.
That's what you pay them for.

Without saying that OPs approach was the most appropriate solution to their
problem, I'll point out that Figma's bottom line isn't directly connected to
how many document invite emails they shoot out. That's just a collaboration
feature of a larger product.

------
timothycrosley
I would think if you have a consumer application that can't handle double what
is set as the rate limit during a very small corner case (start and end of the
the minute barrier) you have bigger problems. As you're still effectively
enforcing your rate limit over time with that approach. This just sounds like
micro-optimization at its worst.

~~~
zokier
Yeah, I was also thinking how meaningful ~20MB of memory use really would be
in this context. Or how badly would racy token bucket perform in the real
world. Still, enjoyed the read.

~~~
jdwyah
I think this is an important point. Trying to store all of these in RAM means
you can only have so many. Which is why I really like something that can use a
backing store of a more cost efficient DB. Once you start thinking about what
you could do if you could have 1000s of rate limits per user you end up
thinking of lots of interesting ways to use them. Like limiting how often you
log/track-usage to 1/hr per event per user. That's saved me a ton of money.

Second thought: token buckets have a nice property of being really cacheable
once they expire. You can push down a "won't refill until timestamp" and then
clients can skip checking altogether.

------
joneholland
You know you can implement a token bucket that doesn't share state between
your API servers, in about 10 lines of code, using just a in memory map.

Your incoming requests should be balanced across all of the servers so you
just derive the allowed throughput and divide by the number front ends....

~~~
mason55
This only works if your bucket size is much larger than your number of
servers.

In the degenerate case, imagine a rate limit of 2 total requests per minute
load balanced across 2 servers with enough traffic that my request is
basically hitting a random server. In this case, 50% of the time my second
request will be rate limited incorrectly because each server has a bucket of 1
and my second request went to the same server as my first.

I'm sure someone smarter than me (and better at probability) could come up
with an equation where you input your rate limit & number of servers and it
tells you the probability of a false positive for a user.

~~~
joneholland
That's valid, if the distribution of work is not fair, this won't work.

In practice, when you are receiving enough traffic to make throttling
practical you aren't usually throttling at 2 RPM across 2 servers.

~~~
eridius
Even if you have more servers, you'll still very easily hit the case of rate
limiting someone too early. And that's really bad, because it means your
clients, who are aware of the rate limit and structure their code to stay
under it, will start getting failures they shouldn't get, and they have no way
to handle it besides intentionally staying significantly under the advertised
rate limit.

So if you're really set on doing something like this, you need to set the
actual rate limit to be significantly higher than the advertised rate limit,
such that it's extremely unlikely for a client to be rate limited too fast.

------
daliwali
I think rate limiting is the wrong idea. Say for example, a client wants to
re-fetch everything that it has cached, it may send a burst of requests in a
short amount of time, and some of those requests may be wrongly rejected due
to rate limits. This is what happens when a browser refreshes a page for
example.

A better approach I think is delaying request handling based on resource
allocation. If one client is disproportionately using up more time than
others, then processing that clients' request will be queued up to process
later, while well behaving clients will get their requests handled quickly. I
think this is a more realistic approach and imposes less arbitrary
restrictions.

~~~
jdmichal
Once you start getting into quality-of-service, basic rate limiting like the
discussed is not enough. I think Stripe did a better job of covering such
concerns in their rate limiter post. They specifically talk about being
lenient to "bursty" traffic, as that was a legitimate use-case for clients.

[https://stripe.com/blog/rate-limiters](https://stripe.com/blog/rate-limiters)

[https://news.ycombinator.com/item?id=13997029](https://news.ycombinator.com/item?id=13997029)

~~~
peterhunt
Yes, this is why you want to use a token bucket rate limiter (which for some
reason was considered and rejected by the original post). We wrote it up here
and have an open-source impl on github that's in production serving hundreds
of millions of rate limits: [https://medium.com/smyte/rate-limiter-
df3408325846](https://medium.com/smyte/rate-limiter-df3408325846)

------
sarreph
This is why I like HackerNews comments. As a primarily front-end dev building
a SaaS, I'd already bookmarked this post and was planning implementing it. But
it seems like the comments here are pointing me in a better direction.

------
jnwatson
Sounds like a lot of work to avoid writing 20 lines of Lua.

~~~
jdwyah
agreed. I was a bit leery of diving into Lua as I was building
[http://ratelim.it](http://ratelim.it) but it really expands Redis's
capabilities dramatically and was easy enough to add.

My apps all write the lua into redis and store the hash when they boot up.
Duplicative, but means everybody is on the same page and it's easy to store
the lua in the main codebase.

~~~
wpeterson
Hey buddy!

Reading all the design and discussion I was. Rey curious how you structured
things at a brass tacks storage level.

~~~
jdwyah
yoo! [https://www.slideshare.net/jdwyah/diy-heroku-using-amazon-
ec...](https://www.slideshare.net/jdwyah/diy-heroku-using-amazon-ecs-and-
terraform) does have a bit of a pretty picture, but the basic idea is:

For each rate limit you can choose to be in one of two modes: 1) Redis with a
backing store of DynamoDB aka BestEffort since there are failure modes where
you could lose an update. In this mode everything expects to happen in Redis,
but if we don't find your limit there we check Dynamo. Writes are
asynchronously persisted to Dynamo.

2) Token Buckets straight in DynamoDB. This is our Bombproof mode.

(details in
[https://www.ratelim.it/documentation/safety](https://www.ratelim.it/documentation/safety))

It's worth noting that with either of these you can cache aggressively in the
clients whenever the limits have gone over. Both the clients
[https://github.com/jdwyah/ratelimit-
ruby](https://github.com/jdwyah/ratelimit-ruby)
[https://github.com/jdwyah/ratelimit-
java](https://github.com/jdwyah/ratelimit-java) do that for you.

------
joaodlf
It's a nice post with a lot of detail and nice imagery... With that said, how
would a simple, slightly modified, exponential backoff work any worse?

~~~
cakoose
What? These are different things.

This article is about the server deciding which requests to reject.
Exponential backoff is a strategy clients use deciding when to retry after
their request is rejected. (Plus, the article is about malicious clients;
they're not going to follow your preferred backoff strategy.)

More concretely, how would exponential backoff ensure that you don't allow
more than 10 requests/second per user?

------
pacaro
It's good practice to rate limit endpoints for a variety of reasons, but in
particular any endpoint that exposes user authentication should be rate
limited.

So this should be a tool that every service at scale has access to.

IIRC the lack of rate limiting burned Apple relatively recently.

Is this yet another area where we all reinvent the wheel? I've yet to see a
recommendation for an off the shelf solution

------
seniorghost
Thanks for referencing my earlier post in the article! We use the "sliding
window log" you described at ClassDojo, but your more memory-efficient
approach looks great.

[https://engineering.classdojo.com/blog/2015/02/06/rolling-
ra...](https://engineering.classdojo.com/blog/2015/02/06/rolling-rate-
limiter/)

------
stonelazy
Was wondering what would be the best way to rate limit API requests that are
in the order of thousands per minute, am guessing not any of the methods
suggested in this write up helps?! Help pls.

~~~
jdwyah
why not? 1000s/min should be NBD.

~~~
stonelazy
Really ? In our product, an average user can send request of upto 4000/minute
and we have about 1000 users now. Do you think would it be possible to scale ?
Suppose, if maintained a list in redis with sorted time stamp, then for every
incoming request i will have to make get query to redis for count of requests
in last one minute, one hour, one day (3 calls) and then insert a timestamp
for this current request. So, totally 4 requests. Apart from this suppose if
concurrency handling (number of concurrent connections allowed by a particular
user) is also built then that will also include additional redis calls. Do i
make sense to you ?

------
contingencies
Zooming out a little, the fundamental problem here is broadly recognized as a
modern protocol design challenge. To phrase the consideration roughly: the
response to a request should not require more resources than the client has
already spent to request it, either in terms of bandwidth or processing
(including memory, storage IO bandwidth, etc.).

Obviously, in some cases such design is not possible. The classic case is
HTTP, where the entire purpose is to supply some (arbitrarily large) volume of
data in response to a small request, and therefore there is a bandwidth
challenge.

Conventional defense strategies tend to utilize the fact that TCP requires a
three-way handshake to instantiate, thus validating the peer's IP address
(unlike UDP), and include:

(1) An authenticated session, eg. using a session key derived from a separate
API call.

(2) Rate limiting per authenticated user, either based upon data over time or
request frequency. ( _This alone is the subject of the article_ )

(3) Segregating read-only, cacheable data (even if it expires within seconds)
on separate infrastructure such as CDNs or memory-based caches.

(4) Aggressive use of HTTP caching.

(5) Careful tuning of HTTP session length related configuration to suit the
application profile.

A newer strategy is the use of captcha, however this is not viable in
automated (ie. API) use cases. Another relatively 'new' (for HTTP) strategy is
the use of websockets or other mechanisms for real time 'push', to avoid the
latency and processing overheads of the conventionally enforced HTTP request-
response model.

Additional options would include segregating clients to speak to different
servers (ideally in different data centers, on different links) such that
overhead may be scaled across different infrastructure. Thus even if a single
server is targeted by an attacker damage is limited in scope.

Another architectural defense would be the enforcement of a gateway/proxy
layer (internal or third party) obscuring real datacenter netblocks from
attackers, however this comes at the cost of latency where data cannot be
cached.

Cloudflare basically provide all of the above (plus additional features) as a
service.

Finally, in native mobile application development where an API is the primary
interface with some central system, another simple step that can be taken
(with appropriate care regarding peer authentication and cache invalidation)
is the use of a cached set of IP addresses within the client as a means to
identify servers. In this way, attacks against DNS infrastructure will also be
nullified, and you can focus the demands of segments of your user base on
different infrastructure. (Here in China, DNS is often hijacked or broken,
though this is much less of a concern in typical western environments. It will
also reduce startup latency on slow mobile links the world over.)

