
Thundering Herds and Promises - YoavShapira
https://instagram-engineering.com/thundering-herds-promises-82191c8af57d
======
sciurus
This is how many HTTP caches and CDNs work. The terminology used to describe
it is often request collapsing or request coalescing.

Some examples:

* varnish: [https://info.varnish-software.com/blog/hit-for-pass-varnish-...](https://info.varnish-software.com/blog/hit-for-pass-varnish-cache)

* nginx: [http://nginx.org/en/docs/http/ngx_http_proxy_module.html#pro...](http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_lock)

* fastly: [https://docs.fastly.com/guides/performance-tuning/request-co...](https://docs.fastly.com/guides/performance-tuning/request-collapsing)

* cloudfront: [https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#request-custom-traffic-spikes)

~~~
camelspade
Also in Apache Traffic Server through the collapsed forwarding plugin:
[https://docs.trafficserver.apache.org/en/latest/admin-
guide/...](https://docs.trafficserver.apache.org/en/latest/admin-
guide/plugins/collapsed_forwarding.en.html)

------
thamer
In practice there's a bit more to it than what the article describes,
especially for a distributed cache: you need to have the Promise auto-expire
so that if the machine that's performing the backend read disappears the other
readers don't stay stuck forever waiting for it. It's also useful to have a
feature in the cache itself that blocks the caller until the Promise has been
fulfilled, so as to avoid repeated requests asking if the data is finally
there.

As an aside, Guava loading caches[1] implement per-key locking so that
multiple threads accessing a missing key simultaneously[2] would only lead to
a single load with all other readers waiting for the loading thread to
complete the fetch and populate the cache.

[1]
[https://github.com/google/guava/wiki/CachesExplained](https://github.com/google/guava/wiki/CachesExplained)

[2] in the sense of "in the time it takes for the first accessor to complete
the backend read"

~~~
davinic
Why block the caller? Subsequent calls will have the same promise returned and
notification will happen for all once the promise is resolved.

~~~
vhost-
Because then your caller will need to implement something that understands
what a promise is instead of just getting a data object it already has to
understand. Not only this, but the caller will need to also implement a
polling mechanism to keep trying. Put into code:

    
    
      func get_value(key):
        value = backend.get(key)
        return value
    

Is way better than something along these lines:

    
    
      func get_value(key):
        while true:
          response = backend.get(key)
          if response.type == "promise":
            sleep duration
            continue
          return response.value
     

If your service looks like the former, then suddenly your unit tests can use a
postgres database, a sqlite database, a rest client... that all implement the
same backend.get(key) interface.

------
spankalee
This is how all async caches are supposed to work. You never want concurrent
requests for the same uncached resource to all hit the backend.

For TypeScripters out there, this is what my team wrote for our static
analysis framework:
[https://github.com/Polymer/tools/blob/master/packages/analyz...](https://github.com/Polymer/tools/blob/master/packages/analyzer/src/core/async-
work-cache.ts)

------
nicwolff
Instead – and more usefully given how slow some of our backend APIs are – we
cache each value twice, under e.g. `key` with a short TTL and `key_backup`
with a long TTL.

The first process to miss on `key` renames `key_backup` to `key` (which is
atomic and fast on Redis) and goes to the backend for a new value to cache
twice and return, while the rest of the herd reads the renamed backup.

Yes, this doubles the total cache size, or equivalently halves the number of
keys we have room for. That's a price we're OK with paying to avoid blocking
reads while a value is recalculated.

~~~
horsawlarway
How does this solve your cold start?

What happens if I have a new request come in for which I have no key OR
key_backup?

------
aaron_m04
This looks like it would be a very useful design, however the article doesn't
discuss the implementation of the Promise.

This is not something memcached or redis support out of the box, as far as I
know. It would seem to imply a cache manager service that has its own in-
memory table of Promises.

~~~
dragontamer
> however the article doesn't discuss the implementation of the Promise.

Its not a "thundering herd" problem either. The Thundering herd is classically
a scheduling problem.

You have 100-threads waiting on a resource (classically: a Mutex). The Mutex
unlocks, which causes all 100-threads to wakeup. You KNOW that only one thread
will win the Mutex, so 99 of the threads wasted CPU-time as they wokeup. When
the next thread is done, 98 threads will wakeup (again, all wasting CPU time
because only one can win).

Solving the thundering herd requires your scheduler to know all the resources
that could be blocking a thread. The scheduler then only wakes ONE thread up
at a time in these cases.

\-----------

I'm not entirely sure what the problem should be named in the blog, but it
definitely isn't a "thundering herd". I will admit that its a similar looking
problem though.

~~~
mentat
Thundering herd has referred to demand spikes in services architectures for at
least 8 years[0], probably much longer.

0\. [https://qconsf.com/sf2011/dl/qcon-
sanfran-2011/slides/Siddha...](https://qconsf.com/sf2011/dl/qcon-
sanfran-2011/slides/SiddharthAnand_KeepingMoviesRunningAmidThunderstorms.pdf)

~~~
dragontamer
Hmm, the Netflix presentation there seems to make sense superficially though.

The key attribute of the "Thundering Herd" problem is the LOOP. The Thundering
Herd causes another Thundering Herd... which later causes another Thundering
Herd. In the Netflix presentation, the "Thundering Herd" causes all of the
requests to time out, which causes two new servers to be added ("automatic
scale up"), then everyone tries again.

When everyone tries again, there's more people waiting, so everyone times-out
AGAIN, which causes everything to shutdown, add two more servers to scale up,
and start over. Etc. etc. Its a cascading problem that gets worse and worse
each loop. You solve the Thundering Herd not by adding more resources (that
actually makes the problem worse!!), but by cutting off the feedback loop
somehow.

The problem discussed in the blog post has no feedback loop. Its simply a
problem that happens once on startup.

~~~
mentat
Very good point, thanks for clarifying.

~~~
lmeyerov
System start is indeed a synchronization point, and for limited resources like
here, painful and now clpser to the vernacular

Thundering herds can cause escalating and successive failures. That is very
much an issue with service start/restart. A bad restart will cause a timeout,
another restart, and eventually, restarts on further layers. Imagine all this
running above k8s. So yes, this pattern is indeed about one of the failure
modes that happen with thundering herds.

Though if your cache needs another cache, that feels like a bad cache. The
promise pattern can be done transparently by the cache, coalescing GETs,
instead of requiring a user protocol. We do app level caching to stay process-
local because latency is fun in GPU land and we are a visual analytics tool...
But that is not for the problem shown here.

------
_bxg1
"instead of caching the actual value, we cached a Promise that will eventually
provide the value"

I did this exact thing recently in a client-side HTTP caching system for
frequently-duplicated API requests from within a single page. Cool to see it
pop up elsewhere.

------
sbov
Back when I worked on a similar problem 10 years ago, we solved it by having a
quickly expiring memcached key for hitting the database. So if the value
wasn't cached, and if that key wasn't there, it would attempt to add that key.
If it was added, it would hit the database and cache the result. Otherwise, if
that key was there or it didn't successfully add it, it would wait for a short
period of time, then re-try the whole process again.

There's other similar problems elsewhere too though. A cold MySQL start is a
bitch when you have and rely upon huge amounts of memory for MySQL to service
your requests - this is especially noticeable if you have so much traffic you
need to cache some results. Back then it would take us about an hour before a
freshly spun up MySQL instance could keep up with our regular traffic, even
accounting for stuff being cached.

------
ris
Came up with something along these lines at my last place - not actually the
hardest thing to do as long as you've got a robust & convenient locking system
available to you. In my case I abused the common db instance that all the
clients were connected to anyway to synchronize the callers on postgres
advisory locks. Sure, this isn't the infinitely scalable, Netflix-scale
solution that everyone is convinced they need for everything, but it _will_
probably work _absolutely fine_ for >90% of development scenarios.

[https://gist.github.com/risicle/f4807bd706c9862f69aa](https://gist.github.com/risicle/f4807bd706c9862f69aa)

------
layoutIfNeeded
We do the same thing on client side when lazy-loading assets (typically
textures) from the disk, cause you don’t want to hold multiple copies of the
same asset in memory.

------
Johnny555
I'm not a developer, but to be honest, thought that's how all non-trivial
caching implementations worked -- instead of going directly to the back end or
having each one trigger a read from the back end for a cache-miss, all of the
threads that want that resource just waited on it to appear in the cache.

------
m0meni
Interesting. I also ran into this on a much smaller scale 3 years ago and made
[https://github.com/AriaFallah/weak-
memoize](https://github.com/AriaFallah/weak-memoize)

------
jelder
It's been a while since I used Rails, but think the fragment cache does this
"out of the box."

------
niklabh
i have implemented this in node.js
[https://npmjs.com/package/memoise](https://npmjs.com/package/memoise)

------
z3t4
In an old apartment I had a lot of stuff on the same extension plug. I had to
turn on the devices one by one to prevent power loss ... There's also the same
problem/solution when boarding airplanes. Even if it feels backwards, it's
actually faster to let one third of the requests go through first, then to let
all requests go through at once.

