
On Sharding - Isofarro
https://www.tbray.org/ongoing/When/201x/2019/09/25/On-Sharding
======
jedberg
> Load-sensitivity is one “smart” approach. The idea is that you keep track of
> the load on each shard, and selectively route traffic to the lightly-loaded
> ones and away from the busy ones. Simplest thing is, if you have some sort
> of load metric, always pick the shard with the lowest value.

Gotta be super careful with this one. We did this at reddit and it bit us bad.
The problem was as soon as the load on a machine went down it got pounded with
new requests and the load shot up, but it takes a few seconds for the load
number to react to all the new requests. So we saw really bad see-saw affect.

We had to add extra logic to mark how long a machine had beed at a certain
load and also randomly send requests to slightly more loaded machines to keep
things even.

The moral of the story here is make sure you pick a metric that reacts to the
change in request rate as quickly as your request rate changes!

~~~
ucarion
This sounds a lot like the control-theory problem of balancing the
proportional, integral, and derivative coefficients of a PID controller [1]?

I'm curious how you reached this condition as a requirement:

> The moral of the story here is make sure you pick a metric that reacts to
> the change in request rate as quickly as your request rate changes!

It makes sense _intuitively_ , but I'm having trouble proving to myself that
this is necessary+sufficient.

[1]:
[https://en.wikipedia.org/wiki/PID_controller](https://en.wikipedia.org/wiki/PID_controller)

~~~
jedberg
So I studied control theory after I left Reddit and you’re indeed right, it’s
a PID issue.

Picking a metric that reacts to changes quickly is neither necessary nor
sufficient, but it certainly helps reduce the error on your calculation. You
need to know how far off your set point you are and so you need as accurate a
measurement as possible.

Imagine a cruise control for a car where the speedometer had a five second
delay. You’d still stay at your desired speed on average but it would vary a
lot more and require more work to get back to the desired speed. It would have
to accelerate harder and brake harder.

~~~
ucarion
Oh I see! I was misunderstanding what you meant by "quickly"; it seems you're
referring to the sensor delay / how tight the feedback loop is.

Thanks for explaining!

~~~
brandmeyer
See also Bode stability analysis.

Delay in the feedback path counts against phase margin, requiring you to
reduce the loop bandwidth to maintain stability.

------
tpmx
_Sorta_ related:

I managed a team that built a 5x 1000 node distributed setup 10+ years ago.

We ended up going with

a) short DNS TTL + a custom DNS server that sent people to the closest cluster
(with some intra-communication to avoid sending people to broken clusters)

b) in each cluster; three layers: 1) Linux keepalived load balancing, 2) Our
custom HTTP/TLS-level loadbalancers (~20 nodes per DC), 3) our application
(~1000 nodes per DC)

A typical node had 24 (4x6) CPU cores when we started and 48 (4x12) towards
the end.

These were not GC/AWS nodes, we were buying hardware directly from
IBM/HP/Dell/AMD/Intel/SuperMicro and flying our own people out to mount them
in DCs that we hired. Intel gave us some insane rebates when they were're
recovering from the AMD dominance.

Load-balancing policy: we just randomized targets, but kept sticky sessions.
Nodes were stateless, except for shared app properties - we built a separate
globally/dc-aware distributed key-value store - that was a whole new thing 12
years ago we built based on the vague concept of AWS Dynamo. App nodes
reported for duty to the load balancers when they were healthy.

We had a static country-to-preferred-DC mapping. That worked fine at this
scale.

This setup worked fine for a decade and 250M+ MAUs. We had excellent
availability.

At some point like 10 years ago a kinda well known US-based board member
really, really wanted to us to move to AWS. So we did the cost calculations
and realized it would cost like 8X more to host the service on AWS. That shut
him up.

Different times. It's so much easier now with AWS/GC to build large-scale
services. But also so much more expensive - still! I wonder how long that can
last until the concept of dealing with computation, network and storage
_really_ becomes a commodity.

~~~
jiggawatts
What in god's good name were you guys hosting that required 5,000 quad-socket
physical hosts!?

~~~
tpmx
A popular server-assisted mobile browser for crappy phones.

Basically one CPU second per web page. 150k pages/second @ peak. 5 million
HTTP requests/s. 150 Gbit/s. The web for 250 million people.

Kinda insane numbers when I think about it now, still. (I left five years ago,
after it peaked.)

~~~
ignoramous
Opera?

------
jedberg
My favorite sharding/load balancing algorithm is Highest Random Weight, or
Rendezvous hashing [0]. It has all the benefits of consistent key hashing
without the hotspots, and it doesn't require any coordination between nodes.

[0]
[https://en.wikipedia.org/wiki/Rendezvous_hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing)

~~~
ignoramous
Squid (http cache) uses Rendezvous Hashing, iirc. Google's Maglev Hash and
Jump Hash are other alternatives that spring to mind:
[https://medium.com/@dgryski/consistent-hashing-
algorithmic-t...](https://medium.com/@dgryski/consistent-hashing-algorithmic-
tradeoffs-ef6b8e2fcae8)

Two years or so back, I stumbled on power-of-2 load balancing via Twitter
Finagle documentation. Found it pretty interesting. Here is a relevant news.yc
discussion:
[https://news.ycombinator.com/item?id=14640811](https://news.ycombinator.com/item?id=14640811)

And of course, the exponential weighted moving average is a good algorithm
too. It is, I believe, used by Elasticsearch. Cloudflare blogged abt using it,
as well: [https://blog.cloudflare.com/i-wanna-go-fast-load-
balancing-d...](https://blog.cloudflare.com/i-wanna-go-fast-load-balancing-
dynamic-steering/)

------
twotwotwo
> But the cache is a distraction. The performance you’re going to get will
> depend on your record sizes and update patterns and anyhow you probabl don’t
> care about the mean or median as much as the P99.

True your 99th percentile slowest requests won't hit the cache, and certainly
that caching won't solve all your scaling difficulties.

However, keeping requests for commonly-needed data away from (say) a DB
cluster decreases the load on it at a given level of throughput, and _that_
can be good for P99, and (as the post notes) caching can specifically help
with super-hot data which can cause problematic hotspots in some sharding
strategies.

Obviously situations vary and there're limits, but a cache seems like a legit
tool, not just a band-aid, for a decent number of situations.

------
plandis
Another good strategy for load balancing/sharding that always strikes me as
simple but also devilishly cleaver is random pick two:
[https://brooker.co.za/blog/2012/01/17/two-
random.html](https://brooker.co.za/blog/2012/01/17/two-random.html)

~~~
amelius
It looks at mean queue time, not worst case time.

~~~
plandis
The article linked does, yes. The paper the article is based on (linked in the
article) has a proof for worst case load if you’re interested in the details.

Edit: Link to paper
[http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...](http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf)

------
oweiler
Sounds more like load balancing than sharding.

~~~
jedberg
Load balancing is just a subset of sharding though. It's how you shard your
incoming traffic. The same strategies generally apply to both, but you get
more leeway with traffic if it's stateless.

------
gfodor
Def worth clicking through to the shuffle sharding thread. Simple concept (and
somewhat common in my experience) but I’ve never seen the analysis before.

~~~
OJFord
Took me a long time to find (couldn't contrast the link), here it is:
[https://twitter.com/colmmacc/status/1034492056968736768](https://twitter.com/colmmacc/status/1034492056968736768)

~~~
ignoramous
Some more resources re Shuffle Sharing:
[https://news.ycombinator.com/item?id=19291163](https://news.ycombinator.com/item?id=19291163)

Also see, this nice little blog post: [https://maisonbisson.com/post/hash-
rings-sharding-request-re...](https://maisonbisson.com/post/hash-rings-
sharding-request-replication/)

------
speedplane
Is it just me, or is this article talking about load balancing, not sharding.
My understanding of "sharding" is to split up a database into groups, either
by time or by some index key (e.g., A-C on one shard, D-G on another, etc.).
This article seems to be about splitting up web traffic, not sharding.

~~~
icebraining
Sharding is about splitting up the _data_ in groups; in this case, the idea is
that the web nodes have some local state, reducing the need to hit the
databases so much:

"If all the clickstream clicks from the same user or state-change events from
the same workflow or whatever go to the same host, you can hold the relevant
state in that host’s memory, which means you can respond faster and hit your
database less hard."

------
prostodata
Is there any (significant) difference between sharding and load balancing?

It seems that in both cases the idea is to distribute (supposedly independent)
requests between workers and one of the main difficulties is that requests
might not be independent either within one stream (say, in the case of
sessions) or between different streams (say, if they need to use one common
state).

