

Show HN: Balancing an Elasticsearch Cluster by Shard Size - KennyCason
http://engineering.datarank.com/2015/07/08/balancing-elasticsearch-cluster-by-shard-size.html

======
gt565k
You mention

"... however, that none of these balancing options are resource-aware: there
is no “balance shards across nodes by the size of the shards” flag to set or
knob to turn."

I was under the impression that shards are kept uniform in size as ES will try
and equally spread data to all shards in an index, so that there aren't
imbalances.

You can manually route data to a specific shard when reading/writing, which
would cause a shard to be much larger in size and use more resources, but
generally there are very few cases (0.01%) when this is a good idea.

Again, I'm not sure how the shards are so imbalanced.

"Node A will degrade faster than the rest of the cluster due to extra CPU use,
memory writes, and disk read/writes."

Also, remember that you can scale reads using replicas. Writes initially only
happen on primary shards, and then propagate to replicas.

Not trying to be nit-picky. I'm just not sure why the shards are so imbalanced
in your example. This seems more like a hack for a poor model/architecture.

At some point you'll have to redesign your ES architecture.

~~~
aewhite
I'm a Sr Dev here at Datarank in charge of our ES architecture. Maybe I can
shed some light on your points.

"I was under the impression that shards are kept uniform in size as ES will
try and equally spread data to all shards in an index, so that there aren't
imbalances."

This is true by default and it's fine for simple document retrieval but it
doesn't scale well if you want to do complex aggregations on arbitrary filters
of large datasets. For that you _need_ document clustering. In our case (and
probably others), the clustering can't be done uniformally, see
[http://engineering.datarank.com/2015/06/30/analysis-of-
hotsp...](http://engineering.datarank.com/2015/06/30/analysis-of-hotspots-in-
clusters-of-log-normally-distributed-data.html).

"Writes initially only happen on primary shards, and then propagate to
replicas." \- also true but you still want this to be as distributed as
possible. In some cases we are bulk loading millions of documents. We want the
load to be equally distributed as possible while still allowing for clustering
as mentioned above. Also, we want to minimize heap usage per node for GC and
buffering performance.

"At some point you'll have to redesign your ES architecture." \- Perhaps but
this system scales to 100s nodes easily and maybe 1000s. We allow our
customers to perform very complex aggregates on some pretty large datasets
with 50ms response times. All other architectures we tried failed to scale
well.

Clustering of comments is vital to our use case for performance. Our data is
distributed log-normally. We must be able to scale quickly and easily. The
default setup didn't scale well with our volume. All of these factors lead us
to this solution.

~~~
gt565k
Thank you for the detailed explanation. It seems like a unique use case.

I've yet much to learn about ES :).

Glad you are releasing this as a plugin!

------
divideby0
This seems quite cool. For very large clusters, I'd also consider looking into
Optaplanner, which exposes a variety of probabilistic metaheuristics for
balancing. There's a "cloud balancing" example in the documentation which is
fairly close in terms of the use case:
[https://docs.jboss.org/drools/release/6.0.0.Beta1/optaplanne...](https://docs.jboss.org/drools/release/6.0.0.Beta1/optaplanner-
docs/html_single/#cloudBalancingTutorial)

~~~
adenverd
Oh nice! That looks close to what we were trying to do with this plugin. I'm
not sure it would've worked within the constraints of the Elasticsearch
environment, but the additional confidence of finding a solution that
optaplanner provides by using multiple algorithms to solve the bin-packing
problem (NP-Hard) looks quite promising.

------
tobz
Do you have any data on bringing new nodes online / offline? Obviously,
rebalancing in general is going to be I/O-intensive if you have a ton of data,
but has Tempest shown any upside/downside when it comes to scaling a cluster?

Like, Tempest helped you with your static cluster... but could it make adding
a new node slower or faster?

~~~
adenverd
We've tested for stability when adding and removing nodes, but haven't
compared the time-to-balance of tempest versus the default balancer. Because
an ES cluster remains fully functional (you still have access to all data)
while a rebalance is in progress, we chose to optimize for resource usage
rather than time-to-balance. There's not really even a good way to compare the
balancers' time-to-balance, since they're both highly configurable
(range_ratio and iterations in tempest's case, the 4 balance weights in ES's
case) - default values probably aren't "equivalent" in terms of time-to-
balance, since it's a very minor concern compared to resource usage and
stability.

