
Optimal Shard Placement in a Petabyte Scale Elasticsearch Cluster - andrelaszlo
https://underthehood.meltwater.com/blog/2018/11/05/optimal-shard-placement-in-a-petabyte-scale-elasticsearch-cluster/
======
jillesvangurp
Interesting overview.

I understand from elsewhere on their site that they are not using a recent
version of ES (2.x?). This has a big impact on how things behave as they
improved performance and garbage collect behavior by e.g. doing more things
off heap with memory mapped files, fixing tons of issues with clustering,
refactoring the on disk storage in Lucene 5 and 6, and giving users more
control over sharding and indicex creation. So, it's likely that this setup is
working around issues that probably have been addressed in later versions.

For example, 6.x has a shard allocation API that allows you control which
nodes host which shards. That sounds like that would be useful in this setup.

They also seem to be mixing query and indexing traffic on the same nodes. I
would suggest separating these in a cluster this big and using specialized
nodes for query and index nodes (as well as dedicated master nodes). That way
write node performance would be a lot more predictable and query traffic is no
longer able to disrupt writes. Also you can use instances with more CPU/memory
for querying or use elastic scaling groups to add querying capacity when
needed. You might even get away with using cheaper T instances for this with
cpu credits.

~~~
karlney
Hi. We are actually using an even older version, 1.7.6 for our cluster. This
is off not an ideal set-up and we will upgrade as soon as we can. That is not
an easy task in a cluster of this size and given our uptime requirements. We
know that general search and indexing performance, shard allocation and
general cluster state management has all improved a lot in later versions of
elasticsearch but we still anticipate that we will need shardonnay running
even after we upgrade.

The shard allocation api would also be a very helpful addition to this
solution. Currently we can sometimes get into situations where elasticsearch
actively works against shardonnay and does very 'stupid' moves out of our
control that later needs to be 'fixed' by shardonnay in later iterations. We
have tried to disable elasticsearchs own allocation algorithm as much as we
can but we still want it active as well as a safety net in case something
unexpected happens with the hardware for example.

We do have dedicated master nodes. We also will have query and indexing client
nodes that coordinates and merges search responses and handles routing etc.
But behind that we utilize all data nodes for both query and indexing load.
That has worked well so far due to the fact that indexing load is fairly
constant. And we scale query load on recent indexes by having a large number
of replicas for those.

Our largest problems have always been the vast difference in query complexity
that we get. But we are slowly getting on top of that problem now as well.
Some details on how that is done can be found in the following blog post

[https://underthehood.meltwater.com/blog/2018/09/28/using-
mac...](https://underthehood.meltwater.com/blog/2018/09/28/using-machine-
learning-to-load-balance-elasticsearch-queries/)

~~~
jillesvangurp
I fully sympathize, I also still have a 1.7.x cluster that I have wanted to
get rid off for years now. Several rounds of compatibility breaking changes
have made this hard for us. Impressive that you can make this work at this
scale.

The thing is, at the scale you are running that probably means that you are
running at a much higher cost than technically needed with a recent 6.x setup.
You should be seeing magnitudes improvement in memory usage probably.

Recent versions of ES are much better at preventing complex queries from
taking down nodes. They've done a lot of good work with circuit breakers
preventing queries from getting out of hand. They are also a lot smarter at
e.g. doing caching for filtered queries.

My guess is you should not need Shardonnay on a recent setup. Most of what it
does should be supported natively these days. ES has been working with many
companies on setups comparable to yours and they've learned a lot in the last
few years how to do this properly.

Another feature that could be of interest to you is cross cluster search
introduced in 6.x (replacing the tribe functionality they had in v2 and v5).
This would allow you to isolate old data to a separate cluster optimized for
reads. Probably whenever you hit those old indices, your query complexity goes
through the roof because it needs to open more shards.

~~~
rpedela
Honestly, I bet 6.x doesn't work well at PB scale. There may be different
problems with different solutions, but I bet it would still need custom shard
balancing, etc. I work at TB scale and 6.x largely works well with the
exception of reindexing. Reindexing without downtime is still tricky.

The reason I think there are likely still problems at PB scale because of the
attitude of ES core developers. They collectively act as if reindexing is no
big deal and their proposed solution to many things I consider bugs is "just
reindex". Reindex is the last thing I want to do given it is so hard to do it
at TB scale with zero downtime. I don't think the core developers have
experience with large clusters themselves so I find it unlikely that just
upgrading to 6.x would solve all the problems at PB scale.

~~~
jillesvangurp
I know people running ES at PB scale. You definitely need to know what you are
doing of course but it is entirely possible. There are definitely companies
doing this. Any operations at this scale need planning. So, I don't think you
are being entirely fair here.

I'm not saying upgrading will solve all your problem but I sure know of a lot
of problems I've had with 1.7 that are much less of a problem or not a problem
at all with 6.x.

Reindexing is indeed time consuming. However, if you engineer your cluster
properly, you should be able to do it without downtime. For example, I've used
index aliases in the past to manage this. Reindex in the background, swap over
to the new index atomically when ready. Maybe don't reindex all your indices
at once. Also they have a new reindex API in es 6 to support this. At PB scale
of course this is going to take time. I've also considered doing re-indexing
on a separate cluster and using e.g. the snapshot API to restore the end
result.

~~~
mikljohansson
In our case at Meltwater we also have new documents as well as updates to old
documents coming in. Meaning older indexes are actively being modified with
low latency requirements on visibility and consistency of those modifications.
This makes it more tricky to both reindex as well as doing a seamless/no-
downtime-whatsoever upgrade to a new major version. It's not infeasible at all
though, and we're working on it. Reindexing a PB of data should be possible on
a matter of weeks based on our estimates, but we shall see!

------
KirinDave
This article represents an ongoing trend that we've seen over the last 3 years
as a resurgence in optimization techniques for directly modeling tasks which
historically were considered "not amenable to optimization techniques." We saw
this last year as well with the landmark, "The Case for Learned Index
Structures" [0] and in prior years with Coda Hale's "usl4j And You" [1], which
can easily be woven in realtime into modern container control planes.

We've also seen research in using search and strategy based optimization
techniques [2] (such as Microsoft's Z3 SMT prover[3]) to validate security
models. Some folks are even using Clojure's core.logic at request time to
validate requests against JWTs for query construction!

[0]: [https://arxiv.org/abs/1712.01208](https://arxiv.org/abs/1712.01208)

[1]: [https://codahale.com/usl4j-and-you/](https://codahale.com/usl4j-and-
you/)

[2]:
[http://www0.cs.ucl.ac.uk/staff/b.cook/FMCAD18.pdf](http://www0.cs.ucl.ac.uk/staff/b.cook/FMCAD18.pdf)
(and by the way, this made it to prod last year, so an SMT solver is
validating all your S3 security policies for public access!)

[3]: [https://github.com/Z3Prover/z3](https://github.com/Z3Prover/z3)

------
tnolet
Props to the person coming up with the Shardonnay name. Tiny bits of fun in a
pretty serious operation are awesome.

~~~
mikljohansson
Thanks! We have a fair bit of a punny humor in our Gothenburg-Sweden
engineering office where the information retrieval and big data magic happens.
One of our plugins for Elasticsearch that significantly improves GC pressure
and performance is called the "Greased Pig" plugin, because it makes
Elasticsearch all slippery so it doesn't get stuck in GC hell ;)

------
andrewvc
This is pretty cool tech.

I'd like to see this contrasted with the hot warm cold architecture which is
the common approach to this problem.

Usually people deal with this problem by allocating faster/more/less storage
dense hardware for recent data and slower/less/more storage dense hardware for
older data. You can read more about this here
[https://www.elastic.co/blog/hot-warm-architecture-in-
elastic...](https://www.elastic.co/blog/hot-warm-architecture-in-
elasticsearch-5-x)

Maybe something about it didn't work for them, but it's not clear to me from
the article.

Disclaimer: I work for Elastic.

~~~
karlney
We tried the hot/cold architecture as well, and used it before in our data
centre architecture.

But currently we have concluded that it is not worth the added complexity. But
that might change again if/when we learn more or get new requirements.

We do change the number of replicas for recent data vs old data. We actually
have a 4 tiered approach to how many replicas we use.

~~~
Serow225
Thanks for the article! We have a tiny hosted ELK cluster for our app logging,
but we don't have much expertise in good cluster design. We recently
implemented monthly rolling indices, and currently have a single 'archive'
index for the old data (from the past two years) sharded into 25GB chunks.
Would it make sense from a query performance perspective to add extra replicas
to the 'archive' shards, to help spread the load between the nodes? Thanks for
any thoughts!

------
boulos
Since I see meltwater folks in this discussion, can you say a bit about the
distribution of your queries? In the other blog post about routing [1], you
mention that most are a few ms but others are minutes and more.

What’s the rough break down? Are most users performing queries “scoped” to a
fairly small subset?

[1] [https://underthehood.meltwater.com/blog/2018/09/28/using-
mac...](https://underthehood.meltwater.com/blog/2018/09/28/using-machine-
learning-to-load-balance-elasticsearch-queries/)

~~~
mikljohansson
Our query complexity and query response times follow a somewhat exponential
pattern, with a lot of quick queries and a long tail of more monstrous
queries.

We do have a large number of very complex queries coming in. What we deem to
be towards the "simple" end could easily be hundreds of terms, wildcard, near
and phrase operators. "Difficult" queries are things with hundreds of
thousands of terms or many wildcards within nears/phrases that expand to
millions of term combinations.

For Meltwaters customers it's usually important to get both very high recall
and precision in the dataset described by a query. Since it's very little
about getting a ranked result list and finding a hit (e.g. like what Google
does). It's much more about running analytics/dashboards/reports/trends over
the dataset delineated by the query, and exactness in the analytics matter a
lot to Meltwater customers.

This all makes for complicated queries, to get both high precision and recall
for whatever a customer is interested in analyzing. Our sales and support
organizations help customers write good queries, and we also use AI systems to
generate queries

------
outworlder
Interesting take on the shard optimization.

I run a setup which is nowhere near as big as this one, and is mostly logging.
I have seen the hotspots as described. Our solution is also to run beefy
instances (i3 helps).

One thing that is sorely needed is a better solution than Curator. Curator is
cool if you are only doing, say, daily shard rotation.

But it is dumb and can perform no decisions. If you want to do something more
complicated than that, even a simple hot/warm architecture, then things get
messy quickly.

Want to add a 'shrink' process and using hot/warm? Ok, now Curator will easily
do dumb things like assign warm shards to a shrink node that also happens to
be tagged as 'hot'. And nothing works.

Want to change the number of shards new indexes will have based on historical
data(so they will fall under 50GB each)? Curator can't do that. Or maybe even
change the number of replicas based on the number of nodes (because you have
some auto-scaling going on) – I know you can do auto_expand_replicas 0-N, but
N changes. Curator also doesn't do that.

Have a big cluster with hundreds of indices that need deletion every day? Have
fun with the bulk deletion – if it fails, it won't retry. And now your cluster
falls over because of high water marks popping everywhere.

You also need a single place from where to run it. Since in many clusters it
is critical, you may want to replicate it. But then the curators will step on
one another, and even if not, you need to make sure they are all in sync.

So custom code it is. Frankly, the basic functions performed by Curator should
be available in ES itself – and actions pulled from an index stored in ES.

So now we have some ad-hoc scripts. If I get some time, I'll package them into
some utility.

~~~
Xylakant
At least for your ‘run it once in the entire cluster’ I can offer a simple
solution: curator has a ‘master_only’ flag that makes it execute only on the
elected master. You can run it on all nodes, but on all other nodes, the run
will end up a noop:
[https://www.elastic.co/guide/en/elasticsearch/client/curator...](https://www.elastic.co/guide/en/elasticsearch/client/curator/5.5/configfile.html#master_only)

------
carrja99
Really good overview. We have a similar problem with our 2.1 cluster as well,
we originally setup routing by customer but some customers can run some REALLY
hot tasks that have us wind up with shards that are much larger than other
shards.

One thing I'm curious about from the article, what did you use to visualize
the heatmaps?

~~~
karlney
We use datadog for monitoring our cluster. Shardonnay also produces lots of
custom metrics that you don't get from the vanilla elasticsearch datadog
integration.

~~~
carrja99
Nice, we do too, interested in getting better monitoring around shard
allocation.

------
ahi
OT: Their pricing page appears to be broken in Firefox:
[https://www.meltwater.com/request-
pricing/](https://www.meltwater.com/request-pricing/)

------
kokey
As much as I like elasticsearch, it's been less fun to work with after Kibana
v3 (which, I think, only works with ES v1?). I can only imagine people needing
really large elasticsearch clusters nowadays don't really use Kibana for the
majority of queries against it.

------
foolfoolz
craziest thing to me about this whole article is this one company has 40
billion social media posts!

~~~
spullara
I personally have 25B just for fun / side projects / analysis. The surprising
thing for me is that for them it is petabyte scale (25k+ each!) and for me
they take up 2TB (80 bytes each, compressed). Must have a tremendous amount of
metadata or elastic search is super inefficient.

~~~
mikljohansson
Our whole dataset exported as compressed JSON is likely not much more than
100TB. The petabytes come from all the index datastructures
Elasticsearch/Lucene builds, as well as the high replication factor needed to
keep up with the query throughput.

We index a lot of NLP and other enrichments on our documents. This also adds a
lot of storage on top of the base text. And like Karl mentioned, we have lots
of small social media documents available for analytics (currently 34B,
sometime next year upward of 230B). But also 7+ billion documents of news,
blogs and other long-text articles indexed, basically 10 years of all news
media from the entire world online and available for analytics.

We're building the [https://fairhair.ai/](https://fairhair.ai/) data science
platform to allow other companies to access, and run online search and
analytics on top of this massive dataset and compute clusters. For example to
embed analytics over this dataset into their own SaaS products

~~~
user5994461
In my experience, Elasticsearch is triple the size of the data.

    
    
        First is the actual json data in quasi plain text.
        Second is the _source field that duplicates the original input object (necessary for reindexing/rebuilding)
        Third is the _all field that duplicates the json data as text (only used for some text search, better disable it).
    

Finally, the index is duplicated to replicas, at least one if you want any
redundancy.

Index compression with lz4 takes 20 or 30% off, new feature of elasticsearch
v5.0, on by default.

