I understand from elsewhere on their site that they are not using a recent version of ES (2.x?). This has a big impact on how things behave as they improved performance and garbage collect behavior by e.g. doing more things off heap with memory mapped files, fixing tons of issues with clustering, refactoring the on disk storage in Lucene 5 and 6, and giving users more control over sharding and indicex creation. So, it's likely that this setup is working around issues that probably have been addressed in later versions.
For example, 6.x has a shard allocation API that allows you control which nodes host which shards. That sounds like that would be useful in this setup.
They also seem to be mixing query and indexing traffic on the same nodes. I would suggest separating these in a cluster this big and using specialized nodes for query and index nodes (as well as dedicated master nodes). That way write node performance would be a lot more predictable and query traffic is no longer able to disrupt writes. Also you can use instances with more CPU/memory for querying or use elastic scaling groups to add querying capacity when needed. You might even get away with using cheaper T instances for this with cpu credits.
The shard allocation api would also be a very helpful addition to this solution. Currently we can sometimes get into situations where elasticsearch actively works against shardonnay and does very 'stupid' moves out of our control that later needs to be 'fixed' by shardonnay in later iterations. We have tried to disable elasticsearchs own allocation algorithm as much as we can but we still want it active as well as a safety net in case something unexpected happens with the hardware for example.
We do have dedicated master nodes. We also will have query and indexing client nodes that coordinates and merges search responses and handles routing etc. But behind that we utilize all data nodes for both query and indexing load. That has worked well so far due to the fact that indexing load is fairly constant. And we scale query load on recent indexes by having a large number of replicas for those.
Our largest problems have always been the vast difference in query complexity that we get. But we are slowly getting on top of that problem now as well. Some details on how that is done can be found in the following blog post
The thing is, at the scale you are running that probably means that you are running at a much higher cost than technically needed with a recent 6.x setup. You should be seeing magnitudes improvement in memory usage probably.
Recent versions of ES are much better at preventing complex queries from taking down nodes. They've done a lot of good work with circuit breakers preventing queries from getting out of hand. They are also a lot smarter at e.g. doing caching for filtered queries.
My guess is you should not need Shardonnay on a recent setup. Most of what it does should be supported natively these days. ES has been working with many companies on setups comparable to yours and they've learned a lot in the last few years how to do this properly.
Another feature that could be of interest to you is cross cluster search introduced in 6.x (replacing the tribe functionality they had in v2 and v5). This would allow you to isolate old data to a separate cluster optimized for reads. Probably whenever you hit those old indices, your query complexity goes through the roof because it needs to open more shards.
Our flavor of Elasticsearch 1.7 is faster than vanilla 2.* for our workload, though still slower than ES 2.* with our customizations applied.
Recent Elasticsearch versions still use the same basic shard allocation algorithms as far as we know. Our workload is very imbalanced towards recent data, but it's not a binary hot-cold matter, rather a more exponential decay in workload for older indexes. We fully expect to need Shardonnay to balance the workload even with ES 7+.
We're also in early conversations with Elastic about shard placement optimization. They seem to be interested in applying linear optimization in a similar way, with a goal of solving the fundamental problems with shard allocation based on observed workload.
The reason I think there are likely still problems at PB scale because of the attitude of ES core developers. They collectively act as if reindexing is no big deal and their proposed solution to many things I consider bugs is "just reindex". Reindex is the last thing I want to do given it is so hard to do it at TB scale with zero downtime. I don't think the core developers have experience with large clusters themselves so I find it unlikely that just upgrading to 6.x would solve all the problems at PB scale.
I'm not saying upgrading will solve all your problem but I sure know of a lot of problems I've had with 1.7 that are much less of a problem or not a problem at all with 6.x.
Reindexing is indeed time consuming. However, if you engineer your cluster properly, you should be able to do it without downtime. For example, I've used index aliases in the past to manage this. Reindex in the background, swap over to the new index atomically when ready. Maybe don't reindex all your indices at once. Also they have a new reindex API in es 6 to support this. At PB scale of course this is going to take time. I've also considered doing re-indexing on a separate cluster and using e.g. the snapshot API to restore the end result.
We've also seen research in using search and strategy based optimization techniques  (such as Microsoft's Z3 SMT prover) to validate security models. Some folks are even using Clojure's core.logic at request time to validate requests against JWTs for query construction!
: http://www0.cs.ucl.ac.uk/staff/b.cook/FMCAD18.pdf (and by the way, this made it to prod last year, so an SMT solver is validating all your S3 security policies for public access!)
I'd like to see this contrasted with the hot warm cold architecture which is the common approach to this problem.
Usually people deal with this problem by allocating faster/more/less storage dense hardware for recent data and slower/less/more storage dense hardware for older data. You can read more about this here https://www.elastic.co/blog/hot-warm-architecture-in-elastic...
Maybe something about it didn't work for them, but it's not clear to me from the article.
Disclaimer: I work for Elastic.
Our workload is rather different from this. Indexing and document updates occur in an somewhat exponential decay pattern back in time, same with queries. So there's less sharp cut offs
If we ran a hot-cold architecture we'd get a few issues
Within each tier we'd get imbalanced workload over the nodes. Since within the tier the workload varies greatly with age of the indexes
We use AWS i3 NVMe SSD instances. d2's with HDDs or using EBS have too long IO latency/throughput/iops even for our "cold" data workload. So a cold tier would scale based on storage needs, but in this tier we'd be wasting lots of compute capacity. And a hot tier would scale based on compute needs, and waste tons of storage capacity.
By running both hot and cold workloads on the same set of nodes we get much more cost effective utilization. Since the hot workload uses most of the aggregate compute capacity, and the cold data uses most of the storage.
But this then necessitates using Shardonnay to ensure we spread workload optimally across the clusters. And the more evenly we can spread it, the higher total utilization we can put on the clusters without having single nodes overload.
A hot/cold architecture would much more costly for our workload. Since we'd have to unused storage on the hot tier, and unused compute capacity on the cold tier. A single tier just makes much more sense for our particular use case
But currently we have concluded that it is not worth the added complexity. But that might change again if/when we learn more or get new requirements.
We do change the number of replicas for recent data vs old data. We actually have a 4 tiered approach to how many replicas we use.
What’s the rough break down? Are most users performing queries “scoped” to a fairly small subset?
We do have a large number of very complex queries coming in. What we deem to be towards the "simple" end could easily be hundreds of terms, wildcard, near and phrase operators. "Difficult" queries are things with hundreds of thousands of terms or many wildcards within nears/phrases that expand to millions of term combinations.
For Meltwaters customers it's usually important to get both very high recall and precision in the dataset described by a query. Since it's very little about getting a ranked result list and finding a hit (e.g. like what Google does). It's much more about running analytics/dashboards/reports/trends over the dataset delineated by the query, and exactness in the analytics matter a lot to Meltwater customers.
This all makes for complicated queries, to get both high precision and recall for whatever a customer is interested in analyzing. Our sales and support organizations help customers write good queries, and we also use AI systems to generate queries
I run a setup which is nowhere near as big as this one, and is mostly logging. I have seen the hotspots as described. Our solution is also to run beefy instances (i3 helps).
One thing that is sorely needed is a better solution than Curator. Curator is cool if you are only doing, say, daily shard rotation.
But it is dumb and can perform no decisions. If you want to do something more complicated than that, even a simple hot/warm architecture, then things get messy quickly.
Want to add a 'shrink' process and using hot/warm? Ok, now Curator will easily do dumb things like assign warm shards to a shrink node that also happens to be tagged as 'hot'. And nothing works.
Want to change the number of shards new indexes will have based on historical data(so they will fall under 50GB each)? Curator can't do that. Or maybe even change the number of replicas based on the number of nodes (because you have some auto-scaling going on) – I know you can do auto_expand_replicas 0-N, but N changes. Curator also doesn't do that.
Have a big cluster with hundreds of indices that need deletion every day? Have fun with the bulk deletion – if it fails, it won't retry. And now your cluster falls over because of high water marks popping everywhere.
You also need a single place from where to run it. Since in many clusters it is critical, you may want to replicate it. But then the curators will step on one another, and even if not, you need to make sure they are all in sync.
So custom code it is. Frankly, the basic functions performed by Curator should be available in ES itself – and actions pulled from an index stored in ES.
So now we have some ad-hoc scripts. If I get some time, I'll package them into some utility.
One thing I'm curious about from the article, what did you use to visualize the heatmaps?
We index a lot of NLP and other enrichments on our documents. This also adds a lot of storage on top of the base text. And like Karl mentioned, we have lots of small social media documents available for analytics (currently 34B, sometime next year upward of 230B). But also 7+ billion documents of news, blogs and other long-text articles indexed, basically 10 years of all news media from the entire world online and available for analytics.
We're building the https://fairhair.ai/ data science platform to allow other companies to access, and run online search and analytics on top of this massive dataset and compute clusters. For example to embed analytics over this dataset into their own SaaS products
First is the actual json data in quasi plain text.
Second is the _source field that duplicates the original input object (necessary for reindexing/rebuilding)
Third is the _all field that duplicates the json data as text (only used for some text search, better disable it).
Index compression with lz4 takes 20 or 30% off, new feature of elasticsearch v5.0, on by default.
We also have quite a lot of metadata.