
Guide to running Elasticsearch in production - thunderbong
https://facinating.tech/2020/02/22/in-depth-guide-to-running-elasticsearch-in-production/
======
DmitryOlshansky
Mostly good stuff but a few comments:

\- article doesn’t clarify if it’s on hardware or VMs

\- 140 shards per node is certainly on the low side, one can easily scale to
500+ per node (if most shards are small, typically power law distribution)

\- more RAM is better, and there is a ratio of disk:ram that you need to keep
in mind (30-40 for hot data, 200-300 for warm data)

\- heaps beyond 32g can be beneficial but you’d have to go for 64g+, 32-48g is
a dead zone

\- not a single line about GC tuning (I find default CMS to be quite horrible
even in recommended ~31g sizes)

\- CPUs are often a bottleneck when using SSD drives

~~~
MuffinFlavored
Serious question: does indexing Logstash/JSON logs really need to take
gigabytes of memory + disk and sharding?

~~~
jrockway
No. ELK is a slow and expensive way to store and retrieve logs. The reason
people use it is that nothing else exists. (I was blown away when I started
using it at my last job. I used the fluentd Kubernetes daemonset to extract
logs from k8s and into ES... and the "cat a file and send it over the network"
thing uses 300+MB of RAM per node. There is an alternate daemon that can be
used now, but wow. 300MB to tail some files and parse JSON.)

I think a better strategy is to store logs in flat files with several
replicas. Do your metric generation in realtime, regexing a bunch of logs on a
bunch of workers as they come in. (I handled > 250MB/s on less than one Google
production machine, though did eventually shard it up for better
schedulability and disaster resilience. Also those 10Gb NICs start to feel
slow when a bunch of log sources come back after a power outage!)

For simple lookups like "show me all the logs in the last 5 minutes", you can
maintain an index of timestamp -> log file in a standard database, and do
additional filtering on whatever is retrieving the files. You can also
probably afford to index other things, like x-request-id and maybe a trigram
index of messages, and actually be able to debug full request cycles in a
handful of milliseconds when necessary. For complicated queries, you can just
mapreduce it. After using ES, you will also be impressed at how fast grepping
a flat file for your search term is.

The problem is, the machinery to do this easily doesn't exist. Everything is
designed for average performance at large scale, instead of great performance
at medium scale. Someday I plan to fix this, but I just don't see a business
case, so it's low priority. Would you fund someone who can make your log
queries faster? Nope. "Next time we won't have a production issue that the
logs will help debug." And so, there's nothing good.

~~~
pas
> The reason people use it is that nothing else exists.

Maybe [https://github.com/grafana/loki](https://github.com/grafana/loki) , but
haven't yet tried it.

(Or [https://github.com/phaistos-networks/TANK](https://github.com/phaistos-
networks/TANK) ..?)

> I think a better strategy is to store logs in flat files with several
> replicas

Agreed. We just used beats + logstash and put the files into Ceph.

> x-request-id and maybe a trigram index of messages, and actually be able to
> debug full request cycles in a handful of milliseconds when necessary.

Yes, yes, yes. That would be great.

~~~
social_quotient
Curious why ceph?

~~~
cpitman
Loki and similar solutions are leveraging object storage (not just Ceph) as a
way to store chunks of logs relatively cheaply, and scale performance. Loki
can work on a single system, storing logs on the local filesystem, but that
will eventually become an availability or performance bottleneck. Putting logs
in object storage allows multiple systems to store/query/etc.

------
tyingq
Does it still have the issue of failing _" open to the world"_ if the x-pack
trial expires? See: [https://discuss.elastic.co/t/ransom-attack-on-
elasticsearch-...](https://discuss.elastic.co/t/ransom-attack-on-
elasticsearch-cluster/71310/18)

~~~
six2seven
As an alternative, there's an open-source OpenDistro for ElasticSearch [1]
that offers X-Pack-like security with some other X-Pack-like features.
Although it is not officially supported by elastic.co, but it's a pretty good
alternative and is supported by Netflix, Amazon, et all. Worth giving a try.

[1] [https://opendistro.github.io/for-
elasticsearch/](https://opendistro.github.io/for-elasticsearch/)

~~~
puszczyk
> ... supported by Netflix, Amazon, et all

So in case I'm stuck I can pay Netflix to help me?

------
there_the_and
Unsecured Elasticsearch servers have been implicated in multiple breaches in
recent months [1][2]. Since this post is an "In depth guide to running
Elasticsearch in production,” it should prominently include information
related to security and configuration. With tools like these where there is a
learning curve for new users, security can end up treated as an afterthought,
leading to these kinds of breaches.

1\. [https://www.pandasecurity.com/mediacenter/news/billion-
consu...](https://www.pandasecurity.com/mediacenter/news/billion-consumers-
data-breach-elasticsearch/)

2\. [https://thedefenceworks.com/blog/250-million-microsoft-
recor...](https://thedefenceworks.com/blog/250-million-microsoft-records-
exposed-in-another-elasticsearch-server-related-breach/)

 _Edited for clarity_

~~~
cloakandswagger
This guide is clearly intended to focus on the ops-side of ElasticSearch. No
one is being irresponsible, you're basically just complaining that the article
was written about one topic instead of another.

Notice how it also doesn't talk about system architecture, load balancers,
disaster recovery, etc? It's because the author chose to focus the post on
cluster configuration. The topic of security could be its own standalone
writeup and I highly doubt that its omission is an endorsement for running an
ES cluster totally exposed and unsecured.

~~~
tmpz22
The argument is that you can't have an in-depth production guide to
Elasticsearch without a section on security. "Production" should be "secure".
A better title would be "optimizing Elasticsearch performance in production"
or something of the sort.

~~~
isbvhodnvemrwvn
To be honest I think if you're responsible for running production systems, it
would be a no-brainer to run everything as closed up as it gets, with only
access from servers which actually need it.

~~~
judge2020
Yet we see security breaches caused by trivial misconfigurations and bad (or
no) firewall setups. Chances are, people building these systems aren't
accustomed to security-first deployment and will use and bookmark a guide like
this to properly set up instances, rarely if ever going back to the docs or
looking at other guides.

~~~
rhizome
_Chances are, people building these systems aren 't accustomed to security-
first deployment and will use and bookmark a guide like this to properly set
up instances_

Or they aren't given the time, running on ASAP-brand project management and/or
pushing the POC to prod.

------
ivan_ah
Does anyone have experience running Elasticsearch as a kubernetes deployment?
Can you just spin up some big-RAM containers attached to persisted volumes?

Elastic Co. seems to have an offering specialized for k8s:
[https://www.elastic.co/elastic-cloud-
kubernetes](https://www.elastic.co/elastic-cloud-kubernetes) but I can't
understand what it does exactly.

Our data is not crazy-big and it doesn't need to be super performant, but for
operational simplicity I'd like to deploy as part of the production cluster
like all our other app containers rather than some "special" type of
container.

~~~
zegl
Yep, I'm running ES in a StatefulSet. It works nicely out of the box using
headless Services for node-to-node discovery, and by using a custom preStop
hook, to make sure that the cluster wouldn't become RED after the node shuts
down.

~~~
pm90
Does the Prestop hook just leave the cluster instead of dying and waiting for
reconciliation?

~~~
zegl
It's blocking forever if the cluster isn't GREEN.

------
troelsSteegin
I appreciate the systems perspective and find the writeup useful. However,
from a production perspective, I think security should be topic one.

~~~
speedplane
> from a production perspective, I think security should be topic one

Two general approaches to security:

\- Upgrade to a paid Elastic cluster, and use their own feature-full security
suite.

\- Put a reverse proxy server in front of Elastic (like nginx), and configure
that to handle security.

------
ram_rar
We used to use ElasticSearch a lot for log aggregations. Its a beast of its
own. you still need a dedicated team to handle this. Eventually moved to
Splunk, wavefront like solutions. It ll costs a lot lesser and frees up
engineering time to build a better product.

~~~
thickice
Genuine questions as someone with no experience in dealing with large sclae
log aggregation: Can you share some details on what kind of issues you ran
into in production with Elastic Search that needed a dedicated team to manage
?

------
rcarmo
1) Firewall it before anything else.

The rest is pretty much common sense for anyone used to run Solr and other
indexers in production, but I've always been somewhat amazed at finding
default setups insecure (a situation that was still largely true when I looked
in-depth at the Python clients last year), which is madness considering that
the security (authentication, authorization and auditing) docs are pretty
good.

So making sure it's only accessible to either localhost or a restricted subnet
_when you install it_ seems like the minimum sane thing every sysadmin ought
to do by default.

Maybe nobody actually reads the docs...

(edit: typos)

------
animalnewbie
Is there a non-Java alternative to this ES/Logstash stuff? Preferably rust or
a native lang, but okay with CLR too. I'm not comfortable running Java in
production after previous memory issues...

~~~
CameronNemo
You can check out [https://vector.dev/](https://vector.dev/) to replace
Logstash. Not sure about replacing Elasticsearch with something non-Java.
Especially for the search use case -- Lucene is fairly dominant. For metrics
you have prometheus (Go, not sure if that is better for memory issues with the
non-tunable GC). You will probably want/need a clustered storage backend for
prometheus. For that you have lots of options:
[https://prometheus.io/docs/operating/integrations/#remote-
en...](https://prometheus.io/docs/operating/integrations/#remote-endpoints-
and-storage) . Of those, TiKV (Rust), InfluxDB (Go), and TimescaleDB (C - its
a Postgresql extension) seem like decent options.

~~~
animalnewbie
This is obviously a dumb question, being in this thread and all but what's the
difference between LogStash, Elastic Search and Lucene? I only ever saw one
amazing demo once with real time search and I remember hearing all 3 names. My
guess is LogStash stores and the other two retrieve?

~~~
itronitron
Lucene is the core search engine, it can be used on its own and has a lot of
depth in its many capabilities.

ElasticSearch (and SolrCloud >> LucidWorks Fusion Server) add a distributed
architecture that leverages Lucene's capabilities.

LogStash helps ES easily deal with log files, and it has been one of the main
marketing drivers used by elastic which is ironic since most of Lucene's
capabilities are more useful on human generated text as opposed to machine
generated logs.

------
StreamBright
Great writeup. I wish there was a search engine built on the top of Riak that
has a bit simpler workload distribution.

~~~
peterwwillis
There's really nothing simple about building apps on top of Riak. It's one of
those things that seems simple until you use it in production, and then you
realize it's a total nightmare and you can't wait to sunset it.

~~~
StreamBright
Without details, I am not really sure what you are talking about. Riak is just
a key-value store, probably the simplest abstraction to store data. I have put
it in production several times and operated it for years. Reliability and
performance is just outstanding. Scalability was capping out around 150 nodes
in the same cluster.

------
pmarreck
I recently built a webapp and tried to avoid using ES by using Postgres
fulltext search and it’s working great so far

------
chx
If you secure your Elasticsearch cluster without paying and want to test your
queries, it seems to me there are three ways: command line, Deja Vu,
Elasticsearch Head extension and if you try to use the Elasticsearch Head
extension you will run into [https://github.com/mobz/elasticsearch-
head/issues/431](https://github.com/mobz/elasticsearch-head/issues/431)

Overall, security became free only last year
[https://www.elastic.co/blog/security-for-elasticsearch-is-
no...](https://www.elastic.co/blog/security-for-elasticsearch-is-now-free) and
knowledge and tooling is thin on the ground.

------
testplzignore
> every document you put into ES will create a segment with only that single
> document

That's not correct. A new segment is created (or rather, made immutable) after
a commit. Creating a new segment for every document is a surefire way to kill
performance.

------
andrewflnr
This was a really helpful article for understanding the architecture of
Elasticsearch. However, what I really want to know is why Elastic has the
reputation of crapping itself for no reason and what can be done about it.

~~~
arwhatever
It seems like most articles discussing Elastic search administration read like
an instruction manual for something like steam locomotive - describing just
how to constantly shovel the coal in, relieving the steam pressure in just the
right way, etc.

------
winrid
I have found you need a queue in front of ES for better write reliability,
more so than other DBs. The read/aggregation performance is fantastic though.

Also, you should know a little about GC tuning too

~~~
petemc_
Can I ask how you implemented such a queue?

~~~
igama
One way for example would be store all events you want to write to
ElasticSearch to a Queue (like rabbit, Kafka, etc) then have a set of
workers/importers (this can even be logstash with input from queue and output
ES) that read the events from that queue and store them in ES.

This way your system becomes a sync, and your system above is not affected if
ES becomes slow writing or goes down for a reason. Your events will remain
stored in the queue waiting to be processed.

------
speedgoose
I was considering to use ElasticSearch to replace my CouchDB indexes that are
way too slow, memory hungry, and not optimized. But I read somewhere, can't
remember the source, that ElasticSearch doesn't offer any guarantee that all
your data will (eventually) be saved or returned when queried. Is that the
case?

~~~
frant-hartm
Maybe you are thinking of Aphyr's analysis of elastic search where he showed
that Elastic can lose indexed documents during network partitions:

[https://aphyr.com/posts/323-call-me-maybe-
elasticsearch-1-5-...](https://aphyr.com/posts/323-call-me-maybe-
elasticsearch-1-5-0)

That has been done on relatively old version. Elastic documents known (and
fixed) issues here
[https://www.elastic.co/guide/en/elasticsearch/resiliency/cur...](https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html)
but I wouldn't trust this to the letter mainly because of their previous
handling of such issues.

Elastic is great as a search index, not as a primary database.

~~~
speedgoose
Thanks for your clarification.

Not using it as a primary database is fine and I understand that it's not
designed for that. But I need to trust the results when I query it. Is it
designed for fuzzy searches on many documents when missing one or two
documents has no consequences or can I trust that I will always retrieve all
documents that should match my query ?

~~~
frant-hartm
In theory with correct settings you should get all (or be told that the
results are partial, e.g when shard is unavailable or timed out).

In practice there are bugs, which are notoriously difficult to find and
reproduce when you go distributed. Jepsen is closest to that I know about.

~~~
speedgoose
Thanks again. Distributed systems are difficults. I think I should have gone
with a old school PostgreSQL.

------
eindiran
Can someone with experience with multiple indices comment on the relative
advantages and disadvantages of running each of the major ones in production:

\- Elastic

\- Lucene

\- Solr

\- CouchDB

\- Any other ones I'm unfamiliar with

------
shreyshrey
For ES and solr gurus here what is the recommended max size for documents if
you want to index lot of office documents?

~~~
itronitron
Whatever makes sense to your users for their search results. Do they want to
get back the whole document or just the relevant parts?

If there are separate sections in the office documents that you can pull out
and index as separate fields then you should do that. For example, if you were
indexing patents, you would want to index abstracts and claims into separate
fields.

~~~
shreyshrey
Just relevant parts. ES says their max size is 100 mb. We have a real life
scenario where we want to index millions of office documents to find PII/PHI

What is the realistic expectation here. Should we say 50 mb. How everybody
else do?

~~~
itronitron
Not sure about ES, but Solr removed it's max field limit in release 4.0. Text
documents tend to be a lot smaller than people expect, both in terms of word
count and file size. I think you will be fine with 50 mb if you are using ES.

------
gui77aume
Is it ok to have two nodes (mdi) ? Or does 2n+1 means you always want a odd
number of master eligible nodes?

------
mrwnmonm
Can anyone provide rough numbers about how a million documents with normal
queries on it would cost?

~~~
igama
No one can answer that for you, you need to test it out your self. Depends on
size of documents, number of keys, volume of reads, etc etc. Spin up 1
instance and test it out ;)

------
LordHumungous
Good summary. I recently did this and had to figure this all out by trial and
error.

------
harryf
Struggling to read this ... the domain typo triggering OCD

~~~
StreamBright
For a long time, I read articles only in Pocket. The web has become unreadable
at this stage. It is a bit funny that it was created to make the distribution
of knowledge easier, having readability its primary feature.

------
jugg1es
Great to read from someone who knows it so deeply.

~~~
speedplane
> Great to read from someone who knows it so deeply.

I'd say it's a decent read on Elasticsearch tuning at the intermediate level,
but not enough to really get high performance from Elastic.

One of the problems with Elastic tuning is that the tuning parameters depend
deeply on the type of data you're indexing. Indexing text is far different
than indexing numbers or dates. A mapping with a large number of fields will
behave differently than with just a few. Some datasets can easily be split up
into different indexes, and others are cannot be.

To really get the most from Elasticsearch, you have to know what it's doing
under-the-hood, and how that maps on to your data. Elasticsearch hides so much
complexity (generally a good thing), but unfortunately, it can be difficult to
know where the bottlenecks are.

------
mychael
Is there a guide for deploying ES on K8S?

~~~
amtux
For kubes you could use the helm chart they provide and tune it similarly.
[https://github.com/elastic/helm-
charts/tree/master/elasticse...](https://github.com/elastic/helm-
charts/tree/master/elasticsearch)

------
liveoneggs
I prefer solr

