

First Do No Harm: Four Things You Should Never Do with Elasticsearch - vanderzyden
http://blog.qbox.io/important-things-never-do-with-elasticsearch

======
enkiv2
The author probably should have expanded upon the point about sizing. Having
even an incidentally undersized cluster for ES is dangerous.

I worked with ES for a while with a pretty reasonably sized cluster and what I
considered to be a pretty reasonably sized data set. I frequently ran into
cases where ES nodes, and the cluster in general, handled resource overflows
in an utterly brain-damaged way.

Doing bulk loading? Don't let your request be larger than 200MB -- Jetty will
crash and hand off corrupt and incomplete data to the loader, which will
happily load it. You'll never notice unless you look at the logs.

Trying to fit a little less than half as much data as you have space on disk?
Better hope none of your nodes ever have a connectivity problem -- automatic
shard replication will attempt to replicate a shard to a machine that doesn't
have enough space to hold it, will fail to recognize that shard replication
was incomplete, and half your requests will go to an incomplete and corrupted
shard. You won't notice it unless you're staring at the logs when it happens.
If you don't catch it in time, the corrupted shard itself could be replicated.
The only fix is to wipe the entire cluster and reload.

Trying to save space by using java's default memory pre-allocation sizes?
Better hope you don't have very much data. When ES gets bigger than your -Xmx
setting, it won't die -- it'll just slowly begin to corrupt all your shards,
and happily serve corrupt information from corrupt shards.

Basically, if using ES, make sure you have _many many_ times the resources as
you do data, and be _very conservative_ about using those resources, because
running out of any single resource means ditching the cluster and reloading
all data from scratch.

(These experiences occurred in ~2011 and in ~2013. It's possible that some of
these problems have been fixed. However, at the time, Shay Banon's position on
most bugs of this type is that running out of resources is the user's problem
and ES has no responsibility to mitigate the damage -- a general policy of
wontfix on anything to do with failure states involving running out of
resources.)

