
Elasticsearch node crashes can cause data loss - felipehummel
https://github.com/elastic/elasticsearch/issues/10933
======
rdtsc
Mandatory reading -- Last year's Call Me Maybe : Elasticsearch

[https://aphyr.com/posts/317-call-me-maybe-
elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch)

I've been hearing a lot of people talk about Elasticsearch lately. I get the
same gut feeling I was getting about MongoDB back during the "Webscale" days.

~~~
bkeroack
In my experience, Elasticsearch is the single most common source of
infrastructure downtime and service failure. It's basically my arch nemesis.

~~~
willejs
I am interested to hear a bit more about this, as I find it hard to believe. I
have only ran it at pretty small scale - x8 servers, around 300 million
documents indexed a day, peak index rate 30k docs/sec. I found that you have
to monitor it correctly, tune the JVM slightly (Mostly GC), give it fast
disks, lots of ram, and the correct architecture (search, index & data nodes)
to get the most out of it. Once I did that it was one of the most reliable
components of my infrastructure, and still is. I would recommend chatting to
people on the elasticsearch irc, or mailinglist, everyone was a great help to
me there.

~~~
bkeroack
The full explanation deserves a blog post, but in a nutshell it revolves
around the issue that ES contains a huge amount of complexity around a feature
that is actually fairly useless (the "elastic" part) or at least difficult to
use correctly. I've found that you need to be a deep expert in ES to architect
and run it properly (or have access to such expertise) and even then it
requires regular care and feeding to maintain uptime. In a short-deadline
startup world you probably won't have time for any of that--once it's working
it will lull you into a false sense of security and then completely blow up a
few weeks/months later.

------
tedchs
The advice I've heard from serious people using Elasticsearch for serious
things indicate that you should definitely not use Elasticsearch as a primary
data store (i.e. it should be treated as a cache).

~~~
po
It is often advocated as a datastore for logging data... which means (in that
case) it's usually the primary datastore but perhaps not mission-critical.

~~~
alrs
It's a great index for log data.

Spew your log data into a standard syslog server, while also pumping it into
Logstash.

Using Elasticsearch as your canonical log storage would be ridiculous.

------
eclark
Crashes of a program will not affect the data being written to disk if said
data has been written into the FS cache (not using std::ostream::write or
other in user space buffering). Dirty pages will eventually be written to disk
even if the process dies un-cleanly. Only something that keeps the kernel from
flushing to disk can keep the page from being eventually written out. ( driver
bug, kernel bug, hardware failure, power failure ).

From reading the code in Jepsen it looks like kill -9 is all that's being used
to start failures. So there's a real bug here:
[https://github.com/aphyr/jepsen/blob/master/elasticsearch/sr...](https://github.com/aphyr/jepsen/blob/master/elasticsearch/src/elasticsearch/core.clj#L361)

~~~
rdtsc
I think Kyle was just going by the documentation. And that is often what he
tests -- how does the reality compare to the claims in the documentation and
marketing.

So given these claims:

> Per-Operation Persistence. Elasticsearch puts your data safety first.
> Document changes are recorded in transaction logs on multiple nodes in the
> cluster to minimize the chance of any data loss.

One would hope they at least flushed the user space buffers.

------
klapinat0r
Suggested reading -- the link.

> _by not fsync 'ing each operation (though one can configure ES to do so)._

It may not be default, but we've seen, again and again, how people are
influenced by what they read about a database (e.g. MongoDB).

The lesson by now should be: _always know your DB_.

~~~
sitkack
Fsync should _always_ be on by default. Require the user to turn it off. I'd
even argue that `fsync` itself is broken and that semantics should be
inverted.

~~~
eternalban
What about pervasive virtualization? The issue here is not really fsync. A
fault-tolerant in-memory cluster should not lose data.

~~~
sitkack
Writes should be durable by whatever means the platform deems durable. Durable
should be the default. It should take work to have non-durable io.

------
gtrubetskoy
Do I understand it correctly from skimming aphyr's article
[https://aphyr.com/posts/317-call-me-maybe-
elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch) that
TL;DR: Elasticsearch does not use a WAL with a real consensus algorithm such
as Paxos or Raft, and therefore isn't reliable?

------
manigandham
I'll just leave this here: one of my first answers on Quora about why
ElasticSearch should not be a primary data store:

[https://www.quora.com/Why-should-I-NOT-use-ElasticSearch-
as-...](https://www.quora.com/Why-should-I-NOT-use-ElasticSearch-as-my-
primary-datastore)

------
capkutay
I hope they can use that $70m they raised last year to throw some engineers at
their architectural issues and fix the data loss issue.

or maybe they're spending that money on marketing and re-branding.

------
smegel
Funnily enough I have seen a slew of technical bulletins from Cloudera warning
of similar issues with HDFS.

Maybe not so funny if your multiply redundant cluster loses data because a
single node dies...

~~~
teraflop
Wow, that sounds bad and I don't remember hearing about it. Do you have any
pointers to bug reports or descriptions of the problem?

HDFS uses chain replication, so I would have expected that by the time the
client got acknowledgement of a write, it would already be acknowledged by all
replicas (3 by default). So even if there's a bug causing one of the nodes to
go down without fsyncing, there shouldn't be any actual data loss.

~~~
smegel
>>> OK its not simply that a node dies, but that disks on a node are replaced
(which might sort of be related to a node dying).

TSB 2015-51: Replacing DataNode Disks or Manually changing the Storage IDs of
Volumes in a Cluster may result in Data Loss Printable View Rate This
Knowledge Article (Average Rating: 3.3) Show Properties « Go Back Information

Purpose Updated: 4/22/2015

In CDH 4, DataNodes are identified in HDFS with a single unique identifier.
Beginning with CDH 5, every individual disk in a DataNode is assigned a unique
identifier as well.

A bug discovered in HDFS, HDFS-7960, can result in the NameNode improperly
accounting for DataNode storages for which Storage IDs have changed. A Storage
ID changes whenever a disk on a DataNode is replaced, or if the Storage ID is
manually manipulated. Either of these scenarios causes the NameNode to double-
count block replicas, incorrectly determine that a block is over-replicated,
and remove those replicas permanently from those DataNodes.

A related bug, HDFS-7575, results in a failure to create unique IDs for each
disk within the DataNodes during upgrade from CDH 4 to CDH 5. Instead, all
disks within a single DataNode are assigned the same ID. This bug by itself
negatively impacts proper function of the HDFS balancer. Cloudera Release
Notes originally stated that manually changing the Storage IDs of the
DataNodes was a valid workaround for HDFS-7575. However, doing so can result
in irrecoverable data loss due to HDFS-7960, and the release notes have been
corrected.

Users affected:

Any cluster where Storage IDs change can be affected by HDFS-7960. Storage IDs
change whenever a disk is replaced, or when Storage IDs are manually
manipulated. Only clusters upgraded from CDH 4 or earlier releases are
affected by HDFS-7575.

Symptoms If data loss has occurred, the NameNode reports “missing blocks” on
the NameNode Web UI. You can determine to which files the missing blocks
belong by using FSCK. You can also search for NameNode log lines like the
following, which indicate that a Storage ID has changed and data loss may have
occurred: 2015-03-21 06:48:02,556 WARN BlockStateChange: BLOCK*
addStoredBlock: Redundant addStoredBlock request received for
blk_8271694345820118657_530878393 on 10.11.12.13:1004 size 6098 Impact:

The replacement of DataNode disks, or manual manipulation of DataNode Storage
IDs, can result in irrecoverable data loss. Additionally, due to HDFS-7575,
the HDFS Balancer will not function properly.

Applies To HDFS All CDH 5 releases prior 3/31/15, including: 5.0, 5.0.1,
5.0.2, 5.0.3, 5.0.4, 5.0.5 5.1, 5.1.2, 5.1.3, 5.1.4 5.2, 5.2.1, 5.2.3, 5.2.4
5.3, 5.3.1, 5.3.2 Cause Instructions Immediate action required:

Do not manually manipulate Storage IDs on DataNode disks. Additionally, do not
replace failed DataNode disks when running any of the affected CDH versions.

Upgrade to CDH 5.4.0, 5.3.3, 5.2.5, 5.1.5, or 5.0.6. See Also/Related Articles
Apache.org Bug HDFS-7575

HDFS-7960 Attachment

~~~
justinsb
That is a bug in CDH/HDFS, but it was an error and is now fixed. That's not
diminishing the severity of the bug, but you can patch and get the correct
behaviour, without a performance hit.

That is not comparable to what seems to be the case here with ES & MongoDB,
where they deliberately (by design) accept the risk of data-loss to boost
performance. Now most systems allow you to do make that trade-off, but an
honest system chooses the safe-but-slow configuration by default, and has you
knowingly opt-in to the risks of the faster configuration.

I hope you consider editing your initial post - if you conflate bugs with
deliberately unsafe design, we just end up with a race to the bottom of
increasingly unsafe but fast behaviour.

------
digitalzombie
Title should have ElasticSearch in there...

I was thinking of NodeJS.

But the comment is correct, ES is not a db but a indexer and search engine.

edit:

Oh god, don't use it for storage. It index stuff.

You got a document, it'll store it in root words form so you can fuzzy search.
It'll also do other NLP stuff to your document such as removing stop words.
Once it hit an index you can store index value that point to your primary
storage (cassandra, postgresql).

At least that's how I used it. If there is any better alternative I'd like to
know about it.

edit:

I highly recommend:
[http://www.manning.com/ingersoll/](http://www.manning.com/ingersoll/)

Taming text by Ingersoll and it won Dr. Dobb award too as a good book.

~~~
dang
That was our mistake. We put it back. Thanks.

------
elchief
Who cares?

As long as we can cash in our options before the lawsuits come in, we win!
Just like my last job in finance, actually.

Plus SQL's gross. You can't even webscale with it, and old people like it, so
it must suck.

