
Call me maybe: Elasticsearch 1.5.0 - tylertreat
https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0
======
teraflop
> How often the translog is fsynced to disk. Defaults to 5s. [...] In this
> test we kill random nodes and restart them. [...] In Elasticsearch, write
> acknowledgement takes place before the transaction is flushed to disk, which
> means you can lose up to five seconds of writes by default. In this
> particular run, ES lost about 10% of acknowledged writes.

Something bothers me about this: if the bug was merely a failure to call
fsync() before acknowledging an operation, then killing processes shouldn't be
enough to cause data loss. Once you write to a file and the syscall returns,
the written data goes into the OS's buffers, and even if the process is killed
it won't be lost. The only time fsync matters is if the entire machine dies
(because of power loss or a kernel panic, for instance) before those buffers
can be flushed.

So is the data actually not even making it to the OS before being acked to the
client? Or is Jepsen doing something more sophisticated, like running each
node in a VM with its own block device instead of sharing the host's
filesystem?

~~~
tedunangst
I believe Jepsen is killing (VM) systems, not just processes. It is meant to
model the failure modes in a data center, which would include machines
bursting into flame, etc.

~~~
teraflop
Hmm. Upon looking at the documentation [1], it looks like the recommended
configuration is to run nodes in LXC containers. Maybe I'm mistaken, but I
thought "LXC" basically just meant some automation around kernel features like
chroots, user namespaces, etc. In which case, shouldn't the containers share
the host's VFS and pagecache?

[1]
[https://github.com/aphyr/jepsen/blob/master/jepsen/README.md](https://github.com/aphyr/jepsen/blob/master/jepsen/README.md)

~~~
tedunangst
Ah, I didn't know it used LXC, as opposed to Xen or so.

------
w8rbt
The 'Recap' section has good advice to address the issue of data loss:

 __ _My recommendations for Elasticsearch users are unchanged: store your data
in a database with better safety guarantees, and continuously upsert every
document from that database into Elasticsearch. If your search engine is
missing a few documents for a day, it’s not a big deal; they’ll be reinserted
on the next run and appear in subsequent searches. Not using Elasticsearch as
a system of record also insulates you from having to worry about ES downtime
during elections._ __

~~~
machbio
I do something similar but my data is not that important, as most of you guys
work on - I have a very high rate of data being input from users for curating
purposes in a bioinformatics lab..So I load the data onto a mongodb cluster
and run the elasticsearch-river-mongodb [1] to keep the elasticsearch in sync
with the mongodb.. I do not know whether it solves this problem - but I am
happy that I have data safe in mongodb - any inputs regarding my method would
be appreciated..

[1][https://github.com/richardwilly98/elasticsearch-river-
mongod...](https://github.com/richardwilly98/elasticsearch-river-mongodb.git)

~~~
cwyers
"Data safe in Mongo" seems... less reassuring than I think you meant it to be:

[https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-
read...](https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads)

------
Meekro
Elasticsearch has its flaws, but are there really any alternatives that do
what it does? In my case, I need to do fulltext search across millions of
documents -- nearly a TB of text in all, and still growing. Many database
engines can do basic fulltext, but what else can scale like Elasticsearch does
while offering powerful fulltext features like fuzzy matching and "more like
this?"

~~~
rch
Riak search works well. The indexing is just Solr, but the cluster and data
itself is managed differently.

~~~
sargun
Riak Search 2.0, or Yokozuna is what you're thinking about.

DISCLAIMER: I work for Basho.

------
phamilton
TL;DR; All previous data loss scenarios still exist, however the window during
which data can be lost is much smaller.

~~~
joseraul
And all known defects are now documented.
[http://www.elastic.co/guide/en/elasticsearch/resiliency/curr...](http://www.elastic.co/guide/en/elasticsearch/resiliency/current/)

------
larrywright
I bet the people who work for these database companies have nightmares about
getting bug reports from Kyle.

But seriously, his work is super impressive. Kudos to Stripe for funding this.

~~~
VeejayRampay
Isn't a bug report always a good thing? I mean the real nightmare is your
users sending outraged emails about corrupted data and you have no idea what's
going on. With the bug report (and especially from Kyle) at least you got
something to chew on. As people say, a reported bug is a bug half-fixed.

------
beachstartup
waaaay back in the mid 2000s when search engines were large, complex,
expensive pieces of enterprise software not unlike a commercial database, it
was always assumed (and the customer told) that the search engine cuts corners
for the sake of speed, in both indexing and searching, and that it should
never be counted on as a database, or any kind of authoritative data store. it
does one thing: search staggering amounts of information quickly at acceptable
levels of performance and accuracy.

looks like expectations have moved beyond that, which is good, but it's not an
easy problem to solve... especially with billions or trillions of documents in
the index. leaving out all the the important stuff that made a database slow
is what made a search engine fast.

------
jodah
Key takeaway for me:

"crash-restart durable storage is a prerequisite for most consensus
algorithms, I have a hunch we’ll see stronger guarantees as Elasticsearch
moves towards consensus on writes."

It seems like ElasticSearch is inching towards a proper consensus algorithm,
at least optionally, which makes me wonder yet again, why not just implement
RAFT? While I won't speculate I will point out that the answer for other
systems appears to have been related to ego (ie, distributed systems are
easy).

~~~
jpgvm
The funny thing is there is a community plugin that fixes almost all of this
by using Zookeeper for discovery and master election.

The not so funny thing is Elastic employed the primary developer of said
plugin and basically shut it down.

~~~
kimchy
actually, it doesn't fixes the mentioned bug. I am only posting this here to
make sure people won't go and try and install it thinking it does...

There is a difference between cluster level master election and replication
semantics in Elasticsearch (and other similar systems). Even if you use
Zookeeper for cluster level master election, one still needs to handle point
to point replication (which doesn't go through zk).

~~~
jpgvm
I should have been more specific. The ZK plugin fixes the issues related to
network split-brains and other nasty partition conditions.

The newly discovered translog issue is also a problem and is not solved by ZK
but is also a lot less scary.

------
temuze
I'm pretty pleased with Elasticsearch's progress on durability. The
snapshot/restore feature has been pretty nice to work with!

That said, having a "single source of truth" and regularly refreshing
Elasticsearch is a _huge_ pain in the butt. I currently maintain a ~500 line
syncing script that takes ~15 minutes to run.

Adding a new field in a doc_type means:

\- Adding a column in Postgres

\- Adding a field in the doc_type mapping (I think being explicit with the
field is better practice)

\- Adding code to the syncer to update the field.

Ouch.

Also, the syncing scripts required a steep learning curve. At first, I was
upserting everything and the syncing script took forever. To solve this, I
made the syncer fetch all the documents in Postgres and all the documents in
Elasticsearch then update the specific changes (thankfully, the dataset is
small enough that it easily fits in memory).

I'd really love to scrap this portion of my infrastructure...

~~~
ComNik
Just today I successfully finished an experiment: An existing web-app with
Postgres as the source of truth pushing update events to a Kafka queue. From
there, a screenful of Go forwards those events into Elasticsearch.

It doesn't solve every problem and it might be a lot of new parts if you're
not going to use Kafka for anything else. I will be using it for other things
like caching and push events. This kind of syncing problem seems to crop up in
a lot of places.

A nice thing is that I don't have to care about Elasticsearch durability much,
because I can simply rerun the ingester from the beginning of the log, as long
as Kafka doesn't lose data.

~~~
lobster_johnson
I think putting a stateful, persistent, transactional stores in the middle of
two stateful, persistent, transactional stores is a bad idea.

The idea is to sync secondary store B to continuously be a perfect replica of
A, so: A —> B. What a lot of people do, you included, is add a third store as
an intermediary: A —> Q —> B. Now you have three complex pieces of software
rather than two.

The thing is, you already have the state that Q covers: It's A.

For the record, we made the same mistake. We put RabbitMQ in the middle. Now
we had many problems:

* What do we do if the queue loses messages? (Manually reindex from N days ago.)

* How do you _know_ if the queue has lost messages, due to bugs, or network downtime or similar? (Well, you don't really know. If unsure, manual complete reindex.)

* What do you do when the PostgreSQL transaction completed, but it's for some reason unable to reach RabbitMQ to post a queue message? (Manually scan logs, figure out how far back to backfill, reindex from there.) Fixing this "correctly" would require some kind of two-phase commit, which queues don't support.

Lots of manual intervention required to run a system flawlessly. It's not just
about running a consistent system; it's about knowing when you're consistent
or not, and how to repair.

Also, there are logistical issues:

\- What do you do when people run batch jobs producing millions of updates,
and users want to update documents (and see their changes) at the same time?
You have no recourse but to create traffic lanes — queues and more workers.

\- How to run multiple queue consumers. You have to use version constraints
(update only if newVersion > oldVersion), because you will end up processing
updates out of order, something which only works if the original source has a
version field (ES does support "external" version numbers). Turns out a queue
makes these checks happen more often because you often get multiple adjacent
updates for the same objects. Kafka can de-dupe, fortunately, but RabbitMQ
can't.

\- How do you update _multiple_ target stores (let's say, both ES and
InfluxDB)? You have to create separate queues for each of them, so that you
get true fanout. Now you have added more "middlemen" on top of the first one,
more stuff that can go out of sync.

My conclusion after struggling with this for a while is that the database is
the truth, but it's also the only one that's coherently transactional. So one
should keep the change log close to the truth, meaning _in the database_. You
can use a PostgreSQL "unlogged" table for performance.

Secondly, you will want to use time-based polling that invokes a simple state
machine that can travel back in time. Basically, you run a worker that keeps a
"cursor" pointing into the database transaction log. Let it grab big batches
every N seconds. The cursor doesn't need to be transactional as long as it's
reasonably persistent; if you lose the cursor, your worst case is a full
reindex, but if you are unable to update the latest cursor, worst is case is
just a small amount of unnecessary reindexing. This worker can live side by
side with the queue-based processor, and if you shard it, you can run multiple
such workers concurrently.

Everything else comes out of this logic. For example: If ElasticSearch is
empty, it can detect this, and set the cursor to the beginning of time. It
knows how far back the cursor is, so it knows whether it's in "full reindex"
mode or "incremental mode", something it can export as a metric to a
dashboard. It can also backfill, by moving the cursor back a little bit. And
by using a state machine you can also put it in "incremental repair" mode,
where it can use a smart algorithm (Merkle trees were mentioned by someone
else recently) to detect holes in the ES index that need to be filled.

Things like building a new index now becomes a trivial, because you just start
a worker instance that points to a new index, but from the same truth data
store; being smart about state, it will start pulling the entire source
dataset into the new index. The old worker can continue indexing the old
index. Once that instance is done, you can swap the new index for the old one,
then delete the old worker.

Finally, to solve the problem of real-time vs. batch updates: In addition the
above worker you run a separate worker that listens to a queues. Whenever a
non-batch update happens (a "batch" flag needs to be indicated in all APIs and
internal processes), push the ID of the affected object on a queue, but not
the object itself; rather, let the worker pull the original from the store.
This way, your queue (which requires RAM/disk) stays super lean and fast. Give
the worker a small time-based buffer (like 1s) so that it can coalesce
multiple updates if they're happening rapidly, and use an efficient query to
get multiple objects at the same time. And use versioning to avoid clobbering
newer data.

Of course, the system I've outlined is probably not workable for Google or
Facebook, but it will scale well and will keep things in sync better than
something queue-based.

~~~
hobofan
You make some good points, but there is a big architectural difference between
RabbitMQ and Kafka. The solutions you point out work just as well with Kafka
as they do with a unlogged PostgreSQL table, but of course there are tradeoffs
for each of them. The unlogged tables has transaction guarantees, but I am not
sure if I want to hit my production database with huge read loads on every
reindex of a secondary data store.

I haven't really looked into it, but Botteled Water[0] looks like it can
combine the good log properties of Kafka with guaranteed delivery from
postgres.

[0] [http://blog.confluent.io/2015/04/23/bottled-water-real-
time-...](http://blog.confluent.io/2015/04/23/bottled-water-real-time-
integration-of-postgresql-and-kafka/)

~~~
lobster_johnson
Kafka is certainly better than RabbitMQ in some respects. (In others, it's
disappointing: It's practically useless if you're not running on the JVM, as
clients for languages such as Go, Ruby and Node aren't up to date with the
"smart" Java client. It's also clearly more low-level and designed for large
installations, and less friendly to small ones.)

The problem with storing indexing state outside the database — using a queue,
for example — is transactionally protecting the gap between the database and
the queue. Bottled Water is cool in that it can actually bridge that gap
safely, as I understand it, since PostgreSQL will keep the decoded stream
until you've been able to propagate it to Kafka. On the other hand, if you
have the stream, do you need Kafka? Can't you just push it directly to
ElasticSearch?

For us, this issue — this and standardizing on an elegant cross-language RPC —
is probably the main architectural challenge we're facing right now in our
microservice development. We have tons of microservices with private data
stores that need good search and also internal synchronization between
microservices, which is coincidentally the exact same problem space: You have
service A with its complex data model, and then a service B that wants to
subscribe to updates so that it can correlate its data with that of A. It's a
complicated problem that requires a simple solution.

 _I am not sure if I want to hit my production database with huge read loads
on every reindex of a secondary data store._

Hopefully a full reindex shouldn't happen that often, though. And a full
reindex would require a full scan of your production database (not the
transaction log) anyway, since you don't want to keep the entire change log
around forever (and can't, since the log only starts at the point when you
started running this system).

~~~
hobofan
> On the other hand, if you have the stream, do you need Kafka? Can't you just
> push it directly to ElasticSearch?

I think the separation is something very nice here. We have something like an
Apache Storm topology (though we use a own Mesos based framework here) for
every datastore we want to populate. If we want to add a new datastore we just
have to find a library for it and can whip up a new topology. That is much
more convenient than having to build support for each datastore into something
central like Botteled Water and can be tweaked nicely to the specialities of
the datastore.

> since you don't want to keep the entire change log around forever (and
> can't, since the log only starts at the point when you started running this
> system).

If we initialize a new Kafka topic, we push the relevant data into it once
from the production database, and after that Kafka dedupe will keep it from
growing too large.

~~~
lobster_johnson
The separation is nice, although I would counter that if your only primary
data store is Postgres, and you want to go the logical decoding route,
Postgres _already_ has the queue: The decoded transaction log. There's no need
for a queue on top of a queue. All you need now is a client that can process
the log sequentially and emit each change to the appropriate data store.

Things like Kafka would be more appropriate if you have multiple producers
that aren't all Postgres.

------
Jarlakxen
One more thing to add to the Elastic reliability:

> We recommend installing the Java 8 update 20 or later, or Java 7 update 55
> or later. Previous versions of Java 7 are known to have bugs that can cause
> index corruption and data loss. Elasticsearch will refuse to start if a
> known-bad version of Java is used.

[http://www.elastic.co/guide/en/elasticsearch/reference/1.5/s...](http://www.elastic.co/guide/en/elasticsearch/reference/1.5/setup.html)

------
nyir
Kind of OT, but losing data in these cases is not even what I'm most concerned
about with ES: I had to recreate the ES cluster way too many times now that
I'm really glad that PostgreSQL keeps on running, or just restarts without
bricking the data files, even after OOM or out of disk errors.

------
dang
Also
[https://news.ycombinator.com/item?id=9475620](https://news.ycombinator.com/item?id=9475620).

