
Postmortem: Outage due to Elasticsearch’s flexibility and our carelessness - vaidik
http://lambda.grofers.com/2015/12/14/postmortem-of-our-app-downtime-elasticsearchs-flexibility-bites/
======
castell
It's not the first time I read about service (near) outages and post mortem
that involve ElasticSearch. It's marketed as NoSQL solution, but some devs
don't read the details in the documentation like "Elasticsearch is _not_
schema-less".

Knowing Lucene and its index structure for years, I wouldn't advise to use a
Lucene database as primary storage. Always keep the original data at least in
a log-queue or in a separate database - so that you can rebuild the Lucene
database.

~~~
vaidik
Agreed. We don't use it as a primary database as well. In fact, all of our
data persists in PostgreSQL. However, a smart thing for us to do would have
been using PostgreSQL as a fallback (postgis for spatial queries, something
that we heavily do with Elasticsearch).

Obviously incidents like such have made us cautious and we will be working on
fixing these architectural issues quickly.

------
orkj
Very interesting article, and a few lessons learned. And very well written. I
feel like I can feel the tenseness in the team when you are debugging and
searching for a solution under stress :)

Just wanted to tip you off to what I assume is a typo:

    
    
      But lack of automated regression testing would have caught this issue.
    

I understand what you mean, but if you mean that literally then that is indeed
a new and interesting paradigm in software testing :)

~~~
vaidik
I might be missing something. But to my understanding I think it is correct. I
didn't quite get what you exactly mean by that? Educate me please.

To explain myself, I meant that tests that were succeeding previously would
have failed due to this issue if we had tests written for it. Does my
statement not convey this?

~~~
anarazel
Original statement:

> But lack of automated regression testing would have caught this issue.

Your question:

> To explain myself, I meant that tests that were succeeding previously would
> have failed due to this issue if we had tests written for it. Does my
> statement not convey this?

No. The sentence says that the _lack_ (i.e. not having) tests would have
caught the issue. If you remove 'But lack of', the sentence makes sense ;)

~~~
vaidik
Impatience at times is root of so many things going wrong. The thing I read
again and again and again and failed to see.

Thanks :)

------
hackeram
Excellent write up! Would configuring ES to alert instead of silently default
type on new fields have helped ?

"dynamic": "strict"
[https://www.elastic.co/guide/en/elasticsearch/guide/current/...](https://www.elastic.co/guide/en/elasticsearch/guide/current/dynamic-
mapping.html)

~~~
vaidik
Yes it would have. But as I said in one of the other comments, they are doing
something about this in 2.x releases. And this reference is from 2.x release's
documentation.

Wish we had it back then. We are currently on 1.4.

~~~
hackeram
Got it. Thanks again for the write up and the time taken to respond to all
comments!!

------
alexatkeplar
This is something we encountered at Snowplow with our real-time loading of
events into Elasticsearch [1]. It's not an issue for us because we schema all
our events, but it was interesting to observe. Here's a summary in one slide:

[http://www.slideshare.net/alexanderdean/snowplow-
analytics-f...](http://www.slideshare.net/alexanderdean/snowplow-analytics-
from-nosql-to-sql-and-back-again/27?src=clipshare)

[1]
[https://github.com/snowplow/snowplow/tree/master/4-storage/k...](https://github.com/snowplow/snowplow/tree/master/4-storage/kinesis-
elasticsearch-sink)

------
joepvd
Really awesome to read such a nice writeup of an error. It got me tangentially
thinking about why we do not read stuff like:

When the patient was brought in, he was barely breathing. Due to the spots on
his face, we assumed it was X. Then we made an incision in his pelvis to fix
Y. Suddenly, he died. Gosh, we _really_ should have kept an eye on the meter
in the corner!

I understand things like litigation, but it makes me wonder to what extent we
are being limited in medical achievements. From my outsider point of view, it
seems that post mortems (ha!) are not used as much in the medical field other
than for insurance / legal reasons.

~~~
dap
Doctors absolutely do this:

[https://en.m.wikipedia.org/wiki/Morbidity_and_mortality_conf...](https://en.m.wikipedia.org/wiki/Morbidity_and_mortality_conference)

------
notdonspaulding
You've got what appears to be a vestigial comment in your article. It reads:

    
    
        (add more details here)
    

But it looks like you've got plenty of details. Just thought I'd let you know
;-)

~~~
vaidik
Precisely why you need review processes ;)

------
shubhamjain
> The values for field price in products mapping and the values for field
> price in promotions mapping (in the same index) will essentially be added to
> the same list at Lucene Segment level. And it will not fail or throw an
> exception.

Probably a stupid question but why can't ElasticSearch translate both mappings
into something like, "products_price", and "promotions_price" before adding to
the 'list'?

~~~
vaidik
Something I wondered too. But I think in the 2.x versions, they are already
planning something around it. Honestly, I have not looked at it much yet but
even creating another mapping with the same field name but different type
should perhaps throw a warning or not be allowed unless overridden as they
lead to things like these.

~~~
craigching
In 2.x, all fields must adhere to the same data type or you will get a failure
[1]. To quote:

    
    
      Fields with the same name, in the same index, in different types,
      must have the same mapping
    

So hopefully this isn't an issue anymore! I do strongly suggest (and we've
done this from the beginning) that no one rely on elasticsearch's auto mapping
and that you should explicitly map your fields.

[1] --
[https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_20_mapping_changes.html)

~~~
vaidik
Yes you are right. In fact, a simple work around, before Elasticsearch coming
out with this feature in their 2.x releases, was to review mappings and what
goes in Elasticsearch. And we got careless about that and shot ourselves in
the foot. We learned it the hard way.

------
gnurag
It is for this reason Elasticsearch 1.x advocates using 'copy_to' on mapping
attributes that may encounter collision. Elasticsearch 2.x explicitly throws
exception whenever it detects mapping collision.

------
devilsenigma
I ran into the same issue on one of our apps too. Nice to see it written down
so thoroughly!

------
mh-
nice writeup.. it strikes a good balance between tech details and narrative.

~~~
vaidik
Thanks :)

------
cekstam
I'm not sure I understand the topic correctly, are you referring to
carelessness in Elasticsearch or on your own behalf?

~~~
windowsworkstoo
Their own

~~~
vaidik
Correctly pointed out. Perhaps calls for an edit.

