Hacker News new | comments | show | ask | jobs | submit login
Postmortem: Outage due to Elasticsearch’s flexibility and our carelessness (grofers.com)
78 points by vaidik 729 days ago | hide | past | web | favorite | 27 comments



It's not the first time I read about service (near) outages and post mortem that involve ElasticSearch. It's marketed as NoSQL solution, but some devs don't read the details in the documentation like "Elasticsearch is not schema-less".

Knowing Lucene and its index structure for years, I wouldn't advise to use a Lucene database as primary storage. Always keep the original data at least in a log-queue or in a separate database - so that you can rebuild the Lucene database.


Agreed. We don't use it as a primary database as well. In fact, all of our data persists in PostgreSQL. However, a smart thing for us to do would have been using PostgreSQL as a fallback (postgis for spatial queries, something that we heavily do with Elasticsearch).

Obviously incidents like such have made us cautious and we will be working on fixing these architectural issues quickly.


Very interesting article, and a few lessons learned. And very well written. I feel like I can feel the tenseness in the team when you are debugging and searching for a solution under stress :)

Just wanted to tip you off to what I assume is a typo:

  But lack of automated regression testing would have caught this issue.
I understand what you mean, but if you mean that literally then that is indeed a new and interesting paradigm in software testing :)


I might be missing something. But to my understanding I think it is correct. I didn't quite get what you exactly mean by that? Educate me please.

To explain myself, I meant that tests that were succeeding previously would have failed due to this issue if we had tests written for it. Does my statement not convey this?


Original statement:

> But lack of automated regression testing would have caught this issue.

Your question:

> To explain myself, I meant that tests that were succeeding previously would have failed due to this issue if we had tests written for it. Does my statement not convey this?

No. The sentence says that the lack (i.e. not having) tests would have caught the issue. If you remove 'But lack of', the sentence makes sense ;)


Impatience at times is root of so many things going wrong. The thing I read again and again and again and failed to see.

Thanks :)


Excellent write up! Would configuring ES to alert instead of silently default type on new fields have helped ?

"dynamic": "strict" https://www.elastic.co/guide/en/elasticsearch/guide/current/...


Yes it would have. But as I said in one of the other comments, they are doing something about this in 2.x releases. And this reference is from 2.x release's documentation.

Wish we had it back then. We are currently on 1.4.


Got it. Thanks again for the write up and the time taken to respond to all comments!!


This is something we encountered at Snowplow with our real-time loading of events into Elasticsearch [1]. It's not an issue for us because we schema all our events, but it was interesting to observe. Here's a summary in one slide:

http://www.slideshare.net/alexanderdean/snowplow-analytics-f...

[1] https://github.com/snowplow/snowplow/tree/master/4-storage/k...


Really awesome to read such a nice writeup of an error. It got me tangentially thinking about why we do not read stuff like:

When the patient was brought in, he was barely breathing. Due to the spots on his face, we assumed it was X. Then we made an incision in his pelvis to fix Y. Suddenly, he died. Gosh, we really should have kept an eye on the meter in the corner!

I understand things like litigation, but it makes me wonder to what extent we are being limited in medical achievements. From my outsider point of view, it seems that post mortems (ha!) are not used as much in the medical field other than for insurance / legal reasons.



Not completely sure about this. But may be we are just not in the same circles of other fields where things like these are shared. I can only speak for myself obviously. But it would be hard to believe that people in the field of medical sciences don't practice this.


You've got what appears to be a vestigial comment in your article. It reads:

    (add more details here)
But it looks like you've got plenty of details. Just thought I'd let you know ;-)


Precisely why you need review processes ;)


> The values for field price in products mapping and the values for field price in promotions mapping (in the same index) will essentially be added to the same list at Lucene Segment level. And it will not fail or throw an exception.

Probably a stupid question but why can't ElasticSearch translate both mappings into something like, "products_price", and "promotions_price" before adding to the 'list'?


And BTW, that is something that Elasticsearch recommends you to do that manually in the current versions as well. Read this: https://www.elastic.co/guide/en/elasticsearch/guide/current/...


Something I wondered too. But I think in the 2.x versions, they are already planning something around it. Honestly, I have not looked at it much yet but even creating another mapping with the same field name but different type should perhaps throw a warning or not be allowed unless overridden as they lead to things like these.


In 2.x, all fields must adhere to the same data type or you will get a failure [1]. To quote:

  Fields with the same name, in the same index, in different types,
  must have the same mapping
So hopefully this isn't an issue anymore! I do strongly suggest (and we've done this from the beginning) that no one rely on elasticsearch's auto mapping and that you should explicitly map your fields.

[1] -- https://www.elastic.co/guide/en/elasticsearch/reference/curr...


Yes you are right. In fact, a simple work around, before Elasticsearch coming out with this feature in their 2.x releases, was to review mappings and what goes in Elasticsearch. And we got careless about that and shot ourselves in the foot. We learned it the hard way.


It is for this reason Elasticsearch 1.x advocates using 'copy_to' on mapping attributes that may encounter collision. Elasticsearch 2.x explicitly throws exception whenever it detects mapping collision.


I ran into the same issue on one of our apps too. Nice to see it written down so thoroughly!


nice writeup.. it strikes a good balance between tech details and narrative.


Thanks :)


I'm not sure I understand the topic correctly, are you referring to carelessness in Elasticsearch or on your own behalf?


Their own


Correctly pointed out. Perhaps calls for an edit.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: