Hacker News new | past | comments | ask | show | jobs | submit login

For those looking for the tl;dr -- GitHub uses elastic search and upgraded the platform to a new version just before launching. Differences with the new version caused corruption in subsystems and cascaded to an outage.

On the whole, GitHub is doing an awesome job. While this may be embarrassing for the team responsible for the feature, I found the writeup to be honest and thorough. I've made plenty of mistakes in my career, and for me being honest about mistakes goes a lot further than the week or so of downtime.




For the record, it's unlikely that the outages were caused due to differences between the versions so much as they were caused by the changing load patterns on the cluster and the old Java version.


Yea that, and they were testing on production like a boss.

> We did not sufficiently test the 0.20.2 release of elasticsearch on our infrastructure prior to rolling this upgrade out to our code search cluster, nor had we tested it on any other clusters beforehand.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: