

Recent Code Search Outages - nixgeek
https://github.com/blog/1397-recent-code-search-outages

======
yajoe
For those looking for the tl;dr -- GitHub uses elastic search and upgraded the
platform to a new version just before launching. Differences with the new
version caused corruption in subsystems and cascaded to an outage.

On the whole, GitHub is doing an awesome job. While this may be embarrassing
for the team responsible for the feature, I found the writeup to be honest and
thorough. I've made plenty of mistakes in my career, and for me being honest
about mistakes goes a lot further than the week or so of downtime.

~~~
imbriaco
For the record, it's unlikely that the outages were caused due to differences
between the versions so much as they were caused by the changing load patterns
on the cluster and the old Java version.

~~~
taproot
Yea that, and they were testing on production like a boss.

> We did not sufficiently test the 0.20.2 release of elasticsearch on our
> infrastructure prior to rolling this upgrade out to our code search cluster,
> nor had we tested it on any other clusters beforehand.

------
purephase
Excellent post-mordem. As an elasticsearch user (without the scale) there is
some helpful advice in there. I've already encountered the heapsize issue they
mentioned, but the others are new.

Thanks GH.

------
mdellabitta
"During the initial recovery, some of our nodes ran out of disk space. It's
unclear why this happened since our cluster was only operating at 67%
utilization before the initial event, but it's believed this is related to the
high load and old Java version. The elasticsearch team continues to
investigate to understand the exact circumstances."

This doesn't pass the smell test for me. Not that I know tons about
ElasticSearch, but couldn't disk space have been consumed by failed
replication attempts?

------
c3
Elasticsearch is great and magical, but there are a bunch of defaults that you
MUST set for it to be useful. I'm surprised github wasn't using these,
actually (like allocating the min and max memory to be the same size).

Generally it takes a catastrophic failure under load for you to discover that
'everyone' (everyone else) uses these!

------
namidark
Does anyone know which JVM they're using? I've heard lots recommend Oracle
over OpenJDK for performance reasons.

~~~
nixgeek
Oracle.

