
Discord Postmortem from Friday - b1naryth1ef
https://status.discordapp.com/incidents/qk9cdgnqnhcn
======
jhgg
It's worth noting that the instance migration basically null-routed the redis
VM for a good 30 minutes, until we manually intervened and restarted it. The
instance was completely disconnected from the internal network immediately
following the migration. From what we could gather from instance logs, the
routing table on the VM was completely dropped and it could not even connect
to the magic metadata service (metadata.internal - we saw "no route to host"
errors for that). This is a pretty serious bug within GCP and we've already
opened a case with them hoping they can get a fix. I think this is the 4th or
5th major bug we've encountered with their live migration system that could
have, or has led to an outage or internal service degradation. GCP team has
seriously investigated and fixed every bug we've reported to them so far, so
props to them for that! Live migration is incredibly difficult to get right.

We believe this triggered a bug in the redis-py python driver we use
(specifically this one: [https://github.com/andymccurdy/redis-
py/pull/886](https://github.com/andymccurdy/redis-py/pull/886)) that made us
have to rolling restart our API cluster in the first place, to get the
connection pools back into a working state. redis-sentinel had appropriately
detected the instance going away, and initiated a fail-over almost immediately
following the instance going offline, but due to the odd network situation
that was caused by the migration (absolute packet loss instead of connections
being reset) - the client driver was unable to properly fail-over to the new
master. We already have work planned for our own connection pooling logic for
redis-py - as right now the state of the drive in HA redis is actually pretty
awful, and the maintainer doesn't appear to have the time to close or look at
PRs that address these issues (we opened one that fixes a pretty serious bug
during fail-over in march [https://github.com/andymccurdy/redis-
py/pull/847](https://github.com/andymccurdy/redis-py/pull/847) that has yet to
be addressed).

~~~
fulafel
For those of us unfamiliar with GCP, do you mean that the default-route of
your VM was unable to route its traffic? Or is there a routing config running
on customer VMs that GCP live-manages?

~~~
b1naryth1ef
GCP has a virtual networking stack to support a bunch of crazy (and awesome)
features Google has built. Unfortunately the complexity here seems to hurt
power-users like us. In this case it appears that for some unknown reason the
node failed to program its network stack when coming up, meaning it was
completely unavailable (even the metadata service used internally by google
failed).

------
cordite
The level of detail and linearity is impressive.

At this scale, it seems like it may be warranted to start using reliability
testing in production in like with Netflix.

At the end I see mention of a library with flaws. I am curious as to which
library that is, given I develop some projects in Elixir.

~~~
b1naryth1ef
Thanks, we try our best with these. Past experience has shown they can be very
valuable, and help everyone at the company get context on the system and how
we handle failures.

Reliability testing is definitely something we're interested in as we spin up
more SRE/reliability focused individuals, but also has probably the least
amount of cost-benefit for us (compared to engineering effort on improving the
things we know need work). Some of the failure in the system we experienced is
related to issues we know about, but haven't prioritized (read; had time for)
yet.

For the library, we believe the bug is related to hackney and the fact it uses
the high priority setting for its pool process. For some reason (this is the
part we're not entirely sure on, and still spending some time investigating)
this high priority process got stuck and consumed all of the scheduler time
(presumably related to the earlier API degradation), breaking the distribution
port and the application in a weird way. Oddly enough the systems we run on
are SMP, so in theory one rogue process should not be able to have this
effect.

~~~
cordite
That is indeed very odd! Thank you for sharing. Hackney, through another
library, is used in a telegram api wrapper that I wrote up. Though my stuff
usually runs on a $5 vps, nothing with multiple cores.

------
phreack
Ever since they launched screen sharing, I've uninstalled both Skype and
Hangouts and relied entirely on it for pair programming sessions. The
smoothness of the reproduction is just incredible, and I don't see myself
going back soon.

------
ZeroCool2u
I'm really impressed. I was using Discord for most of this weekend,
specifically Friday and Saturday. Never noticed any issues.

~~~
gizmo385
They don't have a posted outage for Saturday, but I noticed issues with it on
Saturday evening/night I believe. I'm wondering if it was related to the
issues that they included in the post-mortem.

~~~
b1naryth1ef
Very possible you saw a slight interruption around 11:30PST for around 10
minutes until we found and decommissioned the host that experienced this
problem. We generally don't update status until we can verify impact/source,
we see tons of limited outages from ISPs misbehaving.

------
humanfromearth
We had the exact same issue with RMQ (HA setup) on GCP (running on GKE) a few
weeks ago. Tried contacting support about this, it's paid - no customer
support for their own bugs.

The solution we came up so far is to disable automatic migrations. Not sure if
that option actually does anything.

~~~
sleepydog
You can't disable automatic migrations in GCP. You can choose between allowing
live migrations (move the instance while it's still running) or (hard)
instance reboots.

~~~
humanfromearth
You're right. I meant hard reboots.

------
atomical
Does anyone use discord for work?

~~~
katastic
I have informally with co-workers. But not in any official capacity.

It's like 20x better than every other product out there though. And their new
video chat + screen sharing is pretty great. The bandwidth is far higher than
any other competitors I've used.

My brother and I were playing 1080p videos on each of our screens and watching
the other's, just to test it out. Obviously it wasn't full quality, but it
kept the frame rate up and looked presentable at least to 720p.

~~~
avree
Weird, I find Discord's audio quality especially to be terrible.

And their lack of scalable monetization leads me worried about its longterm
success as a platform - they are adding more cost-intensive features and
continuing to try to support it with what is essentially a $5 monthly donation
model.

~~~
b1naryth1ef
Can you give more explicit examples of the bad audio quality you experience?
I'd be happy to forward this onto our native team to look into if there are
some concrete things they can look at. Generally 99% of the audio issues we
see people experience are due to ISP/peering/DDoS/etc issues, most of which
are handled automatically by our servers within a few minutes.

~~~
eropple
FWIW, a serious feature that I would pay you money for right-here-right-now is
the ability to multitrack audio. Let me give you many dollars to route each
voice to a separate Soundflower channel, so that I can mix them outside of
Discord, and I will give you said monies. I'm pretty sure you're even sending
unmixed streams down? But I can't get at them!

This probably involves not being Electron (from my own adventures in the
area), so I don't hold out much hope, but it keeps Discord from replacing
Extremely Expensive And Bad solutions like SkypeTX for me.

~~~
exikyut
The first thing I was thinking is using PulseAudio somehow. It has some bad
image issues but its swiss-knife-of-audio-routing chops are undeniably
present. It's Linux only though, so probably wouldn't be useful here.

I'm trying to figure out what the actual context in question is, particularly
in terms of technical connectivity. Is this being used for remote DJing? Or
conferencing? Or an audio recording situation?

If you're prepared to throw money at the situation, it's _possible_ this may
be fixable with a simple bespoke solution. I say "possible" because,
unfortunately, I just did some digging and found
[https://bugs.chromium.org/p/chromium/issues/detail?id=453876](https://bugs.chromium.org/p/chromium/issues/detail?id=453876):

> _Unfortunately we don 't support multi-channel > 2 nor multiple devices at
> the moment._

> ...

> _Are there any future plans to support these two features? Is this a w3c
> issue or a Chrome issue?_

> ...

> _I am quite skeptical about this; I was told this requires a huge change in
> our WebRTC-side infrastructure, but I am not sure what the current status
> is._

> _The spec indicates getUserMedia can be configured with 'channel count', so
> I assume this is Chrome issue._

That immediately nukes WebRTC :(

Could make for a fun project. I'm very fascinated with audio handling myself
and this sounds interesting, but I'm unsure I'd personally have the skills (or
mental stamina/attention span :< ) to be sure I could follow through. I'm also
only on a Linux box, which brings up the platform-native problem.

~~~
eropple
Sorry, just saw this. I need to split audio to mix and level it for stuff like
live-streamed podcasts. So, on top of that, I need to pull video.

Honestly, the best answer is probably to continue using multiple Skype
instances. Which is gross. But, y'know.

~~~
exikyut
It's fine - you actually saw it, which is cool :) some of my other past
replies have gone completely unnoticed

I see. I get the impression this is collaborative podcasting with multiple
people that have multiple microphones. (I can't figure out why else you'd need
multichannel A+V transport.) FWIW, it does sound like Skype is probably your
best bet for the time being (unfortunately). It's simple, it works for
everyone, etc.

------
lwansbrough
Funny definition of HA. :)

