
Discord outage postmortem - fanf2
https://discord.statuspage.io/incidents/62gt9cgjwdgf
======
coder543
> Our service discovery system is resilient to various failure modes within
> etcd, however, we did not anticipate a client being able to write a corrupt
> record due to improper HTTP handling. golang's net/http module — which is in
> violation of the HTTP 1.1 specification in this specific case — introduced
> an additional failure case which we did not anticipate. It is expected that
> a Content-Length header and an incomplete body should be rejected by a
> server as being incomplete. Unfortunately net/http does not check that the
> bytes read from the body portion of the request match the Content-Length
> header.

Whichever engineer wrote puts too much weight on the theoretical aspects of
HTTP.

The reality is that any of Discord's software could have had erroneous code
that announced an empty service list. This would have crashed all of Discord
regardless -- transient network errors and mismatched HTTP Content-Length
headers _need not apply._

A _resilient_ HTTP server will try to handle whatever the HTTP client sends
it, and then the specific endpoint can enforce whatever levels of strictness
are needed by the application. It's nice to be able to point a finger at "not
my code", but how many popular web server libraries (not standalone web
servers like nginx or apache) out there will actually flat out reject the
entire request because of a simple Content-Length mismatch, rather than
handing that decision off to the endpoint?

If an empty string is not a valid value for your service discovery system, you
should reject those from the database with a CHECK CONSTRAINT if possible, or
with the code that sits in front of the database if CONSTRAINTS don't exist
(like I believe is the case for etcd). It's also not a bad idea for the
clients to validate what they receive and filter out values they can't handle
to enhance robustness.

You definitely shouldn't rely on nothing ever _attempting_ to put a bad value
into the database.

The web server's willingness to accept "strictly" malformed requests when the
engineers were unaware of this fact was certainly a contributing factor, but I
don't think it was the root cause in this case, as they're heavily trying to
suggest.

(I think it would be neat to have a "strict" mode that you could turn on for
every web server, especially when being used as an internally-facing service,
but most companies want to make it as easy as possible for their
customers/partners to integrate with their service... and that means accepting
requests that aren't always academically perfect.)

~~~
jchw
I think it would be nice if the HTTP server could at least optionally validate
that the length is expected. For example I would really not expect a malformed
chunked encoding lacking a terminating zero length chunk to silently succeed.

However, I think you are dead on the money. Etcd is not validating the values
but Discord’s software seems to assume the etcd values must be valid. That
seems like a recipe for future outages if the mentality were kept this way.

~~~
q3k
Yeah, this is reliability engineering 101 - don't ever expect that outside
content won't be malformed, even from 'trusted' systems. There was a codepath
somewhere that went from an unmarshal error into a crash - and that's
something that should not pass any code review.

~~~
toast0
In Erlang (and Elixir) systems, turning unexpected errors into crashes is what
you're supposed to do. Unfortunately, it's pretty easy for that to spiral into
node shutdown/restarts, everybody ends up learning that one the hard way.

Supervision trees really do need to be considered, but what are you expected
to do when the service discovery system is unreachable, broken, or filled with
bad data? There's certainly options, giving up isn't necessarily the wrong
one.

~~~
jchw
This may be true, but malformed input that is unchecked is not, imo, an
“exceptional” case but nominally expected, and kind of dangerous since it is
one of those things that isn’t likely to happen often since it requires
another bug or fault to reveal itself.

It’s probably an unpopular opinion at this point but I definitely appreciate
Go’s approach to error handling here. Error cases are in-your-face by default
for the most part, leaving nearly only programming errors and critical faults
to panic ideally.

~~~
toast0
The important idea of the Erlang approach is not to leave things unchecked;
the idea is that the proper response to most errors is to restart from a known
good configuration.

Of course, restarting the service discovery client may not help, especially
when the service discovery system's state is stable but broken. I don't know
what you're supposed to do if you're dependent on services, and the service
discovery system you use to find the services is not reachable or is returning
garbage information. Better logging is always nice, but you can't serve the
traffic.

------
jchw
In my opinion, the HTTP server’s Request.Body’s io.Reader should return an
error on read when the request body ends unexpectedly, rather than having
every use of the request body try to sanity check the length which seems
cumbersome and error prone.

~~~
Jasper_
I believe it should, but it's likely a userspace proxy like HAProxy is in the
way causing errors to be swallowed, as there's no way with the current sockets
API to intentionally send an error over the wire.

~~~
jchw
Hi Jasper :)

Do you think so? I checked the Go http server code and my not-so-careful
reading seems to confirm their suspicions: it just reads from the body
io.Reader, which doesn't seem to validate length. (They DO use a limit reader
to prevent DoSing with huge bodies, at least.)

Now if they were phoning out to a remote server, I’d assume some kind of
fabric or load balancer. But I assume they’ve got a local etcd on each node or
something along those lines so it is very possible that it’s all local
networking.

Though in most cases, outside of this use case, I’d agree that this is not
particularly useful because it would never be terribly reliable. I know for a
fact Amazon ELB rewrites the entire request, as an example...

~~~
Jasper_
I was talking about getting an error when reading from io.Reader, like a TCP
timeout. I was quite sure the error would bubble up, but it of course wouldn't
if there was no error to begin with, with something like HAProxy in the way.

------
csears
As part of the fix, they mention changing the Python etcd client code to send
the request in a single packet. How would one force that behavior? And is that
even a good idea?

It seems like TCP fragmentation would be handled at a much lower level in the
Python networking stack.

~~~
q3k
No idea. This 'fix' doesn't make any sense.

~~~
mst
It makes perfect sense as "makes it 95% less likely to hit this bug."

Notice from the write up that they already _also_ modified their code to be
resilient against the bug being hit at all, so "reducing the number of times
the resilience code needs to be invoked by 95%" on _top_ of that is a net win.

------
k__
We used Discord for the "We vs Virus" hackathon last weekend and I was baffled
how much better than Slack it is.

Everything runs much smoother, voice chat is like with TeamSpeak back in the
days. Solid product.

------
kylek
The level of detail in this is pretty nice. But I also feel like it's too
much- some of the comments here are poking holes in the write-up. Is there a
mis-translation from geek-speak to lay-speak (maybe due to the short
turnaround)? Is it better to not mention some of these details at all?

------
chasers
It sounds like they were maybe pattern matching to strictly. If the value is
always a string just save the empty string and try to discover that service.
If it doesn't exist, oh well, it doesn't crash the process.

------
dirtydroog
> unexpected deserialization error

I've had this with Boost.Serialization. Not fun. No exception, just a crash.

------
ayorosmage2
[https://jepsen.io/analyses/etcd-3.4.3](https://jepsen.io/analyses/etcd-3.4.3)

------
intsunny
"All times are PDT."

A daily reminder that one can do anything and everything in JavaScript, except
display information in UTC.

~~~
sbr464
The reason (it seems) they stick to one timezone, are comments /updates can be
added to status page incidents. It can be conflicting to talk about event
times in a comment, and the post has different times. Here's an example from
slack, where it's mixed, comments in PDT and posts in the local timezone.
[https://status.slack.com/2019-06-28](https://status.slack.com/2019-06-28). I
think it's a setting in statuspage, but it's tough decide which is best. I
definitely wouldn't prefer looking at it in UTC.

------
ravedave5
FYI for anyone using discord - dont use the web client. Often users with web
client can only be heard by certain people and vice versa.

------
ddevault
It's a good thing that everyone makes their own discord "server", which
naturally is resilient to any failures upstream by virtue of being its own
fully contained, self-hosted software stack.

Wait, no, Discord just aped that word to gaslight users into behaving in a way
which makes them more money.

~~~
ReidZB
That does not seem like a very generous interpretation.

I'd wager that most of Discord's clientele don't have any specific
expectations for the word "server". Regardless, I think "server" was carried
over from the Mumble / Ventrilo / Teamspeak gaming community.

~~~
Slartie
That's where it came from. Before discord, gamers set up actual servers with
Teamspeak or Mumble and handed the IPs and ports to their guildmates. The
"Teamspeak server" was where you met for voice chatting, and since Discord
just wanted to supplant Teamspeak with their own product, they kept the
"server" part in the name of the thing that gamers were supposed to meet for
voice chatting.

It is a pure marketing ploy. Internally they call the "server" thing a "guild"
AFAIK, which is technically just as much BS as the "server" moniker, but for
other reasons (not every community using Discord is a "guild" in the commonly
accepted meaning of that word in the gaming world, which is a long-term
organizational group doing stuff together in an MMORPG).

