
How Discord Scaled Elixir to 5M Concurrent Users (2017) - lelf
https://blog.discordapp.com/scaling-elixir-f9b8e1e7c29b
======
xb95
Discord infra engineer here -- this blog post needs an update! Since then
we've scaled this system much more. :)

The Fortnite Official server has exceeded 100,000 concurrent users, Discord
itself is way past that 5M concurrent number, we're now using Rust in certain
places to make Elixir go faster, we've built a general purpose replacement to
Process.monitor that scales a whole truckload more that we're open sourcing
next week at Code BEAM SF... the list goes on.

There's a lot of fun stuff going on to try to make this system even more
efficient and reliable, there's a lot to do still. We run everything on a very
small engineering team (there are 4 fulltime engineers on the core
infrastructure, only about 40 engineers in the whole company) and we're always
looking for a few more. Feel free to reach out to me (zorkian#0001 on Discord)
if this blog post sounds up your alley!

~~~
truncate
Still reading the article, and I'm enjoying it so far.

My only complain is the design of your blog. The header and the footer (to
subscribe on Medium) takes so much space that there is relatively less space
to read to actual content.

[https://imgur.com/a/pBUmIvp](https://imgur.com/a/pBUmIvp)

~~~
rehemiau
I can recommend the browser extension "Make Medium Readable Again":

[https://addons.mozilla.org/en-US/firefox/addon/make-
medium-r...](https://addons.mozilla.org/en-US/firefox/addon/make-medium-
readable-again/)

~~~
mercer
This add-on can "Access your data for all websites".

That seems both risky _and_ somewhat overkill considering its features. Does
Firefox not support targeting a specific domain yet? Or is part of the problem
that medium allows custom domains (it does, right?).

~~~
roryokane
Firefox does support targeting specific domains, so the add-on specifically
chose to apply to all domains by writing "[https://*/*"](https://*/*") in
"permissions" in manifest.json. It probably asks for this permission because
of custom domains, as you theorize.

I can see in the extension source (thanks to [https://addons.mozilla.org/en-
US/firefox/addon/crxviewer/](https://addons.mozilla.org/en-
US/firefox/addon/crxviewer/)) that on every page, the extension uses
JavaScript to check for a top nav bar or a login nag popup and hide them if
present, then applies CSS that hides five other UI elements if they are
present.

~~~
mercer
Am I current in assuming that if the add-on was not manually installed, it
could be updated at any time to include malicious code? Or is that just
Chrome's behavior perhaps?

------
jhgg
We've since ditched fastglobal actually, as we found that it'd take too long
to recompile code as the number of processes on the BEAM VM grew. As it turns
out, the cost to recompile is linear scaling up with the # of processes on the
node. Once we reached the million processes per node threshold it ended up
being too slow, and resource intensive to do fastglobal dynamic module
recompilation.

We've since been able to squeeze out the performance from ETS we needed to be
able to run our hashring. More on that in this PR here:
[https://github.com/discordapp/ex_hash_ring/pull/1](https://github.com/discordapp/ex_hash_ring/pull/1)

Also one amazing feat we've accomplished thanks to ZenMonitor (which we will
open source soon as my colleague mentioned), is that we can now tolerate a
guilds node failure and recover from it in ~40 seconds. We had a node fail on
Friday actually due to the underlying host rebooting (gcp calls this a 'host
error'), and didn't even notice until after the system had already recovered
and the on-call got alerted after the fact. Back in the day this lead to a
cascading failure throughout the system. A guild node runs between 600k-700k
concurrent discord guilds, or servers as it's known in userland. And although
we haven't done it recently, we clock a full restart of our distributed system
at roughly 17 minutes (from shutdown to service fully restored).

~~~
pm90
Interesting. Do you run on top of bare gce vms? What machine types do you use?
Didn't know they were being automatically rebooted under the hood.

~~~
jhgg
Yes. We run bare VMs for this workload (n1-standard-X). They do in the event
of hardware failure or other unexpected fault. Generally it does not get to
that though, as google is fairly good at migrating your VMs off of a host if
it detects it will fail soon.

~~~
boulos
Disclosure: I work on Google Cloud (and have emailed jh about Discord).

Yeah, the most straightforward failure that we can’t migrate away from is the
NIC (or rack switch) failing. Obviously, if your path to get off the box is
dead, that’s not going to happen :).

Others though, like the hypervisor crashing or even host kernel are also
possible, but much less frequent than “Hmm, I think the network is dead, we
should start a replacement VM”.

One of the many motivations for (now) only having Persistent Disk for boot
disks, is that it lets us avoid a whole class of truly unrecoverable errors.
There are still bad DIMMs, but monitoring for ECC failures often lets us mark
the host for replacement in time for the VMs to migrate off before they
actually fail.

------
pdimitar
A corollary:

As a mostly Elixir programmer for 2 years now, I find it quite amusing how the
Kubernetes community tries their damnest to emulate Erlang's OTP system --
going as far as to bolt stricter typing system on Golang even -- and try and
reinvent Erlang's "let it crash and get rebooted" idea.

I am however not at all convinced that "let it crash" is a good mantra when
applied to entire containers. One such container can take 20+ seconds to
restart and you can lose a lot more compared to the mini-processes
Erlang/Elixir have which are happily left to crash and [semi-]auto-recover.
Imagine if that container was ingesting events and crashed when it had 5_000
in its in-memory queue and only managed to process 50-100 of them. Or imagine
100 transactions in progress being cancelled.

I applaud the hard work of the Kubernetes team, I am just not very sure they
invest their energy in the right tech stack. Sure Golang is faster than
Erlang/Elixir -- by a lot, too. But is it not K8s idea to be fault-tolerant
and not immensely quick? What does a Golang's speed matter when a container
needs seconds to reboot? In my eyes, K8s could have been written in Bash shell
scripts... but I am probably missing something important.

~~~
cdoxsey
Kubernetes is a complex project with a large codebase. Writing it in bash
would be a nightmare to maintain.

Go makes sense as an implementation language for a system like this for many
reasons:

\- it doesn't require a VM (goodbye Java) \- it is a memory safe language
(goodbye c++) \- it has strong types to improve reliability \- it does a good
job of handling complex, multi-module code efficiently (these two say goodbye
to python)

So of the Google friendly languages you end up with Go.

And I don't know if it was planned, but Go has turned out to be a boon for
contribution. A surprising number of devops engineers are willing to dip their
toes in Go waters and I highly doubt they would've done so with Erlang.

I think you'd end up going the chef route, writing the control plane in Erlang
and expecting users to interact in Ruby or something like it.

Also this whole conversation seems a bit off because Kubernetes is a multi-
service architecture with many components. You can absolutely write operators
in other languages since Kubernetes exposes an API.

It's kind of the whole point to be able to take an existing application and
run it in k8s rather than a regular VM with minimal changes. Expecting
developers to rewrite everything in Erlang is nuts.

~~~
pdimitar
> _Expecting developers to rewrite everything in Erlang is nuts._

...But I never said that?

You are correct on your points and I don't disagree. Golang is certainly a
much better choice than Bash indeed. You are also correct on developer
willingness to work with Golang.

I am aware that K8s and Erlang/OTP are apples to oranges comparison; they
serve different needs. Whereas Erlang's runtime (the BEAM VM) can give you
fault-tolerance, K8s tries to do roughly the same on a higher level - it tries
to give you throwaway containers that can be switched off and on at any time.

As I mentioned in my parent comment, I admire their work.

What my point was that if you squint your eyes hard, it kind of looks like K8s
wants to invent Erlang/OTP for infrastructure (as opposed to Erlang/OTP which
gives its guarantees per node).

~~~
BraveNewCurency
> which gives its guarantees per node

There is no such thing, because Nodes (and even entire datacenters) can fail.

In other words, node failure is an infrastructure problem that is best NOT
handled by your bespoke application code. Replacing failed nodes should NOT be
custom code in your app, that way lies madness.

> kind of looks like K8s wants to invent Erlang/OTP for infrastructure

You can say "K8s and Erlang implement similar ideas implemented at different
levels." But you can't pretend one is a substitute for the other, nor that
"Erlang has done everything that K8s can do."

E.g. Writing in Erlang doesn't magically get you:

\- deploy/upgrade of the Erlang runtime, including rollback + multiple
versions co-existing, including any compiled foreign code (which is required
in the article). \- ability to log, monitor, probe and reroute the connections
between services -- In a standard way such that the application doesn't have
to be modified.("service mesh") \- Ability for an entire ecosystem of tools to
inspect the versions of your application services that are deployed, because
it's exposed as an API. \- A standard ecosystem of plug-ins for operators
(autoscale, autoscale to EC2 spot instances, capacity planning, "best
practices" of running a MySQL cluster, etc.) None of these should ever be
mixed with the application. (Unless you are Kelsey Hightower:
[https://github.com/kelseyhightower/hello-
universe](https://github.com/kelseyhightower/hello-universe) )

~~~
pdimitar
> _There is no such thing, because Nodes (and even entire datacenters) can
> fail._

Gross over-simplification that I'd also call a strawman. In the face of lack
of electricity, of course no computer language matters at all. What's your
point?

------
juhatl
Earlier HN discussion from 2017 on the same post:
[https://news.ycombinator.com/item?id=14748028](https://news.ycombinator.com/item?id=14748028)

------
pault
It was really interesting to read how each problem they solved uncovered
another problem further down the pipeline. It also reveals how much is going
on behind the scenes when a service goes down and everybody starts accusing
the developers of incompetence. :)

~~~
jchw
It also helps you learn to appreciate software that very rarely goes down at
all. It's no simple accomplishment.

------
sametmax
While 5M concurrent chat users is definitly a massive feat, it seems to me
that it's kinda this tech (Erlang) very sweet spot and not being able to do so
would have been disapointing. Or am I missing something ?

~~~
pdimitar
My takeaway is that Erlang/Elixir aren't designed to squeeze _that much_
performance with that setup and that the Discord team went out of their way to
optimize hot-spots to make it possible.

Erlang/Elixir are still quite impressive out of the box, even without these
Discord-specific optimizations. But they probably wouldn't scale to 5M right
away.

~~~
sametmax
Nothing does, 5 millions __concurrent__ user creating and sending new data in
real time is huge. Not to mention video and sound.

~~~
pdimitar
Yes. In my eyes the win here isn't concurrent performance that cannot get
beaten by anything else; I am pretty sure a carefully crafted Golang or Rust
stack can beat Elixir any day. But they won't have the fault tolerance
guarantees of the BEAM VM.

The real win IMO is the good reliability:performance ratio that Discord
achieved. I feel Erlang/Elixir are excellent in optimizing this exact metric.

~~~
bpicolo
Golang and Rust also don’t give you a direct path to distributed actors.

------
nicodjimenez
Question for the Discord team.. if you started this project today in 2019,
would you still build a homegrown event sourcing system? Or would you use
Kafka?

~~~
jhgg
We would still use our own homegrown system. And we do use kafka internally
already for a plethora of other things. Just not on the real-time chat side of
things.

~~~
nicodjimenez
Why not on the real time chat stuff?

~~~
jhgg
Because Kafka is definitely not a tool suited for real time chat and event
distribution to millions of clients. Once a message hits our distribution it’s
fanned our to clients on average 5-10ms later.

------
dev_dull
> _Sending messages between Erlang processes was not as cheap as we expected,
> and the reduction cost — Erlang unit of work used for process scheduling —
> was also quite high._

This is really surprising to me, and definitely something the elixir team
should look at optimizing. _Sending_ messages should be extremely fast.

~~~
jadbox
I have no insight into how Erlang performs this process, but I'd assume Erlang
is performing sending messages almost as fast as it's possible, as it's a key
part of the platform and has had two decades to perfect it. Most likely this
is a limitation of the paradigm (of the Actor model) and not the
implementation.

~~~
larryweya
I believe the bottleneck the author was trying to overcome here is how the
Erlang VM moves processes to the back of the run queue once they hit a
predetermined number of operations (preemptive scheduling) and not of the
Actor Model.

------
DevX101
Have any Node.js/websocket implementations scaled on this order of magnitude?
Would like to do a read-up of any challenges faced.

~~~
tigershark
Am I the only one thinking that node.js is maybe not the right choice for
global-wide services?

~~~
tomc1985
node ain't the right choice for _anything_

------
JulianMorrison
Needs some attention to the scaling _down_ , IMO. Discord is annoying on
intermittent connections. Why do sent messages not have identity such that
they can go through twice?

~~~
jhgg
We actually added this 2nd half of last year. Message send operations are now
deduped and idempotent given a client-generated nonce. Are you still
experiencing this issue?

~~~
JulianMorrison
Your nonce might not survive the "message -> disconnect -> red message -> re-
send" flow.

Could you just get rid of red messages and make it "pending" until acked as
"sent"?

------
rcarmo
I'm a bit curious as to the message processing overhead. BEAM and HIPE are
relatively fast, but string processing of any kind was kind of slow back when
I messed around with Elixir...

~~~
siscia
String manipulation is notoriously slow in the BEAM, mostly implementation
reasons.

Message passing is somehow slow as well, slower than go(Lang) of you want a
comparison.

But on those systems, the one that you design on the BEAM, velocity is usually
not a problem. What you usually try to do is to reach a design that does not
have a single bottle neck or failure point so that pretty much whatever happen
you can just add machines.

This turn out to be a great way to design multi{processes, cores} and multi
nodes system.

What you usually get on a BEAM under load is an 100% of CPU usage, even on
multi core, but to be honest those core are not used as efficiently as you
could.

It is a matter of tradeoffs, I can quickly write multi process software that
can easily scale, but I will leave on the table some raw performance number.

Again it turns out that just raw numbers are not as important as they are
simple to measure.

~~~
brightball
FWIW, that’s also why use of IO Lists are strongly encouraged rather than
string manipulation. It’s also why HTML with Phoenix is so fast.

[https://www.bignerdranch.com/blog/elixir-and-io-lists-
part-2...](https://www.bignerdranch.com/blog/elixir-and-io-lists-part-2-io-
lists-in-phoenix/)

------
eric_khun
Is the ETS cache system can be used by something else than elixir?

~~~
toast0
ETS is basically a key value store with optional sorting, multiple values,
limited atomic updates, and a weird query language. I'm not sure how well it
compares to other key value stores, but it's main differentiator is that it
stores Erlang native types without a significant marshalling burden (at least
for the developer -- if you store complex values, it's still plenty of work
for the runtime). It wouldn't be a lot of work to build an Erlang service to
expose ETS to something else, but I don't know that the data marshalling
required would be worth it.

More interesting might be to expose mnesia, maybe. But even then, it might be
most useful to build out your data handling in Erlang and expose a higher
level API for clients in other languages.

------
aboutruby
I'm continuously impressed at the quality of the software coming from Discord.
By far the best messaging user experience on desktop there is.

Discord already uses Cloudflare, I'm curious what they think of Workers.

edit: I'm on macOS, not an heavy user, I used Slack / Skype / Hangouts /
WhatsApp / Messenger / etc. before

~~~
rumblefrog
Unfortunately I have the opposite experience

\- Abusive trust and safety team \- Constant outages for both users and bots
\- Frontend is heavily bloated \- UI designed for money grab instead for the
users

~~~
metildaa
Don't forget the wierd dependency hell when it comes to writing Discord bots
in Python. They require you use an older version of Python that is "fun" to
work with.

~~~
jhgg
All the Diacord API libraries are community built and maintained.

------
est31
From time to time my ISP changes the IP address and discord totally freaks out
about that. I have to fill out a captcha and have to receive a verification
E-Mail. They are treating me as if I used asdf1234 as my password.

~~~
ubercow13
Yes, and if they don't like you enough they lock your account until you
provide a phone number (no thanks)

------
denart2203
Really neat post. Did y’all consider GenStage or another demand based approach
for your overflow problem? I’d be interested in hearing about the tradeoffs
between demand vs semaphore, it seems like the two have some similarities.

~~~
cuddlecake
I think the main difference is that "demand" in GenStage tells the producer
how much work the producer should do to satisfy the consumer. The consumer
sets the demand, the producer does not limit it.

With how discord uses semaphores, the Consumer will not even make a demand for
more events from the producer.

Other than that, I think in the case of Discord, it's more of an RPC thing,
and GenStage is rather made for concurrent streaming and processing of data.

------
nottorp
Nice. How about fixing the client side by, you know, doing a native client?

------
jplayer01
So why can't they add basic features like organizing servers?

