
How Discord Handles Two and Half Million Concurrent Voice Users Using WebRTC - jhgg
https://blog.discordapp.com/how-discord-handles-two-and-half-million-concurrent-voice-users-using-webrtc-ce01c3187429
======
ReidZB
I really wish Discord would implement an audio compressor (or if they use one,
make it much more aggressive) for voice chat. For example, Mumble doesn't
require manually adjusting anyone's volume - instead, it uses a compressor to
automatically normalize volumes for everyone.

In contrast, Discord has a volume slider that I end up having to adjust for
most people. That requires manual tweaking, often while I'm playing a game,
and isn't future-proof; just the other day, one of my friends changed his
microphone setup and was extremely loud (I had him at 200% when in the new
setup he should've been at about 75%).

This is really my only complaint with Discord's VOIP functionality. It is a
major step backward from Mumble. (Also, for some reason the Discord community
mostly uses voice activated mic instead of push-to-talk, but I don't think
that's Discord's fault.)

~~~
iMerNibor
> voice activated mic

Which is a very sane default for most people

The only reason you'd want to use push to talk is either you have a lot of
people in the channel, you have a loud background or you don't want people
hearing stuff not designated for the discord voice (if you're streaming, have
irl people in the same room - that kinda stuff)

~~~
kadendogthing
>Which is a very sane default for most people

I couldn't disagree more. Having an open mic setup for gaming is both insane
and completely inconsiderate to everyone else in the channel/group.

It's hard not to personally judge people that join a group, whether that's a
friend of a friend, someone looking to try out, public LFG lobbies, etc and
you can immediately tell they have an open mic configuration. It's maddening.

Breathing, keyboard smashing, mouse clicking, background noise, eating, or
exasperated whining every time you die/lose are all annoyances you push on the
rest of the channel/group with an open mic configuration. At best it's an
annoyance. At worst it's actively disrupting people's game play and team
communication.

For the love of all that is decent with humanity, use push to talk.

~~~
souterrain
> For the love of all that is decent with humanity, use push to talk.

I’d love Zoom, WebEx, GoToMeeting, et al., to implement push-to-talk, and give
an option for mandatory use to conference call organizers.

~~~
bbrks
I think Zoom now has a preference for holding down space to temporarily unmute
your mic. But I fully agree, push-to-talk should be absolutely mandatory for
any calls with more than 2 people in them.

~~~
piyh
Zoom has to have focus which rules out using PTT while presenting or
multitasking while meeting.

------
anderspitman
For those interested in open source/decentralized alternatives to
Slack/Discord/etc, I've had a pretty great experience with both the voice and
video integrated into Riot.im[0] (uses Matrix protocol[1] for chat, and
Jitsi[2] for voice/video).

I've said before that I can't really justify nonfree software for something as
simple as text chat.

That said, rooms with very large numbers of people trying to communicate with
voice/video is one use case where maybe it makes more sense for a commercial
product to solve it.

[0] [http://riot.im/](http://riot.im/)

[1] [https://matrix.org/blog/home/](https://matrix.org/blog/home/)

[2] [https://jitsi.org/](https://jitsi.org/)

~~~
singularity2001
Why is the riot "TRY now" link so damn hidden? Two clicks, one long search
here [https://riot.im/app/#/home](https://riot.im/app/#/home) No private room?
Not working without google? Very bad first impression :(

~~~
Arathorn
isn’t the “try now” link the big flashing button labelled “launch now!” on
[http://riot.im](http://riot.im)?

~~~
anderspitman
That's how I always start it. Unfortunately that link is on a carousel for
some reason, which automatically advances. Definitely not ideal UX in my
opinion.

------
jorams
Slightly off-topic, but:

> For clarity, we will use the term “guild” to represent a collection of users
> and channels — they are called “servers” in the client. The term “server”
> will instead be used here to describe our backend infrastructure.

I still don't understand why you ever chose the name "server" to refer to
something that is many things, but not a server. Guild is a way better name,
and doesn't need disclaimers about your self-chosen terminology being
confusing.

~~~
jczhang
I believe this is a tradition that has been grandfathered into voice chat
apps. Previous ones like Ventrilo have you actually connect to a private
server (that someone pays for / hosts).

~~~
elefanten
I second this explanation. Additionally, from a UX perspective, if a user
wanted to join your server you had to give them IP/port/pw and they'd type it
in manually. So even the users were accustomed to addressing servers. "Server"
entered the community lexicon as "the place we chat".

------
orliesaurus
Can anyone who works on the discord app explain to me: How am I impacted for
having (being part of) let's say 100 server (guilds) on my app versus having
1?

Should I quit the discords that I don't use? I feel like they stay in sync in
the background and if I am in 100 servers and they all post say tons of text I
will get massive lags.

Does that make sense? Is there a way to prioritize only the server I have
active/I am talking in and freeze the rest while im full-screen/in-
game/focused?

~~~
b1naryth1ef
We've put a lot of work into our clients and our backend to make sure the
impact of being on 1 vs 100 guilds is negligible. It helps that most of the
folks building Discord are power users in a bunch of servers (so we feel the
pain of poorly optimized paths early). Generally if you don't look at a server
often it shouldn't effect the performance of the app / bandwidth usage. I
think we have some more blog posts in the works regarding some of these topics
so look forward to those!

~~~
jplayer01
Well, are you ever going to provide a better way to deal with 10+ servers? The
left-side server UI is terrible. The icons aren't enough to figure out which
server is what and it desperately needs some form of sorting or categorizing
(being able to create custom groups of servers, for example per game or any
arbitrary list of servers). I end up continuously pruning my list so I don't
have more than 10 because it's so bad.

WoW alone forces me to have 1 server per class, then 1 server for every guild
I'm associated with. Forget about any other game.

~~~
Reedx
No need to be so rude and entitled about it. Especially considering you're
probably not paying for the resources you're consuming. Even if you are, it's
rude.

I bet it's on their todo list to handle that use case and no doubt there are
other priorities.

~~~
jplayer01
It's been years without any improvement at all. So yeah, I'm annoyed. I'd even
be happy to give them money monthly, but the apparent priority of this issue
is at the bottom of their list, if it's on there at all.

------
merb
OT: actually what I always wonder, is how Discord will make money. I mean they
offer this huge free service for everybody (2.6 million concurrent voice
users) over 1000 servers that must be a huge GCloud invoice

~~~
Vishnevskiy
This is how we will make money.

[https://blog.discordapp.com/the-discord-store-
beta-9a35596fd...](https://blog.discordapp.com/the-discord-store-
beta-9a35596fdd4)

Aside from the money we already make from
[https://discordapp.com/nitro](https://discordapp.com/nitro) :)

~~~
merb
do you do that much money from nitro? I actually never seen that many adds in
discord about nitro and I didn't even knew it exists (i'm using discord for
over a year) and I mean we actually migrated from a 10€ teamspeak to discord.
while it is unlikely that everybody in my group would buy nitro, I find it
amazing that you do not have many adds in your app, not even about your own
service. (i mean sometimes there is the update pop up and some messages
between the chat messages, but we basically never use that much of the chat
anyways...)

------
jrockway
I'd be interested to see how the system works during high-load events. It
sounds like if a server gets overloaded, it shuts down and moves the clients
to the next non-overloaded server. If the health-check used by the load
assignment server is not perfectly tuned, then that failover overloads that
server, shutting it down. Eventually you have no servers left.

I suppose that a perfect health check will prevent this, since the failover
will assign failover traffic at exactly the level that the new server can
successfully handle. But if it's wrong on the other side (rejects connections
when capacity actually exists), then compute resources are wasted "just to be
safe".

I imagine that estimating capacity is even more difficult since people can
join and leave at any time, and the client doesn't send any packets when there
is silence. So the load changes based on how talkative people are being (which
means your server always crashes during the best parts of whatever you're
discussing).

Anyway, I'm wondering how this all compares to the naive strategy of "pick a
random server for this channel, if it crashes, bad luck".

~~~
jhgg
Perhaps I can answer some of these questions. We've dealt with scenarios where
we've had to fail-over entire data-centers to other regions, or other data-
centers within that region. We have a few mitigations in-place for stuff like
this.

a) A voice server can sit in multiple different load categories. So it's not
"best server by score", but rather "best server out of a pool of servers with
a given load factor". The load factor is one of ":verylow | :low | :medium |
:high | :veryhigh | :extremelyhigh | :full" When looking for a server, we have
an index of "best servers by region" that's stored in memory on each node and
kept synchronized by service discovery. Additionally, if we don't have enough
candidate nodes in a given load category, we will grab a few from the next-
best load category. The thought being, that for a given region, we'll have a
large set of servers to allocate to. This prevents a server failing from
thundering-herding another server.

b) A voice server fast-fail (reject) allocation requests, and does so under
some circumstances, e.g.: the rate of allocation requests for the server
exceeds a threshold, the server is at capacity, or approaching capacity. We do
a lot of this fast-failing logic using semaphores around a shared resource
(server alloactor):
[https://github.com/discordapp/semaphore](https://github.com/discordapp/semaphore)

c) We also run things a bit over-provisioned. We try to have enough excess
capacity during peak such that we can handle the failure of an entire
datacenter within a region, or an entire region to nearby geographical
regions.

>I imagine that estimating capacity is even more difficult since people can
join and leave at any time, and the client doesn't send any packets when there
is silence. So the load changes based on how talkative people are being (which
means your server always crashes during the best parts of whatever you're
discussing).

We use a lot of factors to measure load on a server to group it into a load
category - in addition to just traffic: we look at concurrent clients
connected, concurrent voice servers allocated, packets/sec, bytes/sec.

------
GranPC
> Instead of DTLS/SRTP, we decided to use the faster Salsa20 encryption. In
> addition, we avoid sending audio data during periods of silence — a frequent
> occurrence especially with larger groups.

I wonder whether this would help an attacker infer voice data, with a method
similar to the one from the paper "Uncovering Spoken Phrases in Encrypted
Voice over IP Conversations" [1]

[1]:
[http://www.cs.unc.edu/~fabian/papers/tissec2010.pdf](http://www.cs.unc.edu/~fabian/papers/tissec2010.pdf)

~~~
EvangelicalPig
I've never tried doing pcaps with Discord voice chat yet but it would be
interesting "estimating" what was said best on the amount of voice traffic
transferred. Might be harder if it still looks like standard Discord TLS
"control traffic" blending in though, no?

~~~
wahern
Keystroke timing analysis has some solid research. See, e.g., Timing Analysis
of Keystrokes and Timing Attacks on SSH
([https://people.eecs.berkeley.edu/~daw/papers/ssh-
use01.pdf](https://people.eecs.berkeley.edu/~daw/papers/ssh-use01.pdf)), Don’t
Skype & Type! Acoustic Eavesdropping in Voice-Over-IP
([https://arxiv.org/pdf/1609.09359.pdf](https://arxiv.org/pdf/1609.09359.pdf)).

Deciphering content from latencies in packetized speech is likely much more
difficult, but I wouldn't put much stock in it being too difficult.

Which is to say, if you're transferring high-value information assets over
VoIP you should probably assume it's decipherable. That doesn't mean you
should change what you're doing. You could simple say, "M'eh, I'll worry about
it when it becomes a thing." But I wouldn't assume it's confidential to
someone willing to invest the time to target and capture the conversations.
And I might leave a few choice comments in the source code and documentation
so nobody could excuse imprudent reliance on confidentiality with, "But nobody
warned me".

------
hfourm
This comment isn't really related. I obviously love Discord and have been
using pretty much since it became an option, but I wish there was a better UX
for people who are in TONS of channels/servers. I find it hard to navigate all
my current servers with just the icon -- especially when server owners are
changing their avatars frequently.

~~~
anafh83
I think their UX is awful, and their client is very slow in channels with
thousands of users. They should open up their protocol so alternative clients
can be created, or at least some IRC gateway.

I haven't made many calls but the sound was always crisp.

~~~
marksomnian
> open up their protocol

Both the Rest API (that official clients use internally too), as well as the
real-time WebSocket protocol are described here:

[https://discordapp.com/developers/docs/reference](https://discordapp.com/developers/docs/reference)

[https://discordapp.com/developers/docs/topics/gateway](https://discordapp.com/developers/docs/topics/gateway)

~~~
Ycros
Unfortunately you can't use these APIs to create third party clients because
that violates their ToS, they are meant only for writing bots. If they catch
you using these APIs with your full account credentials they'll ban your
account.

~~~
tenryuu
They can't catch you using these apis since these are the same apis the first
party client runs off of

~~~
Fnoord
Of course they can catch you if your client isn't 100% the same as the
original there are going to be differences which "forensic" tools are going to
prove.

~~~
tenryuu
Forensic net requests? You can MITM the entire traffic. Mimicking it's nature
isn't the hardest task in the world if all the data is handed on a silver
platter, and it's even easier when their client code is uncompiled JavaScript.
There are always people data mining the client.

If you're thinking they're going to check the actual software. Like ass I'm
going to let someone into my house to check if I'm breaking their terms of
service

------
quackerhacker
So I have an affection for Discord...so I appreciate posts like this.

>"Using the WebRTC native library allows us to use a lower level API from
WebRTC (webrtc::Call) to create both send stream and receive stream."

So I'm gathering that discord's voice servers receive multiple persistent
connections, then compress the audio streams for delivery to each end user.
THIS part is where I can't imagine the on-the-fly cpu usage. Each client's
receiving compression needs to also negate their own audio to prevent an echo
effect (no point to hear your own voice), but it also means separate
compression streams per user.

>" All the voice channels within a guild are assigned to the same Discord
Voice server."

I imagine this helps significantly with I/O in converting live streams into 1
stream per end user. I've dealt with video compression (only in ffmpeg) and
live syncing time stampings, and I can say from experience that, this is no
easy feature. I understand this is audio streams (so lower overhead), but
still the persistent voice server needs to handle the incoming connections,
web socket heartbeats (negligible), compression (high I/O), and deliver the
streams (high memory usage too).

I'm impressed, but _would love to hear the specs on the media servers and
their DL /UL speeds_. My old setup to deliver live video (in sync and
compressed) was 6 mini-itx's, 4GB of ram per board, and i3's...my bottleneck
was my isp, which I solved with multiple docsis modems and an internal switch
(each board had 2 ethernet ports).

~~~
jhgg
We don't transcode audio/video on the server. Each stream is processed and
muxed by the client. The server is merely relaying rtp packets and tagging
them with a given ssrc per peer. The client does the rest of the work.

The bulk of the user-space time on the SFU is spent doing encryption
(xalsa/dtls). We also avoid memory allocations in the hot paths, using fixed-
size ring buffers as much as possible.

Additionally, we coalesce sends using sendmmsg, to reduce syscalls in the
write path: ([http://man7.org/linux/man-
pages/man2/sendmmsg.2.html](http://man7.org/linux/man-
pages/man2/sendmmsg.2.html))

I posted some about the specs here:
[https://news.ycombinator.com/item?id=17954163](https://news.ycombinator.com/item?id=17954163)

~~~
quackerhacker
If Discord is basically proxying the raw packets from one client to the
others, isn't that wasted bandwidth (for discord, not the clients). I
understand from the post that the goal would be to mask the ip of the users,
to shoulder user privacy and the ddos vector. Kudos on silence detection to
save overhead.

So video w/audio broadcasting has to be compressed client side, then proxied
through Discord's media servers, to the end user's. That's pretty smart...I
just wished that I could send my raw stream to a LAN host so I could offload
the compression, and allow my LAN host to provide delivery (I'm a nitro user).

~~~
jhgg
Would rather waste bandwidth than CPU cycles in this case. Would take way too
much CPU time to mux audio streams together server-side, and then recompress.
(Means we have to buffer data for each sender, deal with silence, deal with
retransmits and packet drops, have a jitter buffer, etc...). No way we'd be
able to hit the # of clients we want per core with that overhead. Our SFU's
are intentionally very dumb for this reason.

Also, muxing server side means we can't do things like per-peer volume and
muting, without having to individually mux and re-encode for each user in the
channel depending on who they have muted and the volumes they have set per
peer (which would explode CPU complexity even further).

So, in this case, bandwidth is cheap, let's use (and waste) some, in an effort
to simplify the SFU, and also, make it more CPU efficient. Default audio
stream is 64kbps (or 8 KB/sec), per speaking user.

------
sdegutis
I made several websocket-based apps that I want to turn into Electron desktop
apps where people can use them only with their friends, but WebRTC has been
really confusing and hard to know how to get started with. Especially
considering ideally I don't want to have to host a server, but all the P2P
JavaScript libraries that I found seem to assume you'll at least have a server
for hosting the lobby to look for peers.

~~~
Klathmon
The fact is that you can't easily do the last part without a central server.

There might be some public or semi-public servers for this kind of thing
available somewhere, or the alternative that I played with a few years ago was
to compress and base64 encode the connection information into a string, and
allow users to share the link with friends via whatever method they want, and
that can then be expanded client side and used to establish the connection.
(Or something like qr codes I've also seen used)

Sadly I never finished that project, so I don't really have any code to show
you, but in theory it should work okay.

~~~
sdegutis
I mean if everyone already has a way to contact their friends (discord) and
say "hey let's play this game together", sharing that link does away with the
need for a _coordination_ server. But there's still the need for at least a
server, even if it's just one of the clients that knows how to take charge
(like Minecraft).

~~~
Klathmon
Yeah, I skipped over a lot of the details, but you would need someone to
choose to be a "host" (at least in terms of a "host" for the connection
signaling stuff), but you can also implicitly make it the person who created
the link safely.

~~~
derefr
If everybody is behind NATs (without uPnP privileges), nobody can be the
"host" for connection-signalling purposes.

~~~
Klathmon
There are public ICE/STUN servers that can facilitate NAT-punching, google
even hosts one. But that's not the aspect that I thought the root comment was
talking about, but more the aspect of "A is hosting a room called 'stuff', and
B and C want to browse the list of rooms and connect to it".

------
Karrot_Kream
I really wish there were non-browser implementations of WebRTC. So far, it
seems like the standard as-it-is is defined by browser code and browsers
contain the canonical implementations.

~~~
brian-armstrong
Does
[https://webrtc.googlesource.com/src](https://webrtc.googlesource.com/src) not
count?

~~~
Karrot_Kream
Right, so it's a chromium C++ repo that you need to create bindings for to
interact with. I have no doubt that an organization like Discord has the
engineering resources to either write their own bindings to WebRTC or to roll
their own implementation of the components, but if I'm a an individual
developer that wants to interact with WebRTC without a browser, then it's
pretty difficult.

One common desired use case is using WebRTC for p2p torrent communications.
Right now the best way to do this is in browser, or to use an Electron app
that can bridge the WebRTC clients with the standard desktop torrent clients.

~~~
jgrowl
There's nothing tied specifically to the browser, but it is written in C++. It
seems like the discord folks just built a layer on top of it. It's been a
while since I played around with it, but you just have to build the
peerconnection library. This guide seems pretty close to what I remember
having to do: [https://webrtc.org/native-
code/development/](https://webrtc.org/native-code/development/)

It still leaves you writing in c++, but it lets you build a self contained
server without any browser stuff. There also were java bindings which is what
I used when I was experimenting with it. You just have to build your .so and
.jar files.

EDIT: Here's some scala code that I wrote that uses the native bindings.
(Please forgive the messy test project that I have abandoned).

[https://github.com/jgrowl/livehq/blob/master/media/src/main/...](https://github.com/jgrowl/livehq/blob/master/media/src/main/scala/tv/camfire/media/webrtc/WebRtcHelper.scala)

------
Siecje
Does anyone have a solution to prevent notifications when a server creates a
new channel and then I am notified for every message until I mute the channel.

~~~
ihuman
You can mute the server, or change the default notification setting to
@mentions only for the server

------
Operyl
If you skip the use of ICE, why is it that 25% of the time I get a message
relating to ICE and it gets stuck there?

"CONNECTION_STATUS_ICE_CHECKING" ICE Checking.

------
Reedx
> we have seen 1000 people taking turns speaking

Wow. Has anyone here been in a channel like that? I'd imagine it to be
complete chaos.

But I suppose it's possible with good moderation and/or bots to ensure people
take turns? Is that what they do?

~~~
crystalPalace
I've been in a couple channels with several hundred people. The admins handled
it by shouting until everyone was quiet, enforcing one person speaking at a
time, and swiftly muting, moving, or banning anyone who disobeyed. I'm not
sure if Discord's API is flexible enough for a bot like that.

~~~
Operyl
The API is flexible enough for a bot like this, I'm contemplating opensourcing
mine at some point! :)

~~~
zrobotics
Please do, it would be an actual act of charity! I've left a lot of interest i
8 servers due to the voice chat being clogged.

------
bg0
My buddies and I use Discord as an easy conference call group chat while
playing games (no push to talk, mic is always on).

One thing I've found on the app is when a person/voice is far away from the
mic it often glitches and doesn't send all of the audio data over like, for a
example, a normal FaceTime Audio call would.

Not sure why this happens, but just thought I'd put that out in the ether.

~~~
gizmo385
Sounds like you just need to adjust your microphone sensitivity settings? If
you have the mic on voice activation and the person is far from their mic than
it might be that they're just not loud enough (from the perspective of their
microphone) to actually trigger Discord. That's my guess at least.

~~~
bg0
Ah, never thought of it like that. I always perceived it as a "always on"
scenario. Just found it in the settings and I'll give it a try, thank you!

------
tylerplz
> Since it’s the only service directly accessible from the public Internet, we
> will focus on Discord Voice server failovers.

Why is the Gateway not directly accessible from the public Internet?

~~~
jhgg
It's a clustered service that is behind an SSL Load Balancer[1] that is behind
Cloudflare. So not really a viable DDoS target.

[1]: [https://cloud.google.com/load-
balancing/docs/ssl/](https://cloud.google.com/load-balancing/docs/ssl/)

------
dep_b
I've been reading the article hoping to find out a bit more about the actual
voice server but the SFU is actually not very deeply explained at all.

~~~
jsjohnst
That’s because there isn’t a lot to it from the sounds of it in this post:

[https://news.ycombinator.com/item?id=17955309](https://news.ycombinator.com/item?id=17955309)

~~~
dep_b
I would implement a similar dump pipe as well for a very different
application. I mean the application still needs to set up the WebSockets with
the clients, which isn't trivial to code yourself.

------
winterismute
I love discord but am I the only one that thinks that the name is hindering
wider adoption? It's hard to suggest to some people that an with that name
(and to a lesser extent visual theme) is the skype+slack+others replacement
they are actually looking for...

~~~
Vishnevskiy
We found the name is not an issue as growth continues at a rapid pace.

We are already multiple times larger than Slack :)

------
fulafel
"Instead of DTLS/SRTP, we decided to use the faster Salsa20 encryption."

This is just confused and sounds quite worrying :I

~~~
fulafel
Judging by the votes, this may need spelling out:

Salsa20 is a stream cipher. DTLS and SRTP are higher level security protocols
that use ciphers (among other things) as building blocks, to ensure things
like replay protection, integrity protection, mutual authentication, secure
session key agreement and forward secrecy in addition to confidentiality.

If you replace an engineered security protocol with a raw cipher and key, you
create many vulnerabilities and make a much less secure system. VOIP
applications are especially vulnerable to replay attacks, for instance.

A better approach would be to continue using DTLS and SRTP, and use Salsa20 as
the SRTP cipher.

Non-crypto-experts saying things like "we replaced <estabilished security
protocol> with <my own idea> and it's much faster" is a well known bad sign -
It often indicates the person has fumbled things without realizing it.

~~~
Orphis
They say they have done it for performance. I am wondering if they wanted to
optimize for the client performance or the SFU's.

