Hacker News new | past | comments | ask | show | jobs | submit login
How Discord Handles Two and Half Million Concurrent Voice Users Using WebRTC (discordapp.com)
649 points by jhgg on Sept 10, 2018 | hide | past | favorite | 244 comments

I really wish Discord would implement an audio compressor (or if they use one, make it much more aggressive) for voice chat. For example, Mumble doesn't require manually adjusting anyone's volume - instead, it uses a compressor to automatically normalize volumes for everyone.

In contrast, Discord has a volume slider that I end up having to adjust for most people. That requires manual tweaking, often while I'm playing a game, and isn't future-proof; just the other day, one of my friends changed his microphone setup and was extremely loud (I had him at 200% when in the new setup he should've been at about 75%).

This is really my only complaint with Discord's VOIP functionality. It is a major step backward from Mumble. (Also, for some reason the Discord community mostly uses voice activated mic instead of push-to-talk, but I don't think that's Discord's fault.)

> voice activated mic

Which is a very sane default for most people

The only reason you'd want to use push to talk is either you have a lot of people in the channel, you have a loud background or you don't want people hearing stuff not designated for the discord voice (if you're streaming, have irl people in the same room - that kinda stuff)

>Which is a very sane default for most people

I couldn't disagree more. Having an open mic setup for gaming is both insane and completely inconsiderate to everyone else in the channel/group.

It's hard not to personally judge people that join a group, whether that's a friend of a friend, someone looking to try out, public LFG lobbies, etc and you can immediately tell they have an open mic configuration. It's maddening.

Breathing, keyboard smashing, mouse clicking, background noise, eating, or exasperated whining every time you die/lose are all annoyances you push on the rest of the channel/group with an open mic configuration. At best it's an annoyance. At worst it's actively disrupting people's game play and team communication.

For the love of all that is decent with humanity, use push to talk.

> For the love of all that is decent with humanity, use push to talk.

I’d love Zoom, WebEx, GoToMeeting, et al., to implement push-to-talk, and give an option for mandatory use to conference call organizers.

I think Zoom now has a preference for holding down space to temporarily unmute your mic. But I fully agree, push-to-talk should be absolutely mandatory for any calls with more than 2 people in them.

Zoom has to have focus which rules out using PTT while presenting or multitasking while meeting.

http://mizage.com/shush/ for anyone on a Mac is a godsend. Emulates a hardware PTT (can be toggled to push to silence) before audio ever gets to applications. Best $5 I've ever spent.

What's so special about this? Discord has PTT and PTS native.

The reply is suggesting it as a solution to the problem for apps other than Discord, which the parent comment mentions.

AFAIK Ventrilo, TeamSpeak, and Mumble each support this

It's definitely common in gaming-focused apps, but not in the general or business focused apps like Skype, Slack, Teams, etc.

We use Teams on our team, and my current PTT technique is to Ctrl+Shift+M to unmute and mute every time. Très annoying.

Not to mention that often the mute button in many of these apps needs to execute hideous amounts of WebRTC-related Javascript in Electron, delaying your mute significantly. Signalling the CoreAudio OS system directly as Shush does is the right layer for the job.

IIRC even Skype has that option. Can you not bind this directly in Windows/macOS?

Skype only has "push to mute."

Zoom does, long press the space bar to talk

Voice activation on teamspeak was much more reasonable and accurate. Discord's is always either trigger happy or drops quiet speech.

Also voice activation is a boon for competitive games where you can't spare a hotkey and finger during play.

I read a story about someone who used a step-to-talk setup for their game comms: the motion was so ingrained that, one day while driving in traffic, they inadvertently floored the gas pedal while having a conversation with a passenger.

I don't think they hit anyone, but they did unplug the foot pedals from their PC the moment they got home.

When using pedal based PTT in the field, I would put the PTT pedal backwards and use my heel to develop different 'muscle memory' in relation to driving, etc.

Worked great!

Can confirm. I bought a cheap USB pedal and use it for push to talk, it works great.

>Also voice activation is a boon for competitive games where you can't spare a hotkey and finger during play.

I've been playing competitive games for 15 years. I think you're overstating how difficult it is to press one other button. It's also contradictory to the common practice of clans/scrim groups having a PTT policy.

I think people saying "Well X does it better." are missing the point. It's always been a problem. Ventrilo, Mumble, Teamspeak, etc. If you've never considered it a problem there is a good possibility a lot of people around you were just tolerating it.

From my ~10y experience with multiplayer games voice activation IS clearly superior. Not sure what games are you talking about (and what's "competitive" level for you), but there was never enough keys for in-game actions and scripts, PTT is shooting yourself into the foot just because you can't find not-infantile teamplayers

>voice activation IS clearly superior.

Why? That's just a bald assertion with a butt load of evidence to the contrary.

>and what's "competitive" level for you

I'm pretty serious about competitive play. Started with CS 1.6, moved to SC2, then RB6:Siege, then GO, and now OW. I've consistently ranked in the top tier in each of those games. I moved my gym time to my lunch break and run when I wake up so I can play games in the evening and scrim when I'm on teams. It's an addiction.

>PTT is shooting yourself into the foot just because you can't find not-infantile teamplayers

I like how instead of addressing the technological merits of voice activation you just call people "infantile" to avoid actually discussing the subject.

Mumble has a really great voice detection algorithm, so I never hear your parade of horribles.

Not sure if discord has anything as good, that might be the difference.

> For the love of all that is decent with humanity, use push to talk.

This is super game/group dependent. For example if I'm playing with a 5 stack I'll have open mic on volume cutoff. We're casually chatting pre-game and ptt is not useful there, then dota starts and I cannot manage the additional load of regularly using ptt key whilst playing so I'll end up talking way less and it's bad

On the other hand in a 250 man fleet in Eve Online - you'd best believe everyone with open mic is getting banned from the server.

It’s just ignorance to do this. It’s definitely possible to run an open mic on discord without any of these sounds. Opening your audio settings where you can set the threshold very clearly for what makes it in and what does not.

This falls apart for people that speak very quietly though. If there is no volume difference between your chewing sounds and your speaking sounds you __need__ push to to talk.

Open mic and voice activation are two different things. The first is permanent sending of your microphone input, the last one is only sending when there is activity, hence the name.

>Open mic and voice activation are two different things.

In theory but not in practice.

Also in practice. You just need to tell people, and by tell I mean command them to fix their stuff.

"working as intended, just patronize other users into fiddling with their settings until it's acceptable"

sounds to me like it's not a sane default

Sometimes it can be hard to know that your setup isn't working as you expected.

Whilst commanding someone to fix it should be the last resort most people with voice activation would rather the setup work for everyone in the server so they can keep using it.

On my setup for example I had a desk mic which sat on my wooden desk. People had to point out to me that there were sudden sharp noises coming from my mic. Eventually it was isolated to my mouse hitting the desk and vibrating up the base into the mic. I placed the mic on a bit of foam and everyone was happy. These types of activations are not picked up during self testing easily.

People don't hear their microphone, so they won't notice wrong settings. Telling them is the only choice you have. If you think "command" is too strong of a word, use "urge".

Have you played an online game recently? Not exactly the most eager to take advice from strangers, those folk.

You can't command youngsters when you don't have authority over them. Asking it in a friendly tone is much more reasonable though it does require an extra keybind (I recommend rebinding caps lock to it).

Yes. You can if they come to your server. And then they are not exactly strangers anymore. Nobody cares about ingame voice as those are usually PTT per default.

There are situations where an open mic is useful, such as conference calls and high-intensity multiplayer FPS and MOBA games. Initiating such a group call shoud be as explicit as possible, however.

Can Discord server admins enforce PTT? I believe on Teamspeak there was a setting the disabled voice activation for that server.

Personally, I hate the sound of people chewing and smacking their lips directly in my ear as they work their way through their 'fresh that afternoon' gas station nachos.

But that's just me :)

> "They actually do talk, then. They use words, ideas, concepts?"

> "Oh, yes. Except they do it with meat."

> "I thought you just told me they used radio."

> "They do, but what do you think is on the radio? Meat sounds. You know how when you slap or flap meat it makes a noise? They talk by flapping their meat at each other. They can even sing by squirting air through their meat."


What a haunting short film. (I understand that it's adapted from a short story.)

Tangent: It's ironic that the dialogue between the two aliens is spoken by humans. After all, the actors are rendering the dialog with meat sounds. I think it would have been better to use a speech synthesizer that's not based on actual recordings of meat sounds -- like eSpeak, DECtalk, or ETI-Eloquence. DECtalk, at least, is flexible enough to speak with whatever intonation you want, so the dialogue wouldn't necessarily come out sounding flat.

It is not just you. I know of several gaming sites that will block or ban people for leaving their mic open or making breathing sounds into it.

if you are in control of the Discord server you can force push to talk in voice channels

Also vaping incessantly...

Absolutely not, the default should be PTT.

I can't tell you how many times I've ran into users who just cannot get their sensitivity right. So in turn, you end up with a ton of people coughing, sneezing, talking to others, fans, and insert your other favorite background noise. Most users just don't care to set it up right, and it's impossible to get them to adjust it (unless you have some type of Discord Mod / Admin privs on the server at the time to force them to fix it). Otherwise, you have a bunch of users who have to put up with Joe-Schmoe chewing away on his mic.

It's why the Discord servers I'm in 100% completely enforce PTT, zero voice activation allowed.

It's not a sane default for people gaming with mouse/mechanical keyboard. The voice activation can (in general) be set not to have squad breathing and chewing in your ear - but you'll hear every Crack of desperate mouse movement...

It might be better with people using Xbox controllers.

My modmic setup doesn't transmit me typing on my Cherry MX Brown keyboard.

And we complain a lot when someone on the server has an improper voice activation threshold, so everyone quickly fixes theirs.

Also rocking a ModMic5 and a keyboard with Brown switches. Never had an issue transmitting when I wasn't actually speaking. Both in Discord and Slack.

>>It might be better with people using Xbox controllers.

Not necessarily, I have had to disable my mic when playing certain games, apparently I'm violent enough with the analog sticks that hitting the end of travel will activate discord.

The joy of hearing sick people coughing or sneezing. Or raging people. Or their family. Their barking dogs.

The best being when they use speakers so you get some echo. That's usually when they get muted because fuck their lazy-ass.

    you have a lot of people in the channel
Which is common

    you have a loud background
Which is also extremely common. Fans, open windows, keyboard, neighbors yelling, etc.

     you don't want people hearing stuff not designated for the
Like the fact you're a goddamn mouth breather. Fuck I refuse to join most discords because most the time I have to hear people mouth breath into their mics.

I think you outlined several major reasons to use push to talk. I exclusively use push to talk, specifically so I am not that guy in raid with an open mic, breathing heavy, with police sirens wailing in the background.

Hi, I'd like to introduce you to the very large gaming crossover niche of mechanical keyboards.

What were you saying about open mics?

> Also, for some reason the Discord community mostly uses voice activated mic instead of push-to-talk, but I don't think that's Discord's fault.

Definitely the community, as Discord provides an option to require PTT on a channel-by-channel basis.

Not sure if this is the case here, but the chosen defaults and the UX around it could easily make it a Discord's fault.

IMO it would be significantly more user-hostile to disallow users choosing whether to use PTT unless the benevolent server owner allows it.

I know people who consider voice-activation to be an accessibility feature, as otherwise it would be too difficult/distracting to be able to use VOIP.

A community should absolutely be able to disallow voice-activation, but disallowing it by default across the platform would be an issue.

No, I'm speaking more about hiding voice-activation mode somewhere into accessibility settings page, as opposed to getting big shiny button "activate voice-activation" after first login or even enabling it by default with an option to switch to PTT afterwards.

I wonder if the difference in PTT vs voice activated has to do with the common games being played at the time.

PTT is mandatory running an MMORPG raid with 40 people. Voice activated is perfect for playing a game while talking with 3-8 people at a time.

Eh we raided Mythics with 25 where most people had voice activation. Was perfectly fine.

I think it's more about who you're playing with, if you are good friends with the people that you are on voice with, chances are you're quite alike and considerate to each other.

Wouldn't the sheer overlap of sounds of people mashing on their keyboards drive you insane? I'm usually fine with voice activation in a smaller group (like maybe 5 people) but once you get up to 25 people, I feel like the background noise could seriously overpower some really important callouts that one might need to hear (in the context of something like Mythic raiding).

>Wouldn't the sheer overlap of sounds of people mashing on their keyboards drive you insane?

That's a hint that the voice activation level is not configured correctly.

This is not high enough. Spending a few minutes setting voice activation is critical, and can go a long way toward helping with issues people have with voice activation. Background sounds should not trigger voice. Typing should not trigger voice. The issues is many people don't do this, and so make the assumption that voice activation is bad.

Couple that with people not being considerate of others (such as not muting when you do need to make noise that will be picked up), it gives voice activation a bad rap.

But in terms of pure quality, voice activation is always, in my mind, better than PTT. Granted, the last few years I've only really ever played with friends and in small groups, so when people forget, we aren't righteous assholes about it.

Background sounds should not trigger voice. Typing should not trigger voice.

That's right. Voice activity detection (VAD) is not the same as sound detection. WebRTC even has a really good VAD built into it that is extremely easy to use and dynamically adapts to the current audio environment. See e.g. https://github.com/wiseman/py-webrtcvad and https://github.com/dpirch/libfvad for examples where the relatively small VAD code has been pulled out of the giant webrtc corpus.

People also need to know to enable AEC in their audio driver, which completely solves the problem of whatever sounds they're playing leaking into their mic.

I'd rather slit my wrists than have to herd a raid team any more than I already have to by micromanaging their voice settings. PTT or mute.

Voice activation levels are set by the person themselves. You don't manage someone's for them, you can change someone else's volume.

And who makes them change it? I would have to micromanage every single person into setting their voice activation right.

Yeah, once. If you can't even be bothered to do that, raid leading (hell, even just raiding) isn't for you. There's a lot more frustration to come.

Been doing it for a couple of years. It's not as simple as you think to deal with something like this. If you think it's a once and done deal, you've obviously never had to deal with stuff like this in a group of more than 5 people.

It is. Don't extrapolate your experience to a broader set.

It isn’t. Setups change. Stuff gets uninstalled and reinstalled. Old members leave, new ones join. Environments change. Fiddling with that shit isn’t worth the effort and it’s a complete waste of time when a perfect solution exists.

No level of voice activation can work well with Cherry Blues

Yes, yes it can.

It never did for me. I have a Blue Yeti set to cardioid, and the levels needed to fully ignore my mx blues also sometimes ignored my voice.

Mumble offers signal to noise detection that can be more effective than amplitude detection in many but not all situations. I'm surprised it hasn't caught on more widely.

Edited for tone.

How close is the Blue Yeti to your face (and to the keyboard)? You might be able to do better with a headset.

Do you have a pop filter or a foam head? That will help.

Even for Overwatch in a group with 2-3 friends I’d much rather have them use PTT because a lot of people are keyboard-smashing typists who can’t set the mic transmit threshold right.

How can so many people be complaining about keyboard sounds?! I assume they ain't using a mechanical keyboard, and/or they ain't using a laptop with built-in mic, but a headset.

In Overwatch, we can set the input/output volume but no threshold and my mic doesn't have that option.

That's why I also use PTT.

I think that's exactly it, as well as the comparative twitchiness of the game. You don't have time to fiddle with an extra button in something like Overwatch while you can probably exchange a couple of emails and check twitter during a single FF14 GCD.

I used TeamSpeak for a long time but when my group switched to Discord there have been a litany of audio problems. My favorite is people will complain about an intermittent hiss that only stops when I talk. The hiss ends even after I've finished transmitting, and I'm not transmitting the hiss (it happens even when I'm not connected to the audio chat.)

I don't find this to be the case. I have played several games on several servers, including 40 and 50 person channels, and I have never manually adjusted volume. The only adjustment I make is to mute myself so everyone else can't hear me yell at football.

For those interested in open source/decentralized alternatives to Slack/Discord/etc, I've had a pretty great experience with both the voice and video integrated into Riot.im[0] (uses Matrix protocol[1] for chat, and Jitsi[2] for voice/video).

I've said before that I can't really justify nonfree software for something as simple as text chat.

That said, rooms with very large numbers of people trying to communicate with voice/video is one use case where maybe it makes more sense for a commercial product to solve it.

[0] http://riot.im/

[1] https://matrix.org/blog/home/

[2] https://jitsi.org/

Why is the riot "TRY now" link so damn hidden? Two clicks, one long search here https://riot.im/app/#/home No private room? Not working without google? Very bad first impression :(

Not sure what you mean by not working without google?

Iirc there is a captcha on sign up. Same as discord.

isn’t the “try now” link the big flashing button labelled “launch now!” on http://riot.im?

That's how I always start it. Unfortunately that link is on a carousel for some reason, which automatically advances. Definitely not ideal UX in my opinion.

It should be but it takes you to a messy page where nothing happens

Personally I'm sad that people are going for non-decentralized solutions these days. I thought we aggreed that decentralized is the way to go.

Everyone hosting their own Ventrilo/Mumble/Ts3 servers had the huge advantage that not only one single company got all data everyone produces. Not to mention outages.

They say in the blog post

> Routing all your network traffic through Discord servers also ensures that your IP address is never leaked

Yeah, except to your centralized company servers.

They also clearly explain why decentralization doesn't work for large rooms:

> Supporting large group channels (we have seen 1000 people taking turns speaking) requires client-server networking architecture because peer-to-peer networking becomes prohibitively expensive as the number of participants increases.

It's unfortunate, but it's just the truth. The reason we aren't seeing decentralized solutions to the problem that Discord solves is technical. It's not because people don't want it or aren't trying. I just doesn't perform well enough.

that is not an argument for centralized vs decentralized, that's an argument for client-server vs peer2peer

Not many selfhosted (eg decentralized) solutions use peer2peer because this requires NAT punching and other methods. TS3, Mumble and Ventrilo are all "proxies" in that manner and not peer to peer

Federation is good enough for practically everyone.

> > Routing all your network traffic through Discord servers also ensures that your IP address is never leaked

> Yeah, except to your centralized company servers.

Which most gamers don't care about.

What they do care about, is what follows that sentence: "preventing anyone from finding out your IP address and launching a DDoS attack against you".

Especially for high profile e-sports figures and streamers, that's a very real problem: they want to interact with their fans using chat/voice calls, without giving bad actors the means to DDoS their residential connections (causing disconnects from game matches etc).

That's also why Discord proxies all embedded media from third parties through their own servers (except for some major sites, that are white-listed).

Slightly off-topic, but:

> For clarity, we will use the term “guild” to represent a collection of users and channels — they are called “servers” in the client. The term “server” will instead be used here to describe our backend infrastructure.

I still don't understand why you ever chose the name "server" to refer to something that is many things, but not a server. Guild is a way better name, and doesn't need disclaimers about your self-chosen terminology being confusing.

I believe this is a tradition that has been grandfathered into voice chat apps. Previous ones like Ventrilo have you actually connect to a private server (that someone pays for / hosts).

I second this explanation. Additionally, from a UX perspective, if a user wanted to join your server you had to give them IP/port/pw and they'd type it in manually. So even the users were accustomed to addressing servers. "Server" entered the community lexicon as "the place we chat".

I assumed it was based on IRC servers, because each Discord server can have multiple "channels" just like IRC.

This is most likely the case. Discord's introduction was mainly to pull in the gaming audience using Skype while providing the quality behind Mumble and Ventrillo

Users colloquially call such communities "servers" because Ventrilo and TeamSpeak historically required actual servers for each community. The terminology has stuck. Discord definitely made the correct decision calling them "guilds" internally and "servers" for user-facing material. Nearly every user has always called them servers.

They used to be called Guilds, even in the user-facing client. I'd imagine a conclusion was drawn at some point that the term Guild is far too niche even inside the already niche video gaming community; its really only used in MMORPG-style games.

Agreed. I feel like if I told a non-gamer friends to join my Discord 'guild' I'd get less yesses than if I called it a 'server'. In a way 'server'/'community' feel like much more inclusive terms.

And even in the niche of MMORPG games, it's just one variant of name for the (relatively) permanent groups that players usually organize themselves in. It's rather common in the fantasy/medieval-type MMORPGs where it actually has its origins, but uncommon in pretty much all other themes, especially anything scifi. Eve Online for example calls them "corporations", which is a better fit for a science fiction universe. Anarchy Online just called them "organizations", while Star Trek Online calls them "fleets".

So yes, "guild" is a really bad name for the thing pretty much anyone in the target demographic is used to call "server", whether that thing actually was a physical or logical server or not. "Guilds" also used to have a voice "server" to host their voice chat activity, but they did never equal themselves to their voice server, so even if all games out there called groups of players "guilds", it would still be a really bad choice. Someone at Discord apparently learnt on the job that "naming stuff" is one of the two hardest problems in IT ;-)

I'm a programmer who well knows what a server is to a programmer, but I'm also a gamer and Discord calling its servers servers never seemed weird to me. From the perspective of the user, the term makes sense. The servers have users, those users have permissions, they host files, they host chat, you can install plugins to them.

It would have taken my community a lot longer to switch over from our Teamspeak server [0] and come to Discord if they were called guilds.

We don't play MMOs, aren't very organized, have a very flat hierarchy, (admins, regulars, new users) and don't stick to one game very long.

The word guild comes with too much baggage, and that single word choice would likely mean we'd still be on Teamspeak to this day. I don't think I'm exaggerating that point either.

I never considered "server" a poor name, even knowing that it was probably not one dedicated physical server in reality. It's simply the right nomenclature for the target audience.

[0]: A Teamspeak server is definitely a server as I assume you define them.

Server is the de facto standard terminology that users expect here. Guild would be a horrible name since it comes with a lot of preconceptions, especially for gamers.

Discord started as a gaming app, so most likely a "guild" was their first "grouping unit" and the name is now just legacy.

Can anyone who works on the discord app explain to me: How am I impacted for having (being part of) let's say 100 server (guilds) on my app versus having 1?

Should I quit the discords that I don't use? I feel like they stay in sync in the background and if I am in 100 servers and they all post say tons of text I will get massive lags.

Does that make sense? Is there a way to prioritize only the server I have active/I am talking in and freeze the rest while im full-screen/in-game/focused?

We've put a lot of work into our clients and our backend to make sure the impact of being on 1 vs 100 guilds is negligible. It helps that most of the folks building Discord are power users in a bunch of servers (so we feel the pain of poorly optimized paths early). Generally if you don't look at a server often it shouldn't effect the performance of the app / bandwidth usage. I think we have some more blog posts in the works regarding some of these topics so look forward to those!

Well, are you ever going to provide a better way to deal with 10+ servers? The left-side server UI is terrible. The icons aren't enough to figure out which server is what and it desperately needs some form of sorting or categorizing (being able to create custom groups of servers, for example per game or any arbitrary list of servers). I end up continuously pruning my list so I don't have more than 10 because it's so bad.

WoW alone forces me to have 1 server per class, then 1 server for every guild I'm associated with. Forget about any other game.

No need to be so rude and entitled about it. Especially considering you're probably not paying for the resources you're consuming. Even if you are, it's rude.

I bet it's on their todo list to handle that use case and no doubt there are other priorities.

It's been years without any improvement at all. So yeah, I'm annoyed. I'd even be happy to give them money monthly, but the apparent priority of this issue is at the bottom of their list, if it's on there at all.

If a guild is completely muted (including @everyone), and I don't open it, do you have a rough estimate of how much resources it uses? In theory, it should be zero, but in practice, does is the server sending me updates about messages even if I'm muted?

It depends on the activity of the server. But generally, you're only receiving message create events and updates to the server. But you aren't receiving things like member list updates, presence or typing events from the server until you focus it initially.

Additionally, we unsubscribe you from these events if you unfocus the server for a given amount of time.

I'm Junior developer interested in webrtc tech. How can I learn more.. Thanks

You custom implemented many components of webrtc I barely got part of the pats down in my project. So this was really interesting to me.

I'm pretty sure there isn't a ton of background fetching asides from marking servers as unread, it can take a bit to actually load the messages when you click an inactive server, and it won't have any new messages if you don't have a connection (even if it had a connection when it was sent)

OT: actually what I always wonder, is how Discord will make money. I mean they offer this huge free service for everybody (2.6 million concurrent voice users) over 1000 servers that must be a huge GCloud invoice

We don't host our voice servers on google cloud as that's actually really expensive. We rent commodity dedicated servers. Our voice fleet is basically JBOS (just-a-bunch-of-servers). We use really low-spec servers too (4 core - 8 thread xeons @ 4ghz), 8-16gb ram, shitty hard drive (as these servers literally do no IO operations aside from some logging), and have no persistent state.

> We rent commodity dedicated servers.

So not even a VPS like Linode, Discord rent physical servers across the board? Or is there a mix of AWS or some other vendor in there?

We host our voice server on dedicated hardware, not VPS. Visualization overhead for networking is too high for the cost.

Additionally, you can buy bandwidth for much cheaper from dedicated hosting providers as opposed to cloud providers. For our usecase, AWS would be approximately 15,000x to 30,000x more expensive due to bandwidth pricing.

>Visualization overhead for networking is too high for the cost.

Do you mean virtualization?

If so, I recommend looking into testing this with SR-IOV based NICs and passing through a VF to the guest. Even in regular operation the latency difference between bare metal and an ixgbevf virtualized NIC all but disappear into levels well below anything that would be meaningful for voice communication.

Moving to a DPDK based poll mode driver would reduce the latency differences even further.

Edit: https://01.org/packet-processing/blogs/nsundar/2018/nfv-i-ho... some actual numbers w/ DPDK on bare metal vs vm

Disclaimer: I work for a cloud company, but SR-IOV knowledge in general is something I had from my days running a vmware environment, and not anything new :)

With potentially substantial engineering effort, including needing to hire someone with a relatively rare expertise, they could [1] eliminate that specific downside of virtualization.

However, there are remaining overhead/downsides, and virtualization may be a solution looking for a problem in their environment.

[1] Also, presumably, dependent on specific NIC hardware, but I expect they're already using something compatible. It's merely another constraint.

By no means do you need DPDK to basically eliminate the latency difference - I just wanted to point out how low latency can go in general.

A vm using SR-IOV with ixgbevf on good ol' Intel 82599 from 7 years ago will not have a latency difference noticeable to the overwhelming majority of use cases vs. bare metal.

> By no means do you need DPDK to basically eliminate the latency difference

I didn't mean to imply that was your argument.

> I just wanted to point out how low latency can go in general.

Rather, I meant that "can" isn't the same as "does", absent exceptional circumstances.

> A vm using SR-IOV

Whether this qualifies as exceptional is, of course, arguable, but I'm arguing that it is. I could understand the point that it doesn't have to be, but, to be actually convinced, I'd want to see evidence that it's well understood and well implemented enough that neither rare expertise, substantial engineering effort, nor constrained configuration (hardware or software) would be required to take advantage of it. I'd expect most technically-minded decision makers to think similarly.

>Whether this qualifies as exceptional is, of course, arguable, but I'm arguing that it is. I could understand the point that it doesn't have to be, but, to be actually convinced, I'd want to see evidence that it's well understood and well implemented enough that neither rare expertise, substantial engineering effort, nor constrained configuration (hardware or software) would be required to take advantage of it. I'd expect most technically-minded decision makers to think similarly.

Oh. My apologies for misunderstanding your point.

SR-IOV is available on basically any and all server grade NICs, and is quite simple to use. With Azure and AWS it's basically just making sure you have the proper driver installed (gotten for free on basically all modern kernels) and flipping a command switch.

If you're rolling your own virtualization stack, it's generally about as simple as any other task for that stack.

With vSphere it takes a matter of seconds: https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsp...

Similarly easy for XenServer: https://support.citrix.com/article/CTX126624

A little bit more work with the common KVM management options, but still a very simple task as far as Linux sysadmin tasks go: https://access.redhat.com/documentation/en-us/red_hat_enterp...

OpenStack is a bit more complicated, but frankly, less complicated than plenty other tasks in OpenStack: https://docs.openstack.org/mitaka/networking-guide/config-sr...

All of the real setup work has to be done at the hypervisor level, but you're primarily just doing two things: Creating VFs, and assigning them to VMs. The driver does all of the rest of the hard work. I would argue that any Linux or vSphere admin with any real amount of experience should be able to read any of the documentation I linked and be able to confidently work through it in an hour or two.

For the guest, just making sure the driver is installed should be all that's required. For ixgbevf, the ubiquitous commercial option, it's been in-tree for the Linux kernel for at least half a decade.

Once VFs are created and assigned, it largely "Just Works". The only real caveat I know of is that seamless live migration of the guest is no longer an option, because now all of the network virtualization is handled in the hardware instead of the hypervisor.

> I would argue that any Linux or vSphere admin with any real amount of experience should be able to read any of the documentation I linked and be able to confidently work through it in an hour or two.

Having glanced through those documents, I agree that it doesn't appear to be overly complex. However, considering how much CLI there was in those instructions, I'd argue that it's evidence that this feature is not what could safely be called "well implemented" (or perhaps "well integrated" would have been better for me to use) and probably not "well understood".

If it actually only ever requires that hour or two and nothing ever again and isn't brittle, that's great. If it ever needs debugging, especially if a critical performance problem crops up, a rare expert might be needed after all.

I realize my overall point is, essentially, FUD, but, absent a large enough installed base, that's not a totally outlandish stance for a decision-maker with an already-working solution.

> Once VFs are created and assigned, it largely "Just Works". The only real caveat I know of is that seamless live migration of the guest is no longer an option

If they have to be individually/manually (or automated, just not already integrated into the usual VM management mechanisms), wouldn't this also prevent other forms of virtualization flexibility?

Ultimately, though, especially in this case, it seems like virtualization is a solution looking for a problem. That there may be (even nearly complete) mitigations for some performance issues doesn't mean that there won't still be some overhead and, more importantly, at scale, the virtualized options are always going to be noticeably more expensive than bare metal.

Probably the biggest jitter comed from your VPS getting preempted by other customers due to oversubscribing (cpu or network), not the relatively benign and fixed amount of packet processing overhead from virtualization itself.

For sure! But that's a matter of the provider's placement decisions, and not inherent to virtualization.

Regarding bandwidth, since that is probably the most important resource required by your voice servers: do you strategically use the mixed-calculation-type pricing that some dedicated server providers offer (where you can get large amounts of traffic included in cheap server prices because very few heavy users are subsidized by many users that only use a tiny fraction of their bandwidth share) or are you rather renting dedicated uplinks for your servers that you and your provider expect to be fully saturated most of the time?

Can't speak for Discord but at that scale a provider is probably going to charge you for the physical pipe 100 Mbps, 1 Gbps, 10 Gbps, or by the bandwidth used (Mbps) using the 95% percentile method.

It really amazes me how ridiculous cloud providers are with their bandwidth pricing. Anything with user generated video is not feasible on the current cloud providers. Dedicated server providers like OVH are more than an order of magnitude cheaper. Anyone know why cloud providers have such high markup on bandwidth?

It can be at least partially explained by a mismatch between the "cloud" model and the underlying engineering (and market) reality [1].

One has to build (or buy) for peak bandwidth. Selling it pay-as-you-go, with no regard to local maxima, means one has to price that rate high enough to account for the typical (and then some) spikiness in traffic. [2]

It's not hard to imagine that something like a UGC video site might significantly increase that spikiness ratio, if only because of the sheer quantity of data involved. Moreover, it's a large quantity of data transfer per user, so even modest user growth would result in huge network use growth. As a sibling comment pointed out a cloud provider "may not really want that type of client".

Perhaps cloud providers could start charging on a more traditional-ISP 95th-percentile style basis for larger customer and engineer their networks accordingly, but then they might have to keep those customers corralled in specific datacenters, which would remove part of the value of cloud infrastructure.

[1] Forgetting that "the cloud is just somebody else's servers" also led to the delusion that one doesn't have to "worry" about hardware failures in the cloud. Fortunately, it's now common knowledge that EC2 instances are subject to disappearing due to hardware reasons and that this needs to be "worried" about (engineered around).

[2] There is a similar issue with residential electricity pricing, where consumers pay a flat rate but the utility actually pays time-of-use (potentially a much higher rate on the spot market). Somewhat related to but not identical to rooftop solar using the grid as a "free battery", since that's also time-of-use. These come up routinely on HN discussions of electric power.

Wouldn't it be cheaper for cloud providers though? They are buying more bandwidth so they can get it cheaper. Also, they are taking advantage of the fact that clients have unused bandwidth so they can overprovision and get cost savings that way as well. I would think that that SHOULD make it cheaper for clients, but the opposite seems to be the case.

> They are buying more bandwidth so they can get it cheaper.

I don't think that's actually true. The first assumption, that they are buying something from someone else is potentially flawed, and the conclusion is based on another potentially flawed assumption, that bandwidth has an inherent volume discount.

I say "potentially" flawed because these assumptions easily hold true for small enough providers and little enough bandwidth.

At large provider scale, it's probably safer to assume that they're building instead of buying, and those costs follow fairly large, discrete steps.

Increasing bandwidth means buying faster DWM modules and, possibly, higher-end equipment that supports them. It might mean doing that for their network peer, too.

In many cases, I expect it would mean bypassing shared infrastructure like internet exchanges, which might be limited to as little as 10Gb/s or even 1Gb/s and getting direct peering arrangements (including physical connections) with other networks, including possibly reimbursing them for their costs. This can be complicated by the new peer only having just enough bandwidth to that exchange point to match the exchange's maximum bandwidth, in which case peering will require co-locating somewhere else, with all the hardware and (hopefully dark but not always possible) fiber leasing costs.

None of those costs are necessarily high if considering maximum available bandwidth, such as if they spent 5x to get 20x or even 100x the capacity. However, if they only did it for a single customer that, on average only uses 2x the bandwidth (and only peaks at 20x-100x at rare times or only on certain, unpredictable in advance, connections to peers), they experienced a volume premium, rather than a discount.

> Anyone know why cloud providers have such high markup on bandwidth?

1) because they can

2) because they may not really want that type of clients

3) because due to the nature of peering agreements, they want to avoid paying as much as they can

Not sure which of the above applies, but the list is very likely at least part of the reason.

This is how we will make money.


Aside from the money we already make from https://discordapp.com/nitro :)

do you do that much money from nitro? I actually never seen that many adds in discord about nitro and I didn't even knew it exists (i'm using discord for over a year) and I mean we actually migrated from a 10€ teamspeak to discord. while it is unlikely that everybody in my group would buy nitro, I find it amazing that you do not have many adds in your app, not even about your own service. (i mean sometimes there is the update pop up and some messages between the chat messages, but we basically never use that much of the chat anyways...)


> This is all in beta, so things may change by the time we’re ready to launch. What you see today will likely not be what you see in the future.

I imagine they make most of it through Nitro: https://discordapp.com/nitro

They’ve just started a game store. Encouraging integration of their social features into games will allow them to take a healthy bite out of Steam’s pie.

Related: steam has built in voice chat now.


Steam has had voice chat (group and 1-on-1) for a much longer period of time than that. I've used it on a regular basis for the last 5 years or so, well before the current updates they've made.

What they added was voice chat to static groups[0], whereas before you had to invite everyone to a custom group chat every time. Definitely a QoL improvement, but not exactly "new".

0: https://steamcommunity.com/search/groups

The new chat is a huge improvement over the old one in terms of UX, but lack of persistent history is pretty pathetic considering the amount of resources Valve has available compared to a startup like Discord.

Steam has had voice chat for a while. This article is actually wrong, it wasn't added in the redesign.

Same thing as WhatsApp. VC pay most of the bills until the network effect has cultivated the userbase to a size attractive for a buy out by a bigger fish.

WhatsApp was bringing in 10M in revenues[1] a year with the .99cent fee. Not selling it wouldn’t have made them billionaires, but it’s certainly fuck you money for a team of 8.

[1] https://techcrunch.com/2016/02/01/whatsapp-hits-one-billion-...

I'd be interested to see how the system works during high-load events. It sounds like if a server gets overloaded, it shuts down and moves the clients to the next non-overloaded server. If the health-check used by the load assignment server is not perfectly tuned, then that failover overloads that server, shutting it down. Eventually you have no servers left.

I suppose that a perfect health check will prevent this, since the failover will assign failover traffic at exactly the level that the new server can successfully handle. But if it's wrong on the other side (rejects connections when capacity actually exists), then compute resources are wasted "just to be safe".

I imagine that estimating capacity is even more difficult since people can join and leave at any time, and the client doesn't send any packets when there is silence. So the load changes based on how talkative people are being (which means your server always crashes during the best parts of whatever you're discussing).

Anyway, I'm wondering how this all compares to the naive strategy of "pick a random server for this channel, if it crashes, bad luck".

Perhaps I can answer some of these questions. We've dealt with scenarios where we've had to fail-over entire data-centers to other regions, or other data-centers within that region. We have a few mitigations in-place for stuff like this.

a) A voice server can sit in multiple different load categories. So it's not "best server by score", but rather "best server out of a pool of servers with a given load factor". The load factor is one of ":verylow | :low | :medium | :high | :veryhigh | :extremelyhigh | :full" When looking for a server, we have an index of "best servers by region" that's stored in memory on each node and kept synchronized by service discovery. Additionally, if we don't have enough candidate nodes in a given load category, we will grab a few from the next-best load category. The thought being, that for a given region, we'll have a large set of servers to allocate to. This prevents a server failing from thundering-herding another server.

b) A voice server fast-fail (reject) allocation requests, and does so under some circumstances, e.g.: the rate of allocation requests for the server exceeds a threshold, the server is at capacity, or approaching capacity. We do a lot of this fast-failing logic using semaphores around a shared resource (server alloactor): https://github.com/discordapp/semaphore

c) We also run things a bit over-provisioned. We try to have enough excess capacity during peak such that we can handle the failure of an entire datacenter within a region, or an entire region to nearby geographical regions.

>I imagine that estimating capacity is even more difficult since people can join and leave at any time, and the client doesn't send any packets when there is silence. So the load changes based on how talkative people are being (which means your server always crashes during the best parts of whatever you're discussing).

We use a lot of factors to measure load on a server to group it into a load category - in addition to just traffic: we look at concurrent clients connected, concurrent voice servers allocated, packets/sec, bytes/sec.

> ... and the client doesn't send any packets when there is silence.

Yes and no, there's no packets being passed over the WebRTC connection, but the server maintains a WebSocket connection for state changes. There's two, one to the guild service, which handles assignment of the voice service. Want to count the number of clients connected to a voice server? Count the number of connections.

All Discord clients are connected to Guild services and discord publishes events for voice channel updates to all clients in the guild. (Ex. when a person joins a voice channel.)

Also, for larger discord guilds, there are exclusive voice server pools for them to consume.[1] These servers are configured with more resources and are usually pretty close to exclusive to the single guild. Most of the voice servers are for the millions of other guilds though.

[1] https://discordapp.com/partners

I remember in the past the Discord devs have moved particularly overloaded servers to their own dedicated hardware to improve performance.

> Instead of DTLS/SRTP, we decided to use the faster Salsa20 encryption. In addition, we avoid sending audio data during periods of silence — a frequent occurrence especially with larger groups.

I wonder whether this would help an attacker infer voice data, with a method similar to the one from the paper "Uncovering Spoken Phrases in Encrypted Voice over IP Conversations" [1]

[1]: http://www.cs.unc.edu/~fabian/papers/tissec2010.pdf

First author of the linked paper here. It's an interesting question.

Actually the voice activation (on/off periods of sending vs silence) was the first thing we looked at in that project. There's definitely some information leakage there, but it was really hard to learn anything meaningful from it. The problem seems to be that long strings of words all get lumped into a single activation. It's really hard to discern anything about what those words are, or even what language they come from.

We got some very weak results on language detection from VAD. You can learn something about the language, but it's not very precise. For example, maybe you could tell that a given conversation is definitely not language A, B, or C, but it might still be language X, Y, or Z.

I've never tried doing pcaps with Discord voice chat yet but it would be interesting "estimating" what was said best on the amount of voice traffic transferred. Might be harder if it still looks like standard Discord TLS "control traffic" blending in though, no?

Keystroke timing analysis has some solid research. See, e.g., Timing Analysis of Keystrokes and Timing Attacks on SSH (https://people.eecs.berkeley.edu/~daw/papers/ssh-use01.pdf), Don’t Skype & Type! Acoustic Eavesdropping in Voice-Over-IP (https://arxiv.org/pdf/1609.09359.pdf).

Deciphering content from latencies in packetized speech is likely much more difficult, but I wouldn't put much stock in it being too difficult.

Which is to say, if you're transferring high-value information assets over VoIP you should probably assume it's decipherable. That doesn't mean you should change what you're doing. You could simple say, "M'eh, I'll worry about it when it becomes a thing." But I wouldn't assume it's confidential to someone willing to invest the time to target and capture the conversations. And I might leave a few choice comments in the source code and documentation so nobody could excuse imprudent reliance on confidentiality with, "But nobody warned me".

This comment isn't really related. I obviously love Discord and have been using pretty much since it became an option, but I wish there was a better UX for people who are in TONS of channels/servers. I find it hard to navigate all my current servers with just the icon -- especially when server owners are changing their avatars frequently.

I am in a few dozen different discords and use Ctrl+K to navigate amongst them rather than use the icons on the left-hand side of the window.

I think their UX is awful, and their client is very slow in channels with thousands of users. They should open up their protocol so alternative clients can be created, or at least some IRC gateway.

I haven't made many calls but the sound was always crisp.

> open up their protocol

Both the Rest API (that official clients use internally too), as well as the real-time WebSocket protocol are described here:



Unfortunately you can't use these APIs to create third party clients because that violates their ToS, they are meant only for writing bots. If they catch you using these APIs with your full account credentials they'll ban your account.

They can't catch you using these apis since these are the same apis the first party client runs off of

Of course they can catch you if your client isn't 100% the same as the original there are going to be differences which "forensic" tools are going to prove.

Forensic net requests? You can MITM the entire traffic. Mimicking it's nature isn't the hardest task in the world if all the data is handed on a silver platter, and it's even easier when their client code is uncompiled JavaScript. There are always people data mining the client.

If you're thinking they're going to check the actual software. Like ass I'm going to let someone into my house to check if I'm breaking their terms of service

I agree on opening the protocol but if what you want is a gateway, it's already open enough for that.


What channels/servers are you part of that you have so many to manage?

I am in a lot of hiphop oriented servers, through which I have met a lot of people who have invited me to other hiphop oriented servers.

I also have a fair amount of different IRL/internet friend groups who all have their own channels.

I have at least 12 servers for WoW alone. 6 for Warframe. 9 EVE ones. Around 6 programming servers. Some involving cracks/hacking. Etc.

So I have an affection for Discord...so I appreciate posts like this.

>"Using the WebRTC native library allows us to use a lower level API from WebRTC (webrtc::Call) to create both send stream and receive stream."

So I'm gathering that discord's voice servers receive multiple persistent connections, then compress the audio streams for delivery to each end user. THIS part is where I can't imagine the on-the-fly cpu usage. Each client's receiving compression needs to also negate their own audio to prevent an echo effect (no point to hear your own voice), but it also means separate compression streams per user.

>" All the voice channels within a guild are assigned to the same Discord Voice server."

I imagine this helps significantly with I/O in converting live streams into 1 stream per end user. I've dealt with video compression (only in ffmpeg) and live syncing time stampings, and I can say from experience that, this is no easy feature. I understand this is audio streams (so lower overhead), but still the persistent voice server needs to handle the incoming connections, web socket heartbeats (negligible), compression (high I/O), and deliver the streams (high memory usage too).

I'm impressed, but would love to hear the specs on the media servers and their DL/UL speeds. My old setup to deliver live video (in sync and compressed) was 6 mini-itx's, 4GB of ram per board, and i3's...my bottleneck was my isp, which I solved with multiple docsis modems and an internal switch (each board had 2 ethernet ports).

We don't transcode audio/video on the server. Each stream is processed and muxed by the client. The server is merely relaying rtp packets and tagging them with a given ssrc per peer. The client does the rest of the work.

The bulk of the user-space time on the SFU is spent doing encryption (xalsa/dtls). We also avoid memory allocations in the hot paths, using fixed-size ring buffers as much as possible.

Additionally, we coalesce sends using sendmmsg, to reduce syscalls in the write path: (http://man7.org/linux/man-pages/man2/sendmmsg.2.html)

I posted some about the specs here: https://news.ycombinator.com/item?id=17954163

If Discord is basically proxying the raw packets from one client to the others, isn't that wasted bandwidth (for discord, not the clients). I understand from the post that the goal would be to mask the ip of the users, to shoulder user privacy and the ddos vector. Kudos on silence detection to save overhead.

So video w/audio broadcasting has to be compressed client side, then proxied through Discord's media servers, to the end user's. That's pretty smart...I just wished that I could send my raw stream to a LAN host so I could offload the compression, and allow my LAN host to provide delivery (I'm a nitro user).

Would rather waste bandwidth than CPU cycles in this case. Would take way too much CPU time to mux audio streams together server-side, and then recompress. (Means we have to buffer data for each sender, deal with silence, deal with retransmits and packet drops, have a jitter buffer, etc...). No way we'd be able to hit the # of clients we want per core with that overhead. Our SFU's are intentionally very dumb for this reason.

Also, muxing server side means we can't do things like per-peer volume and muting, without having to individually mux and re-encode for each user in the channel depending on who they have muted and the volumes they have set per peer (which would explode CPU complexity even further).

So, in this case, bandwidth is cheap, let's use (and waste) some, in an effort to simplify the SFU, and also, make it more CPU efficient. Default audio stream is 64kbps (or 8 KB/sec), per speaking user.

More bandwidth less cpu and complexity

I made several websocket-based apps that I want to turn into Electron desktop apps where people can use them only with their friends, but WebRTC has been really confusing and hard to know how to get started with. Especially considering ideally I don't want to have to host a server, but all the P2P JavaScript libraries that I found seem to assume you'll at least have a server for hosting the lobby to look for peers.

The fact is that you can't easily do the last part without a central server.

There might be some public or semi-public servers for this kind of thing available somewhere, or the alternative that I played with a few years ago was to compress and base64 encode the connection information into a string, and allow users to share the link with friends via whatever method they want, and that can then be expanded client side and used to establish the connection. (Or something like qr codes I've also seen used)

Sadly I never finished that project, so I don't really have any code to show you, but in theory it should work okay.

> base64 encode the connection information into a string

Doesn't work anymore.

Firefox times out the answer offer after very few seconds, which makes sharing the answer offer asynchronously impractical, which effectively killed serverless WebRTC and in turn killed any interest I had on it (for my side projects I mostly do serverless web apps, as in real serverless, not lambda functions).

Ouch! That really sucks...

Is there a ticket number or blog post or something I can research to learn a bit more about the reasoning behind this change with them?

No ticket that I know of. It's mentioned in https://blog.mozilla.org/webrtc/ice-disconnected-not/

> So every 5 seconds Firefox (version >= 49) sends another binding request no matter if the ICE transport is in use or not, and it expects the other side to reply with a binding response. If it hasn’t received a binding response for 6 consecutive binding requests, in other words no reply within the last 30 seconds, it will give up and mark the transport as failed. This results in switching the ICE connection state to ‘failed‘ and stop sending any packets over that transport.

Zero-config peer to peer is basically impossible without central servers or some large populated list of long-term seed nodes which are basically central servers.

The central servers don't have to do much, but they need to exist so new nodes on random IPs can find each other. The Internet has no native service discovery bus.

Could I avoid requiring a central server, and still use webrtc, if the two peers I want to connect are in the same lan?

Yes: https://github.com/LucasVanDongen/WebRTC-Local-Loop-iOS

Based on: https://webrtc.github.io/samples/src/content/peerconnection/...

You will need to know where the other peer lives. Which is why they expect you to have a discovery service. But typing in local IP's should work for internal networks. The hardest part is getting through NAT and firewalls and that's not a problem on your own LAN.

I think this is possible, but I'm not sure about the details. I think their IPs need to be obtained somehow though. I don't think you can use things like mDNS from web pages without a plugin.

would ip guessing work? 192.168.. … ?

It would flood the network with probes and take quite a while.

Maybe embedding nmap with the electron app to search for active bodes, then probeing. Requires some C or C++ knowledge though.

can't electron js do local pings?

You still ultimately need some channel to set up that initial P2P connection. This allows the peers to share the IP addresses at which they’re accessible, as well as the formats and codecs they each support.

There’s nothing to stop you doing this through broadcast messages on a local network, though I’m not aware of any way that could be done from within a browser. You could extract the SDP and ICE messages from the browser environment and handle that in Node though, if you were using Electron or something.

Thanks everyone, what I essentially want to do is link a phone (my app) with a laptop (running a decent browser) by showing a QR code on the latter, and establish a data channel (not voice or video).

I mean if everyone already has a way to contact their friends (discord) and say "hey let's play this game together", sharing that link does away with the need for a coordination server. But there's still the need for at least a server, even if it's just one of the clients that knows how to take charge (like Minecraft).

The link could just be encoded information that gives the hosts details for everyone else to connect to.

Yeah, I skipped over a lot of the details, but you would need someone to choose to be a "host" (at least in terms of a "host" for the connection signaling stuff), but you can also implicitly make it the person who created the link safely.

If everybody is behind NATs (without uPnP privileges), nobody can be the "host" for connection-signalling purposes.

There are public ICE/STUN servers that can facilitate NAT-punching, google even hosts one. But that's not the aspect that I thought the root comment was talking about, but more the aspect of "A is hosting a room called 'stuff', and B and C want to browse the list of rooms and connect to it".

>The fact is that you can't easily do the last part without a central server.

Looks up DHT use.

DHT is a fantastic system, but it's not easy by any means, and it still needs some "important" nodes to help bootstrap the system.

WebRTC requires some sort of external communication channel to exchange information about how to set up the connection. (Often referred to as signaling messages).

Disclaimer: Google employee, work on the WebRTC team.


After getting frustrated with all the other WebRTC libraries over the last 4 years, I finally wrote our own ( https://github.com/amark/gun/blob/master/lib/webrtc.js ) which is capable of using a set of decentralized DHT relay-peers in GUN for signaling - and once peers are already on WebRTC, they can signal (daisy-chain "DAM" as we call it) to other WebRTC peers via WebRTC!

Meaning, you don't/won't have to run any servers!!! Ping us on our chatroom if you want the list of DHT peers.


Chuck! Sounds like you'd be a very useful person to know. Me, Feross, etc., plenty others have been requesting additional API/protocol access over the last 4 years (some of which FireFox is adding in libdweb extension!). Any chance we could connect and chat? Ping me at mark@gun.eco ?

Would it be a problem to keep a list of DHT peers in the git repo?

Good idea. I need to get approval they are OK with their peers being publicly listed (easy DDoS potential, we don't have code yet to [but working on it] mitigate it).

I'll start one with just mine, and then others can PR to opt-in. You want to join?

Awesome, thanks.

I'm not actually using WebRTC (or GUN) yet, but I hope to for a project relatively soon and the possibility of not requiring a STUN/ICE server is very enticing to me.

If you need to request more APIs, you should try to talk about your usecase to the WebRTC working group using the users mailing list ( https://groups.google.com/forum/#!forum/discuss-webrtc ). Depending on the question, you may also ask on the spec Github project ( https://github.com/w3c/webrtc-pc ).

Source: Google employee, in the WebRTC WG.

You will need some sort of signal transport for sending ICE negotiation data in JSON. Because the two peers need to know how to coordinate how to connect to each other before actually connecting. It can be a websocket server, REST API, email, QR codes, carrier pigeon, etc.

Also in general, I highly recommend webrtc-adapter.

You need a STUN/TURN server, that's just how the cookie crumbles unfortunately. Blame NAT :)

I used this dockerized coturn server successfully:


There are probably some problems with this idea, but why not just use a popular social network or similar as the lobby? For example, if there is a twitter feed "@MyWebRTCLobby" that everyone follows, clients could just @ a tweet at it with current contact details.

If random Twitter users send @mentions to "@mywebrtclobby" then no other users will likely see these tweets unless they already follow each of the various Twitter users sending the @mentions.

However if you connected @mywebRTClobby with https://GroupTweet.com you could configure things so that any @mentions (from authorized users or anyone) would be converted into actual tweets from the @mywebRTClobby account so that all followers would actually see those tweets/contact details.

Haha I don't use Twitter so I'm not surprised I got something wrong... thanks for fixing that bug.

It used to be the case you could see all @mentions someone received, but alas not anymore.

Iirc there are some free servers you can use provided by Google and/or twilio.

Edit: STUN is free, Turn is not. Been a while since I worked with those.

For my game, I wrote a simple UDP to WebRTC datachannel proxy in NodeJS, suing node-electron, webrtc-electron and SimplePeer. My proxy only works downstream, with the upstream signalling using websockets. I'm going to move over to an all WebRTC architecture, however.


Interesting, i am working on a realtime action game and currently implemented multiplayer via websockets including authorative server, client side predicition and server reconciliation which works failry well but based on the nature of TCP, i still sometimes get some visible stutter in the movement. I guess with WebRTC you can handle things more like traditional UDP? What are your experiences vs Websockets? Did you write how you did it somewhere? Would be very interesting

I got really big stutter with head of line blocking on TCP. You can make things better by conspiring to send only 1 packet messages of about 512 bytes or less, and not too often.

You can register for an account on my server. WebRTC downstream seems to be working at the moment. Be sure to read the instructions, however.


I am thinking of turning my game server into a serverless PaaS offering. I'm open to collaborating on this.

i emailed you regarding this ;)

There's a number of decentralized networks that could send the connection info without a single central server, e.g. torrents or Dat. OTOH the time required to connect to them and pass the info would be quite noticeable.

A single connection-initiation server could be the easiest solution. It won't need to withstand a heavy load.

I really wish there were non-browser implementations of WebRTC. So far, it seems like the standard as-it-is is defined by browser code and browsers contain the canonical implementations.

Right, so it's a chromium C++ repo that you need to create bindings for to interact with. I have no doubt that an organization like Discord has the engineering resources to either write their own bindings to WebRTC or to roll their own implementation of the components, but if I'm a an individual developer that wants to interact with WebRTC without a browser, then it's pretty difficult.

One common desired use case is using WebRTC for p2p torrent communications. Right now the best way to do this is in browser, or to use an Electron app that can bridge the WebRTC clients with the standard desktop torrent clients.

There's nothing tied specifically to the browser, but it is written in C++. It seems like the discord folks just built a layer on top of it. It's been a while since I played around with it, but you just have to build the peerconnection library. This guide seems pretty close to what I remember having to do: https://webrtc.org/native-code/development/

It still leaves you writing in c++, but it lets you build a self contained server without any browser stuff. There also were java bindings which is what I used when I was experimenting with it. You just have to build your .so and .jar files.

EDIT: Here's some scala code that I wrote that uses the native bindings. (Please forgive the messy test project that I have abandoned).


There are Java Android and ObjC iOS bindings available in the repository. Not sure how those would work in a regular JRE or ObjC macOS environment, but it should be alright (but then, you might as well use the C++ API directly on macOS, just saying).

Ah that makes sense. I have seen forks that de-chromium it some that look pretty reasonable. Most of it doesn't depend on anything chromium, just happens to use its build system (for better or worse). But I believe the fork switches it to CMake

There are implementations that don't use libwebrtc/Chromium at all like pions-WebRTC[0]

Or do you want something that has a completely different API? It would be nice to be able to ignore the JS idioms, but it is tough to start from scratch.


pions' webrtc is the only third-party implementation with that level of completeness, and it's not very complete yet. It's super new (months old), but it's exciting! See issues: https://github.com/pions/webrtc/issues

There are many aspirational repos that have been started over the years with the intention to implement WebRTC but very few of them actually made good progress.

I too would love more third-party language-native implementations. Right now everyone is binding to the C++ codebase.

The real problem without having language-native implementation is that it creates a protocol rift. Things like BitTorrent vs WebTorrent, or IPFS native vs IPFS JS: It's effectively UDP/TCP vs WebRTC, clients on one end (native apps that aren't NodeJS) can't speak to clients on the other end (browsers) without a relay bridge (which is always NodeJS).

I agree. This is a major barrier to adoption of many exciting technologies, in my estimation. See for example[0]

[0] https://github.com/webtorrent/webtorrent/issues/1492

From the article: "Our desktop, iOS, and Android applications, however, make use of a single C++ media engine built on top of the WebRTC native library — specifically tailored to the needs of our users."

Does anyone have a solution to prevent notifications when a server creates a new channel and then I am notified for every message until I mute the channel.

You can mute the server, or change the default notification setting to @mentions only for the server

If you skip the use of ICE, why is it that 25% of the time I get a message relating to ICE and it gets stuck there?


> we have seen 1000 people taking turns speaking

Wow. Has anyone here been in a channel like that? I'd imagine it to be complete chaos.

But I suppose it's possible with good moderation and/or bots to ensure people take turns? Is that what they do?

I've been in a couple channels with several hundred people. The admins handled it by shouting until everyone was quiet, enforcing one person speaking at a time, and swiftly muting, moving, or banning anyone who disobeyed. I'm not sure if Discord's API is flexible enough for a bot like that.

The API is flexible enough for a bot like this, I'm contemplating opensourcing mine at some point! :)

Please do, it would be an actual act of charity! I've left a lot of interest i 8 servers due to the voice chat being clogged.

1000 people taking their turn speaking? It must be some political rally or reddit discord channel for a large national outbreak news story

All I know is that official fortnite (discord's largest gaming server last year AFAIK) had serious amounts of disconnect issues, only on that server though

Fortnite didn't have more than 100 battlerooms (4 players max) at any given point in time.

There's a few battle royale scrimming servers with 70-100 players in one room for tournaments. But the rule is everyone needs to be mute & deafened though

Yeah. Typically it starts with a priority speaker giving a speech, ends with a Q&A moderated via text chat.

There's usually some sort of moderator, and a bot to facilitate turn taking, at least in the channels that I had these in :).

My buddies and I use Discord as an easy conference call group chat while playing games (no push to talk, mic is always on).

One thing I've found on the app is when a person/voice is far away from the mic it often glitches and doesn't send all of the audio data over like, for a example, a normal FaceTime Audio call would.

Not sure why this happens, but just thought I'd put that out in the ether.

Sounds like you just need to adjust your microphone sensitivity settings? If you have the mic on voice activation and the person is far from their mic than it might be that they're just not loud enough (from the perspective of their microphone) to actually trigger Discord. That's my guess at least.

Ah, never thought of it like that. I always perceived it as a "always on" scenario. Just found it in the settings and I'll give it a try, thank you!

> Since it’s the only service directly accessible from the public Internet, we will focus on Discord Voice server failovers.

Why is the Gateway not directly accessible from the public Internet?

It's a clustered service that is behind an SSL Load Balancer[1] that is behind Cloudflare. So not really a viable DDoS target.

[1]: https://cloud.google.com/load-balancing/docs/ssl/

I've been reading the article hoping to find out a bit more about the actual voice server but the SFU is actually not very deeply explained at all.

That’s because there isn’t a lot to it from the sounds of it in this post:


I would implement a similar dump pipe as well for a very different application. I mean the application still needs to set up the WebSockets with the clients, which isn't trivial to code yourself.

I love discord but am I the only one that thinks that the name is hindering wider adoption? It's hard to suggest to some people that an with that name (and to a lesser extent visual theme) is the skype+slack+others replacement they are actually looking for...

We found the name is not an issue as growth continues at a rapid pace.

We are already multiple times larger than Slack :)

I suppose you could say the same for Slack. I don't think that the connotations of the names for either particularly matter to users.

What's hindering adoption outside of the gaming community is the branding and niche focus.

"Instead of DTLS/SRTP, we decided to use the faster Salsa20 encryption."

This is just confused and sounds quite worrying :I

Judging by the votes, this may need spelling out:

Salsa20 is a stream cipher. DTLS and SRTP are higher level security protocols that use ciphers (among other things) as building blocks, to ensure things like replay protection, integrity protection, mutual authentication, secure session key agreement and forward secrecy in addition to confidentiality.

If you replace an engineered security protocol with a raw cipher and key, you create many vulnerabilities and make a much less secure system. VOIP applications are especially vulnerable to replay attacks, for instance.

A better approach would be to continue using DTLS and SRTP, and use Salsa20 as the SRTP cipher.

Non-crypto-experts saying things like "we replaced <estabilished security protocol> with <my own idea> and it's much faster" is a well known bad sign - It often indicates the person has fumbled things without realizing it.

They say they have done it for performance. I am wondering if they wanted to optimize for the client performance or the SFU's.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact