Hacker News new | past | comments | ask | show | jobs | submit login
How Discord Scaled Elixir to 5M Concurrent Users (2017) (discordapp.com)
528 points by lelf on Feb 24, 2019 | hide | past | web | favorite | 162 comments

Discord infra engineer here -- this blog post needs an update! Since then we've scaled this system much more. :)

The Fortnite Official server has exceeded 100,000 concurrent users, Discord itself is way past that 5M concurrent number, we're now using Rust in certain places to make Elixir go faster, we've built a general purpose replacement to Process.monitor that scales a whole truckload more that we're open sourcing next week at Code BEAM SF... the list goes on.

There's a lot of fun stuff going on to try to make this system even more efficient and reliable, there's a lot to do still. We run everything on a very small engineering team (there are 4 fulltime engineers on the core infrastructure, only about 40 engineers in the whole company) and we're always looking for a few more. Feel free to reach out to me (zorkian#0001 on Discord) if this blog post sounds up your alley!

For the curious, xb95 was kind enough to tell me how they call Rust from Elixir - it's via Native Implemented Functions:


> As a NIF library is dynamically linked into the emulator process, this is the fastest way of calling C-code from Erlang (alongside port drivers). Calling NIFs requires no context switches. But it is also the least safe, because a crash in a NIF brings the emulator down too.

Sounds like a pretty good use for Rust!

Wonder if they use rustler [0], which claims it cannot crash the BEAM. Looks pretty good.

[0] https://github.com/hansihe/rustler

We do indeed. Rustler is very cool!

Sonny Scroggin of Bleacher Report gave a talk about writing NIFs in Rust (using Rustler) at last year's Code BEAM SF conference -- the video's on YouTube:


Congrats! You say only about 40 engineers in the company... How big is the company and how big are your other departments relative to engineering?

THeir engineering team seems pretty efficient, for 40 people. I'm curious how they're organized, what they look for in hires, their release pipeline, etc.

Still reading the article, and I'm enjoying it so far.

My only complain is the design of your blog. The header and the footer (to subscribe on Medium) takes so much space that there is relatively less space to read to actual content.


I can recommend the browser extension "Make Medium Readable Again":


This add-on can "Access your data for all websites".

That seems both risky and somewhat overkill considering its features. Does Firefox not support targeting a specific domain yet? Or is part of the problem that medium allows custom domains (it does, right?).

Firefox does support targeting specific domains, so the add-on specifically chose to apply to all domains by writing "https://*/*" in "permissions" in manifest.json. It probably asks for this permission because of custom domains, as you theorize.

I can see in the extension source (thanks to https://addons.mozilla.org/en-US/firefox/addon/crxviewer/) that on every page, the extension uses JavaScript to check for a top nav bar or a login nag popup and hide them if present, then applies CSS that hides five other UI elements if they are present.

Am I current in assuming that if the add-on was not manually installed, it could be updated at any time to include malicious code? Or is that just Chrome's behavior perhaps?

I wonder if there could be uBlock/AdBlock filter made for medium in general to block all this. I dunno the format of filter files, but it was easy to add a uBlock rule for a particular element using UI.

Isn't this the same for all Medium blogs?

You are right. :(

I hate this too but I think it's intentional - as once you sign in the banner will disappear on scroll

This makes me wonder what else they'll "invent" before it becomes completely unreadable

Also, a second question - in the conclusion, it says:

> Choosing to use and getting familiar with Erlang and Elixir has proven to be a great experience.

What background did the core infrastructure engineers have before tackling Discord in Elixir and Erlang?

How is webrtc handled at scale? Seems most webrtc server software doesn't perform that well, and discord I believe is using turn to proxy it all, so that's gotta be a lot of data flowing.

The tldr is we wrote our own highly efficient SFU - and we used our own transport layer that is not dtls, but xsalsa poly over udp.

The Fortnite Official server has exceeded 100,000 concurrent users, Discord itself is way past that 5M concurrent number, we're now using Rust in certain places to make Elixir go faster.

Would this have been avoided have you started with golang or the JVM ?

Who knows? We would have had a whole other set of issues - of which golang and JVM struggle at. There is more discussion in the rest of the comments in this post. Beam/OTP provide a fantastic foundation for building distributed soft real time systems unlike any other programming language/framework/ecosystem provides.

Not sure why you are being downvoted - it seems like a genuine (good) question. Perhaps using Go could have solved the need for Rust, perhaps not.

Do you have remote positions?

Have Discord created any blog posts with technical details (configuration details) around ways to tune various aspects of the infrastructure (from bare meta, to VM to container, to the applicaiton) to get those numbers? I've seen similar blogs in the past from Cloudflare, StackExchange and a few others. Those are always a fun read.

I hope to read about your rust return of experience. I hope it's the leaner type.

We've since ditched fastglobal actually, as we found that it'd take too long to recompile code as the number of processes on the BEAM VM grew. As it turns out, the cost to recompile is linear scaling up with the # of processes on the node. Once we reached the million processes per node threshold it ended up being too slow, and resource intensive to do fastglobal dynamic module recompilation.

We've since been able to squeeze out the performance from ETS we needed to be able to run our hashring. More on that in this PR here: https://github.com/discordapp/ex_hash_ring/pull/1

Also one amazing feat we've accomplished thanks to ZenMonitor (which we will open source soon as my colleague mentioned), is that we can now tolerate a guilds node failure and recover from it in ~40 seconds. We had a node fail on Friday actually due to the underlying host rebooting (gcp calls this a 'host error'), and didn't even notice until after the system had already recovered and the on-call got alerted after the fact. Back in the day this lead to a cascading failure throughout the system. A guild node runs between 600k-700k concurrent discord guilds, or servers as it's known in userland. And although we haven't done it recently, we clock a full restart of our distributed system at roughly 17 minutes (from shutdown to service fully restored).

Interesting. Do you run on top of bare gce vms? What machine types do you use? Didn't know they were being automatically rebooted under the hood.

Yes. We run bare VMs for this workload (n1-standard-X). They do in the event of hardware failure or other unexpected fault. Generally it does not get to that though, as google is fairly good at migrating your VMs off of a host if it detects it will fail soon.

Disclosure: I work on Google Cloud (and have emailed jh about Discord).

Yeah, the most straightforward failure that we can’t migrate away from is the NIC (or rack switch) failing. Obviously, if your path to get off the box is dead, that’s not going to happen :).

Others though, like the hypervisor crashing or even host kernel are also possible, but much less frequent than “Hmm, I think the network is dead, we should start a replacement VM”.

One of the many motivations for (now) only having Persistent Disk for boot disks, is that it lets us avoid a whole class of truly unrecoverable errors. There are still bad DIMMs, but monitoring for ECC failures often lets us mark the host for replacement in time for the VMs to migrate off before they actually fail.

A corollary:

As a mostly Elixir programmer for 2 years now, I find it quite amusing how the Kubernetes community tries their damnest to emulate Erlang's OTP system -- going as far as to bolt stricter typing system on Golang even -- and try and reinvent Erlang's "let it crash and get rebooted" idea.

I am however not at all convinced that "let it crash" is a good mantra when applied to entire containers. One such container can take 20+ seconds to restart and you can lose a lot more compared to the mini-processes Erlang/Elixir have which are happily left to crash and [semi-]auto-recover. Imagine if that container was ingesting events and crashed when it had 5_000 in its in-memory queue and only managed to process 50-100 of them. Or imagine 100 transactions in progress being cancelled.

I applaud the hard work of the Kubernetes team, I am just not very sure they invest their energy in the right tech stack. Sure Golang is faster than Erlang/Elixir -- by a lot, too. But is it not K8s idea to be fault-tolerant and not immensely quick? What does a Golang's speed matter when a container needs seconds to reboot? In my eyes, K8s could have been written in Bash shell scripts... but I am probably missing something important.

Kubernetes is a complex project with a large codebase. Writing it in bash would be a nightmare to maintain.

Go makes sense as an implementation language for a system like this for many reasons:

- it doesn't require a VM (goodbye Java) - it is a memory safe language (goodbye c++) - it has strong types to improve reliability - it does a good job of handling complex, multi-module code efficiently (these two say goodbye to python)

So of the Google friendly languages you end up with Go.

And I don't know if it was planned, but Go has turned out to be a boon for contribution. A surprising number of devops engineers are willing to dip their toes in Go waters and I highly doubt they would've done so with Erlang.

I think you'd end up going the chef route, writing the control plane in Erlang and expecting users to interact in Ruby or something like it.

Also this whole conversation seems a bit off because Kubernetes is a multi-service architecture with many components. You can absolutely write operators in other languages since Kubernetes exposes an API.

It's kind of the whole point to be able to take an existing application and run it in k8s rather than a regular VM with minimal changes. Expecting developers to rewrite everything in Erlang is nuts.

> Expecting developers to rewrite everything in Erlang is nuts.

...But I never said that?

You are correct on your points and I don't disagree. Golang is certainly a much better choice than Bash indeed. You are also correct on developer willingness to work with Golang.

I am aware that K8s and Erlang/OTP are apples to oranges comparison; they serve different needs. Whereas Erlang's runtime (the BEAM VM) can give you fault-tolerance, K8s tries to do roughly the same on a higher level - it tries to give you throwaway containers that can be switched off and on at any time.

As I mentioned in my parent comment, I admire their work.

What my point was that if you squint your eyes hard, it kind of looks like K8s wants to invent Erlang/OTP for infrastructure (as opposed to Erlang/OTP which gives its guarantees per node).

> which gives its guarantees per node

There is no such thing, because Nodes (and even entire datacenters) can fail.

In other words, node failure is an infrastructure problem that is best NOT handled by your bespoke application code. Replacing failed nodes should NOT be custom code in your app, that way lies madness.

> kind of looks like K8s wants to invent Erlang/OTP for infrastructure

You can say "K8s and Erlang implement similar ideas implemented at different levels." But you can't pretend one is a substitute for the other, nor that "Erlang has done everything that K8s can do."

E.g. Writing in Erlang doesn't magically get you:

- deploy/upgrade of the Erlang runtime, including rollback + multiple versions co-existing, including any compiled foreign code (which is required in the article). - ability to log, monitor, probe and reroute the connections between services -- In a standard way such that the application doesn't have to be modified.("service mesh") - Ability for an entire ecosystem of tools to inspect the versions of your application services that are deployed, because it's exposed as an API. - A standard ecosystem of plug-ins for operators (autoscale, autoscale to EC2 spot instances, capacity planning, "best practices" of running a MySQL cluster, etc.) None of these should ever be mixed with the application. (Unless you are Kelsey Hightower: https://github.com/kelseyhightower/hello-universe )

> There is no such thing, because Nodes (and even entire datacenters) can fail.

Gross over-simplification that I'd also call a strawman. In the face of lack of electricity, of course no computer language matters at all. What's your point?

> it doesn't require a VM (goodbye Java) -

Neither does Java, only devs clueless about Java world aren't aware of AOT compilers to native code.

Also sorry to spoil your fun, Kubernetes was originally written in Java, it was later rewritten in Go as other team took over.

Apparently a prototype was written in Java. Borg was written in c++. AFAIK Kubernetes was always a Go project, but maybe I'm wrong. Why do you think Go was chosen?

Your comment about my experience with Java was needlessly provocative so I'm going to ignore it.

It was mentioned at the FOSDEM 2019 talk about K8S.

Nowhere in my commented I addressed the Java remark to your specific person.

It was targeted to all of those that bash Java without actually knowing how the rich the eco-system actually is.

You called me a "clueless dev".

No I did not, nowhere on my sentence is the direct mention of your person.

On my previous comment I explicitly mention it was targeted to developers that bash Java, without having any clue what they are bashing about, cargo cult if you wish.

So either you feel offended, because you are indeed attacking Java without having any clue about the Java eco-system, or are deciding to play victim here, when I never mentioned you directly.

This is a public forum, so I leave to everyone else and HN moderators to judge my comments, and won't reply to anything else on this thread.

> And I don't know if it was planned, but Go has turned out to be a boon for contribution. A surprising number of devops engineers are willing to dip their toes in Go waters and I highly doubt they would've done so with Erlang.

> I think you'd end up going the chef route, writing the control plane in Erlang and expecting users to interact in Ruby or something like it.

DevOps engineers have learned to be comfortable with Ruby (Chef) and Python (Ansible) so this certainly makes sense.

Kubernetes is based on Google Borg / Omega not OTP very different things. As for your queue / DB question there is something called transactions for that kind of problem, I hope you don't use OTP features replacing transactions or you're looking for very serious issues.

OTP doesn't do 5% of what Kubernetes is able to provide ( cpu / io / memory quota, live / readiness probes, rolling deployments / canary / blue green.

I knew somebody will get my random examples and turn them against me. :D Serves me right for not researching several hours to find the perfect example, I guess!

I am well aware of the differences in scope between K8s and Erlang/OTP. My point was that they kind of try to do the same only on different levels.

And I am still not sold on the idea of throwaway containers. That works well if you have a hyper-network of microservices that are able to discover each other and self-heal a bigger graph of services but if you have any sort of a more classic coherent whole app... then not really. Throwing containers away and rebooting new copies isn't something that looks viable for many projects.

But hell, who knows. K8s team and their audience are really dedicated. They might change every single tool of the trade just to make K8s work well. I was mostly saying that I don't feel it brings something really radical to the table.

Let it crash is kind of overstated for Erlang too. Sure, a supervised process will restart, but you'll lose the message queue. For a gen_server that's mostly stateless (or that's mostly servicing state in mnesia or similar), it ends up being less disruptive to catch in handle_call around your real work, so a crash only throws away the one request.

Caveat: in the environment I work, we don't do real gen_server:call, because all of the included monitor/demonitor calls are too expensive for us. The trade-off is we only have timeouts when the server goes away during the request. If you had the monitor letting you know the request crashed the server, you could presumably do a smartish retry -- but it's still nicer to only need to do that for the crashing request, not the others that would fail simply because the server died.

Depends on the scenario of course, but when working with Elixir I tend not to use GenServers much as well. I mostly handle problems with idempotent and at-least-once-executed tasks. Works quite well 99% of the time.

This is why HashiCorp's Consul/Nomad/Vault/Terraform stack has strong support - it is way easier than setting up K8s.

Better yet: a whole pile of microservices, preferably in node.js that tries to re-invent supervision trees and of course on Kubernetes.

If you want to re-invent the whole Erlang eco-system you should just switch and call it a day, the chances that a company that needs to get a job done will be able to pull this off successfully as a side project are nil.

Earlier HN discussion from 2017 on the same post: https://news.ycombinator.com/item?id=14748028

It was really interesting to read how each problem they solved uncovered another problem further down the pipeline. It also reveals how much is going on behind the scenes when a service goes down and everybody starts accusing the developers of incompetence. :)

It also helps you learn to appreciate software that very rarely goes down at all. It's no simple accomplishment.

While 5M concurrent chat users is definitly a massive feat, it seems to me that it's kinda this tech (Erlang) very sweet spot and not being able to do so would have been disapointing. Or am I missing something ?

My takeaway is that Erlang/Elixir aren't designed to squeeze that much performance with that setup and that the Discord team went out of their way to optimize hot-spots to make it possible.

Erlang/Elixir are still quite impressive out of the box, even without these Discord-specific optimizations. But they probably wouldn't scale to 5M right away.

Nothing does, 5 millions __concurrent__ user creating and sending new data in real time is huge. Not to mention video and sound.

Yes. In my eyes the win here isn't concurrent performance that cannot get beaten by anything else; I am pretty sure a carefully crafted Golang or Rust stack can beat Elixir any day. But they won't have the fault tolerance guarantees of the BEAM VM.

The real win IMO is the good reliability:performance ratio that Discord achieved. I feel Erlang/Elixir are excellent in optimizing this exact metric.

Golang and Rust also don’t give you a direct path to distributed actors.

Sure. This is basically the exact sort of problem Erlang is designed to solve.

But it's still cool to see it done, and a lot of people are unfamiliar with Erlang or Elixir, so the discussion is interesting.

How often do we lament people using the wrong tools for the job?

I think it’s probably healthier to cheer the people doing it right than to keep shaming everyone else.

They are attempting to expand to being a springboard for games and apps within their app. Sort of like Steam, but since you're already chatting with friends why not start up a game of League?

No I mean Erlang.

Question for the Discord team.. if you started this project today in 2019, would you still build a homegrown event sourcing system? Or would you use Kafka?

We would still use our own homegrown system. And we do use kafka internally already for a plethora of other things. Just not on the real-time chat side of things.

Why not on the real time chat stuff?

Because Kafka is definitely not a tool suited for real time chat and event distribution to millions of clients. Once a message hits our distribution it’s fanned our to clients on average 5-10ms later.

> Sending messages between Erlang processes was not as cheap as we expected, and the reduction cost — Erlang unit of work used for process scheduling — was also quite high.

This is really surprising to me, and definitely something the elixir team should look at optimizing. Sending messages should be extremely fast.

There are ways to speed up message passing. It's not slow by any means but weird things become your bottle neck when you have 5m concurrent users. For instance you have to build you OTP tree extra layers deep or spawn hundreds of supervisors because just routine calls to the master process to add / remove children begins to become bottle neck.

In my implementation, only 1/2 a million concurrent IOT devices, I use routing tables that narrow down a process to a node and registry cluster (https://hexdocs.pm/elixir/master/Registry.html) , and from that registry it fans out to 1 of 100 supervisors for that worker type per node.

The high reduction cost is deliberate. Reductions in BEAM are an arbitrary value that certain operations in ERTS assigned. Sending to a remote node has a high reduction cost, which means that the more sends you do, the more your process has to be rescheduled to do remote sends. We worked around this with manifold, by limiting the number of remote sends we do to fan-out messages, and transforming them into local sends on the receiving nodes, which has a much cheaper "reduction" cost.

I have no insight into how Erlang performs this process, but I'd assume Erlang is performing sending messages almost as fast as it's possible, as it's a key part of the platform and has had two decades to perfect it. Most likely this is a limitation of the paradigm (of the Actor model) and not the implementation.

I believe the bottleneck the author was trying to overcome here is how the Erlang VM moves processes to the back of the run queue once they hit a predetermined number of operations (preemptive scheduling) and not of the Actor Model.

If the message is sent to a remote host then perhaps the send_no_suspend function could help. I know it is used in core parts of erlang telcom systems with several millions of users to work around performance issues


That's something that the BEAM group would work on, although jose has contributed some patches to the compiler.

Needs some attention to the scaling down, IMO. Discord is annoying on intermittent connections. Why do sent messages not have identity such that they can go through twice?

We actually added this 2nd half of last year. Message send operations are now deduped and idempotent given a client-generated nonce. Are you still experiencing this issue?

Your nonce might not survive the "message -> disconnect -> red message -> re-send" flow.

Could you just get rid of red messages and make it "pending" until acked as "sent"?

Yes I experience this constantly, very recently, on poor internet connections

Have any Node.js/websocket implementations scaled on this order of magnitude? Would like to do a read-up of any challenges faced.

5M concurrent users is a pretty big number. There are plenty of Node.js systems/frameworks which could scale to that size with minimal effort.

SocketCluster has been used in production to service hundreds of thousands of concurrent users and it can handle millions.

I've had one report of a chat system (adult industry) which could handle 250K concurrent users using only two large servers.

Also, I once did some consulting work for a popular cryptocurrency trading platform which handled tens of thousands of concurrent users/trading bots (with very high frequency of messages). After I was done, that company didn't talk to me for 6 months straight; it turns out that they hadn't had any issues with their pub/sub cluster since.

Unfortunately SocketCluster doesn't get discussed very often among influential circles. I have no idea why because the feedback I get from users is essentially 100% positive.

I guess Node.js doesn't get much hype these days.

Am I the only one thinking that node.js is maybe not the right choice for global-wide services?

node ain't the right choice for anything

I have given up on NodeJS and moved solely to Erlang for long term projects.

I still love NodeJS ! but once you learn Erlang its hard to go back, it reminds me of PG's having a higher bird eye view.

NodeJS, Python, Haskell (!), PHP, C - all belong to the same class of coding style with the same type of problems.

You use Erlang not because of playing the testosterone game of nominal performance - but because you really want some guarantee.

( Money should be not a problem since the world seems flush with cash - if your manager is complaining its because he wants his bonus to be higher. )

Care to elaborate what these types of problems are and how erland fixes them?

Sure !

Think about how prinf / console.log / print .... works in traditional settings.

- How would you make it so that printf doesn't crash your entire program if the console hangs.

- how would you isolate a single codebase's IO operations ?

- How would writing to console work in a multi threaded environment ? multi server ? 100 servers ?

Haskell sort of tries to answer these questions but I am not sure how its going go work out for them, in Erlang's process based universe all those questions have been answered already !

Sounds like reasons to prefer FP over OOP, not necessarily Erlang.

I do really appreciate Erlang / Elixir (have contributed several libraries), but the problems you describe are not uniquely solved by Erlang. Akka / Scala is another take on the whole actor based architecture, and seems to have considerably more traction (hiring talent will be easier for your manager).

You should really read jhgg's response describing the BEAM runtime's pre-emptive scheduling of its lightweight processes: https://news.ycombinator.com/item?id=19241194 . No other runtime can do that for you, not even the JVM.

> Akka / Scala is another take on the whole actor based architecture, and seems to have considerably more traction (hiring talent will be easier for your manager).

They solve a subset of the problems that Erlang's OTP solves. They don't have the entire package.

Where I live, it is easier to find local Elixir talent than Scala. There was a big Ruby community here and many of them have moved to Elixir.

> but the problems you describe are not uniquely solved by Erlang

I agree, I do not any Scala / Akka experience so I cannot argue for or against due to ignorance.

But as you say at least the platform / language addresses these concerns.

>> How would you make it so that printf doesn't crash your entire program if the console hangs.

Why would the console hang? If that happened, it would signal a major issue at the OS level and probably not related to your application (unless you're trying to log an extremely massive string; which is a bad idea and you'd probably already have run out of memory before that could happen). I have never seen the console/stdout hanging in production and I've built some pretty high-traffic distributed systems with Node.js.

>> how would you isolate a single codebase's IO operations ?

What sort of IO operations are we talking about? Network, Disk? There are many ways to inspect different kinds of IO operations. The Node.js ecosystem offers a large number of modules which would let you achieve that.

>> How would writing to console work in a multi threaded environment ? multi server ? 100 servers ?

Node.js is perfect for running on Kubernetes. There are many K8s dashboards and tools which allow you to browse and aggregate logs from thousands of machines with very little effort. I don't see how this point has anything to do with Erlang specifically. A language-agnostic container orchestrator like Kubernetes is the best way to go over a tool which only works with a specific language.

There are Node.js frameworks which offer kubetnetes .yaml files and CLI tools which allow you to deploy a highly scalable cluster to K8s in a few minutes.

Can we all take a step back here and for a second realize that Node.JS is based upon a language/event model that was purpose-built and designed to handle client side browser operations, and has since been expanded into the server space. And Erlang/OTP/BEAM was built and designed for running reliable soft real-time distributed telecom systems.

By default (and unless you go out of your way w/ web-workers) the javascript event-loop (and thus nodes event loop) is single threaded. To work around this, you can bind many node processes to a given port to load balance requests (SO_REUSEADDR, anyone?) - or simply run many smaller instances of node (perhaps in a bunch of containers) where traffic ingresses in via some form of load balancer. The load balancing problem is unavoidable, and you will definitely need the same if you want to send requests to multiple BEAM nodes. However, BEAM can schedule your work across all the cores you give it.

But let's talk about work for a second, and about a very special thing that the BEAM VM gives you, that other runtimes (whether it be node, JVM, golang's, etc...) aside from the actual operating system of your computer does not. And that's specifically preemptive scheduling.

Suppose you have a single core computer that's running Linux, and you have a process that is sitting there busy looping. Let's say we just make a simple script that does nothing infinitely in a loop. Does your computer grind to a halt? Most likely, no. You can still probably use your terminal, move your mouse, operate your web browser, etc... You can thank pre-emptive scheduling for that. The OS suspends the process to allow other processes to do work - hopefully in a fair manner (on linux, the CFS (aptly named Completely Fair Scheduler) does this).

Now let's say you have a single node process serving requests. Let's say that a specific kind of requests requires 250ms of CPU time to compute - and does not explicitly yield back to the event loop. (You can imagine doing some processing of input data, deserialization, serialization, aggregation, etc...). During this computation, nothing else within the node process can progress. This means that requests that may not take a lot of time to compute now have to wait 250ms to be processed. Generally, I see node deployments not having single request/response request handling, but rather many concurrent requests/responses being handled at any given time, using promises/callbacks to allow the event loop to progress while waiting on IO from something else. During the periods of expensive computation from a given request handler, the entire event loop is stalled, and the response time percentiles of your requests spike. A pathological case would be something like `setTimeout(() => while(1) { }, 1000)` deadlocking your entire node process after a whole second, as the loop does not yield back to the event loop.

In BEAM, this does not exist. Processes are scheduled and pre-empted - to allow for fair utilization of the underlying computation resources (very much like how your OS does it.) This means that a computationally intensive process does not stall the event loop for all other processes, meaning that your response times and percentiles remain low for all other work within the system.

Now of course, you could hand-craft your javascript code to explicitly yield to the scheduler every so often, but that's a lot of work that you as a programmer are now doing that your runtime could be doing for you, and if you forget to do it, could be catastrophic to the performance of your soft-realtime system.

This is only one of the many benefits that OTP/BEAM provide over other runtimes. But one compelling enough for Discord as a company to bet on it. For a given service, we run entirely homogeneous infrastructure. We do not need to allocate or dedicate special resources to our largest servers (100k CCU/350k members), and instead can run and schedule it alongside the millions of other small servers that exist on Discord - all without negatively impacting the performance, percentiles, or soft-realtime guarantees of your chat with a few of your friends.

The NodeJS cluster module can be used to do load balancing between multiple processes on the same machine - It supports load balancing either at the OS level or at the application level. The application level 'round robin' approach is the default and leads to more even distribution between processes in terms of CPU usage. The application 'round robin' approach may be limited in terms of scalability at some point but I've done tests with 32 processes on a 32 core machine and I couldn't see the slightest sign of struggle from the process which hands off the connections to worker processes. Hopefully, eventually the OS scheduler in Linux will have improved enough to outperform the application-level LB but for now it hasn't.

In any case, you don't necessarily need loadbalancing at the host level, you can load balance at the cluster level only. Your load balancers (e. g. nginx or haproxy...) have their own hosts/machines in your cluster and they loadbalance between processes directly. Some of those may be running on the same machine but the load balancer does not dostinguish between them. A random load balancing approach yields the most even distribution from my experience - You do need each process to be able to support maybe 1k concurrent users in order to get the sample sizes on each prpcess to allow even random distribution between them but this is easily achieved with most Node.js WebSocket libraries. They can easily support 10K concurrencr connections per prpcess with very high message throughput. If the commections are mostly idle, each process can handle 100k connections or more.

Load balancing a bunch of connections between a bunch of processes across a cluster of nodes is pretty well understood. I don't think that's what I'm trying say here. My post is more on the power of pre-emptive scheduling built into the runtime, and what it means for your application.

We use BEAM/OTP for way more than just holding open websocket connections. Our entire websocket layer is a few hundred lines of elixir code - and honestly hasn't been touched in over a year - and has remained pretty much the same as we scaled from 200k ccu -> to well over 5m ccu. Holding open websockets and load-balancing them is pretty much a solved problem for us.

How is this related to programming languages - I think these are platform issues and design choices. How does Erlang help there (curiously interested)?

You should read on the fault tolerance guarantees of Erlang OTP. It's really hard to be summarized. Supervision trees, "let it crash and get safely restarted", super-mini-processes that have preemptive scheduling built-in, soft real-time...

TL;DR: No, no other language or framework in the world has the primitives that Erlang and Elixir have.

People on HN and Reddit really love acting non-impressed and claiming the pain points are easily solved in other languages.

My 17 years of career say this is not true at all.

Erlang's language syntax is kinda awful but that is besides the point.

Issues I have raised are 100x harder to address then some syntax.

Those are good points, but does node fail in the scale requirement?

I mean supposing we can get past console printing issues?

Node is awesome for fast and quick projects / cmd tools.

- For example I have a project involving smart electrical inverters, I need some guarantees regarding crash handling / low latency. Not a lot of scaling issues.

- With scaling involving NodeJS, I consistency had issues with exploding RAM and crashes - unable to isolate part of codebase.

I struggled a lot to fix these problems, so while searching for a solution I came across Erlang and haven't looked back.

The problem with your reasoning is you're trying to make your program very resilient when it shouldn't be, let the underlying infrastructure deal with that, if it crashes it will be re-created. BEAM and Erlang are the wrong solution for those problems because not language neutral. I can make my nodejs / PHP app as resiliant as Erlang using Kubernetes.

Infrastructure can never get to any reasonable level of resilience alone without programs being designed for resilience and such ignorance will likely lead to nasty stuff, like cascading failures, because of massive load differences between normal operations and attempts to handle errors by recreating entire services. But Kubernetes itself is not known for its resilience, on the contrary, its reliability reputation is pretty bad at this point, unlike that of Erlang.

Design for resilience is not a small thing. I suggest to read Joe Armstrong's "Making reliable distributed systems in the presence of software errors" [1] as a starting point on this.

[1] http://erlang.org/download/armstrong_thesis_2003.pdf

I'm not saying that you should not add resilience in your program, but that OP claims about "what if printf crashes" well that's it it will crash deal with it. You should ensure your platform will recover if you program crash not try everything possible to not make your program not crash.

"Pet vs Cattle"

The thing I don't like about Erlang is about the runtime that is mixed between code and infra which I think is not a good idea, it was designed before we made progress with HA platform like Kubernetes. https://github.com/kubernetes/community/blob/master/sig-scal...

It should be separated and it's what pretty much everyone is doing nowdays.

Say you have a bug in a minor feature, that most people wouldn’t care about, but triggers the crash. Now someone’s trying to use this feature once or twice per minute.

If you rely on the infra for recovery, you’re going to be in serious trouble. There may be 300k users connected to an app instance, and you’re going to be kicking all of them out and restarting an instance every time the minor feature is called. This turns a minor bug into a full-blown outage.

Now we’re talking about a “printf crash” as an example, but in my experience the issues are more subtle. I’ve seen a Python app leak file descriptors due to a bug in an object’s destructor, which only happened when an exception was thrown in a certain place. This caused a service outage as the app looked “fine” from the infra’s point of view (it could respond to health check requests), but the functionality was dead. With Erlang processes, using a process per connection, the VM guarantees resources are cleaned up when a process dies, so that kind of issue doesn’t happen.

The Zen of Erlang goes a bit deeper into transient issues and why the supervision model helps: https://ferd.ca/the-zen-of-erlang.html

I originally thought the same until I started building ML pipelines. The best would be a direct implementation of first class module systems to handle infrastructure.

Seriously? You’d rather let a whole container crash and respin and start all the services that will rebind again hoping that everything goes as planned in the restart? I’d rather have a language in which you can’t have null pointer exceptions and you handle everything that can possibly go wrong rather than write sloppy code that causes entire processes to crash, honestly.

But Erlang is a dynamic language...

It's easy to scale to 5m websockets with almost any languages, the real question is what those connections are doing, looking at Discord we can assume that a large % of users are idle for example. My Discord client is always on but I do nothing on it, I'm pretty sure it's the case for most people since it starts at windows startup.

I actually think that Elixir / Erlang are actually not great for those kind of problems because they're slow language and consume a lot of memory. They allow you to do easy message passing + horizontal scaling but the runtime is inefficient. Java / C# net core / Rust / C++ / Go are much faster than Elixir. ( if you actually read on it you'll see that they use a lot of C / C++ / Rust to make it fast which is not something you need to with the above languages ).

And for deployment / scaling just use Kubernetes or equivalent, better than BEAM trust me.

Since this blog post, we are well over 5m ccu (I don't think we have shared our current peaks publicly yet). We do egress 4m-6m websocket messages/sec through our system though at peak. So it's a lot more than just holding onto idle websockets :P

The power of BEAM is that although the performance may not be the best, throughput and response time is consistently low throughout the system - allowing no single process (or actor) from monopolizing resources of the system. When you use a homogenous server configuration like we do, this makes a lot of sense. Our largest guilds (100k ccu, 350k members) are on the same nodes as all of our other servers. And when they're busy, the small guilds notice no performance degradation.

Other chat products out there (that I hear use the JVM for their real time stuff) have to spin up dedicated resources to hosting their larger servers/clients - and even then cannot handle servers with as many users or concurrents that we can. Every single discord server runs in a homogenous cluster, without special dedicated resources for our largest instances.

I'm a bit curious as to the message processing overhead. BEAM and HIPE are relatively fast, but string processing of any kind was kind of slow back when I messed around with Elixir...

Use IO List rather than strings. Same result, but much much faster.


String manipulation is notoriously slow in the BEAM, mostly implementation reasons.

Message passing is somehow slow as well, slower than go(Lang) of you want a comparison.

But on those systems, the one that you design on the BEAM, velocity is usually not a problem. What you usually try to do is to reach a design that does not have a single bottle neck or failure point so that pretty much whatever happen you can just add machines.

This turn out to be a great way to design multi{processes, cores} and multi nodes system.

What you usually get on a BEAM under load is an 100% of CPU usage, even on multi core, but to be honest those core are not used as efficiently as you could.

It is a matter of tradeoffs, I can quickly write multi process software that can easily scale, but I will leave on the table some raw performance number.

Again it turns out that just raw numbers are not as important as they are simple to measure.

FWIW, that’s also why use of IO Lists are strongly encouraged rather than string manipulation. It’s also why HTML with Phoenix is so fast.


Can you be more specific in what you mean by "processing"?

- Reading from the mailbox?

- Destructuring and binding the content?

- Operating on the data by transforming, modifying, or filtering it?


Is the ETS cache system can be used by something else than elixir?

ETS is basically a key value store with optional sorting, multiple values, limited atomic updates, and a weird query language. I'm not sure how well it compares to other key value stores, but it's main differentiator is that it stores Erlang native types without a significant marshalling burden (at least for the developer -- if you store complex values, it's still plenty of work for the runtime). It wouldn't be a lot of work to build an Erlang service to expose ETS to something else, but I don't know that the data marshalling required would be worth it.

More interesting might be to expose mnesia, maybe. But even then, it might be most useful to build out your data handling in Erlang and expose a higher level API for clients in other languages.

When you have an erlang distributed app and need caching without having the overhead of converting erlang terms then ETS is the best solution. That's not to say you would want to use ETS instead of redis from a java app.

I'm continuously impressed at the quality of the software coming from Discord. By far the best messaging user experience on desktop there is.

Discord already uses Cloudflare, I'm curious what they think of Workers.

edit: I'm on macOS, not an heavy user, I used Slack / Skype / Hangouts / WhatsApp / Messenger / etc. before

Unfortunately I have the opposite experience

- Abusive trust and safety team - Constant outages for both users and bots - Frontend is heavily bloated - UI designed for money grab instead for the users

Discord's trust and safety team has come under fire lately for their (until a week ago) lax policy on "cub content" (for those unfamiliar, it's the furry equivalent of child porn). Apparently some of the people on the team may have been participants in sharing it; there are number of furries on that team (at least: Tinyfeex, Allthefoxes).

While I appreciate their technical accomplishments, this is one of those things that gives me chills about the company.

As much as I want to support Discord, their goal from when they initially started as a company has shifted in an entirely new direction. I feel betrayed as a 2015 user.

Honestly I don't know what you expected. There's no other way it could've ended. Voice, file uploads, etc are expensive.

Don't fall in love with services. (Or with people ;P)

Don't forget the wierd dependency hell when it comes to writing Discord bots in Python. They require you use an older version of Python that is "fun" to work with.

All the Diacord API libraries are community built and maintained.

Why do you have to use Python?

By judging all the strings in the client frontend, their help website, and my interactions with the support team, Discord is being run by some rather immature people.

I have the opposite experience. Frequent outages/voice connectivity degradation. Font rendering was broken on many linux distros for months. And now there has been some kind of crash for about a month and still ongoing where discord becomes sluggish and eventually closes over time. (If you have a game open it can crash within one hour)

Not to mention where an update randomly the noise suppression would completely mute your mic.

>I'm curious what they think of Workers.

We are a very heavy user of cloudflare workers. Our marketing/developer/web app are entirely served from the edge using workers - and we use workers for request authentication for downloading game chunks on our store.

Thanks a lot, I should have googled it, I would have seen your previous comment mentioning it which has complementary information[1].

[1]: https://news.ycombinator.com/item?id=17447355

I fully agree, compared to other widely used software in the gaming space (Teamspeak, Mumble, Ventrilo). Experience to join new servers, create new content on existing one and chat+voice is amazing!

I can definitely agree that Discord is a fantastic chat experience. I use both IRC and Discord, and I think Discord (if only slightly) takes the cake.

Discord has actually been using workers since before they were available to the general public. They've been using them mainly for edge caching games for the store, A/B, and promotion of clients to different testing environments.

Discord lets me pick sound output and input devices... Slack doesn't even do that!

Never in the world would I have thought a gaming chat client had more robust features than the pricey enterprise chat client

If you're on windows, the 'App Volume device preferences' settings pane has you covered

It's the replacement for the vintage volume control, while a little sub-par by any modern standard, you can pick the input/output device per application

Having gone through some of their Elixir libraries, I concluded the opposite. They have some very inefficient and ugly code in a lot of places.

Not self-hostable.

>By far the best messaging user experience on desktop there is.

No offence but that's like your opinion.

Coming from IRC and TeamSpeak, to me Discord is unnecessarily bloated and full of analytics.

>I used Slack / Skype / Hangouts / WhatsApp / Messenger / etc. before

Yeah, explains it. :P

I'm still on IRC. It is much quicker and it is much easier to be connected and active in multiple servers/channels, whereas the overview in Discord is terrible unless you use third party clients, which can get you banned from the service.

The voice service isn't very high quality and there are frequent server outages.

I use Telegram, Facebook, whatever except Discord for instant messaging, and wish Discord wasn't the default for non-professional groups.

People my age (upper 20s) praising it are always those who used Skype for gaming and never touched Teamspeak/Ventrilo/Mumble.

I was a user of all three of the gaming-centric chat / voice services and the issue always has been about centralizing different communities together rather than quality of service / technology itself. One group plays on a random guy’s Teamspeak server that’s been running in a VPS or colo for 6 years, another splinter group moved to Vent, etc. Then there were the fun times dealing with various authentication and rotating certificates (if you practiced decent security at all you had to have one) for each of these disparate communities.

As gaming itself got broader reach a lot of casual folks were left to their own devices and underserved, so that’s the big gap being seen. I know plenty of 20-somethings that know about and used the predecessors to Discord but they’re all hardcore gamers compared to even the folks closer to mid/late 30s that may have been hardcore before but don’t have the time to fiddle with these systems anymore.

It seems with Discord quality was tough to maintain at scale and we’ve got the inverse problem set.

Teamspeak/Ventrilo/Mumble could get some inspiration from some ergonomics aspects of Discord. Quickly sharable unique urls for invitations for instance, instead of server:port-username-(password)

The mumble URI lets you embed a password: https://wiki.mumble.info/wiki/Mumble_URL#Username_and_Passwo...

Teamspeak has something like this.

In on both Discord on IRC but like Discord better. Discord comes with multiple servers (not that easy in IRC, you must have a bouncer), formatting, logs, in an easy package and relatively "low" footpring when compared to the competition.

You do not need a bouncer to connect to multiple servers on IRC. Any client will do that and have server lists ready for you to just check and join.

Now, you might want a bouncer to store messaged and logs when offline, but not to be able to connect to multiple servers. Logging is a basic feature in IRC.

I'm precisely speaking about seeing the logs when not connected. The sense of community on IRC is lower because you need to connect all the time and otherwise loose what everyone said.

IRC clients have had those features for at least 15-20 years now.

No if you are not connected you loose the logs.

Discord has low footprint? It's usually the application that consumes most RAM of my system (if we ignore the browser) and consumes an steady 1-2% of CPU even when in the systray.

And it is, of course, closed source, so no way to use an alternative client. I wish I could use an IRC gateway for it, that would be cool (I very, very rarely use the voice chat)

>no way to use an alternative client These exist, but some are not open source. You can run discord on a vita if you wanted to.

Just because something is closed source, doesn't mean it's impossible to reverse engineer. The entire system is compiled into a javascript webpack, alongside just reading web requests/their documentation, doing some basic functionality in an alternative implementation is really easy.

This apparently breaks the terms of service, but they are only enforce this if you're an outlier for the number of API requests you send.

The most well known 3rd party client would be Ripcord, which I use for it's Slack features

> but they are only enforce this if you're an outlier for the number of API requests you send

This thing usually goes: Until the better client threatens the proprietary one with is better monetized.

Low footprint compared to the competition, Discord is way better than IRC practically so IRC is no competition. I'd better spend that RAM to have features.

I used Teamspeak and Ventrilo for gaming back in the day and much prefer Discord.

IRC lacks history (unless you have access to an always-on server, which you'll pay for one way or another) and has a userbase that is hostile to standardizing anything like user auth or rich-text formatting. I'll put up with a bit of "bloat" for the sake of not having to worry about where my bouncer is running.

I haven't been on IRC in years but auth has been standardized no? Msg the auth bot with your authentication phrase or you get your nick changed

Some servers do that. It's not standardised, not integrated with any other systems, and usually inherently subject to a race condition that renders it low security.

No history is a feature. I like knowing that, like in real life, only people who are present can read what I'm saying.

That's not the case either though - many users have bouncers that will sit in the channel the whole time, and you have no idea who might be saving logs or where they might be publishing them. It's the worst of both worlds.

>many users have bouncers that will sit in the channel the whole time

Then I know they're reading.

>and you have no idea who might be saving logs or where they might be publishing them

If I'm in a room with you nothing stops you from telling others what I've told you.

> If I'm in a room with you nothing stops you from telling others what I've told you.

Sure, but I won't have a complete written record like an IRC log, and won't be able to credibly quote.

how would one credibly quote on IRC? Isn't the chat log enough for you?

My point is that chat logs make IRC a lot more "on the record" than in-person discussions in practice (even if not in theory). Anyone who's keeping a log can quote directly from it, and those quotes are credible (if nothing else, because the person posting them doesn't know who else was keeping a log and can call out inaccurate quotes).

From time to time my ISP changes the IP address and discord totally freaks out about that. I have to fill out a captcha and have to receive a verification E-Mail. They are treating me as if I used asdf1234 as my password.

Yes, and if they don't like you enough they lock your account until you provide a phone number (no thanks)

Really neat post. Did y’all consider GenStage or another demand based approach for your overflow problem? I’d be interested in hearing about the tradeoffs between demand vs semaphore, it seems like the two have some similarities.

I think the main difference is that "demand" in GenStage tells the producer how much work the producer should do to satisfy the consumer. The consumer sets the demand, the producer does not limit it.

With how discord uses semaphores, the Consumer will not even make a demand for more events from the producer.

Other than that, I think in the case of Discord, it's more of an RPC thing, and GenStage is rather made for concurrent streaming and processing of data.

Nice. How about fixing the client side by, you know, doing a native client?

So why can't they add basic features like organizing servers?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact