Hacker News new | past | comments | ask | show | jobs | submit login
Using Rust to Scale Elixir for 11M Concurrent Users (2019) (discord.com)
332 points by aloukissas 68 days ago | hide | past | favorite | 67 comments

The perfect Hacker News title has finally been crafted.

As was pointed out last time it was posted: https://news.ycombinator.com/item?id=19947458

A glitch in the Matrix?

A glitch in the Matrex: https://github.com/versilov/matrex (sorry I couldn't resist)

Rust, Elixir, Million, Concurrent ... all that's missing is Decentralized Web :)

What about blockchain ?

That would quickly get flagged by the flamewar detector.


Better alternative: Rust leverages the power of FP in Elixir to Scale 11M Users.

And “scale”!

scalable paradigm disrupting shift on blockchain in rust

Is it web scale?

Articles SortedSet[0] is basically a pseudo-BTree with fixed depth = 2 and fixed max_size of leafs. This gives O(log n) search and O(sqrt n) inserts [1].

It would mean that insertion at beginning is worst case scenario (split + need to move all buckets), but timing of insert is actually dominated by adding size of the buckets to calculate final index.

Spending a little bit of time on research, finding https://en.wikipedia.org/wiki/Order_statistic_tree and just using G++ implementation would probably yield better result and less code to support.

G++ has __gnu_pbds which add O(log n) "find_by_order" and "order_of_key" to trees, e.g. [2].

[0] https://github.com/discord/sorted_set_nif/blob/master/native...

[1] Technically O(n/max_size + max_size) but we can assume that max_size is selected to be ~sqrt n

[2] https://www.ideone.com/8mzxGR

You can also write Elixir/BEAM NIF extensions in Nim and Zig as well using Nimler or Zigler, resp [1,2]. Also I’ve found writing native extensions for the BEAM is simpler than many other dynamic languages since it relies on immutable data. It’s roughly the same as just a function to deserialize/serialize data with some extra references to GC’ed data. Especially with dirty “NIF”s (they allow the functions to run for as long as they want without messing up the BEAM scheduler).

1: https://github.com/wltsmrz/nimler 2: https://github.com/ityonemo/zigler

In the case of BEAM NIFs, I find Rust's compiler guarantees especially compelling.

That's because a crashing NIF can take down the whole Erlang VM. If fault-tolerance is what brought you to Erlang/Elixir and OTP to begin with, as it is for many, having the server crash could be a major issue.

True, Rust has the strongest guarantees. Nim offers similar guarantees overall (checked arrays etc) and Nimler has exception checking enabled so the compilers checks if you have any possible checkable exceptions. Personally, I use Nim since it’s easier to use on embedded Linux devices by precompiling to C. Rust has some odd dependencies on the target’s C linker but doesn’t check the standard $CC or $LD variables. Zig is interesting but it offers the least memory protection.

Stop raining on the rust parade!

The most interesting thing about Discord is their ability to scale. Their product really works well in my experience, and their scale is no joke.

The have some really disciplined engineers (some of them I used to work with). I have learned a lot from these people. Erlang is everybody's secret weapon for high concurrency (like Discord) and Rust has predictable performance that is easy to tune. GC is not an option for many use cases they have. One interesting problem they run into was the GC is Go and the lack of configuration options for it. I think they are the prime example of using the right tool for the job, whatever it might be, instead of trying to sell you a single programming language that solves everything.

I've been using Discord on a near-daily basis for 9 years, in servers from 10 to 25,000 people, and I think I can count on two hands the number of actual breakages I have had with the product. It's honestly pretty nuts.

> I've been using Discord on a near-daily basis for 9 years

Wikipedia says Discord was released 5 years ago. Am I missing something?

Hm — maybe that’s in reference to it coming out of beta or something? I can tell you I made my first Discord server in 2015.

I have been using for 2-3 years, and it had loads of breakages for me. I cannot count on two hands, need more. Europe.

I've seen it go down quite a lot. https://discordstatus.com/ shows at least 5 breakages in the last week.

I agree, I always found it strange that an app aimed for gaming really has better performance than slack/teams and their massive budgets behind them. But I never saw a company that uses discord for communication.

> I always found it strange that an app aimed for gaming really has better performance than slack/teams

Makes sense to me, its users would be running games alongside the software, and would be quite cross with the software shitting itself in the middle of a co-op game. Somewhat good performance and behaviour is (or at least was originally) absolutely necessary to getting traction, a bunch of checkboxes would not suffice.

> But I never saw a company that uses discord for communication.

Currently working at a place where at least one, if not two, teams use Discord for daily chatting. Official meetings are mostly handled over Zoom but I believe their intra-team stuff tends to happen over Discord.

My view is that they solved voice communication first - that was their original use case. Reliable multi-party voice is a harder problem than text chat and means that you can't/don't just rely on Electron for everything. Their product is a way to 'sell' some good non-trivial (but not groundbreaking) engineering.

Whereas Slack is successful because it understood it's usecase well (including how to sell to enterprises).

The closest I've found is the Spotify open-source project "Backstage", which uses Discord for communication (https://github.com/backstage/backstage) - I have no idea about the rest of Spotify, mind.

Budgets are not a subtitute for programming talent and willingness to experiment it seems.

A lot of opensource teams use Rust for communication

I know that rust is the greatest latest thing but I’m still not quite convinced of it’s excellence as a communication platform.

It'd be interesting to see it compared to https://doc.rust-lang.org/std/collections/struct.BTreeSet.ht....

Their skiplist description is a little funny — typically a skiplist has one element per leaf node, and you should never need to move items between buckets. Also skiplists have logarithmic layers in the number of elements, so inserting and searching are always bounded by O(log n).

I've been using a bucketized version of skiplist for years now, I thought I invented it! Using most recently in https://github.com/infradig/trealla (a Prolog interpreter).

We needed to support arbitrary index access into the sorted set - which is why stdlib btree map doesn't work.

For binary trees, indexing can be done by saving the subtree size of each node and doing a sort of binary search. Not sure if this is fast for B-trees that have more than 2 children nodes, though.

Oh, that makes sense. I like the solution you arrived at!

> There’s a fantastic Elixir project called Rustler. It provides nice support on the Elixir and Rust side for making a safe NIF that is well behaved and using the guarantees of Rust is guaranteed not to crash the VM or leak memory

Hum, can someone talk more about these NIF and what special thing Rustler brings?

What's different from just any kind of native interop, like say Java JNI ?

So there are a few ways to do interop with non-BEAM languages.

Ports are the typical one for simpler things. You start an external command and talk to it via stdin/stdout.

NIFs are potentially much faster and are traditionally in C. But they carry the risk of crashing the VM. Rustler+Rust removes most, likely not all, but most of the risks for screwing up and killing the VM.

So NIFs are usually not recommended simply because the resilience and reliability of the BEAM VM relies on a bunch of cool strategies inside the VM and NIFs side-step all of that. The Rust ones should be far less dangerous.

NIFs run without most (all?) of the safety features the erlang VM provides, such as memory safety and scheduling. If the NIF crashes, the entire app (aka, VM) crashes.

This is anathema to some...i dunno..philosophies? expectations? of Erlang/Elixir.

NIFs written in Rust (via Rustler) should never be able to crash the VM. Unlike NIFs written in C.

Rustler brings other quality of life improvements, like easier interop (decoding/encoding values to and from) and managing the cleanup of objects that live in Rust but are referenced by Elixir.

So what happens to the VM when an index out of bounds happens in Rustler?

Is there a panic handler that prevents a crash and returns a magic value that makes sense to the caller?

The called code is made unwind-safe at the FFI boundary using `std::panic::catch_unwind`. A panic will be converted to an Erlang exception.

Except it doesn't work always,

> Note that this function may not catch all panics in Rust. A panic in Rust is not always implemented via unwinding, but can be implemented by aborting the process as well. This function only catches unwinding panics, not those that abort the process.

It works whenever it's configured to unwind, which is the default. Apparently one can force this by adding a few lines in the rustler Cargo.toml (https://doc.rust-lang.org/edition-guide/rust-2018/error-hand...), I'll create a PR.

If you compile a NIF you probably will not go out of your way to write "panic = abort" in your project. Unwinding is the default behavior.

It depends on the dependencies.

From the docs filmor linked to, if a library sets panic = unwind and a user sets that to abort, “If any of your users choose to abort, they'll get a compile-time failure.”

I was thinking more about libraries that choose to abort, and already compiled.

It's considered bad practice, as far as I know. The executable should decide that, not library.

We use rustler only as a thin wrapper to existing crates (like, for hdrhistograms). I'm far from a Rust expert, but I believe it uses std::panic::catch_unwind

> Note that this function may not catch all panics in Rust. A panic in Rust is not always implemented via unwinding, but can be implemented by aborting the process as well. This function only catches unwinding panics, not those that abort the process.

I can’t speak to the Rustler, but the JNI bindings for Rust are quite good. There are a lot of areas where you can still shoot yourself in the foot, lots of unsafe. But there are patterns they encourage that are very safe, like associating lifetimes to Java objects that are instantiated in the Rust code appropriately. It also has good interfaces for working with arrays and avoid some of the foot guns in traditional C JNI libraries.

It’s not quite possible to have no unsafe code in JNI with Rust (in my experience) but it is very easy to isolate to very small areas of your code.

I'm not sure if the other answers were good enough.

NIF is native interop. There's a bunch of provided functions for interfacing with Erlang terms and other things like that. But, it's native interop, so you can also do whatever crazy stuff you want, even if it's a bad idea.

The BEAM VM cannot and does not protect you from bad NIFs; but if you follow the guidelines and are careful not to crash, you can have a good time. Often, the scope of the native work ends up so small, it's easyish to keep it safe. For larger scope, it's more common to use a Port or a C-node; a Port is just what Erlang calls running your other code in separate OS process communicating with stdin/stdout, and a C-node is your other code running separately, possibly on another machine, connected via network sockets, acting like a dist node. For this application though, it makes more sense to use a NIF to avoid copying the data to and from a separate OS process.

I haven't used Rustler, but my understanding is it's mostly just a framework of sorts for writing NIFs in Rust. Maybe making it easier to use Erlang terms, and setting up the compile settings etc.

My brief understanding of NIFs is they're similar to unsafe in Rust where a bunch of VM guarantees are only held so long as the NIF is well-behaved. I believe there was also some limits on how long a NIF could execute before starving the Erlang scheduler[1].

[1] http://erlang.org/doc/man/erl_nif.html

The other answers have covered the technical details well, but not so much the philosophical ones.

You may have heard about how Erlang with the BEAM is a fantastic implementation of the Actor programming model. This involves lots of processes (green threads) holding a little bit of state and communicating with messages. It's both a great concurrency model and a great programming model. The standard library (OTP) then also provides excellent tools for managing these processes in supervision trees.

Ports act like another actor in a supervision tree. This actor happens to be a C program and the messages are sent and received by stdout/stdin, but otherwise it behaves like a process. You can manage its runtime state with supervision trees and all the other excellent tools that BEAM/OTP give you.

NIFs on the other hand are best used to replace something that could be a pure function in Erlang, but is a lot faster in something compiled (or because you want to leverage some existing codebase that does that).

The little that I know about NIFs is that if they crash, they bring down the entire BEAM, so it's pretty risky if not done properly. But I've seen that Rust complements Elixir (via NIFs) very well.

OMG, my startup is trying to solve a similar problem. Albeit not at this scale (YET). Looks like I have a good excuse to delve back into rust

damn, tried to pull this in as a dependency, doesnt look like it compiles on the most recent erlang versions. good reason to dig into the source this weekend.


I was wondering if Crystal could be used thus way too. Found this https://github.com/splattael/erl_nif.cr

I believe you can use almost anything as an erlang / elixir nif

The combination of Elixir and Rust has been really fascinating to watch. They compliment each other so well.

You might be thinking complement. And yes those do complement each other perfectly. The other scaling article how and why they switched from Go to Rust is pretty mind blowing too.

Yep that one.

Old article: May 17, 2019. Title should reflect that.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact