Hacker News new | past | comments | ask | show | jobs | submit login
Using Rust to Scale Elixir for 11M Concurrent Users (discordapp.com)
441 points by O_H_E on May 18, 2019 | hide | past | favorite | 65 comments

Thanks for sharing. I like reading posts that show a few trials and before arriving at the solution. Well and Rust and Elixir just seem to go together.

I had noticed that both a list and ordsets were tested. Ordsets are just sorted lists. The huge difference I guess is because full sorting and uniqueness checking happens after every member addition, as in add_member(M, L) -> lists:usort([M | L]), while ordsets actually traverses the list and finds the right place where to insert the element.

Also wonder how gb_sets http://erlang.org/doc/man/gb_sets.html would behave. Those should provide better asymptotic behavior as they actually implement a balanced tree data structure. Rust would still be faster but it would be an interesting comparison.

Ordsets have O(log n) time insert, remove, and lookup operations. They are definitely not lists, they must be trees.

> Ordsets have O(log n) time insert, remove, and lookup operations. They are definitely not lists, they must be trees.

To be specific, we were talking about Erlang's ordsets module. You can see it constructed as a list: https://github.com/erlang/otp/blob/d6285b0a347b9489ce939511e...

    new() -> [].
Then notice the function to add an element https://github.com/erlang/otp/blob/d6285b0a347b9489ce939511e.... It's a recursive traversal of the list to find the right place to add it.

I am surprised that this works with a list. An array list would require moving data to insert at the right place, and a linked list would make it uneasy to get to the nth element in O(1).

Erlang's list are immutable cons cells (same idea as in LISP). This cons cell is just the element plus the "tail" pointer to the rest of the list. It is written as [H | T]. The last cons cell's tail is an empty list. So a list L = [1,2] is also the same as [1 | [2]] or [1 | [2 | []].

Underneath it is implemented like a linked list with nodes pointing to an element and then a pointer to the next. But explicit pointer manipulation, breaking and stitching lists together like one would do in C is not allowed. So it is all about traversing the list then inserting at the right place.

The reason ordsets are implemented using just lists is for simplicity and because they are designed for a small collections. Say maybe keeping some preferences around, or a list of options. Not storing 1M entries.

But there are other data structures and facilities to handle larger collections like gb_trees which are implemented as AVL trees. Another is an in-memory table called ETS. This one can be shared between processes (though terms are actually copied in most cases when reading and writing to/from it).

The thing is that linked lists can only insert sorted in O(n).

They are most definitely sorted lists. This is mentioned very clearly in the documentation: http://erlang.org/doc/man/ordsets.html

>An ordset is a representation of a set, where an ordered list is used to store the elements of the set.

That does not prevent a tree-based implementation, though.

It kind of does. And there is also no need to argue about this, the code is available and very straight forward: https://github.com/erlang/otp/blob/fe2b1323a3866ed0a9712e9d1...

They don't have O(log n) insert as implemented in Erlang. It's a linear scan to find the place to insert.

It appears you don't understand sorted lists if you think a tree structure is required for those asymptotics.

Erlang's ordsets() module has O(N) insertion. Since there's theoretical disagreement, here's experimental verification:

  11> f(), Ord_bench = fun(N) -> Random = [random:uniform(N) || _ <- lists:seq(1,N)], F = fun(Item, Set) -> ordsets:add_element(Item, Set) end, lists:foldl(F, ordsets:new(), Random) end.
#Fun<erl_eval.6.99386804> 12> timer:tc(fun() -> Ord_bench(10000) end). {337691, [1,2,3,4,5,7,8,10,12,13,14,15,16,17,18,19,21,23,24,26,27,28, 29,36,39,40,42|...]} 13> timer:tc(fun() -> Ord_bench(100000) end). {37167977, [1,2,3,5,7,9,10,11,12,15,16,17,18,19,20,21,23,24,25,26,27, 29,30,33,34,35,38|...]}

So, 337 ms turns into 37 seconds, roughly 100 times slower, when I increase the data size tenfold, which is what you'd expect from O(N) insertion.

Repeating for gb_sets, the results are 49 ms versus 484 ms.

Could the misunderstanding be that you don't realise that the ordsets module specifically guarantees that the representation is an ordinary sorted list?

(I am aware of e.g. skip lists. But that is not the same Erlang data structure as a sorted list.)

No, the misunderstanding is that I was thinking of Ordsets in Rust [1], which are implemented as trees.

[1] https://docs.rs/im/10.0.0/im/#sets

I would probably use the word array if you can access any item without traversal. Just a thought.

Looks like you've implemented a square-root decomposition based solution resulting in O(sqrt(N)) insert, remove and O(sqrt(N)+M) slice operations where M is the number of elements within the requested slice [1].

Better approach, imo, would be an order statistic tree with O(logN) insert, remove and O(logN+M) slice operations [2]. This would be usually implemented as an implicit treap [3]. Furthermore, balanced binary search trees can be implemented efficiently in a functional language like Elixir that doesn't have mutable data structures. One such example is Erlang's gb_tree [4].

[1] https://cp-algorithms.com/data_structures/sqrt_decomposition...

[2] http://www.cs.yale.edu/homes/aspnes/pinewiki/OrderStatistics...

[3] https://cp-algorithms.com/data_structures/treap.html#toc-tgt...

[4] http://erlang.org/doc/man/gb_trees.html

> large sorted sets

Surprised to see no mention of trees which are basically the standard datastructure for this.

One thing that our blog-post didn't mention too well was the need to be able to access arbitrary slices of data within the sorted set, as well as being able to know the index at which an item was inserted as well as removed. This is necessary for our usage of sorted sets, as clients can subscribe to a given window of the sorted-set (e.g. the top of a member-list), and in order to compute the delta operations to keep the client-side list in sync with the one on the server, we need this information.

I think you can augment simple trees to support those operations.

This reads as an almost textbook description of Cartesian tree.

Cartesian trees do not provide the ability to get items at arbitrary indices within the data structure in an efficient member from my understanding. In order to get the Nth item in the tree, a linear traversal is required. Furthermore, to get the index at which an item is inserted or removed requires the same traversal to accumulate the index.

The common solution is to hold size of subtree in the nodes in addition to the value a.k.a treap with implicit keys.

But at this point you're adding even more bookkeeping and implementation complexity just to use a data structure that's not ideal in the first place (for performance, sequential memory layout is better than chasing pointers).

I propose a challenge: Provide the interface for the datastructure and the test tool.

Memory fragmentation.

They can also be implemented fairly efficiently in functional languages.

I recently wrote a parser in Elixir for a templating language. Binary pattern matching is pretty neat for this purpose, but performance is nevertheless sub-optimal. Elixir was chosen over more suitable languages because our team is Elixir-only. OP shows that delegating performance critical computations in Rust is viable and fairly cleanly. That's good to know.

Have you taken a look at https://rhye.org/post/erlang-binary-matching-performance/ ? I've seen lots of posts in the erlang mailing list about how unintuitive getting optimal performance from binary pattern matching can be.

It's also worth trying on OTP-22, which is reportedly much better at optimizing binary patterns:


I have, but thanks for the link :)

Well played. Kudos for creativity (https://github.com/rusterlium/rustler) Thanks for write up and benchmarks.

Rustler doesn't get the notice it deserves. I see people talk about using Rust to extend Ruby and Python and Node all the time, but none of those value concurrency-safety as much as Erlang does, and in the domains people tend to use Erlang for, the difference between using C and using Rust for NIFs matters even more than in typical back-end serverland.

I have a feeling there are plenty of excellent NIFs for most situations so most people using Elixir haven't been very dependent on Rust optimizations. Especially the web people who usually don't need that level of optimization, unless you're a giant like Discord, which is relatively unique.

Although I wouldn't be surprised to see more and more of Rustler as both Elixir and Rust gain in popularity. There's probably plenty of low-hanging performance fruit for any future rustler devs.

That's just a consequence of Erlang/Elixir having a small fraction of the developer mind share of Node or Python.

Wow, very cool! Two questions:

1) One of the things that used to be the case with NIFs is that they can interfere with the BEAM's famed pre-emptive scheduling. Any chance that's what's going on here? ie: the NIF solution was fast because it's getting an unfair amount of CPU time, potentially leading to latency issues elsewhere in the app? I believe recently that was changed, though, and the BEAM can pre-empt NIFs?

2) Did you consider/benchmark a port solution before NIF? I think of those as a little safer, but with not as good of performance.

Their SortedMap operation times are quite a lot less than the 1ms recommended yield time for NIFs http://erlang.org/doc/man/erl_nif.html#lengthy_work

Ah, good point! Thanks.

I'd be curious how ETS does on that problem. The Erlang efficiency guide says an ETS table takes:

> Initially 768 words + the size of each element (6 words + the size of Erlang data). The table grows when necessary.

This might be too large if there's a ton of them, but perhaps a single table per node could be workable? In particular, ordered_set tables with write_concurrency have been improved a lot on Erlang/OTP-22:

   OTP-15128    Application(s): erts, stdlib

               ETS option write_concurrency now also affects and
               improves the scalability of ordered_set tables. The
               implementation is based on a data structure called
               contention adapting search tree, where the lock
               granularity adapts to the actual amount of concurrency
               exploited by the applications in runtime.

ETS has no facility to insert/remove an object and get the index at which the object was inserted to/removed from. This is needed in order to synchronize windows within the sorted set to our clients. We would have loved to use ETS, and use it in many places actually, and have open-sourced a few:

- https://github.com/discordapp/zen_monitor

- https://github.com/discordapp/gen_registry

- https://github.com/discordapp/ex_hash_ring

I don't know if it would be good enough,and there's certainly not a big reason for you to try, given your solution works great; but you could probably modify ets to give you the slot number where it inserted/removed and use that as a non-exact indicator of where in the sequence it was. (Depending on how accurate you need it to be, this might work?)

Its great to see two interesting languages used together.

I wonder if an out of process solution was considered. For instance, Redis supports sorted sets: https://redis.io/commands#sorted_set

Using something like Redis saves development time and makes builds and packaging easier as no FFI is needed. However Redis comes at the cost of having a more complicated infrastructure.

On a given node, we maintain a few million sorted sets that see a query and insertion velocity in the orders of millions of operations per second. When we are meticulously measuring operation time in the microsecond range to meet a performance target, a networked solution is out of the question.

Also, rustler handles compiling the rust code, and configuring the nif dynamic library details. I’d wager it’d be a ton easier than Redis infrastructure even if redis worked for the use case. The sheer amount of flexibility in the BEAM is one reason I love working in Elixir.

Really it seems like a “write once and forget” type of nif. Have you had to make a lot of updates after the initial tuning with Rustler?

Though I was curious if you tried ETS or mnesia? They both offer sorted sets. Not sure if they just use ordsets behind the scene.

Edit: nvm, I see you answered in another thread.

This data structure has been operating in production for almost a year now, we have not touched it for any modification after writing and deploying it. We had 0 issues deploying it, and thanks to the comprehensive testing suite and benchmarks, we were confident it would meet the performance objectives. It was actually a drop-in replacement for the pure elixir data structure we had written. Deploying it was literally just a find + replace of OrderedSet to SortedSet in our code, running our test suite and then gradually deploying the new code to production.

Impressive work!

What about a local redis instance using unix domain sockets?

The cost of making a system call and network serialization + deserialization still makes this infeasible.

This is really cool.

Meanwhile, it also shows Elixir itself is already really fast for the rest of us. Unless you're Discord or WhatsApp.

I doubt I will forget the time I first booted up an Elixir Phoenix server and got something like `[info] Sent 200 in 409µs` and stared at that `µs` for a while before it dawned on my that that was the symbol for microseconds.

I was so used to 20-40ms that I couldn't even comprehend that it could be 80x faster. A year later since playing around in Elixir's world, and I'm still just so impressed by its speeds.

Yeah. It's really something.

Recently I started writing a bunch of tests for a Phoenix application for my first Phoenix app.

After a day of hacking around and learning as I go I had 63 tests of which 57 of them are full blown controller tests where a good amount of those tests are dealing with a "current user" and reading / writing things from a database. I'm even using a "Faker" library to generate fake data for all of the DB attributes.

The entire test suite runs in 600ms. That's 600ms for ~60 controller tests from start to finish. That's in a worst case scenario too where the code is on a non-SSD and that code is volume mounted into a Docker container on Windows while I use WSL.

Sometimes things work so fast that I'm actually worried it's not doing what I think it's doing, but then I run --cover and see that I have 95%+ test coverage on the things I care about, so that code is getting reached.

20-40ms with what language and implementation?

Elixir is pretty slow tbh, like 3-10x slower than Java / C# / Go. Immutability / message passing is very heavy.

It's all relative. Plenty people come from Node, Ruby, and Python.

Coming from the Go! world, I view Elixir is trade of speed for OTP out the box niceness. But comparing Elixir to Node, then you're just gaining speed and stability.

And it keeps going -- I very frequently see Go criticized for slowness by C, C++, and Rust developers. It is all about the usecase, which is why I find this topic so interesting. It's one of the few situations Elixir mid-level speed is actually a real world problem.

I get your point. Even without message passing, just in process calculation in Erlang/Elixir is also pretty slow compared to Java.

But in the domain of Web programming, scheduling efficiency is much more important than the speed of execution. If it is a huge batch operation on a single machine, Erlang/Elixir would be terrible.

Similarly, for React there's fiber mechanism -- it's worse according to benchmark, while the user would get a better experience due to a better schedule.

Yeah for numerical stuff but for network stuff which Erlang and its VM was created for it's great for web application. Having the option to write Rust, C, etc.. with NIF is a good compromise.

Were any thoughts given to a process dictionary implementation? One (maybe supervised) process per instance of the pd_ordset type. The process would only ever have one client so no locking needed. Only potential issue would be the cost of sending messages. I have seen this kind of solution bring 100x speed improvements to write once <GB strings storage over ETS tables. This could be a gen_server as well. I’m curious because the PDict is often frowned upon (especially vocalized on the mailing list) yet cannot be avoided in some parts of OTP.

Also looking at the code leads me to believe the benchmarking of the rust code was done without taking the usage of a NIF’s overhead, if there is any. Is this intended or am I missing something?

Last question: are the great tracing capabilities of the BEAM lost when using NIFs and would this be an issue at Discord scale?

Thanks for the interesting write up.

It seems like they are using a really weird data structure (it's not even clear whether it's sub-quadratic).

This is a textbook problem normally solved using a balanced binary tree augmented by adding a number with the size of the subtree to every node (an "order statistic tree").

I'm curious, how do they handle persistence? Is this data structure just used as a cache or?

As I understand it, Sorted Sets are an in-memory feature.

Discords architecture is one where an entire guild stays on one machine. And the sorted set is part of a guild as well.

So, there is no need to persist it, since it is ephemeral in nature.

Edit: perhaps to expand on this a bit more, keep in mind that discord is not a cloud native application. They have a few strong server nodes and a Cassandra cluster (at this point I am curious if they switched to scylladb) that handle all or at least most of their client facing features.

Since they do not need to scale services, the servers they have are stateful. Hence, if the guild server crashes, so will the data structure, since the data structure is of no use if the server crashes and would need to be rebuilt from scratch either way.

(Disclaimer: I am not an insider, just an Elixir fanboy)

I wonder on how they did the benchmarks. In Java/JVM land you would use JMH. Is there anything similar for Erlang/beam?

You can use some built-in modules: http://erlang.org/doc/efficiency_guide/profiling.html or some external tools like recon: https://ferd.github.io/recon/index.html

Thank you for sharing, and for going so deep on it!

Did they just reinvent B trees?

This title is the platonic ideal of an HN post.

The article was unusually well-written as well.

It would be nice if they had also went the same lengths with their clients in terms of efficiency...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact