
Using Rust to Scale Elixir for 11M Concurrent Users - O_H_E
https://blog.discordapp.com/using-rust-to-scale-elixir-for-11-million-concurrent-users-c6f19fc029d3
======
rdtsc
Thanks for sharing. I like reading posts that show a few trials and before
arriving at the solution. Well and Rust and Elixir just seem to go together.

I had noticed that both a list and ordsets were tested. Ordsets are just
sorted lists. The huge difference I guess is because full sorting and
uniqueness checking happens after every member addition, as in add_member(M,
L) -> lists:usort([M | L]), while ordsets actually traverses the list and
finds the right place where to insert the element.

Also wonder how gb_sets
[http://erlang.org/doc/man/gb_sets.html](http://erlang.org/doc/man/gb_sets.html)
would behave. Those should provide better asymptotic behavior as they actually
implement a balanced tree data structure. Rust would still be faster but it
would be an interesting comparison.

~~~
chongli
Ordsets have O(log n) time insert, remove, and lookup operations. They are
definitely not lists, they must be trees.

~~~
IceDane
It appears you don't understand sorted lists if you think a tree structure is
required for those asymptotics.

~~~
matthiasl
Erlang's ordsets() module has O(N) insertion. Since there's theoretical
disagreement, here's experimental verification:

    
    
      11> f(), Ord_bench = fun(N) -> Random = [random:uniform(N) || _ <- lists:seq(1,N)], F = fun(Item, Set) -> ordsets:add_element(Item, Set) end, lists:foldl(F, ordsets:new(), Random) end.

#Fun<erl_eval.6.99386804> 12> timer:tc(fun() -> Ord_bench(10000) end).
{337691, [1,2,3,4,5,7,8,10,12,13,14,15,16,17,18,19,21,23,24,26,27,28,
29,36,39,40,42|...]} 13> timer:tc(fun() -> Ord_bench(100000) end). {37167977,
[1,2,3,5,7,9,10,11,12,15,16,17,18,19,20,21,23,24,25,26,27,
29,30,33,34,35,38|...]}

So, 337 ms turns into 37 seconds, roughly 100 times slower, when I increase
the data size tenfold, which is what you'd expect from O(N) insertion.

Repeating for gb_sets, the results are 49 ms versus 484 ms.

Could the misunderstanding be that you don't realise that the ordsets module
specifically guarantees that the representation is an ordinary sorted list?

(I am aware of e.g. skip lists. But that is not the same Erlang data structure
as a sorted list.)

~~~
chongli
No, the misunderstanding is that I was thinking of Ordsets in Rust [1], which
are implemented as trees.

[1] [https://docs.rs/im/10.0.0/im/#sets](https://docs.rs/im/10.0.0/im/#sets)

------
ikazar
Looks like you've implemented a square-root decomposition based solution
resulting in O(sqrt(N)) insert, remove and O(sqrt(N)+M) slice operations where
M is the number of elements within the requested slice [1].

Better approach, imo, would be an order statistic tree with O(logN) insert,
remove and O(logN+M) slice operations [2]. This would be usually implemented
as an implicit treap [3]. Furthermore, balanced binary search trees can be
implemented efficiently in a functional language like Elixir that doesn't have
mutable data structures. One such example is Erlang's gb_tree [4].

[1] [https://cp-
algorithms.com/data_structures/sqrt_decomposition...](https://cp-
algorithms.com/data_structures/sqrt_decomposition.html)

[2]
[http://www.cs.yale.edu/homes/aspnes/pinewiki/OrderStatistics...](http://www.cs.yale.edu/homes/aspnes/pinewiki/OrderStatisticsTree.html)

[3] [https://cp-algorithms.com/data_structures/treap.html#toc-
tgt...](https://cp-algorithms.com/data_structures/treap.html#toc-tgt-6)

[4]
[http://erlang.org/doc/man/gb_trees.html](http://erlang.org/doc/man/gb_trees.html)

------
gubbrora
> large sorted sets

Surprised to see no mention of trees which are basically the standard
datastructure for this.

~~~
jhgg
One thing that our blog-post didn't mention too well was the need to be able
to access arbitrary slices of data within the sorted set, as well as being
able to know the index at which an item was inserted as well as removed. This
is necessary for our usage of sorted sets, as clients can subscribe to a given
window of the sorted-set (e.g. the top of a member-list), and in order to
compute the delta operations to keep the client-side list in sync with the one
on the server, we need this information.

~~~
kilotaras
This reads as an almost textbook description of Cartesian tree.

~~~
jhgg
Cartesian trees do not provide the ability to get items at arbitrary indices
within the data structure in an efficient member from my understanding. In
order to get the Nth item in the tree, a linear traversal is required.
Furthermore, to get the index at which an item is inserted or removed requires
the same traversal to accumulate the index.

~~~
kilotaras
The common solution is to hold size of subtree in the nodes in addition to the
value a.k.a treap with implicit keys.

~~~
pornel
But at this point you're adding even more bookkeeping and implementation
complexity just to use a data structure that's not ideal in the first place
(for performance, sequential memory layout is better than chasing pointers).

------
BasDirks
I recently wrote a parser in Elixir for a templating language. Binary pattern
matching is pretty neat for this purpose, but performance is nevertheless sub-
optimal. Elixir was chosen over more suitable languages because our team is
Elixir-only. OP shows that delegating performance critical computations in
Rust is viable and fairly cleanly. That's good to know.

~~~
zinclozenge
Have you taken a look at [https://rhye.org/post/erlang-binary-matching-
performance/](https://rhye.org/post/erlang-binary-matching-performance/) ?
I've seen lots of posts in the erlang mailing list about how unintuitive
getting optimal performance from binary pattern matching can be.

~~~
ramchip
It's also worth trying on OTP-22, which is reportedly much better at
optimizing binary patterns:

[http://blog.erlang.org/OTP-22-Highlights/](http://blog.erlang.org/OTP-22-Highlights/)

------
nwrk
Well played. Kudos for creativity
([https://github.com/rusterlium/rustler](https://github.com/rusterlium/rustler))
Thanks for write up and benchmarks.

~~~
kibwen
Rustler doesn't get the notice it deserves. I see people talk about using Rust
to extend Ruby and Python and Node all the time, but none of those value
concurrency-safety as much as Erlang does, and in the domains people tend to
use Erlang for, the difference between using C and using Rust for NIFs matters
even more than in typical back-end serverland.

~~~
dmix
I have a feeling there are plenty of excellent NIFs for most situations so
most people using Elixir haven't been very dependent on Rust optimizations.
Especially the web people who usually don't need that level of optimization,
unless you're a giant like Discord, which is relatively unique.

Although I wouldn't be surprised to see more and more of Rustler as both
Elixir and Rust gain in popularity. There's probably plenty of low-hanging
performance fruit for any future rustler devs.

------
losvedir
Wow, very cool! Two questions:

1) One of the things that used to be the case with NIFs is that they can
interfere with the BEAM's famed pre-emptive scheduling. Any chance that's
what's going on here? ie: the NIF solution was fast because it's getting an
unfair amount of CPU time, potentially leading to latency issues elsewhere in
the app? I believe recently that was changed, though, and the BEAM can pre-
empt NIFs?

2) Did you consider/benchmark a port solution before NIF? I think of those as
a little safer, but with not as good of performance.

~~~
fanf2
Their SortedMap operation times are quite a lot less than the 1ms recommended
yield time for NIFs
[http://erlang.org/doc/man/erl_nif.html#lengthy_work](http://erlang.org/doc/man/erl_nif.html#lengthy_work)

~~~
losvedir
Ah, good point! Thanks.

------
ramchip
I'd be curious how ETS does on that problem. The Erlang efficiency guide says
an ETS table takes:

> Initially 768 words + the size of each element (6 words + the size of Erlang
> data). The table grows when necessary.

This might be too large if there's a ton of them, but perhaps a single table
per node could be workable? In particular, ordered_set tables with
write_concurrency have been improved a lot on Erlang/OTP-22:

    
    
       OTP-15128    Application(s): erts, stdlib
    
                   ETS option write_concurrency now also affects and
                   improves the scalability of ordered_set tables. The
                   implementation is based on a data structure called
                   contention adapting search tree, where the lock
                   granularity adapts to the actual amount of concurrency
                   exploited by the applications in runtime.

~~~
jhgg
ETS has no facility to insert/remove an object and get the index at which the
object was inserted to/removed from. This is needed in order to synchronize
windows within the sorted set to our clients. We would have loved to use ETS,
and use it in many places actually, and have open-sourced a few:

\-
[https://github.com/discordapp/zen_monitor](https://github.com/discordapp/zen_monitor)

\-
[https://github.com/discordapp/gen_registry](https://github.com/discordapp/gen_registry)

\-
[https://github.com/discordapp/ex_hash_ring](https://github.com/discordapp/ex_hash_ring)

~~~
toast0
I don't know if it would be good enough,and there's certainly not a big reason
for you to try, given your solution works great; but you could probably modify
ets to give you the slot number where it inserted/removed and use that as a
non-exact indicator of where in the sequence it was. (Depending on how
accurate you need it to be, this might work?)

------
papaf
Its great to see two interesting languages used together.

I wonder if an out of process solution was considered. For instance, Redis
supports sorted sets:
[https://redis.io/commands#sorted_set](https://redis.io/commands#sorted_set)

Using something like Redis saves development time and makes builds and
packaging easier as no FFI is needed. However Redis comes at the cost of
having a more complicated infrastructure.

~~~
jhgg
On a given node, we maintain a few million sorted sets that see a query and
insertion velocity in the orders of millions of operations per second. When we
are meticulously measuring operation time in the microsecond range to meet a
performance target, a networked solution is out of the question.

~~~
elcritch
Also, rustler handles compiling the rust code, and configuring the nif dynamic
library details. I’d wager it’d be a ton easier than Redis infrastructure even
if redis worked for the use case. The sheer amount of flexibility in the BEAM
is one reason I love working in Elixir.

Really it seems like a “write once and forget” type of nif. Have you had to
make a lot of updates after the initial tuning with Rustler?

Though I was curious if you tried ETS or mnesia? They both offer sorted sets.
Not sure if they just use ordsets behind the scene.

Edit: nvm, I see you answered in another thread.

~~~
jhgg
This data structure has been operating in production for almost a year now, we
have not touched it for any modification after writing and deploying it. We
had 0 issues deploying it, and thanks to the comprehensive testing suite and
benchmarks, we were confident it would meet the performance objectives. It was
actually a drop-in replacement for the pure elixir data structure we had
written. Deploying it was literally just a find + replace of OrderedSet to
SortedSet in our code, running our test suite and then gradually deploying the
new code to production.

~~~
elcritch
Impressive work!

------
namelosw
This is really cool.

Meanwhile, it also shows Elixir itself is already really fast for the rest of
us. Unless you're Discord or WhatsApp.

~~~
Thaxll
Elixir is pretty slow tbh, like 3-10x slower than Java / C# / Go. Immutability
/ message passing is very heavy.

~~~
chessturk
It's all relative. Plenty people come from Node, Ruby, and Python.

Coming from the Go! world, I view Elixir is trade of speed for OTP out the box
niceness. But comparing Elixir to Node, then you're just gaining speed and
stability.

And it keeps going -- I very frequently see Go criticized for slowness by C,
C++, and Rust developers. It is all about the usecase, which is why I find
this topic so interesting. It's one of the few situations Elixir mid-level
speed is actually a real world problem.

------
fenollp
Were any thoughts given to a process dictionary implementation? One (maybe
supervised) process per instance of the pd_ordset type. The process would only
ever have one client so no locking needed. Only potential issue would be the
cost of sending messages. I have seen this kind of solution bring 100x speed
improvements to write once <GB strings storage over ETS tables. This could be
a gen_server as well. I’m curious because the PDict is often frowned upon
(especially vocalized on the mailing list) yet cannot be avoided in some parts
of OTP.

Also looking at the code leads me to believe the benchmarking of the rust code
was done without taking the usage of a NIF’s overhead, if there is any. Is
this intended or am I missing something?

Last question: are the great tracing capabilities of the BEAM lost when using
NIFs and would this be an issue at Discord scale?

Thanks for the interesting write up.

------
devit
It seems like they are using a really weird data structure (it's not even
clear whether it's sub-quadratic).

This is a textbook problem normally solved using a balanced binary tree
augmented by adding a number with the size of the subtree to every node (an
"order statistic tree").

------
foota
I'm curious, how do they handle persistence? Is this data structure just used
as a cache or?

~~~
cuddlecake
As I understand it, Sorted Sets are an in-memory feature.

Discords architecture is one where an entire guild stays on one machine. And
the sorted set is part of a guild as well.

So, there is no need to persist it, since it is ephemeral in nature.

Edit: perhaps to expand on this a bit more, keep in mind that discord is not a
cloud native application. They have a few strong server nodes and a Cassandra
cluster (at this point I am curious if they switched to scylladb) that handle
all or at least most of their client facing features.

Since they do not need to scale services, the servers they have are stateful.
Hence, if the guild server crashes, so will the data structure, since the data
structure is of no use if the server crashes and would need to be rebuilt from
scratch either way.

(Disclaimer: I am not an insider, just an Elixir fanboy)

------
spockz
I wonder on how they did the benchmarks. In Java/JVM land you would use JMH.
Is there anything similar for Erlang/beam?

~~~
fgkramer
You can use some built-in modules:
[http://erlang.org/doc/efficiency_guide/profiling.html](http://erlang.org/doc/efficiency_guide/profiling.html)
or some external tools like recon:
[https://ferd.github.io/recon/index.html](https://ferd.github.io/recon/index.html)

------
EdgarVerona
Thank you for sharing, and for going so deep on it!

------
hinkley
Did they just reinvent B trees?

------
didgeoridoo
This title is the platonic ideal of an HN post.

~~~
mwerty
The article was unusually well-written as well.

------
layoutIfNeeded
It would be nice if they had also went the same lengths with their clients in
terms of efficiency...

