I had noticed that both a list and ordsets were tested. Ordsets are just sorted lists. The huge difference I guess is because full sorting and uniqueness checking happens after every member addition, as in add_member(M, L) -> lists:usort([M | L]), while ordsets actually traverses the list and finds the right place where to insert the element.
Also wonder how gb_sets http://erlang.org/doc/man/gb_sets.html would behave. Those should provide better asymptotic behavior as they actually implement a balanced tree data structure. Rust would still be faster but it would be an interesting comparison.
To be specific, we were talking about Erlang's ordsets module. You can see it constructed as a list: https://github.com/erlang/otp/blob/d6285b0a347b9489ce939511e...
new() -> .
Underneath it is implemented like a linked list with nodes pointing to an element and then a pointer to the next. But explicit pointer manipulation, breaking and stitching lists together like one would do in C is not allowed. So it is all about traversing the list then inserting at the right place.
The reason ordsets are implemented using just lists is for simplicity and because they are designed for a small collections. Say maybe keeping some preferences around, or a list of options. Not storing 1M entries.
But there are other data structures and facilities to handle larger collections like gb_trees which are implemented as AVL trees. Another is an in-memory table called ETS. This one can be shared between processes (though terms are actually copied in most cases when reading and writing to/from it).
>An ordset is a representation of a set, where an ordered list is used to store the elements of the set.
11> f(), Ord_bench = fun(N) -> Random = [random:uniform(N) || _ <- lists:seq(1,N)], F = fun(Item, Set) -> ordsets:add_element(Item, Set) end, lists:foldl(F, ordsets:new(), Random) end.
So, 337 ms turns into 37 seconds, roughly 100 times slower, when I increase the data size tenfold, which is what you'd expect from O(N) insertion.
Repeating for gb_sets, the results are 49 ms versus 484 ms.
Could the misunderstanding be that you don't realise that the ordsets module specifically guarantees that the representation is an ordinary sorted list?
(I am aware of e.g. skip lists. But that is not the same Erlang data structure as a sorted list.)
Better approach, imo, would be an order statistic tree with O(logN) insert, remove and O(logN+M) slice operations . This would be usually implemented as an implicit treap . Furthermore, balanced binary search trees can be implemented efficiently in a functional language like Elixir that doesn't have mutable data structures. One such example is Erlang's gb_tree .
Surprised to see no mention of trees which are basically the standard datastructure for this.
Although I wouldn't be surprised to see more and more of Rustler as both Elixir and Rust gain in popularity. There's probably plenty of low-hanging performance fruit for any future rustler devs.
1) One of the things that used to be the case with NIFs is that they can interfere with the BEAM's famed pre-emptive scheduling. Any chance that's what's going on here? ie: the NIF solution was fast because it's getting an unfair amount of CPU time, potentially leading to latency issues elsewhere in the app? I believe recently that was changed, though, and the BEAM can pre-empt NIFs?
2) Did you consider/benchmark a port solution before NIF? I think of those as a little safer, but with not as good of performance.
> Initially 768 words + the size of each element (6 words + the size of Erlang data). The table grows when necessary.
This might be too large if there's a ton of them, but perhaps a single table per node could be workable? In particular, ordered_set tables with write_concurrency have been improved a lot on Erlang/OTP-22:
OTP-15128 Application(s): erts, stdlib
ETS option write_concurrency now also affects and
improves the scalability of ordered_set tables. The
implementation is based on a data structure called
contention adapting search tree, where the lock
granularity adapts to the actual amount of concurrency
exploited by the applications in runtime.
I wonder if an out of process solution was considered. For instance, Redis supports sorted sets: https://redis.io/commands#sorted_set
Using something like Redis saves development time and makes builds and packaging easier as no FFI is needed. However Redis comes at the cost of having a more complicated infrastructure.
Really it seems like a “write once and forget” type of nif. Have you had to make a lot of updates after the initial tuning with Rustler?
Though I was curious if you tried ETS or mnesia? They both offer sorted sets. Not sure if they just use ordsets behind the scene.
Edit: nvm, I see you answered in another thread.
Meanwhile, it also shows Elixir itself is already really fast for the rest of us. Unless you're Discord or WhatsApp.
I was so used to 20-40ms that I couldn't even comprehend that it could be 80x faster. A year later since playing around in Elixir's world, and I'm still just so impressed by its speeds.
Recently I started writing a bunch of tests for a Phoenix application for my first Phoenix app.
After a day of hacking around and learning as I go I had 63 tests of which 57 of them are full blown controller tests where a good amount of those tests are dealing with a "current user" and reading / writing things from a database. I'm even using a "Faker" library to generate fake data for all of the DB attributes.
The entire test suite runs in 600ms. That's 600ms for ~60 controller tests from start to finish. That's in a worst case scenario too where the code is on a non-SSD and that code is volume mounted into a Docker container on Windows while I use WSL.
Sometimes things work so fast that I'm actually worried it's not doing what I think it's doing, but then I run --cover and see that I have 95%+ test coverage on the things I care about, so that code is getting reached.
Coming from the Go! world, I view Elixir is trade of speed for OTP out the box niceness. But comparing Elixir to Node, then you're just gaining speed and stability.
And it keeps going -- I very frequently see Go criticized for slowness by C, C++, and Rust developers. It is all about the usecase, which is why I find this topic so interesting. It's one of the few situations Elixir mid-level speed is actually a real world problem.
But in the domain of Web programming, scheduling efficiency is much more important than the speed of execution. If it is a huge batch operation on a single machine, Erlang/Elixir would be terrible.
Similarly, for React there's fiber mechanism -- it's worse according to benchmark, while the user would get a better experience due to a better schedule.
Also looking at the code leads me to believe the benchmarking of the rust code was done without taking the usage of a NIF’s overhead, if there is any. Is this intended or am I missing something?
Last question: are the great tracing capabilities of the BEAM lost when using NIFs and would this be an issue at Discord scale?
Thanks for the interesting write up.
This is a textbook problem normally solved using a balanced binary tree augmented by adding a number with the size of the subtree to every node (an "order statistic tree").
Discords architecture is one where an entire guild stays on one machine. And the sorted set is part of a guild as well.
So, there is no need to persist it, since it is ephemeral in nature.
Edit: perhaps to expand on this a bit more, keep in mind that discord is not a cloud native application. They have a few strong server nodes and a Cassandra cluster (at this point I am curious if they switched to scylladb) that handle all or at least most of their client facing features.
Since they do not need to scale services, the servers they have are stateful. Hence, if the guild server crashes, so will the data structure, since the data structure is of no use if the server crashes and would need to be rebuilt from scratch either way.
(Disclaimer: I am not an insider, just an Elixir fanboy)