
Why does musl make my Rust code so slow? - andygrove
https://andygrove.io/2020/05/why-musl-extremely-slow/
======
iou
Swap out the allocator [https://users.rust-lang.org/t/optimizing-rust-
binaries-obser...](https://users.rust-lang.org/t/optimizing-rust-binaries-
observation-of-musl-versus-glibc-and-jemalloc-versus-system-alloc/8499)

~~~
andygrove
Thanks! That really does seem to be the issue and I wouldn't have known about
this, had I not asked. I will try this out and will update the blog post in ~8
hours time.

~~~
masklinn
IME allocations is one of the main things making rust programs slow without
diving into the more arcane stuff. So looking into unnecessary allocations
and/or the performances of the allocator would be one of the first things to
do (right after checking if you're compiling with optimisations).

Given your CPU graphs, and the large number of cores, I expect musl's
allocator simply has very poor behaviour with respect to multithreading (e.g.
limited or no threadlocal arenas, size-classing, etc…) leading to a lot of
crosstalk, extreme contention on allocations, etc...

~~~
zokier
Tbh allocations (and related memory management things) are often the low
hanging fruit in big picture optimization in many languages, incl c++.

------
jessermeyer
For those curious, Musl's malloc implementation is currently being re-written
for higher performance and robustness, see
[https://github.com/richfelker/mallocng-
draft](https://github.com/richfelker/mallocng-draft)

~~~
liuliu
Do you have any extra readings on the rationale of building their own malloc
rather than integrating mimalloc or jemalloc?

~~~
jessermeyer
Not any first hand, but reading their principles suggests that simplicity and
ease of deployment are probably relevant.
[https://musl.libc.org/about.html](https://musl.libc.org/about.html)

~~~
liuliu
Thanks. Taking on malloc and being competitive with much less code (mimalloc
is around 6k, and it is the smallest I know that is still competitive) would
be a great feat. Would be interesting to follow the development.

------
dalias
We'd be happy to address specific problems on the mailing list. I believe it's
a known issue that the Rust compiler is making really heavy use of rapid
allocation/freeing cycles, and would benefit from linking a performance-
oriented malloc replacement. Doing so is inherently a tradeoff between many
factors including performance, memory overhead, safety against erroneous usage
by programs, etc.

One statement in your post, which some readers pointed out was apparently
added later, "Others have suggested that the performance problems in musl go
deeper than that and that there are fundamental issues with threading in musl,
potentially making it unsuitable for my use case," seems wrong unless they
just meant that the malloc implementation is not thread-caching/thread-local-
arena-based. The threads implementation in musl is the only one I'm aware of
that doesn't still have significant bugs in some of the synchronization
primitives or in cancellation. It's missing a few optional and somewhat
obscure features like priority-ceiling mutexes, and Linux doesn't even admit a
fully correct implementation in some regards like interaction of thread
priorities with some synchronization primitives, but all the basic
functionality is there and was written with extreme attention to correctness,
and musl aims to be a very good choice in situations where this matters.

------
underdeserver
...Not the Intel guy, if anyone else had to pause for a second.

~~~
andygrove
I get that a lot!

~~~
hinkley
Are you familiar with SwiftOnSecurity on twitter?

Do you have any hobbies that would be out of character for Intel's Andy Grove?
I think the world has room for a ficttionalized Andy Grove talking about how
to cook french pastries, train bonsai, intermittent fasting, or preparing for
a marathon.

~~~
platinumrad
One SwiftOnSecurity is already too many.

~~~
hinkley
It's a free country. You're allowed to be wrong.

------
BubRoss
This actually someone asking and not an investigation and explanation. There
isn't even a lot of due diligence to figuring it out - no profiling or
resource usage other than CPUs. Also it is musl combined with docker causing a
30x slowdown.

If something is running 30x slower from linking in a different libc, I'm
guessing it should not be that difficult to narrow down the cause at least a
little bit.

~~~
MiroF
Benchmarking in Docker in general is a mistake I believe.

~~~
mschuster91
Why? The only overhead you have in Docker is on syscalls (due to permission
checks, namespaces, ...), everything else runs at 100% native speed - unlike
assisted virtualization (at least IOMMU overhead plus overhead for anything
involving the filesystem) or emulated virtualization (obvious overheads here).

~~~
edwintorok
If both sides of the comparison run inside Docker that is still a valid
comparison. With Docker the benchmark should be a bit more reproducible by
anyone.

------
pedrocr
Swapping out the allocator for jemalloc would be my first try. It's easy to do
and often results in better performance. 30x requires some kind of
pathological case though.

~~~
mrits
In a commercial product I worked on I went against the vendors advice to try
out jemalloc. It took a 100GB memory hold (that took 48 hours to happen) to
staying steady at around 2-4GB and only peaking at 100GB for a few seconds a
day.

Same exact code but just swapped out the jemalloc at the command line.

------
pjc50
.. where's the profile output?

------
termie
run a perf trace on both and see what jumps out

~~~
MiroF
If he's benchmarking on docker, I'm not sure that perf works in docker.

~~~
ori_b
Then benchmark outside docker?

------
renewiltord
Post was not very illuminating. Very little content. It's pretty much a "if
musl is slow, it may be the allocator (eom)" which fits in the headline and
would have saved me the click.

