
Fewer mallocs in curl - dosshell
https://daniel.haxx.se/blog/2017/04/22/fewer-mallocs-in-curl/
======
nnethercote
There is a little-known Valgrind tool called "DHAT" (short for "Dynamic Heap
Analysis Tool") that's designed to help find exactly these sorts of excessive
allocations.

Here's an old blog post describing it, by DHAT's author:
[https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-
with...](https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with-dhat/)

Here's another blog post in which I describe how I used it to speed up the
Rust compiler significantly:
[https://blog.mozilla.org/nnethercote/2016/10/14/how-to-
speed...](https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed-up-the-
rust-compiler/)

And here is the user manual: [http://valgrind.org/docs/manual/dh-
manual.html](http://valgrind.org/docs/manual/dh-manual.html)

~~~
wyldfire
> I previously had -Ccodegen-units=8 in RUSTFLAGS because it speeds up compile
> times. ... the resulting rustc was about 5–10% slower. So I’ve stopped using
> it now.

...why is this the case?

~~~
detaro
It breaks the code into smaller modules that are processed independently, thus
limiting the scope of optimizations to those modules.

------
tom_mellior
For the benefit of others who found the description in the blog post unclear
and can't or don't want to dig through the code changes themselves: "fixing
the hash code and the linked list code to not use mallocs" is a bit
misleading. Curl now uses the idiom where the linked list data (prev/next
pointers) are inlined in the same struct that also holds the payload. So it's
one malloc instead of two per dynamically allocated list element. This
explains the "down to 80 allocations from the 115" part.

The larger gain is explained better and comes simply from stack allocation of
some structures (which live in a simple array, not a linked list or hash
table).

~~~
userbinator
_Curl now uses the idiom where the linked list data (prev /next pointers) are
inlined in the same struct that also holds the payload._

I wonder why this wasn't the original design; I've seen the "two-allocate"
method a few times in other code and it's always seemed rather silly to
allocate that separate little structure just to point to the things you're
linking together anyway, so I'm curious how that way of doing it became
somewhat commonplace.

~~~
masklinn
Mayhaps because the structure was originally stack-allocated, or because the
developer was unfamiliar or uncomfortable with intrusive datastructures. Or
because the structure is used outside of linked list contexts and the memory
overhead of the intrusion was considered not compensated for?

~~~
tom_mellior
> Or because the structure is used outside of linked list contexts

That use case is possible with the particular implementation that is now used
by curl:

    
    
        struct fileinfo {
          struct curl_fileinfo info;
          struct curl_llist_element list;
        };
    

You can use the payload (struct curl_fileinfo) outside of a list without
incurring any overhead.

I think one of your other points is more likely, or simply "it was fast enough
and we went for more interesting features than for such optimizations".

~~~
wbl
One of the Plan 9 C enhancements makes this even easier by permitting the
fields of these structs to be accessed directly.

~~~
generic_user
you can do this now in ISO C11 with the inclusion of anaonomous structures and
unions.

    
    
        struct A { 
            int x, y; 
        };
    
        struct B {
            union {
                struct A 2d;
                struct { int x, y; };
            }
            int z;
        };
    
        ...
    
        struct B foo;
        foo.x = foo.2d.y = foo.z = 0;
    

you can access x and y as members of B or A.

------
makerbraker
I think this is fantastic engineering work towards performance, without
falling back on the "RAM is cheap" line and instead doing nothing.

It's not every day that you see an example of someone examining and improving
old code, that will result in a measurable benefit to direct and indirect
users.

~~~
throwanem
Also, when you download FLAC files with it, they'll sound warmer.

~~~
nayuki
There is actually a corner on the web where people debate about FLACs and WAVs
sounding different

~~~
lucideer
This is probably true in non-blind listening comparison though. Similar to
food tasting different when served in different coloured receptacles[0]

[0]
[http://onlinelibrary.wiley.com/doi/10.1111/j.1745-459X.2012....](http://onlinelibrary.wiley.com/doi/10.1111/j.1745-459X.2012.00397.x/abstract)

~~~
mikeash
If you had to summarize the audiophile community in a few words, it would be
"does not understand the point of blind testing."

~~~
venture_lol
All audophiles eventually will reach the stage where my hearing aid is better
than yours :)

------
vbezhenar
Underlying problem is that C doesn't have comprehensive standard collections,
so many developers reinvent the wheel over and over again, and usually that
wheel is far from best in the world. If curl was written with C++, those
optimizations would be applied automatically by using STL collections.

~~~
tom_mellior
Somewhat true, but many C++ programs implement their own containers because
they find the STL "slow" or "bloated" or whatever. (For concreteness, both GCC
and LLVM do this.) So I guess the same might eventually happen to curl, in
which case we'd get this same kind of article on Hacker News ;-)

~~~
jononor
GCC used to be written in C (not C++) until couple of years back, and the
codebase origins predates both the STL and C++. So they are not the greatest
example.

~~~
tom_mellior
As you say, GCC switched to C++ just a few years ago, when the STL had already
existed for decades. I don't understand how you conclude that they couldn't
have just used the STL when they did the switch to C++. Implementing their own
containers was more work without any gain in portability.

------
iamalurker
My problem with excessive allocations is usually what happens in interpreted
languages. People think, hey it's already slow ass interpreted so lets not
care about allocation at all.

An example which I see all the time, looking at tons of python libraries which
in the end do I/O against a TCP socket. Sometimes the representation between
what the user passes to the library and what goes out to the socket can be
retained as an array of buffers which are to be sent to the socket.

Instead of iterating on the array, and sending each block (if big enough) on
it's own to the socket, the library author concat them to one buffer, and then
send it over the socket.

When dealing with big data, this adds lots of fragmentation and overhead
(measurable), yet some library authors don't care...

Even the basic httplib and requests has this issue when sending via POST a
large file (it concats it to the header, instead of sending the header, then
the large fiel).

------
snksnk
Optimization backed by comparative statics. These reads are so satisfying.
Thank you for submitting.

------
faragon
Explicit dynamic memory handling in low level languages hurts in a similar way
garbage collectors do in high level languages: hidden and often unpredictable
execution costs (malloc/realloc/free internally usually implement O(n log n)
algorithms, or worse). So the point for performance, no matter if you work
with low level or high level languages is to use preallocated data structures,
when possible. That way you'll have low fragmentation and fast execution
because not calling the allocator/dealocator in the case of explicit dynamic
memory handling, and lower garbage collector pressure because of the same
reasons in the case of a garbage-collected languages.

~~~
maccard
An allocator doesn't have to be slow. You can implement allocator yourself
that asks for memory from the OS up front, and just hands pointer back to the
caking application. If you know the order of allocations is the reverse of the
deallocstiond (as an example) you can do allocations with a pointer bump!

~~~
faragon
Sure. Imagine you have a process with 2^20 active allocations (e.g. 2^20 calls
to malloc()), i.e. you have a tree with the metainformation, and every time
you de-allocate and allocate you have to search trough one or more trees. So
no matter how "smart" is your library for avoiding OS system calls, you
already have a hell to maintain (search through a tree or a list, delete,
split, etc.). Things get ugly when a process has lots of dynamic memory calls,
on non-trivial cases.

~~~
maccard
> you have a tree with the metainformation, and every time you de-allocate and
> allocate you have to search trough one or more trees

Don't use a tree in that case? I don't see why you using a tree to store
allocation info means that allocations are slow.

~~~
faragon
Whatever other data structure you use is not going to be simple, and will have
even worse drawbacks: e.g. using hash tables for handling allocation pools and
free blocks could be even worse.

------
rumcajz
My rule of thumb is to look at application's design and only ever use malloc
where there is 1:N (or N:M) relationship between entities. Everything that's
1:1 should be allocated in a single step.

~~~
areyousure
Your heuristic sounds interesting. Can you say a little more about which
entities and how to determine their relationship? Thanks.

------
21
> Doing very small (less than say 32 bytes) allocations is also wasteful just
> due to the very large amount of data in proportion that will be used just to
> keep track of that tiny little memory area (within the malloc system). Not
> to mention fragmentation of the heap.

That not necessarily true. Modern allocators tend to use a bunch of fixed
size-buckets.

But given that curl runs on lots of platforms it makes sense to just fix the
code.

~~~
__s
& often those fixed-size buckets smallest size is 32 bytes. It still has to at
least have a freelist

~~~
notacoward
There has to be a free list somewhere, but a single bucket only needs a
bitmap. I think the GP's point is that such a structure amortizes the cost of
the free-list metadata over more items, reducing total overhead.

~~~
21
The free list can be stored inside the empty cells, meaning you put the
pointer to the next empty cell inside the previous empty cell, and you need a
single additional pointer to store the location of the first empty cell. When
you free a cell you just make that the first empty cell.

~~~
exDM69
This isn't free either because the free list is scattered around in memory and
all that pointer chasing is bad for caches.

~~~
Someone
It isn't free, but _all that pointer chasing_ is one step for each allow and
one for each free. There is no need to look for a 'best' block on alloc, nor
is there a need to search for a best place to insert a block on free.

Moreover, the cache hit you take for an alloc likely would have happened
anyways because, presumably, the program that made the allocation wants to
write to the allocated memory (in theory, that doesn't imply the CPU has to
_read_ the memory, but are there allocators that are that smart?)

For frees, the memory may, but need not be, already be in the cache when free
is called.

------
vertex-four
Note that this pattern[0] is essentially "copy-on-write", which can be
encapsulated safely as such in a reasonably simple type (in a language with
generics) and used elsewhere. I use a similar mechanism pervasively in some
low-level web server code to use references into the query string, body and
JSON objects directly when possible, and allocated strings when not.

[0]
[https://github.com/curl/curl/commit/5f1163517e1597339d](https://github.com/curl/curl/commit/5f1163517e1597339d)

~~~
Asooka
But why not use alloca instead of always allocating 10 elements on the stack
you might not need?

Edit: I would also be tempted to remove the ufds_on_stack variable and just
check if the ufds pointer points to the stack or not.

~~~
drfuchs
Because always allocating them on the stack costs zero cycles in the typical
case, while alloca costs more than zero cycles in the typical case. And
assuming that "struct pollfd" isn't big, and the function isn't very
recursive, there's no practical downside to wasting a little stack space for
the life of the function.

(Of course, he could get rid of "bool ufds_malloc" and just see if his pointer
is NULL before calling free(); or not even bother checking, since free(NULL)
is defined to be a no-op.)

~~~
codyps
`alloca` is a simple addition to the stack pointer, so a single instruction,
presuming it isn't folded into the normal bump of the stack pointer to
allocate the fixed size local variables. There isn't really much cost to doing
a dynamic stack allocation rather than a fixed one. Variable length arrays
(VLAs) allow the same thing but can be slightly more portable.

Normal C caveats do apply here though: alloca is POSIX, not C (but is widely
implemented outside of POSIX systems). VLAs are an optional standard feature.
Neither is required to actually use the stack for storage.

Not sure if there are any platforms supported by curl which would prevent it's
use of VLAs or alloca.

~~~
tjalfi
tl;dr - alloca costs, history, and why it is problematic

Alloca is somewhat more expensive on x86/x64 than a single instruction.

[0] shows the code generation for four functions that generate and sum an iota
array. I used -O1 to make the differences more apparent.

iota_sum_alloca and iota_sum_vla generate similar code. They both require a
frame pointer (RBP) and code to preserve the 16 byte alignment of the stack
frame.

iota_sum_const_alloca and iota_sum_array generate identical code. Clang
recognizes that alloca is invoked with a constant argument.

History of Alloca

Alloca was originally written for unix V7 [1]. Doug Gwyn wrote a public domain
implementation [2] in the early 80s for porting existing programs. The FSF
used Gwyn's alloca implementation in GDB, Emacs, and other programs. This
helped to spread the idea.

Problems of Alloca

[3] is a comp.compilers thread that discusses some of the issues with alloca.
Linus does not want either VLAs or alloca in the Linux kernel [4].

References:

[0] [https://godbolt.org/g/1JyXhQ](https://godbolt.org/g/1JyXhQ)

[1]
[http://yarchive.net/comp/alloca.html](http://yarchive.net/comp/alloca.html)

[2] [https://github.com/darchons/android-gdb/blob/android-
gdb_7_5...](https://github.com/darchons/android-gdb/blob/android-
gdb_7_5/gdb/gnulib/import/alloca.c)

[3]
[http://compilers.iecc.com/comparch/article/91-12-079](http://compilers.iecc.com/comparch/article/91-12-079)

[4]
[https://groups.google.com/forum/#!msg/fa.linux.kernel/ROgkTB...](https://groups.google.com/forum/#!msg/fa.linux.kernel/ROgkTBO4VYI/sQZN3R9bEV4J)

Edited for minor formatting changes.

~~~
tjalfi
[https://godbolt.org/g/XKAZOb](https://godbolt.org/g/XKAZOb) fixes the bugs in
the sample code in the parent post.

If you compile with -O2 then iota_sum_const_alloca and iota_sum_array are both
evaluated at compile time.

------
0xcde4c3db
> The point is rather that curl now uses less CPU per byte transferred, which
> leaves more CPU over to the rest of the system to perform whatever it needs
> to do. Or to save battery if the device is a portable one.

Does anyone have a general sense of how these kinds of efficiencies translate
to real-world battery life? I understand that the mechanisms
(downclocking/sleeping the CPU) are there; I'm just curious as to how much it
actually moves the needle in a real system.

~~~
maccard
Not sure in hard numbers, but mobile processors are designed to work this way
- do a small amount of work at full power and then sleep.

~~~
zkms
Race to idle~

------
hota_mazi
> There have been 213 commits in the curl git repo from 7.53.1 till today.
> There’s a chance one or more other commits than just the pure alloc changes
> have made a performance impact, even if I can’t think of any.

"I can't think of any" is not a very scientific way to measure optimizations.
Actually, this simple fact casts a doubt on whether it's this malloc
optimization that led to the speed up or any of the 200+ commits OP is working
on top of.

Why not eliminate that doubt by applying the malloc optimizations to the
previous official release? I'm a bit skeptical about the speed up myself,
since I would expect curl to be primarily IO bound and not CPU bound (much
less malloc bound, given how little memory it uses).

~~~
dr_zoidberg
I'm not skeptical about it. Last time I did something similar to this
optimization, I had a python function that was doing some work over strings to
get a similarity metric.

For this function, a list of elements was kept inside each call, which began
empty and a few calculations where performed, populating it before the main
loop of the function kicked in. In python-land, this was obviously done by
declaring an empty list `precalc_values = []` and then appending over it. When
we cythonized it, the dev that took it went in with a
`cy_malloc(size(int)*elements)` "dynamic array of ints", and called it a day,
70x speedup over plain python.

A few days later I came in and saw that code, and said "why not go with a
simple array?", to which I was told "because we don't know the size of the
strings beforehand". Did a test run with a small array plus a counter (to know
up to where the array held real values, and not just zero-init fields) and got
a 100x speedup.

In the end we went with both functions and a wrapper that would check the
length of the strings involved and select one or the other, because the array
version would massively crash from accessing an array out of bounds if a large
string happened to come by.

------
ape4
> This time the generic linked list functions got

> converted to become malloc-less (the way

> linked list functions should behave, really).

I don't see how a linked list can not use malloc().

~~~
rumcajz
Look for intrusive containers.

~~~
ape4
TIL, thanks.

------
amenghra
You would think curl's perf is bound by the network latency/bandwidth and that
intrusive lists wouldn't make a signifiant difference.

------
__s
> The point here is of course not that it easily can transfer HTTP over
> 20GB/sec using a single core on my machine

2GB

~~~
fastest963
He said gigabit, or at least that's how I read it. 2900MB/sec * 8 is over
20gb/s

