
Glibc malloc inefficiency (2016) - ibobev
http://notes.secretsauce.net/notes/2016/04/08_glibc-malloc-inefficiency.html
======
cranekam
> but it suggests strongly that the heuristics in malloc either have a bug, or
> the parameters aren't set aggressively enough

I think the author might be being a bit naive about the role of the allocator.
Sure, it could aggressively madvise()/munmap() unused pages to return them to
the kernel but as soon as memory is needed again there's a significant
overhead to bring it back again. For desktop apps it might be fine to run with
few spare pages (and save the RAM) but doing so on a busy server would likely
result in different inefficiencies.

jemalloc used to have an option (opt.lg_dirty_mult:
[https://linux.die.net/man/3/jemalloc](https://linux.die.net/man/3/jemalloc))
to control how many dirty pages it'd allow to build up before calling
madvise(.., MADV_FREE) to give them back to the OS. This allowed one to choose
between saving CPU by maintaining free pages or saving RAM by giving them
back. Later jemallocs switched to decay-based purging of dirty pages (via a
dedicated thread) to get the best of both worlds.

------
camgunz
I've heard this as an argument for GC being faster/using less memory in some
cases than manual memory management. You can asynchronously return memory to
the OS in a separate thread. Boehm GC in particular is really easy to link
into an app and use instead of malloc and friends.

~~~
chrisseaton
It’s the fact that automatic memory memory management can defragment through
moving that is key. But there are also some manual memory managers that can
defragment through moving as well - even some compatible with the malloc
interface.

~~~
mschwaig
If I remember correctly you couldn't do this by just re-linking with another
manual memory manager though because the application could hold pointers to
things in ways you don't understand and then you can't move them, so there has
to be some additional indirection on every pointer access that resolves the
real location of data in memory.

So I would think this requires some compiler support or some sort of runtime.

If you know more about this or know any cool implementations of it please
elaborate!

~~~
chrisseaton
> so there has to be some additional indirection on every pointer access

We already have this - almost all modern computers have something called
virtual memory, that allows you to de-couple your memory addresses from where
the memory actually is. This allows you to 'move' memory without changing the
addresses that you use to refer to it. You can then defragment memory by
mapping two sets of addresses to the same memory physical memory, as long as
the holes in the two sets overlap.

In order words, if you have two pages, one with only the first half used, the
other with only the second half used, you can copy the used part of the second
page into the gap in the first page, and map the old addresses of the second
page to the location of the first, now shared, page. You are using twice the
_address space_ (which costs just a few bytes in bookkeeping) but have
defragmented so you're only using half the _physical memory_.

You can then completely return the second page, now un-used, to the kernel.

~~~
mschwaig
Are there implementations that actually do this?

I would think the page level granularity of virtual memory makes it not that
useful for this purpose, since the used and unused parts of the two pages
would have to line up nicely and larger objects tend to cause less
fragmentation.

~~~
chrisseaton
I first saw this done in 2009 in the Hound memory allocator. I _think_ it's
been tried in or implemented in recent Glibc, but wouldn't swear to that. I
think it's been shown to be useful in practice in real applications in a
couple of papers.

Fragmentation is often just one pesky object holding onto a whole page. If you
have many such pages all with one object it's likely you can map them all
together and release most of them.

------
tinus_hn
This assumes glibc can actually shuffle memory locations around, which it
can’t. Except for some special cases you can’t return memory to the kernel,
you can only stop using it and hope it’s swapped out. Perhaps you could zero
it so it compresses well.

~~~
rmind
Umm.. that's not really true.

Most userspace memory allocators are based on mmap(). Using sbrk() is legacy.
The slab allocator (Bonwick, 1994) is the dominant algorithm (both in
userspace and the UNIX-like kernels), although there are various flavours and
somewhat different implementations of it. Since it uses fixed-sized
allocations under the hood, they are nicely packed in pages. Those pages can
be released with munmap() once there are no used blocks. Sure, depending on
the application and workload, allocations can be become quite distributed
amongst the pages over the time, but that is a separate problem.

Also, as another comment states: madvise() is also an option, but it doesn't
reclaim the virtual address space. On 32-bit systems VA space exhaustion can
be a problem, especially with "modern" applications.

~~~
wahern
Both glibc and musl libc still uses brk for most allocations. You can see it
when strace'ing simple programs. After the linker--which uses mmap for some
allocations as it can't use malloc--has finished you'll see brk calls to
satisfy application allocations.

Also you can verify in the code, e.g. [https://git.musl-
libc.org/cgit/musl/tree/src/malloc/expand_h...](https://git.musl-
libc.org/cgit/musl/tree/src/malloc/expand_heap.c?id=55a1c9c8#n56)

I believe OpenBSD's malloc only uses mmap and also aggressively munmap's
memory--to better catch application bugs and to keep addresses randomized.

------
lokar
If you care a lot about this stuff use tcmalloc, and its very helpful
tc_malloc_stats()

~~~
rsecora
I agree, and the arena movement in tcmalloc solves the over assignations in
ptmalloc when one thread allocates and other frees.

Nevertheless I prefer jemalloc because I find the stats and memory profiling
awesome to find leaks or structural over assignation.

------
asveikau
I wonder if some particular M_TRIM_THRESHOLD value does better than the
default. The author spent some time measuring before and after forcing a trim
by attaching a debugger, but it may be more generally useful to tweak the
default.

------
rsecora
Another option is to use jemalloc[1] instead. Better performance with threads.

It'z possible to use via .so injection with ld_preload.

------
egberts1
Limit Memory Allocation (if not necessary)

Multithreaded programs often do not scale because the heap is a bottleneck.

When multiple threads simultaneously allocate or deallocate memory from the
allocator, the allocator will serialize them. Programs making intensive use of
the allocator actually slow down as the number of processors increases.

Malloc (libc) is the worse memory allocation API to use.

Programs should avoid, if possible, allocating/deallocations memory too often
and in particular whenever a packet is received.

In the Linux kernel there are available kernel/driver patches for recycling
skbuff (kernel memory used to store incoming/outgoing packets).

Using PF_RING (into the driver) for copying packets from the NIC to the
circular buffer without any memory allocation increases the capture
performance (around 10%) and reduces congestion issues.

Design Evolution

Basic design of malloc() is to dynamically pre-allocate a pool of memory from
the OS in which applications can then take smaller pieces from. malloc() is a
standard API having a choice of different allocation algorithms and to
mitigate the expensive OS system calls (typically done at program
initialization time) during allocation of its system memory. The first memory
allocation scheme started with a stack-based memory allocation.

Next came the dynamic-based memory allocation scheme where linked-list and
bucket-heap mechanism are used to divide the private-heap using size class
approach.

Soon, garbage collection algorithm introduced the initial backend of the
memory allocation scheme. Frontend covers the usual malloc() API, et. al.

In 2006, a third pool was introduced (after operating system memory pool and
library-based memory pool) called the “arena”. Arena is a jemalloc-term and is
intended to deal with different memory types such as different-speed memory
bank or NUMA-architecture, as well as memory tied to specific to each of the
multiple CPU core or even CPU infinity.

Frontend Evolution Frontend manages the memory being given to the application.

Within the frontend of the memory allocation system, the evolution went in the
following order:

\- link-list free space \- heap-bucket size classes (eliminating an object
header) \- (Process) Owner encoding \- single core local allocation buffers
(CLABs) \- Epoch encoding \- Large-size class memory block by direct mmap()
Hazard pointers (safe memory reclamation for lock-free objects) (M.M. Michael,
2004) \- Arena memory pool \- thread-specific local allocation buffers (TLABs)
\- constant-time modulo synchronization (early return to OS pool, or FreeBSD
madvise call)

Backend Evolution

Backend of the memory allocation system manages the empty, straggling,
fragmented or no-longer used memory blocks back to the OS (thereby reducing
RSS).

\- Pool semantic: Remote f-list encoding, using Treiber stack), (R.K. Treiber,
1986) \- buddy algorithm \- binary buddy algorithm \- BIPOP Table (span-based
allocator)(S. Schneider, 2006) aka local free list and \- remote free list \-
segment queue (Quasi-linearizability, Y. Afek, 2010) \- multi-core distributed
queue (A. Haas, 2013) \- k-FIFO queue (T.A. Henzinger, 2013)

~~~
scott_s
> When multiple threads simultaneously allocate or deallocate memory from the
> allocator, the allocator will serialize them.

Memory allocators designed for multithreaded use will _not_ serialize
allocations and frees. Such allocators use thread-local data structures and
try to avoid touching non-thread-local structures that require
synchronization. I am "S. Schneider, 2006" and that was one of the main points
of that paper. Modern memory allocators (tcmalloc, jemalloc, even I believe
modern glibc) follow similar designs.

~~~
egberts1
Aye, such evolution tracking we must do for the history of malloc; Thanks,
Scott.

Evolution doesn’t always means most efficient or better. :-)

------
ncmncm
Giving memory back to the OS is usually a pessimization; especially so, if the
program runs more than one thread, on more than one core. Therefore, not
giving it all back is not a failing, but an optimization.

~~~
roelschroeven
Is that always true? If a long-living process (like Emacs in that blog post)
at some point temporarily uses a lot of more, is it a good idea it keeps hold
of that amount of memory forever, even if they don't have any use for it
anymore?

It could be, but it's counter-intuitive to me that processes keeping unused
memory in use is better for the system.

~~~
BubRoss
No, it's total nonsense of course. If you imagine a web browser with lots of
tabs open, if someone closes lots of tabs they might be doing it specifically
to free up memory.

Even going over physical memory is not a death sentence anymore because of
fast SSDs being used for virtual memory. Because that is slower however,
freeing up memory will still speed everything up, even if the maximum memory
used needs to go over physical memory.

The OS also caches memory mapped pages of files, so memory used means more
disk IO from that standpoint as well.

Even disregarding all of that, heaps get fragmented, and if many virtual
memory allocations can be given back to the OS, the next time more memory is
needed it will be allocated in large, 'virtually contiguous' chunks that start
out without fragmentation.

