
Understanding Memory Fragmentation in Haskell - tirumaraiselvan
https://www.well-typed.com/blog/2020/08/memory-fragmentation/
======
FooBarWidget
A year ago, I research similar memory issues in Ruby. The established
hypothesis in the community was that it was due to memory fragmentation. But I
found that the largest culprit was actually the glibc memory allocator, which
doesn't like to return memory to the OS. In multithreaded scenarios this issue
is amplified even more, due to the use of separate heap arenas per thread.

I also found a simple solution: call malloc_trim() after a GC. This reduces
memory usage by 70%.

[https://www.joyfulbikeshedding.com/blog/2019-03-14-what-
caus...](https://www.joyfulbikeshedding.com/blog/2019-03-14-what-causes-ruby-
memory-bloat.html)

~~~
flohofwoe
Hmm... if your process keeps grabbing new memory from the OS, even though 70%
of the memory it had already allocated is free to use then that's a sure sign
of rampant memory fragmentation though. Because even though there is a lot of
free mapped memory in the process overall, there are no continuous ranges of
free memory that are big enough to fulfill at least some of the new
allocations (so the allocator needs to grab fresh memory pages from the OS).

This means there's a wave of new allocations moving through your address
space, and it's leaving behind a fragmented mess. Calling malloc_trim() won't
help with the address space fragmentation, it will only free memory pages
caught up in the mess. At some point the allocation wave will hit the top of
the address space and allocations will start to fail. Usually this is not a
problem in 64-bit processes of course, because it will take a very long time
to run out of 64-bits, but on 32-bit processes this was a real problem.

~~~
nh2
This is correct. malloc_trim() can make the unused memory pages (within an
mmap() that malloc did) available for use by other processes using madvise()
(turning them from grey squares into white squares in the linked article's
visualisation), but it does leave holes in the address space.

This is what the MMAP_THRESHOLD tunable solves. It makes that allocations
larger than that many Bytes are served via their own mmap that can be
munmapped in independence.

I use env MALLOC_MMAP_THRESHOLD_=65536 to reduce the memory-fragmentation
wasted RAM of my program from 6.5 GB to 0.8 GB.

The benefit of this is that you don't have to decide at which points to call
malloc_trim(). But it's expected to be a bit slower because mmap() takes a
while. Choosing between malloc_trim() vs MALLOC_MMAP_THRESHOLD_ is dual to
choosing between GC vs reference counting -- higher memory use for a while and
having to choose when to clean up vs higher per-operation cost.

~~~
labawi
Isn't this just passing work to the kernel which coincidentally behaves in a
friendlier way than default malloc, so in essence a workaround for a bad/buggy
allocator that doesn't release or reuse pages?

If large allocations are page-aligned (hopefully at 16+ pages), they can be
individually unmapped and remapped, and I see no reason why or how individual
mmaps could in general result in less fragmentation, other than said silly
malloc.

~~~
nh2
I think you're pretty much right.

E.g.
[https://github.com/thestinger/allocator/tree/f42a6c2dffb63d5...](https://github.com/thestinger/allocator/tree/f42a6c2dffb63d5aec3c9d7f6a387ee1c7b1857a#current-
implementation) (found via Google) explains:

> The Linux kernel also lacks an ordering by size, so it has to use an ugly
> heuristic for allocation rather than best-fit. It allocates below the lowest
> mapping so far if there and room and then falls back to an O(n) scan. This
> leaves behind gaps when anything but the lowest mapping is freed, increasing
> the rate of TLB misses.

So the libc should be able to do a better job than the kernel here.

And yes, I think it should probably work via MADV_DONTNEED to give mapped
pages back to the kernel (and perhaps PROT_NONE to also reduce commit charge
when overcommit is disabled, see
[https://github.com/thestinger/allocator/issues/18](https://github.com/thestinger/allocator/issues/18)).

I'm using `MALLOC_MMAP_THRESHOLD_` because it has a positive effect, but as
written on
[https://news.ycombinator.com/item?id=24244271](https://news.ycombinator.com/item?id=24244271),
I'm not sure why automatic trimming (which should do all of the discussed
above) does not work.

------
siraben
Having written Haskell for over a year now for personal projects,
understanding the memory model can be one of the hardest aspects of Haskell,
which can make it frustrating to write allocation-free code (although some
techniques like deforestation are done by the compiler to eliminate
intermediate structures entirely).

Linear types being added in GHC 8.12 would be a big deal because it would
allow programmers to be able to write allocation-free code that can use
mutable data structures with a pure API (as opposed to the ST state monad),
much like how Rust solves this with the ownership system.

~~~
platz
I don't understand how ghc linear types allows allocation-free code. GHC'ss
linear types don't give you uniqueness types in the same way rust does

~~~
zetalemur
> GHC'ss linear types don't give you uniqueness types in the same way rust
> does

Why, though?

Is it because GHC's linear types are a superset of Rust's linear types? I
guess that could rule out some features the compiler is able to prove (or
not).

~~~
platz
GHC models linearity differently. Rust puts linearity on the types, GHC puts
linearity on the arrows.
[https://i.imgur.com/s0Mxhcr_d.webp?maxwidth=640&shape=thumb&...](https://i.imgur.com/s0Mxhcr_d.webp?maxwidth=640&shape=thumb&fidelity=medium)

~~~
senorsmile
Is Idris 2's linearity the same as Haskell's in this respect?

~~~
platz
yes

------
tirumaraiselvan
More discussion here:
[https://www.reddit.com/r/haskell/comments/id8m9w/welltyped_u...](https://www.reddit.com/r/haskell/comments/id8m9w/welltyped_understanding_memory_fragmentation/)

------
brundolf
Is this sort of multi-tier allocation used in any other garbage collected
languages? Or is it specifically used in Haskell because of immutability
(which would presumably to result in a higher-than-normal frequency of
allocation/deallocation)?

~~~
eru
Erlang also has pervasive immutability. But not sure what they are doing for
allocation.

~~~
jlouis
There is a whole memory allocation system used to combat fragmentation toward
the OS level. Erlang also uses a multi-tiered approach, but note that a lot of
things are easier because there is far less sharing going on and more things
are isolated.

------
crote
Wouldn't the GC be able to `munmap` the space between blocks?

Sure, it wouldn't solve object-level fragmentation, but at least you'd get rid
of block-level fragmentation.

~~~
chrisseaton
You can only unmap a full page, but there aren’t any full empty pages because
the memory is fragmented.

~~~
crote
According to the link, GHC creates 1MiB megablocks, consisting of 4KiB blocks.
So, each block would be a page, and a megablock consists of 256 pages.

It seems that the problem with pinning is that a megablock will end up
containing only a single block, leaving the space for the other blocks unused.
The block can't be moved to another megablock because it is pinned, so it
can't free the entire megablock.

My suggestion is to unmap the pages corresponding to the empty space in the
megablock. So if a megablock only contains a single block, unmap the 255 empty
pages.

~~~
chrisseaton
> a megablock will end up containing only a single block

That's the absolute worst case.

Most likely case is many (most?) pages have one or two pinned objects in them.
It only takes a small proportion of pages to be wasted like this for it to be
a huge problem.

