
Will calling “free” or “delete” in C/C++ release the memory to the system? - ingve
https://lemire.me/blog/2020/03/03/calling-free-or-delete/
======
ncmncm
There are usually very sound reasons not to release memory back to the OS,
particularly in a multithreaded program. Each such release causes a "TLB
shootdown", in which threads on other cores are blocked while the cores'
"translation lookaside buffers", caches of page mappings, are cleared, and
further stalls as their entries are re-filled.

This is another reason to prefer single-threaded processes, which are less
subject to such shootdowns, and less-coupled forms of parallelism.

Besides the TLB potholes, releasing memory means that the next time memory is
requested, the OS is obliged to zero it before the process gets to see it
again. Furthermore, each page will be marked read-only, causing a trap the
first time it is touched, and then zeroed lazily.

As a result, freeing memory to the OS should only be essayed with the support
of a great deal of measurement of the consequences.

------
mkhn
Always appreciate your posts. Perhaps good to explain the sbrk/mmap threshold
for this and maybe also why the memset() is required to fault in pages :-)

------
_bxg1
Interesting. It calls to mind garbage collection, actually, where the
"collection" of memory happens at some indeterminate time in the future based
on heuristics about how much there is. Of course it doesn't have the same
performance impact, probably because the collection process consists of just
freeing known chunks instead of iteratively searching for chunks that can be
freed.

~~~
DmitryOlshansky
GC doesn’t search for chunks that can be freed. Also “iteratively” - what do
you mean by this?

There is no known algorithm for GC that searches for garbage to delete, the
way GCs work is exactly opposite - find all live objects (reachable from roots
- registers,stack, statics/globals) and the rest of the heap by definition is
garbage.

~~~
_bxg1
[https://en.m.wikipedia.org/wiki/Tracing_garbage_collection](https://en.m.wikipedia.org/wiki/Tracing_garbage_collection)

> tracing garbage collection is a form of automatic memory management that
> consists of determining which objects should be deallocated ("garbage
> collected") by tracing which objects are reachable by a chain of references
> from certain "root" objects, and considering the rest as "garbage" and
> collecting them

So it starts at the root and, in a sense, "iterates" down through the
reference tree to see which allocations are reachable, and then the inverse of
that is freed. In the OP case, by contrast, "which allocations to free" is
known at "collection" time, but there's still an extra step of actually doing
the collection.

~~~
DmitryOlshansky
The point is that tracing GC searches for things to keep and assumes the rest
is free.

Which I believe is the important distinction, and quite often it doesn’t know
or care “which allocations to free” because it does it in bulk. That allows it
to be efficient.

~~~
_bxg1
No, the important distinction is that typically GC iterates over _something_
(incurring a cost of O(N)), whereas the C++ case iterates over _nothing_
(incurring a cost of O(1)).

But my original point was just that it's interesting that there _is_ an extra
step at all in the C++ case, where we tend to think of collection as happening
"for free". It has a constant-time overhead, which is not the same as zero
overhead.

~~~
DmitryOlshansky
> important distinction is that typically GC iterates over something
> (incurring a cost of O(N))

During _collection_ and N is size of live set, so divided by allocation the
cost is amortized and can safely considered as O(1) much like appending a
dynamic array where costly O(N) resize is amortized over N appends.

Typically GCs allocate via bump a pointer allocation which is actually faster
then what pretty much all of libc malloc implementations would do.

> C++ case iterates over nothing (incurring a cost of O(1)).

Not quite, take a look at jemalloc paper for instance:

[https://www.facebook.com/notes/facebook-
engineering/scalable...](https://www.facebook.com/notes/facebook-
engineering/scalable-memory-allocation-using-jemalloc/480222803919/)

A lookup to get the metadata for this specific allocation could be anything
from O(1) to O(lgN) depending on size and malloc strategy.

------
Iwan-Zotow
You could trim memory as much as you want, but next malloc call will result in
(a lot of) syscalls, and they are NOT free and NOT cheap

------
Ace17
All else being equal, could the efficiency of malloc_trim depend on heap
fragmentation? Could the heap end up with "gaps" of free'd memory that
couldn't be returned to the OS independently?

~~~
flohofwoe
Not an expert in the matter, but: In my simple mental model of how memory
management works under the hood (fixed-size memory pages, and a virtual- to
physical-address indirection through a page table), only "clean" pages could
be returned to the OS which don't have any allocations left in them. So
fragmentation might prevent a 'mostly clean' page from being returned to the
OS because there's a single tiny allocation left which pins it into the
process' address space.

However: Most (all?) general memory allocators have small-allocation buckets
for minimizing this effect, which basically groups allocations of specific
sizes into different memory pages.

Despite that, virtual address space fragmentation in 32-bit applications is
definitely a thing. Memory might still become fragmented over time so that
"big" allocations (bigger than a memory page) are failing because there's no
big-enough gap left in the processes virtual address space, despite having
enough free physical memory available for mapping.

This usually isn't a problem in 64-bit processes, because there's always
enough room at the "front" (new memory allocations would basically move like a
tidal wave through the 64-bit address space, leaving a mess of fragmented
address space behind).

The best way to prevent memory fragmentation of course is to minimize
allocation frequency in long-running applications (ideally only allocate
memory when the program starts up).

------
flqn
I'm guessing this is because the runtime keeps a few pages around for free-
lists and the like? Interesting that there's a hump in the middle there.

------
olliej
TLDR: no more so than standard mallocs, all of which for obvious reasons cache
allocated pages.

