1: Erlang GCs a process' heap whenever that process'
heap gets full, or when you call
2: Erlang stores most data associated with a process
on the process heap, there's one such heap per
erlang process. Binaries are a special case. If
they're large (> 64 octets), the heap only
contains a reference to the binary, the binary
itself is stored in an area specifically for
1+2 can result in a lot of binary garbage left lying
around. Here's how it can happen:
a) A process creates a relatively large amount of
unused heap space. This could happen by
temporarily using a large amount of heap, e.g. by
calling binary_to_list() on a large-ish binary,
doing something with the list then dropping
it. For arguments' sake, let's say we made a heap
(for one Erlang process) with 30M free.
b) Now the process moves on to a new phase: creating
large but short-lived binaries. Let's say 1M
each. Those binaries don't live on the heap, only
a reference to them does. So they only consume 8
(?) octets on the heap.
c) If the references are the only thing using the
heap, then you can make 4M of them before filling
the process heap. But since they're 1M each,
they'll eat 4T of the binary heap. i.e. you'll
run out of memory.
As you found out, setting 'fullsweep_after' to 0
doesn't help, since a GC is never triggered. But
explicitly calling erlang:garbage_collect() does.
You can investigate a bit more using
tracing. erlang:trace/3 can generate a message
whenever the target process is GCed (the
garbage_collection entry in flaglist). If my guess
is right, then you should see that your processes
holding all the binaries are never (or rarely)
GCed. Tracing the GC is cheap, you won't notice any
performance difference if you only do it on a few
The process_info BIF can also tell you quite a bit,
e.g. the process heap size.
Disclaimer #1: my knowledge about the details above may
be wrong or out of date. But the mechanism is known.
I've seen it in embedded systems and handled it in similar
ways to your approach.
Disclaimer #2: obviously I'm taking a guess. It's possible
you've run into something else entirely. That's why I
suggested some things to look at to confirm or reject my
The processes with complex state, things that can't be lost, might tend to be long-lived. In these cases, they should either only do very simple operations, or be isolated from the operations on that state.
These processes will generally live higher up in the supervision tree structure and delegate the risky work to processes lower in the hierarchy; these short-lived workers will thus have their impact limited, but will also have their state known before some unit of work, a bit like an invariant. If the short-lived worker dies, then restarting it with its short-lived state is a cheap operation.
Restarting the long-lived process is a difficult thing because the state might be a) possible to re-compute, but complex to do so, or b) bound to events that cannot be repeated, and can't be lost.
Binaries are either stored on a private heap or in a global area where they are reference counted. Binary reference counting depends on ProcBin objects stored on a process heap.
Reference counting is only effective when garbage collection occurs, thus forcing an explicit collection on an process removes the ProbBin objects.
The system has no way of knowing if a collection should be triggered to free ref counted memory. Collection is local to a process.
That isn't what the parent comment said.
I really wish you'd take more of a BDFL approach to things.
In theory a great VM could out-perform C, but after you've chased down a few Erlang VM WTFs, there's something nice about being closer to the metal.