Hacker News new | past | comments | ask | show | jobs | submit login
Erlang is a hoarder (andy.wordpress.com)
90 points by skeltoac on Feb 13, 2012 | hide | past | favorite | 12 comments

I think two things are interesting here. 1: "when does Erlang GC a process' heap?" and 2: "where does Erlang keep a process' data?".

1: Erlang GCs a process' heap whenever that process' heap gets full, or when you call erlang:garbage_collect() explicitly.

2: Erlang stores most data associated with a process on the process heap, there's one such heap per erlang process. Binaries are a special case. If they're large (> 64 octets), the heap only contains a reference to the binary, the binary itself is stored in an area specifically for binaries.

1+2 can result in a lot of binary garbage left lying around. Here's how it can happen:

a) A process creates a relatively large amount of unused heap space. This could happen by temporarily using a large amount of heap, e.g. by calling binary_to_list() on a large-ish binary, doing something with the list then dropping it. For arguments' sake, let's say we made a heap (for one Erlang process) with 30M free.

b) Now the process moves on to a new phase: creating large but short-lived binaries. Let's say 1M each. Those binaries don't live on the heap, only a reference to them does. So they only consume 8 (?) octets on the heap.

c) If the references are the only thing using the heap, then you can make 4M of them before filling the process heap. But since they're 1M each, they'll eat 4T of the binary heap. i.e. you'll run out of memory.

As you found out, setting 'fullsweep_after' to 0 doesn't help, since a GC is never triggered. But explicitly calling erlang:garbage_collect() does.

You can investigate a bit more using tracing. erlang:trace/3 can generate a message whenever the target process is GCed (the garbage_collection entry in flaglist). If my guess is right, then you should see that your processes holding all the binaries are never (or rarely) GCed. Tracing the GC is cheap, you won't notice any performance difference if you only do it on a few processes.

The process_info BIF can also tell you quite a bit, e.g. the process heap size.

Disclaimer #1: my knowledge about the details above may be wrong or out of date. But the mechanism is known. I've seen it in embedded systems and handled it in similar ways to your approach.

Disclaimer #2: obviously I'm taking a guess. It's possible you've run into something else entirely. That's why I suggested some things to look at to confirm or reject my suspicion.

Might this be considered a symptom of an architecture problem? I'm far from an Erlang expert but my understanding is that processes generally should be short lived, except for supervisors, which should only supervise. Instead of having a single long-lived process handling a lot of large binaries would a better design be to have separate processes handling each binary?

It depends. Some state needs to live, some needs to go. Long-lived processes should ideally do few state manipulations, or be easy to replace (so they store less state). Risky or frequent operations should be done far down in the supervision tree.

The processes with complex state, things that can't be lost, might tend to be long-lived. In these cases, they should either only do very simple operations, or be isolated from the operations on that state.

These processes will generally live higher up in the supervision tree structure and delegate the risky work to processes lower in the hierarchy; these short-lived workers will thus have their impact limited, but will also have their state known before some unit of work, a bit like an invariant. If the short-lived worker dies, then restarting it with its short-lived state is a cheap operation.

Restarting the long-lived process is a difficult thing because the state might be a) possible to re-compute, but complex to do so, or b) bound to events that cannot be repeated, and can't be lost.

In my case the processes needing gc are streaming TCP sockets. Their module is just a loop function which receives a binary and sends it to the client then does a tail call. So there should be no reason for the process to die. They have been running indefinitely except when the system ran into swap.

as described here: http://www.erlang.org/doc/efficiency_guide/binaryhandling.ht...

Binaries are either stored on a private heap or in a global area where they are reference counted. Binary reference counting depends on ProcBin objects stored on a process heap.

Reference counting is only effective when garbage collection occurs, thus forcing an explicit collection on an process removes the ProbBin objects.

The system has no way of knowing if a collection should be triggered to free ref counted memory. Collection is local to a process.

Uhh, I've seen this too. Part of the reason we are migrating some code from Erlang to C.

It could be argued that this is a poor reason to switch from Erlang to C -- this is a problem with how binaries are handled in user code, not something intrinsic to Erlang VM.

It could also be argued that you will still have memory problems, just of a different type :)

Eh, every factor involved is implementation detail. The threshold at which binaries are heap-alloc'ed, gc behavior, etc. I really fail to see how this is user error.

I really fail to see how this is user error.

That isn't what the parent comment said.

what parts? I'm interested about the future of Couch(Base|DB) sans all the Apache wankery.

I really wish you'd take more of a BDFL approach to things.

Specifically we're working to get Erlang out of the data path. So the code where throughput and latency really matter is written by hand and compiles directly to machine code, and Erlang continues to do what it does best - manage distributed systems and asynchronous processes.

In theory a great VM could out-perform C, but after you've chased down a few Erlang VM WTFs, there's something nice about being closer to the metal.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact