So having written production Erlang, I have literally never had an issue with la...

toast0 · on Nov 30, 2020

If you're not running into overfull mailboxes, you're not trying hard enough! :D

Usually, it's some process that's gotten itself into a bad state and isn't making any progress (which happens, we're human), so ideally you had a watchdog on it it, to kill it if it makes no progress, or flag for human intervention, or stop sending it work, cause it's not doing any of it.

But sometimes, you get into a bad place where you've got much more work coming in than your process can handle, and it just grows and grows. This can get worse if some of the the processing involves send + receive to another process and the pattern doesn't trigger the selective receive optimization where the current mailbox is marked at make_ref() and receive only checks from the marker, instead of the whole thing.

If you miss that optimization, the receives will check the whole mailbox, which takes a long time if the mailbox is large, and tends to make you get further behind, making it take longer, etc, until you lose all hope of catching up, and eventually run out of memory; possibly triggering process death if you've set max_heap_size on the process (although where I was using Erlang we didn't set that), or triggering OS process death of BEAM when it can't get more ram because allocation failed or OOM killer, or triggering OS death, if it has trouble when BEAM sucks in all the memory and can't muster its OOM killer in time and just gets stuck or OOM kills the wrong thing.

lostcolony · on Nov 30, 2020

Yep; as I mentioned elsewhere, I'm not discounting mailboxes filling faster than you can handle them, but rather the idea that you have 'problematic mails in the mailbox' - I've never had problematic mail in the mailbox. I've certainly seen mailboxes grow, because they were receiving messages faster than I was processing them, and i had to rethink my design. But that isn't an issue with the mail going into the mailbox; that's an issue with my not scaling out my receivers to fit the load. As I mentioned elsewhere, that may seem like semantics, but to me it isn't; it means the language's working that way makes sense, and it's rather an issue that I created a bottleneck with no way to relieve the pressure (as compared to messages that can't be handled and just sit there causing receives to take longer and longer over time).

toast0 · on Dec 1, 2020

Oh, I think I see. I don't think there's such thing as a problematic mail for BEAM really, mails are just terms, and BEAM handles terms, no problem. A mail that contained a small slice of a very large binary, would of course keep the whole referenced binary hanging around, which could be bad, or a large tuple could be bad because if you tried to inspect the message queue through process_info in another process, that would be a big copy.

But I think maybe the original poster just meant lots of bad mail in the mailbox to mean mail that would take a long time to process, because of how the receiving process will handle it.

Or, possibly bad mail meaning (as you suggest, perhaps), mail that won't be matched by any receive, resulting in longer and longer receives when they need to start from the top.

lostcolony · on Dec 1, 2020

"But I think maybe the original poster just meant lots of bad mail in the mailbox to mean mail that would take a long time to process, because of how the receiving process will handle it."

Yeah; just, if he meant that, it seems like a...weird call out. Since that's not particular to Erlang's messaging model; that's true in any system where you have a synchronization point that is being hit faster than it can execute. Seems weird to call that out as a notable problem, as such.

What's unique to Erlang, and -could- be prevented (by limiting a process to a single receive block, and having a type system prevent sending any messages not in that receive block), if you wanted to change the model, is the fact I can send arbitrary messages to a process, that will never match on them, and because it's a queue, will cause it to delay all following messages from being handled. Hence my focusing on that; yes, that's a potential problem, no, it's not a particularly likely one.

toolz · on Dec 1, 2020

I've also never had the issue where process mailboxes were filling faster than the messages were being consumed. If I were to run into a problem where that was an issue I would question whether or not erlang/elixir was the right tool for the task, but in my experience there's always been a way to spread that load across multiple processes so that work is being done concurrently and eventually across multiple nodes if the throughput demands keep increasing. If the workload truly does have to be synchronized I've always had the experience that sending a message to another tool was the right answer - maybe a database query or another service, for example.

filmor · on Nov 30, 2020

Overfull mailboxes can be problematic even with a wildcard `receive` and can happen if some branch of the control flow is slower than expected. We have some places in our code that essentially do manual mailbox handling by merging similar messages and keeping them in the process state until it's ready to actually handle them.

lostcolony · on Nov 30, 2020

Right, but I wouldn't characterize that as problematic messages...rather problematic handling, or systemic issues with load. I.e., my fix does not change the format of the messages. Semantics, perhaps, but a difference between "this design decision of Erlang's caused me problems" and "this architectural/design decision of mine caused me problems"