Thanks for sharing this. A few questions from someone interested in learning how to use BEAM-based systems in production:
* How did you debug the issue?
I know Erlang in Anger [1] is kind of written just for this but it feels pretty intense to me as a newbie. I don't know enough to tell whether the issues there are only things I'd need to worry about at larger scale but I found it pretty intimidating, to the point where I have delayed actually designing/deploying a Erlang solution. Designing for Scalability with Erlang/OTP [2] has a chapter on monitoring that I'm looking forward to reading. Wondering if there's some resources you could recommend to get started running a prod Erlang system re: debugging, monitoring, etc.
* How do you decide when to use a message queue (like sqs, RabbitMQ) vs regular processes? Do you have any guidelines you could share or is it more just, "use processes when you can; more formal MQ if interfacing with non-beam systems"? I struggle since conceptually each sender/receiver has its own queue via its mailbox.
Good question. So first I noticed in metrics dashboard (so it is important to have metrics) the receiver never seemed to have gotten the expected number of messages.
Then noticed the count of messages was being reset. Suspected something restarted the node. Browsed through metrics, noticed both memory usage was spiking too high and node was indeed restarting (uptime kept going up and down).
Focused on memory usage. We have a small function which returns top N memory hungry processes. Notice a particular one. Noticed its mailbox was tens of gigabytes.
Then used recon_trace to trace that process to see what it was doing. recon_trace is written by author of Erlang In Anger. So did something like:
dbg module is built-in and can use that, but it doesn't have rate limiting so it can kill a busy node if you trace the wrong thing (it floods you with messages).
So noticed what it was doing and noticed that a particular operation was taking way too long. It was because it was doing an O(n) operation instead of O(1) on each message. On a smaller scale it wasn't noticeable but when got to 1M plus instances it was bringing everything down.
After solving the problem to test it, I compiled a .beam file locally, scp-ed to test machine, hot-patched it (code:load_abs("/tmp/my_module")) and noticed that everything was working fine.
On whether to pick a message queue vs regular processes. It depends. They are very different. Regular processes are easier, simpler and cheap. But they are not persistent. So perhaps if your messages are like "add $1M to my account" and sender wants to just send the messages and not worry about acknowledging it. Then you'd want something with very good persistence guarantees.
* How did you debug the issue?
I know Erlang in Anger [1] is kind of written just for this but it feels pretty intense to me as a newbie. I don't know enough to tell whether the issues there are only things I'd need to worry about at larger scale but I found it pretty intimidating, to the point where I have delayed actually designing/deploying a Erlang solution. Designing for Scalability with Erlang/OTP [2] has a chapter on monitoring that I'm looking forward to reading. Wondering if there's some resources you could recommend to get started running a prod Erlang system re: debugging, monitoring, etc.
* How do you decide when to use a message queue (like sqs, RabbitMQ) vs regular processes? Do you have any guidelines you could share or is it more just, "use processes when you can; more formal MQ if interfacing with non-beam systems"? I struggle since conceptually each sender/receiver has its own queue via its mailbox.
1. https://www.erlang-in-anger.com/
2. http://shop.oreilly.com/product/0636920024149.do