ZMQ 'sockets' aren't compatible with those calls, they use a userland equivalent of those kernel level edge triggered pollers. Integrating ZMQ with a traditional event library (like eventmachine) presents a problem at this point, as software like EM or Node typically require IO to be across real file descriptors from real sockets, something a userland library can't provide. The ZMQ devs however realized this was a hotly requested feature and so devised a way around this limitation.
The compatibility layer in ZMQ takes the form of performing some internal communication across traditional unix IPC, in the case using a pipe IIRC. In other words, for some of its internal messaging rather than simply use a function call, ZMQ will push data across a pipe. This pipe can then be exposed as a proxy for a ZMQ socket.
The downside of this strategy is that exposing FDs across software requires extreme care. Generally, it is assumed that one piece of software will have responsibility for an FD.
The actual issue in my case was that any ruby exception would cause the entire process to crash with an error about closing an already closed FD. What was happening was that given an exception both ZMQ and EM were trying to shut down all the FDs they knew about. Closing an FD that's already closed causes ZMQ to assert and crash instantly. It sounds simple once you're in the right frame of mind, but it took a good number of evenings to track down to that cause. It turned out the the EM option to not shut-down FDs was non-functional in the end. A one character patch provided the fix.
1. It shows how bugs can be something quite conceptually simple
2. It shows the value of logical, detective-like thinking in tracking these bugs.
Even as a programmer, I still think of bug-hunting as something requiring an encyclopedia knowledge of the trivial and arcane. Obviously, it looks easier in hindsight, but the OP does a great job of demonstrating how you can discover a much-overlooked flaw with the right deductive thinking (and experience with profiling tools)
I have no doubt that the OP is good at reasoning logically, but the take-home lesson here is, if you want to be good at debugging, do a lot of debugging.
Summarizing the reasoning process:
1. The program's object space doesn't contain an absurd number of small objects, so inspect the core dump
2. 95% of the core dump is leaked objects, so a random sample should contain clues to the composition of the leaked objects.
3. A repeated pattern in every leaked object indicates a common pointer, i.e. a single type of object.
4. The signature helps find what file in the program is being referenced, which indicates that the pointer's object type is a BIO struct
5. This kind of leak isn't possible in straight Ruby. So between OpenSSL and EventMachine's C/C++ code, the latter is more likely to have something awry.
6. Search the EM code for BIO constructors
7. Check each constructor call to see if the BIO instance is ultimately freed
Steps 3 and 4 are the most arcane to me...I am pretty sure I do not know enough about memory to look at a hex signature and realize where in the address space it refers to, or even the significance of it.
Noting there is an identical pointer in each object, assume this is a pointer to some kind of type definition struct. What will this have in it? There'll likely be a type name, some function pointers for standard operations, maybe a link to any more definitions, that kind of thing. This kind of arrangement is just something you do in C and C++. It's very common.
So take a look at what's at that address, under the assumption it's just a block of pointers. And then look at the first pointer (0x401 is obviously not a pointer). And in this case he got lucky, because it points to a suggestive string. (It could just as easily have pointed to another type definition, or something random. Though it's not uncommon for descriptors to have the name in the first field or two.)
Object type names are usually string literals so it's likely to point into the data segment of an EXE or DLL. (Maybe on Linux you can guess that from the address as well.) That's what the maps thing is all about - figuring out where the string might come from as a way of narrowing down the search.
Other platforms likely have similar tools, though I have yet to stumble across one as easy to use.
Also, we weren't yet sure whether this bug was reproducible in development/QA, or was only triggerable in production, so using techniques that worked directly on the running image was attractive.
This was the work of someone who knew what they were doing, of course. Had it been, say, me, then the effect and utility of the piece would have been considerably lacking.
I wrote a frontend: https://github.com/kevingadd/HeapProfiler but you can also just invoke it manually from the shell and then capture the state of the heap at your leisure.
In fewer words: It’s most likely a pointer."
As someone who has not stared at any core dump for more than about 2 seconds, I admire this level of skill.
No product announcement links, no Valley gossip rag links, no endless picking apart of every tiny Apple and Google thing.
What makes HN special are posts about making things. Everything else is just the same chatter all the other tech sites have.
"It was easy enough to work around the leak by adding monitoring and restarting the process whenever memory usage grew too large"
I was surprised, because I can not think of any other language and/or framework where "just restart the process" is done so often. I mean, this is not a common attitude among Java programmers, I don't think it is common among C programmers, and I don't think it is common among Python programmers. But it does seem to be fairly standard in the Ruby community. David Heinemeier Hansson admitted this used to happen with Basecamp:
Can anyone else tell me of a community where this is done so commonly?
> We restart HN every 5 or 6 days, or it gets slow (memory leaks). 
pg went into more depth about this somewhere, but I don't have the link on hand at the moment. Essentially, the software running it is riddled with memory leaks, but it's more time-efficient to simply reboot it every so often than it is to actually go in and fix it.
In my tests of python 2.6.6, 2.7.3, and 3.3, python 2.7.3 was by far the worst in hanging on to memory. Yet the performance increase (due to integration of simplejson) is worth the memory penalty. We use gunicorn with sync workers to serve our WSGI app and a memory watchdog to signal the worker to gracefully retire after it handles a query large enough to leave it with a large memory footprint.
Performance has been awesome and we're very happy with the result.
We need some online courses dedicated to not-beginners :)