

Tracking down a memory leak in Ruby's EventMachine - nelhage
http://blog.nelhage.com/2013/03/tracking-an-eventmachine-leak/

======
andrewvc
I found a bug in eventmachine during the time I spent writing em-zeromq, the
eventmachine binding for zeromq. The important thing to understand here is
that ZMQ is, in-essence, a userland socket. Normal sockets are efficiently
monitored using the epoll system call (or one of its older variants, say
select, or poll). However, as a user-land program

ZMQ 'sockets' aren't compatible with those calls, they use a userland
equivalent of those kernel level edge triggered pollers. Integrating ZMQ with
a traditional event library (like eventmachine) presents a problem at this
point, as software like EM or Node typically require IO to be across real file
descriptors from real sockets, something a userland library can't provide. The
ZMQ devs however realized this was a hotly requested feature and so devised a
way around this limitation.

The compatibility layer in ZMQ takes the form of performing some internal
communication across traditional unix IPC, in the case using a pipe IIRC. In
other words, for some of its internal messaging rather than simply use a
function call, ZMQ will push data across a pipe. This pipe can then be exposed
as a proxy for a ZMQ socket.

The downside of this strategy is that exposing FDs across software requires
extreme care. Generally, it is assumed that _one_ piece of software will have
responsibility for an FD.

The actual issue in my case was that _any_ ruby exception would cause the
entire process to crash with an error about closing an already closed FD. What
was happening was that given an exception _both_ ZMQ and EM were trying to
shut down all the FDs they knew about. Closing an FD that's already closed
causes ZMQ to assert and crash instantly. It sounds simple once you're in the
right frame of mind, but it took a good number of evenings to track down to
that cause. It turned out the the EM option to not shut-down FDs was non-
functional in the end. A one character patch provided the fix.

------
twoodfin
Like danso, I admire the detective-work here. I would like to point out,
though, that XCode's Instruments utility has a fantastically useful "Leaks"
mode that will identify leaked allocations, including a stack trace. It can
attach to a running process and has a non-disastrous impact on performance,
though like most such tools it's voracious for memory.

Other platforms likely have similar tools, though I have yet to stumble across
one as easy to use.

~~~
nelhage
(author here) Yeah, there are a lot of such tools, and they can be invaluable.
In my experience, though, if you're working with a large, complex system,
including pieces like Ruby and OpenSSL which do some grody tricks internally,
they'll flag lots of false positives and noise, and it takes a lot of work to
get a development environment where they work well.

Also, we weren't yet sure whether this bug was reproducible in development/QA,
or was only triggerable in production, so using techniques that worked
directly on the running image was attractive.

------
gingerlime
_"If you’ve stared at too many Linux coredumps, as I have, that number looks
suspicious. Interpreted in little-endian, that is 0x00007f1b5358a800, which
points near the top of the userspace portion of the address space on an amd64
Linux machine.

In fewer words: It’s most likely a pointer."_

As someone who has not stared at any core dump for more than about 2 seconds,
I admire this level of skill.

------
pc
I wish there were more posts like this.

~~~
gruseom
Me too. I wonder what would make for more posts like this.

~~~
grey-area
Go here every day and upvote interesting articles after reading them.

<http://news.ycombinator.com/newest>

~~~
gruseom
No, what I'm saying (cf my other comment in this thread) is that it's actually
a supply problem.

~~~
grey-area
Not sure I agree there. There are lots of interesting posts which sink on that
queue and fall off it without getting sufficient votes to be seen, while less
interesting but more controversial stuff floats to the top. All it would take
is a few score more people voting technical stories up consistently to change
the story mix.

------
lkrubner
This is slightly off-topic, but I worked on a Ruby project where we did
something just like this:

"It was easy enough to work around the leak by adding monitoring and
restarting the process whenever memory usage grew too large"

I was surprised, because I can not think of any other language and/or
framework where "just restart the process" is done so often. I mean, this is
not a common attitude among Java programmers, I don't think it is common among
C programmers, and I don't think it is common among Python programmers. But it
does seem to be fairly standard in the Ruby community. David Heinemeier
Hansson admitted this used to happen with Basecamp:

[http://david.heinemeierhansson.com/posts/31-myth-2-rails-
is-...](http://david.heinemeierhansson.com/posts/31-myth-2-rails-is-expected-
to-crash-400-timesday)

Can anyone else tell me of a community where this is done so commonly?

~~~
mintplant
Hacker News itself does this.

> We restart HN every 5 or 6 days, or it gets slow (memory leaks). [1]

pg went into more depth about this somewhere, but I don't have the link on
hand at the moment. Essentially, the software running it is riddled with
memory leaks, but it's more time-efficient to simply reboot it every so often
than it is to actually go in and fix it.

[1] <https://news.ycombinator.com/item?id=4730251>

~~~
koenigdavidmj
That might work for HN, but it doesn't really matter if HN is down for a
while. For something like Basecamp (the original 'restart it every few
minutes' example, it does not seem nearly as professional.

------
danso
There are two things I really like about this walkthrough:

1\. It shows how bugs can be something quite conceptually simple

2\. It shows the value of logical, detective-like thinking in tracking these
bugs.

Even as a programmer, I still think of bug-hunting as something requiring an
encyclopedia knowledge of the trivial and arcane. Obviously, it looks easier
in hindsight, but the OP does a great job of demonstrating how you can
discover a much-overlooked flaw with the right deductive thinking (and
experience with profiling tools)

~~~
icambron
Deduction is great, but what I read here was how much experience matters in
debugging. You have to know what to look and test for to provide fodder for
your reasoning capabilities, or else you're just shooting in the dark. The OP
here had a plan informed by what I'm guessing are years of experience solving
this kind of thing ("It's probably small objects. I should dump the core and
look for repeating memory patterns."). Then there's a lot of accumulated
wisdom that reprioritizes your search as you go along: "That looks like a flag
because flags tend to look like that.", etc.

I have no doubt that the OP is good at reasoning logically, but the take-home
lesson here is, _if you want to be good at debugging, do a lot of debugging_.

~~~
danso
Yes, this is true, so as I said, debugging looks easier in hindsight,
especially someone else's hindsight :)

Summarizing the reasoning process:

1\. The program's object space doesn't contain an absurd number of small
objects, so inspect the core dump

2\. 95% of the core dump is leaked objects, so a random sample should contain
clues to the composition of the leaked objects.

3\. A repeated pattern in every leaked object indicates a common pointer, i.e.
a single type of object.

4\. The signature helps find what file in the program is being referenced,
which indicates that the pointer's object type is a BIO struct

5\. This kind of leak isn't possible in straight Ruby. So between OpenSSL and
EventMachine's C/C++ code, the latter is more likely to have something awry.

6\. Search the EM code for BIO constructors

7\. Check each constructor call to see if the BIO instance is ultimately freed

\---

Steps 3 and 4 are the most arcane to me...I am pretty sure I do _not_ know
enough about memory to look at a hex signature and realize where in the
address space it refers to, or even the significance of it.

~~~
spoiler
I am familiar with step 3, but I don't even know where to begin with step 4. I
guess it'll remain a mystery... 4ever.

~~~
to3m
The reasoning is reasonably well explained in the article, I think?

Noting there is an identical pointer in each object, assume this is a pointer
to some kind of type definition struct. What will this have in it? There'll
likely be a type name, some function pointers for standard operations, maybe a
link to any more definitions, that kind of thing. This kind of arrangement is
just something you do in C and C++. It's very common.

So take a look at what's at that address, under the assumption it's just a
block of pointers. And then look at the first pointer (0x401 is obviously not
a pointer). And in this case he got lucky, because it points to a suggestive
string. (It could just as easily have pointed to another type definition, or
something random. Though it's not uncommon for descriptors to have the name in
the first field or two.)

Object type names are usually string literals so it's likely to point into the
data segment of an EXE or DLL. (Maybe on Linux you can guess that from the
address as well.) That's what the maps thing is all about - figuring out where
the string might come from as a way of narrowing down the search.

------
gravitronic
THANK YOU - after looking at the heap perspective, and valgrind perspective,
I've exhausted options looking at a memory leak in our production environment.
This is another avenue. AWESOME.

------
pimeys
I noticed a same kind of leak in one of our apps and this post was more than
worth of gold on how to hunt it down.

------
rurounijones
Now if there were an online course where you could learn this kind of stuff I
would be signed up so fast physicists would be re-evaluating general
relativity.

We need some online courses dedicated to not-beginners :)

------
vskr
Amazing post. How much time did it take to debug and identify the issue

~~~
spoiler
They said the overall process took 4 to 6 hours.

