
Walmart Node.js Memory Leak - btmills
http://www.joyent.com/blog/walmart-node-js-memory-leak
======
diminoten
I'm actually looking into a segfault issue deep in the bowels of a C++ addon
we have in node.js (anyone in #node.js will have seen me over the past few
weeks ask about it), but what reading this makes me realize is how woefully
underequipped I am to hunt for problems of this nature.

My problem is likely in one of our addons, but this _kind_ of debugging, this
whole genre of problem solving is entirely beyond me. How do I get to this
level? What do I need to learn? To study?

It's just a little depressing to read something like this and see how far the
road ahead goes, despite how far I've already traveled...

~~~
JoachimSchipper
Debugging severe memory corruption or memory leaks is annoying, and can
occasionally take a lot of time, but it's not necessarily _that_ bad. Here are
some pointers that may be helpful.

Tools: valgrind and gdb are obvious. But don't forget your compiler! Crank up
the warnings, and look through LLVM-clang's -fsanitize=<foo> and warning
options. (Also, if you're already on OpenBSD, check out the "S" flag to
malloc; if you're on Solaris, check out, well, the blog post.) Finally,
Boehm's conservative garbage collector has a "find memory leaks" mode, which
_looks_ useful for those cases where you can't get valgrind working. If all
else fails, shovel through the memory dump looking for repeated patterns.

Testing: try to reproduce the problem; the first iteration may look something
like "it runs out of memory after 36 hours". Then simplify: for instance, the
author of the article could have asked "does this still happen if the server
closes the connection immediately, without sending any data" and would have
found the bug very quickly. (Of course, you're likely to ask a lot of wrong
questions before hitting on the right one; experience and a full knowledge of
the system you're working on is useful but not sufficient.) Questions like
"does this happen more quickly if we ping 100 times per second instead of once
every ten minutes" are often useful as well. (Finally, just printing memory
usage every N seconds is helpul.)

Coding: be careful when writing code. The usual ways of improving code quality
(e.g. code reviews) work to reduce memory leaks, too. Try to run a multiple-
hour soak test every so often during development (preferably on a CI server);
it's a lot easier to debug "hey, we suddenly run out of memory after
yesterday's commits" than "well, something goes wrong in production". If
you're doing new development, consider alternatives to malloc() - arena/pool
allocation (e.g. libtalloc) is convenient and very fast if your memory use is
tree-like (e.g. a connections owns a request owns some memory to sort the data
before returning it). In C, goto a single chunk of cleanup-and-return code
rather than duplicating the cleanup at every place where you exit from the
function.

~~~
benihana
> _Tools: valgrind and gdb are obvious._

The fact that someone is saying they don't know where to start seems to
indicate this isn't true.

------
davidw
I looked at node.js for a system I'm involved with creating, but ultimately we
went with Erlang just because it's been around a lot longer and is more stable
in terms of things like this. We're working on a semi-embedded system that
will not always be on-line or accessible for debugging. We also considered Go,
which probably would have been more familiar to C++ guys, but it was also
deemed a bit immature even if it seems like a very pleasant language to work
with.

Cool writeup though!

~~~
andreypopp
2 or 3 years ago I hit memory leak in Erlang's stdlib's httpc... just saying.

~~~
rcb
This is impressive work by the Joyent team!

I've seen two sources of memory leaks in Erlang based systems: 1) unbounded
process message queues, and 2) passing binaries across process (pid)
boundaries.

Many beginning erlangers run into these, and they're relatively easy to
identify and correct. With a little practice, these become easy patterns to
recognize and avoid.

As far as httpc, I'm unaware of that bug -- but I can say that I recently
worked on a commercial product that leveraged httpc as a core component of the
service, and it worked fine.

~~~
andreypopp
> As far as httpc, I'm unaware of that bug -- but I can say that I recently
> worked on a commercial product that leveraged httpc as a core component of
> the service, and it worked fine.

It was fixed soon after discovered.

------
ambirex
Thank you, I really enjoy detailed write-ups like this. It is fascinating to
see how an engineer approaches an elusive problem.

------
jzwinck
I'd like to read more about how we can prevent this class of error going
forward. Could stronger typing or RAII or some other feature or trick have
made the bug apparent at compile time?

I made a very basic Node.js module in C++ with V8 and it was surprisingly
difficult to make a good (idiomatic JS behaviour, believably bug-free) wrapper
for a straightforward class and factory method. I say this coming from Boost
Python and Luabind, where there are some tricky parts to bind complex classes,
but simple ones are easy enough, and once written, obviously correct.

------
city41
I've been running an extremely simple Node application on 0.10.18 for a while
now and it has a very gradual memory leak. My code is just a few dozen lines,
and it all seems pretty innocent. I am also using Hapi, so I thought maybe
Hapi has a leak in it somewhere. Now I wonder if I have the same leak as
Walmart here. I just now upgraded to 0.10.22 and am curious to see where I end
up. If the leak goes away then hot damn, I got lucky :)

------
ryanseys
And a one-line fix. Damn that must be satisfying.

~~~
yen223
Reminds me of that old joke:

The office photocopier broke down, so the manager called in a repairman. The
repairman takes one look at the machine, draws an 'X' at the problem part, and
hands the manager a bill for $500. The manager was shocked at the price, and
demanded an itemized bill. The repairman simply wrote:

    
    
        Marking the 'X'              -   $1
        Knowing where to put the 'X' - $499

~~~
lstamour
I started Googling the Picasso "principle" about it being a lifetime to know
how to do it, but it turned into Googling this one instead. Found a snippet,
"Karl Steinmetz (German-born, U.S citizen), the well known electrical engineer
who worked out many details of a.c. theory and was responsible largely for the
adoption of a.c. for commercial use, was once called in by the General
Electric Company to examine a poorly performing transformer. After a few
minutes, Steinmetz marked an x on the transformer core and said, “It will work
if you take off the turns from this x to the end.” The prescription worked
well, and Steinmetz later sent G.E. a bill for his service of $10,000. The
company official thought the bill excessive and asked for the itemization.
Steinmetz then sent them a more detailed bill: For putting x on transformer
core : $1; for knowing where to put the x: $9999." It's funny that in today's
world, both Picasso and Steinmetz take "minutes" to do this, but in perhaps
earlier tellings, it took hours for Picasso to do his work and days for
Steinmetz:
[http://edisontechcenter.org/CharlesProteusSteinmetz.html](http://edisontechcenter.org/CharlesProteusSteinmetz.html)

~~~
zb
Never happened.

[http://www.snopes.com/business/genius/where.asp](http://www.snopes.com/business/genius/where.asp)

That page does suggest a possible origin for the (equally apocryphal) Picasso
fable, though, in a quote from James McNeill Whistler.

~~~
gruseom
Snopes provides no citation for the Whistler story. That prompted me to look
it up:

[http://en.wikipedia.org/wiki/James_Abbott_McNeill_Whistler#R...](http://en.wikipedia.org/wiki/James_Abbott_McNeill_Whistler#Ruskin_trial)

I had no idea that this quote has such a delightful and well-documented
origin. I'd only heard the story told about Picasso (and various mechanics and
engineers). A great example of how these things morph over time.

The story is delightful because it pitted two great Victorian aesthetes
against one another. Ruskin had said this about Whistler:

    
    
      I have seen, and heard, much of Cockney impudence before now; 
      but never expected to hear a coxcomb ask two hundred guineas 
      for flinging a pot of paint in the public's face.
    

So Whistler sued for defamation and was examined by Ruskin's lawyer:

    
    
      Holker: Did it take you much time to paint the Nocturne in Black and Gold? 
              How soon did you knock it off?
      Whistler: Oh, I 'knock one off' possibly in a couple of days – one day 
                to do the work and another to finish it.
      Holker: The labour of two days is that for which you ask two hundred guineas?
      Whistler: No, I ask it for the knowledge I have gained in the work of a lifetime.
    

The insinuation in the lawyer's question ("how soon did you knock it off?") is
hilarious!

Whistler, by the way, was a great wit and had a famous skirmish with Oscar
Wilde:

[http://quoteinvestigator.com/2013/09/05/oscar-
will/](http://quoteinvestigator.com/2013/09/05/oscar-will/)

... which inspired this Monty Python classic:

[http://www.youtube.com/watch?v=UxXW6tfl2Y0](http://www.youtube.com/watch?v=UxXW6tfl2Y0)

------
charlieflowers
FYI, a typo -- "illusive" -> "elusive". (haven't read further yet, just wanted
to let you know).

------
aaronbrethorst
Wonderful blog post; major props for the engineering time expenditure. But,
why do you have an Olark chat widget that says "Contact Sales". I don't want
to have anything to do with those schlubs! If anything, I want to talk to
serious engineers like you!

Perhaps a better call to action would be:

* Talk to us about how we can solve your problems

* Chat with us

* We can help you too

* What's up?

------
rcthompson
Ironically, this page hangs Chrome indefinitely when I try to load it. Luckily
it only hangs the tab so I can still close it. I guess I'll fire up Firefox to
see if I can actually read the article.

Edit: Actually, it loads fine in a private browsing tab, so it must be a bad
interaction with some extension. Oh well.

~~~
dfc
I am curious why you find this ironic? What is your definition of irony?

~~~
pritambaral
Chrome uses V8. Chrome is the primary user of V8.

Not supporting OP's definition of irony, whatever it is, just speculating how
OP could've thought of it.

~~~
dfc
I have a strange hang-up/interest in people's concept of irony. I am not sure
Alanis Morissete would be able to find any irony in this situation.

~~~
kbenson
I find the interest in how irony is identified and (mis)used much more
interesting than the actual use or misuse itself.

I guess you could say I have a hang-up about your hang-up, or an interest in
your interest. ;)

~~~
dfc
"Ironically" is a word that you see and hear frequently in everyday english. I
think its interesting how varied people's concept of irony is for such a
common word. From what I can tell most people's definition is something
between "serendipity's evil twin" and "partially related." It seems in this
case OP thinks the definition is the latter. I think that sooner or later
"ironically" is going to have the same fate as "randomly," ("It is so random
we ran into you, we were just talking about you.) which I think has zero
meaning in conversational english.

To be honest i think most people's definition of irony is largely shaped by
Alanis Morissette's terribly misinformed but catchy song and the hipster d-bag
that says he has an "ironic mustache."

I think it is the evolution of language that is interesting? It seems like we
have a case of the more a word is used the definition becomes less concise
until it carries no meaning. There has to be some linguistic jargon for this
type of situation.

~~~
kbenson
Before I posted that I googled "define irony" and was surprised to see the
first definition as something that sounded like sarcasm. The definition
appears to already be changing, making hipsters retroactively correct, which
is just another reason we can all hate them. ;)

------
patrickg_zill
That is pretty impressive - I love how they could use DTrace to scope out what
was going on.

------
retr0h
I've always loved the debugging tools in solaris (smartos or whatever now).

------
batbomb
Can anyone tell me if there is reason for this in bash?

    
    
         DEST=~~/public/walmart.graphs

~~~
stewars
Not bash. '~~' gets replaced by the MANTA_USER environment variable by the
manta command line tool mput.

------
atomical
I assume that they can restart the server at intervals or use load balancing.
A few months of developer timer for something like this seems excessive unless
he was working on something else as well.

~~~
spyc3r
As a former software engineer at Walmart I can tell you that a few months for
something like that is nothing to them. They employ several thousand devs at
the home office. Having one of them focus on a bug like this isn't an issue in
terms of time or money. In their minds its worth it given the scale of the
enterprise.

------
ilaksh
I think there are still quite a few C and C++ programmers out there. To me
this is a great example of why it is better software engineering to write a
server in something like Node.js. Because rather than having a million code
bases with potential memory leaks like this one, there is just the Node code.
In ordinary JavaScript code its impossible to cause a problem just that.

~~~
sbov
It is fairly easy to create a long running server in a GC'd language that will
continually consume more memory. Some don't like to call it a memory leak,
which is why I put it the way I did, but the effect is the same.

At the end of the day, the more that you think this is impossible the more
likely your programs will experience it. So please don't think that your
program is immune to this because you use Javascript.

~~~
tantalor
Good example might be a server process which never releases memory, so the
longer it runs the more memory it "consumes". That is, the maximum memory
required to handle any previous request.

This might be a well known solved problem, but I have heard it mentioned
before.

~~~
tehwalrus
I have an apache box that runs a bunch of PHP and flat HTML sites. I have to
set it to only use 10 processes, and to kill them every half hour, because
they all gradually swell up to 35MB each (which I imagine is where they've
loaded pretty much all the PHP on my server, independently of each other).

Without the number limit, or the kill policy, the server runs out of RAM and
crashes. (it's only a cheap one, with 512MB RAM.) Luckily it's a very low
traffic set of sites, so these limits don't break the experience. I'm glad I
didn't have to solve this problem any deeper!

~~~
driverdan
Have you enabled the GC? Some default configs disable the PHP GC because it's
SOP to run temp startups (eg mod_php) or restart them on a regular basis (like
you).

~~~
tehwalrus
ooh, no, thanks for the tip :)

I haven't had to log in for a while (about a year), so I've been happy to
leave it as it is. When I migrate over to DigitalOcean (which I've been
intending to do for ages now) I'll look into that instead!

------
joeblau
Excellent details on the sleuthing that went on to find this error. I think
it's great that there are great tools available to debug errors like this and
your write up helps me in learning more about how to go about properly
debugging my Node apps.

------
jnazario
cool writeup. while not a node.js user, i love these sorts of tours of system
internals - i always learn a lot, both specific tools and also processes of
using them.

thanks for the details, very articulate and useful stuff.

------
jokoon
we know that node.js is a bad piece of software, you don't need to remind us
about it all the time

(down vote me)

