
Tracking Down a Python Memory Leak - bbernard
https://benbernardblog.com/tracking-down-a-freaky-python-memory-leak/
======
guyzero
"What's possible, though, is accumulating Python objects in memory and keeping
strong references to them12. For instance, this happens when we build a cache
(for example, a dict) that we never clear. The cache will hold references to
every single item and the GC will never be able to destroy them, that is,
unless the cache goes out of scope."

Back when I worked with a Java memory profiling tool (JProbe!) we called these
"lingerers". Not leaks, but the behaviour was similar.

~~~
stinos
_The cache will hold references to every single item and the GC will never be
able to destroy them, that is, unless the cache goes out of scope_

Had the reverse happening a while ago and it's nasty as well: some C++ objects
were holding references to a Python objects, but due to the GC not scanning
those memory regions (they're not Python-owned after all), the Python objetcs
would get GC'd and all hell broke loose when the C++ tried to access the now
dead Python objects. Solution is forced 'lingering', i.e. is applying some
RAII adding the Python objects to a global dict and removing them when the C++
objects go out of scope.

~~~
winstonewert
Ugh... No. That's not how Python's GC works.

Python is reference counted. This was happening because you didn't increment
the reference count for python objects you were referencing from C++. Your
RAII should increment/decrement reference counts on the object, not place
objects into a global dict. That's the "correct" way to reference python
objects.

Python's GC doesn't scan memory in the same way other languages do. Instead,
it detects cycles between python objects. As long as you follow the reference
counting rules correctly, you shouldn't have to worry about it. (Unless you
need to detect cycles involving your C++ objects.)

~~~
stinos
_Ugh... No. That 's not how Python's GC works._

You assume CPython, or some other reference counted implementation though. I'm
talking about MicroPython which uses mark&sweep. And which I obviously should
have mentioned but I guess I was tired and I didn't.

------
tantalor
Spoiler alert, the leak is in libxml2, not Python code.

~~~
Animats
This sort of thing is why calling C from Python is a marginal idea. Debugging
such C is tough, especially if it manipulates Python objects in complex ways.
There are many invariants of the Python data structures which must be
carefully maintained by hand in C code. Getting one wrong will result in
obscure bugs like the parent.

"Pickle", Python's old seralizer, had a similar leak problem. It has a cache.
You can reuse a Pickle stream, which is done in interprocess communication.
But the cache of previously-send objects wasn't being cleared at the end of
each Pickle block. I found that, did a workaround, and submitted a bug report.
Not sure it was ever fixed properly.

I still have a bug report in on CPickle, a "faster" implementation of Pickle
written in C.[1] In a complicated situation with multiple threads using
CPickle, memory becomes corrupted and the program will crash. It doesn't
happen using Python Pickle, so I just quit using CPickle. The bug report got
the usual "reproduce in a simpler situation" reply to make it go away, and the
bug report remains open. It may be the same bug as this one [2] from 2012,
although I doubt it.

For parsing HTML, I use html4lib. It's slower, but it's all in Python.

[1] [http://bugs.python.org/issue23655](http://bugs.python.org/issue23655) [2]
[http://bugs.python.org/issue12680](http://bugs.python.org/issue12680)

~~~
bbernard
I've had quite a few issues with cPickle myself, so I see what you mean.

Indeed, Python packages built over C extensions can be quite hard to debug, as
seen with lxml. But what makes it even harder is the fact that lxml is partly
built using Cython... so you deal with Python code, C code generated with
Cython, and pure C code (libxml2).

~~~
dom0
While Cython - in my experience - doesn't have many bugs, it still has some,
and can, at times, generate blaringly wrong C code (iirc one simple example:
slice assigmnent to a uint8_t* tries to call a Python runtime function on the
uint8_t* as a PyObject*).

~~~
bbernard
Good to know! My only experience so far with Cython comes from lxml. It's a
weird language that seems to have a lot of corner cases, though. Just like
C++/CLI.

~~~
dom0
Yes. Yes it does. Also, because it works on _two_ type systems it has many
cases were you'll want to take a look at the generated C code to verify that
the "cheap path" was taken and no intermediary Python objects are constructed
or Py operators are used -- if performance matters, that is.

On the other hand it is a radically simpler way to write bindings that also
contain logic, or to write rather fast code without straying to far from
Python. Plus, it can cythonize almost all code, even very dynamic code with
closures, which will still often improve performance on it's own (no
interpreter, but still Python runtime for every op). And that is then a nice
base to do further optimizations.

------
mwcampbell
The JVM community tends to prefer pure Java implementations of everything,
rather than using existing C libraries like Python and Ruby. Some may see this
as a bad thing, but it definitely has its benefits. One particularly relevant
benefit in the context of this article is that the amount of code that can
leak memory, in the conventional sense, is dramatically reduced. I suppose the
same thing is happening in the Node.js ecosystem, though I don't recall if
Node uses native code to parse XML.

~~~
greglindahl
If you don't mind potentially slow code, that's a fine thing. Once you've
measured and discovered that you're losing out on a lot of performance, it's
worth evaluating whether the risk of leaks can be baked away via careful
testing, which doesn't appear to have been done at all in the library used in
this blog post.

~~~
brianwawok
Well pure Java code is 10x to 100x faster than pure Python code. So you aren't
exactly accepting slowness in that case.

~~~
greglindahl
If you say so. That's not what I've observed, especially if you're talking
about code where Python is the glue and all the heavy lifting is done in C or
C++.

~~~
brianwawok
I am comparing Java to Python. Not Java to Python code wrapping C++ code.
Because in that case, we would be really comparing Java to C++ code. Which I
am happy to do, but not what I did.

~~~
greglindahl
This whole thread was about memory leaks caused by not sticking to a single
language.

------
module0000
tldr; libxml2's C implementation leaked memory, author tracked it down. Kudos
to the author for their persistence in digging down to the root of the
problem. A lot of people would throw their hands up and decide to recycle the
process every <N> seconds rather than analyze it to the depth the author did.

~~~
bbernard
I'm the author of the post, so thanks a lot for your kind remarks.

Now, the problem appears to be in libxml2, but... it's only partly true. I
assure you that the best is yet to come :)

------
gravypod
> "But if we're strictly speaking about Python objects within pure Python
> code, then no, memory leaks are not possible - at least not in the
> traditional sense of the term. The reason is that Python has its own garbage
> collector (GC), so it should take care of cleaning up unused objects."

I have a hard time beliving this. Java can have memory leaks so why couldn't
Python?

~~~
zipfle
I think that the author is defining memory leaks as permanently out of scope
but not deallocated memory. In that sense I don't know of anything in vanilla
Python, or Java, that would qualify as a memory leak. In the more intuitive
sense of a memory leak being any failure to make objects available to garbage
collection, (such as by retaining references to them in an unexpected place)
leading to unchecked increases in a program's memory footprint, memory leaks
are possible in either language.

------
dekhn
I've used the gc module, with get_referers and get_referents, to track down
various leaks. This only really helps with python-allocated object.

It's trivial to end up with an unexpected strong reference. Weak references
are the right way to deal with cache objects, imho.

~~~
dom0
> Weak references are the right way to deal with cache objects, imho.

Yet, I disagree ;) Whether a weakref is the correct thing to use or not
depends _entirely_ on the purpose of the cache. I often find myself using
caches were weakref would not be very useful, because it would cool the cache
a lot.

------
partycoder
Reminds me of myself tracing a memory leak in a node app loading a core dump
into an IllumOS VM with mdb_v8. Not so simple/friendly/happy after all.

(You could argue that you could generate a heap snapshot with v8-profiler but
I was against time).

~~~
bcantrill
Would be curious for detail on your experiences; we do this a lot (we
developed mdb_v8) and we've continued to extend/develop mdb_v8 to make it
easier -- but trying to debug node memory growth is not something I would
every characterize as simple, friendly or happy (despite our best efforts).

