Hacker News new | past | comments | ask | show | jobs | submit login
Memory Management in Python (rushter.com)
265 points by blacknessninja 79 days ago | hide | past | web | favorite | 33 comments

I’ve found it’s easy to get much better performance in CPython by simply replacing their malloc with my own. Mine allocates a large block of memory on the stack, then pulls from it. Won’t work for large programs, but for smaller scripts it’s a good amount faster. I think this is because CPython makes something like 1000 malloc calls just to startup, so if those can be made ~instant you get a lot of benefit.

We did this in gamedev all the time with arena allocators[1].

Generally per-level or at some other temporal division. That, in-place loading and fixed-size pools for things like effects covered 90% of the allocation tricks we did to get huge performance gains.

[1] https://developers.google.com/protocol-buffers/docs/referenc...

The repo: https://github.com/JacksonKearl/cpython/blob/master/README.r...

My allocator isn’t included as it is part of a class project, but you can make your own malloc pretty easily.

Alternatively you can just do some LD preload magic, but I don’t know much about that (and this was needed to match the class spec)

> My allocator isn’t included as it is part of a class project, but you can make your own malloc pretty easily.

Was this a class that you took, or one that you TA or tutored?

I was a TA.

> I think this is because CPython makes something like 1000 malloc calls just to startup, so if those can be made ~instant you get a lot of benefit.

This suggests that Python's pool allocator is really bad at handling start up procedures, doesn't it?

It intentionally over allocates, but not by enough to handle the warm up procedure.

Yes, CPython has poor startup performance due in large part to the number of allocations it makes, and once can achieve at least 60% reductions in startup time by being smarter about those allocations.

This has had me thinking about the Python/malloc story for a little bit.

It does make sense CPython is tuned more towards runtime than startup time. One is a continual cost, one is paid once and doesn't matter for a large variety of workloads.

But if CPython really is make 1000+ allocations during startup, then it is probably going to fragment badly, which slow down everything, and into the future as well...

Seems that's not a crazy idea. [0] Zapier linked CPython to jemalloc which tends to fragment less and saw 40% memory gains.

I guess if you took the time to tune jemalloc to your workload you should be able to see significant performance improvements at startup as well as in the hot path.

[0] https://zapier.com/engineering/celery-python-jemalloc/

With it being a very popular python implementation, one wonders why such optimizations haven't been implemented yet. 60% sounds huge.

Probably because overall Python startup time is still dominated by site.py/pkg_resources spidering the filesystem and importing half the world; as long as that exists for most users, it's hard to be motivated to make speedups anywhere else.

cf. https://github.com/pypa/setuptools/issues/510

The problem is so bad that it's one of the top quoted reasons that hg is considering a gradual rewrite to rust:


Hilariously, my pentadactyl shortcut for the oxidation plan was HG. I took that as a sign to read the article. Has hg ever considered a long-running "daemon-like" process instead? It could be instantiated automatically by the first call to hg and then it can die after it hasn't been used for a while. It would probably be racy and hard to support on multiple OS's though...

They do kind of have this: https://www.mercurial-scm.org/wiki/CommandServer

Though it looks like it's more about supplying a way to embed hq in a proprietary (or non-Python) application via subprocess communication rather than directly importing the GPL modules or having to string-parse responses. There doesn't seem to be provision for interacting with the server directly from the command line, which is odd, since this pattern is well established from Java land (bazel, for example).

The pkg_resources problem is specific for programs that use entry points (and it can be avoided), not a general CPython issue, and AFAICT is not related to Mercurial's plan to include Rust.

Interesting; I hadn't realised this, thanks for the clarification.

I'm aware of yapsy as a lightweight plugin system, but entry_points really is the gold standard in terms of Python extensibility. It's a shame it can't be made more performant.

Thanks for this, it makes for an interesting read, especially for someone who only dips his toe in python programming lightly once in a while.

Confused, how do you replace malloc with memory pulled from the stack? Wouldn't it get trampled?

Reserve a fixed amount and pull from it. That’s why I say it won’t work for larger programs. Though, you could resort to normal malloc after your stack memory runs out. This would get you the improved startup performance and runtime performance for smaller programs (those that use less memory than you initially reserve), without sacrificing anything on larger programs.

Edit: to be clear, I’m pretty sure that initial block could be reserved on the heap. The benefit comes from the large initial allocation, not from the memory being on the stack.

Do you mean you're compiling the interpreter from scratch with C/C++ code that performs this initial allocation? It's really confusing trying to decipher what you're doing, since it doesn't sound like something that you could do to an existing CPython interpreter without corrupting memory...

Yep! Building from the C source. If you check out the demo I linked in another comment, it’s just a couple additional files in the CPython repo.

Why does Python use its own allocation abstractions when the C allocator is likely doing the exact same thing underneath it? Or does it only employ these when it's directly getting pages of memory from the OS?

Author of the article here.

I don't think system allocators are clever enough to process and allocate 100-500k of very small objects each minute when Python is performing something very intensive.

It's a pretty standard way to speedup allocation for dynamic languages. Game developers use similar techniques as well.

I have some stats on Python's allocator:


Can confirm for perl. We are doing the very same. It's a huge win.


We never free empty pools. Our arenas are just single linked lists, no need for the prev.


For a statically compiled perl the biggest win is to avoid arena allocation (mmap) at all. Data and code is made static. That's around 10-20% of the runtime (for shortrunning programs). Also we rarely free at the end. The OS does it much better than free(). Only the mandatory DESTROY calls and FileIO finalizers are executed.

Can Python actually return memory pages back to the OS, e.g. by sbrk() with negative argument?

I'm currently having a problem with this, where I load a large deep learning model into "CPU" memory then move it to the GPU, but I can't get rid of the memory reserved by the process.

I can't answer your exact question, but any large allocations/deallocations should be handled by mmap under the hood and in those cases the memory should be returned.

In your case you should first consider the possibility that there is a pointer to your model's objects that is for some reason not being released. It might simply be that even though you are moving your model the GPU and maybe removing any of your own references, there might be internal references to your model's data that is hidden from you. At least something to consider.

edit: To add to this, I'm now quite sure (though I could be wrong!), that whether python does or does not use sbrk with a negative value is beyond the scope of python. Python is making use of malloc/free under the hood:


There's some flexibility for wrapping free in different ways in that file, but it seems like it'll basically always be using free at the core. At least on my system in a debugger I just verified that. So if it's true that python by default uses malloc/free, then the question of whether sbrk with a negative number comes into play is more a question of how your libc implements malloc/free.

Of course I might be wrong, but I think that you should probably stop worrying about it at that level and instead look into object references first as I detailed above.

Yes, it can, by calling the free function.

Which framework do you use for deep learning? It can allocate some object on its own.

Can you give me some stats when using a model and after it's no longer in use and can't be accessible? You can get it by calling the sys._debugmallocstats() function.

The short and simple answer is because it is substantially faster (and has a hugely less memory overhead). It is easy enough to modify the c-python source to try this.

Why it is faster is the article helps cover... some of the tricks for optimizing small object allocation and reuse that c-python performs are not done by the c allocator. As to why THAT is, I can only appeal to your intuition that not all applications allocate memory in the same way so an allocator optimized for Python's behavior has an advantage.

There are a couple of ways of interpreting this question.. but to answer the literal question of why not call malloc/free direcly: See https://docs.python.org/3/c-api/memory.html - it supports custom user-supplied memory allocators, so calling malloc() and free() directly would break this.

I believe this article is taking about small objects allocation (which presumably happens quite frequently in a language like python).

> To reduce overhead for small objects (less than 512 bytes) Python sub-allocates big blocks of memory. Larger objects are routed to standard C allocator. Small object allocator uses three levels of abstraction — arena, pool, and block.

But most system allocators have affordances for small allocations, using techniques such as binning. Is that not good enough for Python?

I don't know if there's a concise summary of the decisions but you can follow the trail of discussions and benchmarks in these links: https://bugs.python.org/issue26249 https://mail.python.org/pipermail/python-dev/2016-February/1...

One explanation could be that the system allocator needs to be thread safe, and this Doren's time with locking. Python is single threaded so perhaps is custom allocator is lock free.

> Python is single threaded

This is inaccurate; especially when considering the thread safety of memory allocation.

Memory management was my favorite topic to study during my time as an undergrad. I considered going back to school to get my MS in CS solely because I loved learning about this topic so much. Fascinating to see how it works under the hood in Python. Bravo!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact