Generally per-level or at some other temporal division. That, in-place loading and fixed-size pools for things like effects covered 90% of the allocation tricks we did to get huge performance gains.
My allocator isn’t included as it is part of a class project, but you can make your own malloc pretty easily.
Alternatively you can just do some LD preload magic, but I don’t know much about that (and this was needed to match the class spec)
Was this a class that you took, or one that you TA or tutored?
This suggests that Python's pool allocator is really bad at handling start up procedures, doesn't it?
It intentionally over allocates, but not by enough to handle the warm up procedure.
It does make sense CPython is tuned more towards runtime than startup time. One is a continual cost, one is paid once and doesn't matter for a large variety of workloads.
But if CPython really is make 1000+ allocations during startup, then it is probably going to fragment badly, which slow down everything, and into the future as well...
Seems that's not a crazy idea.  Zapier linked CPython to jemalloc which tends to fragment less and saw 40% memory gains.
I guess if you took the time to tune jemalloc to your workload you should be able to see significant performance improvements at startup as well as in the hot path.
The problem is so bad that it's one of the top quoted reasons that hg is considering a gradual rewrite to rust:
Though it looks like it's more about supplying a way to embed hq in a proprietary (or non-Python) application via subprocess communication rather than directly importing the GPL modules or having to string-parse responses. There doesn't seem to be provision for interacting with the server directly from the command line, which is odd, since this pattern is well established from Java land (bazel, for example).
I'm aware of yapsy as a lightweight plugin system, but entry_points really is the gold standard in terms of Python extensibility. It's a shame it can't be made more performant.
Edit: to be clear, I’m pretty sure that initial block could be reserved on the heap. The benefit comes from the large initial allocation, not from the memory being on the stack.
I don't think system allocators are clever enough to process and allocate 100-500k of very small objects each minute when Python is performing something very intensive.
It's a pretty standard way to speedup allocation for dynamic languages. Game developers use similar techniques as well.
I have some stats on Python's allocator:
We never free empty pools.
Our arenas are just single linked lists, no need for the prev.
For a statically compiled perl the biggest win is to avoid arena allocation (mmap) at all. Data and code is made static. That's around 10-20% of the runtime (for shortrunning programs).
Also we rarely free at the end. The OS does it much better than free(). Only the mandatory DESTROY calls and FileIO finalizers are executed.
I'm currently having a problem with this, where I load a large deep learning model into "CPU" memory then move it to the GPU, but I can't get rid of the memory reserved by the process.
In your case you should first consider the possibility that there is a pointer to your model's objects that is for some reason not being released. It might simply be that even though you are moving your model the GPU and maybe removing any of your own references, there might be internal references to your model's data that is hidden from you. At least something to consider.
edit: To add to this, I'm now quite sure (though I could be wrong!), that whether python does or does not use sbrk with a negative value is beyond the scope of python. Python is making use of malloc/free under the hood:
There's some flexibility for wrapping free in different ways in that file, but it seems like it'll basically always be using free at the core. At least on my system in a debugger I just verified that. So if it's true that python by default uses malloc/free, then the question of whether sbrk with a negative number comes into play is more a question of how your libc implements malloc/free.
Of course I might be wrong, but I think that you should probably stop worrying about it at that level and instead look into object references first as I detailed above.
Which framework do you use for deep learning? It can allocate some object on its own.
Can you give me some stats when using a model and after it's no longer in use and can't be accessible? You can get it by calling the sys._debugmallocstats() function.
Why it is faster is the article helps cover... some of the tricks for optimizing small object allocation and reuse that c-python performs are not done by the c allocator. As to why THAT is, I can only appeal to your intuition that not all applications allocate memory in the same way so an allocator optimized for Python's behavior has an advantage.
> To reduce overhead for small objects (less than 512 bytes) Python sub-allocates big blocks of memory. Larger objects are routed to standard C allocator. Small object allocator uses three levels of abstraction — arena, pool, and block.
This is inaccurate; especially when considering the thread safety of memory allocation.