Hacker News new | past | comments | ask | show | jobs | submit login
Python consumes a lot of memory – how to reduce the size of objects? (habr.com)
257 points by atomlib 10 days ago | hide | past | web | favorite | 103 comments

The article doesn't mention a hidden gem in Python's standard library - typed arrays!

Those are in package `array` and they are a very barebones version of numpy's ndarray - with just one (explicit) dimension and no overloaded operators. But if you just want to keep a bunch of numbers in a contiguous array, they can save you tons of memory.

(I know the purpose of the article is to describe a more complex data structure, but arrays can still get you very far.)

That's how you deal with that problem with Java as well: arrays of primitives. If you need to operate on million of points, you can either use Point[] which will incur something like 16 MB additional memory or use two int[] arrays which won't incur any extra overhead short of few bytes. Your code won't be pretty, but it'll be fast and you can always hide this weirdness behind pretty API.

> That's how you deal with that problem with Java as well: arrays of primitives.

It just goes to prove that you can indeed write FORTRAN in any language!

This is actually a feature - python array slicing syntax is very similar to fortran.

It will depend on your application, but you might also consider just one array of [x, y, x, y, x, y, ...] to improve cache locality.

Also known as the "Struct of Arrays" approach.

"Struct of Arrays", to be contrasted with "Array of Structs", because in Java it's actually "Array of pointers to Struct". In languages where "Array of Structs" is actually possible, the decision of which to use is less clear-cut and depends on how big the Struct is, what the access patterns are like, and whether you're trying to perform SIMD operations on multiple Structs at once.

That's not only about pointers. Every Java object incurs an overhead of around 16 bytes (architecture and JVM dependent, of course, but it's there).

Several times in Python I've used the built-in CTypes library to cut down the memory footprint by using giant Arrays of (literally C) Structs.

I was bored and had time on my hands, so I played around a bit with some of the suggestions in this thread, but it seems as though a naive ctypes approach is the worst possible approach in the two cases I tested (arrays of integers, arrays of x,y struct).

Here are my code and results:


Is there an approach that you use that produces better results than this? Because my naive approach is unmanageably worse for both performance (not recorded) and impact on Python's memory usage.

> Your code won't be pretty,

Depends on the algorithms that use that code. Two separate arrays is not a bad way to represent a vector-of-tuples. And so if you think of everything as a vectorised operation (and if that makes sense for your use case) then the code can be clean enough.

Efficiency-wise it depends on how the locality of reference falls out. There are cases where it's more cache-efficient to store the pairs next to each other, but again for big vectorised operations you might lose nothing by using two cache-ways instead of one.

Or you can just use a more memory efficient language like Java. Python and JS use tons of memory compared to strongly typed VM languages like C#, Go, Java. People lament Java memory usage but Python/Node use several times more

>People lament Java memory usage but Python/Node use several times more

The people who lament Java's memory usage never complain about Python or Ruby because... if you're worried about the JVM you would never consider Python!

I'm mostly referring to those that say the JVM is too "resource heavy" then go on to use JS/Python/Ruby instead, which have much higher runtime overhead.

In my experience many dev think the scripting language heritage means these languages have smaller/lighter runtimes

Not that I want to go down the rabbit hole, but I probably fall into the category of “JVM is heavy, but I use Python”

Sure, memory consumption and CPU perf are both pretty bad in Python, but latency and memory footprint of the runtime itself are pretty good, so it’s ideal for tooling, crons, lambdas etc. JVM is comparatively efficient once you have the JVM ready, but that sure takes tons of resources.

I’m hoping Graal changes this. I don’t like JVM-based languages, but I do like technological progress.

Python startup time can be significant for non-trivial programs as well, and it's been a big problem for Mercurial and other projects. For example, see these threads:

- https://mail.python.org/pipermail/python-dev/2014-May/134528...

- https://mail.python.org/pipermail/python-dev/2018-May/153296...

If your script is expected to be run interactively on a frequent basis, then Go, Rust, C++, or even Bash (for simple stuff) will give you much lower user-perceived latency than Python.

It's fairly easy to use Class Data Storage and include an archive in Java 12 where startup time is important. This reduces start time significantly.

JVM starts really quick (~100ms on my machine) and doesn't use much resources as long as your app is small. that's... Uncommon in Java land though. Even simply apps pull in Guava/Apache Commons and a few client libraries. This can easily be thousands of classes. Nobody thinks about it because runtime cost for loading shitloads of code is so low. But you can improve this a ton by using ProGaurd and stripping out stuff you don't need

It really depends.

If I can do it in numpy, I can probably get a lighter, faster implementation with Python than in Java. More generally, if I can do it with a Python package that's actually a fairly thin wrapper around a C, C++ or Fortran library, then Python also has a decent chance of being the easy winner.

If none of those situations apply, then yeah, typically Java ends up being more efficient.

I have seen completely the opposite in the wild. In fact I've only seen it so far in the opposite direction (Java consuming far more memory than Python) that I have to ask what on Earth you are doing with Python to have found yourself in such a position.

(I've never successfully used Node so I have no idea what that's like in memory use.)

Until you end up rediscovering the information storage skills used by databases and lower level programs: buffers with fixed sizes and implicit type associations.

I would say that Python is the right tool for some things, not for others. Why not use Go, or Nim or C to store/access/return these data in a data structure that transforms the values from/to python?

That's what C-extensions are for :P

Python is a great high-level wrapper around low-level programs (c.f. SciPy)

It's not always an option, but simply using PyPy can massively reduce memory usage without needing to change your code at all.

The PyPy site links to this blog post (from 10 years ago) with some info: https://morepypy.blogspot.com/2009/10/gc-improvements.html

And a quick search found this relatively recent post that does some measurement of it: https://dev.nextthought.com/blog/2018/08/cpython-vs-pypy-mem...

In the article consider aproach with the `dataobject` from `recordclass` library as a base object in class definition. This seem can produce less memory than PyPy.

Any type that you have "a lot" of, is a bad candidate for a class/object. Your OO program should typically never instantiate thousands of heap allocated objects at once. A triangle mesh is a good object candidate. One triangle or vertex is not. A device, a render system API handle and an image is a good candidate, one pixel is not, and so on.

So don't use a lot of objects. In C# you use a struct and arrays of them to avoid creating an object on the heap per array entry. In java you have to resort to SoA, in Python it's the same. Just because you have objects doesn't mean everything should be an object. Even languages that follow an "everything is an object"-design usually have an escape hatch such as plain value arrays.

It's not about using objects or not, it's how you compose them. A C# array of integers is an object. Each integer in that array is also an object (though granted, it's a special case). Numpy arrays and javascript typed arrays as well.

Primitives and value types are not objects in C#. The array is one object, that’s it.

It has methods and fields, it implements interfaces, and descends from the 'Object' type.

The fact that it's unboxed is an implementation detail.


What’s relevant for performance is whether an array of 1000 “things” require 1 or 1001 allocated objects, whether accessing thing N in the array requires dereferencing one or two pointers, and whether the item has a storage of 32 bits without overhead for headers. Which ones to call “objects” is semantics.

For efficient access, the array must be a consecutive array of primitives. This is the case in both C# and java for integers. In C# it’s also the case for an array of Vector2 with 2 primitives each, which isn’t the case in java.

My point is this: avoid heap allocating many things in collections. They must be raw (primitive, consecutive) data without per instance overhead and of course without heap alloc/GC cost.

While everyone haggles about the internals I have an interesting anecdote about the costs of tools libraries.

A small service that landed in my lap needed to read(only) a data source that was roughly 10k lines of yaml. No way that was going to be in any way efficient so I asked for suggestions. All the work-a-day devs(I am not) instantly said the same thing without a single real thought about it: Make it a database duh!

Long story slightly less long, Loading up the libraries to interface with a database ate between 2 and 3 times the memory(depending on the DB and lib) that simply loading the entire 10k line yml ate and offered slower performance and required more code.

SQLite was pretty darn close but in the interest of saving developers from themselves vis a vis parameterized queries, or the need to queries all together for that matter, increased the required code for zero benefit.

The service still hums along with a 10k line yaml in memory. "Worse is better" indeed.

You could of improved it with a protobuf or a json file then, but with the size of your data set, it shouldn't really matter what your using.

It can be hard to beat an in memory data structure when your data set is small enough, true.

Well, the SQLite might have been more scalable if you needed to deal with much larger YAML files. I've been in a similar situation with XML. It's fine to hold in memory until you end up running out of memory, at which point you need to look at other approaches. There's definitely an overhead to changing, but it might be worth paying if the tradeoffs make sense.

The quickest change is no change, naturally.

However, "better" is dependent on the situation. If the data is unlikely to change, stay the course. If not, then while a database might be more maintainable over the long term, even if not as efficient.

In the pure Python cases the 8 bytes per attribute are just pointers. The x, y and z are themselves full-blown objects with all the extra memory overhead that comes with it and this is not counted in the article. For example, an int object uses 28 bytes, so three of them already use up more than each of the described container objects.

The Cython and Numpy cases directly store the actual data and this has the larger effect to reduce memory.

On the other hand, the first 256 integers are singletons in python, so they aren't duplicated.

Slightly pedantic correction: This is a performance optimization in CPython. I wouldn't be surprised if other implementations have something similar, but to my best knowledge, this behaviour isn't part of the standard.

> ... but to my best knowledge, this behaviour isn't part of the standard.

To the best of my knowledge, Python isn't one such language with a standard, is it?

The standard is documented in PEPs[1]. CPython is the flagship implementation, and many PEPs discuss it directly, but if you're looking to write another Python, that's where you'd go.


0-255 (inclusive) are singletons, just to clarify.

my bad. Maybe it used to be 0-255 and I'm just too old to remember the change?

Same for dictionaries ain't it ?

Many workloads do not require more than 4GB addressable RAM per process. Linux offered a 32-bit user space with 64-bit instruction set:


As effectively many Python workloads are usually object oriented business logic and objects are mostly pointers, setting up x32 user space "halved" the memory usage. It also made execution performance faster because of better CPU cache utilisation.

Sadly, x32 was "very custom" and very hard to support. Last I heard x32 is being phased out from Linux kernel.

Looks like this little gem was in the Wikipedia article

> The best results during testing were with the 181.mcf SPEC CPU 2000 benchmark, in which the x32 ABI version was 40% faster than the x86-64 version.

Another option is to use the standard 64 bit ABI, but store object pointers compressed into 32 bits when on the heap. This lets you address perhaps 32 GB in a 32 bit value.

Could you elaborate more on how the 'compress' part works. Quite curious. I can imagine working with a base pointer and 32 bit offsets.

Java for example, can use "compressed OOPs" -- ordinary object pointers. This is a 32-bit stored pointer but shifted, as pointers to objects always point to something on multiple of 8 bytes. So the shifted address is stored, but then shifted left before being dereferenced, allowing addressing of 32GB memory with 32-bit references.

I read some longer article about the evolution and options for memory addressing in JVM, if only I could find it... but here's another one: https://wiki.openjdk.java.net/display/HotSpot/CompressedOops

There are variations on this method that can be applied to other memory management pools. For example, let's say are allocating many objects for processing, need them only within a certain span of time and have to free all of them when it's over. Well, rather than individually tracking allocations you can allow a single perhaps growing memory area, store the start offset somewhere and reference them with a shorter reference relative to that area.

Do you need to address every byte in your memory? If all your Python objects are at least two words in size, or more, then no, you only need to address every 16 bytes or so. So shift your whole pointer right a bit and you can address more than 4 GB of objects in a 32 bit pointer.

Also curious about this.

Does 64-bit instruction set provide some segments or functionality form this? How about "native" pointers coming from glib and such?

If there has to be base + offset translation on every pointer access it is way too slow.

I would also assume JavaScript VMs in browsers would be already utilising this, as web page workloads are not gigabytes (hopefully).

> If there has to be base + offset translation on every pointer access it is way too slow.

It does do this, but it's not too slow - the overhead of the translation is lower than the benefit of reduced memory transfer, increased cache space, etc. Obviously - otherwise people wouldn't be doing it.

> If there has to be base + offset translation on every pointer access it is way too slow.

The memory access instructions in the x86 ISAs can do base + offset in a single instruction.

I don't think the current V8 release has this build in yet. This something the common JVMs have been doing this since at least 2006.

> If there has to be base + offset translation on every pointer access it is way too slow.

It's time to update your intuition: memory access is usually slow, much slower than simple arithmetic.

Wow, this is really amazing. I've always felt like 64-bit pointers were a huge waste of memory, since many small tools need well below 4 GB. Cache is always going be relatively small (unless semiconductor/transistor fabrication costs drop dramatically), and smaller pointers result in better cache usage, and consequently fewer cache misses.

> Sadly, x32 was "very custom" and very hard to support. Last I heard x32 is being phased out from Linux kernel.

This is really sad to hear. Do you happen to know of any kernel discussion threads on this besides this one[1]? (I'd love to chime in, in support of the X32 ABI.) I'm a little surprised honesty, as Linus has mentioned multiple times how important it is to him that the user-facing ABI remain stable.

[1] https://lkml.org/lkml/2018/12/10/1145

Interestingly, V8 (the JS engine) is actively finishing up its own implementation of 32 bit pointer compression.


I am a C/C++ programmer mostly. Is there good documentation for python development when I care about performance and memory footprint out there as a book or something that anyone can recommend? It was fun reading this but I like to know more.

I'm a Python "snob", going on 24 years and If I ever "I care about performance and memory footprint" I'd not use Python.

Python is good enough 90%. You get faster 2x 10x, etc code by picking better algorythms or solutions. In Python, you should not be caring about 10% or 30% speed improvement. It's not worth it, It's not Python's strength.

When you need faster you go to C based libs (the python libs that need to go fast are already in C), https://en.wikipedia.org/wiki/Cython https://en.wikipedia.org/wiki/Numba etc.

Or go for [1]pypy or [2]graalpython. I know that equivalent code in graal java ee produces equivalent assembly/performance to C+current gcc, which really shows that yes changing to a JIT can pay off. The python in graalvm is early stages but it shows that the old lesson of just write the hot spot in C is no longer true. Chris Seaton shows that for Ruby/C the language Ruby interpreter/C switch is expensive. I think the same is true for Python interpreter/C switches. Something that GraalVM+Python|Ruby can optimize out.

[1] https://www.pypy.org [2] https://www.graalvm.org/docs/reference-manual/languages/pyth...

LuaJIT has great performance through the C FFI. It’s more that traditional JITs can’t optimize across FFI boundaries. Hopefully Chris will correct me if I’m full of shit.

There are other approaches which can work too, like compiling C to LLVM IR using Clang so a function can be inlined by an LLVM based JIT at runtime.

Like with JS, you might not have a choice. I mostly use Python for extending programs that use it as a embedded scripting language. Performance is still important.

High Performance Python by Ian Ozsvald and Micha Gorelick - fairly outdated at this point, but still quite relevant (I believe the authors are working on a new edition).

They have started work on the new edition if I'm not mistaken.

No draft online ?

I don't think so. This is quoted from Ian's latest mailing list [0] posting:

Second Edition of High Performance Python in the works

I'm very pleased to say that Micha and I have started work on a Second Edition of High Performance Python with O'Reilly, planned for early 2020. This book will use Python 3.7 and will add tools that barely existed 4 years ago including Dask and probably some Tensorflow, amongst lots of other goodies. I'll let you know how the book progresses via this list. So far I've updated the Profiling chapter, added some advice on 'being a highly performant developer' and have rebuilt the Cython/Numba/PyPy code for the Compiling chapter.

[0] https://ianozsvald.com/data-science-jobs/

it's written in 2014 so not that old I think

I thought the usual mantra for performance-critical functions is to implement them in C and call from py.

Not always necessary. I do a lot of ETL type work in python, and generally, the idea is to never pull everything into memory at once if you don't need to. This means leveraging generators quite a bit. Read a row, process a row, generate a row, pass it along the pipeline. I've written scripts that can process a 1k line csv file with the same memory footprint as a 10M line csv.

If you have to read everything into memory, because you're doing some sort of transforms like a list to a dict or such, explicitly deleting variables can help, as well, rather than waiting for them to go out of scope, but this should be pretty rare.

Or just port to Julia, Scheme or Common Lisp, while enjoying the power of dynamic languages with compilation to native code.

See Mike Mueller's tutorials from PyCon at https://www.youtube.com/watch?v=DGrS0uwMuHY. That's the 2018 one, but I'm guessing there is a similar one. There always seems to be a high perf tutorial of some kind.

>A significant reduction in the size of a class instance in RAM is achieved by eliminating __dict__ and__weakref__. This is possible with the help of a "trick" with __slots__:

What is the downside to the __slots__: trick? In what cases do you need the dict and weakref?

One big upside of using __slots__, which the docs tell you not to use it for, is that it will throw an error if you try to access or assign to a member that is not one of the slots. Very handy for catching bugs.

Try C++!

I do what I'm paid to do :)

Learning Python and Flask for my home projects. It’s really making me wish I was paid to write Python instead of JavaScript.

It’s not all roses! In my experience at three separate organizations, Flask apps tend to become horrific to maintain beyond a certain size, unless the original developers really know what they’re doing and are disciplined about it.

A lot of it is due to the standard patterns for small Flask apps (e.g. instantiating your app directly at the module level rather than in a factory function, putting all of your business logic into ORM classes) causing increased complexity and unintended side effects down the line. Flask apps also struggle quite a bit with performance, and the only way to do things asynchronously is with something like celery.

For Python, aiohttp is also a really interesting web framework, and the asynchronicity should be easy to grok for JS devs.

Personally, as someone who has worked with Python/Flask professionally for years, I’d pick TypeScript and any JS web framework any day!

> Flask apps also struggle quite a bit with performance, and the only way to do things asynchronously is with something like celery.

I present to you, Quart[0], a flask-like web framework using Python 3.7's asyncio framework.

It works pretty well in the single production application I have tested it in.

[0] https://pgjones.gitlab.io/quart/

Neat! I hadn’t seen this before. I’ll have to take it a spin for a side project at some point. Thanks for the recommendation!

Or help mypy / mypyc be better!

There isn't really a "downside" in a strict sense; there are just trade-offs.

The biggest one I run into is that __slots__ does not play well with multi-inheritance:


Also, slots somewhat contributes to obfuscation since slotted objects don't work in the exact same ways as regular objects do so developers may run into weird stack traces they might not immediately recognize.

>> developers may run into weird stack traces they might not immediately recognize


To me anyway, debugging code is massively harder than writing it in the first place. Nothing worse than trying to fish out some odd issue that plagues you after spending two hours writing some code, and then 4 hours sorting it out.

If you are writing code that adds or removes properties at runtime. I.e., when you want to write shoddy code.

I have written code like that in the past, and would consider doing so again in the future. When, it's appropriate, it can be hugely beneficial.

When I last did it, I was wrapping a C++ API that I needed to compare against another dataset for a merge. The C++ API (which, admittedly, I also wrote), didn't have equality or hash operators defined on the Python objects. So, I monkey-patched them in for the keys I needed. It was actually the most elegant solution I could come up with as I could then naturally use the objects in sets and easily get differences to update the appropriate dataset.

As an aside, when I wrote the python wrapper around my C++ API, I purposely didn't define equals and hash operators for the objects, despite having full information to do so on the natural keys, because I wanted the flexibility to do the monkey-patching and override how the objects were compared depending upon circumstance.

Pretty sure you can still monkeypatch the class instead of each instance, but I’m too jet lagged to think it through at the moment.

Why not competition with forwarding a.k.a a thin wrapper with the extra comparison method?

"when you want to write shoddy code" - This so much!

A downside of __slots__ is that the way they are set up, you can only have slots come from one branch of an inheritance hierarchy, so they don't work out very well for fancy OO interface / mixin type stuff.

Inheritance. Even Guido has said that people should avoid use of _slots_. Some people think that the space saving trumps the inheritance issue/type checking not inherited or some such. It's almost anti Python-like. People want to use it so they don't have to go down to something like C. I personally would not use _slots_ but then again, I'm just a nobody who writes semi-decent code for semi-decent money. YMMV.

TXR Lisp, 64 bit:

  1> (defstruct blank ())
  #<struct-type blank>
  2> (pprof (new blank))
  malloc bytes:            16
  gc heap bytes:           32
  total:                   48
  milliseconds:             0
32 bit:

  2> (pprof (new blank))
  malloc bytes:             8
  gc heap bytes:           16
  total:                   24
  milliseconds:             0
The structure instance has a pointer to its type, followed by a numeric ID (which is also in the type, but is forwarded to the instance for faster access). The ID is combined with a slot symbol to perform a cache lookup to get the offset of a slot. The numeric ID is a fixnum, which leaves a few spare bits for a couple of flags:

  struct struct_inst {                                                            
    struct struct_type *type;                                                     
    cnum id : sizeof (cnum) * CHAR_BIT - TAG_SHIFT;                               
    unsigned lazy : 1;                                                            
    unsigned dirty : 1;                                                           
    val slot[1];                                                                  
If someone wanted to shrink this, they could patch the code so that the dirty flag support, and lazy instantiation of structs is made optional (as in compiled out), and so is the forwarding of the inst->type->id to inst->id. This struct type is not known outside of struct.c, which is only some 1700 lines long. If you take out the id member from struct_inst, the C compiler will find all the places that have to be fixed up; literally a 15 minute job.

I can also think of a more substantial refactoring that would eliminate the type pointer also. There is no room in the heap object handle to store it directly: heap handles have four words in them; the COBJ ones used for structs have the type field, a class symbol, a pointer to an operations structure, and a pointer to some associated object (in this case struct_inst). All structures share the same operations structure. However, if we dynamically allocate that operations structure for each struct type, we could stick the type pointer in there, taking it out of the instance. Thus an instance size could literally just be sizeof(pointer) * no-of-instance-slots.

I am paying US$5 a month for python's memory hogging.

Running Mailman it will not run in 1GB of ram (the US$5 VPS) so I had to give it 2GB. 2GB? For a mail list server?

Had the same problem running motioneye. It will not run on a Raspberry PI Zero. It is advertised to run on a Zero (Apparently I could squeeze it on by doing some Python magic....)

What total crap is Python! What waste, what hubris, what technical failure!

I suspect the problem is actually Django - what hubris! NIH! lighttpd is running sweetly, with some rust templating in a acceptable fraction of the memory.

But moving up your stack of desire, what alternate to mailman would you live with? Five a month is $60 a year. Assuming you value your own time at professional wages, this job is worth an hour of your time at most before the opportunity cost of complaint is cheaper than replacement.

opportunity cost of complaint

No, what he actually bought for his $60/year was license to complain about mailman & python :)

The python complaint licence fee was the best deal I ever made.

You have completely missed the point. The US$5/month is not a hardship. What is a hardship is taking what used to be done in 40k Perl and require 2G. It is a affront!

What is a waste is writing a poor quality greedy web server when Apache/Lighttpd/Nginx all exist and are much much better.

What really cost me was the two weeks I spent trying to make motioneye (python) and motioneyeos (not) work as advertised. As advertised! It is not only that the python is too bloated that stops it working, it does but there are other reasons. Very clearly no body has really tested it. It is a symptom of python culture. Shoddy and arrogant.

I feel terribly burnt by it. Twice. So my stack of desire in this domain is quite small and simple: Never to hear again from python! A very personal POV, be happy in your python hacking. I saw the opportunity to have a public rant and I took it!

Seems to be an error in editing:

> ...which received a rating of [stackoverflow] (https://stackoverflow.com/questions/29290359/existence-of-mu... -python / 29419745). In additioobjects liken, it can be used to reduce the size of objects in RAM...

Is this trading space for speed or is there actually a speedup as well in some cases? With interpreters it's possible you be small and fast if you become perhaps obscure or more rigid in your structured definitions

I always hate how these things are phrased. The idea of “using a lot of memory” doesn’t exist in a vacuum. “A lot” relative to what? Do you require run-time dynamic typing features? Do you want to leverage the Python data model? If yes, then this is the memory cost of your desires.

Advice in many of the comments about other languages sounds so tone deaf. The question is not about using less memory. It’s about getting _exactly_ the feature set of Python while using less memory.

This is a pretty neat and useful comparison of how much memory different structures use in Python to achieve roughly the same goals.

I feel like the last two are cheating a bit by explicitly using 32 bit integers where the other examples seemed to use 64 bit.

No, the fields that take up 8 bytes are pointers to PyObject. (I guess this article assumes a 64-bit memory model.)

OP, if you're reading this, your article has damaged syntax at the word `additioobjects`.

The author was fixed this.

this functions u will also see in the case of JAVA s well.this problem arises due to large no. of objects are active in RAM durig the execution of a progrm especially if there ae restrictions on the total amount of availablle memory

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact