Hacker News new | past | comments | ask | show | jobs | submit login
How Python List Works (antonz.org)
82 points by tosh 13 days ago | hide | past | favorite | 35 comments

One thing to note, in the gritty details of Python's actual implementation of the list type, is that there's a little more to say about how list resizing works than is immediately obvious. The relevant source is here: https://github.com/python/cpython/blob/main/Objects/listobje...

As the article points out, the size progression, as detailed in the comment in that source, is relatively modest. Your "classic" dynamic array will simply double the size as needed. This progression does less than that: 16, 24, 32, 40, and so on. But: The underlying implementation is using the standard C library function realloc(). This will perform some amount of in-place resizing on its own, and will lend a lot to the performance of a Python list, albeit in a way that might be difficult to nail down, since it depends on the overall usage of the C library's heap.

How a Python list really works? See the underlying C implementation: https://github.com/python/cpython/blob/main/Objects/listobje... ; https://github.com/python/cpython/blob/main/Include/listobje...

For comparison, java.util.ArrayList is implemented in Java itself, and is much shorter: http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/...

Ya I was reading this thinking none of this is telling me how CPython implements it. It’s just a poor-man’s version.

One point which I think bears mentioning with respect to performance is that even with O(1) access, the performance in the real world of a list implemented by a list of pointers vs a contiguous block of memory will be very different. This is because programs often iterate over lists and a list implemented by a contiguous block of ram will have MUCH better cache locality.

AFAIK, this unfortunately wouldn’t be possible to implement in Python though because I don’t think there would be any way to know how much space to allocate for each item in the list.

In CPython at least, there's no other way. Everything is an object, even simple types like integers. And references to objects are implemented with pointers. Even a Numpy array which stores a contiguous array of simple values requires conversion to an object when you actually try to use those values.

For more efficient "lists" there are Numpy arrays https://numpy.org/doc/stable/reference/generated/numpy.array... which do have cache locality

Python does come with built-in arrays, but I've never seen them used, or heard anyone recommend them for anything.


They are recommended when you want speed, but most people writing Python probably never need that much speed (or would probably want numpy anyway), and the few who do likely have more private sources to recommend it for specific occasions. It should probably be used for more things (instead of reaching for numpy at first chance) but is still a niche tool.

The times I've used array, it was a common data structure for communicating with a C extension. You could use a Python list, but that's not just slower but more of a pain to deal with on the C side. (This was before numpy was everywhere; I don't know how much fun that might be to work with from C.)

If you're going pure Python then space is a better reason than speed. I'm not sure an array is even as fast as a list.

I think I use struct much more when interfacing with C, but I can see array being useful for arrays specifically when you expect more than a couple of elements. Both are still quite niche though.

The standard array module was more relevant in the distant past, before version 2.6 added the bytearray type. Otherwise, numpy arrays are considerably more useful in virtually every circumstance.

I had never heard of them and when I try to Google Python array v set speed it just returns list v set. I did get the official documentation but it doesn’t go into speed.

These are useful when you need a pointer to an integer for a C or C++ API.

If you have numeric value you can use https://docs.python.org/3/library/array.html

But if you're in the real world you use numpy.ndarray instead (this underpins half of the Python ecosystem, more or less).

On a similar note: Has cpython implemented tagged pointers yet? I remember there being a lot more discussion about boxed types when Java sometimes boxed primitive types (at the expense of performance).

Why python gets away with it I don't know. Back when I still did python, integers below 256 and above -3 (?) were "cached" and had no boxing overhead. In 3.6 that was still the case. Either cpython lacks tagged pointers or has the worst tagging scheme in history.

Does it have to be one or the other? You could allocate blocks of memory for the list then when it will be filled up you allocate more and add a pointer to it.

This breaks O(1) access by index, because the objects in the list may have different sizes.

Related to that, do heterogenous lists make sense as a default? I can see the argument for dictionaries easily, you might want to put a lot of different stuff inside one. However, for lists, I fail to see the advantages of heterogenous lists as a default. Especially since they are going to be way slower, and nobody seem to use arrays in Python.

> do heterogenous lists make sense as a default?

Yes, because then the list object doesn't have to do any extra work to check the types of objects. If your application needs that extra checking, you can add it in your application's code; but there's no reason to burden the basic list object with extra code that not everyone needs.

> Especially since they are going to be way slower

Way slower than what? They're slower than Python arrays, of course, but Python arrays can only store numeric values (since the amount of space each value takes up is constant and known when the array is created).

For a container that needs to be able to store objects of any type (including instances of user-defined classes), a heterogeneous list is faster than a homogeneous one, because, as above, it doesn't have to do extra work to check the types of objects. All the other extra work is there because the size in memory of the objects isn't known in advance; and that will be true even if all of the objects are of the same type, since instances still won't necessarily all have the same in-memory size (for example, consider instances of user-defined classes with different attributes).

> For a container that needs to be able to store objects of any type (including instances of user-defined classes), a heterogeneous list is faster than a homogeneous one, because, as above, it doesn't have to do extra work to check the types of objects.

My question is about that. Is storing objects of any type a common use case, compared to storing objects of one type?

> Is storing objects of any type a common use case, compared to storing objects of one type?

I'm not sure it's possible to know a general answer to that, since Python is used in such a wide variety of contexts.

In my own programming, I use lists that can contain objects of different types often enough for that use case to be significant for me.

One other thing to keep in mind is that Python has duck typing, so, for example, if you are iterating over a list and retrieving the same attribute or calling the same method on each object, that doesn't require each object to be of the same type, just to have an attribute with the required name, or a method with the required name and calling signature.

Storing objects of any size is common, because Python objects are rarely fixed size (Python integers aren't fixed size!)

At that point, you need pointers anyway, so mixed type doesn't add much overhead. Especially given that you need to handle mixed concrete types even if you have single “typechecking” types because of inheritance (including via declared ABCs and union types.

I have found, at least in my own use of python, lists are for "homogeneous" collections of indeterminate length. But these are not strictly homogeneous - there may be None entries, for example. So I do use the heterogeneous capabilities.

For heterogeneous collections with a fixed length, I tend to use a tuple. That makes more sense to me, conceptually (coming from C++).

In Python’s type annotation system, “class T or None” is represented by “typing.Optional[T]”. I think C++ has something similar for STL containers. Even numpy has NaN values. So your usage still fits some reasonable definition of homogeneous :)

Python's semantics make homogeneity less useful and hard to enforce.

Everything's passed by reference, so the list has to contain pointers even if you know all its elements have the same type. And some objects can change their type by assigning to __class__, which can happen from a distance (because everything is by reference).

A careful implementation can solve this for some builtin types. PyPy does it for ints: if a list only contains ints it's represented as a proper numeric array. But then as soon as you append some other type the whole list is converted to a more expensive more traditional representation.

PyPy's ints still have to pretend to be heap-allocated objects at all times. In particular they have to do something tricky with the id() function, which on CPython just returns an object's memory address. The return value has to be unique and consistent for as long as an object lives.

> Especially since they are going to be way slower, and nobody seem to use arrays in Python.

Arrays are widely used in Python for serious work (mostly NumPy arrays, either directly or via higher-level libraries, not the stdlib ones.)

If you use the static typing eyes, the Python list is homogenous: every element is of the type reference to python object. There's no funny business with types, just one reference after another.

That's a fair point, but if static typing was like that it would be useless. "Object" is a type, but it's pretty useless most of the time.

The item count is stored separately so that the len() function also performs in O(1) time, and does not have to count the elements each time.

Could you kindly explain this part in detail?

I always wondered why there weren't more 'hyper-collection' abstractions that just implement multiple collection types in one interface, so that you can get the respective speedups of the different collection types while still having the flexibility to use them as you wish

This is the field of datastructures. There isn't 'just' one be-all to end-all. There are so many kinds of hashing/tables, trees/balancing, hybrid (e.g. skiplists), priority queues, etc. Because there are so many possible combinations, if the choices matter it's useful to use a language that lets you make your own. This is one of the main reasons why generics for Go is a big deal. Go always had generics, but it's only been for the built-ins array/slice, map, and channel. With user-defined generics any datastructures/interfaces can be conveniently made and used without language/library/compiler support.

Its curious that for Python this is blog post material, but for Go it's intrinsic to understanding the performance and usage of slices or maps, or in Java (and other langs) would be referenced merely by the class name ArrayList. The fact that so much can be done effectively in Python without always having to think about such details is an achievement of the language and ecosystem of libraries and packages.

C arrays are basically that. Also python dicts (most python types including all user-defined classes are actually dicts underneath), Lua tables, Javascript objects, and more. In most dynamically typed languages, "the" type is actually what you're calling a hyper-collection and not just conceptually, but also in implementation

How would that work? Every time you create an object, internally it creates one of everything?

Does anyone know any good resources for learning more about algorithm times like O(1) and O(n)?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact