Hacker News new | past | comments | ask | show | jobs | submit login

Python lists are actually arrays — contiguous chunks of memory.

Just wondering here, is this guaranteed to always be the case? Practically it probably is, but does the Python spec (as in: the laguage, not one of it's implementations) say a list must be implemented using contiguous memory of slots with Python objects? That seems so low-level and C-ish. Or does the OP actually mean CPython here for instance?




Does Python actually have a spec? AFAIK, it's always been "what CPython does."

IIRC, an old joke was that Python lists are arrays and Perl arrays are lists.


It seems like a reasonable shorthand to say that Python does something, but to really mean that CPython happens to be implemented to do it that way.


Python doesn't really have a spec in the way that, say, C# does. The docs are missing things like whether the argument is positional-only, exact exceptions raised in all cases, and time complexity of common operations.

That said, I think every implementation uses contiguous memory. PyPy has list "strategies" int[], double[] and PyObject[] and probably more, switching transparently between them.


With numba or other jit tools in CPython, you have things equivalent to pypy list strategies as well. If you add jitted helper functions or Cython, you have even more control.


According to this post: https://news.ycombinator.com/item?id=18707403, "Lists aren't called arrays in Python because they're not arrays. Next!".

I really am confused now. Is it an array, or is it not?


Strictly speaking, lists are not arrays.

I'm not aware of what the specification for a list would be, this is the closest I'm aware of: https://docs.python.org/3/reference/datamodel.html#objects-v..., 3.2 > Sequences > Mutable Sequences > Lists, "The items of a list are arbitrary Python objects. Lists are formed by placing a comma-separated list of expressions in square brackets"

CPython implementation is here: https://github.com/python/cpython/blob/e42b705188271da108de4...

If I'm talking about arrays with someone else, I would expect us to be thinking of arrays in the sense of C. Both C++ and Java use the same definition. Other languages that I'm aware of (e.g. Javascript, Go) have their own notions of arrays that you can't conflate with the C notion of an array.

From the C11 spec, §6.2.5.20: "An array type describes a contiguously allocated nonempty set of objects with a particular member object type, called the element type."


Lists are very often used in the way arrays are, though.

The heterogeneous use-case (that in C, etc. is usually a struct) is often filled by either tuples or (ordered) dictionaries in Python.

If you really want something a bit closer to a C array, you can fairly easily create a UserList that enforces the non-empty requirement and only allows items of a particular type.


If you want some closer to C arrays, you can use Python array.array, which is part of the standard library.

https://docs.python.org/3/library/array.html

https://stackoverflow.com/questions/176011/python-list-vs-ar...


Thanks! I had forgotten about that module.


I think you have a point about C or C++ notion of arrays having the same type.

I'll probably rewrite that paragraph, since the main point is to tell people unfamiliar with Python that lists aren't actually linked lists.


So... Because lists are dynamically typed and heterogeneous, does that mean the underlying C is basically a contiguous segment of memory of python object references? Then in cpython the actual list contents are Python objects sitting on the heap in a non-contiguous manner?

And when the list needs to be resized and doubles itself, it's just doubling the space for object references?


> Because lists are dynamically typed and heterogeneous, does that mean the underlying C is basically a contiguous segment of memory of python object references?

Yep, it is an array of pointers: https://github.com/python/cpython/blob/3.7/Include/listobjec...

> And when the list needs to be resized and doubles itself, it's just doubling the space for object references?

Basically yes. The growth pattern is not as aggressive as doubling though: https://github.com/python/cpython/blob/3.7/Objects/listobjec...


I keep forgetting just how readable cpython source is. I'm still mentally stuck in my beginner years, where it was all Greek to me.

Thanks for sharing these links.


Interesting:

"The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ..."

What is the big-o amortized cost of using this pattern? Doubling has a easy proof that it's O(1) amortized.


The growth pattern is still exponential, instead of going x -> 2 * x, you go x -> 9/8 * x + 6. I think it should still be O(1) amortized, you just need to adjust your proof for doubling a bit.


Mathematically the best is not doubling but fibonacci. for the start doubling is used because it's much faster to compute.


> So... Because lists are dynamically typed and heterogeneous, does that mean the underlying C is basically a contiguous segment of memory of python object references?

Yes. Every Python object is held by the interpreter as a pointer to a PyObject struct on the (C) heap.


Not something I would put into production, but had a fun application of this on a side project. I had the `id` of a function (but not the function object itself) and needed to recover the function. In CPython, `id` of a function corresponds to it's memory address (not sure if that can be overridden). By casting the id to a PyObject, I was able to recover the original.

    func = ctypes.cast(int(address), ctypes.py_object).value
Was kinda cool actually seeing that work.


I guess it depends on your definition of arrays, and what you expect from them. For example, Ruby calls a similar thing array: https://docs.ruby-lang.org/en/2.0.0/Array.html

To me it is kind of self-growing array that can hold objects of different types. The C++-ish equivalent would be something like std::vector<void*> / std::vector<MagicObjectBoxType> (there is no MagicObjectBoxType of course).


To me, the difference between a list and an array is that an array is a fixed length allocation of memory storing multiple items of the same size. This way, to jump to the third item in the list, you just need to know the data type (or rather, how many registers a single item of that type takes up in memory) and the index. That way you can jump right to the correct memory address by just doing ```index * cell size```.

You compare this to, say, another fundamental data structure, the stack. In a stack you don't know how much memory each item takes up, so you can't just jump right to that item. Thus, the limitation is that you can only pop or push the last item on that stack. Once upon a time, stack-based languages were quite common.

A list, in contrast, is an ordered, polymorphic collection of objects which can have variable size. You can create resizable arrays by using a linked list. A resizable list would typically be implemented as an array with the last item being a pointer to the next array. I believe (but don't quote me on this), that Python uses links to increasingly large arrays so that there aren't too many links. That is, it dynamically increases how much memory to allocate to that array depending on how much data you're actually storing in that list.

A list is also polymorphic, meaning it can house multiple types of data. I sometimes hear a non-polymorphic, but resizable array called a "vector," but that's super confusing terminology. You can implement this by creating a resizable array of pointers, where each pointer points to a location in memory where the value, along with the data type, are stored. This way, all the pointers are the same size and you can jump right to that location by calculating it from the index, but you still get to have multiple value types.

The distinction is not universal, though. But in my humble opinion everyone who disagrees with me is wrong.


No, they are not. At "best" they are implemented "like" an array of pointers to the objects (scattered somewhere else in the memory) in the list. If you want an array (in the "c-like" sense of it) you can look at the array module https://docs.python.org/3/library/array.html


An array is a list, but a list is not always an array


It is indexable, and it is used as an array, so I guess it is an array in all implementations.

But good point, I should check this.

Dicts are definktely implemented differently in different implementations, and I mention this.


It's not implemented like that and couldn't be. They aren't arrays unless you consider them arrays of pointers but this isn't a useful definition in Python.

An array contains a number of elements of the same size. That means you can find the memory address of any element by simple arithmetic: array_loc + index * element_size.

Python lists can contain objects of any type and the objects can't be found using arithmetic like that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: