
Optimization tricks in Python: lists and tuples - metmirr
https://rushter.com/blog/python-lists-and-tuples/
======
viraptor
There's a pretty bad mistake at the beginning:

    
    
        But you can change immutable object:
    

Appending to a list contained in a tuple does not change the tuple. The
immutable object does not change here.

~~~
rushter
Thanks, it's a typo, I was referring to the mutable object (list).

------
saagarjha
> Empty tuple acts as a singleton, that is, there is always only one tuple
> with a length of zero…But that doesn't apply to lists since they can be
> modified.

Well, you can get this to work with a copy-on-write implementation. I don't
know enough about the Python runtime to decide whether that's feasible or not.

> if you want to append an item to a list of length 8, Python will resize it
> to16 slots and add the 9th item

I find it interesting that Python has decided to count their list resizing
algorithm in as a formal API…

~~~
quietbritishjim
I don't think copy on write would work in Python because you wouldn't know
which variables to update to point to the copied list

    
    
        x = []
        y = []
        y2 = y
        y.append(1)  # Should also update y2, but not x

~~~
viraptor
The implementation would have to change, but it's doable. x and y would have
to have different IDs, but they could point to a shared location with a cow
marker set. It would likely use more memory then the current empty list, but
it could potentially improve x[:] of arbitrary list where the result doesn't
actually need to change later.

~~~
quietbritishjim
Where would these unique IDs be stored? In Python, variables are just
references to objects (as are entries in lists, dictionaries, etc.). So you
would need some sort of intermediate object that kept the id and reference to
the underlying list; the underlying list would be shared but there would be a
separate stub object for every (distinct) variable. In the case of an empty
list, this probably wouldn't be any smaller than the current system of having
a totally separate list.

Copy-on-write might make a difference in the case of large non-empty lists,
but then there's another problem: you would need to trigger the copy even if
something contained nested within the list changed e.g. if `x=[[1,2],3]` and
you change `x[0][0]`. One way around this would be to only allow copy on write
for lists of immutable things (excluding immutable things containing mutable
things e.g. `x=[([1, 2], 3), 4]`). This is all a really significant change
from the current object model.

~~~
viraptor
The first part is pretty much what I meant. I don't see a problem with nested
lists though. They don't need any special treatment. Let's say you use
variable->metadata->listdata scheme where listdata is prefixed with a "user
counter" similar to refcount but for metadata pointers. You can modify x[0][0]
then, which may or may not need to decrement the counter and make a copy of
[1,2], but [a,3] doesn't have to change - it points at the same list metadata.

~~~
quietbritishjim
What I meant was what if the outer list had a ref count of two, like if x and
y in the following code sample were automatically elided into a single list
with a distinct ref count of 2:

    
    
        x = [[1, 2], 3]
        y = [[1, 2], 3]
        x[0][0] = 9  # the nested list needs to notify its owner(s)
    

Now I think about it, I realise that this could be a harder situation to get
into than I had thought at first. As the snippet above shows, it can happen if
you have a a compiler/interpreter that interns entire nested lists at once.
But assuming that COW lists are only formed by trying to copy existing lists,
I'm not sure it's possible to end up in this situation. In real Python, if
y=list(x) then changing x[0][0] _should_ change y[0][0], as you said.

------
octopope
The free list for lists seems to reuse more than just empty lists as you
suggest.

    
    
      > a = ['non','empty','list']
      > id(a)
      4363176072
      > del a
      > b = ['new', 'list']
      > id(b)
      4363176072

~~~
rushter
Maybe I've missed something, but you can also get the same id (it's basically
an address in the memory) because of the how memory allocation works. There is
a special allocator in CPython, which preallocates big chunks of memory and
constantly reusing it without allocation overhead. I have an article on this
too.

~~~
octopope
Ahh.. interesting. Then I guess using the id of the tuples to show that they
are using the same object doesn't exactly prove the point.

To your original point (that tuples reallocate based on length) I see that if
I delete a tuple of length 3 and then create a tuple of length 5 I see the id
is immediately changed. So that's correct.

Lists on the other hand seem to keep reallocating the same address in my
limited test.

Strings seem to behave like tuples. When I delete a string and create a new
one it creates a new object with a new address.... unless the strings are of
the same length.

Perhaps this is no real revelation, I'm rather new to python and spending my
time poking around to see how it works. :)

~~~
ccmonnett
> Then I guess using the id of the tuples to show that they are using the same
> object doesn't exactly prove the point

Without the `del a`, it does, because they both have active references. If
they were unique objects, we'd see a unique ID for `b` as long as a reference
to `a` is active.

> Strings seem to behave like tuples. When I delete a string and create a new
> one it creates a new object with a new address.... unless the strings are of
> the same length.

String _objects_ (as opposed to variables referring to them) are immutable in
Python. They tend to be allocated anew, but for optimization reasons you can
end up with cases where the string objects have the same ID. Like here:

>>> a = 'asdf' >>> id(a) 4389881424 >>> a = 'qwerty' >>> id(a) 4395015672 >>>
a = 'asdf' >>> id(a) 4389881424

'asdf' has the same ID with no deling involved because its object wasn't GC'd
yet.

Below are some relevant links if you want to knock yourself out (some do) but
I write an awful lot of Python and even for me this is well into the realm of
"what happens when..." after a few beers or job interview trivia.

[https://rushter.com/blog/python-lists-and-
tuples/](https://rushter.com/blog/python-lists-and-tuples/)
[http://guilload.com/python-string-interning/](http://guilload.com/python-
string-interning/)

------
ris
There are better reasons to prefer tuples over lists when mutation is not
needed, namely that it prevents unexpected mutation performed by other code to
which references have been "leaked". This makes code easier to read as a
result because such possibilities can be ruled out in the mind of the reader.

More generally, mutable objects encourage unnecessarily side-effecty code.

------
nerdponx
It would be nice if Python had a fixed length but mutable sequence type.
MutableTuple anyone?

~~~
cdancette
Why do you have the fixed length requirement?

Regular lists seem to work perfectly for this

~~~
dkersten
Presumably so you don’t have to pay for the memory or performance overhead of
supporting resizing? Possibly also to more closely match the semantics of the
code with what you want to do, to make it easier to catch errors.

I find Pythons core data structures really convenient and easy to use, but
sometimes the C++ programmer in me wishes that I had more options available to
me (eg, Python has lists and tuples, C++ has double ended linked lists,
arrays, tuples, vectors, double ended vectors in the standard library), the
idea being that I can choose something that is a very close match to usage
patterns.

Having said that, I’ve never actually found it to be a limitation in real
Python code and when memory or performance matters enough that these
differences start to show, Python is usually no longer the best tool for the
job anyway, so it doesn’t really need these things. Still, sometimes I feel
like I want them anyway, probably just because I’ve done enough C++ to be used
to having them :)

~~~
heavenlyblue
double ended linked lists - deque arrays - array, numpy arrays vectors - numpy
again

~~~
dkersten
Sure, with third party libraries (that are mostly written in C), you can get
whatever you want. I’m just talking part of the language. In C++, the range of
data structures are huge if I use third party libraries. For example, EASTL
expands the STL containers with intrusive variants and more.

Of course, an argument could be made that looking just at the language
(without third party libraries) is unrealistic and useless, and its true. You
could also argue that Python libraries are easier to install and use than C++
libraries (which is also true), but when you look at the languages in
isolation, Python only has a few basic structures and C++ has multiple ones.

Again, I acknowledge that this is an academic discussion since in real life,
I’ve never actually seen it matter and, when it does, libraries like numpy
that delegate the work to C exist and are easy to use.

~~~
heavenlyblue
I don't know: if you really need to do some heavyweight maths in C++, you're
going to go the CUDA/heavyweight intrinsic way anyway. I don't yet see how STL
containers (excuse me for the pun) contain the performance problem at it's
limit - but otherwise why would you use C++?

Also, having these specific structures as part of the standard library would
make you think that what you're doing is acceptable: thinking about the
overhead of a doubly-linked-list when your variable assignment operation
requires an access to a Namespace object based on a variation of a HashMap.

Seriously: whenever you need a fast solution, Python doesn't stop you from
building something in C++. But trying to make Python into C++ is either an
incredibly long shot or just having no idea what Python is actually useful
for.

------
hcnews
I always scoff a bit when I see Python and Optimization/Fast/Efficiency in one
statement. IMHO Python should only be used for non-complex scripts. For other
things, just use better suited languages.

~~~
rushter
My article describes the tricks which are used inside the Python
_interpreter_. Every improvement in the interpreter saves an insane amount of
computing power considering its popularity.

