Hacker News new | past | comments | ask | show | jobs | submit login
Python: copying a list the right way (precheur.org)
150 points by joeyespo on Nov 5, 2011 | hide | past | web | favorite | 72 comments

Before your jump into your code, grep it and change every instance of [:] into list or copy know that it isn't that easy. Most Python projects will use all four common variations of copying a list or object. Here they are with benchmark times[1]:

    b = a[:]           0.039ms
    b = list(a)        0.085ms
    b = copy(a)        0.187ms
    b = deepcopy(a)   10.592ms 
First method for short lists, eg. function or system args, where you know you have a list. The Python manual suggests this method when copying a sequence as the fastest/best method[2]

The type constructor list will convert any sequence into a list and will preserve order. If you pass it a list, all it does is return the sequence using the slice operator anyway[3]. It is slower because of the type checking, but it is implemented in C. So you can think of list() as just [:] with a type cast - no need to call it again if you know you have a list.

copy and deepcopy are implemented in python, and are generic functions that attempt to sniff the type of the object to be copied. They will use the __copy__ magic[4] of the object if it exists, so you can override it in your objects with return self[:]. You need to use these if you have a generator, a list of non-basic types (such as lists of lists, or lists of tuples, or lists of objects). Both functions use a module-level cache and deepcopy will iterate and apply copy

there is very little performance degradation by aliasing copy to deepcopy and using it everywhere, although it could save you time by catching bugs. (Edit: scratch that, I got my benchmark wrong - deepcopy will still be slow even if you pass it a shallow list, see comment below, thanks tedunangst)

Read the source of copy and deepcopy so you can understand them and can implement your own custom version for more advanced types. Find the file:

    >>> import copy
    >>> copy.__file__
Each of these methods has its own use case, if you grep through a well implemented project such as Werkzeug[5] you can find how each is used efficiently. For eg. [:] is used when you know you have a list, such as template variables. list() is used to force into a list, eg. before these vars get to other objects and copy() is used on custom data types, making a copy of environ (which can contain almost anything) and in copying the routing table (which can not be trusted to be a list).

[1] Benchmark times taken from: http://stackoverflow.com/questions/2612802/how-to-clone-a-li... which I had bookmarked as a reference

[2] http://docs.python.org/faq/programming.html#how-do-i-copy-an...

[3] http://docs.python.org/faq/programming.html#how-do-i-convert...

[4] http://www.brpreiss.com/books/opus7/html/page85.html

[5] https://github.com/mitsuhiko/werkzeug

I'm puzzled by your comment that there's very little degradation using deepcopy everywhere. Your numbers demonstrate quite the opposite.

Thanks for noticing - I got my numbers completely off. When I ran the benchmark on my machine it turns out it was still using the original copy.

One way to catch deepcopy bugs might be to create an autocopy function which can detect if it is a 'shallow' object and use copy, or if not use deepcopy.

I am going to try and write an implementation that doesn't slow it down too much. It might be worthwhile since copy bugs are so common in Python projects.

I wonder whether it would be possible to optimise the Python interpreter to make deep copies copy-on-write. I suppose that would involve a lot of work for relatively little gain.

I remember that being mentioned in a PEP somewhere but it never got implemented. It might be worth implementing copy in C with copy-on-write to bring some of those benchmark numbers down.

Can you explain why `[:]` is faster for small list (10 elements) but `list()` is faster for larger list (100000 elements).

    ~$ python -S -mtimeit -s "a = list(range(10))" "a[:]"
    1000000 loops, best of 3: 0.198 usec per loop
    ~$ python -S -mtimeit -s "a = list(range(10))" "list(a)"
    1000000 loops, best of 3: 0.453 usec per loop
    ~$ python -S -mtimeit -s "a = list(range(100000))" "a[:]"
    1000 loops, best of 3: 675 usec per loop
    ~$ python -S -mtimeit -s "a = list(range(100000))" "list(a)"
    1000 loops, best of 3: 664 usec per loop

If you are copying many lists, or copying many large lists you still want to use the slice method ( new_list = old_list[:] ) as it is faster then list(). It is also the fastest method of copying lists if you consider copy.copy() and copy.deepcopy() as well.

The only caveat here is if you are copying a list of lists, you have to use copy.deepcopy() if you want the lists inside your lists to actually be copied.

Tested copy(), [:], and list():

  >>> t1 = timeit.Timer('copy.copy(orig)','import copy;import random;orig = [random.randint(0,255) for r in xrange(100000)];')
  >>> t2 = timeit.Timer('orig[:]','import copy;import random;orig = [random.randint(0,255) for r in xrange(100000)];')
  >>> t3 = timeit.Timer('list(orig)','import copy;import random;orig = [random.randint(0,255) for r in xrange(100000)];')

  >>> print t1.timeit(10000)/10000
  >>> print t2.timeit(10000)/10000
  >>> print t3.timeit(10000)/10000
Probably not the world's most ideal testing here. I have no idea if/how Python caches, for instance. But slice notation continually comes out the slowest in this crummy example.

Not at my computer right now to whip up a simple benchmark(phone), but there is a pretty exhaustive benchmark here: http://stackoverflow.com/questions/2612802/how-to-clone-a-li...

With smaller lists, the results in my test change:

  >>> t1 = timeit.Timer('copy.copy(orig)','import copy;import random;orig = [random.randint(0,255) for r in xrange(10)];')
  >>> t2 = timeit.Timer('orig[:]','import copy;import random;orig = [random.randint(0,255) for r in xrange(10)];')
  >>> t3 = timeit.Timer('list(orig)','import copy;import random;orig = [random.randint(0,255) for r in xrange(10)];')

  >>> print t1.timeit(10000)
  >>> print t2.timeit(10000)
  >>> print t3.timeit(10000)
Which is similar to those results.

This implies more setup cost for copy() and list(), but after that they're faster.

If not for noise, list() should always outperform copy() as copy() just calls list() internally (specifically type(l)(l)), and also incurs the cost of several interrupted function calls.

Also, the minor difference in slice vs. list() for large lists are likely platform dependent and highly sensitive to the details of branch prediction and cache.

The only caveat here is if you are copying a list of lists, you have to use copy.deepcopy() if you want the lists inside your lists to actually be copied.

This caught me bad once. I wish it were on the page linked in the post (and others like it).

a[:] feels a bit too much like Perl.

The irony being that in Perl, a list copy looks like this:

   @dest = @src

Yes. And why is there this widely-held obsession that people who don't know your language should be able to tell what your code does (i.e., 'readability')? When will this ever be relevant in the real world?

I don't know many non-Python programmers who like to sit down of an evening, fire up their e-reader, and peruse a few hundred lines of beautiful Python code.

Because there are extra mental operations involved in translating what the code does to what you think/know it does.

We all learn a natural language and its grammar, and use it daily. It's very important for a machine language to be as cognitively compatible with a natural language as possible (aka readability), because that is extremely important for productivity.

p.s. The above is particularly relevant to Python, because it (justifiably) prides itself on its emphasis on readability. To me, even though Python is 5 to 30 times slower than Java, it holds the keys to the future because of its commitment to readability. In the long run, that is the single most important feature of any language - other fundamentals are necessary, but not sufficient to ensure a language's longevity, and can be fixed more easily.

There are times where its handy to have readability, albeit in one-off situations.

The most relevent one I recently encountered was working with a team and needing to get some numbers crunched - I had knowledge of scipy and was able to show the code to others who hadn't used python but could still understand it.

Hmm, as the parent comment, you too confuse readability of a language with it being readily readable by non-programmers in that language.

Readability describes how clear, concise, non-ambiguous, etc a language's syntax is and also the code one writes is. It's not about sharing your code with others who don't know the language, it's about making actual use and collaboration in the language (in a shared project) better.

There are many people who are not expert in your programming language but need to understand it, though. When you work in an environement where you need domain people, who are hired for their domain first and are programmers second, this can matter a lot.

Perhaps, but list() is certainly easier to google/find in the docs than [:]

`[:]` is as easy to grep as anything:


If you don't know that `[:]` relates to slices then you might start by reading http://docs.python.org/tut

"""And why is there this widely-held obsession that people who don't know your language should be able to tell what your code does (i.e., 'readability')? When will this ever be relevant in the real world?"""

Em, you got it wrong. Readability is not about people "not knowing your language". It's about people knowing your language and having to read your code at a later point.

The problem with a language with poor readability is that it is hard to read even your own code written in it, because the syntax is ambiguous and funky and it involves a large mental overhead.

He's probably referring to this, from the article:

"Isn’t it better, less cryptic, and more pythonic? a[:] feels a bit too much like Perl. Unlike with the slicing notation, those who don’t know Python will understand that b contains a list."

Well I don't think that in

    b = list(a)
it's clear in any way that the purpose of "list" is to make a copy.

No, but it's clear that it returns a list out of "a".

My guess upon seeing list(a) for the first time would have been that it returns [a], which is worse than not knowing what it does.

Possibly this is because I've known lisp longer than python.

I agree. I write Python all the time, and I would have expected something like list(*a).

I can and do send bits of my Python code to non-Python developers for them to examine for things like business logic errors. The only reason this is practical is because the code is clear and explicit about what it's doing.

I also like list(s) for its clarity and universality (the same technique works with dict(m) and deque(d) for example).

The s[:] gets called faster (builtin syntax dispatches directly) than list(s) which requires a global lookup. One called though, they both run the same underlying code and are therefore equally fast when it comes to the actual copying.

In Python 3.3, we're adding list.copy() and list.clear() because so many people were having issues with the [:] notation for copying and clearing.

Unless you actually want to turn some other type of iterator into a list, why wouldn't you use python's generic copy?

  from copy import copy
  b = copy(a)
This has the advantage of being advertised as a shallow copy, so the intent of the operation and its effects are clear and well-documented.

I've been working with Python for a year now, and I don't claim to have swum its depths.

Because copy won't necessarily work the same way on things that are not lists, which removes some of the flexibility afforded by python's duck typing. Unless you have good reason to require a list specifically, why should you?

But don't the alternatives also depend on list slicing? Heck, the title is `coping a list`. If you have a different datatype, it is the business of that type to specify how to copy using __copy__, isn't it?

Imagine a case like this:

  def example(input_data):
    l = list(input_data)
    # code that uses list-specific stuff and returns a result
The function 'example' doesn't want to touch the original input data, so it needs to make a copy of it. It also contains code that assumes operation on a list, so the copied value needs to support list-like operators. If you assume input_data is a list, you can use [:] or copy() to copy just fine, but if input_data is NOT a list then you cannot feed example a generator or some other list-like object or iterable and know for sure that it is going to work. By explicitly converting to list, you can take anything that implements __iter__, and then safely assume that the rest of your code will be working with lists. This adds a pretty bit of extra flexibility to the function and can make it much easier and/or cleaner to use.

Obviously as with anything the choice of list copy method is situation-dependent. Using [:] makes sense if you can guarantee the input is a list and you need maximal speed. Using copy() makes sense if you just want a copy of the input object and don't specifically care that the copy is itself a list. Using list() makes sense if you want to be able to take in all kinds of input values and be assured that the copy is a list. Use what is best for the situation at hand.

You can't copy a generator that way.

Using list(gen) will fully consume the generator, which may be undesirable, but at least it will generate a list with the correct values.

As somebody who doesn't know python, I find a = list(b) to be not very intuitive. b is a list already, why call list() on it?

So I don't think you gain any readability over the slicing syntax.

I've done quite a bit of Python and I agree. If you want to be "readable" or explicit that you want to copy the list, then you use copy.copy().

There is absolutely nothing wrong with [:]. This kind of judgmental critique of code is completely absurd IMO. If someone doesn't know something as basic in Python as slices, he or she shouldn't be reading Python code. If by all means you want to have people who don't know Python reading and understanding your code somehow, try copy.copy(). It's not like the performance hit is meaningful compared to list() and [:] is the fastest of the three.

Copy constructors are a much more common idiom in programming than taking a slice of an entire list.

So, his argument is that copying using

    a = b[:]
is cryptic? I'd say it follows directly from how slice notation works. If you omit the first index, the slice goes to the beginning of the list. If you omit the second index, the slice goes to the end of the list. So, if you omit both indices, everything works as you'd expect it to if you understand slice notation. Anyone working with lists in Python should understand slice notation, so I fail to see the problem here.

Let's look at the line:

    list1 = list2[:]
That line means "list1 is a list that has all the elements of list2", not "list1 is a new list that has all the elements of list2". It just happens that the implementation of slices in the python interpreter creates a new list for a slice. You may say it is 'obvious' that it works like that, in python, I will agree with you. You might not be able to say the same thing for other languages, especially if you haven't used them before.

The problem is, even with a simple knowledge of slices, someone reading your code may not realise that list1 is actually a new list; you are depending upon a side-effect of the language. If you used the Copy module, or even the list constructor, it would have been much clearer that list1 is a new list. This increases readability, which is very important if other people, or even yourself in the future re-read the code. Personally, I prefer readability in really long programs, and it isn't like you're programming C on an embedded device here.

His argument is that Python beginners won't be familiar with slice syntax, and if they encounter it in someone else's code they won't understand.

I actually have to agree if you are coming to something like

    a = b[:]
in code you are going to understand what it means or you are probably going to be missing a large portion of other things that are going on in the python code.

With that said I must also state that I think

    a = list(b)
is much prettier and is easier to understand to me.

They need to learn it, then. Honestly, anyone who read and comprehended my first post now understands slice notation (with the exception of the more rarely used stride argument), so it's not like it's a ton of effort. Unless the post author is claiming that slice notation itself is cryptic, I don't even understand where he's coming from.

I happen to agree that it's more intuitive to use list() instead of slice notation, but back when I had no idea about slice notation and first saw it, it caused me to wonder what else could be done with it. After some reading, I learned a lot more about different slice tricks than I would have if I were just presented with list().

You can't say just looking at b[:] that its creates a copy unless you know its type e.g., if `b` is a numpy array then it doesn't copy the data and by changing b[i] you change a[i].

list(a), copy.copy(a) are more explicit.

>If you omit the first index, the slice goes to the beginning of the list. If you omit the second index, the slice goes to the end of the list.

Unless you use the stride (third argument to the slice, the part that defaults to 1). If that is negative, then everything goes the opposite way, as in foo[::-1], the reverse of foo).

I'm always wondering why people have to write 3 pages instead of getting to the point. You know like, just a list.

- a = b <= doesn't copy! its just a reference

- a = list(b) <= works!

- <other methods if you like>

using list() in this manner feels a bit weird to me.

If you want to be truly explicit, why not use the built in copy module?


    from copy import copy
    b = copy(a)   #or deepcopy(), depending on your needs

*edit: I should note that the docs for copy suggest using the slice operator.

    b = a[:]
This happens to be fastest, and language recommended way to copy lists.

I really don't buy the argument about languages being intuitive to people who don't know the language. Languages should be optimized for people who know the language. I think Matz has mentioned it somewhere regarding "principle of least surprise" and Ruby.

Knowledge of a language is not a Boolean value. I use Ruby on daily basis, and occasionally read (and to a lesser extent write) Python, and I knew what "list(a)" does instantly, and I could only guess about "a[:]". Fortunately, both Python and Ruby has very few such gotchas, that's why I love them both.

This technique is also useful for iterating over a list and conditionally removing items. If you don't slice it you are changing the list as your iterate over it and you get unexpected results (like skipping items).

Or you could just use list comprehensions :)

Despite being a non-programmer, I found this article really user-friendly.

I went through my code and discovered that I seldom copy lists. I do often make selections from lists in a new list, i.e:

    b = [x for x in a if somecondition(x)]

This is analogous to javascript:

    b = a.slice()

    b = Array.apply(null, a)
(the second is fastest and arguably clearer)

Something else that's interesting is reversing a list but not in place: x[::-1]

The built-in reverse() method will do it in place.

There is also the top level reversed() function that takes some sequence and returns a reversed iterator.

What's wrong with `b = a + []`

You are depending on an undefined side-effect, specifically that adding two lists returns a new list. This is a very bad habit, especially if you move to a different language that doesn't have such behaviour. When you are programming, you should be writing what you want to do, not depending on side-effects to do it for you. People reading your code (including yourself in the future) would have to spend a lot more time working out what the code does otherwise. The only real exception to this is C on embedded hardware, where you really need to use lots of these tricks.

What if you wrote `b = b + []`, some languages might just append to b, and not create a new list (python seems to create a new string). Slices can still be seen as having the same problem. Really you should be using the Copy module or the List constructor, which have the implicit guarantee of a new list.

This argument is silly. Python is not some other programming language, so when you're writing Python, it doesn't matter what other programming languages do. If you use the same idioms in every programming language that you use, you're writing bad code in every programming language that you use. So don't do that.

The reason why list(x) is better than x + [] is because list(x) works regardless of what type of iterable x is. x + [] only works on lists.

x + [] or x[:] still has a readability issue, to anyone who hasn't done much python before, it isn't immediately obvious that you want to create a new list. For personal scripts, this may not matter too much, but for code that is seen by other people, you may be inhibiting their understanding.

A good analogy is probably assuming that pointer sizes are the same as int sizes in C. This assumption was safe for many years, but broke when 64-bit came along. Slices and adding lists will probably always return new lists in python, but it is still good not to depend on such behaviour.

It's silly for you to dumb down code for people new to the language. If you treat newbies like they're dumb, they'll never learn how to write "real" code. The only way they'll do that is by reading real idiomatic code.

Slices and adding lists will probably always return new lists in python, but it is still good not to depend on such behaviour.

Not buying this. The behavior is documented and Python has a deprecation cycle for changes in documented behavior.

The reason to write list(x) is because that's the generic way to turn anything into a list. It's not for being future proof or being easy for newbies to understand. It's because that's the right way to do it.

Python is committed to code readability. It has nothing to do with people being dumb. It has to do with making a language which is cognitively compatible with a spoken language that we have been practicing every day since birth. var[:] is obscure and requires more mental effort to parse than list(var). Because of that, it's preferable to var[:] from a readability standpoint as well.

That isn't an undefined side-effect, that's the definition of the operation.

It depends on which scope you define it at. Within the python interpreter, and the guarantees it gives, it is one of the definitions. However, at a higher level, the definition is really to give one variable the value of one list appended to another; without knowing how python works, you can't guarantee that it will be a new list.

A much more interesting line of python is

b[:] = a # copy a into b, while b keeps its identity.

list(x) is also used for converting a sequence x to a list. x[:] is actually more explicit because it only means 'copy x'.

So, uh, why are people doing so much list copying? It's not something that occurs often in idiomatic code. I understand that it's something to be aware of, but it's just not something that is required often.

Calling methods like pop(), insert(), and remove() on a list actually affects the contents of a list rather than returning a _copy_ of the list with everything removed. For example:

all = range(10) allbut2 = all.remove(2)

Actually removes 2 from all as well. Hence, you have to copy lists a lot if you are doing a lot of list creation or change from a master list.

list.remove() returns None. So it is a mistake to bind its return value to allbut2.

It follows the convention that methods that modify their object inplace should return None. list.pop() is an obvious exception.

You create a new list via list comprehension instead of copying and then removing:

  even = [i for i in L if i % 2 == 0] # remove odd numbers

I believe he meant:

    all = range()
    allbut2 = all
which is something that catches people all the time.

I'm totally aware of that. I'm just confused because in idiomatic Python, list copies without a map, filter, or reduce operation are very rare, so a list copy is usually replaced with a list comprehension or iterative loop of some sort.

For example, in Bravo, there are exactly eight list copies, all in contributed code which I didn't write, and another six in Exocet, which I also didn't write. The ones in Exocet are probably required, but the others are from code that was written without forethought.

I knew white space means a lot in Python (Disc: newbee and learning Python still!). Just found this, and thought someone out there might be able to throw some more light.

  Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
  Type "copyright", "credits" or "license()" for more information.
  >>> a=[1,2,3]
  >>> b=a
  >>> b
  [1, 2, 3]
  >>> a=[1]
  >>> b
  [1, 2, 3]
  >>> id(a)
  >>> id(b)
  >>> c = [4, 5, 6]
  >>> d = c
  >>> id(c)
  >>> id(d)
  >>> print c, d
  [4, 5, 6] [4, 5, 6]
  >>> c = [7]
  >>> print c, d
  [7] [4, 5, 6]
  >>> e = c
  >>> c.append(8)
  >>> print c, d, e
  [7, 8] [4, 5, 6] [7, 8]

c = [4,5,6] is creating a new list, and assigning it to c; not modifying the list that c pointed to. The '=' operator simply associates an object reference to a variable. Going d=c is copying the reference to an object from one variable to another.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact