> Does this really count as mutability though? Not really. The string would be thrown away anyway so this is just an optimization to reuse the memory. The important takeaway here is that you are not allocating a new string every single time like the internet says.
This is an undocumented optimization, so you should assume you're allocating a new sting every single time like the internet says.
I've been coding Python since 1.5.2 days and so I'll continue to use lists as intermediate string builders and then join afterwards because I know this works in past, present, a likely future versions of Python.
Some interesting results. Joining on a generator is slower than joining on a list because the list has a known size beforehand, which lets Python perform some optimizations. Even though it's more memory intensive. Appending a string is surprisingly fast. Building an intermediate list is not.
While I do get your point, Python likes making optimizations. They rarely, if ever, make patterns slower by choice. There's nothing in any spec that says it _has_ to be slow: that's as much as an implementation detail as "joining lists are fast".
In [18]: sys.version
Out[18]: '3.9.1 (default, Feb 3 2021, 07:38:02) \n[Clang 12.0.0 (clang-1200.0.32.29)]'
In [6]: strings = ["".join(random.choice(string.hexdigits) for _ in range(10)) for _ in range(10)]
In [9]: def one():
...: "".join(s for s in strings)
...:
In [10]: def two():
...: "".join([s for s in strings])
...:
In [11]: def three():
...: x = []
...: for s in strings:
...: x.append(s)
...: "".join(x)
...:
In [12]: def four():
...: x = ""
...: for s in strings:
...: x += s
...:
In [13]: %timeit one()
753 ns ± 9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [14]: %timeit two()
521 ns ± 5.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit three()
696 ns ± 3.94 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [16]: %timeit four()
620 ns ± 4.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Joining on a list (as in function two()) is no more memory intensive than joining on a generator expression (function one()). This is surprising at first. Raymond Hettinger has a very good explanation.[0] In short, it's because "".join() needs to iterate over the elements twice, and so it needs to store the whole sequence of string anyway.
You can write python to be performant for either CPython or for PyPy. It's very difficult to write code that is performant on both. If you are targetting PyPy, be explicit about it and go full RPython on it.
If you are targetting CPython, try to be "pythonic"[0] and then drop to C/++ (through numba, cython, nuitka, pure C or whatever way you like best) if your performance requirements need it.
I'm not familiar with Jython and IronPython innards to know what's the best way to handle performance for them. Most probably it's very different than CPython and PyPy.
So like you said, keep things reasonable and don't depend on implementation details--unless you are explicit in your target/supported interpreter and you absolutely need to squeeze the performance out of the code.
a) that surprises me, because in my short experience with it, I found that PyPy is more about simpler idioms if you are aiming for performance, a bit C-ish in style
b) it's been something like... 3 years (I think) since I last used PyPy for some serious performance testing of code. And even then it was a horrible proof of concept that barely ran, but helped me point out that "hey, we optimized it to 3.5x times faster with CPython and PyPy is getting to 4.5x over the original code, but it's throwing a thousand errors and warnings on screen and we can't be sure it'll work in production"[0].
Probably things have changed a bit on the PyPy side, though I don't think RPython has changed much in concept.
Also I see I never got to explain the pythonic note on my previous comment: basically, whatever "pythonic" means to the team. In general I try to go for the definition that is "it's easy to read and understand what's going on with the code, and provides easy-to-use interfaces for programmers". Not an easy to do thing, but at least trying to do it helps a lot.
[0] the job was that there was a large program that was bottlenecking a process in a lab, and somebody threw the idea of "oh we should use pypy because it'll surely make things run faster". And then it was my responsibility to pick that skeleton up and take care of it.
Oh and the code was pretty inefficient to begin with, which is why we got to speed it up so much without even going into C. A quick test with Cython put it in the 7-10x range of speedups, but the client wasn't willing to handle that, so they just kept it in CPython.
Yeah; I'm pretty sure this isn't anything you can take advantage of. But I think the point of the post is a "isn't this neat?" rather than "because of an optimization one approach isn't as bad as it could be and you should take advantage of that".
The optimization got moved to unicode objects during the great Python 3 upheaval and AFAIK still remains unimplemented for what are now bytes objects. The paramiko SSH library relied on it heavily, as a result its runtime is (AFAIK still) horrific on Python 3
This one is less painful for the devs to break, if they need to, because it only works 99% of the time. If you rely on this behaviour for correctness, rather than performance, it will set you right pretty quickly.
If you do want to take advantage of this, maybe you can go a level deeper: dig into Python's behaviour when allocating memory for strings, and figure out in what circumstances you can be guaranteed to get the same id back. E.g. maybe if you create a 28-character string, there will always be room to append 4 more.
Right, but my thought is "at what times would it make sense to join, and not append, but this optimization would be faster than joining"? I'm guessing not very many.
Because, yeah, the fact that str += "?" is faster isn't a big deal; that's the natural, ergonomic way to deal with it, because you're appending all of one string. Likewise foo + " " + bar + "?" is probably easier to write that way, than to drop them into a list and join (but even if not, I'd be curious if it's actually any faster; this article doesn't measure). By the time you get to joining large amounts together, concatenating a CSV or something, you're going to use join, naturally, and join is going to be more performant.
So to my mind it's kind of an open question, at what points is the non-ergonomic thing going to be the faster thing? That's what "taking advantage of (an optimization)" feels like; otherwise you're just writing code and letting the performance fall where it may.
> Right, but my thought is "at what times would it make sense to join, and not append, but this optimization would be faster than joining"? I'm guessing not very many.
I’m guessing all of them. Because what it does is what join will do internally anyway, but without the overhead of the list itself, and with operations which have lower overhead than the list’s.
Yes, you are. I'm saying this is a more interesting question, and one that isn't answered. Because the reality is that if you're appending just a handful of things together, the idiomatic approach is to concatenate them, and this optimization comes into play. If you're concatenating a lot of items together...you probably already have a list, and the idiomatic approach to join them is probably faster than a for loop to concat them over and over. So the question then, is is there a point where idiomatically it makes more sense to join things (putting them in a list if they aren't already, but they might be; depends on the example), but this will be more efficient?
>> Does this really count as mutability though? Not really. The string would be thrown away anyway so this is just an optimization to reuse the memory. The important takeaway here is that you are not allocating a new string every single time like the internet says.
The important takeaway is that semantic immutability does not require or depend on implementation immutability. It's a 'levels of abstraction' thing. Once you let the users peek behind the curtain, however, it becomes much more difficult to be sure they cannot shoot themselves in the foot by so doing.
This does play a role if for some reason you use object IDs, e.g. when deserializing data with cross-links. Keeping a real second reference becomes important.
No. The strings having the same address doesn't mean they're the same strings. This is an interesting optimization but one Python string is being replaced by another string at the same address. Even without this optimization the address will be reused eventually.
The closest thing Python has to a specification is the Language Reference. It shies well away from performance details and aims to describe "syntax and core semantics" while acknowledging that some parts of what it specifies may just be implementation details of CPython [0].
To draw the line on what you can assume re performance, you need a combination of background knowledge, folk wisdom and common sense. (You could say "or read the code" but that could change in the next minor version: you still need common sense to know if that's likely). This is unlike, e.g. C++ where there is a specification and standards-compliant compilers must implement certain functions with the specified runtime complexity.
If a tree falls in a forest, and nobody observes it, does it make a sound? The tree can choose, because there's nobody to notice the difference.
Mutable state is not bad; shared mutable state is bad — or, at least, takes a lot of care to handle. So CPython mutates strings which are not observed by anyone else but the mutating code.
This is exactly the idea behind uniqueness types [1] and move semantics in C++ and Rust: mutation is safe as long as it's not shared.
The advice from Python core developers is not to rely on that optimization — it is fragile and implementation dependent. Instead, please use str.join() to concatenate a bunch of strings together. That is guaranteed to not be quadratic.
When Armin Rigo's brilliant optimization was added, the rationale was mitigation-of-harm: it could sometimes save users who weren't following the advice. Also it sometimes helped with the then common technique of building a short string with series of concatenations done with "+":
One other personal thought as a code reviewer and teacher: If you teach someone to build strings with "+" or "+=" and the optimization prevents a negative consequence, then don't be surprised if they try this with other sequence types where no such optimization exists.
Really, just learn str.join() and itertools.chain(). Both of those scale nicely.
My favorite Python WTF "feature" is that integers can have have the same reference, but only sometimes:
>>> a = 256
>>> b = 256
>>> a is b
True
>>> a = 257
>>> b = 257
>>> a is b
False
>>> a = 257; b = 257
>>> a is b
True
Sometimes I think of Python as the Nash Equilibrium[a] of programming languages:
It's never the absolute best language for anything, but it's hard to improve it on any front (e.g., execution speed) without hindering it on other fronts (e.g., ad-hoc interactivity), and it's never the absolute worst language for anything. Often enough, it's good enough.
Python might well be "the least-worst language for everything."
> Sometimes I think of Python as the Nash Equilibrium of programming languages
FYI: What you're describing is not a Nash equilibrium, but a Pareto optimal point [1]. They are similar in that you couldn't do any better, but Nash equilibria is in terms of whether this would cause other players to change their strategies, while Pareto optimality is only about trading off different features/dimensions.
Think of the developers as players competing against each other trying to get their ideas (PEPs) incorporated into the language, seeking individual recognition, credit, etc., and also think of languages competing against each other for developer attention, and then it will make a bit more sense why I called it a "Nash Equilibrium" :-)
Nash equilibria are mainly interesting when they are not Pareto optimal. Both the developers and users of a language, if being rational, should prefer languages to be on the Pareto frontier, but where on that frontier depends on how you weight the trade-offs.
As other commenters have pointed out, this is an implementation-specification optimization rather than a property of Python as a language.
It is, at a first glance, a bit weird. But the way you should look at it is that Python the language doesn't say the two integers have the same identity, and you shouldn't assume they will. But it also doesn't say they can't be the same object. Since Python integers are immutable, and thus having the two variables actually reference the same object can't create side effects unless you're directly playing with identities and making assumptions in your code that you shouldn't make, the implementation can have the two variables reference the same object as an optimization without breaking anything.
But this is using the seemingly harmless keyword "is" that's you're supposed to use sometimes. A programmer could stumble upon one of these statements and think it's going to work reliably after it works the first time.
I used to test for None by doing what seemed to work:
if my_variable:
do something
until I discovered it doesn't work if my_variable = 0 or some other falsy value besides None.
You could use '== None' instead, but it's generally recommended to use 'is None' (supposedly this is slightly faster). I don't think I've ever encountered anything else relying on 'is'. IMO the 'is' keyword was a poor language decision, given how rarely it's ever used.
`is` is for identity whereas `=` is for equality. You rarely want `is` unless you're asking if two references are the same object. This is almost exclusively used for `x is False/True`, but sometimes used to denote "missing" arguments (where true/false may be valid):
missing = object()
def func(a=missing):
if a is missing:
raise ValueError('you must pass a')
This "numbers less than 256 are the same objects" is a fairly common on the list of "wtf python" but I've never understood it. You don't use `is` like that and you would never use it in code because the operator you're using is not the right one for the thing you're trying to do.
Plus if this is the biggest wtf then that's pretty good going.
The "numbers less than 256 are the same objects" wasn't done so you could use "is" on them, that's just a side effect. It was done as an optimization, because those small integers are far more common than the larger ones. You save space, because you need only one copy of those small integers. And you save time, because those objects are never destroyed or recreated.
The "numbers less than 256 are the same objects" reminds me of the existence of the IntegerCache in Java, with an array storing the number from -128 to 127.
> It's never the absolute best language for anything, but it's hard to improve it on any front (e.g., execution speed) without hindering it on other fronts (e.g., ad-hoc interactivity),
This belief seems common, but I always wonder if anyone with familiarity with dynamic programming languages that were implemented by people who knew what they are doing (as implementers) thinks so. Self, Smalltalk and Common Lisp, for example, are doing much better on the ad-hoc interactivity front in non-trivial ways whilst offering implementations with vastly better performance preceding (C)Python by many years. The fact that python has terrible execution speed is most due to lack of relevant skills in the community not some conscious engineering trade-off.
Having said that, I don't think you are wrong on python being "the least worst language for everything" -- very few other languages have an eco system of remotely comparable expansiveness and quality (the top minds in several disciplines mostly use python for their work) which alone kills of huge swathes of would-be-competitors.
> Having said that, I don't think you are wrong on python being "the least worst language for everything" -- very few other languages have an eco system of remotely comparable expansiveness and quality (the top minds in several disciplines mostly use python for their work) which alone kills of huge swathes of would-be-competitors.
Yes, I agree. The ecosystem is part of what makes the language "the least worst language for everything."
In [2]: (1, 2) is (1, 2)
Out[2]: True
In [3]: a, b = (1, 2), (1, 2)
In [4]: a is b
Out[4]: True
In [7]: a = (1, 2)
In [8]: b = (1, 2)
In [9]: a is b
Out[9]: False
If you run that in a script, then you get True for all statements. Reason: when running a file, the interpreter reads the entire script and can make the optimization that both variables are the same objects, since they're not mutated.
> Sometimes I think of Python as the Nash Equilibrium[a] of programming languages:
I think you can say that about almost any language. Each feature has it's advantages and disadvantages and even the most hated features of some languages have some reasoning behind them - so changing it would hurt some use case.
Language design is sometimes more about reasonable compromises than genius ideas.
This is outside of the spec... "is" is for testing the exact same reference and it is only coincidence that to speed things up they made smaller integers the same objects in memory. See:
He names references in his post. I highly doubt he’s confused about the difference between is and ==. It’s a weird leak of interpreter details that could, in very narrow situations, cause a bug.
Java has the same "problem" when boxing an int into a java.lang.Integer. Small integers will have the same reference (==) because there is a cache table, but larger ones won't.
This behaviour (id not changing after concatenation) is consistent with the original str object being deallocated and a new one allocated in its place with an identical id. It’s not what actually happens behind the scenes (in CPython), but it’s still an implementation detail that this demonstration does not even unambiguously expose.
CPython could do both; allocations are usually not byte-perfect, and the small remaining room at the end of string can be used as an optimization, or a new buffer can be allocated if there's no more room.
Also, cutting long strings with .strip() can use the same optimization. Allocation isn't fast.
IDK if it's used, but it's entirely plausible to implement.
> CPython could do both; allocations are usually not byte-perfect
Though it's probably not the case for strings, it should be noted that for most objects allocations are byte-perfect, because the average Python object is a bunch of pointers and and a refcount: two gc pointers, a refcount, a weakref pointer, a class pointer and a dict pointer, for a total of 48 bytes (as of 3.9, ignoring all the pointees).
That's not the case for `__slots__` though: they drop the instance dict and the weakref pointer, but then each field will add a pointer to the total (then again that's space which won't be allocated as part of the instance dict).
I was about to say that. The old buffer has to be copied to the new buffer, so a naive implementation couldn't possibly allocate them both at the same address.
Can you please explain to a rookie what the ord('1') is for? Is it just a way of writing a magic number of value 49, or is there some significance to it?
It's mostly just a mildly obfuscated way to write 49. It also happens to line up nicely with the index of the character being overwritten (index 1); ord('0') through ord('9') work likewise.
----
I'm personally kind of shocked at how big strings are - the minimum size is 6 native words in size, corresponding to the following fields: refcount, type pointer, string length, string hash (precomputed), flags, and a wstr pointer which just seems to be NULL for most strings. It seems like they could have at least merged the string length and flags together for short strings - a possible subject for a future PEP...
CPython implementation of the string stores metadata to save memory. The ord('1') simply jumps to the correct memory offset in the string structure. I would think it's easier to remember ord('0') as the offset of the string byte array in the structure than the magic number 48?
> the builtin id() function, which is just the memory address.
This, crucially, is not guaranteed. Instead, id() is only guaranteed to give a unique integer value for every object valid for the lifetime of that object. So if Python deallocated your first string object and allocated a new string object, Python could give the same id() for the new string object and this would be fine; this does not necessarily mean that the two string objects are the “same” object! In fact, since the objects do not share lifetimes, they cannot be said to be the same object!
The title seems like false advertisement. It seems that strings in Python are indeed immutable. How it implements immutability is up to it (the implementation), and making a brand new allocation for each operation is not the only way of adhering to the contract.
Get back to me if you can make several references to the same string and make them step on each others’ toes by only assigning a new value to one of them.
Python string semantics are very strange, as the way Unicode was implemented. The internal representation can be one, two, or four bytes per character, depending on the widest character. Plus I've heard that there's sometimes a UTF-8 representation in there, too.
Python strings have to be indexable by integers. But, in fact, most indexing is iteration. So a UTF-8 representation could work. You can index forwards and backwards through UTF-8 with opaque cursors. The search, match, and regular expression functions can all work on UTF-8. Returned indexes would be opaque cursors. Addition and subtraction of small integers to a cursor can be handled by indexing forwards and backwards through UTF-8. If a program does does random access indexing, or uses an opaque cursor in a context where it really has to be convereted to an integer, the implementation has to build an index of the entire string. Index building costs O(n), but should be rare.
This is the same way persistent data structures are used in FP to provide immutable semantics with ~mutable performance. In Clojure the concept is extended to transients, which allow for scope-local mutation reasoning that if the statefulness is invisible at the call site it wasn’t stateful.
All of this is just making the physics of electronics and stateful low level APIs provide reasonable performance for the reasoning benefits of immutable data.
Of course if you break that fourth wall you’ll find it’s more complicated than that. And that’s interesting for understanding how it works, but it’s important not to frame that as some kind of exception or trap door. Anyone busting through that abstraction either knows they are or has more problems coming.
I appreciate the clarification. And giving it a more general name which I wasn’t aware of. I hope it was clear I was using an example and not trying to suggest this was somehow unique to Clojure. My point was to say that encapsulating mutability to provide an immutable interface with other desirable characteristics is actually commonplace.
Taking advantage of a reference count of 1, like this, must be one of very few places where reference counting can provide a performance improvement over tracing garbage collection.
There is a difference between changing internals by doing things outside the system (OS level, hardware level) and changing things with perfectly legal and correct code.
I'm not convinced it's legal and correct. You're going to risk breaking internal JVM optimisations like string interning. Whether it's specifically against the letter of the law I'm not sure, but I imagine the Java specification doesn't have much to say about the behaviour of a JVM after someone messes with its internals. That is to say, very roughly speaking, you're likely flirting with what C programmers call undefined behaviour, which is to say, it's likely not legal and correct.
Strings are immutable on the Python side. On the C size, however, PyUnicode_Resize is part of the PEP 384 Stable ABI and very specifically tries to resize the string in-place (it's normally intended for string-building).
Does Python deliberately allocate extra space at the end of each string to enable characters to be appended in-place (similar to lists)? Or does it just use space that’s available as a side-effect of memory allocation, e.g. by calling `realloc`?
This is an undocumented optimization, so you should assume you're allocating a new sting every single time like the internet says.
I've been coding Python since 1.5.2 days and so I'll continue to use lists as intermediate string builders and then join afterwards because I know this works in past, present, a likely future versions of Python.