Python strings are immutable, but only sometimes

js2 · on Feb 15, 2021

> Does this really count as mutability though? Not really. The string would be thrown away anyway so this is just an optimization to reuse the memory. The important takeaway here is that you are not allocating a new string every single time like the internet says.

This is an undocumented optimization, so you should assume you're allocating a new sting every single time like the internet says.

I've been coding Python since 1.5.2 days and so I'll continue to use lists as intermediate string builders and then join afterwards because I know this works in past, present, a likely future versions of Python.

orf · on Feb 15, 2021

Some interesting results. Joining on a generator is slower than joining on a list because the list has a known size beforehand, which lets Python perform some optimizations. Even though it's more memory intensive. Appending a string is surprisingly fast. Building an intermediate list is not.

While I do get your point, Python likes making optimizations. They rarely, if ever, make patterns slower by choice. There's nothing in any spec that says it _has_ to be slow: that's as much as an implementation detail as "joining lists are fast".

    In [18]: sys.version
    Out[18]: '3.9.1 (default, Feb  3 2021, 07:38:02) \n[Clang 12.0.0 (clang-1200.0.32.29)]'

    In [6]: strings = ["".join(random.choice(string.hexdigits) for _ in range(10)) for _ in range(10)]
    
    In [9]: def one():
       ...:     "".join(s for s in strings)
       ...:
    
    In [10]: def two():
        ...:     "".join([s for s in strings])
        ...:
    
    In [11]: def three():
        ...:     x = []
        ...:     for s in strings:
        ...:         x.append(s)
        ...:     "".join(x)
        ...:
    
    In [12]: def four():
        ...:     x = ""
        ...:     for s in strings:
        ...:         x += s
        ...:
    
    In [13]: %timeit one()
    753 ns ± 9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
    In [14]: %timeit two()
    521 ns ± 5.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
    In [15]: %timeit three()
    696 ns ± 3.94 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
    In [16]: %timeit four()
    620 ns ± 4.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

lukasgelbmann · on Feb 16, 2021

Joining on a list (as in function two()) is no more memory intensive than joining on a generator expression (function one()). This is surprising at first. Raymond Hettinger has a very good explanation.[0] In short, it's because "".join() needs to iterate over the elements twice, and so it needs to store the whole sequence of string anyway.

[0] https://stackoverflow.com/a/9061024

ufo · on Feb 16, 2021

I agree with you that newer versions don't tend to get rid of optimizations.

However, other python implementations might not optimize the concatenation. IIRC, pypy doesn't.

dr_zoidberg · on Feb 16, 2021

You can write python to be performant for either CPython or for PyPy. It's very difficult to write code that is performant on both. If you are targetting PyPy, be explicit about it and go full RPython on it.

If you are targetting CPython, try to be "pythonic"[0] and then drop to C/++ (through numba, cython, nuitka, pure C or whatever way you like best) if your performance requirements need it.

I'm not familiar with Jython and IronPython innards to know what's the best way to handle performance for them. Most probably it's very different than CPython and PyPy.

So like you said, keep things reasonable and don't depend on implementation details--unless you are explicit in your target/supported interpreter and you absolutely need to squeeze the performance out of the code.

ufo · on Feb 16, 2021

In this particular case, the append and join version which is the fastest in CPython is also the recommended way to do it in PyPy.

dr_zoidberg · on Feb 16, 2021

I'll admit that:

a) that surprises me, because in my short experience with it, I found that PyPy is more about simpler idioms if you are aiming for performance, a bit C-ish in style

b) it's been something like... 3 years (I think) since I last used PyPy for some serious performance testing of code. And even then it was a horrible proof of concept that barely ran, but helped me point out that "hey, we optimized it to 3.5x times faster with CPython and PyPy is getting to 4.5x over the original code, but it's throwing a thousand errors and warnings on screen and we can't be sure it'll work in production"[0].

Probably things have changed a bit on the PyPy side, though I don't think RPython has changed much in concept.

Also I see I never got to explain the pythonic note on my previous comment: basically, whatever "pythonic" means to the team. In general I try to go for the definition that is "it's easy to read and understand what's going on with the code, and provides easy-to-use interfaces for programmers". Not an easy to do thing, but at least trying to do it helps a lot.

[0] the job was that there was a large program that was bottlenecking a process in a lab, and somebody threw the idea of "oh we should use pypy because it'll surely make things run faster". And then it was my responsibility to pick that skeleton up and take care of it.

Oh and the code was pretty inefficient to begin with, which is why we got to speed it up so much without even going into C. A quick test with Cython put it in the 7-10x range of speedups, but the client wasn't willing to handle that, so they just kept it in CPython.

daveFNbuck · on Feb 16, 2021

I wrote a quick test script. You're right that pypy doesn't have this optimization.

pansa2 · on Feb 16, 2021

Makes sense. The optimisation only applies when the reference count is one, and PyPy doesn’t use reference counting.

lostcolony · on Feb 15, 2021

Yeah; I'm pretty sure this isn't anything you can take advantage of. But I think the point of the post is a "isn't this neat?" rather than "because of an optimization one approach isn't as bad as it could be and you should take advantage of that".

masklinn · on Feb 15, 2021

> Yeah; I'm pretty sure this isn't anything you can take advantage of.

You probably can but should not rely on it:

* it was a very official part of the Python 2.4 release notes

* it's unlikely the devs would remove it as they know how implementation details tend to leak into the software (see: dict ordering)

* but it is an optimisation so there's no guarantee, a simple trace function could break it

* and obviously there's no guarantee (and likely no way) other implementations can implement it

dmw_ng · on Feb 15, 2021

The optimization got moved to unicode objects during the great Python 3 upheaval and AFAIK still remains unimplemented for what are now bytes objects. The paramiko SSH library relied on it heavily, as a result its runtime is (AFAIK still) horrific on Python 3

dmurray · on Feb 15, 2021

This one is less painful for the devs to break, if they need to, because it only works 99% of the time. If you rely on this behaviour for correctness, rather than performance, it will set you right pretty quickly.

If you do want to take advantage of this, maybe you can go a level deeper: dig into Python's behaviour when allocating memory for strings, and figure out in what circumstances you can be guaranteed to get the same id back. E.g. maybe if you create a 28-character string, there will always be room to append 4 more.

lostcolony · on Feb 15, 2021

Right, but my thought is "at what times would it make sense to join, and not append, but this optimization would be faster than joining"? I'm guessing not very many.

Because, yeah, the fact that str += "?" is faster isn't a big deal; that's the natural, ergonomic way to deal with it, because you're appending all of one string. Likewise foo + " " + bar + "?" is probably easier to write that way, than to drop them into a list and join (but even if not, I'd be curious if it's actually any faster; this article doesn't measure). By the time you get to joining large amounts together, concatenating a CSV or something, you're going to use join, naturally, and join is going to be more performant.

So to my mind it's kind of an open question, at what points is the non-ergonomic thing going to be the faster thing? That's what "taking advantage of (an optimization)" feels like; otherwise you're just writing code and letting the performance fall where it may.

masklinn · on Feb 15, 2021

> Right, but my thought is "at what times would it make sense to join, and not append, but this optimization would be faster than joining"? I'm guessing not very many.

I’m guessing all of them. Because what it does is what join will do internally anyway, but without the overhead of the list itself, and with operations which have lower overhead than the list’s.

lostcolony · on Feb 15, 2021

"I'm guessing"

Yes, you are. I'm saying this is a more interesting question, and one that isn't answered. Because the reality is that if you're appending just a handful of things together, the idiomatic approach is to concatenate them, and this optimization comes into play. If you're concatenating a lot of items together...you probably already have a list, and the idiomatic approach to join them is probably faster than a for loop to concat them over and over. So the question then, is is there a point where idiomatically it makes more sense to join things (putting them in a list if they aren't already, but they might be; depends on the example), but this will be more efficient?

mannykannot · on Feb 16, 2021

>> Does this really count as mutability though? Not really. The string would be thrown away anyway so this is just an optimization to reuse the memory. The important takeaway here is that you are not allocating a new string every single time like the internet says.

The important takeaway is that semantic immutability does not require or depend on implementation immutability. It's a 'levels of abstraction' thing. Once you let the users peek behind the curtain, however, it becomes much more difficult to be sure they cannot shoot themselves in the foot by so doing.

nine_k · on Feb 15, 2021

This does play a role if for some reason you use object IDs, e.g. when deserializing data with cross-links. Keeping a real second reference becomes important.

wizzwizz4 · on Feb 15, 2021

You should never use object IDs that way; that's what weak references / dictionaries are for. (See pickle for details.)

heavenlyblue · on Feb 15, 2021

Python docs explicitly say that object IDs are only unique for objects with overlapping lifetimes.

Congrats, by trying to be smart you just shot yourself in the foot.

helsinkiandrew · on Feb 16, 2021

> Does this really count as mutability though?

No. The strings having the same address doesn't mean they're the same strings. This is an interesting optimization but one Python string is being replaced by another string at the same address. Even without this optimization the address will be reused eventually.

Dylan16807 · on Feb 15, 2021

Does python document anything about performance?

I assume my array sort won't be n^3, but how do we draw the line on what to assume and what not to assume?

jshprentz · on Feb 15, 2021

Python uses Timsort[1]. They provide benchmarks and details at [2].

[1] https://en.wikipedia.org/wiki/Timsort

[2] https://svn.python.org/projects/python/trunk/Objects/listsor...

Dylan16807 · on Feb 16, 2021

They also give details about the string implementation.

I mean a guarantee that it meets some baseline efficiency that you can depend on across time and different implementations. Does that exist?

dmurray · on Feb 16, 2021

The closest thing Python has to a specification is the Language Reference. It shies well away from performance details and aims to describe "syntax and core semantics" while acknowledging that some parts of what it specifies may just be implementation details of CPython [0].

To draw the line on what you can assume re performance, you need a combination of background knowledge, folk wisdom and common sense. (You could say "or read the code" but that could change in the next minor version: you still need common sense to know if that's likely). This is unlike, e.g. C++ where there is a specification and standards-compliant compilers must implement certain functions with the specified runtime complexity.

[0] https://docs.python.org/3/reference/introduction.html#altern...

Wicher · on Feb 16, 2021

There's https://wiki.python.org/moin/TimeComplexity but that's not official documentation.

rovr138 · on Feb 15, 2021

If you're implementing it on python, check `dis.dis()`

If builtin, then you'll need to check the code for cpython or whatever you're using.

nine_k · on Feb 15, 2021

If a tree falls in a forest, and nobody observes it, does it make a sound? The tree can choose, because there's nobody to notice the difference.

Mutable state is not bad; shared mutable state is bad — or, at least, takes a lot of care to handle. So CPython mutates strings which are not observed by anyone else but the mutating code.

This is exactly the idea behind uniqueness types [1] and move semantics in C++ and Rust: mutation is safe as long as it's not shared.

[1]: https://en.wikipedia.org/wiki/Uniqueness_type

raymondh · on Feb 16, 2021

The advice from Python core developers is not to rely on that optimization — it is fragile and implementation dependent. Instead, please use str.join() to concatenate a bunch of strings together. That is guaranteed to not be quadratic.

When Armin Rigo's brilliant optimization was added, the rationale was mitigation-of-harm: it could sometimes save users who weren't following the advice. Also it sometimes helped with the then common technique of building a short string with series of concatenations done with "+":

   advice = 'Dear ' + customer + ', please buy ' + product + '.'

Nowadays, that would be done efficiently with an f-string:

   advice = f'Dear {customer}, please buy {product}.'

One other personal thought as a code reviewer and teacher: If you teach someone to build strings with "+" or "+=" and the optimization prevents a negative consequence, then don't be surprised if they try this with other sequence types where no such optimization exists.

Really, just learn str.join() and itertools.chain(). Both of those scale nicely.

nayuki · on Feb 16, 2021

Interestingly in Java, at the language level, string concatenation is translated like this:

    x + "s" + y + "t" + z -->
    
    new java.lang.StringBuilder().append(x).append("s").append(y).append("t").append(z).toString()

and not naively like this:

    java.lang.String.valueOf(x).concat("s").concat(String.valueOf(y)).concat("t").concat(String.valueOf(z))

> just learn str.join()

Or Python io.StringIO .

cs702 · on Feb 15, 2021

My favorite Python WTF "feature" is that integers can have have the same reference, but only sometimes:

  >>> a = 256
  >>> b = 256
  >>> a is b
  True
  >>> a = 257
  >>> b = 257
  >>> a is b
  False
  >>> a = 257; b = 257
  >>> a is b
  True

Sometimes I think of Python as the Nash Equilibrium[a] of programming languages:

It's never the absolute best language for anything, but it's hard to improve it on any front (e.g., execution speed) without hindering it on other fronts (e.g., ad-hoc interactivity), and it's never the absolute worst language for anything. Often enough, it's good enough.

Python might well be "the least-worst language for everything."

--

[a] https://en.wikipedia.org/wiki/Nash_equilibrium

justinnhli · on Feb 15, 2021

> Sometimes I think of Python as the Nash Equilibrium of programming languages

FYI: What you're describing is not a Nash equilibrium, but a Pareto optimal point [1]. They are similar in that you couldn't do any better, but Nash equilibria is in terms of whether this would cause other players to change their strategies, while Pareto optimality is only about trading off different features/dimensions.

[1]: https://en.wikipedia.org/wiki/Pareto_efficiency

cs702 · on Feb 15, 2021

Think of the developers as players competing against each other trying to get their ideas (PEPs) incorporated into the language, seeking individual recognition, credit, etc., and also think of languages competing against each other for developer attention, and then it will make a bit more sense why I called it a "Nash Equilibrium" :-)

mannykannot · on Feb 16, 2021

Nash equilibria are mainly interesting when they are not Pareto optimal. Both the developers and users of a language, if being rational, should prefer languages to be on the Pareto frontier, but where on that frontier depends on how you weight the trade-offs.

cs702 · on Feb 16, 2021

I feel my comment is being taken way too seriously... but yes, I agree.

Delk · on Feb 15, 2021

As other commenters have pointed out, this is an implementation-specification optimization rather than a property of Python as a language.

It is, at a first glance, a bit weird. But the way you should look at it is that Python the language doesn't say the two integers have the same identity, and you shouldn't assume they will. But it also doesn't say they can't be the same object. Since Python integers are immutable, and thus having the two variables actually reference the same object can't create side effects unless you're directly playing with identities and making assumptions in your code that you shouldn't make, the implementation can have the two variables reference the same object as an optimization without breaking anything.

exporectomy · on Feb 16, 2021

But this is using the seemingly harmless keyword "is" that's you're supposed to use sometimes. A programmer could stumble upon one of these statements and think it's going to work reliably after it works the first time.

I used to test for None by doing what seemed to work:

  if my_variable:
    do something

until I discovered it doesn't work if my_variable = 0 or some other falsy value besides None.

mannerheim · on Feb 16, 2021

You could use '== None' instead, but it's generally recommended to use 'is None' (supposedly this is slightly faster). I don't think I've ever encountered anything else relying on 'is'. IMO the 'is' keyword was a poor language decision, given how rarely it's ever used.

mannykannot · on Feb 16, 2021

I agree. 'is' creates opportunities for possibly counter-intuitive implementation dependency, for very little gain.

cs702 · on Feb 15, 2021

Yes, of course. I agree. Nothing I wrote contradicts that :-)

orf · on Feb 15, 2021

`is` is for identity whereas `=` is for equality. You rarely want `is` unless you're asking if two references are the same object. This is almost exclusively used for `x is False/True`, but sometimes used to denote "missing" arguments (where true/false may be valid):

    missing = object()
    def func(a=missing):
       if a is missing:
          raise ValueError('you must pass a')

This "numbers less than 256 are the same objects" is a fairly common on the list of "wtf python" but I've never understood it. You don't use `is` like that and you would never use it in code because the operator you're using is not the right one for the thing you're trying to do.

Plus if this is the biggest wtf then that's pretty good going.

cs702 · on Feb 15, 2021

Yes, of course.

BTW, that's not the "biggest" WTF feature; it's just my favorite. There's a long list of WTF features here:

https://github.com/satwikkansal/wtfpython

Otherwise, I agree, it's a pretty good going :-)

mark-r · on Feb 16, 2021

The "numbers less than 256 are the same objects" wasn't done so you could use "is" on them, that's just a side effect. It was done as an optimization, because those small integers are far more common than the larger ones. You save space, because you need only one copy of those small integers. And you save time, because those objects are never destroyed or recreated.

orf · on Feb 16, 2021

Yes, of course, But I never said that it was done so you could use “is” with them, only that “is” is the wrong thing to do on them.

roelschroeven · on Feb 16, 2021

Of course. "is" is almost always the wrong thing to do.

baud147258 · on Feb 16, 2021

The "numbers less than 256 are the same objects" reminds me of the existence of the IntegerCache in Java, with an array storing the number from -128 to 127.

pansa2 · on Feb 16, 2021

Yes, Python has an integer cache holding the values -5 to 256.

patrec · on Feb 15, 2021

> It's never the absolute best language for anything, but it's hard to improve it on any front (e.g., execution speed) without hindering it on other fronts (e.g., ad-hoc interactivity),

This belief seems common, but I always wonder if anyone with familiarity with dynamic programming languages that were implemented by people who knew what they are doing (as implementers) thinks so. Self, Smalltalk and Common Lisp, for example, are doing much better on the ad-hoc interactivity front in non-trivial ways whilst offering implementations with vastly better performance preceding (C)Python by many years. The fact that python has terrible execution speed is most due to lack of relevant skills in the community not some conscious engineering trade-off.

Having said that, I don't think you are wrong on python being "the least worst language for everything" -- very few other languages have an eco system of remotely comparable expansiveness and quality (the top minds in several disciplines mostly use python for their work) which alone kills of huge swathes of would-be-competitors.

cs702 · on Feb 15, 2021

> Having said that, I don't think you are wrong on python being "the least worst language for everything" -- very few other languages have an eco system of remotely comparable expansiveness and quality (the top minds in several disciplines mostly use python for their work) which alone kills of huge swathes of would-be-competitors.

Yes, I agree. The ecosystem is part of what makes the language "the least worst language for everything."

heavyset_go · on Feb 16, 2021

It isn't just integers.

    In [2]: (1, 2) is (1, 2)
    Out[2]: True
 
    In [3]: a, b = (1, 2), (1, 2)
    In [4]: a is b
    Out[4]: True

    In [7]: a = (1, 2)                                                                                                                                                                           
    In [8]: b = (1, 2)
    In [9]: a is b
    Out[9]: False

Scene_Cast2 · on Feb 16, 2021

If you run that in a script, then you get True for all statements. Reason: when running a file, the interpreter reads the entire script and can make the optimization that both variables are the same objects, since they're not mutated.

patrickthebold · on Feb 15, 2021

Java basically has the same thing:

Quick Google gives: https://stackoverflow.com/a/1515811

exyi · on Feb 15, 2021

> Sometimes I think of Python as the Nash Equilibrium[a] of programming languages:

I think you can say that about almost any language. Each feature has it's advantages and disadvantages and even the most hated features of some languages have some reasoning behind them - so changing it would hurt some use case.

Language design is sometimes more about reasonable compromises than genius ideas.

cozzyd · on Feb 16, 2021

I mean some languages have outright bugs (e.g. the php ternary operator)

th0ma5 · on Feb 15, 2021

This is outside of the spec... "is" is for testing the exact same reference and it is only coincidence that to speed things up they made smaller integers the same objects in memory. See:

    >>> a=257
    >>> b=a
    >>> a is b
    True

What you want is double equals.

oivey · on Feb 15, 2021

He names references in his post. I highly doubt he’s confused about the difference between is and ==. It’s a weird leak of interpreter details that could, in very narrow situations, cause a bug.

th0ma5 · on Feb 16, 2021

The language never said you could do this.

nayuki · on Feb 16, 2021

Java has the same "problem" when boxing an int into a java.lang.Integer. Small integers will have the same reference (==) because there is a cache table, but larger ones won't.

coldtea · on Feb 15, 2021

>My favorite Python WTF "feature" is that integers can have have the same reference, but only sometimes

Many languages do the same, ditto for strings (as in TFA).

pwdisswordfish6 · on Feb 15, 2021

This behaviour (id not changing after concatenation) is consistent with the original str object being deallocated and a new one allocated in its place with an identical id. It’s not what actually happens behind the scenes (in CPython), but it’s still an implementation detail that this demonstration does not even unambiguously expose.

nine_k · on Feb 15, 2021

CPython could do both; allocations are usually not byte-perfect, and the small remaining room at the end of string can be used as an optimization, or a new buffer can be allocated if there's no more room.

Also, cutting long strings with .strip() can use the same optimization. Allocation isn't fast.

IDK if it's used, but it's entirely plausible to implement.

masklinn · on Feb 15, 2021

> CPython could do both; allocations are usually not byte-perfect

Though it's probably not the case for strings, it should be noted that for most objects allocations are byte-perfect, because the average Python object is a bunch of pointers and and a refcount: two gc pointers, a refcount, a weakref pointer, a class pointer and a dict pointer, for a total of 48 bytes (as of 3.9, ignoring all the pointees).

That's not the case for `__slots__` though: they drop the instance dict and the weakref pointer, but then each field will add a pointer to the total (then again that's space which won't be allocated as part of the instance dict).

duckerude · on Feb 16, 2021

Couldn't the old object only ever get deallocated after the new one is allocated?

mark-r · on Feb 16, 2021

I was about to say that. The old buffer has to be copied to the new buffer, so a naive implementation couldn't possibly allocate them both at the same address.

nneonneo · on Feb 15, 2021

I mean, who says Python strings are immutable?

  s = "asd"
  import ctypes
  ctypes.memmove(id(s) + ord('1'), b'X', 1)
  print(s) # prints aXd

(Python 3.3+, 64-bit; this is a joke so please do not use this in anything real)

yccs27 · on Feb 15, 2021

Nice!

Can you please explain to a rookie what the ord('1') is for? Is it just a way of writing a magic number of value 49, or is there some significance to it?

nneonneo · on Feb 15, 2021

It's mostly just a mildly obfuscated way to write 49. It also happens to line up nicely with the index of the character being overwritten (index 1); ord('0') through ord('9') work likewise.

----

I'm personally kind of shocked at how big strings are - the minimum size is 6 native words in size, corresponding to the following fields: refcount, type pointer, string length, string hash (precomputed), flags, and a wstr pointer which just seems to be NULL for most strings. It seems like they could have at least merged the string length and flags together for short strings - a possible subject for a future PEP...

panda88888 · on Feb 15, 2021

CPython implementation of the string stores metadata to save memory. The ord('1') simply jumps to the correct memory offset in the string structure. I would think it's easier to remember ord('0') as the offset of the string byte array in the structure than the magic number 48?

rasjani · on Feb 15, 2021

It feels like its just "magic" number. Probably id() is the actual raw pointer to str class and the string starts at offset 48 ?

fireflies_ · on Feb 15, 2021

Yes, you can can replace that with the magic number 49 if you want, ord('1') == 49 because ord returns the unicode code point for a character.

Documentation is here for the curious: https://docs.python.org/3/library/functions.html#ord

teddyh · on Feb 15, 2021

> the builtin id() function, which is just the memory address.

This, crucially, is not guaranteed. Instead, id() is only guaranteed to give a unique integer value for every object valid for the lifetime of that object. So if Python deallocated your first string object and allocated a new string object, Python could give the same id() for the new string object and this would be fine; this does not necessarily mean that the two string objects are the “same” object! In fact, since the objects do not share lifetimes, they cannot be said to be the same object!

ymbeld · on Feb 15, 2021

The title seems like false advertisement. It seems that strings in Python are indeed immutable. How it implements immutability is up to it (the implementation), and making a brand new allocation for each operation is not the only way of adhering to the contract.

Get back to me if you can make several references to the same string and make them step on each others’ toes by only assigning a new value to one of them.

Animats · on Feb 15, 2021

Python string semantics are very strange, as the way Unicode was implemented. The internal representation can be one, two, or four bytes per character, depending on the widest character. Plus I've heard that there's sometimes a UTF-8 representation in there, too.

Python strings have to be indexable by integers. But, in fact, most indexing is iteration. So a UTF-8 representation could work. You can index forwards and backwards through UTF-8 with opaque cursors. The search, match, and regular expression functions can all work on UTF-8. Returned indexes would be opaque cursors. Addition and subtraction of small integers to a cursor can be handled by indexing forwards and backwards through UTF-8. If a program does does random access indexing, or uses an opaque cursor in a context where it really has to be convereted to an integer, the implementation has to build an index of the entire string. Index building costs O(n), but should be rare.

eyelidlessness · on Feb 16, 2021

This is the same way persistent data structures are used in FP to provide immutable semantics with ~mutable performance. In Clojure the concept is extended to transients, which allow for scope-local mutation reasoning that if the statefulness is invisible at the call site it wasn’t stateful.

All of this is just making the physics of electronics and stateful low level APIs provide reasonable performance for the reasoning benefits of immutable data.

Of course if you break that fourth wall you’ll find it’s more complicated than that. And that’s interesting for understanding how it works, but it’s important not to frame that as some kind of exception or trap door. Anyone busting through that abstraction either knows they are or has more problems coming.

Blikkentrekker · on Feb 16, 2021

Transients in Clojure are rather dynamically typed linear types.

Rather than statically verifying that the types are linear, the runtime throws an exception at runtime when they be used in a non-linear way.

There are various functional programming languages that utilize linear types to implement mutation while otherwise retaining a pure interface.

eyelidlessness · on Feb 16, 2021

I appreciate the clarification. And giving it a more general name which I wasn’t aware of. I hope it was clear I was using an example and not trying to suggest this was somehow unique to Clojure. My point was to say that encapsulating mutability to provide an immutable interface with other desirable characteristics is actually commonplace.

pansa2 · on Feb 16, 2021

Taking advantage of a reference count of 1, like this, must be one of very few places where reference counting can provide a performance improvement over tracing garbage collection.

weinzierl · on Feb 16, 2021

In Java even literal numbers are mutable[1]:

    import java.lang.reflect.Field;

    public class IntegerCacheFun {
      public static void main(String[] args) 
        throws Exception {
        Class cls=Class
        .forName("java.lang.Integer$IntegerCache");
        Field fld = cls.getDeclaredField("cache");
        fld.setAccessible(true);
        Integer[] cache = (Integer[]) fld.get(cls);
        cache[4 + 128] = 5;

        Integer result = 2 + 2;
        System.out.print("2 + 2 = ");
        System.out.println(result);
      }
    }

    % java IntegerCacheFun
    2 + 2 = 5

Ok, technically it's not mutability, but still a nice trick.

[1] https://codeexplainer.wordpress.com/2018/02/18/some-are-more...

xmodem · on Feb 16, 2021

That's like saying, "Every variable in every language is mutable if I just take this electron gun and aim it at my RAM chip very carefully."

You're messing around with the internals of the system that was never intended to be touched.

weinzierl · on Feb 16, 2021

There is a difference between changing internals by doing things outside the system (OS level, hardware level) and changing things with perfectly legal and correct code.

MaxBarraclough · on Feb 16, 2021

I'm not convinced it's legal and correct. You're going to risk breaking internal JVM optimisations like string interning. Whether it's specifically against the letter of the law I'm not sure, but I imagine the Java specification doesn't have much to say about the behaviour of a JVM after someone messes with its internals. That is to say, very roughly speaking, you're likely flirting with what C programmers call undefined behaviour, which is to say, it's likely not legal and correct.

https://stackoverflow.com/a/6932880/

postit · on Feb 15, 2021

I just spend two days chasing a bug because someone shoved a variable inside a dataclass in the wrong place, nothing is immutable in python.

iainmerrick · on Feb 15, 2021

No, strings are immutable, despite the semantic gymnastics this article is performing.

masklinn · on Feb 15, 2021

Strings are immutable on the Python side. On the C size, however, PyUnicode_Resize is part of the PEP 384 Stable ABI and very specifically tries to resize the string in-place (it's normally intended for string-building).

ymbeld · on Feb 15, 2021

If it is immutable from the client’s perspective, it’s immutable. Anything else is ill-defined.

masklinn · on Feb 15, 2021

The C api is “from a client’s perspective”. Any rando can build native modules with it.

ymbeld · on Feb 16, 2021

I stand corrected.

mcv · on Feb 16, 2021

In javascript I enforce immutability by hitting people over the head when they try to mutate something they're not supposed to.

pansa2 · on Feb 16, 2021

Does Python deliberately allocate extra space at the end of each string to enable characters to be appended in-place (similar to lists)? Or does it just use space that’s available as a side-effect of memory allocation, e.g. by calling `realloc`?

ac42 · on Feb 16, 2021

From the python docs:

> Two objects with non-overlapping lifetimes may have the same id() value.