Hacker News new | comments | show | ask | jobs | submit login
Weird Python Integers (kate.io)
367 points by luu on Aug 24, 2017 | hide | past | web | favorite | 146 comments



https://github.com/adtac/exterminate

Plugging my near useless Python library that does this and a lot of other subtle, annoying things to break programs. The library is essentially a display of how much Python actually exposes to the user and how modifiable it is.


Wow, the alternative print is extra evil ...

    >>> from exterminate import AltPrint
    >>> print("Hey! How are you?")
    Yo dawwwwg! Wuz crackalackin' yo?
... because that string is not calculated locally, but retrieved via HTTP from an API call. So everything you print is sent to an external server.

AltPrint calls Gizoogle.translate():

https://github.com/adtac/exterminate/blob/f18862ed5b2e143e18...

And Gizoogle.translate() performs an HTTP call:

https://github.com/adtac/exterminate/blob/f18862ed5b2e143e18...


print("'; drop table users")



Why does my inner self find it immensely fun to modify programs in subtle ways to watch how they break? Considering there's infinite ways to write junk code, and far fewer ways to make it work.


It's probably what got me interested in programming in the first place. Writing C code and learning about pointers to memory, and playing around with memory locations to see undefined behavior could be invoked


That's the basic idea behind mutation testing which can turn the aim of 100 % code coverage up to 11. You introduce random mutations into the code and if a mutation does not cause at least one test to fail then either your tests aren't covering enough, or that expression/statement was unnecessary for correct program behaviour. Never got a chance to actually try it out, though.


Summary: Integers in python are full blown objects. Small numbers are stored in a central preallocated table where each entry represents one number. Setting a variable to a small integer makes it point to an entry in that table. Multiple variables may point to the same small integer objects in that table. Fooling around with the table leads to funny results.


It's just like Fortran, where everything including a constant is passed by reference, so if you assign to your arguments you might corrupt your program's only copy of 7 unless the linker put it in a read-only page.


Could you be more specific? Fortran passes arguments by reference, but writing to an argument won't mess up the "program's only copy of 7".

For example, the first output of the following program is 4, but the second is still 7.

       program main
       integer m,n

       m = 7
       call corrupt(m)
       write (*,*) m

       n = 7
       write (*,*) n
       stop
       end program main

       subroutine corrupt (a)
       integer a
       a = 4
       return
       end


I think he means something like

   call corrupt(7)
which causes a segfault for me. I suppose if you were using a compiler/platform that didn't store constants in read only memory, this might actually work.


All newer versions of Fortran will segfault, but back in the days they would not. Back in the ninties I fixed a bug we found when porting a Fortran 77 program from HPUX to Linux. The program would segfault on Linux, but worked on HPUX.

The reason was that in one subroutine, a parameter value was stored in a local variable, then used for computation and restored at the end. Since constants was stored in read only memory when using g77 on Linux, but not on the f77 compiler on HPUX, the Linux port would crash, but not the original HPUX version. In that the code you had above would have worked.


Yeah, a "fun" thing to do was to change the value of built-in constants such as pi.


That feature was required by the Indiana General Assembly.

https://en.wikipedia.org/wiki/Indiana_Pi_Bill


More like ridiculed than required.

I was going to mention Indiana but was more hoping that it wouldn't be mentioned at all.


By manually moving the constants from .const to .data, I got this program:

       program main
       integer m,n

       call corrupt(7)

       write (*,*), 7
       stop
       end program main

       subroutine corrupt (a)
       integer a
       a = 4
       return
       end
to output 4. Thanks for pointing this out, 'concede_pluto. Pretty weird!


The same is true for Java apparently: https://stackoverflow.com/a/2001861


That's only the case if you specifically treat them as full-blown objects. If you treat them as primitives they stay as primitives. Note that the type of these local variables is `Integer` rather than `int`.


Although you are technically correct (the best kind of correct), it is still a shortcoming of objects (or maybe just their implementation in Java).

It is just a bad idea to implement the '==' operator using comparison on addresses instead of on values and I consider languages pretty low-level that do that.

In a sane language == is always a function adhering the attributes of an equivalence relation (reflexivity, symmetry and transivity), where a user can expect an actual value comparison - AFAIK one is advised to overwrite `equals` in e.g. Java, but I do not think this is good design.

//edit:

There is no obvious answer, as one could argue, that languages just use different names for the same concepts ('==' vs 'equals') and consider comparison by value or by reference more fundamental (and choose '==' for that).


Just to point out, "equal by comparing addresses" is symmetric and transitive as well. (whether you call it reflexive is an uninteresting question, if you think about it)


There is an obvious answer to which of comparison by value vs "comparison by reference" (I would have called this "comparison by identity") is more fundamental. Comparison by reference is more fundamental than comparison by value.


Yeah, well you know, that's just like your opinion, man.


On the off-chance that you're serious, I'll point out that there is no way to compare equal by identity without also comparing equal by value.


Can you elaborate on this?

  Integer a = new Integer(2);
  Integer b = a;
Here a == b and no comparison by value.

  Integer c = new Integer(2);
Then a != c but a.equals(c) (in Java). My point is that I do not agree with that and would like to have a == c, which would be the case if '==' was implemented using comparison by value.


While '==' meaning compares equal by value would be nice; it makes the == operator quite complex. If it's a general comparison operator, it would need to recursively compare all the members, and avoid looping on circular references, or risk == running forever.

Certainly, one can come up with examples of how a different operator == would be useful for certain types of objects (including Integer), but the architects of Java strongly felt that an operator should always behave the same, regardless of the type of its operands, unless the operator is + and the objects are Strings.


It's a technical/pedantic point, but

a == b is an value-comparison of the identity "value" (in Java it's the memory reference).

Or in other words, how do I know two references are equal (identity equal) without comparing the values of the references?

Logical equality devolves to the same thing at the end.


> Here a == b and no comparison by value.

Um, nonsense?

    class Intbox {
        public static void main(String[] args) {
    	Integer a = new Integer(2);
    	Integer b = a;
    	System.out.println(a == b);
    	System.out.println(a.equals(b));
        }
    }

    $ java Intbox
    true
    true


I believe what your parent is emphasizing the difference between comparing a == c and a == b. Both are identity comparisons as opposed to value comparisons.

    > cat intbox.java 
    class Intbox {
        public static void main(String[] args) {
            Integer a = new Integer(2);
            Integer b = a;
            Integer c = new Integer(2);
            System.out.printf("a == b => %b\n",  a == b);
            System.out.printf("a.equals(b) => %b\n",  a.equals(b));
            System.out.printf("a == c  => %b\n", a == c);
            System.out.printf("a.equals(c) => %b\n", a.equals(c));
        }
    }

    > javac intbox.java

    > java Intbox
    a == b => true
    a.equals(b) => true
    a == c  => false
    a.equals(c) => true
Edit to add: Your point regarding that "there is no way to compare equal by identity without also comparing equal by value" is true but doesn't really say much. (I'm not sure if it isn't a tautology?) There are cases where one might want to know whether two objects are the same object, or whether they represent the same value. Which you want depends on context.

That's separate from the decision Java made that == was for identity comparison. Some people disagree with that decision.


The point is that equality of identity is objectively more fundamental than equality by any other metric. You can easily observe this by noting that any two objects which are equal in identity are necessarily equal in all other possible ways.

There is no argument to be made that "comparison by value is just as fundamental a concept as comparison by identity", or that this is just a matter of opinion.

The fact that it is possible to compare equal by value without comparing equal by identity does not contradict that it is impossible to compare equal by identity without also comparing equal by value, however you define "equal by value".

> Your point regarding that "there is no way to compare equal by identity without also comparing equal by value" is true but doesn't really say much. (I'm not sure if it isn't a tautology?)

It's not a tautology; it is one of three requirements defining an equivalence relation. (Equivalence relations are by definition reflexive, which is to say x is equivalent to x for all x.)


There are contexts where object identity is meaningless, however. In mathematics, a 2 is a 2 is a 2. This 2 versus this other 2 doesn't have meaning. This can easily trip people up, and arguing that object identity is more fundamental is meaningless. In many contexts, it's the value that's important, and object references actually get in the way. The many discussions on boxed versus unboxed and value types out there are evidence of this.

I agree that in some sense you can say that "identity comparison is more fundamental that value comparison". As I added in my first comment, I think this is likely a tautology, and doesn't do much work for you at the end of the day. The important thing is keeping the two straight and doing the appropriate comparison depending on the context.


In mathematical terms, what I'm saying is that equality is more fundamental than equivalence as defined by some other equivalence relation[1]. If you're considering integers according to their remainder when divided by 5, you can easily show that 4 is equal to 4, and that 4 is equal to -6. If you change your metric to remainder when divided by 3, you'll notice that 4 is still equal to 4, but no longer equal to -6. That is because 4 is fundamentally more similar to 4 than -6 is.

[1] And as I've already mentioned, you can see that this is true by looking at the definition of an equivalence relation. They all treat everything as equivalent to itself.


> any two objects which are equal in identity are necessarily equal in all other possible ways

Two NaNs are equal in identity but not in value, according to the IEEE spec.


I wasn't aware that IEEE floating point provided for two different kinds of equality checking? How would you invoke one or the other?

A NaN value does not compare equal to itself, because NaN is not thought of as a value at all but rather an error signal ("Not a Number"). You can check whether something is NaN, but that's not a distinction between comparing by identity and comparing by value -- it's a distinction between comparing values and checking for the presence of errors. The analogous operations to (1) comparing NaN to itself (unequal) and (2) checking whether NaN is NaN (yes), for another value such as -0, are (1b) comparing -0 to itself (equal) and (2b) checking whether -0 is NaN (no). They aren't (1c) comparing -0 to itself and (2c) checking whether -0 has the same "identity" as -0.


Except when that value is NaN.


I don't know Java so I may have missed something, but isn't this[0] a counter-example?

[0]: https://ideone.com/w8NhmR


No, that is not a counterexample, because while that code defines a method named "equals", the method so defined is not an equivalence relation.


Why is value-equality inherently more sane than identity-equality, in an object-orientated language? Value-equality takes an unbound amount of time to compute, as well.


What's the meaning of

    new Long(5) == new Long(5);?
I know why the implementation lets me ask that question and I know why it's false, but I'd say that there's no reasonable question I'd ever want to ask using that expression.


You can think up an unreasonable example for anything though.

If I have a graph data structure then `node == node` with identity does make sense to me in lots of cases.

And what's the value equality of `(new Object()).equals(new Object())` - they have no value.


The point is that the language treats numbers as objects when they aren't.

It's not that reference semantics don't have a place, it's just that the place isn't numbers.

As for your second question, it looks like the type system has two embarrassing questions: what the heck does "new Object()" mean? Nothing worth saying.


I use `new Object()` in my code and I use it exactly for identity equality! If I want a unique ID for something I create a new object and it's guaranteed it will never be reference equality to any other object anyone else could construct.


Comparison by value tends to be more useful, and more often what a user expects "==" to mean. In some ways, "==" mirrors the mathematical =. (Not exactly, of course, since mathematics is usually making statements of fact: x = 2 means x is equal to 2, not x is assigned the value 2.)

I think Python is a good example here, where referential comparison is the `is` operator: `a is b`.


I agree, that's why I revised my answer (see edit). Probably comparison by-value is more useful in higher-level and comparison by reference in lower-level languages.


I don't think that it was a specific design decision to make '==' be a pointer comparison instead of value comparison. It's simply a consequence of the much more general design decision that operators cannot be overloaded.

`a==b` is always an analog for the C code `a==b`, not `* a==*b` and certainly not C++'s `operator==(a, b)`


Micropython does a much better job, using pointer alignment guarantees to pack in small integers:

https://micropython.org/

Apparently it is too hard to move python proper over to that method because of backwards compatibility issues with C extensions.


Guido said in some interview they used tagged pointers in some project before Python, and it didn't work well. Apparently there is some benefit in "value is always a reference" (less code paths?) that outweights somewhat larger heap pressure.


I believe that's called "interning". It's done for strings as well.


The real ProTip is in the comments. :)


the python documentation [1] says the following:

> The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object. So it should be possible to change the value of 1. I suspect the behaviour of Python in this case is undefined. :-)

does anyone have any idea how they chose that range? it's a 262-wide block starting at -5, which seems incredibly arbitrary.

[1] https://docs.python.org/2/c-api/int.html


You could probably find out from the code's log. I'd guess the small positive integers are common due to e.g. iteration or len() of small collections and the like, and the very small negatives are due to things like error values.


I did some git archaeology. Here are relevant commits and bug reports:

* Initial commit with -1...99 cached:

https://github.com/python/cpython/commit/842d2ccdcd540399501...

* Cache more negative numbers, -5...99:

https://github.com/python/cpython/commit/c91ed400e053dc9f11d...

https://bugs.python.org/issue561244

* Cache more positive numbers, -5...256:

https://github.com/python/cpython/commit/418a1ef0895e826c65d...

https://bugs.python.org/issue1436243


good find! here's a summary:

it looks like they searched the standard library for negative integer literals and that's how they settled on -5 for the low end.

the high end came after introducing the `bytes` object. it went up to 256 instead of just 255 for size/len checks.


The small negatives might also be for indexing from the end of lists. `my_list[-1]` is super common.


Good point.


> We can use the Python built-in function id which returns a value you can think of as a memory address to investigate.

> [...]

> It looks like there is a table of tiny integers and each integer is takes up 32 bytes.

It is the memory address but it's a "CPython implementation detail: This [return value of the id() function] is the address of the object in memory."[1]

Though you cannot use this to determine the size of an object, or rather you "shouldn't" because that assumes a very specific implementation detail, which isn't there.

If you'd like to get the size of an object, use sys.getsizeof().[2] Also keep in mind that containers in Python does not contain the objects themselves but references to them so the returned size is the size of the object itself only, non-recursively. Read "Is Python call-by-value or call-by-reference? Neither."[3] for some more details.

[1]: https://docs.python.org/3.6/library/functions.html#id

[2]: http://docs.python.org/3.6/library/sys.html#sys.getsizeof

[3]: https://jeffknupp.com/blog/2012/11/13/is-python-callbyvalue-...


Python memory model is amazingly simple and consistent: everything is passed and stored by value, but all values are references to objects.


I wouldn't say it's necessarily neither. While the concept of binding used in [3] is consistent, you can also simply say that functions are call-by-value and all values are references. Pretty much the same way C uses pointers when passed to a function.

BTW, I didn't like the way the difference between mutable and immutable objects was pointed out, it sounded like the language recognized and supported those two types, when in fact the difference is just in the methods exposed by the objects. Even immutable objects can be modified if you don't respect the contracts (like in the case of this article).


"Is", "is." "is"—the idiocy of the word haunts me. If it were abolished, human thought might begin to make sense. I don't know what anything "is"; I only know how it seems to me at this moment.

— Robert Anton Wilson, The Historical Illuminatus Chronicles, as spoken by Sigismundo Celine.

https://en.wikipedia.org/wiki/E-Prime

Kellogg and Bourland describe misuse of the verb to be as creating a "deity mode of speech", allowing "even the most ignorant to transform their opinions magically into god-like pronouncements on the nature of things".

Bourland and other advocates also suggest that use of E-Prime leads to a less dogmatic style of language that reduces the possibility of misunderstanding or conflict.

Alfred Korzybski justified the expression he coined — "the map is not the territory" — by saying that "the denial of identification (as in 'is not') has opposite neuro-linguistic effects on the brain from the assertion of identity (as in 'is')."


This is part of the reason why I prefer structural types (like in Typescript, Go and OCaml) over nominal types (like in most languages), as any object with the required methods is automatically an instance of such a type, instead of having to explicitly declare that it "is"/"extends"/"implements" that type.

I suspect that with dependent types, nominal types are actually a degenerate type of dependent structural type: a dependent pair of a type and a proof that a string 'name' field has some particular value, or that a list<string> 'implements' field contains some particular string (the interface name).


Which is a great approach if you're slinging JSON between different languages and implementations.

Python is fine with duck typing.


It's also really good for testing: if you want to mock some opaque third-party object, you can just create an interface with the same methods as that object and use it in your code. Nominal types often don't allow this, such as how in Java you can't make a class to which you don't have the source implement your own interface, or make it painful, like orphan instance restrictions in Rust and Haskell.


Ruby does something similar, but all Fixnum (native sized) values are 'fixed objects':

    a = 2**62 - 1
    b = 2**62 - 1

    a.object_id == b.object_id # true

    a = 2**62
    b = 2**62

    a.object_id == b.object_id # false
Ruby does automatic promotion from Fixnum (native size) to Bignum (arbitrarily large) and uses one bit of the native size as a flag to identify this which is why 2^62 - 1 is the max instead of 2^63 - 1. Though I think this is only true of MRI and other implementations handle it without the flag bit.

Perhaps one difference from Python is that in MRI Ruby Fixnum doesn't really even allocate an 'object', the object_id is the value in disguise. In fact all 'real' objects have even object_ids and all odd object_ids are integers:

    a = 123456789
    (a.object_id - 1) / 2 # 123456789


I wrote a blog post about this in the past. It's really fun going through the oddities of the language like this.

It caches small integers, but also literals used in the same interpreter context (I'm probably getting that last term wrong). You'll get different results if you run these in from the shell as opposed to executing a script, try it out!

Here's a fun example

    >>> x = 256; x is 256
    True
    >>> x = 257; x is 257
    True
    >>> x = 257
    >>> x is 257
    False
    >>> def wat():
    ...   x = 500
    ...   return x is 500
    >>> wat()
    True


The Python interpreter compiles programs into bytecode first, and the bytecode includes instructions that load a constant from the constant pool. As an example,

    >>> def f():
    ...     x = 2222; y=2222
    ...     return x is y
    ...
    >>> f()
    True
    >>> f.func_code.co_consts
    (None, 2222)
This last line is showing the constant pool, which is just a standard Python tuple.

I believe the REPL compiles each input in a new toplevel module context, so each input gets its own constant pool.

Functions get their own constant pools, which explains the following behavior:

    >>> if True:
    ...     def f():return 2222
    ...     def g():return 2222
    ...
    >>> f() is g()
    False


I'm glad you and squeaky-clean wrote these comments. When I was experimenting in the Python REPL, I was confused by the last line here:

    >>> 100 is 100
    True
    >>> (10 ** 2) is (10 ** 2)
    True
    >>> (10 ** 3) is (10 ** 3)
    False
    >>> 1000 is 1000
    True
I used the disassembler, but I completely missed that although `1000 is 1000` and `(10 3) is (10 3)` both get optimized to nearly identical bytecode they load different constants. I wrote it up in a new post and thanked you both. https://kate.io/blog/2017/08/24/python-constants-in-bytecode...


Yep yep, and only when no operations are done to it, otherwise it will end up storing the same constant multiple times and referring to each separately. For example

    >>> def f():
    ...   x = 2221+1; y=2221+1
    ...   return x is y
    >>> f()
    False
    >>> f.__code__.co_consts
    (None, 2221, 1, 2222, 2222)
    >>> dis.dis(f)
      2           0 LOAD_CONST               3 (2222)
                  2 STORE_FAST               0 (x)
                  4 LOAD_CONST               4 (2222)
                  6 STORE_FAST               1 (y)
    
      3           8 LOAD_FAST                0 (x)
                 10 LOAD_FAST                1 (y)
                 12 COMPARE_OP               8 (is)
                 14 RETURN_VALUE
Not too sure when stuff like this would be useful (it goes way beyond the 'avoid using is unless checking references` advice), but it sure is fun.


So why is this happening? The blog post also shows it, but doesn't explain why (first example)


std_throwaway got the reasoning right in higher comment. You're actually doing a comparison if the object references are identical, not if the values are identical.

The tricky bit is that everything is an object in Python, even "primitives" like integers, which can be surprising coming from some other languages. Then the trickier bit is that CPython (the most popular and standard implementation) on startup creates an object for every int from -5 to 256 and will re-use those instead of creating a new PyLongObject. This means that `256 is 256` works, because under the hood, it's short-circuting it to the same object. But `257 is 257` creates 2 different PyLongObjects, which have the same value, 257, but aren't technically the same object.

And in my above example, the reason `x = 257; x is 257` is True when on the same line, but False when on separate lines (in the REPL) is because of the reasons kmill says, the constants pool that the interpreter creates. If they're interpreted together, it will create and use the same constant.

Here is my entry on my company's blog about this. I won't guarantee you it's well written, but it does show bytecode and has links to the relevant C source lines at the very end.

https://www.everymundo.com/literals-other-number-oddities-py...


thanks. that makes good sense.


Lisp systems also have fixnums and bignums. For example a 64bit Common Lisp:

  MOST-NEGATIVE-FIXNUM, value: -1152921504606846976                                      
  MOST-POSITIVE-FIXNUM, value: 1152921504606846975        
Fixnums are typically stored inline in data structures (like lists, arrays and CLOS objects). Bignums will be stored as a pointer to an heap-allocated large number. Data has tags and thus in a 64bit Lisp the fixnums will be slightly smaller than 64bit. Bignums can be 'arbitrary' larger and there is automatic switching between fixnums and bignums for numeric operations.


The range varies across implementations: the above values correspond to 60 bits; most-positive-fixnum evaluates to 4611686018427387903 in the version of SBCL I am currently running, which corresponds to 62 bits of data.


Lots of good discussion at this Stack Overflow question (2008): https://stackoverflow.com/q/306313/893 (Python “is” operator behaves unexpectedly with integers)


    In [7]: a = "foo"

    In [8]: b = "foo"

    In [9]: a is b
    Out[9]: True

    In [10]: b = "foobaljlajdfsklfjds l;kjsl;dfj ls;dfj l;skdj flsdjluejsklnm "

    In [11]: a = "foobaljlajdfsklfjds l;kjsl;dfj ls;dfj l;skdj flsdjluejsklnm "

    In [12]: a is b
    Out[12]: False
Seems to work with small and big strings too.


Relevant code here:

https://hg.python.org/cpython/file/f254ceec0d45/Objects/code...

Its not actually length, its whether the string looks like it could be a python name, and it only happens for string literals.


This would be due to string interning, which automatically caches small strings, which are very common because attribute names are just strings.


To be clear the question of which strings to intern is up to an implementation of string interning. Only small strings, all strings, or even only large strings are all options that are good for different reasons.


This comment is great, points out why the article is bad. Python integers are not weird simply because a well-defined operator gives a result which goes against an intuition based on the semantics of English.


Is there any practical reason to use "is" to compare two ints (other than demonstrating integer interning)? Should doing so produce a warning?


I'll take it one step further - I don't think I've ever used `is` for anything other than comparisons to `None`.


It's useful when you want a sentinel object and you don't want to use None as the sentinel because it is a valid value. Some discussion here:

http://www.ianbicking.org/blog/2008/12/the-magic-sentinel.ht...


`is` is also useful for things like detecting cycles for walking object graphs.

But I'd estimate 99% of the times I've used `is` was for comparisons to `None`, and 90% of the rest of times for comparisons to other sentinel values.


Yeah, this post is confusing. It reads as though I should expect `is` to be the same operator as `==`. But that's completely off. It's not that the integers are weird, it's that OP is misusing an operator. The topic could have been an explanation of what the `is` is, instead it's confusing and not helpful.


A lot of code relies on the fact that "is" is very fast because it just compares memory addresses. In order for it to give a warning when it is used on integers, it would need to also look at the types of the objects which would have have some performance impact.

No, it is not good practice to rely on the interning of ints and short strings. It's implementation-specific behavior and relying on it can cause your code to not work on PyPy or even potentially a future version of CPython.


I guess with MyPy type annotations, you could issue a warning when types are checked if the code is using "is" on types where doing so is implementation-defined and unlikely to produce a useful result.


I hadn't thought of that sort of static analysis. I would guess that the most common variation of using "is" on integers is comparing to a literal ("if x is 3"), so conceivably the bytecode compiler could warn about that without even needing annotations.


Only if you want to determine whether two integers are the same object and not just two objects that look similar. `is` is for identity, whereas `==` is for equivalence.

Idiomatically, you generally only see `is` used for comparison to `None`. Sometimes developers want to check identity, and you'd use `is` there too (but I don't have a handy example offhand).


That depends on what your definition of "is" is.


"is" is a well defined comparison operator in Python which compares object identity. Some philosophical argument about the meaning of "is" is tangential.


The case of strings is also pretty interesting: http://guilload.com/python-string-interning/


I remember implementing this too on Nova 1200. When the address space is bigger than the memory, you can place those integers outside the memory. Those objects do not actually exist in other words. Saves you memory cycles too, because you can calculate the numeric value from the address.



You can also do a similar thing in Java, as illustrated in this answer on CodeGolf stackexchange: https://codegolf.stackexchange.com/a/28818


Indeed - the integer boxing cache is even required by the language spec!

http://docs.oracle.com/javase/specs/jls/se8/html/jls-5.html#...

(here == means reference not value equality)

"If the value p being boxed is an integer literal of type int between -128 and 127 inclusive (§3.10.1), or the boolean literal true or false (§3.10.3), or a character literal between '\u0000' and '\u007f' inclusive (§3.10.4), then let a and b be the results of any two boxing conversions of p. It is always the case that a == b."


Does anyone know why Python refcounts everything -- even small integers, True, False, None...? Why not avoid it?


Sure: CPython doesn't know statically that anything is "not an object", so your proposal adds an extra test to every incref and every decref. This is perceived as probably worse for performance. In 2002, a hacked CPython interpreter (source code long since lost, sorry!) saw a 5% performance decrease on pystone but a 14% performance increase on an integer-heavy workload. https://mail.python.org/pipermail/python-dev/2002-August/027...


Interesting, thanks for the link! Generally I've viewed writing as more expensive than reading, so I'd be curious if that behavior is still true on modern hardware.


I don't know the internals of Python, but maybe checking if you need to refcount something is basically takes as long as actually just going ahead and doing it all the time anyway.


That's also the same thought behind the Python practice of "it's better to ask forgiveness than permission".

If you're going to have a conditional that will evaluate to True most of the time, sometimes (in Python) it's considered better to just go ahead and try the logic that the conditional is guarding, and then catch the exception and handle it if it happens.


Yeah you might be right, it's hard to say. Generally I view writing as more expensive than reading so I don't know.


Note that it's not just write vs read, it's write vs read + branch.


Right, if it was just 1 read vs. 1 write then I would know!


I'll probably eat my own words, but generally I believe CPython is meant to be easy to reason about, easy to read the code for and understand the implementation, and not to try to be meaningfully performant. I would bet the decision to refcount everything is to keep it simple.


Possibly, though it seems they do try to optimize things (they have an entire peephole optimizer).


> That is suprising! It turns out that all “small integers” with the same value point to the same memory. We can use the Python built-in function id which returns a value you can think of as a memory address to investigate.

Unfortunately this blog post seems to miss a great opportunity to show you how you should compare integers for equality -- using the equality operator `==` and not the identity comparison `is`.

EDIT: odd, this post attracted a lot of downvotes. Please help me learn how this post could be improved.


It never ceases to amaze me that Python seems to be the only one that got this right - namely, that == should have value comparison semantics, for all types, and that some other (preferably distinctive enough, rather than something like ===) operator should always compare references.

Then you look at something like C#, where == compares values for value types and references for reference types, except that == is overloaded for some "value-like" reference types like String... it's a mess.

OTOH, Java is consistent in a sense that == never dereferences, but awkward in practice because of the need to use equals() for strings, which is too verbose for an operation that's far more common than object identity comparison.


It's still confusing in Java because there are primitive types which == compares by value. Scala gets it right: == does the right thing and there's some operator I never use (I think it's "eq"?) if you really want reference comparison.


In Java it's not confusing when you realize that == always compares values (i.e. things that can be stored in variables); it's just that for reference types, the values are references. It'd be more obvious if Java required explicit dereferencing like C++ does, but it's still consistent. Just not convenient.


> In Java it's not confusing when you realize that == always compares values (i.e. things that can be stored in variables); it's just that for reference types, the values are references.

It's still confusing. Plenty of "reference types" behave like values in all obvious senses. Why should "foo" be a reference but 45L a value?


Because "foo" is an instance of String, aka subclass of Object, while 45L is a value for the primitive type long.

I fail to see the confusion, other than for newbies.


That's pretty much saying "it is the way it is", which

a) is not even true, see floating point numbers

and

b) will be obsolete with value types.

Needless to say, there is pointless confusion created by Java's design, and there are better approaches available.

All you have to do is to adapt the semantic model from reference equality vs. value equality to identity vs. equality.

Identity checks whether the "bits" are identical, irregardless of whether the bits are "references" or values, and equality is a user-defined operation.

    "Foo" equality "Foo" // True
    "Foo" identity "Foo" // Only true if they point to the same "object"
    123 equality 123 // True
    123 identity 123 // True
    Double.NaN equality Double.NaN // False
    Double.NaN identity Double.NaN // True
Which symbols you pick for "equality" and "identity" is largely arbitrary.


Better yet, allow identity checks for reference types only. Value types don't have identity, per se, so the operation should be meaningless for them.


I don't think you can get away with that for theoretical and practical reasons:

1.

There is a ton of code out there which does something like

    def contains(that: Thing): Boolean =
      this.value identity that || this.value equality that
Pretty much every single collection implementation would be broken if this stopped working with value types. Additionally, you would run into issues with floating point numbers which would not be found/retrieved anymore if identity were removed.

2.

The idea to define a _sane_ definition of identity/equality across all types is there to avoid the "next-best" option: boxing primitives to wrapper classes which is both slow and has terrible semantics.

3.

I don't really think restricting identity to e.g. reference types makes sense given that equality is defined for every type. Either none of them should be available by default, or both should be.

There _are_ multiple valid ways to compare to things (consider floating point numbers for a second) and making one more privileged than the other feels wrong.


> Pretty much every single collection implementation would be broken if this stopped working with value types.

Maybe collections of values should be different from collections of references. The sensible use cases for the two are quite different.

> Additionally, you would run into issues with floating point numbers which would not be found/retrieved anymore if identity were removed.

Meh, just allow NaN to compare equal to itself. Equality is supposed to be reflexive.

> The idea to define a _sane_ definition of identity/equality across all types is there to avoid the "next-best" option: boxing primitives to wrapper classes which is both slow and has terrible semantics.

Unboxed primitives don't have identity, only value equality. They align well with what's being proposed.

> I don't really think restricting identity to e.g. reference types makes sense given that equality is defined for every type. Either none of them should be available by default, or both should be.

We could build the distinction into the language, so for every type you define you explicitly choose whether it's value or reference. Scala's already halfway there with the class/case class distinction.

> There _are_ multiple valid ways to compare to things (consider floating point numbers for a second)

Disagree; comparison is so fundamental to most types that it's worth privileging. Using the wrong kind of comparison is a very common source of bugs.


> Maybe collections of values should be different from collections of references. The sensible use cases for the two are quite different.

I think all existing code disagrees with that. There has been great value derived from being able to abstract over element types.

What you are proposing would double the required number of collection classes and all of its traits, because it would require separate ones for Collection[E <: AnyRef] and for Collection[E <: AnyRef].

There is literally no reason for introducing this complexity. Go has demonstrated how poorly this idea has worked out in practice.

Additionally, this approach would make it nearly impossible to migrate reference types to value types, because it would break all users of the code.

> Meh, just allow NaN to compare equal to itself. Equality is supposed to be reflexive.

That's a complete non-option. You might not like the IEEEs definition of equality, but this is what it is. Messing with it would break all existing code using floating point numbers.

> Unboxed primitives don't have identity, only value equality.

Their identity is the bits they consist of, just like identity on references is the bits of the reference.

> They align well with what's being proposed.

What is being proposed?

> We could build the distinction into the language, so for every type you define you explicitly choose whether it's value or reference.

We already have that: AnyRef and AnyVal.

> Scala's already halfway there with the class/case class distinction.

That doesn't make any sense. The case keyword is basically just a compiler built-in macro to generate some code. It is already doing way to much, and overloading it with even more semantics is not the way to go.

> Disagree; comparison is so fundamental to most types that it's worth privileging. Using the wrong kind of comparison is a very common source of bugs.

What I'm proposing improves the consistency across value and reference types so that it's always obvious which kind of comparison happens:

- identity: Low-level comparison of the bits at hand. Built into the JVM and not overridable.

- equality: High-level comparison defined by the author of the type.


> What you are proposing would double the required number of collection classes and all of its traits, because it would require separate ones for Collection[E <: AnyRef] and for Collection[E <: AnyRef].

Less than double, because not all collections make sense for both - e.g. a sorted set or sorted map only makes sense if the keys are values.

> There is literally no reason for introducing this complexity. Go has demonstrated how poorly this idea has worked out in practice.

It eliminates a common class of errors. All type-level distinctions add a bit of complexity, but we often consider them worthwhile to make.

> Additionally, this approach would make it nearly impossible to migrate reference types to value types, because it would break all users of the code.

Changing from one to the other is a radical change that should force the user to reexamine code that deals with them.

> That's a complete non-option. You might not like the IEEEs definition of equality, but this is what it is. Messing with it would break all existing code using floating point numbers.

Java already deviated from the IEEE definition with Float and Double. The sky didn't fall. Maybe strict IEEE semantics could be offered in their own type where needed, and that type would neither be value or identity-is-meaningful. (This would mean the type system wouldn't allow you to use the strict-IEEE type in any standard collection, which I think is correct behaviour; compare e.g. Haskell where for a long time you could corrupt the standard sorted set structure by inserting two NaNs).

> Their identity is the bits they consist of, just like identity on references is the bits of the reference.

That's a low-level implementation detail that may not even be true on all platforms. The language semantics should make sense.

> We already have that: AnyRef and AnyVal.

No, those are just implementation details of how they're passed around. Many AnyRef types have value semantics.

> That doesn't make any sense. The case keyword is basically just a compiler built-in macro to generate some code. It is already doing way to much, and overloading it with even more semantics is not the way to go.

Well, what I'd like in an ideal language is: no universal equality, opt-in value equality with derivation for product/coproduct types. As for references... I'm not really convinced there's a legitimate use case for comparing references, especially the implicit invisible references that the language uses to implement user classes. If we need reference comparison at all I'd rather something a bit more explicit - either an opt-in "the identity of this class is meaningful", or a notion of explicit references that were much more visible in the code (something a bit like ActorRef), or both.

> What I'm proposing improves the consistency across value and reference types so that it's always obvious which kind of comparison happens: > - identity: Low-level comparison of the bits at hand. Built into the JVM and not overridable. > - equality: High-level comparison defined by the author of the type.

That's very inconsistent at the language-semantics level; which things are "the bits at hand" are a low-level implementation detail that should probably be left up to the runtime to represent as best suits a particular code path. At the language level, "does 2L + 2L equal 4L?" is the same kind of question as "does "a" + "b" equal "ab"?", and both those questions are quite different from any question to which reference comparison would be the answer.


This hardly makes any sense, is not practical to implement and theoretically questionable.

It makes decisions that break existing code and introduce pointless complexity, while failing to offer any tangible benefits in return.


That's tautological. The question is not why String behaves as a reference type - obviously, because it is a reference type. The question is, why is String a reference type, when it has such obvious value semantics?

And it's a perfectly valid question. The real answer is that they wanted strings to have methods, so it had to be a class, because primitives can't have methods; and Java doesn't have value classes, so it had to be a reference type. So the reason is a deficiency in the language.

A better question is, why .NET string is a reference type (with overloaded == to make it behave like a value, even!). It could have easily been a struct. I suspect that this is one of those decisions that were made very early on to basically be like Java because that's what people expect; and then it became impossible to change without major breakage.

The real problem is that most implementations that do reference / value type distinction, mix up semantics and implementation. What we really need is the ability to say that something 1) doesn't have a meaningful identity, and 2) is immutable (in a sense that you can't replace individual components of that object; you still can replace the whole object). When you combine these two, you get something that can be implemented as a value type under the hood, but the programmer doesn't need to think about it. They can just continue treating it as a reference, just as they do in Python. There's no way to tell the difference.


Correct me if I'm wrong, but doesn't C++ use == for value comparison?


C++ uses `==` for whatever you want. But generally in C++, user-defined types will be documented as either having 'reference semantics' or 'value semantics' and all aspects of the type will conform to that.


In C++, == pretty much always uses value semantics regardless, because reference semantics are pretty much always explicit (dereference with star etc).


I think you mean reference with & (e.g. &x == &y)


No, I mean dereference if what you have are reference types (i.e. pointers in C; I'm using Java terminology here, since that's the baseline for comparison) to begin with, which is equivalent to the situation in Java. E.g. to copy the referenced object, you need to do x = y, while x = y copies the reference (pointer) itself. Similarly, when you compare, x == y compares the referenced objects, while x == y compares the references (pointers). And even for member access, you have to do (*x).foo, or the equivalent syntactic sugar x->foo.


Ruby does it, too (== is for value comparison, while .equal? is for reference comparison)


Frankly, what's the point? This is common knowledge.


Cool examples, but I'm not super concerned about the problems arising from the ability to 'use ctypes to directly edit memory'. It's actually pointers to memory blocks, not the memory contents itself: https://docs.python.org/3/library/ctypes.html If you're advanced enough to need to handle pointers to memory blocks in your python program, you are probably good enough to know not to create problems with the behavior of iterators on ranges.


Nice article. I wrote a similar piece some time ago related to booleans, in case anybody is interested: https://blog.rmotr.com/those-tricky-python-booleans-2100d5df...

And to avoid issues with is/==, we recommend our students to always use == (except for `is None`). Also related piece:https://blog.rmotr.com/avoiding-being-bitten-by-python-161b0...


That's unnecesarily restrictive -- I've gotten great use out of "is". You wouldn't want to use it on an integer, but I've been using namedtuples a lot lately, and "LastKnownConfiguration is CurrentConfiguration" works great to to check whether anything has changed without checking all of the fields for equality.


You're very much right and it's a great suggestion. We just suggest that to our students when they're starting in order to help them avoid issues. But there are exceptions.


Fair point -- == versus "is" is a pretty subtle distinction and best avoided by novices.


For comparison, integers in clisp:

  [1]> (eq (expt 2 47) (expt 2 47))
  T
  [2]> (eq (expt 2 48) (expt 2 48))
  NIL
Explanation here: https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node17.html


But nobody would use eq in lisp for this. For numbers you would use equal, comparing the values. You only would use eq for addresses, like cons cells or vectors. E.g. in php there is == vs ===.

re python: this is of course superlame. the compiler can easily detect "is <int>" and forbid the const int table optimization. Only "is None" would make sense.



You can use sys.getrefcount() to explore these "weird integers": https://news.ycombinator.com/item?id=15093897


Stumbled across this when debugging a reference leak with a C-extension once -- small integers didn't exercise the problem, but larger ones did...


Brilliant! Very interesting and an insight into the inner workings of python. Thank you for sharing Kate!


related read:

Equal Rights for Functional Objects or, The More Things Change, The More They Are the Same (1993) by Henry Baker

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.9...


a = 1000

b = 1000

a is b

The output is false only if you use the REPL. Run a .py script writing the above code and it returns True. I read about it on SO but can't find the link to it!


"things you should never do"


It's not weird, it's pythonic.


Now change it back!


Yup, 0day hunters have fun with this behavior from time to time.


Why is it interesting for "0day hunters" in particular?


I believe it's because you can find code where it does the right thing when e.g. a username is short enough, but fails in a useful way if you increase an input string length.


If you can write to memory in a way that overwrites integer values the whole logic of a program falls apart. Imagine if an adversary could affect the result of everything you do with integers.

It's not necessarily a common thing. Most people aren't running python directly on the internet but being able to screw around with stuff like this is just another important tool in the toolbox when you're attacking something.


good guy "0day hunter"

> has 4-byte arbitrary global write primitive

> uses it to change arithmetic


That's not weird but pretty common knowledge. IIRC 1-char strings are interned too.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: