Hacker News new | past | comments | ask | show | jobs | submit login

> The encoding of a string should never cause a comparison to fail when the two strings are equivalent except for the encoding.

If you mean comparing "file" == b"file" that's not possible on several levels.

Firstly, even if you say "just compare the bytes", the computer doesn't know what byte-format you want for "file". Sure, it's "Unicode", but is it UTF-8 or UTF-16 or what? Those choices will produce different results, and the computer cannot accurately guess the right one for you.

Secondly, that violates Python's normal rules by introducing type juggling. It's equivalent to asking for expressions like ("15"==15) or ("True"==True) to work, and involves all the same kinds of long-term problems. (Don't believe me? Work in PHP for a few years...)




As I said, Delphi/FreePascal have been handling this for years without issue, so "not possible" sounds like you're giving up a little too early. The encoding of a string is not the same thing as its type, and shouldn't be treated as such. Python has to know the encoding of the "file" string by the encoding off the source file. It then also has to know what the encoding of the b"file" string is because it is explicitly specified. That's all of the information that it needs to make the comparison, so it should either a) issue a compilation/runtime error if its an invalid comparison, or b) return a proper comparison result. Returning an invalid comparison result is the worst of all possible outcomes.

As for character sets/code points:

https://stackoverflow.com/questions/130438/do-utf-8-utf-16-a...

A byte string is simply a string that is using the lower ASCII characters (< 127). The code points for "file" map cleanly to the same code points in any Unicode encoding.


A python byte sequence (the b"file" in the example) is not necessarily a string that is using the lower ascii characters, it's an arbitrary sequence of arbitrary (not only <127) bytes - the equality operation comparing a string with a sequence of bytes needs to be well defined for all possible byte sequences, including non-ascii (byte)strings like b'\xe2\xe8\xe7\xef' which decodes to different strings in different ANSI encodings (and bytestring data does not include any assumption about the encoding - especially if you just received those bytes over the network or read them from a file), and is not valid UTF-8.

Furthermore, even for ascii sequences like b"file" the bytes do not map to the string "file" in every Unicode encoding - for example, in UTF-16 the bytes of b"file" represent "楦敬", which is a bit different than "file".


If the "string" b"file" does not mean the ASCII string "file", but rather is supposed to be interpreted as a byte array (equivalent to just bytes in memory with no context of the individual array members being characters), then my original point still stands: such a comparison shouldn't be allowed at all and an error should be raised. To simply return False indicates that the comparison is valid with regard to the string types, but the comparison simply returned False because the two strings were not equal.

I thought Python was strongly-typed ? Am I incorrect in this regard ?


Python is not strongly-typed, it's "duck-typed", i.e. everything is an object, and you should be able to hand over "X-like" objects to code that expects type X, and it should work properly if your X-like object supports all the interfaces that type X does. As part of that duck-typing, it's valid to compare anything with anything without raising an error, it's just that different things are (by default) not equal, so the comparison returns false. For example, you can compare a class defining a database connection with the integer 5, that would be a valid comparison that returns False.

This behavior is a key requirement for all kinds of Python core data structures, for example, if you'd define bytestrings so that they throw an error when compared to a "normal" string, then this means that for a heterogenous list (pretty much all data structures in Python can be heterogenous regarding data types) containing both b"something" and "something" many standard operations (e.g. checking if "something" is in that list) would break because the list code would require a comparison to do that.


> Python has to know the encoding of the "file" string by the encoding off the source file.

Not when it's a string literal like in your example.

The string "foo" has no file, not in the past, present, or reasonably-predictable future. Ditto for the literal bytes.

> It then also has to know what the encoding of the b"file" string is because it is explicitly specified.

Explicitly specified by who? Where? When?

They're just bytes, they don't have a text-encoding yet, or perhaps never: It could be a picture file, or a random seed.


<< The string "foo" has no file, not in the past, present, or reasonably-predictable future. Ditto for the literal bytes. >>

Is Python not parsing/compiling the source file ? Does the "foo" string constant not live in such a source file ?

<< Explicitly specified by who? Where? When? >>

The "b" prefix states what the encoding is: it is a raw byte string whose bytes are assumed to correspond to their ASCII counterparts.


> Delphi/FreePascal have been handling this for years without issue

Except people don't really want to write and learn those languages?


That's not my point, and I think that you know that.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: