> There's actually way more we can say about this, but this is a huge book already, haha.
Yeah, so long for "I'll try to keep this short" :P
> One difference in encoding between python and ruby I think is, in ruby 1.9+, if a string is tagged with an encoding but contains bytes invalid for that encoding (that do not represent a legal sequence of chars), you'll get an exception if you try to concat it to anything else -- even a string of the same encoding. I don't _think_ that happens in python?
True, also strings in python are immutable, so unless there's some weird way to access the underlying char* with the CPython C Api, I don't think that you can have an invalid sequence of bytes inside an unicode string
(obviously you can have codepoint U+FFFD, if you set errors='replace' when decoding)
> Ruby also doesn't have a canonical internal encoding, strings can be in _any_ encoding it recognizes, tagged with that encoding and containing internal bytes in memory that are actually the bytes for that encoding (any one ruby knows about). I am not aware of any other language that made that choice -- I think it came about because of experience in the Japanese context, although at this point, I think anyone would be insane not to keep all of your in-memory strings in UTF-8 and transcode them on I/O, and kind of wish the language actually encouraged/required that. But, hey, I program in a U.S. context.
Yeah, some time ago I looked into the differences of Python/Ruby encoding, and I wrote down these notes that I just uploaded:
There are indeed some characters/ideograms that cannot be converted to unicode codepoints, but even if we try to obtain them, we westerners are none the wiser, since we cannot print them to our terminals in a utf-8 locale
About the edit you just added:
> I definitely like it better than Java, which did decide all strings had to have a canonical internal encoding (if only it wasn't the pre-unicode-consolidation "UCS-2"!! perhaps that experience, of choosing the in-retrospect wrong canonical internal encoding influended ruby's choice)
Yes, but I think that this issue is made more complex by Java's efforts to keep bytecode compatibility.
In a language like Python/Ruby, the bytecode is only an internal implementation detail, upon which you shouldn't rely (you should rely only on the semantics of the source code). If you keep the actual encoding of your unicode strings an internal implementation detail, this issue could've been avoided (without switching to linear time algorithms for strings handling):
Just migrate to UTF-32 (or to a dynamic fixed width encoding like in Python3.3) as the in-memory representation, when parsing strings from the source code, and everything would've continued to work.
I think that it had more to do with the Han unification, rather than with the fear of picking the "wrong encoding"
Yeah, so long for "I'll try to keep this short" :P
> One difference in encoding between python and ruby I think is, in ruby 1.9+, if a string is tagged with an encoding but contains bytes invalid for that encoding (that do not represent a legal sequence of chars), you'll get an exception if you try to concat it to anything else -- even a string of the same encoding. I don't _think_ that happens in python?
True, also strings in python are immutable, so unless there's some weird way to access the underlying char* with the CPython C Api, I don't think that you can have an invalid sequence of bytes inside an unicode string
(obviously you can have codepoint U+FFFD, if you set errors='replace' when decoding)
> Ruby also doesn't have a canonical internal encoding, strings can be in _any_ encoding it recognizes, tagged with that encoding and containing internal bytes in memory that are actually the bytes for that encoding (any one ruby knows about). I am not aware of any other language that made that choice -- I think it came about because of experience in the Japanese context, although at this point, I think anyone would be insane not to keep all of your in-memory strings in UTF-8 and transcode them on I/O, and kind of wish the language actually encouraged/required that. But, hey, I program in a U.S. context.
Yeah, some time ago I looked into the differences of Python/Ruby encoding, and I wrote down these notes that I just uploaded:
https://gist.github.com/berdario/9b6bd24cafe3817e4773
There are indeed some characters/ideograms that cannot be converted to unicode codepoints, but even if we try to obtain them, we westerners are none the wiser, since we cannot print them to our terminals in a utf-8 locale
About the edit you just added:
> I definitely like it better than Java, which did decide all strings had to have a canonical internal encoding (if only it wasn't the pre-unicode-consolidation "UCS-2"!! perhaps that experience, of choosing the in-retrospect wrong canonical internal encoding influended ruby's choice)
Yes, but I think that this issue is made more complex by Java's efforts to keep bytecode compatibility.
In a language like Python/Ruby, the bytecode is only an internal implementation detail, upon which you shouldn't rely (you should rely only on the semantics of the source code). If you keep the actual encoding of your unicode strings an internal implementation detail, this issue could've been avoided (without switching to linear time algorithms for strings handling):
Just migrate to UTF-32 (or to a dynamic fixed width encoding like in Python3.3) as the in-memory representation, when parsing strings from the source code, and everything would've continued to work.
I think that it had more to do with the Han unification, rather than with the fear of picking the "wrong encoding"