Hacker News new | past | comments | ask | show | jobs | submit login

> Statements like "first 256 code points in Unicode map to Latin-1" make little sense.

That's not true. Latin-1 is both a character set and an encoding, so "the first 256 Unicode code points map to the corresponding Latin-1 characters" is a reasonable statement.

You can also say "the first 128 Unicode code points, when encoded in UTF-8, are equal to the corresponding Latin-1 encoding".




My point is that in Python land it is silly to call u"" strings, "Unicode" strings. Unicode strings are strings in UTF-8/16/32 and a bunch of lesser-used encodings. For that matter "" could be used as a Unicode string as long as it's only ASCII. What the docs should be talking about is ASCII vs UTF-16, not ASCII/Latin-1 vs Unicode. This starts making a difference when questions like "How much memory is consumed by this string?" or "What characters can I not store in Python?" are asked. In this light, Python 3 makes a big improvement: it has immutable byte arrays and it has encoded strings.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: