It also doesn't help that the classic LAMP stack has very confusing defaults and badly named functions:
* PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"
* MySQL for the longest time used latin1 as a default character set, then introduced an insufficient character set called "utf8" which only allows up to 3 bytes, not enough for all possible utf8 encoded codepoints, then introduced a proper implementation called "utf8mb4".
* mysql connectors and client libraries often default their "client character set" setting to latin1, causing "silent" transcodes against the "server character set" and table column character sets. Also, because their "latin1" charset is more or less a binary-safe encoding, it is very easy to get double latin1-to-utf8 transcoded data in the database, something that often goes by unnoticed as long as data is merely received-inserted-selected-output to a browser, until you start to work on substrings or case insensitive searches etc.
* In Java, there are tons of methods that work on the boundary between bytes and characters that allows not specifying an encoding, which then silenty falls back to an almost randomly set system encoding
* Many languages such as Java, JavaScript and the unicode variants of win32 were unfortunately designed at a time where unicode characters could fit into 16bits, with the devastating result that the data type "char" is too small to store a single unicode character. It also plays hell on substring indexing.
In short, the APIs are stacked against the beginning programmer and doesn't make it obvious that when you go from working with abstract "characters" to byte streams, there is ALWAYS an encoding involved.
> * PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"
In the XML module, no less. I'll get round to moving those out of there eventually.
Does any programming language get Unicode right all the way? I thought Python did it mostly correctly, but for example with the composing characters, I would argue that it gets it wrong if you try to reverse a Unicode string.
My basic litmus test for "does this language support Unicode" is, "does iterating over a string get me code points?"¹
Rust, and recent versions of Python 3 (but not early versions of Python 3, and definitely not 2…) pass this test.
I believe that all of JavaScript, Java, C#, C, C++ … all fail.
(Frankly, I'm not sure anything in that list even has built-in functionality in the standard library for doing code-point iteration. You have to more or less write it yourself. I think C# comes the closest, by having some Unicode utility functions that make the job easier, but still doesn't directly let you do it.)
¹Code units are almost always, in my experience, the wrong layer to work at. One might argue that code points are still too low level, but this is a basic litmus test (I don't disagree that code points are often wrong, it's mostly a matter of what can I actually get from a language).
> try to reverse a Unicode string.
A good example of where even code points don't suffice.
Lua 5.3 can iterate over a UTF-8 string. You can even index character positions (not byte positions) in a UTF-8 string. Some more information here: https://www.lua.org/manual/5.3/manual.html#6.5
I basically agree with you, but note that code points are not the same as characters or glyphs. Iterating over code points is a code smell to me. There is probably a library function that does what you actually want.
I explicitly mention exactly this in my comment, and provide an example of where it breaks down. The point, which I also heavily noted in the post, is that it's a litmus test. If a language can't pass the iterate-over-code-points bar, do you really think it would give you access to characters or glyphs?
However, it exposes the encoding directly as a sequence of 16-bit ints. In other words, if you iterate over a string or index it, you're getting those, and not codepoints (i.e. it doesn't account for surrogate pairs).
Note that this only applies to iteration and indexing. All string functions do understand surrogates properly.
I'd accept that, too. I'm not familiar w/ Swift, so it wasn't in the list above. (But I do think the default should not be code units, or programmers will use it incorrectly out of ignorance. Forcing them to choose prevents that, hence, I'll allow it)
I would go one step further and say that there is no meaningful default for these things. It all depends on the context, and there's no single context that is so common that it's the only one that most people ever see. Thus, it should always be explicit, and an attempt to enumerate or index a string directly should not be allowed - you should always have to spell out if it's the underlying encoding units, or code points, or text elements, or something else.
How does Clojure or ClojureScript do as they are built on top of JVM/CLR or JavaScript? I'm assuming that since they fall back to many of the primitives of their respective runtimes that they use their implementations.
I've had the least trouble when using Apple's Objective-C (NSString), and Microsoft's C# - these two at least make you take conscious decisions when transcoding to bytes.
* PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"
* MySQL for the longest time used latin1 as a default character set, then introduced an insufficient character set called "utf8" which only allows up to 3 bytes, not enough for all possible utf8 encoded codepoints, then introduced a proper implementation called "utf8mb4".
* mysql connectors and client libraries often default their "client character set" setting to latin1, causing "silent" transcodes against the "server character set" and table column character sets. Also, because their "latin1" charset is more or less a binary-safe encoding, it is very easy to get double latin1-to-utf8 transcoded data in the database, something that often goes by unnoticed as long as data is merely received-inserted-selected-output to a browser, until you start to work on substrings or case insensitive searches etc.
* In Java, there are tons of methods that work on the boundary between bytes and characters that allows not specifying an encoding, which then silenty falls back to an almost randomly set system encoding
* Many languages such as Java, JavaScript and the unicode variants of win32 were unfortunately designed at a time where unicode characters could fit into 16bits, with the devastating result that the data type "char" is too small to store a single unicode character. It also plays hell on substring indexing.
In short, the APIs are stacked against the beginning programmer and doesn't make it obvious that when you go from working with abstract "characters" to byte streams, there is ALWAYS an encoding involved.