It also doesn't help that the classic LAMP stack has very confusing defaults and...

TazeTSchnitzel · on Oct 13, 2016

> * PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"

In the XML module, no less. I'll get round to moving those out of there eventually.

TazeTSchnitzel · on Oct 13, 2016

Well, I finally wrote a patch for that: https://github.com/php/php-src/pull/2160

jordigh · on Oct 12, 2016

Does any programming language get Unicode right all the way? I thought Python did it mostly correctly, but for example with the composing characters, I would argue that it gets it wrong if you try to reverse a Unicode string.

deathanatos · on Oct 13, 2016

My basic litmus test for "does this language support Unicode" is, "does iterating over a string get me code points?"¹

Rust, and recent versions of Python 3 (but not early versions of Python 3, and definitely not 2…) pass this test.

I believe that all of JavaScript, Java, C#, C, C++ … all fail.

(Frankly, I'm not sure anything in that list even has built-in functionality in the standard library for doing code-point iteration. You have to more or less write it yourself. I think C# comes the closest, by having some Unicode utility functions that make the job easier, but still doesn't directly let you do it.)

¹Code units are almost always, in my experience, the wrong layer to work at. One might argue that code points are still too low level, but this is a basic litmus test (I don't disagree that code points are often wrong, it's mostly a matter of what can I actually get from a language).

> try to reverse a Unicode string.

A good example of where even code points don't suffice.

spc476 · on Oct 13, 2016

Lua 5.3 can iterate over a UTF-8 string. You can even index character positions (not byte positions) in a UTF-8 string. Some more information here: https://www.lua.org/manual/5.3/manual.html#6.5

Edit: fixed link

TorKlingberg · on Oct 13, 2016

I basically agree with you, but note that code points are not the same as characters or glyphs. Iterating over code points is a code smell to me. There is probably a library function that does what you actually want.

deathanatos · on Oct 13, 2016

I explicitly mention exactly this in my comment, and provide an example of where it breaks down. The point, which I also heavily noted in the post, is that it's a litmus test. If a language can't pass the iterate-over-code-points bar, do you really think it would give you access to characters or glyphs?

jcranmer · on Oct 13, 2016

The .iterator() method of strings in JS returns code points, not JS characters, so, e.g., Array.from("a") is ["", "a"].

(I use U+1F4A9 for testing all my non-BMP needs)

deathanatos · on Oct 13, 2016

If I've got the right method[1], that appears to have been added in ES6. I'm still getting up to speed there. Not supported on IE, of course.

[1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

pjmlp · on Oct 13, 2016

C# is built on top of what Windows offers, so I imagine it matches a mix of DBCS and Unicode APIs.

int_19h · on Oct 14, 2016

All strings in C# are UTF-16.

However, it exposes the encoding directly as a sequence of 16-bit ints. In other words, if you iterate over a string or index it, you're getting those, and not codepoints (i.e. it doesn't account for surrogate pairs).

Note that this only applies to iteration and indexing. All string functions do understand surrogates properly.

On the other hand, C# (or rather .NET) has a way to iterate over text elements in a string. This is one level higher than code points, in that it folds combining characters: https://msdn.microsoft.com/en-us/library/system.globalizatio...

Dylan16807 · on Oct 13, 2016

Better still is forcing you to say what you want to iterate over, like Swift.

deathanatos · on Oct 13, 2016

I'd accept that, too. I'm not familiar w/ Swift, so it wasn't in the list above. (But I do think the default should not be code units, or programmers will use it incorrectly out of ignorance. Forcing them to choose prevents that, hence, I'll allow it)

int_19h · on Oct 14, 2016

I would go one step further and say that there is no meaningful default for these things. It all depends on the context, and there's no single context that is so common that it's the only one that most people ever see. Thus, it should always be explicit, and an attempt to enumerate or index a string directly should not be allowed - you should always have to spell out if it's the underlying encoding units, or code points, or text elements, or something else.

wtbob · on Oct 13, 2016

> Does any programming language get Unicode right all the way?

Lisps with Unicode support seem to. This is a case where the Common Lisp standard's reluctance to mandate certain things paid off bigtime.

chinpokomon · on Oct 13, 2016

How does Clojure or ClojureScript do as they are built on top of JVM/CLR or JavaScript? I'm assuming that since they fall back to many of the primitives of their respective runtimes that they use their implementations.

wtbob · on Oct 13, 2016

> How does Clojure or ClojureScript do as they are built on top of JVM/CLR or JavaScript?

I don't know, but what you wrote sounds right. I don't really think of them as Lisps, although they have some Lisp-like features.

0x0 · on Oct 12, 2016

I've had the least trouble when using Apple's Objective-C (NSString), and Microsoft's C# - these two at least make you take conscious decisions when transcoding to bytes.