Hacker News new | past | comments | ask | show | jobs | submit login

It also doesn't help that the classic LAMP stack has very confusing defaults and badly named functions:

* PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"

* MySQL for the longest time used latin1 as a default character set, then introduced an insufficient character set called "utf8" which only allows up to 3 bytes, not enough for all possible utf8 encoded codepoints, then introduced a proper implementation called "utf8mb4".

* mysql connectors and client libraries often default their "client character set" setting to latin1, causing "silent" transcodes against the "server character set" and table column character sets. Also, because their "latin1" charset is more or less a binary-safe encoding, it is very easy to get double latin1-to-utf8 transcoded data in the database, something that often goes by unnoticed as long as data is merely received-inserted-selected-output to a browser, until you start to work on substrings or case insensitive searches etc.

* In Java, there are tons of methods that work on the boundary between bytes and characters that allows not specifying an encoding, which then silenty falls back to an almost randomly set system encoding

* Many languages such as Java, JavaScript and the unicode variants of win32 were unfortunately designed at a time where unicode characters could fit into 16bits, with the devastating result that the data type "char" is too small to store a single unicode character. It also plays hell on substring indexing.

In short, the APIs are stacked against the beginning programmer and doesn't make it obvious that when you go from working with abstract "characters" to byte streams, there is ALWAYS an encoding involved.




> * PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"

In the XML module, no less. I'll get round to moving those out of there eventually.


Well, I finally wrote a patch for that: https://github.com/php/php-src/pull/2160


Does any programming language get Unicode right all the way? I thought Python did it mostly correctly, but for example with the composing characters, I would argue that it gets it wrong if you try to reverse a Unicode string.


My basic litmus test for "does this language support Unicode" is, "does iterating over a string get me code points?"¹

Rust, and recent versions of Python 3 (but not early versions of Python 3, and definitely not 2…) pass this test.

I believe that all of JavaScript, Java, C#, C, C++ … all fail.

(Frankly, I'm not sure anything in that list even has built-in functionality in the standard library for doing code-point iteration. You have to more or less write it yourself. I think C# comes the closest, by having some Unicode utility functions that make the job easier, but still doesn't directly let you do it.)

¹Code units are almost always, in my experience, the wrong layer to work at. One might argue that code points are still too low level, but this is a basic litmus test (I don't disagree that code points are often wrong, it's mostly a matter of what can I actually get from a language).

> try to reverse a Unicode string.

A good example of where even code points don't suffice.


Lua 5.3 can iterate over a UTF-8 string. You can even index character positions (not byte positions) in a UTF-8 string. Some more information here: https://www.lua.org/manual/5.3/manual.html#6.5

Edit: fixed link


I basically agree with you, but note that code points are not the same as characters or glyphs. Iterating over code points is a code smell to me. There is probably a library function that does what you actually want.


I explicitly mention exactly this in my comment, and provide an example of where it breaks down. The point, which I also heavily noted in the post, is that it's a litmus test. If a language can't pass the iterate-over-code-points bar, do you really think it would give you access to characters or glyphs?


The .iterator() method of strings in JS returns code points, not JS characters, so, e.g., Array.from("a") is ["", "a"].

(I use U+1F4A9 for testing all my non-BMP needs)


If I've got the right method[1], that appears to have been added in ES6. I'm still getting up to speed there. Not supported on IE, of course.

[1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


C# is built on top of what Windows offers, so I imagine it matches a mix of DBCS and Unicode APIs.


All strings in C# are UTF-16.

However, it exposes the encoding directly as a sequence of 16-bit ints. In other words, if you iterate over a string or index it, you're getting those, and not codepoints (i.e. it doesn't account for surrogate pairs).

Note that this only applies to iteration and indexing. All string functions do understand surrogates properly.

On the other hand, C# (or rather .NET) has a way to iterate over text elements in a string. This is one level higher than code points, in that it folds combining characters: https://msdn.microsoft.com/en-us/library/system.globalizatio...


Better still is forcing you to say what you want to iterate over, like Swift.


I'd accept that, too. I'm not familiar w/ Swift, so it wasn't in the list above. (But I do think the default should not be code units, or programmers will use it incorrectly out of ignorance. Forcing them to choose prevents that, hence, I'll allow it)


I would go one step further and say that there is no meaningful default for these things. It all depends on the context, and there's no single context that is so common that it's the only one that most people ever see. Thus, it should always be explicit, and an attempt to enumerate or index a string directly should not be allowed - you should always have to spell out if it's the underlying encoding units, or code points, or text elements, or something else.


> Does any programming language get Unicode right all the way?

Lisps with Unicode support seem to. This is a case where the Common Lisp standard's reluctance to mandate certain things paid off bigtime.


How does Clojure or ClojureScript do as they are built on top of JVM/CLR or JavaScript? I'm assuming that since they fall back to many of the primitives of their respective runtimes that they use their implementations.


> How does Clojure or ClojureScript do as they are built on top of JVM/CLR or JavaScript?

I don't know, but what you wrote sounds right. I don't really think of them as Lisps, although they have some Lisp-like features.


I've had the least trouble when using Apple's Objective-C (NSString), and Microsoft's C# - these two at least make you take conscious decisions when transcoding to bytes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: