Hacker News new | past | comments | ask | show | jobs | submit login

With utf-8 you give up any hope of O(1) string operations. With utf-16 you can get O(1) string operations at a cost of doing them on code points rather than graphemes. If greater than 95% of the strings your language/program will ever see are going to be a single code point that trade off seems worth it.

I think a lot of people totally discount the enormous cost that Latin-1 users are paying for CJK (etc) support they don’t use. Maybe it’s the right call, but it isn’t obviously so.

Personally, at least for internal use, I like the idea of specialized implementations. At (immutable) string creation time decide whether to construct a Latin-1 string or a UTF-8 string (and possibly other fixed width encodings) depending on what code points are present. Expose the same set of operations (i.e. slicing on grapheme clusters) but with better performance characteristics where possible.

For storage at rest and especially communication with external programs I agree you probably should emit UTF-8.




Except that with utf-16 you still have to give up any hope of O(1) string operations, unless you don't care about accuracy. Only UTF-32 allows O(1) index based operations.

For example, 🧐 (aka face with monocle) needs 2 UTF-16 'characters', since the UTF code point is U+1F9D0, or, in UTF-8: F0 9F A7 90.

And this ignored the point that others have made about combining characters, such as ‍ woman farmer Unicode: U+1F469 U+1F3FD U+200D U+1F33E, UTF-8: F0 9F 91 A9 F0 9F 8F BD E2 80 8D F0 9F 8C BE

That beast needs 7 UTF-16 or 15 UTF-8 "characters". I.e. 14 bytes vs 15 bytes. And this is still only one displayed character.


To clarify the woman farmer being 14 or 15 bytes: the example given is four different combined characters, and is additionally a "medium skin tone woman farmer".

A woman farmer requires the "woman" and "ear of rice" emoji, with a zero-width joiner character between them. To change the skin tone from the default yellow, a "medium skin tone" (type 4) modifier is added after the woman, but before the joiner.

So the sequence "U+1F469 U+1F3FD U+200D U+1F33E" represents "woman skin-tone-4 joiner ear-of-rice". And in the UTF-8 bytes, the first four bytes are "woman", the next four are "skin-tone-4", then three for "zero-width-joiner", and finally four for "ear-of-rice".

Emojipedia helpfully lists codepoints for each multi-character emoji: https://emojipedia.org/female-farmer/


In 2004 no one cared about face with monocle or woman farmer (I still don’t). The question was why did people pick UTF-16 back then. The answer is they were willing to make accuracy trade-offs for performance. Especially where they thought those trade-offs would only impact a small percentage of their users.


How exactly is Latin-1 privileged in this regard? Note its suffix - writing this from a region where other single-byte encodings proliferated in the 1990s, before sort-of settling to Latin-2 and/or Windows-1250 in the aughts (no, they're not the same, and the mapping is subtly broken).

If you mean "ASCII was good for our grandfathers", say so - but don't pretend that Latin-n somehow was not a bastardized set of hacks extended upon ASCII (like all "regional" encodings, single-byte or not), and shouldn't have died at least a decade ago: suddenly a wild Latin-10 string appears, and you're down to spinning up a conversion engine, praying that something didn't slip through the cracks (and inevitably something does, eventually). If your data never crosses a process boundary - by all means, keep it in Linear B, see if I care. Caveat: maintenance of your back-and-forth conversion engine will eat you alive, or iconv will obliterate the benefits of the "enormous" savings. (I also had the great idea "we'll just use a clever(tm) scheme and only have the necessary encoding parts," sometime around 2005: Memory and processing speed improvement was marginal, burden to keep juggling One Special Encoding (Latin-2 in my case) was not.)


> How exactly is Latin-1 privileged in this regard?

Latin-1 is sufficient for languages which countries representing more than 50% of global GDP speak. It might not be fair but I’m not sure this is a question of fairness to begin with.


Oh. In that case, you do mean ASCII, methinks (which also works), with Latin-1 you're shooting yourself in both feet. (Looks for link to "falsehoods programmers believe about encodings")


To my understanding, ASCII basically only covers English and Italian well, whereas Latin-1 covers most Western European languages — which includes the main languages of the Americas (Spanish, English, French, Portuguese).


Italian requires è, ù, ò, à, ì, é, ó, í and (rarely used) î

Accented letters just don't appear that much prominently as in French or Spanish, and distinctions of acute and grave accents are often nowadays lost even on native writers if it wasn't for spellcheckers.


> To my understanding, ASCII basically only covers English and Italian well

And German, too: while German does have ä, ö, ü & ß, they may correctly be written as ae, oe, ue & ss.


Not really if you can make a good living writing software for only the US/UK, France, Germany, Italy and a few more Western European countries, which is perfectly feasible. Latin1 works perfectly fine there, and if you wanted, and you didn't have to interact too much with other systems, you could just completely ignore all other encodings until UTF-8 started to get traction in the later 1990s. It even got into many RFCs as the default encoding until the RFC editors started to enforce everyone to implement UTF-8 and default to it.


Oh, you could have done that in 1990s and well into the aughts, no doubt about that - that's exactly what happened :) I thought you proposed Latin-1 as useful today.


Oh no, no way. I'm not 100% sure about bradleyjg though... Maybe I misunderstood them.


I think Latin-1 compatible strings are common enough to be worth optimizing for with separate code paths. At least in large projects like OSes and programming languages. That doesn’t mean I think Unicode support should be omitted.


In such cases I would think the optimization is basically for ASCII, and then extend that to Latin1 because it happens to be the first 256 characters of Unicode, which means processing is trivial and you don't waste the other 128 byte values that way. But I figure that 99.9% of those strings would be ASCII-only.


> with Latin-1 you're shooting yourself in both feet

Can you be more specific about that? As I wrote in a sibling comment, I'm not convinced. I mean, nowadays you shouldn't use any of the ISO-8859-* encodings anymore of course, but we're talking about the 1990s here.


I beg to differ, considering that it doesn't contain the € sign.


The list of what it doesn't contain, with a 256 character space, is...long ;)


> With utf-16 you can get O(1) string operations at a cost of doing them on code points rather than graphemes.

No you cannot. UTF-16 is a variable-length encoding (up to two 16-bit code units per code point).

> I think a lot of people totally discount the enormous cost that Latin-1 users are paying for CJK (etc) support they don’t use.

If we limit ourselves to Latin 1, UTF-16 requires 100% more memory than UTF-8. How often do you need to randomly access a long string anyway?


Right, code units not code points. My mistake. Nonetheless that’s what you get back from e.g. java’s charAt.


You can do O(1) string operations in UTF-8 too, if you do them at the code unit level. It's just as wrong, but it's more obvious that it's wrong because it only works for ASCII instead of only working for BMP.


Bringing things full circle in this thread, this is absolutely why it is a big deal that emoji have become so popular and that people care about them, and that they are mostly homed in the Astral Plane. Now there's a giant corpus of UTF-16 data people are interacting with daily that absolutely makes it clear that you can't treat UTF-16 like UCS-2, and if you are still doing bad string operations in 2018 you have fewer excuses and more unhappy users ("why is my emoji broken?!").


Same thing I said above except even more so. What’s the most used language that can’t be represented with the BMP? Bengali is one possibility but AFAIK there’s a widely used Arabic form in use as well as the traditional script.


They'd kind of what happened in python 3.3. The internal representation is flexible, depending on the string contents. It may be just 1byte ASCII, or it may be utf32 if needed: https://www.python.org/dev/peps/pep-0393/


Java 11 is going to a similar model.


Does anyone know if anything similar is planned for Javascript? It kills me that String.length (and [], codePointAt and so on) are living footguns. Though as other commenters have said, the rise of emoji has at least brought attention to the bugs caused by naive use of these tools.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: