Hacker News new | past | comments | ask | show | jobs | submit login
The Tragedy of UCS-2 (unascribed.com)
208 points by ingve 79 days ago | hide | past | web | favorite | 158 comments



This is a great rundown. I started my career at Microsoft working with and on Win32/COM and saw this play out first hand.

One thing not mentioned here is the history of the "Byte Order Mark" (BOM) in unicode.

(Not an expert here but my understanding having lived it.)

You see, given UCS-2, there are 2 ways to encode any codepoint -- either big endian or little endian. The idea was then to create a codepoint (U+FEFF) that you could put at the start of a text stream that would signify what order the file was encoded in.

Wikipedia page: https://en.m.wikipedia.org/wiki/Byte_order_mark

This then got overloaded. When loading a legacy text format often times there is the difficulty of figuring out the code page to use. When applied to HTML, there are a bunch of ways to do it and they don't always work. There are things like charset meta tags (but you have to parse enough HTML to find it and then re-start the decode/parse). But often times even that was wrong. Browsers used to (and still do?) have an "autodetect" mode where it would try divine the codepage based on content. This is all in the name of "be liberal in what you expect".

Enter UTF-8. How can you tell if a doc is US ASCII or UTF-8 if there are no other indications in the content? How does this apply to regular old text files? Well, the answer is to use the BOM. Encode it in UTF-8 and put it at the start of the text file.

But often times people want to treat simple UTF-8 as ASCII and you end up with a high value codepoint in what would otherwise be an ASCII document. And everyone curses it.

Having the BOM littered everywhere doesn't seem to be as much of a problem not as it used to be. I think a lot of programs stopped putting it in and a lot of other programs talk UTF-8 and deal with it silently. Still something to be aware of though.


Yeah, the BOM has gotten me a few times.

My most spectacular fail was a program that read UTF-8 or Latin-1 and wrote UTF-16, preserving but not displaying null characters. I believe this was default behavior HyperStudio. Every round-trip would double the size of the file by inserting null bytes every other character. Soon there were giant stretches of null characters between each display character, but the displayed text never appeared to change even though the disk requirements doubled with each launch. That's how I learned about UTF-16!

Speaking of Win32/COM... is there a "tcpdump for COM"? I've got a legacy app that uses COM for IPC and I've been instrumenting each call for lack of one.


If there is anything like tcpdump for COM, it would be part of Event Tracing for Windows, but you’d probably prefer to use it via Microsoft Message Analyzer.


A tcpdump for COM would be incredible... but I'm in the same boat of having just instrumented each call individually :(

I would guess though, that there is probably some pretty helpful code for this in the apitrace program that could probably be lifted out and reused, since DirectX APIs tend to involve a lot of COM. I haven't tried, though.


 is the representation of the UTF-8 BOM byte sequence in Latin-1. If this comment were stored as Latin-1 and you assumed it was UTF-8 just because it began with that byte sequence, you would discard an important part of my message.


Indeed, Windows Notepad does exactly that (ignores  and reads rest as UTF8)


I have sympathy for its authors. There is no way to really know what the right encoding is. Your options are to guess based on heuristics, allow the user to specify, or demand a particular format. Even the friendliest applications just guess and allow the user to override.


If you start using the BOM like that for UTF-8, then it's not really a byte order marker anymore.

BTW, OLE/COM is something that didn't click at all for me when I first started encountering it in the '90s. I'm kind of bummed that it seems to have been left behind because it's still a useful technology.


> How can you tell if a doc is US ASCII or UTF-8 if there are no other indications in the content?

If the document contains no codepoints above 0x7F, it is both US-ASCII and UTF-8 at the same time. If the document decodes as valid UTF-8, it’s more likely to be UTF-8 than whatever national encoding it might be (latin1? windows-1252? equivalents for other countries?)

With the BOM, you get more serious problems. Everyone on the way needs to know about it, and that it needs to be ignored when reading, but probably not when writing. I remember the olden days of modifying .php files from CMSes with notepad.exe, which happily, silently added a BOM to the file, and now suddenly your website displays three weird characters at the very top of the page.


It always baffled me as a native Chinese speaker that 16-bit was ever considered enough to “be plenty of room to encode every script in modern use, and then some”. Common local encodings like Big5 (also 16-bit fixwidth) was already suffering from serious lack of code points at that time. It would be obvious if they consulted literally any experienced programmer from East Asia. Yeah, that was 1990 so it’s not an easy thing to do by any means, but you’d expect more from someone aspiring to create such thing.


> It would be obvious if they consulted literally any experienced programmer from East Asia. Yeah, that was 1990 so it’s not an easy thing to do by any means

They did consult not just experienced East Asian programmers, but Chinese, Korean, Japanese and Vietnamese professors of Linguistics and the relevant local authorities, and had them form the Ideographic Rapporteur Group, which advised them that all the East Asian languages can be encoded in 20940 code points.

Unicode is not centrally run like so many people seem to expect. Rather, the Unicode Consortium itself acts just as the standards body that mates together local standards into a single complete whole. The Consortium makes no decisions about how Chinese is represented, other than allocating them code pages; all the relevant decisions about any language is done by local experts. The problem was that the early 90's were an era of boundless optimism and futurism, which in East Asia resulted in the Han Unification project, with the idea of cutting down the amount of symbols in use, and unifying the representation of all the languages that used a Han-derived alphabet. That... did not go so well.


Thanks for that background. I was thinking that the (for lack of a better word) politics of CJKV and Han unification likely had a lot to do with it.

To my eyes, the letter "A" in English (Latin), Greek, and Russian (Cyrillic) looks identical. Why does it need three separate code points? Maybe it gets tricky when the relationship between uppercase and lowercase diverges. In fact, that does appear to be one of the arguments in this technical note:

https://www.unicode.org/notes/tn26/


I think the most salient point in your link is:

>Even more significantly, from the point of view of the problem of character encoding for digital textual representation in information technology, the preexisting identification of Latin, Greek, and Cyrillic as distinct scripts was carried over into character encoding, from the very earliest instances of such encodings. Once ASCII and EBCDIC were expanded to start incorporating Greek or Cyrillic letters, all significant instances of such encodings included a basic Latin (ASCII or otherwise) set and a full set of letters for Greek or a full set of letters for Cyrillic. Precedent for the purposes of character encoding was clearly established by those early 8-bit charsets.

That's true, although it reminded me of a cool charset: the Russian KOI8-R encoding[1]. This encoding was created so that if you stripped the high bit (presumably on a system unable to handle russian properly) you ended up with semi-legible latinized Russian. Quoting wikipedia:

>For instance, "Русский Текст" in KOI8-R becomes rUSSKIJ tEKST ("Russian Text") if the 8th bit is stripped; attempting to interpret the ASCII string rUSSKIJ tEKST as KOI7 yields "РУССКИЙ ТЕКСТ". KOI8 was based on Russian Morse code, which was created from Latin Morse code based on sound similarities, and which has the same connection to the Latin Morse codes for A-Z as KOI8 has with ASCII.

[1] https://en.wikipedia.org/wiki/KOI8-R


Wow, they made the 8th bit essentially be the case bit, and losing+recreating it mostly just made everything uppercase.


No, the case bit is still bit 5. Hiwei, unlike ASCII, KOI7/KOI8's Cyrillic alphabet sets bit 5 to 1 for uppercase characters. This way, even though Cyrillic text remains legible, you also have a clue that the encoding is KOI.


Bloody hell, that is awesome and I don't often say that.


I don't think uppercase/lowercase is an issue and in fact it is something that already happens. For example in most languages lowercase "i" has the letter "I" as uppercase version, but not in Turkish, where it gets converted to "İ" (with the dot). In Turkish "I" is the uppercase version of "ı" (without the dot). In other words the same letter can be "uppercased" to a different letter depending on the current language


I remember the politics of this well. It was all about legacy encodings (similar to the UCS-2 issue in the article). There were many encodings (“code pages” in MS land) for each language or character set, even Latin-derived ones. Just getting a single one for Cyrillic or Greek was a hurdle, with the most common in use as the “default”. True character unification would have been great, but was too much of a hurdle because you couldn’t just use an 8-bit character space plus a single offset. This was in the days of 16-bit machines, remember.

This legacy support is why there are precomposed characters.

It’s not quite the same for Han unification (people using those alphabets were already used to complex gyration was ao e). In that case the linguists, historians and other scientists were in agreement but they neglected politics. In Latin unification nobody cares that J sounds different in every language, but because Hanzi embody semantics, Japan may have a loancharacter because it once sounded the same as a word in use at the time OR because it had the same meaning as a word in use at the time so the sense of “unified” can be contentious, even though they are simply code points.


That would mix words from different alphabets when sorted in a lexicographical order (and/or break this order, respectively). You could get used to it, but traditionally it is a big nope. Let fonts erase the difference, but we want to know if a letter is greek, latin, cyrillic or something else.

Sorting complex charsets may require some sort of collation, of course, but in simple cases the status quo is highly usable without one.


Lexicographic order is already language specific nad majority of languages use only one script.


But databases/filenames/lists.. often mix them. Collation by rules of a single language isn't perfect, but highly usable.


You could also see black letter and antique as two different scripts. In fact foreign words were written in antiqua. In Unicode however they’re put together (except for mathematics).


One thing I can think of is that when I write in Russian, and I put text into "italics", it goes into the cursive script. So I guess the letter "a" is different at least from the roman letter.


Han Unification is something that should have been more nuanced, and I'll be quite surprised of it at least a few linguists didn't give a nuanced answer and have been ignored.

1. Let's start with a simple character (indeed the character for character) like 字. The top stroke on this character is generally written straight in Japanese (and I believe Korean), but it has a very distinct diagonal angle (top left to bottom right) in Chinese, both Traditional and Simplified. Should it be counted as the same character? I believe most linguists would say yes. If your answer for this is no, then Han Unification is a lost cause - you'd basically have to create completely separate character sets for Simplified Chinese, Traditional Chinese, Japanese and Korean and Vietnamese, and add a few extra characters to the Traditional Chinese set that are different over in Hong Kong.

2. Let's go further on with the grass radical (艹) which has 4 strokes in Traditional Chinese vs. 3 in other writing systems. Should all the characters with this radical be separate?

3. What about characters like 骨? The inner right angle stroke in the top of this character is on the left side in Simplified Chinese, and on the right side otherwise. Similar minor differences can be found for characters like 刃 and 化. These are not really the fault of simplification, but rather come from the fact that different countries chose a different Han character variant to standardize on.

A good example from another non-western language that went through a partial unification ignoring stylistic differences in writing would be Urdu. It's sharing the same unicode range as Arabic, even though the typical styling of "the same characters" in both languages could be quite different. Here's a signboard with both Urdu and Arabic from Wikipedia - the difference is quite apparent: https://upload.wikimedia.org/wikipedia/commons/2/22/UAE_sign...

But the unification could never be perfect. Simplified characters often combine more than 1 traditional characters. e.g. 发 could mean both hair (髮) and discharge (發). Japanese simplified the later character to 発. Unicode assigns a different codepoint to each of them, and it would obviously be quite bad for Traditional Chinese if it didn't. It also assigns a different code point where consistent stylistic simplification of radicals are involved, so all characters with the radical 見 ('see') have different codepoints from those with the simplified radical 见.

Even ignoring political tensions, Han unification was naive from the start, especially considering they knew they wouldn't be able to unify everything. They could have left a way to designate regional styling within the encoding.


This is actually covered in the original Unicode paper. See section 2.1 of https://unicode.org/history/unicode88.pdf.

The reasoning is not totally crazy; the Table of General Standard Chinese Characters even today only has about 8000 characters.


I think what they did not realize is that once a universal encoding won the battle of standards, it would have to be all-inclusive. Operating systems and platforms like Java settled on Unicode so they won't have to deal with a myriad of different encodings, but the moment you support only one encoding, you'll eventually have to accommodate everyone, and it's not enough to include all commonly used characters in the world's major languages. You'll have to make a go at the long tail.

You can't leave out people who need to encode rare languages on computers underserved, so for all the scholars recording Egyptian and Luwian hieroglyphs, Sumerian and Akkadian cuneiform and Tangut script. You'd probably have to eventually support as-of-yet unencoded scripts such as Classical Yi, Jurchen, Aztec hieroglyphs and so on.

You'd have to add alphabets and syllabaries for small minority languages such as Lisu, Coptic, Cherokee and Vai. And you'd have to support the rather widely used writing systems of India, each with its own 128 characters.

Combine this with the over 28,000 CJK characters Unicode did end up allocating, the 11,184 discrete codepoints they Unicode did end up adding for Koran Hangul [1], you've already used 40,000 characters. Add all the long-tail scripts and you're past 65,535 well before we even reach Emoji or extremely rare CJKV characters from Mojikyo[1].

"In other words, given that the limitation to 65,536 character codes genuinely does satisfy all the world's modern communication needs with a safety factor of about four, then one can decide up-front that preserving a pure 16-bit architecture has a higher design priority than publicly encoding every extinct or obscure character form. Then the sufficiency of 16 bits for the writing technology of the future becomes a matter of our active intention, rather than passive victimization by writing systems of the past."

The engineers behind Unicode made the classic software engineering mistake of looking at their needs today and taking a factor of safety that's way too small. It's the same thing that happened with the IPv4 address range. A simple back-of-the-envelope calculation would have told us it's not going to be enough, but it was easy to forget back in the 1980s that one day children in Africa would also have personal computers connected to the grand Internet, that would need an IP address.

The fact that by version 1.0.1 (the first version of Unicode to include the CJK characters), they already used almost half of their space when they were planning to use less than a quarter, should have already prepared them to the future. No, the reasoning behind 16 bits was always crazy. You can't design future-proof standards with a low safety factor and condescending estimation of usage (Yi Syllabary? Who would ever use that?).

But we should probably be happy about this. A 32-bit encoding would have certainly been rejected by just about everyone as too wasteful, and software engineers back in the 80's and 90's were still unable to give up merging their understanding of codepoints with glyhs, so a multi-byte encoding (which is what we ended up having eventually with UTF-8 and UTF-16) wasn't a popular idea either.

Fortunately enough, Unicode is still much better than the mess it preceded, and the fix for the 16-bit limitation, as painful as it was, has proven far easier than IPv6.

[1] a.k.a The "Korean Mess": http://unicode.org/pipermail/unicode/2015-June/002038.html

[2] https://en.wikipedia.org/wiki/Mojikyo


> Common local encodings like Big5 (also 16-bit fixwidth) was already suffering from serious lack of code points at that time.

They are not 16 bits wide, but mostly 14 bits wide and separated from the single byte portion with most significant bits of bytes set. As far as I can tell there are "only" about 33,000 code points [1] in the Unihan database that are known to be officially encoded in some sort of encodings throughout the entire East Asia. This does not include unofficial but de facto encodings like ETEN extensions to Big5 but you get the idea. This and the Han unification made it pretty reasonable to expect that the modern characters can be encoded in 16 bits.

[1] I've counted distinct code points in the "other mappings" table from the Unihan database.


There are not more than 65,000 Chinese characters in common use today. Indeed, a college graduate from this region would probably only recognize 1/10th that number, even from Taiwan or Hong Kong (where a greater number of traditional characters are used). If the goal is simply to encode one point for each character in use today, 16 bits is very likely sufficient.

That's not what Unicode does though. Unicode encodes many duplicate characters (because they are duplicated in a national encoding standard and the round-trip compatibility requirement). Unicode also includes every possible variant that is of cultural or academic interest, including dead languages and so-called "ghost characters" that are actually printing errors in old paper dictionaries or census data, but now allocated a code point. And this is before we say anything about emoji...

So from today's perspective 16 bit encoding space is laughably small. But it is true that basically any message meant to be consumed by the general population, including CJKV languages but notably not emoji, can be fully represented with characters found on the BMP, which can be faithfully encoded in UCS-2.


But, but... 64K code points should be enough for everybody!


It's also worth noting where the terms "UCS-2" and "UCS-4" come from. For a while there was the idea of the Universal Coded Character Set[0] as an alternative to Unicode. This would have been a natively 32-bit encoding. However, it wouldn't have allowed the full 2^32 possible characters, because of its insistence that none of the 4 bytes could be any of the C0 or C1 controls. Still, it would have about 2^29 possible characters as opposed to Unicode's about 2^21.

But the insistence on avoiding C0 and C1 controls would have led to some odd codepoint numbers, and definite incompatibility with ASCII and Unicode; for instance, 'A', instead of 0x41, would have been 0x20202041! Whereas Unicode of course made sure to copy ASCII (indeed, to copy ISO-8859-1) for its first block.

Of course, this didn't come to pass because, well, people were already supporting 16-bit Unicode and didn't want to switch again. And I'm kind of glad it didn't -- while it would have meant not having to deal with the compromise that is UTF-16, imagine living in a world where 'A' is not 0x41, but rather 0x20202041! (Or imagine opening up a text file and finding three spaces between each character...)

[0]https://en.wikipedia.org/wiki/Universal_Coded_Character_Set


"imagine living in a world where 'A' is not 0x41, but rather 0x20202041! (Or imagine opening up a text file and finding three spaces between each character...)"

On the other hand, often "obviously broken" is better than "subtly broken". There's a ton of undetected encoding problems in the wild right now.


I would've mentioned this in my post but didn't feel I could do the topic justice as I'm not submersed in it like I am the UTF-16 tragedy. This is indeed an important aside, though.


I would go further and say that UCS-4 is also a variable length encoding, mostly because of emoji and its endearing ZWJ sequences, not to mention the encoding of flags as two Regional Indicator Symbol codepoints (both in the astral plane, of course), which makes Unicode not self-synchronizing.

So the idea of ever having a fixed-length encoding for Unicode is basically impossible now. Best to just use UTF-8 for everything and logic to group it in to code points, grapheme clusters, or whatever other granularity is needed.


Those are all shaping (and bidi) issues though, independent from encoding. UCS-4 is a single Unicode code point and the simplest encoding. Shaping may combine code points many to one, one to many, have ordering issues, etc. I think it is important for people to understand how shaping works and the code point/glyph distinction. Trying to duplicate the effort of the shaper when processing strings UCS-4 is going to lead to a world of hurt. Even English text with strings like "1/4" and ligatures in the font may display as a single glyph (that's font specific shaping, not encoding).

I've heard more than one person tell me they don't need to worry about text shaping since they are using UTF-8. (That statement doesn't make any sense) There is a lot of confusion with Unicode text rendering stack.


Sure. I guess what I'm trying to say is: when do you care about code point boundaries at all? In my experience, it's actually fairly rare. If you're drawing text, you care about the result after shaping. If you're moving the cursor, you care about grapheme cluster boundaries. If you're doing stuff like sorting and collation, you should use specialized Unicode libraries that get this stuff right. When is the last time you wrote code that really needed to iterate Unicode code points specifically?


> When is the last time you wrote code that really needed to iterate Unicode code points specifically?

We do it all the time. Splitting a string at U+002C or U+003B or U+000A, looking for U+007B or U+007D, etc. In my experience, it's really common.


Right, and if you're doing processing of ASCII code points (this is 95% of Markdown parsing, for example, the major exception being case folding and Unicode whitespace), then it's just as convenient to do those in UTF-8 as any other representation.


Those aren't unicode code points; those are bytes.


Not on a UTF-16 platform.


Fun fact: because flags are a political nature, and Unicode did not want to limit the length of country codes, any number of Regional Indicator Symbol code points can be placed back to back and treated as an extended grapheme cluster.

That means that what Unicode today considers "one character" can be infinite in size.


And worse, the ‘operators’ are postfix or infix, so if you're handling text sequentially — and let's say, split across network packets with an arbitrary delay in between — you can't even tell you've read a complete character until you receive the next one.

Typing in many Latin languages generates diacritics using ‘dead keys’, which are prefix operators. (Originally, on a physical typewriter, these were keys that simply struck the paper without triggering the mechanism to advance the carriage.) If Unicode had taken this hint, life would be easier.

(ISO/IEC 6937 had prefix diacritics, but it was too late.)


Complete recognition before processing. It's not the law, just a good idea, but you should do it anyway. In cryptography this lesson had to keep being learned. "Oh, I received this partial data, and I processed it, and then in the next packet I received the MAC and I realised the bad guys altered the data, oops"

Process whole strings, the meaning of a partial string may not be what you hope, even without fancy writing systems and Unicode encoding.

"Give the money to Steph" - OK, will do

"...enson" - Crap, OK, somebody chase Steph and get our money back, meanwhile here's James Stephenson's money

"... once he gives you the key" - Aargh. Somebody chase down James and get the key off him.


Wow so interesting an idea. But so many streaming algorithms etc. people stick megabytes of crap into one framed message.


Sure, but you can also just stack infinite accent marks.

Either way there's already a recommendation in this area. I'm not sure if it strictly fits this particular scenario but the "UAX15-D3. Stream-Safe Text Format" suggests that you should support sequences at least 31 code points long.


To be fair, I believe this was true well before flags—diacritics played this role previously, as in "Zalgo Text".


What do you think a character is? ZWJ and combining characters are needed for a few Indian languages spoken by millions of people, more then speak Dutch or Norwegian.


Some low level text libraries (FriBiDi, HarfBuzz) use UTF-32 as it makes code simpler when you have to work on it at code point level. Most code does not need that of course and is better off working as a different level than code points.


> both in the astral plane, of course

Well never mind that then; how do you like U+41,U+300 aka U+C0 latin capital letter a with grave accent. Not just in the BMP but in the part of the BMP that's only 2 bytes of UTF-8.

And there's a whole bunch of similar characters that don't have NFC encodings.


Another aspect of this tragedy is that the creation of UTF-16 has forever limited Unicode to only 17 planes. The original UTF-8 encoding (https://tools.ietf.org/html/rfc2044) used up to six bytes, and could represent a full 31-bit range, which corresponds to 32768 16-bit planes; after UTF-16 became common, UTF-8 was limited to four bytes (https://tools.ietf.org/html/rfc3629).


The original pattern extends up to at least 36 bits (which makes the PDP-10 refugees happy).

    1  7  0xxxxxxx
    2 11  110xxxxx 10xxxxxx
    3 16  1110xxxx 10xxxxxx 10xxxxxx
    4 21  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    5 26  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    6 31  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    7 36  11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Then, you can either decide to limit the leader to one byte, giving you 42 bits…

    8 42  11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Or allow multi-byte leaders and continue forever.

    9 53  11111111 110xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   10 58  11111111 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    ⋮
The Intergalactic Confederation will not look kindly on UTF-16.


> Or allow multi-byte leaders and continue forever.

That loses one very useful property of UTF-8: you always know, by looking at a single byte, whether it's the first byte of a code point or not. It also loses the property that no valid UTF-8 encoding is a subset of another valid UTF-8 encoding. It's better to stop at 36 bits (which keeps another useful property: you'll never find a 0xFF byte within an UTF-8 string, and you'll only find a 0x00 byte when it's encoding the U+0000 code point).


> The original pattern extends up to at least 36 bits (which makes the PDP-10 refugees happy).

Naah.. there's UTF-9 and UTF-18 for that..

https://tools.ietf.org/html/rfc4042


Why would you ever need 31 bits? That's over TWO BILLION distinct codepoints.

Only ~137K have been assigned so far, or in other words a little over 17 bits. Given that each additional bit doubles the range, even in the case that characters continue to be assigned at a constant rate, reaching that high is so far into the future that it's not worth thinking about. Unicode is around 30 years old; at the same rate, another 30 years would be needed to get to 18 bits, then another 60, 120, ...

On the other hand, it is extremely wasteful to burden all UTF-8 processing code with handling values that will literally never be encountered outside of error cases.


Well if they keep adding random emoji at the current rate....


I think that was the point - if they add at the current rate, we'll be OK. If they started adding at an exponential rate we'd have trouble, but that's highly unlikely.


Is this really forever? ISTM the old version could be reintroduced as UTF-8+ without insurmountable problems and with a good degree of compatibility.


Fun fact: for SMS, UCS-2 is used when a message requires more than 128 characters to be rendered.[1]

If you've ever noticed that entering an emoji or other non-ASCII [2] character seems to dramatically increase the size of your SMS, that's because the message is switching from 1-byte characters to 2-byte characters.

[1] https://www.twilio.com/docs/glossary/what-is-ucs-2-character...

[2] It's actually GSM-7, not ASCII, but the principle is the same: https://www.twilio.com/docs/glossary/what-is-gsm-7-character...


I was a flip-phone holdout for a long time. This was not generally supported in new devices sold up to 2015 (when I acquired my first smartphone, and my experience with good phones ended).

Every flip phone I owned did not play well with non-ASCII text. Receiving a message with an emoji would render the entire message, or if I was lucky just the portion following the emoji, unrenderable. Most phones I had failed to just rectangles, which I assume is their method of mojibake.

This was not a causal factor in my switch to a smartphone. It was annoying to have to explain character encodings to friends who didn't understand why they couldn't send me emoji-containing messages.


You sound like a blast.


Hey, could you please review the site guidelines and stick to the spirit of this site when posting here? Normally we ban accounts that attack others like this, and I actually banned yours, but decided to reverse that because you posted a more substantive comment earlier today. Basically, we're trying for more thoughtful conversation and better signal/noise than internet default.

https://news.ycombinator.com/newsguidelines.html

You might also find these links helpful for getting the intended use of HN:

https://news.ycombinator.com/newswelcome.html

https://news.ycombinator.com/hackernews.html

http://www.paulgraham.com/trolls.html

http://www.paulgraham.com/hackernews.html


I don't really understand how it can use UCS-2. I don't think UCS-2 can represent emoji.


This is actually mentioned in the submission, but it does so through surrogate pairs. In short: the first of two bytes references a different "plane" of code points, and the second byte refers to a code point in that additional plane.

What's cool is that UCS-2 and UTF-16 behave the same way in this regard, so you can see how this works in JavaScript as well. There's a great HN post about the length of an emoji being 2 in JS [1] from 2017.

Practically speaking, I had fun learning this while writing an Angular app to read SMS backups [2]. It's frustrating and cool at the same time.

[1] https://news.ycombinator.com/item?id=13830177

[2] https://github.com/devadvance/sms-backup-reader-2/blob/maste...


The submission actually states (correctly): "UCS-2 can only represent the first 65536 characters of Unicode, also known as the Basic Multilingual Plane."

UTF-16 can represent all Unicode code points, through surrogate pairs when needed.

Systems that originally were designed for UCS-2 have often been upgraded to UTF-16. E.g. Windows, JS. But it is then not UCS-2 anymore. In some case the documentation and/or function names, etc., might talk incorrectly about being UCS-2 (while they really are UTF-16 now) for historical reasons, adding to the confusion.


If there are surrogate pairs, I thought that was more of UTF-16 rather than UCS-2. Wikipedia says UCS-2 cannot go outside the BMP range (which emojis are).

>UCS-2 differs from UTF-16 by being a constant length encoding[4] and only capable of encoding characters of BMP.

https://en.wikipedia.org/wiki/UTF-16

There does seem to be some disagreement on this topic though. I wonder if it could be described as WTF-16.

https://simonsapin.github.io/wtf-8/


JS, like Java, now has implementations that will also store strings as Latin1 when the implementation believes it is safe to do so. This results in significant memory savings [1] in most programs.

1. https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-f...


I wonder why Latin-1 and not just UTF-8


The String (and CharSequence) API are already warped around the assumption of underlying UTF-16 storage. In particular `charAt()` refers to a UTF-16 code unit position, and in general all methods [1] that take any kind of index are using an index into an explicit or implicit UTF-16 string. All of these functions need to be fast: certainly at least O(1).

So for your narrow encoding you need one that maps 1:1 (in a fixed-length way) to UTF-16. That rules out multi-byte UTF-8.

You could of course just restrict it to UTF-8 chars that fit in a single byte – but that's just ASCII! You waste a bit per character.

So you might has we'll use a full 8-bit encoding, and Latin-1 is convenient as it covers many more characters likely to be encountered in text that can use single-byte UTF-8 in the first place. It is also convenient in that Latin-1 and Unicode codepoints identical so conversion to and from UTF-16 can be very efficient (in fact there are SIMD instructions which do it exactly).

In any case, even in the scenario you could somehow use UTF-8 efficiently, it is likely to be worse in this particular scenario:

- They both use 1 byte for code points 0-127 - Latin-1 uses 1 byte for 128-255, while UTF-8 uses 2 - Above 255, UTF-8 uses at least two but the Java hybrid string will be using UTF-16 now, for a common case of 2 bytes.

So the only place UTF-8 wins is for the unusual case of characters outside the BMP which are 3 bytes in UTF-8 but 4 bytes in UTF-16. Those aren't common at all, and in many scenarios where you'd have a lot of them UTF-16 would still win because it represents many characters in the BMP in 2 bytes that take 3 in UTF-8.

---

[1] One might think there is an exception illustrated by the few methods that mention "code point", like `codePointAt`. Yes, these deal in code points, but their indexes are still indexes into a UTF-16 string. So they are kind of a hybrid API. They let you count N UTF-16 code units (which will represent <= N code points) into a string, and then deal with the code point at that location. That leads to weird cases like asking for the code point starting at the second half of a surrogate pair, which gives you the low surrogate alone.


Ah, thanks for explaining this. I was confused and thought that Latin-1 fit in UTF-8 in one byte, but I went and confirmed for myself you are right it does not. Hm.

In fact this means though that it isn't just the mistake of UCS-2 that results in people wanting "transparent Latin1 under the hood to save space".

Even if we had gone right to UTF-8, there would still roughly the same cost savings there.

I can't think of any way to deal with unicode that wouldn't have similar cost savings from "transparently doing non-unicode when possible".

Well, I suppose unless you made whatever Java (et al) are doing to mark which strings or portions of strings are actually being stored as Latin1, and made that a Unicode encoding. That would of course have it's own... challenges. If you think "transparently storing as Latin1 when possible" is an ugly hack, you'd probably not be happier with it being made a unicode standard encoding.

It turns out being able to represent all written human communication is really hard. Unicode actually does a pretty amazing job of it, balancing lots of trade-offs. (And the fact that it has as wide-spread adoption as it does is in part testament to this; just because you make a standard doesn't mean anyone has to or is going ot use it. Some of the trade-offs unicode balanced were making the adoption curve as easy as possible for existing software. Making ascii valid UTF-8 was, I think, proven by history to be the right decision, although it involved trade-offs... such as Latin1 not fitting in one byte :) ). Overall, Unicode is a really succesful standard, technically and in terms of adoption.

But UCS-2 was a mistake, that if it had been avoided would avoid some headache. But, i think, not the headache where Java is motivated to "store some things transparently as Latin1 where possible."


The alternative is to just not provide functions like charAt() that accept a codepoint index and are expected to be O(1). Instead, as Rust, Swift, and Go all do to some extent, you provide functions that accept UTF-8 byte indexes, and complain if they get an index that doesn’t correspond to the beginning of a codepoint (which can be verified in O(1) thanks to UTF-8’s self-synchronizing property). Then you can add higher-level APIs on top that provide non-random-access iterators over things like codepoints and grapheme clusters. After all, if you really want to treat a string as a list of characters, the most sensible default notion of “character” is probably a grapheme cluster rather than a codepoint - but there‘s no encoding that gives you O(1) random access to those anyway.

Under such a design, storing the string data as anything other than UTF-8 is unnecessary. In fact, it’s impossible: if you store string data as Latin-1, you can’t index into it with a UTF-8 byte index.

That does mean you can’t take advantage of the space savings of Latin-1 over UTF-8, but those are extremely minimal, since both encodings are 1 byte for all ASCII characters.


That is an "alternative", but I'm not sure it's relevant to the motivations people are having to use Latin1 for space savings.

I am not sure if the space savings really are extremely minimal or not for Latin1 vs UTF8. You are correct both encodings are one byte for all ASCII characters, but they are choosing to store as Latin1, NOT as ASCII, to get 1 byte for all Latin1 characters beyond ascii. If they didn't care about space savings for anything but ascii -- they could already have chosen UTF8 instead of Latin1. Which makes me think they care about the space savings for non-ascii Latin1.

It seems to me an orthogonal question to whether you provide functions like `charAt`, and what big-O performance-complexity.


Well, we’re talking about an optimization retrofitted onto a String class which guaranteed O(1) indexing by 16-bit code units. Given that constraint, only a fixed-width encoding can work, so UTF-8 is out. In lieu of that, you may as well pick an encoding that uses the full 8 bits, hence the choice of Latin-1 instead of ASCII.

But indexing by code units only really made sense when Unicode was limited to 16-bit codepoints. Back then, “code unit” and “codepoint” were the same thing, so every code unit had some semantic meaning of its own, rather than just being an artifact of the encoding. Thus, programmers could keep treating strings as “arrays of characters” like they were used to, except that “character” now meant a 16-bit integer instead of an 8-bit one. In truth, Unicode code points were never a perfect representation of what a human might think of as a “character”, due to the existence of combining characters like accent marks, but you could still arbitrarily chop them up and combine them and get something roughly sensible. Overall, they seemed like a reasonable compromise that preserved the notion of “array of characters” to as great an extent as possible.

But these days, a UTF-16 “code unit” can be half of a surrogate pair: an artifact of the encoding with no semantic meaning of its own, the exact thing we were trying to avoid. It now has the downsides of a variable-width encoding like UTF-8, without the space efficiency or the backwards compatibility. So if Java’s String were being designed from scratch – or if, hypothetically, Unicode had never been intended to be limited to 16-bit codepoints – there’s no way it would provide indexing by 16-bit code units.

It might instead opt to provide O(1) access to 32-bit Unicode codepoints. This is what Python 3 does, and so it has to represent strings as UCS-4 in the worst case, but it can store them as UCS-2 or Latin-1 if the string has no codepoints over 65,535 or 255, respectively. In that case, yes, you still end up with Latin-1 as an optimization. But Python has always prioritized ease of use over performance, and it had backwards compatibility constraints as well. For new designs, most people just see the overhead of UCS-4 as a bridge too far. Compared to that, it seems more attractive to bite the bullet and accept the downsides of variable-width encodings. And at that point you may as well pick UTF-8.


Java’s “String” interface included the fact that indexing a string by character is O(1), so its desire to not break interfaces rules out such choices.


Yes, and it includes that guarantee because it was designed with UCS-2 in mind. The question is what would have happened if, hypothetically, the mistake of UCS-2 had never been made.


Well, you can't get that easy O(1) charAt with UTF-8 either, so I think it'd actually be in the same spot, is what I'm saying. I think.

I guess perhaps it wouldn't ever have said it provided a charAt in O(1), okay. It may have been a mistake to think this mattered, it was just offered because it was easy. In which case I suspect there are actually few programs which would be negatively affected by changing charAt to the most efficient you can do in UTF8.

UTF-32 of course has the O(1) property, but nobody wants that memory usage.


From the article:

* Gecko is huge and it uses TwoByte strings in most places. Converting all of Gecko to use UTF8 strings is a much bigger project and has its own risks. As described below, we currently inflate Latin1 strings to TwoByte Gecko strings and that was also a potential performance risk, but inflating Latin1 is much faster than inflating UTF8.

* Linear-time indexing: operations like charAt require character indexing to be fast. We discussed solving this by adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case. This scheme will only work for ASCII strings, though, so it’s a potential performance risk. An alternative is to have such operations inflate the string from UTF8 to TwoByte, but that’s also not ideal.

* Converting SpiderMonkey’s own string algorithms to work on UTF8 would require a lot more work. This includes changing the irregexp regular expression engine we imported from V8 a few months ago (it already had code to handle Latin1 strings).


> Linear-time indexing

This is great for ASCII when you know there's no such thing as combining characters/etc, but I would like to remind everyone reading this that there's no such thing as "linear indexing" of user-perceived characters in Unicode.

User-perceived characters need processing in order to index due to Grapheme clusters potentially using many code points together. (https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundarie...)

For instance, (on my machine, a dark-skinned male teacher) is a combination of these characters:

- U+1F468 Man - U+1F3FE Medium-Dark Skin Tone - U+200D Zero Width Joiner - U+1F3EB School

And knowing what the byte index of the start/end of that character in a string cannot be done by just multiplying an offset by some multiple of number of bytes per character.


After almost 4000 years have we concluded that the alphabet was a bad idea after all?


I would like to remind everyone reading this that there's no such thing as "linear indexing" of user-perceived characters in Unicode.

There is if you dynamically assign virtual code points to grapheme clusters.


Isn't ASCII indexing constant time and Unicode grapheme indexing ~linear time?


Correct, i seem to have quoted the wrong part of the GP comment... I meant to address this quote:

> adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case

But I got confused and addressed the linear indexing part.

But yeah, ASCII is constant time, but you can’t assume even UTF-32 can be constant time due to variable length grapheme clusters.


I interpreted the OP's question as "Why not use UTF-8 as the secondary encoding in these types of hybrid string classes which use a narrower representation when possible", not "Why not convert all strings to UTF-8.


The article has a whole paragraph dedicated to your question


The article specifies why they didn't convert everything to UTF-8 - not how they chose the encoding for that subset of lower code points.


Read the section it describes exactly why they choose Latin-1, it literally answers the question asked in detail...


Ah, you're right, though the first bullet point really threw me off:

>Gecko is huge and it uses TwoByte strings in most places. Converting all of Gecko to use UTF8 strings is a much bigger project and has its own risks.

But I guess they mean converting it all to support UTF-8 under the described circumstances - not just converting it all to UTF-8.


They're the same encoding, given only characters covered by Latin-1. So it's just a little more explicit to say Latin-1, as you're specifying not only the encoding, but also the set of characters.

Edit: I.e. it would be awkward to say you're implementing "UTF-8, but only for these codepoints". That would be equivalent to implementing Latin-1.

Edit: I'm thinking of standard ASCII rather than Latin-1. Above the first 128 code points, UTF-8 switches to two bytes while Latin-1 remains with one byte, so it is much simpler.


That’s not correct. Latin-1 and UTF-8 are both compatible with 7-bit ASCII but they are not the same encoding. For instance é (e acute) is a single byte in latin-1 (0xe9) but is two bytes in UTF-8 (0xc3 0xa9)


You're right! My bad - the first characters are encoded the same across ASCII, UTF-8, and Latin-1, but the second half of Latin-1 differs from UTF-8. So even just having to support those first 256 code points, we jump into multi-byte UTF-8 territory, meaning complexity over Latin-1.


This explanation has little to do with why. Latin-1 guarantees each character is coded into a single set of 8 bits, UTF-8 is a variable width encoding. The point of giving an encoding is so it is known how to decode it and a string passed as Latin-1 comes with guarantees about character positions and so on without parsing.


I vaguely recall making an alternate string class in the 90s that did this, also precomputing hashCode() in the constructor.


The most practical real-world solution across all operating systems is to use "UTF-8 everywhere", and only convert from and to UTF-16 or UTF-32 "at the last minute" when the string data is coming from or going into APIs that don't (yet) accept UTF-8 directly.

Also see: http://utf8everywhere.org/


Back in 2000, when I was doing the initial design of the D programming language, I was convinced that Unicode was the future (things weren't as clear then). Earlier, I had put various multibyte encodings into the Zortech C compiler (Shift-JIS, etc.).

But it wasn't clear at all whether to use UTF-8, -16 or -32. Windows and Java used UTF-16. So D became agnostic about that, supporting all three equally.

Over time it became clear that you're right, the best approach was to use UTF-8 everywhere, and last-minute convert to/from other formats.


I've always wondered why Windows doesn't allow setting the legacy code page to utf8. It would've been a nice hack to get old apps to work with Unicode, assuming they were at least somewhat MBCS aware. If the programmer assumed a character would always be one byte it might not have worked, but then that program would also have failed with a Japanese code page.


There's a fairly new experimental setting in Win10 where UTF-8 is set as the system-wide codepage, allowing to call the "narrow string" functions with UTF-8 strings.

See https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#U...

"With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling "narrow" functions, including fopen and SetWindowTextA, with UTF-8 strings."


That's just awesome. I've wanted this since the win2k/xp days! Back then there were still a lot of non-Unicode programs around. It's less useful today, plus I'm not even using windows anymore.... But just seeing this happen feels great, satisfies past me.


From that same link:

Another popular work-around is to convert the name to the 8.3 filename equivalent, this is necessary if the fopen is inside a library function that takes a string filename and cannot be altered.

8.3 filenames for the win!


Though do note this is only an issue when using particular "narrow" functions. If you consistently use the recommended "wide" functions then none of these issues occur.


I saw once an explanation, probably in the blog which used to be at http://blogs.msdn.com/michkap/, but I'm not finding it anymore. IIRC, the reason was that there is a maximum number of bytes per character for a legacy multibyte code page, I don't recall if it was 2 bytes or 3 bytes (there should be a #define somewhere for it), while for UTF-8 the maximum number of bytes per character is 4 bytes. Allowing a user to set the legacy code page to CP65001 (the legacy code page number for UTF-8) could crash programs which depended on the original limit.


Michael Kaplan's blog was indeed great; Microsoft took it down, then he started blogging elsewhere, then he unfortunately died in 2015, taking his second blog with him. Happily, archives are available at http://archives.miloush.net/michkap/archive/


Yes, that was it, thanks! It's at http://archives.miloush.net/michkap/archive/2006/07/14/66571... (Can the CP_ACP be UTF-8?) which links to http://archives.miloush.net/michkap/archive/2005/02/06/36808... (Can a codepage be changed? How about which codepage a locale points to?) and http://archives.miloush.net/michkap/archive/2006/07/04/65605... (Behind Norman's 'Who needs Unicode?' post).

"It is not possible given both the current architecture (which must work in both user and kernel mode) and also the inherent assumption in several subsystems (like USER) that the ACPs maximum number of bytes per character is 2."

And the #define would have been MB_LEN_MAX.


Oh. I remember reading his blog occasionally back in the day. I faintly remember there was a disagreement with Microsoft about his blog or at least some post in it, so he moved to an external blog. Great to see someone archived that stuff.


Windows has had a codepage number for UTF-8 for a very long time. As I understand it, it's not recommended because too many things break when a multi-byte codepoint gets broken apart accidentally.


That's just a pseudo codepage unfortunately. You can use it when calling functions like MultibyteToWideChar etc. but not set it as the actual locale for processes.


Thanks, I didn't know that detail. `chcp 65001` seems to work.

I like the way they handle it in Python 3.6. They use their own internal encoding, then bypass the regular console byte API when they need to output a string, allowing full Unicode.


UTF-8 has good space behavior for English but as you get higher in the space you end up with 24 bits per character for languages that could expect 16 bits per character with another encoding.

That said, UTF-8 + gzip on the wire isn’t a bad solution.


The website addresses this issue: http://utf8everywhere.org/#asian . The short answer is that most text is not purely high code points, but contains, e.g. HTML tags.


I had fun trying to select parts of:

Приве́т नमस्ते שָׁלוֹם

On Chrome - there's a tricky usability solution to allow part selection of the Hebrew.


On Safari е́, स्ते, and שָׁ can only be selected as complete characters, which makes sense.


How much of the data transmitted or stored on any given system is actually text, as opposed to images, audio, video, machine code, application specific random stuff? Of the little bit of text, how much is HTML, CSS, JSON, any programming language; all of which would be pretty much all ASCII? And if you can find a niche application where most of your data is CJK text, is it really so much that it fits on a USB pen drive if encoded in BIG-5 but not if encoded in UTF-8?

My rough guesses at answers are 0.1%, 90%, and no. Besides, UTF-32 plus gzip isn't a bad solution, and completely ends all discussion about transfer formats... at the cost of requiring a compressor.


On Windows this isn't great advice if you're making heavy use of Windows APIs. You end up needlessly converting back and forth between UTF-16 and UTF-8.

So it often makes more sense to simple use UTF-16 for your own strings and assume UTF-16 for strings returned by the WinAPI (but handle invalid sequences appropriately).


One way to think about text, is as the only data we use that shrinks over the years. Every other type of data continues to grow.

The way this can be stretched to make sense is by thinking for example how much these have grown over 20 years.

Media (music, video, images) grows staggeringly because of 200p > 4k, 128bit mp3 to flac or whatever and there's more of it all that's gone digital.

Code size continues to grow quite significantly even.

But text? Even the massive wikipedia's size, only a fraction is text content.

Then as memory capacity also grows it's really a back and forth battle between computing resources and all other data, except text.

Hard to think of many places 4 byte chars would matter nowadays, maybe the tiniest devices for a few apps?


Conversely as CPUs get more and more powerful the overhead of a non-fixed-width encoding also becomes negligible unless you're processing a huge amount of text. And if you process enough text that it becomes a problem then maybe size would be an issue too.

Furthermore 4-byte chars also have the endianess problem of UTF-16.

I think the rule of thumb of "use UTF-8 unless you really have a good reason not to" still holds true today.


I think that’s a valid and interesting point that the classic time versus space computer science trade-off needs to be considered in this case for large workloads.

One thing that seems different is that you also hear concern from developers about coding complexity and the desire to always have things to be as easy to reason about as possible. Yes these are things a library should just handle for you but it sometimes useful to know what’s happening fundamentally or can sometimes allow for less surprises.

Nevertheless, if one of the cases you mentioned became significant I’m not sure that not many would say coding complexity should take precedence over your considerations.


I agree with you there, I think in the vast majority of applications these days when it comes to text processing simplicity and robustness beats other concerns.

It's unclear if UTF-32 beats UTF-8 there however. UTF-8's encoding is a bit trickier to deal with of course, but it's really not that bad. On top of that you have a few advantages such as not having to worry about endianess as mentioned previously but also not having to worry too much about alignment or corrupted input (you can re-synchronize UTF-8 at every codepoint since the leading byte has a unique bit pattern). And of course ASCII/C-string compatibility means that it's unlikely to be mangled by non-unicode-aware applications.

In any case the complexity probably won't be in the encoding, be it UTF-8, 16, 32 or anything else. UTF-32 might be constand-width codepoint but in general you don't really care about codepoints since they can be combined in various ways to create characters or things like country flags or complex characters with diacritics.


> [...] when it comes to text processing simplicity and robustness beats other concerns. It's unclear if UTF-32 beats UTF-8 there however [...]

I always wondered how this affected simplicity and efficiency of different text editor implementation strategies. I'm mainly thinking about the variable byte lengths and how it affects anything related to character indices. I'm guessing if it seriously hinders a preferred strategy one would just convert to a plain codepoint of fixed word size internally at the cost of greater memory usage... which actually is not that bad if you only want to support the BMP without any fancy encoding which would only takes up 2 bytes internally.


СPUs aren't getting more powerful. Haven't been for years.


How do you figure that?


String-keyed hash maps are really common, and the keys tend to be ASCII. This would increase their size by a factor of 4, resulting in substantially increased memory usage and decreased performance.


Certainly, but I’m not sure myself, can you think of any scenarios that are common where the difference would be noticeable or significant?


Boot any modern OS and you’ll have megabytes if not tens of megabytes of these strings, and spend a small but significant amount of CPU time looking them up. This may not matter on your Ryzen desktop with 97 cores and 318GB of RAM, but it’s substantial on a smartphone.


Well, that seems to be an unfounded assumption that I'm considering only powerful machines, especially since possible challenges on small devices were mentioned in the OP.

In fact, I done a fair amount of ARM coding so I understand what you are trying to say.

Nevertheless, your comment seems to point out smartphones may be less powerful, but sheds no light on what the delta might be for boot performance or how you can be confident it's say, more than a few hundred milliseconds.

Maybe it is, I haven't isolated and investigated that single difference enough to form a reasonable guess. Have you?


Any dynamic language, as all the symbols are strings.


It's not just that the raw encoding that becomes easier to relax with higher system resources - it can also be generically compressed using those same resources. I run my Windows machines with disk compression enabled, and that tames most "bloated" files and lets me fit a lot more into a 256Gb SSD.

However, there is some appeal in defining the format to a fixed low fidelity, too, since it shapes the authoring process around conveying the content through the limitations - "the medium is the message." But they're more likely to be the source format than thr transmission format, in the same way that old video games designed for 8kB of working memory now end up with terabytes of video footage streamed to online audiences.


Nitpick: it’s not shrinking, probably growing faster than ever, it’s just growing slower than other forms of media.


Fair enough. Agreed, it would have been expressed more clearly as a some sort of relative perception.


”A long time ago, at least in computer time, in the far-flung era of 1989, the Unicode working group was really starting to get going”

I find it strange to start in 1989, and not in 1980, when Xerox had a precursor to Unicode, or 1988, when employees from Xerox and Apple coined the name “Unicode” (https://en.wikipedia.org/wiki/Unicode#History)

It certainly is a bit unfair to not mention Xerox or Apple at all.

”that the JVM is gaining the ability to transparently represent strings in memory as Latin-1, a legacy codepage, if possible.”

“Compact strings” (https://openjdk.java.net/jeps/254) shipped with JDK 9 in September 2017 (https://docs.oracle.com/javase/9/whatsnew/toc.htm#JSNEW-GUID...)


This year is mentioned on the Wikipedia page and its references as the sort of "genesis date" of all the companies' involvement and is more or less when the UCS-2 tragedy began. (This is a post specifically about UTF-16 and UCS-2, not the entire history of Unicode. That'd be... considerably longer.)

I mention in another post (Correction: In a draft of a post.) I'm still stuck on JDK 8 for various reasons out of my control, and am generally unaware of changes made in 9 or later. I'll amend the post to reflect that the feature shipped.


And the tragedy of early adoption! Microsoft looks antiquated for using UTF-16 today, but it was actually pretty cutting edge when it happened. They adopted a lot of tech pretty early too, like XML. All in all, pretty fascinating.

Windows 10 recently added the ability to use UTF-8 for the “ANSI” legacy codepage, which is awesome! I accomplished similar prior using API hooking but a Microsoft solution can simply do a much better job across the board. Now the only reason to set the encoding to anything else is to avoid mojibake when using non-English software. (Dear Japanese software developers: please consider using wide win32 APIs :( )


I was under the assumption, that Linux and Mac OS (size of(wchar_t)==4, ergo UCS-4, because it was clear from the beginning, that 16bit won't do it. A compromise "only" windows and java was willing to make. But they went out all the way through the stack.

Linux goes for byte stream encoding (e.g utf-8), because it isn't (/wasn't) doing any string processing. So, encoding efficiency beats algorithmic complexity of string operations. If you need an algorithmic more efficient encoding, that's up to your application, and ucs-4 was the way to do it.


Certainly on Mac OS, sizeof(wchar_t) had nothing to do with the way text strings got stored. The Mac had an elaborate system for handling international text that used either bytes or 16-bit quantities for text storage, where the interpretation of what a byte or byte pair meant was stored separately from those bytes. https://developer.apple.com/library/archive/documentation/ma...:

”The character codes may be 1-byte or 2-byte codes, but there is nothing else in the text stream except for those codes. Using font information that your program stores separately, Script Manager routines can help you determine whether a character is 1 or 2 bytes, and other managers allow you to work with either character size.”

More cumbersome to use than modern Unicode API’s? Definitely, but it was superior to what Windows provided for years.


Thats not really a Linux assumption. It is a gcc one, but you can use 16 bit wchar with as flag too. Almost no one uses wchar on Unix anyway.


Wasn't really expecting this post to leave my circles, sure am glad now I checked my sources.

Look, ma, I'm on Hacker News!


Thanks to Microsoft, it is now entrenched in the JavaScript and even the Language Server Protocol. Eventually they should just put UTF-16 on fire, and move to UTF-8 as all smart people did already. Microsoft was always who shambles behind progress with the lack of C99 support until recent times, UCS-2, MAX_PATH (MSBuild still doesn't work with long names, in 2019!), and many other similar problems.


Yes the language server protocol is particularly insane because they use UTF-16 word positions (sort of, it's JavaScript so it's not anything standard really), but the actual text data is sent as UTF-8! So for my sane UTF-8 speaking language server the text gets converted from UTF-8 to UTF-16 when loaded from disk, then converted back for transmission via the LSP, then converted back to UTF-16 by my language server so I can handle the positions, and then it does everything in reverse again on the way back!

Absolute madness. I think they will fix it eventually though. The madness is too great to ignore.


It's a bit unfair to pick on Microsoft here. They were an early supporter of Unicode, back when UCS-2 looked like the best and/or only way to encode Unicode.


So the next time you're asked in an interview to implement isPalindrome, be sure to ask if you need to support surrogate pairs and diuretics.



I remember when the Linux distributions switched to unicode implementations of tools and there was a noticeable speed reduction of things like ls. Fortunately the processor speeds kept on improving.


I've been wondering, how much processor overhead does UTF8 have? Not just because you can't jump N bytes into a string to get a character offset, but because each byte has to be inspected to find whether it's the start of a multi-byte segment, how many of the subsequent bytes 1-3 are included.

Does that mean there's processor branch-prediction fails and stalling every few bytes in processing UTF8 strings with many multibyte characters in them?


Yes, the variable length nature of UTF-8 generally means more hard-to-predict branches. But sufficiently clever programmers can find ways around that, see Lemire's work on using SIMD to validate UTF-8 very quickly: https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u...


> Not just because you can't jump N bytes into a string to get a character offset

I keep hearing people refer to this operation (mostly in HN threads), I have yet to see an appropriate use of it. So far the only uses I've seen were (for Unicode) misguided attempts at character iteration, which doesn't need this at all.


At work we have to parse various different fixed-width column files. We only need data from a handful of the columns, so we use the substring function to extract the data we need.

Of course these files are so old they predate proper encoding and are implicitly Windows-1252 or similar.


That's been my experience, I've done it but never anything that a scanning or tokenizing algorithm couldn't do.

So being able to iterate over characters is important, but constant time access to a specific character seems very rare.

I think the only case where I've wanted a specific character was when I wrote games that loaded maps from text files and those were ASCII regardless.


Some of the records are over 1k chars wide, and we need stuff near the end.

Though if one had to decode the line from UTF-8, you're iterating anyway so...


I could see a case for it if you were finding line endings in bytes mode, and then casting a buffer to a UTF-8 string. Then you'd benefit from a reverse iterator, which UTF-8 supports just fine.

I imagine the main reason people would use indexing is to scan through strings in lockstep, e.g. to implement comparison, but iterators can cover that case as well, especially if you can copy them.


CPU overhead is nowadays much less important than cache misses. UTF-8 has the least cache misses, Linux/BSD UTF-32 widechars the most, and Windows UCS-2 in between. UTF-32 wchar_t was a gigantic design failure and should have been deprecated a long time ago. There's still no POSIX u8 api in sight.

The branch misses are not that high or costly, as most characters don't need the subsequent branches for 2-4 bytes, but cache misses cost ~50x more than a branch miss. And there are enough parallel pipelines now to avoid most utf-8 branch misses.

The new massive parallel SIMD ways to process UTF-8 are so fast because the width is so high, 256 bytes per loop, which is enough for most strings. But avoiding branching has some constant overhead to prepare the loops, for small strings it rarely pays off.


With SSE it's 2.6-3.4 cycles/byte to unpack UTF-8 into UTF-32 (ie integers). See https://github.com/KWillets/fastdecode-utf-8 .

Selection (extracting the i-th element) is actually quite fast, but there's some overhead in grabbing a whole chunk of characters and extracting all their payload bits at once.


What kind of workload are you thinking about, though? Why are you streaming these characters? What processing are you doing that doesn't already have conditional logic for each character?


In many cases the string data is copied around as is, iterating over the code points is hardly needed. And since 7-bit ASCII characters still are unique bytes in an UTF-8 encoded stream, common operations like splitting by comma, whitespace, slash etc... works the same as on ASCII strings (e.g. just look at the byte stream for the splitting characters, and keep the byte ranges inbetween intact, no UTF-8 decoding needed).


> each byte has to be inspected to find whether it's the start of a multi-byte segment, how many of the subsequent bytes 1-3 are included.

Mostly no. Or at least, not if the implementation was written by somebody who knows what they're doing. A lot of multi-byte encodings do have the problem you're worried about but UTF-8 does not, and the definition is even careful to make this true if you're processing malformed UTF-8 which may be invalid.

The trick is that the bit pattern alone tells you whether this is a trailing byte. Nothing you could be looking for could _begin_ with a trailing byte, they only make sense in trailing position. Whether you are looking for a colon, or an omega, or the dog emoji, any trailing byte won't match.


Reading this, I'm thinking if it's possible (and if yes, how), to write java and javascript program in other encodings that they internally use, for example UCS-4. If yes, what is doing the conversation (parser, compiler?).


I've personally done this in Java, as mentioned. Generally I use fastutil to create an IntList (which is just an int[] with some conveniences) and use Java's Character class to convert between UTF-16 and UCS-4 on-the-fly.

It's not very fast, but it makes flawless Unicode support easy. :P


There's other tragedies too. SCSU lets you reasonably compactly represent strings, but Java doesn't use it so...


> Unix and Linux conveniently sidestepped this whole issue by just shrugging their shoulders and going "eh, a char has no defined encoding, it's just a number.".

Wrong: that's a feature of C, which all these systems are based on.

wchar_t is 32 bits on Linux, and, though also just a number, it commonly represents a Unicode code point.


I would assume that what author meant by this is that on Unix for the kernel itself names of various entities (filenames and similar things) are simply sequences of bytes that it does not ordinarily try to interpret in any way. (On the other hand SUS specifies that only very limited set of characters is valid in filename, but no "true unix" actually cares about that)

On Windows NT filename is specified as sequence of 16 bit wchar_ts and is supposed to be valid NFC UTF-16. OS X also expects that filenames are valid sequences of unicode codepoints and in this case NFD.


> On Windows NT filename is specified as sequence of 16 bit wchar_ts and is supposed to be valid NFC UTF-16.

I would emphasise that the "supposed to be" is a convention only (similar to the Linux UTF-8 convention). The only filesystem enforced rules are that file/directory names can't include \/:*?"<>| or wchar_t(s) less than 0x20 (aka ascii control characters).

A Win32 path can also have additional rules (such as a file cannot be called "COM") but these aren't always enforced, depending on how the API is called.


Yes, that was the intent. C itself obviously just sees chars as a number, but in C itself strings have no real meaning at all. The fact the kernel treats them as opaque byte sequences is the important part.


Unrelated: I love the theme of that blog. Anyone know if it's open source?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: