
The Tragedy of UCS-2 - ingve
https://unascribed.com/b/2019-02-08-the-tragedy-of-ucs2.html
======
jbeda
This is a great rundown. I started my career at Microsoft working with and on
Win32/COM and saw this play out first hand.

One thing not mentioned here is the history of the "Byte Order Mark" (BOM) in
unicode.

(Not an expert here but my understanding having lived it.)

You see, given UCS-2, there are 2 ways to encode any codepoint -- either big
endian or little endian. The idea was then to create a codepoint (U+FEFF) that
you could put at the start of a text stream that would signify what order the
file was encoded in.

Wikipedia page:
[https://en.m.wikipedia.org/wiki/Byte_order_mark](https://en.m.wikipedia.org/wiki/Byte_order_mark)

This then got overloaded. When loading a legacy text format often times there
is the difficulty of figuring out the code page to use. When applied to HTML,
there are a bunch of ways to do it and they don't always work. There are
things like charset meta tags (but you have to parse enough HTML to find it
and then re-start the decode/parse). But often times even that was wrong.
Browsers used to (and still do?) have an "autodetect" mode where it would try
divine the codepage based on content. This is all in the name of "be liberal
in what you expect".

Enter UTF-8. How can you tell if a doc is US ASCII or UTF-8 if there are no
other indications in the content? How does this apply to regular old text
files? Well, the answer is to use the BOM. Encode it in UTF-8 and put it at
the start of the text file.

But often times people want to treat simple UTF-8 as ASCII and you end up with
a high value codepoint in what would otherwise be an ASCII document. And
everyone curses it.

Having the BOM littered everywhere doesn't seem to be as much of a problem not
as it used to be. I think a lot of programs stopped putting it in and a lot of
other programs talk UTF-8 and deal with it silently. Still something to be
aware of though.

~~~
jjoonathan
Yeah, the BOM has gotten me a few times.

My most spectacular fail was a program that read UTF-8 or Latin-1 and wrote
UTF-16, preserving but not displaying null characters. I believe this was
default behavior HyperStudio. Every round-trip would double the size of the
file by inserting null bytes every other character. Soon there were giant
stretches of null characters between each display character, but the displayed
text never appeared to change even though the disk requirements doubled with
each launch. That's how I learned about UTF-16!

Speaking of Win32/COM... is there a "tcpdump for COM"? I've got a legacy app
that uses COM for IPC and I've been instrumenting each call for lack of one.

~~~
carey
If there is anything like tcpdump for COM, it would be part of Event Tracing
for Windows, but you’d probably prefer to use it via Microsoft Message
Analyzer.

------
uranusjr
It always baffled me as a native Chinese speaker that 16-bit was _ever_
considered enough to “be plenty of room to encode every script in modern use,
and then some”. Common local encodings like Big5 (also 16-bit fixwidth) was
already suffering from serious lack of code points at that time. It would be
obvious if they consulted literally any experienced programmer from East Asia.
Yeah, that was 1990 so it’s not an easy thing to do by any means, but you’d
expect more from someone aspiring to create such thing.

~~~
Tuna-Fish
> It would be obvious if they consulted literally any experienced programmer
> from East Asia. Yeah, that was 1990 so it’s not an easy thing to do by any
> means

They did consult not just experienced East Asian programmers, but Chinese,
Korean, Japanese and Vietnamese professors of Linguistics and the relevant
local authorities, and had them form the Ideographic Rapporteur Group, which
advised them that all the East Asian languages can be encoded in 20940 code
points.

Unicode is not centrally run like so many people seem to expect. Rather, the
Unicode Consortium itself acts just as the standards body that mates together
local standards into a single complete whole. The Consortium makes no
decisions about how Chinese is represented, other than allocating them code
pages; all the relevant decisions about any language is done by local experts.
The problem was that the early 90's were an era of boundless optimism and
futurism, which in East Asia resulted in the Han Unification project, with the
idea of cutting down the amount of symbols in use, and unifying the
representation of all the languages that used a Han-derived alphabet. That...
did not go so well.

~~~
SloopJon
Thanks for that background. I was thinking that the (for lack of a better
word) politics of CJKV and Han unification likely had a lot to do with it.

To my eyes, the letter "A" in English (Latin), Greek, and Russian (Cyrillic)
looks identical. Why does it need three separate code points? Maybe it gets
tricky when the relationship between uppercase and lowercase diverges. In
fact, that does appear to be one of the arguments in this technical note:

[https://www.unicode.org/notes/tn26/](https://www.unicode.org/notes/tn26/)

~~~
simias
I think the most salient point in your link is:

>Even more significantly, from the point of view of the problem of character
encoding for digital textual representation in information technology, the
preexisting identification of Latin, Greek, and Cyrillic as distinct scripts
was carried over into character encoding, from the very earliest instances of
such encodings. Once ASCII and EBCDIC were expanded to start incorporating
Greek or Cyrillic letters, all significant instances of such encodings
included a basic Latin (ASCII or otherwise) set and a full set of letters for
Greek or a full set of letters for Cyrillic. Precedent for the purposes of
character encoding was clearly established by those early 8-bit charsets.

That's true, although it reminded me of a cool charset: the Russian KOI8-R
encoding[1]. This encoding was created so that if you stripped the high bit
(presumably on a system unable to handle russian properly) you ended up with
semi-legible latinized Russian. Quoting wikipedia:

>For instance, "Русский Текст" in KOI8-R becomes rUSSKIJ tEKST ("Russian
Text") if the 8th bit is stripped; attempting to interpret the ASCII string
rUSSKIJ tEKST as KOI7 yields "РУССКИЙ ТЕКСТ". KOI8 was based on Russian Morse
code, which was created from Latin Morse code based on sound similarities, and
which has the same connection to the Latin Morse codes for A-Z as KOI8 has
with ASCII.

[1]
[https://en.wikipedia.org/wiki/KOI8-R](https://en.wikipedia.org/wiki/KOI8-R)

~~~
yencabulator
Wow, they made the 8th bit essentially be the case bit, and losing+recreating
it mostly just made everything uppercase.

~~~
bonzini
No, the case bit is still bit 5. Hiwei, unlike ASCII, KOI7/KOI8's Cyrillic
alphabet sets bit 5 to 1 for _uppercase_ characters. This way, even though
Cyrillic text remains legible, you also have a clue that the encoding is KOI.

------
Sniffnoy
It's also worth noting where the terms "UCS-2" and "UCS-4" come from. For a
while there was the idea of the Universal Coded Character Set[0] as an
_alternative_ to Unicode. This would have been a natively 32-bit encoding.
However, it wouldn't have allowed the full 2^32 possible characters, because
of its insistence that none of the 4 bytes could be any of the C0 or C1
controls. Still, it would have about 2^29 possible characters as opposed to
Unicode's about 2^21.

But the insistence on avoiding C0 and C1 controls would have led to some odd
codepoint numbers, and definite incompatibility with ASCII and Unicode; for
instance, 'A', instead of 0x41, would have been 0x20202041! Whereas Unicode of
course made sure to copy ASCII (indeed, to copy ISO-8859-1) for its first
block.

Of course, this didn't come to pass because, well, people were already
supporting 16-bit Unicode and didn't want to switch _again_. And I'm kind of
glad it didn't -- while it would have meant not having to deal with the
compromise that is UTF-16, imagine living in a world where 'A' is not 0x41,
but rather 0x20202041! (Or imagine opening up a text file and finding three
spaces between each character...)

[0][https://en.wikipedia.org/wiki/Universal_Coded_Character_Set](https://en.wikipedia.org/wiki/Universal_Coded_Character_Set)

~~~
tyingq
_" imagine living in a world where 'A' is not 0x41, but rather 0x20202041! (Or
imagine opening up a text file and finding three spaces between each
character...)"_

On the other hand, often _" obviously broken"_ is better than _" subtly
broken"_. There's a ton of undetected encoding problems in the wild right now.

------
raphlinus
I would go further and say that UCS-4 is also a variable length encoding,
mostly because of emoji and its endearing ZWJ sequences, not to mention the
encoding of flags as two Regional Indicator Symbol codepoints (both in the
astral plane, of course), which makes Unicode not self-synchronizing.

So the idea of ever having a fixed-length encoding for Unicode is basically
impossible now. Best to just use UTF-8 for everything and logic to group it in
to code points, grapheme clusters, or whatever other granularity is needed.

~~~
Jasper_
Fun fact: because flags are a political nature, and Unicode did not want to
limit the length of country codes, any number of Regional Indicator Symbol
code points can be placed back to back and treated as an extended grapheme
cluster.

That means that what Unicode today considers "one character" can be infinite
in size.

~~~
kps
And worse, the ‘operators’ are postfix or infix, so if you're handling text
sequentially — and let's say, split across network packets with an arbitrary
delay in between — you can't even tell you've read a complete character until
you receive the next one.

 _Typing_ in many Latin languages generates diacritics using ‘dead keys’,
which are prefix operators. (Originally, on a physical typewriter, these were
keys that simply struck the paper without triggering the mechanism to advance
the carriage.) If Unicode had taken this hint, life would be easier.

(ISO/IEC 6937 had prefix diacritics, but it was too late.)

~~~
tialaramex
Complete recognition before processing. It's not the law, just a good idea,
but you should do it anyway. In cryptography this lesson had to keep being
learned. "Oh, I received this partial data, and I processed it, and then in
the next packet I received the MAC and I realised the bad guys altered the
data, oops"

Process whole strings, the meaning of a partial string may not be what you
hope, even without fancy writing systems and Unicode encoding.

"Give the money to Steph" \- OK, will do

"...enson" \- Crap, OK, somebody chase Steph and get our money back, meanwhile
here's James Stephenson's money

"... once he gives you the key" \- Aargh. Somebody chase down James and get
the key off him.

~~~
lanstin
Wow so interesting an idea. But so many streaming algorithms etc. people stick
megabytes of crap into one framed message.

------
cesarb
Another aspect of this tragedy is that the creation of UTF-16 has forever
limited Unicode to only 17 planes. The original UTF-8 encoding
([https://tools.ietf.org/html/rfc2044](https://tools.ietf.org/html/rfc2044))
used up to six bytes, and could represent a full 31-bit range, which
corresponds to 32768 16-bit planes; after UTF-16 became common, UTF-8 was
limited to four bytes
([https://tools.ietf.org/html/rfc3629](https://tools.ietf.org/html/rfc3629)).

~~~
kps
The original pattern extends up to at least 36 bits (which makes the PDP-10
refugees happy).

    
    
        1  7  0xxxxxxx
        2 11  110xxxxx 10xxxxxx
        3 16  1110xxxx 10xxxxxx 10xxxxxx
        4 21  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
        5 26  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        6 31  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        7 36  11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    

Then, you can either decide to limit the leader to one byte, giving you 42
bits…

    
    
        8 42  11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    

Or allow multi-byte leaders and continue forever.

    
    
        9 53  11111111 110xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
       10 58  11111111 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        ⋮
    

The Intergalactic Confederation will not look kindly on UTF-16.

~~~
cesarb
> Or allow multi-byte leaders and continue forever.

That loses one very useful property of UTF-8: you always know, by looking at a
single byte, whether it's the first byte of a code point or not. It also loses
the property that no valid UTF-8 encoding is a subset of another valid UTF-8
encoding. It's better to stop at 36 bits (which keeps another useful property:
you'll never find a 0xFF byte within an UTF-8 string, and you'll only find a
0x00 byte when it's encoding the U+0000 code point).

------
devadvance
Fun fact: for SMS, UCS-2 is used when a message requires more than 128
characters to be rendered.[1]

If you've ever noticed that entering an emoji or other non-ASCII [2] character
seems to dramatically increase the size of your SMS, that's because the
message is switching from 1-byte characters to 2-byte characters.

[1] [https://www.twilio.com/docs/glossary/what-is-
ucs-2-character...](https://www.twilio.com/docs/glossary/what-is-
ucs-2-character-encoding)

[2] It's actually GSM-7, not ASCII, but the principle is the same:
[https://www.twilio.com/docs/glossary/what-is-
gsm-7-character...](https://www.twilio.com/docs/glossary/what-is-
gsm-7-character-encoding)

~~~
greggyb
I was a flip-phone holdout for a long time. This was not generally supported
in new devices sold up to 2015 (when I acquired my first smartphone, and my
experience with good phones ended).

Every flip phone I owned did not play well with non-ASCII text. Receiving a
message with an emoji would render the entire message, or if I was lucky just
the portion following the emoji, unrenderable. Most phones I had failed to
just rectangles, which I assume is their method of mojibake.

This was not a causal factor in my switch to a smartphone. It was annoying to
have to explain character encodings to friends who didn't understand why they
couldn't send me emoji-containing messages.

~~~
SynthCann
You sound like a blast.

~~~
dang
Hey, could you please review the site guidelines and stick to the spirit of
this site when posting here? Normally we ban accounts that attack others like
this, and I actually banned yours, but decided to reverse that because you
posted a more substantive comment earlier today. Basically, we're trying for
more thoughtful conversation and better signal/noise than internet default.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

You might also find these links helpful for getting the intended use of HN:

[https://news.ycombinator.com/newswelcome.html](https://news.ycombinator.com/newswelcome.html)

[https://news.ycombinator.com/hackernews.html](https://news.ycombinator.com/hackernews.html)

[http://www.paulgraham.com/trolls.html](http://www.paulgraham.com/trolls.html)

[http://www.paulgraham.com/hackernews.html](http://www.paulgraham.com/hackernews.html)

------
STRML
JS, like Java, now has implementations that will also store strings as Latin1
when the implementation believes it is safe to do so. This results in
significant memory savings [1] in most programs.

1\. [https://blog.mozilla.org/javascript/2014/07/21/slimmer-
and-f...](https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-
javascript-strings-in-firefox/)

~~~
raverbashing
I wonder why Latin-1 and not just UTF-8

~~~
Macha
From the article:

* Gecko is huge and it uses TwoByte strings in most places. Converting all of Gecko to use UTF8 strings is a much bigger project and has its own risks. As described below, we currently inflate Latin1 strings to TwoByte Gecko strings and that was also a potential performance risk, but inflating Latin1 is much faster than inflating UTF8.

* Linear-time indexing: operations like charAt require character indexing to be fast. We discussed solving this by adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case. This scheme will only work for ASCII strings, though, so it’s a potential performance risk. An alternative is to have such operations inflate the string from UTF8 to TwoByte, but that’s also not ideal.

* Converting SpiderMonkey’s own string algorithms to work on UTF8 would require a lot more work. This includes changing the irregexp regular expression engine we imported from V8 a few months ago (it already had code to handle Latin1 strings).

~~~
ninkendo
> Linear-time indexing

This is great for ASCII when you know there's no such thing as combining
characters/etc, but I would like to remind everyone reading this that there's
no such thing as "linear indexing" of user-perceived characters in Unicode.

User-perceived characters need processing in order to index due to Grapheme
clusters potentially using many code points together.
([https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundarie...](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries))

For instance, (on my machine, a dark-skinned male teacher) is a combination of
these characters:

\- U+1F468 Man \- U+1F3FE Medium-Dark Skin Tone \- U+200D Zero Width Joiner \-
U+1F3EB School

And knowing what the byte index of the start/end of that character in a string
cannot be done by just multiplying an offset by some multiple of number of
bytes per character.

~~~
nitrogen
Isn't ASCII indexing constant time and Unicode grapheme indexing ~linear time?

~~~
ninkendo
Correct, i seem to have quoted the wrong part of the GP comment... I meant to
address this quote:

> adding a special flag to indicate all characters in the string are ASCII, so
> that we can still use O(1) indexing in this case

But I got confused and addressed the linear indexing part.

But yeah, ASCII is constant time, but you can’t assume even UTF-32 can be
constant time due to variable length grapheme clusters.

------
flohofwoe
The most practical real-world solution across all operating systems is to use
"UTF-8 everywhere", and only convert from and to UTF-16 or UTF-32 "at the last
minute" when the string data is coming from or going into APIs that don't
(yet) accept UTF-8 directly.

Also see: [http://utf8everywhere.org/](http://utf8everywhere.org/)

~~~
iforgotpassword
I've always wondered why Windows doesn't allow setting the legacy code page to
utf8. It would've been a nice hack to get old apps to work with Unicode,
assuming they were at least somewhat MBCS aware. If the programmer assumed a
character would always be one byte it might not have worked, but then that
program would also have failed with a Japanese code page.

~~~
flohofwoe
There's a fairly new experimental setting in Win10 where UTF-8 is set as the
system-wide codepage, allowing to call the "narrow string" functions with
UTF-8 strings.

See
[https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#U...](https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8)

"With insider build 17035 and the April 2018 update (nominal build 17134) for
Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support"
checkbox appeared for setting the locale code page to UTF-8.[a] This allows
for calling "narrow" functions, including fopen and SetWindowTextA, with UTF-8
strings."

~~~
rkagerer
From that same link:

 _Another popular work-around is to convert the name to the 8.3 filename
equivalent, this is necessary if the fopen is inside a library function that
takes a string filename and cannot be altered._

8.3 filenames for the win!

~~~
ChrisSD
Though do note this is only an issue when using particular "narrow" functions.
If you consistently use the recommended "wide" functions then none of these
issues occur.

------
WhitneyLand
One way to think about text, is as the only data we use that shrinks over the
years. Every other type of data continues to grow.

The way this can be stretched to make sense is by thinking for example how
much these have grown over 20 years.

Media (music, video, images) grows staggeringly because of 200p > 4k, 128bit
mp3 to flac or whatever and there's more of it all that's gone digital.

Code size continues to grow quite significantly even.

But text? Even the massive wikipedia's size, only a fraction is text content.

Then as memory capacity also grows it's really a back and forth battle between
computing resources and all other data, except text.

Hard to think of many places 4 byte chars would matter nowadays, maybe the
tiniest devices for a few apps?

~~~
simias
Conversely as CPUs get more and more powerful the overhead of a non-fixed-
width encoding also becomes negligible unless you're processing a huge amount
of text. And if you process enough text that it becomes a problem then maybe
size would be an issue too.

Furthermore 4-byte chars also have the endianess problem of UTF-16.

I think the rule of thumb of "use UTF-8 unless you really have a good reason
not to" still holds true today.

~~~
WhitneyLand
I think that’s a valid and interesting point that the classic time versus
space computer science trade-off needs to be considered in this case for large
workloads.

One thing that seems different is that you also hear concern from developers
about coding complexity and the desire to always have things to be as easy to
reason about as possible. Yes these are things a library should just handle
for you but it sometimes useful to know what’s happening fundamentally or can
sometimes allow for less surprises.

Nevertheless, if one of the cases you mentioned became significant I’m not
sure that not many would say coding complexity should take precedence over
your considerations.

~~~
simias
I agree with you there, I think in the vast majority of applications these
days when it comes to text processing simplicity and robustness beats other
concerns.

It's unclear if UTF-32 beats UTF-8 there however. UTF-8's encoding is a bit
trickier to deal with of course, but it's really not that bad. On top of that
you have a few advantages such as not having to worry about endianess as
mentioned previously but also not having to worry too much about alignment or
corrupted input (you can re-synchronize UTF-8 at every codepoint since the
leading byte has a unique bit pattern). And of course ASCII/C-string
compatibility means that it's unlikely to be mangled by non-unicode-aware
applications.

In any case the complexity probably won't be in the encoding, be it UTF-8, 16,
32 or anything else. UTF-32 might be constand-width codepoint but in general
you don't really care about codepoints since they can be combined in various
ways to create characters or things like country flags or complex characters
with diacritics.

~~~
tomxor
> [...] when it comes to text processing simplicity and robustness beats other
> concerns. It's unclear if UTF-32 beats UTF-8 there however [...]

I always wondered how this affected simplicity and efficiency of different
text editor implementation strategies. I'm mainly thinking about the variable
byte lengths and how it affects anything related to character indices. I'm
guessing if it seriously hinders a preferred strategy one would just convert
to a plain codepoint of fixed word size internally at the cost of greater
memory usage... which actually is not that bad if you only want to support the
BMP without any fancy encoding which would only takes up 2 bytes internally.

------
Someone
_”A long time ago, at least in computer time, in the far-flung era of 1989,
the Unicode working group was really starting to get going”_

I find it strange to start in 1989, and not in 1980, when Xerox had a
precursor to Unicode, or 1988, when employees from Xerox and Apple coined the
name “Unicode”
([https://en.wikipedia.org/wiki/Unicode#History](https://en.wikipedia.org/wiki/Unicode#History))

It certainly is a bit unfair to not mention Xerox or Apple at all.

 _”that the JVM is gaining the ability to transparently represent strings in
memory as Latin-1, a legacy codepage, if possible.”_

“Compact strings”
([https://openjdk.java.net/jeps/254](https://openjdk.java.net/jeps/254))
shipped with JDK 9 in September 2017
([https://docs.oracle.com/javase/9/whatsnew/toc.htm#JSNEW-
GUID...](https://docs.oracle.com/javase/9/whatsnew/toc.htm#JSNEW-GUID-
BA9D8AF6-E706-4327-8909-F6747B8F35C5))

~~~
unascribed
This year is mentioned on the Wikipedia page and its references as the sort of
"genesis date" of all the companies' involvement and is more or less when the
UCS-2 tragedy began. (This is a post specifically about UTF-16 and UCS-2, not
the entire history of Unicode. That'd be... considerably longer.)

 _I mention in another post_ (Correction: In a draft of a post.) I'm still
stuck on JDK 8 for various reasons out of my control, and am generally unaware
of changes made in 9 or later. I'll amend the post to reflect that the feature
shipped.

------
jchw
And the tragedy of early adoption! Microsoft looks antiquated for using UTF-16
today, but it was actually pretty cutting edge when it happened. They adopted
a lot of tech pretty early too, like XML. All in all, pretty fascinating.

Windows 10 recently added the ability to use UTF-8 for the “ANSI” legacy
codepage, which is awesome! I accomplished similar prior using API hooking but
a Microsoft solution can simply do a much better job across the board. Now the
only reason to set the encoding to anything else is to avoid mojibake when
using non-English software. (Dear Japanese software developers: please
consider using wide win32 APIs :( )

------
yokaze
I was under the assumption, that Linux and Mac OS (size of(wchar_t)==4, ergo
UCS-4, because it was clear from the beginning, that 16bit won't do it. A
compromise "only" windows and java was willing to make. But they went out all
the way through the stack.

Linux goes for byte stream encoding (e.g utf-8), because it isn't (/wasn't)
doing any string processing. So, encoding efficiency beats algorithmic
complexity of string operations. If you need an algorithmic more efficient
encoding, that's up to your application, and ucs-4 was the way to do it.

~~~
Someone
Certainly on Mac OS, _sizeof(wchar_t)_ had nothing to do with the way text
strings got stored. The Mac had an elaborate system for handling international
text that used either bytes or 16-bit quantities for text storage, where the
interpretation of what a byte or byte pair meant was stored separately from
those bytes.
[https://developer.apple.com/library/archive/documentation/ma...](https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf):

 _”The character codes may be 1-byte or 2-byte codes, but there is nothing
else in the text stream except for those codes. Using font information that
your program stores separately, Script Manager routines can help you determine
whether a character is 1 or 2 bytes, and other managers allow you to work with
either character size.”_

More cumbersome to use than modern Unicode API’s? Definitely, but it was
superior to what Windows provided for years.

------
unascribed
Wasn't really expecting this post to leave my circles, sure am glad now I
checked my sources.

Look, ma, I'm on Hacker News!

------
xvilka
Thanks to Microsoft, it is now entrenched in the JavaScript and even the
Language Server Protocol. Eventually they should just put UTF-16 on fire, and
move to UTF-8 as all smart people did already. Microsoft was always who
shambles behind progress with the lack of C99 support until recent times,
UCS-2, MAX_PATH (MSBuild still doesn't work with long names, in 2019!), and
many other similar problems.

~~~
IshKebab
Yes the language server protocol is particularly insane because they use
UTF-16 word positions (sort of, it's JavaScript so it's not anything standard
really), but the actual text data is sent as UTF-8! So for my sane UTF-8
speaking language server the text gets converted from UTF-8 to UTF-16 when
loaded from disk, then converted back for transmission via the LSP, then
converted back to UTF-16 by my language server so I can handle the positions,
and then it does everything in reverse again on the way back!

Absolute madness. I think they will fix it eventually though. The madness is
too great to ignore.

------
dehrmann
So the next time you're asked in an interview to implement isPalindrome, be
sure to ask if you need to support surrogate pairs and diuretics.

------
java-man
Mojibake:

[https://en.wikipedia.org/wiki/Mojibake](https://en.wikipedia.org/wiki/Mojibake)

love the word.

------
pkaye
I remember when the Linux distributions switched to unicode implementations of
tools and there was a noticeable speed reduction of things like ls.
Fortunately the processor speeds kept on improving.

------
jodrellblank
I've been wondering, how much processor overhead does UTF8 have? Not just
because you can't jump N bytes into a string to get a character offset, but
because each byte has to be inspected to find whether it's the start of a
multi-byte segment, how many of the subsequent bytes 1-3 are included.

Does that mean there's processor branch-prediction fails and stalling every
few bytes in processing UTF8 strings with many multibyte characters in them?

~~~
blattimwind
> Not just because you can't jump N bytes into a string to get a character
> offset

I keep hearing people refer to this operation (mostly in HN threads), I have
yet to see an appropriate use of it. So far the only uses I've seen were (for
Unicode) misguided attempts at character iteration, which doesn't need this at
all.

~~~
magicalhippo
At work we have to parse various different fixed-width column files. We only
need data from a handful of the columns, so we use the substring function to
extract the data we need.

Of course these files are so old they predate proper encoding and are
implicitly Windows-1252 or similar.

~~~
ben509
That's been my experience, I've done it but never anything that a scanning or
tokenizing algorithm couldn't do.

So being able to iterate over characters is important, but constant time
access to a specific character seems _very_ rare.

I think the only case where I've wanted a specific character was when I wrote
games that loaded maps from text files and those were ASCII regardless.

~~~
magicalhippo
Some of the records are over 1k chars wide, and we need stuff near the end.

Though if one had to decode the line from UTF-8, you're iterating anyway so...

~~~
ben509
I could see a case for it if you were finding line endings in bytes mode, and
then casting a buffer to a UTF-8 string. Then you'd benefit from a reverse
iterator, which UTF-8 supports just fine.

I imagine the main reason people would use indexing is to scan through strings
in lockstep, e.g. to implement comparison, but iterators can cover that case
as well, especially if you can copy them.

------
finchisko
Reading this, I'm thinking if it's possible (and if yes, how), to write java
and javascript program in other encodings that they internally use, for
example UCS-4. If yes, what is doing the conversation (parser, compiler?).

~~~
unascribed
I've personally done this in Java, as mentioned. Generally I use fastutil to
create an IntList (which is just an int[] with some conveniences) and use
Java's Character class to convert between UTF-16 and UCS-4 on-the-fly.

It's not very fast, but it makes flawless Unicode support easy. :P

------
cbsmith
There's other tragedies too. SCSU lets you reasonably compactly represent
strings, but Java doesn't use it so...

------
kazinator
> _Unix and Linux conveniently sidestepped this whole issue by just shrugging
> their shoulders and going "eh, a char has no defined encoding, it's just a
> number."._

Wrong: that's a feature of C, which all these systems are based on.

wchar_t is 32 bits on Linux, and, though also just a number, it commonly
represents a Unicode code point.

~~~
dfox
I would assume that what author meant by this is that on Unix for the kernel
itself names of various entities (filenames and similar things) are simply
sequences of bytes that it does not ordinarily try to interpret in any way.
(On the other hand SUS specifies that only very limited set of characters is
valid in filename, but no "true unix" actually cares about that)

On Windows NT filename is specified as sequence of 16 bit wchar_ts and is
supposed to be valid NFC UTF-16. OS X also expects that filenames are valid
sequences of unicode codepoints and in this case NFD.

~~~
ChrisSD
> On Windows NT filename is specified as sequence of 16 bit wchar_ts and is
> supposed to be valid NFC UTF-16.

I would emphasise that the "supposed to be" is a convention only (similar to
the Linux UTF-8 convention). The only filesystem enforced rules are that
file/directory names can't include \/:*?"<>| or wchar_t(s) less than 0x20 (aka
ascii control characters).

A Win32 path can also have additional rules (such as a file cannot be called
"COM") but these aren't always enforced, depending on how the API is called.

------
lappet
Unrelated: I love the theme of that blog. Anyone know if it's open source?

