Fun fact: for SMS, UCS-2 is used when a message requires more than 128 character...

greggyb · on Aug 3, 2019

I was a flip-phone holdout for a long time. This was not generally supported in new devices sold up to 2015 (when I acquired my first smartphone, and my experience with good phones ended).

Every flip phone I owned did not play well with non-ASCII text. Receiving a message with an emoji would render the entire message, or if I was lucky just the portion following the emoji, unrenderable. Most phones I had failed to just rectangles, which I assume is their method of mojibake.

This was not a causal factor in my switch to a smartphone. It was annoying to have to explain character encodings to friends who didn't understand why they couldn't send me emoji-containing messages.

SynthCann · on Aug 3, 2019

You sound like a blast.

dang · on Aug 4, 2019

Hey, could you please review the site guidelines and stick to the spirit of this site when posting here? Normally we ban accounts that attack others like this, and I actually banned yours, but decided to reverse that because you posted a more substantive comment earlier today. Basically, we're trying for more thoughtful conversation and better signal/noise than internet default.

https://news.ycombinator.com/newsguidelines.html

You might also find these links helpful for getting the intended use of HN:

https://news.ycombinator.com/newswelcome.html

https://news.ycombinator.com/hackernews.html

http://www.paulgraham.com/trolls.html

http://www.paulgraham.com/hackernews.html

Thorrez · on Aug 3, 2019

I don't really understand how it can use UCS-2. I don't think UCS-2 can represent emoji.

devadvance · on Aug 3, 2019

This is actually mentioned in the submission, but it does so through surrogate pairs. In short: the first of two bytes references a different "plane" of code points, and the second byte refers to a code point in that additional plane.

What's cool is that UCS-2 and UTF-16 behave the same way in this regard, so you can see how this works in JavaScript as well. There's a great HN post about the length of an emoji being 2 in JS [1] from 2017.

Practically speaking, I had fun learning this while writing an Angular app to read SMS backups [2]. It's frustrating and cool at the same time.

[1] https://news.ycombinator.com/item?id=13830177

[2] https://github.com/devadvance/sms-backup-reader-2/blob/maste...

temac · on Aug 3, 2019

The submission actually states (correctly): "UCS-2 can only represent the first 65536 characters of Unicode, also known as the Basic Multilingual Plane."

UTF-16 can represent all Unicode code points, through surrogate pairs when needed.

Systems that originally were designed for UCS-2 have often been upgraded to UTF-16. E.g. Windows, JS. But it is then not UCS-2 anymore. In some case the documentation and/or function names, etc., might talk incorrectly about being UCS-2 (while they really are UTF-16 now) for historical reasons, adding to the confusion.

Thorrez · on Aug 3, 2019

If there are surrogate pairs, I thought that was more of UTF-16 rather than UCS-2. Wikipedia says UCS-2 cannot go outside the BMP range (which emojis are).

>UCS-2 differs from UTF-16 by being a constant length encoding[4] and only capable of encoding characters of BMP.

https://en.wikipedia.org/wiki/UTF-16

There does seem to be some disagreement on this topic though. I wonder if it could be described as WTF-16.

https://simonsapin.github.io/wtf-8/