Fun fact: for SMS, UCS-2 is used when a message requires more than 128 characters to be rendered.[1]
If you've ever noticed that entering an emoji or other non-ASCII [2] character seems to dramatically increase the size of your SMS, that's because the message is switching from 1-byte characters to 2-byte characters.
I was a flip-phone holdout for a long time. This was not generally supported in new devices sold up to 2015 (when I acquired my first smartphone, and my experience with good phones ended).
Every flip phone I owned did not play well with non-ASCII text. Receiving a message with an emoji would render the entire message, or if I was lucky just the portion following the emoji, unrenderable. Most phones I had failed to just rectangles, which I assume is their method of mojibake.
This was not a causal factor in my switch to a smartphone. It was annoying to have to explain character encodings to friends who didn't understand why they couldn't send me emoji-containing messages.
Hey, could you please review the site guidelines and stick to the spirit of this site when posting here? Normally we ban accounts that attack others like this, and I actually banned yours, but decided to reverse that because you posted a more substantive comment earlier today. Basically, we're trying for more thoughtful conversation and better signal/noise than internet default.
This is actually mentioned in the submission, but it does so through surrogate pairs. In short: the first of two bytes references a different "plane" of code points, and the second byte refers to a code point in that additional plane.
What's cool is that UCS-2 and UTF-16 behave the same way in this regard, so you can see how this works in JavaScript as well. There's a great HN post about the length of an emoji being 2 in JS [1] from 2017.
Practically speaking, I had fun learning this while writing an Angular app to read SMS backups [2]. It's frustrating and cool at the same time.
The submission actually states (correctly): "UCS-2 can only represent the first 65536 characters of Unicode, also known as the Basic Multilingual Plane."
UTF-16 can represent all Unicode code points, through surrogate pairs when needed.
Systems that originally were designed for UCS-2 have often been upgraded to UTF-16. E.g. Windows, JS. But it is then not UCS-2 anymore. In some case the documentation and/or function names, etc., might talk incorrectly about being UCS-2 (while they really are UTF-16 now) for historical reasons, adding to the confusion.
If there are surrogate pairs, I thought that was more of UTF-16 rather than UCS-2. Wikipedia says UCS-2 cannot go outside the BMP range (which emojis are).
>UCS-2 differs from UTF-16 by being a constant length encoding[4] and only capable of encoding characters of BMP.
If you've ever noticed that entering an emoji or other non-ASCII [2] character seems to dramatically increase the size of your SMS, that's because the message is switching from 1-byte characters to 2-byte characters.
[1] https://www.twilio.com/docs/glossary/what-is-ucs-2-character...
[2] It's actually GSM-7, not ASCII, but the principle is the same: https://www.twilio.com/docs/glossary/what-is-gsm-7-character...