Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fun fact: for SMS, UCS-2 is used when a message requires more than 128 characters to be rendered.[1]

If you've ever noticed that entering an emoji or other non-ASCII [2] character seems to dramatically increase the size of your SMS, that's because the message is switching from 1-byte characters to 2-byte characters.

[1] https://www.twilio.com/docs/glossary/what-is-ucs-2-character...

[2] It's actually GSM-7, not ASCII, but the principle is the same: https://www.twilio.com/docs/glossary/what-is-gsm-7-character...



I was a flip-phone holdout for a long time. This was not generally supported in new devices sold up to 2015 (when I acquired my first smartphone, and my experience with good phones ended).

Every flip phone I owned did not play well with non-ASCII text. Receiving a message with an emoji would render the entire message, or if I was lucky just the portion following the emoji, unrenderable. Most phones I had failed to just rectangles, which I assume is their method of mojibake.

This was not a causal factor in my switch to a smartphone. It was annoying to have to explain character encodings to friends who didn't understand why they couldn't send me emoji-containing messages.


You sound like a blast.


Hey, could you please review the site guidelines and stick to the spirit of this site when posting here? Normally we ban accounts that attack others like this, and I actually banned yours, but decided to reverse that because you posted a more substantive comment earlier today. Basically, we're trying for more thoughtful conversation and better signal/noise than internet default.

https://news.ycombinator.com/newsguidelines.html

You might also find these links helpful for getting the intended use of HN:

https://news.ycombinator.com/newswelcome.html

https://news.ycombinator.com/hackernews.html

http://www.paulgraham.com/trolls.html

http://www.paulgraham.com/hackernews.html


I don't really understand how it can use UCS-2. I don't think UCS-2 can represent emoji.


This is actually mentioned in the submission, but it does so through surrogate pairs. In short: the first of two bytes references a different "plane" of code points, and the second byte refers to a code point in that additional plane.

What's cool is that UCS-2 and UTF-16 behave the same way in this regard, so you can see how this works in JavaScript as well. There's a great HN post about the length of an emoji being 2 in JS [1] from 2017.

Practically speaking, I had fun learning this while writing an Angular app to read SMS backups [2]. It's frustrating and cool at the same time.

[1] https://news.ycombinator.com/item?id=13830177

[2] https://github.com/devadvance/sms-backup-reader-2/blob/maste...


The submission actually states (correctly): "UCS-2 can only represent the first 65536 characters of Unicode, also known as the Basic Multilingual Plane."

UTF-16 can represent all Unicode code points, through surrogate pairs when needed.

Systems that originally were designed for UCS-2 have often been upgraded to UTF-16. E.g. Windows, JS. But it is then not UCS-2 anymore. In some case the documentation and/or function names, etc., might talk incorrectly about being UCS-2 (while they really are UTF-16 now) for historical reasons, adding to the confusion.


If there are surrogate pairs, I thought that was more of UTF-16 rather than UCS-2. Wikipedia says UCS-2 cannot go outside the BMP range (which emojis are).

>UCS-2 differs from UTF-16 by being a constant length encoding[4] and only capable of encoding characters of BMP.

https://en.wikipedia.org/wiki/UTF-16

There does seem to be some disagreement on this topic though. I wonder if it could be described as WTF-16.

https://simonsapin.github.io/wtf-8/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: