> Unicode is such a complicated system, that I read that even you need two UTF-1...

numpad0 · 2024-10-09T00:07:25 1728432445

It's because Unicode don't allow for language switching.

It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).

AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.

account42 · 2024-10-09T13:55:06 1728482106

The number of characters is not the problem, the mess due to legacy compatibility is - case folding and normaltization could be much simpler if the codepoints were laid out with that in mind. Also the fact the Unicode can't make up its mind if it wants to encode glyphs (turkish I and i, han unification) or semantic characters (e.g. cyrillic vs. latin letters) or just "ideas" (emojis).

bayindirh · 2024-10-08T14:06:49 1728396409

I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).

So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)

fluoridation · 2024-10-08T19:06:42 1728414402

Cuneiform codepoints are 17 bits long. If you're using UTF-16 you'll need two code units to represent a character.

gpderetta · 2024-10-09T12:25:20 1728476720

you also need two UTF16 code units for plain emojis.

TorKlingberg · 2024-10-09T13:53:27 1728482007

Lots of emoji are outside the Basic Multilingual Plane and need 4 bytes in UTf-8 and UTF-16. That's without going into skin color and other modifiers and combinations.