Yes, the ultra-compact encoding will not be self-sequencing. But the bit-6-conti...

marcosdumay · on Oct 10, 2019

> huge data wastage when certain languages are encoded

Is there a language that consistently uses codepoints with more than 2 bytes?

It bothers me a bit that UTF-8 is not an infinitely extendable encoding. But that also isn't an important objection, because it is finite, but huge.

XaspR8d · on Oct 10, 2019

> Is there a language that consistently uses codepoints with more than 2 bytes?

There are definitely (small) communities using scripts that lie entirely in the SMP. For example, Mru, Adlam, Takri, Pracalit, Miao, Wancho, etc. Most of these are either historic scripts that have mostly been supplanted by unified ones (esp. Devanagari) but retain usage in some areas, or languages that did not have a pre-colonial writing system that are attempting to reclaim cultural identity with a new script.

But yes, I don't think there are major communities that consistently do so. My anecdata from a few Mandarin- and Japanese-speaking friends is that SIP characters rarely occur.

Really if anything, emoji obsessives, mathematicians using bold/fraktur characters, and historical linguists/anthropologists would have the biggest savings.

https://en.wikipedia.org/wiki/Mru_language#Alphabet https://en.wikipedia.org/wiki/Adlam_script https://en.wikipedia.org/wiki/Takri_script https://en.wikipedia.org/wiki/Pracalit_script https://en.wikipedia.org/wiki/Pollard_script

kstenerud · on Oct 10, 2019

Yeah, it's fine that the encoding is not infinitely extendable. But if it didn't encode the length into the first byte, you'd have 1 extra bit for 2-byte sequences, 2 extra bits for 3-byte sequences, etc. That means that you can double the number of possible glyphs per 2-byte sequence (for an extra 2048 glyphs in the 2-byte range). For 3 bytes, it's 2 bits for an extra almost 200,000 glyphs before having to jump to 4 bytes. This would be a huge boon to East Asian languages like Chinese and Japanese.

boomlinde · on Oct 10, 2019

> Is there a language that consistently uses codepoints with more than 2 bytes?

Kanji, hiragana, katakana etc. At least 3 bytes with some 4-byte sequences. This compares unfavorably to for example Shift-JIS.