Hacker News new | past | comments | ask | show | jobs | submit login

Yes, the ultra-compact encoding will not be self-sequencing. But the bit-6-continuation variant yields more bits per byte, which would give better compression in many languages. Regarding efficiency, you still have to read every character, with the difference being one check every 1, 2, 3, or 4 characters in a more complex algorithm vs a check on every character in a simpler algorithm (I haven't checked to see which beats the other in performance, but they look pretty similar).

This feels a lot like mixing transport layer metadata into the data format, potentially giving a small processing performance benefit at the cost of huge data wastage when certain languages are encoded.




> huge data wastage when certain languages are encoded

Is there a language that consistently uses codepoints with more than 2 bytes?

It bothers me a bit that UTF-8 is not an infinitely extendable encoding. But that also isn't an important objection, because it is finite, but huge.


> Is there a language that consistently uses codepoints with more than 2 bytes?

There are definitely (small) communities using scripts that lie entirely in the SMP. For example, Mru, Adlam, Takri, Pracalit, Miao, Wancho, etc. Most of these are either historic scripts that have mostly been supplanted by unified ones (esp. Devanagari) but retain usage in some areas, or languages that did not have a pre-colonial writing system that are attempting to reclaim cultural identity with a new script.

But yes, I don't think there are major communities that consistently do so. My anecdata from a few Mandarin- and Japanese-speaking friends is that SIP characters rarely occur.

Really if anything, emoji obsessives, mathematicians using bold/fraktur characters, and historical linguists/anthropologists would have the biggest savings.

https://en.wikipedia.org/wiki/Mru_language#Alphabet https://en.wikipedia.org/wiki/Adlam_script https://en.wikipedia.org/wiki/Takri_script https://en.wikipedia.org/wiki/Pracalit_script https://en.wikipedia.org/wiki/Pollard_script


Yeah, it's fine that the encoding is not infinitely extendable. But if it didn't encode the length into the first byte, you'd have 1 extra bit for 2-byte sequences, 2 extra bits for 3-byte sequences, etc. That means that you can double the number of possible glyphs per 2-byte sequence (for an extra 2048 glyphs in the 2-byte range). For 3 bytes, it's 2 bits for an extra almost 200,000 glyphs before having to jump to 4 bytes. This would be a huge boon to East Asian languages like Chinese and Japanese.


> Is there a language that consistently uses codepoints with more than 2 bytes?

Kanji, hiragana, katakana etc. At least 3 bytes with some 4-byte sequences. This compares unfavorably to for example Shift-JIS.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: