Hacker News new | past | comments | ask | show | jobs | submit login

The proposed UCS-G-8 encoding [1] does exactly that. And in case UTF-16 never dies, the website also proposes extensions to UTF-16 (and UTF-32) as well.

[1] http://ucsx.org/g8




As a lay-developer, I know unicode is what you need for international character support. Oh, so what are the options: utf-8, utf-16, utf-32. I will choose utf-32 just because 32 > 16 or 8.


I'm not sure if that's a joke (maybe that's why it got downvoted).

But the answer is that 32 is not better than 16, which is not better than 8, in this specific case. The bit count here is about memory efficiency. There are developers who think/thought that UTF-32 would improve random access into strings because it's a fixed-sized encoding of codepoints, but in fact it does not because there are glyphs that require multiple codepoints.

All who pass must abandon the idea of random access into Unicode strings!

Once you give up on Unicode string random access, you can focus on other reasons to like one UTF over another, and then you quickly realize that UTF-8 is the best by far.

For example there's this misconception that UTF-8 takes significantly (e.g., 50%) more space than UTF-16 when handling CJK, but that's not obviously true -- I've seen statistical analyses showing that UTF-8 is roughly comparable to UTF-16 in this regard.

Ultimately UTF-8 is much easier to retrofit into older systems that use ASCII, it's much easier to use in libraries and programs that were designed for ASCII, and doesn't have the fundamental 21-bit codespace limit that UTF-16 has.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: