Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

in my opinion utf8 should have been a bigger variable length encoding, today it is:

0xxxxxxx

110xxxxx 10xxxxxx

1110xxxx 10xxxxxx 10xxxxxx

and

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

the only reason not to push those last bits and add

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

and maybe even

11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

is utf-32, they should have dropped it and solve the codepoint problem this way.



First I'd like to introduce you to this: http://ucsx.org/

But no, there is no particular reason to introduce a longer encoding than the modern UTF-8 (which is actually shortened from the original one-to-six-byte encoding). The current set of 1,114,112 Unicode characters is sufficient for at least the foreseeable future, because any new assignment requires a demonstrable historic or current use. (Emojis are slightly different, but they still require that the underlying concept is widespread and do not significantly overlap with existing emojis. See [1].) Han characters are the largest source of new assignments to this date and they are yet to reach two out of 17 full planes (that would equate to 131K characters).

[1] https://news.ycombinator.com/item?id=26904980




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: