
Decoding UTF-8 with Parser Combinators - jsnell
https://ianthehenry.com/2015/1/17/decoding-utf-8/
======
lelf
> _It turns out this limitation doesn’t matter in practice: 0x10FFFF
> (1,114,111 in decimal) is almost ten times as many representable codepoints
> as there are codepoints currently defined in the latest Unicode spec, so
> it’s not like we’re going to run out of bits any time soon._

Famous last words.

~~~
TazeTSchnitzel
My personal prediction is that CJK will use up all the representable
codepoints. Why? Well, a Chinese Hanzi character essentially corresponds to
one Mandarin morpheme. And there are a lot of them. You've basically chucked
all the short Mandarin words in the dictionary into Unicode. And their
traditional forms. Oh, and the Japanese kanji. And the Korean hanja. And the
other Chinese-derived systems. Oh, and every single variant throughout
history. And their future forms.

32 bits will not be enough room to encode entire vocabularies of languages
with ideographic writing systems, let alone 21 bits. We can cope with
syllabaries and alphabets. But CJK will kill Unicode.

------
dsymonds
It's a shame the author mixes up UTF-16 and UCS-2 when he says "Because it
turns out UTF-16, another very popular encoding scheme, can’t represent
codepoints that big." in reference to 0x1FFFFF.

~~~
userbinator
It really can't. UTF-16 extends the space by 20 bits (split into 10 bits in
each surrogate pair), and the highest UTF-16 sequence DBFF DFFF is 0x10FFFF.

~~~
vorg
You can define a second-tier surrogate system in UTF-16 on the last 2 planes
(plane 0xF and plane 0x10), which are both private use, to access all
codepoints up to 0x7FFFFFFF. Use plane 0xF as leading 2nd-tier surrogates, and
plane 0x10 as trailing 2nd-tier surrogates. I've written a package in go that
does just that:
[https://github.com/gavingroovygrover/utf88](https://github.com/gavingroovygrover/utf88)

