Hacker News new | past | comments | ask | show | jobs | submit login

And in 20-30 years we'll likely be saying the same about UTF-8.

I figured that "ANSI" would give away that I wasn't being serious since it's not actually an encoding.




> And in 20-30 years we'll likely be saying the same about UTF-8.

Well... If we will, why not? But the thing is that in 20-30 years we won't be able to invent any new writing systems that UTF-8 won't cover. Single-byte encodings were doomed because of their single-byteness. The same awaits two-byte encodings like UCS-2 (aka UTF-16BE) - we already have extended code points for something that glamour hipsters call "emoji". Variable-byte encoding will never become obsolete.


> But the thing is that in 20-30 years we won't be able to invent any new writing systems that UTF-8 won't cover.

I think you underestimate humanity's aptitude at creating things that don't fit into well defined standards.

My (admittedly poorly stated) point wasn't that we shouldn't be moving everything over to UTF-8. I personally use it wherever possible just because it makes life easier. My point was that there are decades of things that use ASCII-US or another one of the overlapping but incompatible encodings because they were the RightThing™ to use at the time and there's no way we're going to get rid of everything non-UTF-8 any time soon.

In 20-30 years we'll be saying "Why isn't everything in FutureText-64, it should be the only encoding. Why does anything else even exist?", and it'll be because we're saying the same about UTF-8 now.


I think you miss the point. When CP1251, KOI8-R and other crazy imcompatible things came around, they came around because there was a need: ASCII didn't provide a necessary character set. Now when we have Unicode that embodies virtually all character sets existing on Earth, we don't _really_ need either non-Unicode encodings, or even fixed-byte UTF versions. So a move to any hypothetical FutureText-64 will actually give no practical gain, unlike a move from single-byters to, for example, UCS-2 and then from UCS-2 to UTF-8.

But my main point is another: eliminate all single-byter and fixed-byter zoo and leave one universal encoding. When (if ever) it's time to replace it, we'll do it all and at once, not having those crazy iconvs everywhere.


Unicode is currently limited to 21 bits for compatibility with UTF-16. Eventually we might manage to exhaust all available codepoint space, and with that we'd have to move to yet another encoding with a whole new kind of surrogate pairs. Though UTF-8 could originally handle 31 bits, that's no longer the case.


So I see 2 steps here: dropping UTF-16 altogether (well, already, because there are plenty of extended codepoints above 0xFFFF), and when approaching the 31-bit limit - inventing something like "zero-width codepoint joiner" to compose codes of arbitrary length.

For example, in a hypothetical alien language, a hypothetical character "rjou" would have a code 0x2300740457 (all the previous codes are exhausted). We can't express this with a single code, so actually we split it into 2-byte parts and write "#" (0x0023), joiner, "t" (0x0074), joiner and "ї" (cyrillic letter yi, 0x0457). As we have a joiner between these codes, we know that we must interpret and display them not as a "#tї" sequence but as a single alien "rjou" character. I think you get the idea.


> dropping UTF-16 altogether (well, already, because there are plenty of extended codepoints above 0xFFFF)

UTF-16 can handle stuff above U+FFFF just fine, it encodes that with surrogate pairs. Are you thinking about UCS-2?

The 21-bit limit for Unicode comes from the limits of UTF-16's surrogate pairs.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: