Hacker News new | past | comments | ask | show | jobs | submit login

>The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points. Another example is the new skin tone emoji.

Still not answering the question though.

For one, when the unicode standard was originally designed it didn't have emoji in it.

Second, if it was limitations to the arbitrary addition of thousands of BS symbols like emoji that necessitate such a design, we could rather do without emojis in unicode at all (or klingon or whatever).

So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Using less memory (like utf-8 allows) I guess is a valid concern.




It didn't have emoji but it did have other combining characters. While some langages it's feasable to normalize them to single code points but other langagues it would not be.

Plus the fact that some visible characters are made up of many graphemes the number of single code points would be huge.

As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.


>As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.

Representing all languages is ok as a goal -- adding klingon and BS emojis not so much (from a sanity perspective, if adding them meddled with having a logical and simple representation of characters).

So, it comes to "the fact that some visible characters are made up of many graphemes the number of single code points would be huge" and "while some languages it's feasable to normalize them to single code points but other langagues it would not be".

Wouldn't 32 bits be enough for all possible valid combinations? I see e.g. that: "The largest corpus of modern Chinese words is as listed in the Chinese Hanyucidian (汉语辞典), with 370,000 words derived from 23,000 characters".

And how many combinations are there of stuff like Hangul? I see that's 11,172. Accents in languages like Russian, Hungarian, Greek should be even easier.

Now, having each accented character as a separate might take some lookup tables -- but we already require tons of complicated lookup tables for string manipulation in UTF-8 implementations IIRC.


You might be correct and 32 bits could have been enough but Unicode has restricted code points to 21 bits. Why? Because of stupid UTF-16 and surrogate pairs.

I'm curious why you think that UTF-8 requires complicated lookup tables.


>I'm curious why you think that UTF-8 requires complicated lookup tables.

Because in the end it's still a Unicode encoding, and still has to deal with BS like "equivalence", right?

Which is not mechanically encoded in the err, encoding (e.g. all characters with the same bit pattern there are equivalent) but needs external tables for that.


But that's the same for UTF-16 and UTF-32. That's why I was wondering why you singled UTF-8 out, implying it needed extra handling.


Nah, didn't single it out, I asked why we don't have a 32-bit fixed-size code points, non-surrogate-pair-bs etc encoding.

And I added that while this might need some lookup tables, we already have those in UTF-8 too anyway (a non fixed width encoding).

So the reason I didn't mention UTF-16 and UTF-32 is because those are already fixed-size to begin with (and increasingly less used nowadays except in platforms stuck with them for legacy reasons) -- so the "competitor" encoding would be UTF-8, not them.


> So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Because language is messy. At some point you have to start getting into the raw philosophy of language and it's not just a technical problem at that point but a political problem and an emotional problem.

Take accents as one example: in English a diaresis is a rare but sometimes useful accent mark to distinguish digraphs (coöperate should be pronounced as two Os, not one OOOH sound like in chicken coop) the letter stays the same it just has "bonus information"; in German an umlaut version of a letter (ö versus o) is considered an entirely different letter, with a different pronunciation and alphabet order (though further complicated by conversions to digraphs in some situations such as ö to oe).

Which language is "right"? The one that thinks that diaresis is merely a modifier or the one that thinks of an accented letter as a different letter from the unmodified? There isn't a right and wrong here, there's just different perspectives, different philosophies, huge histories of language evolution and divergence, and lots of people reusing similar looking concepts for vastly different needs.

Similarly the Spanish ñ is single letter to Spanish but the ~ accent may be a tone marker in another language that is important to the pronunciation of the word and a modifier to the letter rather a letter on its own.

There's the case of the overlaps where different alphabets diverged from similar origins. Are the letters that still look alike the same letters? [1]

Math is a language with a merged alphabet of latin characters, arabic characters, greek characters, monastery manuscript-derived shorthands, etc. Is the modern Greek Pi the same as the mathematical symbol Pi anymore? Do they need different representations? Do you need to distinguish, say in the context of modern Greek mathematical discussions the usage of Pi in the alphabet versus the usage of mathematical Pi?

These are just the easy examples in the mostly EFIGS space most of HN will be aware of. Multiply those sorts of philosophical complications across the spectrum of languages written across the world, the diversity of Asian scripts, and the wonder of ancient scripts, and yes the modern joy of emoji. Even "normalization" is a hack where you don't care about the philosophical meaning of a symbol, you just need to know if the symbols vaguely look alike, and even then there are so many different kinds of normalization available in Unicode because everyone can't always agree which things look alike either, because that changes with different perspectives from different languages.

[1] An excellent Venn diagram: https://en.wikipedia.org/wiki/File:Venn_diagram_showing_Gree...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: