Hacker News new | past | comments | ask | show | jobs | submit login

> It seems to me it would be trivial to enumerate these combinations, and assign code points to them.

Far from it. Even if you limit yourself to just Latin, the number of valid (whatever “valid” even means) combinations is already unmanageably gargantuan. Just look at phonetic notation as one example of many. The basic IPA alone uses over 100 letters for consonants and vowels, plus dozens of different diacritics, many of which need to be present concurrently on the same base letter. Make the jump to extended IPA or any number of other, more specialised transcription systems – and there are plenty – and you’ll never see the end of it.

Sure, it may be technically possible to create an exhaustive list of letter-and-diacritic combinations, just like you can technically create an exhaustive list of every single human on Earth, but good luck getting there. And good luck making sure you didn’t miss anything in the process.

Of course, you don’t need to limit yourself to Latin, because Unicode has 160 other writing systems to offer.

Writing systems like Tibetan and Newa where consonants can be stacked vertically to arbitrary heights and then have vowel signs and other marks attached as a bonus as well.

Or Hangul which would occupy no less than 1,638,750 code points if all possible syllable blocks were encoded atomicly, and that doesn’t even account for the archaic tone marks, or those novel letters that North Korea once tried to establish that aren’t even in Unicode yet.

Or Sutton SignWriting whose system of combining marks and modifiers is so complex that I’m not even gonna explain it here.

If you eschew combining characters then yes, you will create an encoding where every code point is at the same time a full grapheme cluster and that definitely has concrete advantages, but as a consequence you have now assigned to yourself the unenviable task of having to possess perfect, nigh-omniscient knowledge of every single thing that a person has ever written down in the entirety of human history. Because unless you possess that knowledge, you will leave out things that some people need to type on a computer under some circumstances.

Every time some scholar discovers a previously forgotten vowel sign in an old Devanagari manuscript, you need to encode not only that one new character, but every combination of that vowel sign with any of the (currently) 53 Devanagari consonants, plus Candrabindu, Anusvara, and Visarga at the very least, just in case these combinations pop up somewhere, because they’re all linguistically meaningful and well-defined.

It’s doable, in a sense, but why would you subject yourself to that if you can just make characters combine with each other instead?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: