Hacker News new | past | comments | ask | show | jobs | submit login

> Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

I believe it's over a hundred thousand (don't forget scholars need to work with classical and/or obscure characters which aren't in common usage), and while not common new ones are being added. Han unification is a good cautionary example to consider anytime you think something related to Unicode is easy: https://en.wikipedia.org/wiki/Han_unification

Now, there are on the order of 150K characters in Unicode so there is definitely a lot of room even for Chinese. I'm not so sure about the combinations because there are languages which use combining marks extensively (e.g. Navajo) and things like emoji skin tone modifiers (multiply everything with skin by 5 variants) or zero-width joiners to handle things like gender and you can get a lot of permutations if you were trying to precompose those to individual code points.

This is already sounding like a ton of work, even before you get to the question of getting adoption, and then you have to remember that the Unicode consortium explicitly says that diacritic marks aren't specific to a particular known language's usage so you either have to have every permutation or prove that nobody uses a particular one.

https://unicode.org/faq/char_combmark.html#10

The big question here is what the benefit would be, and it's hard to come up with one other than that everyone could treat strings as arrays of 32-bit integers. While nice, that doesn't seem compelling enough to take on a task of that order of magnitude.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: