Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is Unicode Designed Badly?
3 points by indil on Sept 9, 2022 | hide | past | favorite | 14 comments
The more I learn about Unicode, the more complicated it gets. It was rather shocking to learn that the presence of combining characters makes most "reverse a string" programming solutions incorrect, and that strings need to be normalized to compare them. The whole thing seems so much more complicated than it should be, but perhaps that's just the nature of the problem?

Was Unicode designed well? If it were designed from scratch today, with no legacy considerations, would the ideal design look like the current design? What would you change?

Being extremely ignorant of the problem space, the first thing I would consider for the chopping block would be combining characters. Just make every character a precomposed character (one code point), so there's no need for normalization. I'm curious if such a scheme could fit every code point into 32 bits, though. Would this be feasible?




Reversing a string is a useless operation in the real world. Its only application is padding out interview questions. “How to reverse a string” is also an incredibly vague question. What do you actually want me to do? Reverse code points, or code units, or grapheme clusters, or make it look like it’s written backwards? It doesn’t even make sense as a concept in most of the world’s writing systems.

It’s like giving me a list of numbers and asking me to “combine” them. What does that mean? Do I sum them up, or concatenate them, or something else entirely? A lot of string reversal solutions are “incorrect” because there isn’t even a correct question in the first place.

Even with an infinitely large code space, doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone. If Unicode had been the first digital character set ever created, it would not contain a single precomposed code point because they are utterly impractical. As such, normalisation – or at least the canonical reordering part of it – is always going to be a necessity.


>Reversing a string is a useless operation in the real world

I'm not sure why you focused on this one example, which was just meant to indicate the nature of the issue, not cite a broad concrete problem. There are plenty of situations where you'd want to operate on graphemes, not code points, like deleting the previous grapheme in a text editor. It would certainly help programmers write correct code if the two were the same.

>doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone

It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points.


> It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points.

Well, 10 code points because vowels can be capitalized and 12 because ÿ is used in other languages.

That's one of the easiest cases. Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.

At the end of years of work, you'd have an encoding which is easier for C programmers to think about but means all of your documents require substantially more storage than they used to.


>Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.

Perhaps this is just my ignorance talking, but it can't be that many permutations, can it? Somebody linked to https://en.wikipedia.org/wiki/Zalgo_text, which I doubt anyone would seriously want to enable. There's, what, maybe 3-4 marks typically added to chars in the most complex of cases, mostly for vowels, like Vietnamese. With 4 billion code points to work with, that seems doable. We could just throw in all permutations, regardless of past utility, to accommodate future expansions of acceptable marks. Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

>but means all of your documents require substantially more storage than they used to.

Good point! But that comes down to a trade-off analysis between design and space. High 32-bit code point values are meant to be used too, and not shied away from.


> Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

I believe it's over a hundred thousand (don't forget scholars need to work with classical and/or obscure characters which aren't in common usage), and while not common new ones are being added. Han unification is a good cautionary example to consider anytime you think something related to Unicode is easy: https://en.wikipedia.org/wiki/Han_unification

Now, there are on the order of 150K characters in Unicode so there is definitely a lot of room even for Chinese. I'm not so sure about the combinations because there are languages which use combining marks extensively (e.g. Navajo) and things like emoji skin tone modifiers (multiply everything with skin by 5 variants) or zero-width joiners to handle things like gender and you can get a lot of permutations if you were trying to precompose those to individual code points.

This is already sounding like a ton of work, even before you get to the question of getting adoption, and then you have to remember that the Unicode consortium explicitly says that diacritic marks aren't specific to a particular known language's usage so you either have to have every permutation or prove that nobody uses a particular one.

https://unicode.org/faq/char_combmark.html#10

The big question here is what the benefit would be, and it's hard to come up with one other than that everyone could treat strings as arrays of 32-bit integers. While nice, that doesn't seem compelling enough to take on a task of that order of magnitude.


> It seems to me it would be trivial to enumerate these combinations, and assign code points to them.

Far from it. Even if you limit yourself to just Latin, the number of valid (whatever “valid” even means) combinations is already unmanageably gargantuan. Just look at phonetic notation as one example of many. The basic IPA alone uses over 100 letters for consonants and vowels, plus dozens of different diacritics, many of which need to be present concurrently on the same base letter. Make the jump to extended IPA or any number of other, more specialised transcription systems – and there are plenty – and you’ll never see the end of it.

Sure, it may be technically possible to create an exhaustive list of letter-and-diacritic combinations, just like you can technically create an exhaustive list of every single human on Earth, but good luck getting there. And good luck making sure you didn’t miss anything in the process.

Of course, you don’t need to limit yourself to Latin, because Unicode has 160 other writing systems to offer.

Writing systems like Tibetan and Newa where consonants can be stacked vertically to arbitrary heights and then have vowel signs and other marks attached as a bonus as well.

Or Hangul which would occupy no less than 1,638,750 code points if all possible syllable blocks were encoded atomicly, and that doesn’t even account for the archaic tone marks, or those novel letters that North Korea once tried to establish that aren’t even in Unicode yet.

Or Sutton SignWriting whose system of combining marks and modifiers is so complex that I’m not even gonna explain it here.

If you eschew combining characters then yes, you will create an encoding where every code point is at the same time a full grapheme cluster and that definitely has concrete advantages, but as a consequence you have now assigned to yourself the unenviable task of having to possess perfect, nigh-omniscient knowledge of every single thing that a person has ever written down in the entirety of human history. Because unless you possess that knowledge, you will leave out things that some people need to type on a computer under some circumstances.

Every time some scholar discovers a previously forgotten vowel sign in an old Devanagari manuscript, you need to encode not only that one new character, but every combination of that vowel sign with any of the (currently) 53 Devanagari consonants, plus Candrabindu, Anusvara, and Visarga at the very least, just in case these combinations pop up somewhere, because they’re all linguistically meaningful and well-defined.

It’s doable, in a sense, but why would you subject yourself to that if you can just make characters combine with each other instead?


Combining characters have their issues (https://en.wikipedia.org/wiki/Zalgo_text), but making string reversal trickier isn’t one of them. “Reversing” is an extremely atypical thing to do with text. I think only programming exercises and palindrome searchers do it. Why would your data structure make that easy to do?

For Unicode, a “design from scratch” design would remove duplicate legacy code points. Why have “é” both as a single code point and as ”e” plus a combining character?

It also wouldn’t have any of the deprecated characters (https://en.wikipedia.org/wiki/Unicode_character_property#Dep...)

I also would remove the few special flag code points (https://home.unicode.org/the-past-and-future-of-flag-emoji/)

If “design from scratch” also means “drop the goal of encompassing old character encodings”, more code points probably could go. Why are DOS box characters in Unicode, while Atari/PET, etc, ones aren’t, for example?

Finally, I would look into making it easier to retrieve character class from a code point (the ‘these code points are digits, these are combining marks, etc’ tables are a bit of a wart, and getting rid of them could be useful in small embedded devices).

I doubt a solution exists there that is future proof agains extension of Unicode and doesn’t blow up memory use, though, and am not sure any embedded devices too small to host those tables actually could use that info.


>Why would your data structure make that easy to do?

Reversing a string merely indicates the problem. There are many cases for operating on graphemes instead of code points. For example, deleting the previous grapheme in a text editor when pressing backspace/delete. I think most programmers assume they're dealing with graphemes when they're actually dealing with code points. See, for example, the rune type in Go.


I think it would look a lot like UTF-8 with some of the legacy parts removed (e.g. drop the non-combining characters which duplicate combining character combinations). One thing to remember is that there are a LOT of edge cases in the world and you're looking at a lot of permutations when there are characters combining multiple access, or things like Emoji where they use skin tone modifiers to avoid needing to specify every permutation. I'm not sure if that would fit in a 32-bit code point, but I would also consider what that would do to file and network sizes — there are real costs to making almost every document substantially larger and while we have more headroom than we used to, I'd be still be surprised if that didn't result in noticeable performance regressions.

Where I would make the change isn't Unicode itself but the APIs. All of the problems you're talking about basically come down to legacy language design where people think they're working with grapheme clusters but they're really working with code points. Making that more explicit in the tools would be good, similar to how Python 3 forced you to think about whether you wanted encoded binary data or a decoded string but there's so much history around that making it hard to do without getting a lot of griping from people who don't want to update decades of habit.


Every computer science problem eventually ends with The Unicode Problem and its various agendas. Personally I avoid Unicode in my editor and use ASCII at all times. If I have to deal with Unicode, I escape it into the relevant ASCII equivalent and normalize things like emojis to ASCII. This avoids various headaches down the line, since Unicode is not cross-compatible across devices and having everything in ASCII is a saner way to approach that.


> This avoids various headaches down the line, since Unicode is not cross-compatible across devices and having everything in ASCII is a saner way to approach that.

Did you mean to say that not all programs support Unicode? It's been a long time — at least a decade — since I ran into a device which doesn't support it at all, as opposed to something like PHP code which has built-in support but didn't enable it.


"..characters makes most "reverse a string"..."

has nothing with Unicode

It's thing to do with whether it's Little or Big Endian

Unicode as fixed & secured as Ascii at its era


The problem the parent describes (reversing as string) has nothing to do with endianess (that would be trivial to overcome). It's not about the byte order in the architecture, wire protocol, or format -- but about how characters are defined and combined from substituent elements in Unicode itself (way beyond the byte order).


Zalgo text provides a straightforward example of how "reversing a string" is ambiguous and ill-defined: one might conceivably want to transpose "letters" along with their respective piles of combining marks, stacked in the same order, not to move combining marks to adjacent letters in a different order.

And of course what characters are a mark-bearing letter is application-dependent and there might be collateral requirements (e.g. if the reversed text is meant to be displayed swapping closing and opening delimiters such as parentheses: "(note)" -> ")eton(" or "(eton)" ).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: