The more I learn about Unicode, the more complicated it gets. It was rather shocking to learn that the presence of combining characters makes most "reverse a string" programming solutions incorrect, and that strings need to be normalized to compare them. The whole thing seems so much more complicated than it should be, but perhaps that's just the nature of the problem?
Was Unicode designed well? If it were designed from scratch today, with no legacy considerations, would the ideal design look like the current design? What would you change?
Being extremely ignorant of the problem space, the first thing I would consider for the chopping block would be combining characters. Just make every character a precomposed character (one code point), so there's no need for normalization. I'm curious if such a scheme could fit every code point into 32 bits, though. Would this be feasible?
It’s like giving me a list of numbers and asking me to “combine” them. What does that mean? Do I sum them up, or concatenate them, or something else entirely? A lot of string reversal solutions are “incorrect” because there isn’t even a correct question in the first place.
Even with an infinitely large code space, doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone. If Unicode had been the first digital character set ever created, it would not contain a single precomposed code point because they are utterly impractical. As such, normalisation – or at least the canonical reordering part of it – is always going to be a necessity.