It's not bad, but it's complicated, as it requires an O(n) algorithm to jump to a specific character. Unicode should have been capped at 16 bits, and doubling text files in size is fine. An alternate representation of simplified UTF-8 would have kept compatibility with old ASCII files.
Doubling text files is a waste for the most part, but what makes it tolerable is compression. Still 16 bits would not be enough, it's only 65536 different code pages, less than half of what is currently in Unicode. 24 bit is sufficiently out of allignment with modern hardware and algorithms, so 32 bit it is for efficiency. That is now 4 times the size, and compression is now a requirement.
In any case, Utf8 has a place, and if you want easy manipulation and search, convert it to Utf32 - it's fixed width.
> Still 16 bits would not be enough, it's only 65536 different code pages, less than half of what is currently in Unicode.
But that's only because Unicode has significantly ventured well beyond what we consider to be text. The BMP is enough to represent all text (including math).
BMP is mostly filled with symbols from logographic languages, and currently has around 100 free code points. 16 bits simply isn't enough for the scope of capturing all written language of the history.
But I don't think all written languages in history should have the same treatment when it comes to standardized data representation, or should all be standardized by the same body. It's OK to have alphabets no one has used for thousands of years other than specialized researchers standardized separately from Latin or Chinese alphabets.
But if you don't what would the point of a unified standard be? What should happen to glyphs that noone have used outside of research for 100 years? 1000 years?
They should all be standardized under separate standards. You can call them "extended text", if you like. Mathematical notation isn't really supported by Unicode, either (just mathematical symbols), and that's fine. Math should have its own standard, and so should hieroglyphs, emojis, and musical notation.
What happens if you want to use two different extended text code points in one blog post? How would they interact? How do they avoid assigning the same codepoint to different symbols.
How would browsers support this? What's the actual plan, not just a handwave? Do you think it'd be more efficient to have to support 6 different standards than one?
> What happens if you want to use two different extended text code points in one blog post?
There are no codepoints if it's not text. How do you use codepoints for embedding a video or a picture on your blog? You don't! But, if you want to treat something as if it were text, then I suggested doing something similar to an XML namespace: "the next segment is hieroglyphics, you can get their glyphs from here, and these are their indices...". That "extended text" is still not text, and it still doesn't use any Unicode codepoints, but it can work according to similar principles.
> Do you think it'd be more efficient to have to support 6 different standards than one?
Then why don't we let the Unicode Consortium take over standardizing video or audio? If something isn't text, why is it standardized by a text standardization body?
If you think of Unicode as a standard that enables the ease of exchange of textual data among parties, would that change your mind? People have been putting emojis alongside text (think SMS) long before Unicode put them into the standard at the request of Google and Apple.
People have been putting images alongside text since the invention of print. That still doesn't make the images text. So it's really great that for a few years now some people are embedding icons in their SMS messages, but I think that incorporating those fashionable 2-5-year-old icons into a standard that's mostly about standardizing hundreds-of-years-old text doesn't make much sense.
You can exchange text with embedded icons just as easily without requiring OS vendors to come up with their own versions of vaguely-defined pictures by... simply embedding pictures.
I can already see the people on whatever would be the HN 20 years from now complaining how "bloated" Unicode is, full of thousands of symbols that no one ever uses, and calling to replace the whole thing, costing the industry even more money to replace a standard yet again.
If compression would be a requirement for 32 bits, then it was definitely a requirement for 8-bit text 25 years ago when memory and bandwidth were typically 1/1000th of today. Of course it often wasn't, and isn't. And where it is, it's still a requirement with UTF-8.
> it requires an O(n) algorithm to jump to a specific character
If you are trying to index into a string by "character" you are almost certainly already doing it wrong. Meaningful indexing almost always has to be by grapheme cluster. See Swift's string API as a great example of this done right.
ASCII is only 7 bit so UTF-8 is fully compatible, at least when compared to ISO 2022 and other similar horrors from the same era. Are you thinking of other encodings?
Emojis and hieroglyphs aside, 16bit was not enough for CJK characters. It's a real world problem---a character set that can't spell people's name or location can't be universally adopted.
How do you define "character"? You typically need to take into account grapheme clusters / composed character sequences, which make jump-to-character O(n) regardless of encoding.