Text files growing by 8x would be a non-starter. It would be nice to standardize...

pron · on June 23, 2016

It's not bad, but it's complicated, as it requires an O(n) algorithm to jump to a specific character. Unicode should have been capped at 16 bits, and doubling text files in size is fine. An alternate representation of simplified UTF-8 would have kept compatibility with old ASCII files.

hvidgaard · on June 23, 2016

Doubling text files is a waste for the most part, but what makes it tolerable is compression. Still 16 bits would not be enough, it's only 65536 different code pages, less than half of what is currently in Unicode. 24 bit is sufficiently out of allignment with modern hardware and algorithms, so 32 bit it is for efficiency. That is now 4 times the size, and compression is now a requirement.

In any case, Utf8 has a place, and if you want easy manipulation and search, convert it to Utf32 - it's fixed width.

pron · on June 23, 2016

> Still 16 bits would not be enough, it's only 65536 different code pages, less than half of what is currently in Unicode.

But that's only because Unicode has significantly ventured well beyond what we consider to be text. The BMP is enough to represent all text (including math).

hvidgaard · on June 23, 2016

BMP is mostly filled with symbols from logographic languages, and currently has around 100 free code points. 16 bits simply isn't enough for the scope of capturing all written language of the history.

pron · on June 23, 2016

But I don't think all written languages in history should have the same treatment when it comes to standardized data representation, or should all be standardized by the same body. It's OK to have alphabets no one has used for thousands of years other than specialized researchers standardized separately from Latin or Chinese alphabets.

hvidgaard · on June 23, 2016

But if you don't what would the point of a unified standard be? What should happen to glyphs that noone have used outside of research for 100 years? 1000 years?

pron · on June 23, 2016

They should all be standardized under separate standards. You can call them "extended text", if you like. Mathematical notation isn't really supported by Unicode, either (just mathematical symbols), and that's fine. Math should have its own standard, and so should hieroglyphs, emojis, and musical notation.

Tyr42 · on June 23, 2016

What happens if you want to use two different extended text code points in one blog post? How would they interact? How do they avoid assigning the same codepoint to different symbols. How would browsers support this? What's the actual plan, not just a handwave? Do you think it'd be more efficient to have to support 6 different standards than one?

pron · on June 23, 2016

> What happens if you want to use two different extended text code points in one blog post?

There are no codepoints if it's not text. How do you use codepoints for embedding a video or a picture on your blog? You don't! But, if you want to treat something as if it were text, then I suggested doing something similar to an XML namespace: "the next segment is hieroglyphics, you can get their glyphs from here, and these are their indices...". That "extended text" is still not text, and it still doesn't use any Unicode codepoints, but it can work according to similar principles.

> Do you think it'd be more efficient to have to support 6 different standards than one?

Then why don't we let the Unicode Consortium take over standardizing video or audio? If something isn't text, why is it standardized by a text standardization body?

spacehunt · on June 23, 2016

The are characters outside of the BMP that are in daily use in my native tongue (Cantonese).

pron · on June 23, 2016

Right, those should definitely be a part of Unicode. That's no excuse for emojis.

spacehunt · on June 23, 2016

If you think of Unicode as a standard that enables the ease of exchange of textual data among parties, would that change your mind? People have been putting emojis alongside text (think SMS) long before Unicode put them into the standard at the request of Google and Apple.

pron · on June 23, 2016

People have been putting images alongside text since the invention of print. That still doesn't make the images text. So it's really great that for a few years now some people are embedding icons in their SMS messages, but I think that incorporating those fashionable 2-5-year-old icons into a standard that's mostly about standardizing hundreds-of-years-old text doesn't make much sense.

You can exchange text with embedded icons just as easily without requiring OS vendors to come up with their own versions of vaguely-defined pictures by... simply embedding pictures.

I can already see the people on whatever would be the HN 20 years from now complaining how "bloated" Unicode is, full of thousands of symbols that no one ever uses, and calling to replace the whole thing, costing the industry even more money to replace a standard yet again.

mcbits · on June 23, 2016

If compression would be a requirement for 32 bits, then it was definitely a requirement for 8-bit text 25 years ago when memory and bandwidth were typically 1/1000th of today. Of course it often wasn't, and isn't. And where it is, it's still a requirement with UTF-8.

amake · on June 23, 2016

> it requires an O(n) algorithm to jump to a specific character

If you are trying to index into a string by "character" you are almost certainly already doing it wrong. Meaningful indexing almost always has to be by grapheme cluster. See Swift's string API as a great example of this done right.

spacehunt · on June 23, 2016

ASCII is only 7 bit so UTF-8 is fully compatible, at least when compared to ISO 2022 and other similar horrors from the same era. Are you thinking of other encodings?

shiro · on June 23, 2016

Emojis and hieroglyphs aside, 16bit was not enough for CJK characters. It's a real world problem---a character set that can't spell people's name or location can't be universally adopted.

pron · on June 23, 2016

That's a real problem, I admit, but it still doesn't justify the Unicode consortium standardizing pictures as text.

panic · on June 23, 2016

How do you define "character"? You typically need to take into account grapheme clusters / composed character sequences, which make jump-to-character O(n) regardless of encoding.