
Unicode: The people who made the internet open to anyone - danso
https://medium.com/@maggieshafer/unicode-a-story-of-corruption-connection-and-smiling-poo-598295e4af9d
======
danso
> _These articles answered a lot of my questions, but they also raised a new
> one. Every one of them mentions something called Unicode. Some of them in
> passing, as if to say, “We all know about Unicode, right? Let’s move on.”_

Anyone who came through with a typical CS/CompEng degree: did you learn about
Unicode in any depth? I didn't. We didn't even cover regular expressions. I
grew to hate Unicode but only recently have learned to respect what a
monumental achievement they were. I mean, I would hate buffer overflows (more)
if I didn't understand pointers and memory...but after understanding why they
(pointers and memory) are the way they are...They are painful, but I don't
suffer :)

This remark by Chris Granger always bothered me:

> _We don 't want a generation of people forced to care about Unicode and UI
> toolkits. We want a generation of writers, biologists, and accountants that
> can leverage computers_ [1]

[1] [http://www.chris-granger.com/2015/01/26/coding-is-not-the-
ne...](http://www.chris-granger.com/2015/01/26/coding-is-not-the-new-
literacy/)

I don't think people should be forced to decode Unicode by hand, or even
memorize little/big endian. But to not understand how fundamentally text
translates into numbers? That, from now on, humans will have a normalized way
of communicating with each other, whether by Internet or starship viewscreen,
for as long as the standard exists?

Unicode is annoying, but only in the sense that going to France and not
speaking French is totally annoying.

~~~
ygra
I think they mean it that Unicode as a technology should be invisible to the
average user. Just as no one should be forced to think about whether numbers
above 2 billion cause trouble, no one should be forced to remember the
differences between Latin 1, UTF-8, UTF-16LE, etc. pp. Normal people just want
to write their language and people expect to see what they wrote. If they see
question marks, or diacritics next to their respective letters, or completely
different scripts, then things failed. But I'd argue that normal people do not
need to know how or why that failed. Even for developers with a bit of Unicode
experience it's often not trivial to recover text broken in such ways.

Oh, and to answer your question: I never learned anything about Unicode in uni
either. As for myself, I've been lurking on the Unicode mailing list since
about 2008.

~~~
derefr
Sure, Unicode itself should be invisible, because Unicode is the _solution_ to
the problem. The problem itself is having tons of incompatible encodings in
use, and people _do_ have to understand what an encoding is if they're from a
place where "legacy" encodings (i.e. ones that aren't Unicode encodings) are
still in use. My mind usually leaps to Japan's continued adherence to SJIS,
but I'm sure there are others as well.

------
clock_tower
For a dissenting view on Unicode, see:

[https://news.ycombinator.com/item?id=6863824](https://news.ycombinator.com/item?id=6863824)

[https://news.ycombinator.com/item?id=5362200](https://news.ycombinator.com/item?id=5362200)
(warning: this page may crash some browsers)

Anything is better than the old mess of unpredictable, ad-hoc encodings, but
Unicode is so academic and impractical that it isn't really all that much
better. We don't know if it's safe; we do know it can crash browsers with
plaintext (have fun with that second link!); and we _definitely_ know that
Unicode isn't "the solution to conflict and corruption" like the article says.

For a case of corruption that could only be done with Unicode, see the first
comment to this:

[https://news.ycombinator.com/item?id=10437619](https://news.ycombinator.com/item?id=10437619)

Unicode is better than nothing, where "nothing" is "webpages that can only
display 256 characters" or thereabouts; but it has a lot of mistakes and
clumsy characteristics. "I � Unicode" ("I Entity Unicode", using the Unicode
glyph for "unrecognized character") is about as far as affection for Unicode
should go; I used to have that on a bumper sticker.

~~~
nikdaheratik
The first and third dissents are dumb. Basically the attack is that, because
the glyphs used to represent the character look the same but are actually
different code-points they have potential to be abused.

But there's no reason why this would only be limited to Unicode. Any other
encoding system that keeps a cyrillic alphabet in a separate code range from
the latin one would have this problem. And it's much less efficient for your
text chomping program that reads cyrillic to check all of a character range
_except_ for the 2-3 that also have existing latin characters that look the
same as the cyrillic ones.

And I really don't understand why "I Entity Unicode" is even a valid
complaint. Or what it's even complaining _about_ If your font doesn't have a
glyph that matches your character, it's not going to show up no matter which
encoding scheme you use.

Unicode got alot of buy-in from publishers as it met all of their needs. The
tech companies followed suit, eventually, which is a heck of alot better than
the dark ages when each computer company had their own custom platform for
these encodings in the U.S. alone. And then there's individual solutions to
the problem overseas where the wheel was reinvented for each language there.

~~~
clock_tower
"Any other encoding system that keeps a cyrillic alphabet in a separate code
range from the latin one would have this problem."

Then isn't the solution to put the Cyrillic alphabet in the same space as the
Latin one when the glyphs are identical?

In general, Unicode should merge code points a lot more readily than it does;
it shouldn't have some of the strange things it does have, which hog space
that would be better used putting semi-common Chinese characters in the Basic
Multilingual Pane; and the way it handles hangul is just embarrassing. It's
better than nothing (and better than the Codepage-<Insert Number> variety of
solutions), but that doesn't mean it's good; that's what I mean by "I Entity
Unicode".

------
douche
Non-alphabetic languages: Why we can't encode all text in 8 or 16 bits

------
bobbyi_settv
> Or this essay from from New York Magazine, which explores their evolution
> and cultural implications.

Why is a link to an "essay from from New York Magazine" trying to take me to
youtube?

How do you open your article with non functioning links and then immediately
jump to criticizing the design of someone else's website which seems to work
just fine?

------
NoMoreNicksLeft
Everyone buy Klingons.

~~~
WorldMaker
Indeed. Now that the Astral Plane has opened up, I think it is past time for
Unicode to reconsider its decision on conlang alphabets and include things
like Klingon, Tengwar, Aurebesh, and similar. I suppose coming up with
criteria for conlang alphabet significance might be complicated, but certainly
similar scholarship qualifications might be used as some of the dead
languages.

~~~
kevin_thibedeau
They certainly have no criteria for emoji significance. Some Japanese handheld
has an obscure character? Put it in the next standard for all to puzzle over.

~~~
derefr
You mean these ones? 🈶 🈚️ 🈸 🈺 🈷️ 🈴 🈵 🈲

Their names in the standard are kind of annoying, but actually helpful:
"Squared CJK Unified ideograph NNNN", where NNNN turns out to be the hex code
for the equivalent not-in-a-square character.

Now, I've taken courses in Japanese and Chinese, but you don't usually use
characters on their own like this to mean things; so I would _guess_ the
following:

• "🈶" is meant to be used as an emblem/badge to mark items you already own,
and "🈚️" ones you don't;

• "🈸" can mean "to offer", perhaps used beside e.g. bid prices on stocks—or
could just mean "to report", as in a button to view the status of something;

• "🈺" seems to mean "to manage"—as in a button to modify the properties of
something;

• "🈷️" just means "month" (or "moon", but probably not in this case), and so
it likely meant to be used as a button to open a time-picker to modify the
date of something;

• "🈴" seems to mean "to close" or "to pass" (as on a test); I assume it could
be used similarly to a "Done" or maybe "Apply" button on a property sheet, or
to the "Submit" button on a form;

• "🈵" means "enough", in the sense of being satisfied; I guess this could be
used for an "OK" button, or a back/up/out button when looking "inside"
something.

• "🈲" means "to prohibit" or "to restrict", but also "to endure"; I would
guess it would be used for a "Cancel" button.

I have a feeling these characters aren't used much now even on modern Japanese
mobile devices; they're likely a legacy of the early Japanese mobile web and
apps. If anyone here knows better...?

~~~
rakoo
It is only now, after years of dealing with stuff that is not ASCII (I'm not
american), that I realize the icons actually have the unicode codepoint in
them. Thank you very much for making me see it.

~~~
derefr
Do you mean that they're boxes containing the hex digits of their respective
Unicode codepoints? I think that's actually what happens on Windows when you
don't have an emoji font installed.

They should render on most desktop OSes like this:
[http://i.imgur.com/lFQoaQe.png](http://i.imgur.com/lFQoaQe.png)

Or, on on mobile OSes, more like this:
[http://i.imgur.com/M3U6Ikl.png](http://i.imgur.com/M3U6Ikl.png)

