
I Can’t Write My Name in Unicode - luu
https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name
======
sdg1
Not sure if the l33tspeak analogy is fully justified.

In case of the "missing" letter (called khanda-ta in Bengali) for the Bengali
equivalent of "suddenly", historically, it has been a derivative of the ta-
halant form (ত + ্ + ‍ ). As the language evolved, khanda-ta became a grapheme
of its own, and Unicode 4.1 did encode it as a distinct grapheme. A nicely
written review of the discussions around the addition can be found here:
[http://www.unicode.org/L2/L2004/04252-khanda-ta-
review.pdf](http://www.unicode.org/L2/L2004/04252-khanda-ta-review.pdf)

I could write the author's name fine: আদিত্য. A search with the string in the
Bengali version of Wikipedia pulls up quite a few results as well, so other
people are writing it too. The final "letter" in that string is a compound
character, and there's no clear evidence that it needs to be treated as an
independent one. Even while in primary school, we were taught the final
"letter" in the author's name as a conjunct. In contrast, for the khanda-ta
case, it could be shown that modern Bengali dictionaries explicitly referred
to khanda-ta as an independent character.

For me, many of these problems are more of an input issue, than an encoding
issue. Non latin languages have had to shoe-horn their script onto keyboard
layouts designed for latin-scripts, and that has been always suboptimal. With
touch devices we have newer ways to think about this problem, and people are
starting to try out things.

[Disclosure: I was involved in the Unicode discussions about khanda-ta (I was
not affiliated with a consortium member) and I have been involved with Indic
localization projects for the past 15 years]

~~~
chimeracoder
> I could write the author's name fine: আদিত্য

Author here.

Well, yes and no. The jophola at the end is not actually given its own
codepoint[0]. The best analogy I can give is to a ligature in English[1]. The
Bengali fonts that you have installed happen to render it as a jophola, the
way some fonts happen to render "ff" as "ﬀ" but that's not the same thing as
saying that it actually _is_ a jophola (according to the Unicode standard).

The difference between the jophola and an English ligature, though, is English
ligatures are purely aesthetic. Typing two "f" characters in a row has the
same obvious semantic meaning as the ﬀ ligature, whereas the characters that
are required to type a jophola have no obvious semantic, phonetic, or
orthographic connection to the jophola.

[0]
[http://unicode.org/charts/PDF/U0980.pdf](http://unicode.org/charts/PDF/U0980.pdf)

[1] Some fonts will render (e.g.) two "f"s in a row as if they were a
ligature, even though it's not a true ﬀ(U+FB00).

~~~
yuriks
Unicode makes extensive use of combining characters for european languages,
for example to produce diacritics: ìǒ or even for flag emoji. A correct
rendering system will properly combine those, and if it doesn't then that's a
flaw in the implementation, not the standard. It seems like you're trying to
single out combining pairs as "less legitimate" when they're extensively used
in the standard.

~~~
chimeracoder
> Unicode makes extensive use of combining characters for european languages,
> for example to produce diacritics: ìǒ or even for flag emoji.

But it doesn't, for example say that a lowercase "b" is simply "a lowercase
'l' followed by an 'o' followed by an invisible joiner", because no native
English speaker thinks of the character "b" as even remotely related to "lo"
when reading and writing.

> It seems like you're trying to single out combining pairs as "less
> legitimate" when they're extensively used in the standard.

I'm saying that Unicode only does it in English where it makes semantic sense
to a native English speaker. It does it in Bengali even where it makes little
or no semantic sense to a native Bengali speaker.

~~~
maxlybbert
> > It seems like you're trying to single out combining pairs as "less
> legitimate" when they're extensively used in the standard.

> I'm saying that Unicode only does it in English where it makes semantic
> sense to a native English speaker.

Well, combining characters almost never come up in English. The best I can
think of would be the use of cedillas, diaereses, and acute accents in words
like façade, coördinate and renownèd (I've been reading Tolkien's
translation of Beowulf, and he used renownèd _a lot_ ).

Thinking about the Spanish I learned in high school, ch, ll, ñ, and rr are
all considered separate letters (i.e., the Spanish alphabet has 30 letters; ch
is between c and d, ll is between l and m, ñ is between n and o, and rr is
between r and s; interestingly, accented vowels aren't separate letters).
Unicode does not provide code points for ch, ll, or rr; and ñ has a code
point more from historical accident than anything (the decision to start with
Latin1). Then again, I don't think Spanish keyboards have separate keys for
ch, ll, or rr.

Portuguese, on the other hand, doesn't officially include k or y in the
alphabet. But it uses far more accents than Spanish. So, a, ã and á are all
the same letter. In a perfect world, how would Unicode handle this? Either
they accept the Spanish view of the world, or the Portuguese view. Or,
perhaps, they make a big deal about not worrying about _languages_ and instead
worrying about _alphabets_ (
[http://www.unicode.org/faq/basic_q.html#4](http://www.unicode.org/faq/basic_q.html#4)
).

They haven't been perfect. And they've certainly changed their approach over
time. And I suspect they're including emoji to appear more welcoming to
Japanese teenagers than they were in the past. But (1) combining characters
aren't second-class citizens, and (2) the standard is still open to revisions
(
[http://www.unicode.org/alloc/Pipeline.html](http://www.unicode.org/alloc/Pipeline.html)
).

~~~
darklajid
I'm coming from a German background and I sympathize with the author.

German has 4 (7 if you consider cases) non-ASCII characters: äüöß(and upper-
case umlauts). All of these are unique, well-defined codepoints.

That's not related to composing on a keyboard. In fact, although I'm German
I'm using the US keyboard layout and HAD to compose these characters now. But
I wouldn't need to and the result is a single codepoint again..

~~~
the_mitsuhiko
> German has 4 (7 if you consider cases) non-ASCII characters: äüöß(and upper-
> case umlauts). All of these are unique, well-defined codepoints.

German does not consider "ä", "ö" and "ü" letters. Our alphabet has 26 letters
none of which are the ones you mentioned. In fact, if you go back in History
it becomes even clearer that those letters used to be ligatures in writing.

They still are collated as the basic letters the represent, even if they sound
different. That we use the uncomposed representation in Unicode usually, is
merely a historical artifact because of iso-8859-1 and others, not because it
logically makes sense.

When you used an old typewriter you usually did not have those keys either,
you composed them.

~~~
darklajid
One by one:

I'm confused by your use of 'our' and 'we'. It seems you're trying to write
from the general point of view of a German, answering .. a German?

Are umlauts letters? Yes. [1] [2] Maybe not the best source, but please
provide a better one if you disagree so that I can actually understand where
you're coming from.

I understand - I hope? - composition. And I tend to agree that it shouldn't
matter much if the input just works. If I press a key labeled ü and that
letter shows up on the screen, I shouldn't really care if that is one
codepoint or a composition of two (or more). I do think that the history you
mention is an indicator that supports the author's argument. There IS a
codepoint for ü (painful to type..). For 'legacy reasons' perhaps. And it
feels to me that non-ASCII characters - for legacy reasons or whatever - have
better support than the ones he is complaining about, if they originate in
western Europe/in my home country.

Typewriters and umlauts:

[http://i.ebayimg.com/00/s/Mzk2WDQwMA==/$T2eC16N,!)sE9swmYlFP...](http://i.ebayimg.com/00/s/Mzk2WDQwMA==/$T2eC16N,!\)sE9swmYlFPBP8DdJ8zFQ~~48_20.JPG)

(basically I searched for old typewriter models, 'Adler Schreibmaschinen'
results in lots of hits like that). Note the separate umlaut keys. And these
are typewriters from .. the 60s? Maybe?)

1:
[https://de.wikipedia.org/wiki/Alphabet](https://de.wikipedia.org/wiki/Alphabet)
2:
[https://de.wikipedia.org/wiki/Deutsches_Alphabet](https://de.wikipedia.org/wiki/Deutsches_Alphabet)

~~~
ptaipale
I am not entirely sure if Germans count umlauts as distinct characters or
modified versions of the base character. And maybe it is not so important;
they still do deserve their own code points.

Note BTW that in e.g. Swedish and German alphabets, there are some overlapping
non-ASCII characters (ä, ö) and some that are distinct to each language (å,
ü). It is important that the Swedish ä and German ä are rendered to the same
code point and same representation in files; this way I can use a computer
localised for Swedish and type German text. Only when I need to type ü I need
to compose it from ¨ and u, while ä and ö are right on the keyboard.

The German alphabetical order supports the idea that umlauts are not so
distinct from their bases: it is

AÄBCDEFGHIJKLMNOÖPQRSßTUÜVWXYZ while the Swedish/Finnish one is
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ

This has the obvious impacts on sorting order.

BTW, traditionally Swedish/Finnish did not distinguish between V and W in
sorting, thus a correct sorting order would be

Vasa

Westerlund

Vinberg

Vårdö

\- the W drops right in the middle, it's just an older way to write V. And
Vå... is at the end of section V, while Va... is at the start.

~~~
Argorak
Umlauts are not distinct characters, but modifications of existing ones to
indicate a sound shift.

[http://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29](http://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29)

German has valid transcriptions to their base alphabet for those, e.g
"Schreoder" is a valid way to write "Schröder".

ß, however, is a separate character that is not listed in the german alphabet,
especially because some subgroups don't use it. (e.g. swiss german doesn't
have it)

~~~
darklajid
Two things

1) To avoid confusing readers that don't know German or are used to umlauts:
The correct transcription is base-vowel+e (i.e. ö turns to oe - the example
given is therefor wrong. Probably just a typo, but still)

2) These transcriptions are lossy. If you see 'oe' in a word, you cannot
(always) pronounce it as umlaut. The second e just might indicate that the o
in oe is long.

3) ß is a character in the alphabet, as far as I'm aware and as far as the
mighty Wikipedia is concerned, as I pointed out above. If you have better
sources that claim something else, please share those (I .. am a native
speaker, but no language expert. So I'm genuinely curious why you'd think that
this letter isn't part of the alphabet).

Fun fact: I once had to revise all the documentation for a project, because
the (huge, state-owned) Swiss customer refused perfectly valid German, stating
"We don't have that letter here, we don't use it: Remove it".

~~~
Argorak
1) It's a typo, yes. Thanks! 2) Well, they are lossy in the sense that
pronunciation is context-sensitive. The number of cases where you actually
turn the word into another word is very small:
[http://snowball.tartarus.org/algorithms/german2/stemmer.html](http://snowball.tartarus.org/algorithms/german2/stemmer.html)
has a discussion. 3) You are right, I'm wrong. ß, ä, ö, ü are considered part
of the alphabet. It's not tought in school, though (at least not in mine).

Thanks a lot for making the effort and fact-checking better then I did there
:).

------
Epenthesis
Wait, "ত + ্ + ‍ = ‍ৎ" is _nothing_ like "\ + / \+ \ + / = W".

The Bengali script is (mostly) an abugida. Ie, consonants have an inherent
vowel (/ɔ/ in the case of Bengali), which can be overriden with a diacritic
representing a different vowel. To write /t/ in Bengali, you combine the
character for /tɔ/, "ত", with the "vowel silencing diacritic" to remove the
/ɔ/, " ্". As it happens, for "ত", the addition of the diacritic changes the
shape considerably more than it usually does, but it's a perfectly legitimate
to suggest the resulting character is still a composition of the other two
(for a more typical composition, see "ঢ "+" ্" = "ঢ্").

As it happens, the _same_ character ("ৎ") is also used for /tjɔ/ as in the
"tya" of "aditya". Which suggests having a dedicated code point for the
character could make sense. But Unicode isn't being completely nutso here.

~~~
ubasu
Native Bengali here. The ligature used for the last letter of "haTaat"
(suddenly) is _not_ the same as the last ligature in "aditya" \- the latter
doesn't have the circle at the top.

More generally, using the vowel silencing diacritic (hasanta) along with a
separate ligature for the vowel ending - while theoretically correct - does
not work because no one writes that way! Not using the proper ligatures makes
the test essentially unreadable.

~~~
eridius
I don't understand Bengali at all. I'm trying to understand your second
sentence though. When you say "no one writes that way", do you mean nobody
hits the keys for letter, followed by vowel-silencing diacritic, followed by
another vowel? Or do you mean the glyph that results from that combination of
keystrokes doesn't match how a Bengali speaker would write it on paper?

If it's the latter, isn't that an issue for the text input system to deal
with? Unicode does not need to represent how users input text, it merely needs
to represent the content of the text. For example, in OS X, if I press
option-e for U+0301 COMBINING ACUTE ACCENT, and then type "e", the resulting
text is not U+0301 U+0065. It's actually U+00E9 (LATIN SMALL LETTER E WITH
ACUTE), which can be decomposed into U+0065 U+0301. And in both forms (NFC and
NFD), the unicode codepoint sequence does not match the order that I pressed
the keys on the keyboard.

So given that, shouldn't this issue be solved for Bengali at the text input
level? If it makes sense to have a dedicated keystroke for some ligature, that
can be done. Or if it makes sense to have a single keystroke that adds both
the vowel-silencing diacritic + vowel ending, that can be done as well.

\---

If the previous assumption was wrong and the issue here is that the rendered
text doesn't match how the user would have written it on paper, then that's a
different issue. But (again, without knowing anything about Bengali so I'm
running on a lot of assumptions here) is that still Unicode's fault? Or is it
the fault of the font in question for not having a ligature defined that
produces the correct glyph for that sequence of codepoints?

~~~
ubasu
It has to do with how the text is rendered. For example, if you see the
Bengali text on page 3 of this PDF:

[http://www.unicode.org/L2/L2004/04252-khanda-ta-
review.pdf](http://www.unicode.org/L2/L2004/04252-khanda-ta-review.pdf)

it is unreadable and incorrect Bengali. ;-)

~~~
eridius
On page 3 I see two different pieces of Bengali. One is text, and the other is
an image. I assume you're referring to the text? What makes it wrong? And what
software are you using to view the PDF? If it's wrong, it's quite possible
that the software you're using doesn't render it correctly, rather than the
document actually being wrong.

~~~
ubasu
I am referring to the first line of Bengali text.

I have viewed it in Acrobat Reader, epdfview, Chrome's PDF Reader, Firefox's
pdf reader and my iPhone's pdf reader.

The problems are that joint-letter ligatures are not used, and several vowels
signs are placed after the consonant when they should have been placed before
them.

~~~
eridius
Your description makes it sound like the text in the PDF is merely incorrect,
rather than there being any issue with Unicode.

------
microcolonel
“Whatever path we take, it’s imperative that the writing system of the 21st
century be driven by the needs of the people using it. In the end, a non-
native speaker – even one who is fluent in the language – cannot truly speak
on behalf the monolingual, native speaker.”

Not sure how the author can simultaneously say this, while criticizing the CJK
unification, which makes total sense, and has never been a point of contention
in the communities concerned with it.

I regularly work with both Chinese and Japanese texts, and have studied
Japanese language for more than five years; more than not opposing CJK
unification, I benefit from it greatly. It means that I can search for the
same character(yes, the same, I said it) and get results in Japanese, Chinese,
and sometimes Korean context without having to consult some sort of unicode
etymology resource.

There should not be three codepoints for 中, 日, 力, and most other characters.
The simplified and specific post-han forms have their own codepoints anyway,
so there is(to my eyes at least) no value in fragmenting instances of the same
character.

~~~
eloisant
Actually yes, the CJK unification is a problem for many people, including me
when I want to read Japanese on a phone bought in Europe.

Example 1. Typically, any time you want to mix the 2 languages you're getting
in trouble.

Let's say you write a textbook for Chinese people to learn Japanese as a
second language. Or a research article in Japanese citing old Chinese
literature.

In your text, you'll have to mark specifically which part are in Japanese and
which part are in Chinese and use different font for them. If you don't, the
characters look wrong.

Example 2. Most phone software don't switch font for languages, so they pick
one. Let's say you buy a Japanese phone, it's a Japanese font everywhere. For
some reason on a Samsung or Motorola phone bought in Europe, it's a Chinese
font everywhere and it looks ugly for Japanese. Wasn't Unicode supposed to be
universal, and allow you to read in any language with a device bought
anywhere?

On the other hand, I don't think the "I can't write my name" is really a
problem for Japanese people, or at least it's not a Unicode problem. Some
people have rare characters in their names. They accept and understand not
being able to type their name on a computer and having to settle for a close
characters. Unicode actually provided more characters than previous Japanese-
only encodings like JIS so there are people who could write their name thanks
to Unicode.

~~~
krakensden
Neither of your examples are solveable by the Unicode consortium, who are not
the god-emperors of fonts (nor would you want them to be).

~~~
mbessey
I respectfully disagree. If Japanese ideograms and Chinese ideograms actually
used _different code points_ (i.e. no "Han unification"), then the problem
wouldn't exist - the phone could trivially use a Japanese font for Japanese
text, and a Chinese font for Chinese text.

~~~
SiVal
No. Using different code points for the same character used in different
languages creates big problems. It would be like having different code points
for 'A' depending on whether it was used in English, Spanish, German, etc. If
you somehow ended up writing "color" with both 'o' characters from the Spanish
ABCs and the others from the English ABCs, you'd have a real mess when it came
to sorting, searching, name matching (what language is "Hans"?) etc. It is far
more convenient to allow the character sequence "color" or "Hans" to be
language independent, even if the font choices, pronunciation, sort order,
etc., are language dependent.

Chinese, Japanese, and Korean writers face similar issues. The characters they
use to write the name of China or Japan, the ten digits, the characters for
year, month, and day in dates, and so many thousands of others in Chinese
characters are what they all consider to be the same characters. That is not
all characters, but it includes so many that insisting on different code
points by language would make a real mess. Hong Kong has many characters that
are unique to HK Cantonese. So, should Cantonese have a full set of _all_
Chinese characters that are the "Cantonese characters"? How about
Shanghainese, then? Or Hakka? Teochew (Chaozhou) or a dozen Chinese languages?
Full, independent sets of all Chinese characters for each? Suppose you
accidentally used an input method in HK and wrote the name of some Beijing
gov't ministry using characters that looked identical to their Mandarin
counterparts but were entirely different codepoints? Now what? You can't find
your search term? You mess up the database and have two identical-looking
keys?

No, Han unification is not conceptually different from unifying ABCs used by
English and Spanish speakers, Cyrillic used by Russians or Serbs, etc., except
that there are many more characters, so the boundary between what should be
unified and what shouldn't contains more items in the gray zone to cause
debate. Having no Han unification at all wouldn't solve all problems, it would
create all sorts of absurdity.

~~~
mbessey
Actually, Cyrillic is an interesting case. The Unicode standard _does_ define
completely-separate codepoints for the Cyrillic letters, even for the ones
that look "just like" letters of the Latin Alphabet. Greek letters that look
exactly like Latin letters get the same treatment.

It's difficult to come up with a logical explanation for why European
languages that use their own alphabet get their own codepoints, but
ideographic languages need to be "unified", even though the actual letters as
used in those languages look different.

The "Han unification" was fundamentally a bad idea, and persists for
historical reasons. Back when (some) people thought a fixed-width 16-bit
character representation would be "enough", it made sense to try to reduce the
number of "redundant" code points. Now that Unicode has expanded to a much-
larger code space, I would think they'd choose differently.

Unfortunately, that kind of sweeping change is unlikely any time soon.

------
mlmonkey
I get the author's point, but assigning blame to the Unicode Consortium is
incorrect. The Indian government dropped the ball here. They went their
separate way with "ISCII" and other such misguided efforts, instead of
cooperating with UC. To me, the UC is just a platform.

The government is the de facto safeguarder of the people's interests; if it
drops the ball, it should be taken to task, not the provider of the platform.

~~~
afarrell
But then consider the implementation path to fixing the problem for a minority
linguistic group being deliberately repressed by their government--It would
require blood. If there is some alternate process to work with the UC
directly, that could be better but it puts the UC in the position of judging a
linguistic group's claims for legitimacy.

I agree that this refutes claims that the UC was negligent, but we can still
say that they failed to be particularly assiduous in this case.

Side question: does anyone know the story that led to
[http://unicode.org/charts/PDF/U13A0.pdf](http://unicode.org/charts/PDF/U13A0.pdf)
?

EDIT: subjunctive

~~~
GFK_of_xmaspast
In specific or in general?
[http://en.wikipedia.org/wiki/Cherokee_syllabary](http://en.wikipedia.org/wiki/Cherokee_syllabary)

An increasing corpus of children's literature is printed in Cherokee syllabary
today to meet the needs of Cherokee students in the Cherokee language
immersion schools in Oklahoma and North Carolina. In 2010, a Cherokee keyboard
cover was developed by Roy Boney, Jr. and Joseph Erb, facilitating more rapid
typing in Cherokee and now used by students in the Cherokee Nation Immersion
School, where all coursework is written in syllabary.[8] The syllabary is
finding increasingly diverse usage today, from books, newspapers, and websites
to the street signs of Tahlequah, Oklahoma and Cherokee, North Carolina.

~~~
afarrell
In specific. I'm familiar with the rigin of the Cherokee Syllabary in general,
though HN would probably find it enlightening.

------
dominotw
I am an Indian and it shocks me that Indians are still blaming the British
after 70 yrs of independence.

Is 70 years of Independence not enough to make your language "first class
citizen" ? Ofcourse Bengali is second class language because Bengalis didn't
invent the standard.

Can we stop blaming white people for everything. Seriously WTF.

~~~
MichaelGG
Was British rule actually a net negative, in retrospect? Have there been
studies done using objective criteria (not emotional) over counties that were
colonies versus ones that weren't? I suppose you can't really quantify the
value of people that were destroyed by colonization, but you _can_ look at the
current population.

Also I just gotta wonder: suppose European or other relatively simple-to-
encode languages didn't exist, and everyone used the OPs language. How would
they have handled advancing computers? I've seen photos of Japanese
typewriters and they look... unwieldy to say the least. And graphics tech took
a while to get advanced enough to handle such languages, let alone input. (MS
Japanese IME uses some light AI to pick the desired kanji, right?)

Disclaimer: I don't mean this in an offensive way, just a dispassionate
curiosity.

~~~
zo1
>" _Was British rule actually a net negative, in retrospect? Have there been
studies done using objective criteria (not emotional) over counties that were
colonies versus ones that weren 't?_"

It's both a positive, and a negative. But if you want to calculate whether it
was a "net" positive or "net" negative, then I'm afraid you're going to have
to quantify very difficult-to-quantify concepts. And you'll have to do it on
behalf of a lot of people, some of which don't exist anymore to answer your
question. In essence, next to impossible.

However, particularly in the case of colonialism. I will add that because it's
such a difficult concept, it is being exploited by manipulative scholars and
politicians for personal/political agendas. So you have politicians in these
countries claiming that it was a net negative, without any way to quantify it,
and without acknowledging any good that came along with it. Now, of course,
colonialism was bad, period. It can never be justified to enslave people, and
it should never have happened.

~~~
MichaelGG
Well that's why I said without taking into account people that were destroyed
by it. That is, if we look purely at modern-day circumstances, are they
better?

A comparison might be a country with overpopulation. If you kill x% of the
population and enact some birth control plan, then 100 years later you might
end up with a "net positive" discounting the people that were killed and the
emotional effects of their loved ones.

Are there countries in Africa that weren't colonized, or some islands that
weren't, that we could compare to ones that were and make some inferences?

~~~
zo1
[http://www.quora.com/Which-countries-in-Africa-were-never-
co...](http://www.quora.com/Which-countries-in-Africa-were-never-colonized)

Alternatively, you could compare countries that had "more" or "less"
colonialist control/influence, and see if there is any correlation to their
current performance. Though, as with all social-sciency it's very difficult to
isolate the variables to get deep insight into possible causal links.

------
nemo
I wonder if the author has submitted a proposal to get the missing glyph for
their name added. You don't need to be a member of the consortium to propose
adding a missing glyph/updating the standard. The point of the committee as I
understand it isn't to be an expert in all forms of writing, but to take the
recommendations from scholars/experts and get a working implementation, though
more diverse representation of language groups would definitely be a positive
change.

[http://unicode.org/pending/proposals.html](http://unicode.org/pending/proposals.html)

Also, the CJK unification project sounds horrible.

~~~
bbreier
The author's explanation of what characters Chinese, Japanese, and Korean
share is very limited. All three languages use Chinese characters in written
language to varying extents, and in some cases the differences begin
significantly less than a century ago. Though there are cases where the same
Chinese character represented in Japanese writing is different from how it is
represented in Traditional Chinese writing (i.e. 国, Japanese version only
because I don't have Chinese installed on this PC), which could be different
still from how it is represented in Simplified Chinese, there are also many
instances where the character is identical across all three languages (i.e.
中). Although I am not privy to the specifics of the CJK unification project,
identifying these cases and using the same character for them doesn't sound
unreasonable.

Edit- To be clear, Korean primarily uses Hangul, which basically derives jack
and shit from the Chinese alphabet, and Japanese uses a mixture of the Chinese
alphabet, and two alphabets that sort-of kind-of owe some of their heritage to
the Chinese alphabet, but look nothing like it. If they are talking about
unifying these alphabets, then they are out of their minds.

~~~
yummyfajitas
Nor is it unreasonable to "unify" Latin, Greek and Cyrilic:

Cyrillic ПФ vs Greek ΠΦ

Cyrillic АВ vs Latin AB

Obviously using ω for w (as he does) is stupid, but his reducto-ad-absurdum is
not particularly absurd.

~~~
peterfirefly
Not unifying them means that the fonts automatically work when you mix
text/names written in these alphabets. It also means that
mathematical/physical/chemical stuff (that typically uses Latin and Greek
letters together) will just work. There is a similar reasoning behind all the
mathematical alphabets in Unicode.

Furthermore, Unicode was supposed to handle transcoding from all important
preexisting encodings to Unicode and back with no or minimal loss. Since ISO
8859-5 (Cyrillic) and 8859-7 (Greek) already existed (and both included ASCII,
hence all the basic Latin letters), the ship had definitively sailed on
LaGreCy unification.

On top of that, CJK unification affected so many characters that the savings
would really matter and it happened at a time where the codepoints were only
16 bit so it helped squeeze the whole in. All continental European languages
suffered equally or worse back when all their letters had to be squeezed into
8 bits /and/ coexist with ASCII.

~~~
masklinn
> Not unifying them means that the fonts automatically work when you mix
> text/names written in these alphabets. It also means that
> mathematical/physical/chemical stuff (that typically uses Latin and Greek
> letters together) will just work.

These are already completely separate symbols. Ignoring precomposition, there
are at least 4 different lowercase omegas in unicode: APL (⍵ U+2375 "APL
FUNCTIONAL SYMBOL OMEGA"), cyrillic (ѡ U+0461 "CYRILLIC SMALL LETTER OMEGA"),
greek (ω U+03C9 "GREEK SMALL LETTER OMEGA") and Mathematics (𝜔 U+1D714
"MATHEMATICAL ITALIC SMALL OMEGA").

------
Htsthbjig
I think the author misses the point completely.

Things like this: "No native English speaker would ever think to try “Greco
Unification” and consolidate the English, Russian, German, Swedish, Greek, and
other European languages’ alphabets into a single alphabet."

The author probably ignores that different European languages used different
alphabet scripts until very recently. For example, Gothic and other different
script were used in lots of books.

I have old books that take something like a week to be able to read fast, and
they are in German!.

But it was chaos and it unified into a single script. Today you open any book,
Greek, Russian, English or German and they all use the same standard script,
although they include different glyphs. There is a convention for every
symbol. In cyrilic you see an "A" and a "a".

In fact, any scientific or technical book includes Greek letters as something
normal.

It should also be pointed out that latin characters are not latin, but
modified latin. E.g Lower case letters did not exist on Roman's empire. It
were included by other languages and "unified".

Abut CJK, I am not an expert but I had lived on China, Japan and Korea, and in
my opinion it has been pushed by the Governments of those countries because it
has lots of practical value for those countries.

Learning Chinese characters is difficult enough. If they don't simplify it
people just won't use it, when they can use Hangul or kana. With smartphones
and tablets people there are not hand writing Chinese anymore.

It makes no sense to name yourself with characters nobody can remember.

~~~
Dylan16807
Right, I was going to argue against that too. Changing fonts every letter will
always look weird. There's a lot of different ways to shape these letters that
are all counted as the same.

------
LeoPanthera
This article is imbued with its own form of curious racism. In particular, I
became suspicious of its motives at the line:

> "It took half a century to replace the English-only ASCII with Unicode, and
> even that was only made possible with an encoding that explicitly maintains
> compatibility with ASCII, allowing English speakers to continue ignoring
> other languages."

ASCII-compatibilty was essential to ensure the adoption of Unicode. It's not
because English speakers wanted to ignore anyone or anything, it's because
without it, it would never have been adopted, and we'd be in a much worse
position today.

In other words, the explanation is technical, not racial.

~~~
chris_wot
It's sort of hilarious that he said that, given that English speakers were
able to encode every character in 7-bit ASCII. The issues around standardising
characters was because _non-English_ characters were being squabbled about
between the French, Russians and a whole bunch of other non-English countries.

In essence, he's not understood that really ASCII was used as the base for
Unicode because it was widely used. In fact. It's actually ISO8859-1 that has
been used because of its wide coverage of a variety of languages, far more
than any of the other 8859-x character sets.

I cannot speak to anything else he's said, aside from saying that trying to
encode _all_ the world's languages is bloody hard.

Even when a limited number of countries try to nut out a standard for 128
characters, it's a nightmare. And don't forget that they were competing with
EBCDIC.

I wrote about it here:

[http://randomtechnicalstuff.blogspot.com.au/2009/05/unicode-...](http://randomtechnicalstuff.blogspot.com.au/2009/05/unicode-
and-oracle.html)

------
angersock
I came in expecting to read an article bemoaning some niche language and
playing the diversity card. I was not disappointed, but as I kept reading, the
author made some very good points.

I don't really care that the organization is run by white men who speak
English, because frankly the entire computing industry and telecommunications
industry is based on that. I'm not going to argue about the original sin
there, because the facts speak for themselves: they got shit done.

What I _do_ find really disturbing, though, is that somehow both Linear A and
Linear B are included, in their (as known) entirety, while Bengali is not--two
languages spoken and written by nobody for over a millenia are included
whereas a language spoken and written by millions of people is not. That's
bad.

The discussion of somehow creating a CJK superset and then deviations to
support each of those languages is worse, and the author's remark about trying
to "unify" Greek, Roman, Cyrillic, and Swedish struck home.

Personally, I think that it's nice to have an ever-decreasing number of
languages to support, preferably English. The first part of that is because
it's annoying enough to unambiguously parse one human language much less
dozens, and the second part is pure convenience and small-mindedness on my
part.

That said, we don't have to buy into diversity or identity or anything else to
note that something here is amiss. This is bad allocation of engineering
resources.

EDIT:

And yes, one could make the argument that the computer/teleco industry is
being produced in Shenzhen, to which I would respond that the intellectual
capital was handled by the West. Nintendo hired SGI, Sony IBM. There is
certainly a wonderful and burgeoning domestic talent, but the paths were set
elsewhere, to the best of my knowledge.

~~~
com2kid
Han unification makes things _really_ hard for programmers. You end up with
code that tries to guess what language a string is in to pick out which
character set should be used!

It is an absolute nightmare and a horrid idea.

~~~
cplease
You don't determine language based on codepage. I give you ASCII text; what
language is it?

~~~
com2kid
BINGO.

But because of Han unification I all of a sudden DO need to know the language.

The same Unicode code point needs to be rendered differently for a user in
Mainland China versus a user in Japan or else the user may not be able to read
the text! Even if the user can read the character, they are going to
experience a degradation in reading speed and comprehension, and be generally
frustrated. Not to mention showing the wrong character is insensitive to the
customer's culture, and if I pick to and stick just one set of characters to
use, I end up (being accused at least of) promoting cultural hegemony based on
which character set I go with.

~~~
astrange
In what situations do you need to do this, but don't need to show any other
data (dates and times, localized UI, user timezone, culturally appropriate
fonts, RTLness) that involves knowing the user's languages and locale?

This can happen if the user is intentionally reading mixed-language text or
text not in their computer's UI language, of course. In that case different
CJK languages also have different preferred fonts, so having language tagging
or just guessing is pretty important.

~~~
com2kid
> In what situations do you need to do this, but don't need to show any other
> data (dates and times, localized UI, user timezone, culturally appropriate
> fonts, RTLness) that involves knowing the user's languages and locale?

For drawing a given glyph, there is normally a lookup into a font table that
involves solely the string of Unicode code points coming in.

Except if any characters in the CJK Unified Ideograph range. Then my function
call suddenly has to jump out to read environment variables, which are
hopefully setup correctly.

My code to do a lookup into a font file should not depend upon the users
environment variables due to a space saving optimization made two decades ago.

~~~
astrange
> For drawing a given glyph, there is normally a lookup into a font table that
> involves solely the string of Unicode code points coming in.

Why are you implementing OpenType? It's got working libraries already.

But if you are getting into that, glyphs in a font are stored by "glyph name",
not necessarily by code point. There's a bunch more steps than that.

\- Font substitution: Find fonts that cover every character in the text. The
order of your search list depends on the language.

\- Text layout and line breaking: for best results, you don't want to line
break in the middle of a word, and you need to place punctuation on the
correct side of right-to-left sentences. I think both of these need
dictionaries.

\- Choosing individual glyphs: it's complicated!
[http://ilovetypography.com/OpenType/opentype-
features.html](http://ilovetypography.com/OpenType/opentype-features.html)

You have to read the GSUB tables and do a bunch of expected features, like
ligatures, automatic fractions, beginning of word special forms (see Zapfino),
&c. This includes language specific glyphs, but fonts can also just choose
glyphs with a random number generator.

\- Drawing the glyph. Remember not to draw each one individually, or a
translucent line of overlapping characters (like in Indian languages) will
look bad.

Each glyph actually comes with a custom program to do the hinting! It's even
more complicated: [https://developer.apple.com/fonts/TrueType-Reference-
Manual/...](https://developer.apple.com/fonts/TrueType-Reference-
Manual/RM05/Chap5.html)

Luckily I don't think it depends on much external state.

------
ckoerner
The comparison to the play "My Fair Lady" is not very convincing. I would
suggest the author remove it as it weakens the argument. First, the fictional
character states 'no fewer than' as an admittance that there are more. Second,
even if we consider this as a valid complaint, the author himself points out
in his example that this 'common sentiment' is from a play written a century
ago. Using a fictional character from a time when the world's knowledge of
language was incredibly smaller than it is today does not help support your
goals.

------
gurkendoktor
The poo emoji is not part of Unicode because rich white old cis-gendered mono-
lingual oppressive hetero men thought it'd be a fun idea (outrage now!!).
Emoji were adopted from existing cellphone "characters" in Japan, and Japan is
famously lagging in Unicode adoption because some Japanese names cannot (could
not?) be written in Unicode. It all just seems to be very normal
(=inefficient, quirky) design by committee.

------
ticking
Even though I'm not a native english speaker and couldn't write my name in
ASCI, I really despise Unicode.

Its broken technically and a setback socially.

Unicode itself is such a unfathomably huge project that it's impossible to do
it right, too many languages, too many weird writing systems, and too many
ways to do mathematical notation on paper that can't be expressed. Just look
at the code pages, they are an utter mess.

Computers and ASCI were a chance to start anew, to establish english as a
universal language, spoken by everybody.

The pressure on governments who would wanted to partake in the digital
revolution would have forced them to introduce it as an official secondary
language.

Granted english is not the nicest language, but is the best candidate we have
in terms of adoption, and relative simplicity (Mandarin is another contester,
but several thousand logograms are really impractical to encode.).

Take a look at the open source world, where everybody speaks english and
collaborates regardless of nationality. One of the main factors why this is
possible, is that we found a common language, forced on us by the tools and
programming languages we use.

If humanity wants get rid of wars, poverty and nationalism, we have to find a
common language first.

A simple encoding and universal communication is a feature, fragmented
communication is the bug.

Besides. UTF-8 is broken because it doesn't allow for constant time random
character access and length counting.

~~~
krdln
Why do you think English is the best candidate for the universal language, how
do you define simplicity? First of all, pronunciation and spelling are almost
unrelated and you have to learn them separately. That results in really
different accents throughout the world. Even if you look at AmE and BrE, they
differ much at the word level. Which one you want to choose? Besides,
personally I find English really ambiguous and density of idioms in average
text repelling, although that's only a subjective opinion.

Usage of Latin alphabet in English seems like it's on plus, but there's at
least one language that uses that simple alphabet better.

> Besides. UTF-8 is broken because it doesn't allow for constant time random
> character and length counting.

And why you'd want that? And how do you define length? Are you a troll?

~~~
ticking
English is the best candidate because it has the second largest user base (1.2
Billion vs 1.3 Billion for Mandarin),
[http://en.wikipedia.org/wiki/List_of_languages_by_total_numb...](http://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers)
and is twice as spoken as the third most popular language Spanish. (0.55
Billion)

If I got to pick the universal language, it would be Lojban (a few hundred
speakers), but that is not a realistic goal, teaching the other 6 Billion
people a language that is already spoken by 1/7th of the population is at
least plausible.

> Why would you want that...

Why would, you not want that?! Many popular programming languages are based on
array indexing through pointer arithmetic, having a variable width encoding
there is a horrible idea, because you have to iterate through the text to get
to an index.

Length is the number of characters, which is just the number of bytes in ASCI,
but has to be calculated by looking at every character in UTF-8.

~~~
krdln
Even if 1.2 billion seems a lot, that's still a small fraction of a world's
population. So every choice of a universal language would force majority of a
world to learn new one. So that's why I think winning popularity contest is a
poor argument and we shouldn't look at that and focus on things like
simplicity (which I don't find in English), speed of learning, consistency,
expressiveness etc. I'd be happy to use Lojban (it's easier for machines too,
I guess) or any other invented language. If I had to pick one from popular
ones, I'd like Spanish more than English.

I was asking what are your specific usecases, which forbid you to treat UTF-8
string as a black box blob of bytes? If dealing with international code, you'd
rather want to use predefined functions. If you want to limit yourself to
ASCII, just do it and simply don't touch bytes >= 0x80.

And what is a character? Do you mean graphemes or codepoints? Or something
else? Few years before I was thinking like you – that calculating length is a
useful feature. But most often when you think about your usecase, you realise
either that you don't need length or you need some other kind of length: like
monospace-width, rendered-width or some kind of entropy-based amount of
information. Twitter is the only case I know, where you want to really count
"characters". And I find it really silly: eg. Japanese tweet vs. English
tweet.

~~~
ticking
With Unicode these predefined functions have to be large and complex. You
won't be able to use them on embedded systems for example.

------
vacri
It sounds like the author is looking to be offended. Talking about Bengali
being the seventh largest native language, then saying that US$18k is too
expensive for a stake in solving this problem? Emojis with skin tones,
something every human can use, that have arrived before a character that only
Bengalis use is taken as an " _outright insult_ "?

~~~
zhemao
> Talking about Bengali being the seventh largest native language, then saying
> that US$18k is too expensive for a stake in solving this problem?

West Bengal and Bangladesh aren't exactly the richest places in the world.

> Emojis with skin tones, something every human can use, that have arrived
> before a character that only Bengalis use is taken as an "outright insult"?

Nobody is being inconvenienced by the inability to send emojis with a darker
skin tone. People definitely are being inconvenienced by not being able to
write a common character of their native language.

~~~
wbkang
$18k is really a small amount of money for any governmental organization
including those places. Even North Korea participates in the process.

~~~
webitorial
Any Bengali scholar can join the work for $75.

------
theon144
I don't understand; I don't feel like character combination using the zero
width joiner is on the same level as 13375p34k. It looks like the character
just doesn't have a separate code-point, but is instead a composite, but still
technically "reachable" from within Unicode, no?

~~~
cosarara97
It's like typing ` + o to get "ò", isn't it? You can argue that ò is actually
an o with that tilde, while that character is not ত + ্ + an invisible joining
character, but that's an input method thing, and there is a ৎ character after
all.

~~~
jamie_ca
They're on the same key on my keyboard, but ` is a grave, ~ is a tilde.

------
hawkice
Getting rid of CJK unification would better model actual language change in
the future (France, for instance, has a group that keeps a rigorous definition
of the French language up to date -- I would enjoy giving them a subset of the
codes to define how to write French).

But the general principle sounds odd. Should 家, the simplified Chinese
character and 家, the traditional Chinese character have different codepoints?
Should no French be written using characters with lower, English code points
because of their need for a couple standard characters? Should latin be
written using a whole new set of code points even though it needs no code
points not contained in ascii?

~~~
tjradcliffe
There was an academic proposal in the '90's for something called "multicode"
(IIRC) that did exactly this: every character had a language associated with
it, so there were as many encodings for "a" as there were languages that used
"a", and all of them were different, or at least every character was somehow
tagged so the language it "came from" was somehow identifiable.

Fortunately, it never caught on.

The notion that some particular squiggle "belongs" to one culture or language
is kind of quaint in a globalized world. We should all be able to use the same
"a", and not insist that we have our own national or cultural "a".

The position becomes more absurd when you consider how many versions of some
languages there are. Do Australians, South Africans and Scots all get their
own "a" for their various versions of English? What about historical
documents? Do Elizabethan poets need their own character set? Medieval
chroniclers?

Building identity politics into character sets is a bad idea. Unifying as much
as practically possible is a good idea. Every solution is going to have some
downsides, some of them extremely ugly, but surely solutions that tend toward
homogenization and denationalization are to be preferred over ones that enable
nationalists and cultural isolationists.

~~~
RodericDay
> Building identity politics into character sets is a bad idea. Unifying as
> much as practically possible is a good idea. Every solution is going to have
> some downsides, some of them extremely ugly, but surely solutions that tend
> toward homogenization and denationalization are to be preferred over ones
> that enable nationalists and cultural isolationists.

glib white supremacists are the best kind of white supremacists. "it's
progress!"

------
kalleboo
Re: the Han Unification debate that's going on in parallel here,

I think CJK unification makes sense from the Unicode point of view (although
if they had to choose again after the expansion beyond 16-bit I doubt they'd
bother with the political headache). The problem stems from the fact that only
a few high-level text storage formats (HTML, MS Word, etc) have a way to mark
what language some text is in. There's no way to include both Japanese and
Chinese in a git commit log, or a comment on Hacker News.

Sure you can say "that's just the problem of the software developer!" but
that's what we said about supporting different character sets before Unicode,
and we abandoned that attitude. Hacker News is never going to add that feature
on their own.

What's needed is either a "language switch" glyph in Unicode (like the right-
to-left/left-to-right ones) or a layer ontop of Unicode which does implement
one that gets universally implemented in OSes and rendering systems.

------
drawkbox
It is much easier to criticize than to fix it.

While it is good to bring awareness to this, we are still growing in this
area. In fact we should applaud the efforts so far that we even have a
standard that somewhat works for most of the digital world. Does it need to
evolve further, yes.

I am sure the engineers and multi-lingual people that stepped up to do Unicode
and organize it aren't trying to exclude anyone. Largely it comes down to who
has been investing in the progress and time. It may even be easier to fund and
move this along further in this time, it was hard to fund anything like this
before the internet really hit and largely relied on corporations to fund
software evolution and standards.

In no way should the engineers or group getting us this far (UC) be chided or
lambasted for progressing us to this step, this is truly a case of no good
deed goes unpunished.

~~~
webitorial
Generally true, but the problem here is not an issue of bandwidth or racism.
Unicode can represent this character, but does so with two codepoints, a
technical decision the author doesn't feel is useful. He blames this on the
dominance of white people in the work (a questionable assumption, given he
didn't link to the extensive list of international liaisons and international
standards bodies). The participants in Unicode, including a native Bengali
speaker who responded above, considered the argument presented but chose a
different path to be consistent with how other characters are treated. The
author needs to more carefully distinguish the codepoint, input, and rendering
issues raised in his argument.

------
josephschmoe
Am I the only person who thought unifying the Greco-Roman language characters
actually sounds like a good idea?

~~~
mmastrac
Yeah, I thought that "No native English speaker would ever think to try “Greco
Unification”" was a poor argument. In seems like a reasonable idea.

~~~
Someone1234
I think their argument was: The characters look the same (e.g. Russian's first
character and the English A) but have different meanings.

So in this example if you searched for the English word "Eat" that is also a
completely legal Russian word (E, A, and T, exist in English and Russian),
however it means nothing remotely similar.

I don't know if they're right or wrong. I am just saying that might be the
point they were trying to make. You could make a Greco Unified unicode set and
it would work fairly well, but you might wind up with some confusing edge
cases where it isn't clear what language you're reading (literally).

This could be particularly problematic for automation (e.g. language
detection). Since in some situations any Greco-like language could look
similar to any other (in particular as the text gets shorter).

~~~
josephschmoe
English, French, German, Italian, Spanish and several other European languages
have mostly identical character sets and even large numbers of similar or
identical words. Computers detect these languages just fine. I think we'll be
okay.

------
wbkang
I see many comments about Han unification being a bad idea but I am not seeing
any reason why it was such a bad idea. I am from a CJK country and I find it
makes a lot of sense. Most commonly used characters should be considered
identical regardless of whether it is used in Chinese Japanese Korean or
Vietnamese. Sure there are some characters that are rendered improperly
depending on your font but I don't think that makes Han unification a
fundamentally bad idea.

~~~
iso8859-1
Is it really depending on the font, or is it depending on some language
metadata? Having it depend on the font seems stupid, since a font ideally be
able to represent a languages which use a script encodable using Unicode.

~~~
chris_wot
You have the world's greatest username and I was wondering when you would turn
up!

------
qntm
> He proudly announces that there are ‘no fewer than 147 Indian dialects’ – a
> pathetically inaccurate count. (Today, India has 57 non-endangered and 172
> endangered languages, each with multiple dialects – not even counting the
> many more that have died out in the century since My Fair Lady took place)

So, how many were there really? At the time, I mean.

~~~
pc2g4d
I believe the number of "dialects" named in My Fair Lady can be largely
explained by the lack of clear distinction between language and dialect over
the years. From [1]: "There is no universally accepted criterion for
distinguishing a language from a dialect. A number of rough measures exist,
sometimes leading to contradictory results. The distinction is therefore
subjective and depends on the user's frame of reference."

Getting upset about Henry Higgins's estimation of the number of Indian
"dialects" in a play from many decades ago doesn't make sense to me. His
character was deliberately portrayed as a regressive lout, and terminology has
surely changed in the intervening years.

[1]:
[http://en.wikipedia.org/wiki/Dialect#Dialect_or_language](http://en.wikipedia.org/wiki/Dialect#Dialect_or_language)

~~~
chris_wot
It also doesn't make sense to criticize the Unicode Consortium for an
inaccurate quote from a playwright who wrote a play a century ago.

------
Torgo
Minoan script and Mormon Deseret are in there because somebody stepped up.

------
yarper
I Can Text You A Pile of Poo, But I Can’t Write My Name ... \- by Aditya
Mukerjee on March 17th, 2015

What is the glyph missing from this?

I know its not ideal but some uncommon glyphs have always been omitted from
charsets, for example ASCII never included
[http://en.wikipedia.org/wiki/%C3%86](http://en.wikipedia.org/wiki/%C3%86),
and it was replaced by "ae" in common language.

[http://en.wikipedia.org/wiki/Hanlon%27s_razor](http://en.wikipedia.org/wiki/Hanlon%27s_razor)

~~~
fixermark
This is also why we've (almost) all taken to writing résumé as resume.

We've even managed to build text-based search engines that do a pretty decent
job of guessing which one we mean.

------
outworlder
> He proudly announces that there are ‘no fewer than 147 Indian dialects’ – a
> pathetically inaccurate count.

Wow. How can a country function like this? Is everyone proficient in their
native language plus a 'common' one, or are all interactions supposed to be
translated inside the same country? Regardless of historical and cultural
value, if that's the case, it seems... inefficient.

I do realize that there are more countries like this, but the number of
languages seems way too high. I am really curious how that works.

~~~
masklinn
> How can a country function like this? Is everyone proficient in their native
> language plus a 'common' one

Yes, there may be multiple "common" languages if the country is
large/populated enough or for historical reasons (Switzerland has 4 official
languages — though official acts only have to be provided in 3 of them, Swiss-
German, French and Italian — and they only have 8 million people)

> Regardless of historical and cultural value, if that's the case, it seems...
> inefficient.

Well telling people to fuck off with their generations-old gobbledygook and
imposing a brand new language on them tends to make them kind-of restless, so
unless you're willing to assert your declarations in blood (or at least in the
specific suppression of non-primary languages)…

The latter has happened quite bit, e.g. post-revolutionary France tried very
hard to stamp out both dialects and non-french languages until very recently
(the EU shows a distinct lack of appreciation for the finer points of
linguicide and linguistic discrimination), the UK did the same throughout the
Empire.

> I do realize that there are more countries like this

Almost all of them aside from former british colonies (where native languages
were by and large eradicated), at least to an extent (many countries carried
out linguistic unification as part of their rise as nation-states, to various
levels of success).

------
lxe
> No native English speaker would ever think to try “Greco Unification” and
> consolidate the English, Russian, German, Swedish, Greek, and other European
> languages’ alphabets into a single alphabet.

This actually is a pretty good idea. Cyrillic, Latin, and other "Greco"
scripts share quite a lot of characters. There's no need for both А
([http://www.fileformat.info/info/unicode/char/0410/index.htm](http://www.fileformat.info/info/unicode/char/0410/index.htm))
and A
([http://www.fileformat.info/info/unicode/char/0041/index.htm](http://www.fileformat.info/info/unicode/char/0041/index.htm))
beyond ASCII and other legacy compatibility.

~~~
TazeTSchnitzel
Well, it depends. Do you unify those with identical glyphs but different
origins? Do you unify only those with identical glyphs and the same origins?

------
a2tech
Ok, we all understand it sucks. Whats the fix? You think Unicode is racist and
terrible. Its stupidly difficult to work with for sure, but racist is a
stretch.

What do you propose? More complexity layered on a system that people don't
understand isn't really a fix.

~~~
VLM
Also they organizational mystery of how to open the floodgates for unpopular
languages while still keeping Klingon and Tolkien Elvish out is something of a
mystery.

------
agumonkey
I'm not that surprised Unicode includes very shallow code points (pile of
poo), because nobody really cares. Bengali on the other hand, requires no mess
up in order to satisfies the Billion+ users.

------
PSeitz
The problem here is not that these people are white, it's the languages they
are not speaking.

~~~
webitorial
And a native Bengali speaker has discussed his input into the Unicode
discussions and why he disagreed with the author. What do you know about the
linguistic input into the work or the participants? Did you check the huge
international lists of liaisons that the author ignored? It costs $75 to join
Unicode.

------
azinman2
This is one of the most interesting HN post+comments I've read yet, in part
because it mixes technology with culture and history. It also takes advantage
of the diverse HN community and their own native languages.

As an American (English speaker) who has studied French, Hebrew and Japanese,
I can appreciate the complexity of input balanced against standards and the
needs of programmers.

It's a fucking hard problem, but I don't think that blaming the unicode
consortium is the right place to do so. They seem to be doing a reasonably
good job in trying to get everything in, and really they need input from
outsiders to do this well. It requires people with linguistic & technical
backgrounds which is probably why random governments may have a harder time
providing input.

Further from all the points people are making about
uppercase/lowercase/hyphenation across languages, it sounds to me like there
really needs to be a super-standarized open source implementation of the
things you want to do with text, not just purely encode it. I don't think that
exists, and it might be a good place for the unicode people to branch to.

------
bane
Out of curiosity, before Unicode came along what was the state of the art for
encoding/writing/displaying Bengali?

Sometimes I think issues with Unicode might be because it's trying to solve
issues for languages that haven't yet arrived at a good solution for
themselves yet.

Latin-using languages ended up going through a very long orthographic
simplification and unification process after centuries of divergent
orthographic development. These changes all occurred to simplify or improve
some aspect of the language: improve readability, increase typesetting speed,
reduce typesetting pieces. Early personal computers even did away with
lowercase letters and some punctuation marks completely before they were
grudgingly reintroduced.

I'm more familiar with Hangul (Korean), which has sometimes complex rules for
composing syllables but has undergone fairly rapid orthographic changes once
it was accepted into widespread and official use: dropping antiquated or
dialectal letters, dropping antiquated tone markers, spelling revisions, etc.
In addition, Chinese characters are now completely phased out in North Korea
and are rapidly disappearing in the South.

It's my personal theory that the acceleration of this orthographic change has
had to do with the increased need to mechanize (and later computerize)
writing. Koreans independently came up with workable solutions to encode,
write and display their system, sometimes by changing the system, sometimes by
figuring out how to mechanically and then algorithmically encode the complex
rules for their script. It appears that a happy medium has been found and
Korean users happily pound away at keyboards for billions of keystrokes every
year.

I'm digressing here, but pre-unicode, how had Bengali users solved the issue
of mechanically and algorithmically encoding, inputting and displaying their
own language? Is it just a case that the unicode team is ignorant of these
solutions and hasn't transferred this hard-earned knowledge over?

(note, I came across this page as I was writing this, how suitable was this
solution? [http://www.dsource.in/resource/devanagari-letterforms-
histor...](http://www.dsource.in/resource/devanagari-letterforms-
history/devanagari_letterforms/letterforms_for_typewriter.html))

I'm asking these questions out of ignorance of course, I don't know the story
on either side.

On a second point, I'm deeply concerned about Unicode getting turned into a
font-repository for slightly different (or even the same) character, that just
happens to end up in another language. For example, Cherokee uses quite a few
Latin letters (and numbers) (and quite a few original inventions). Is it
really necessary to store all the Latin letters it uses again? Would a reader
of Tsalagi really care too much if I used W or Ꮃ? When does a character go
from being a letter to being a specific depiction of that letter?

------
4bpp
I'd imagine that from the point of view of a Unicode consortium member, the
question as to whether to include a particular Bengali glyph some argued to be
obsolete looks more like "should we lower the threshold for what characters in
a language are considered deserving of a separate codepoint, potentially
exposing ourselves to a deluge of O(#codepoints/#characters per language)
requests for obscure variant characters until we actually run out of them",
whereas the question as to whether to include a sign for poop in the end boils
down to "can we spare O(1) codepoints to prove to the world that we are not
humourless fascists". The particular decision in this case might well be ill-
informed, but I think any judgement that the Unicode Consortium is engaging in
cultural supremacism (as opposed to doing the usual thing of wanting anglo-
american capitalist money) is somewhat far-fetched.

The right solution, I think, would be to replace Unicode with a truly
intrinsically variable-length standard such as an unbounded UTF-8 - many of
the arguments that were fielded in favour of having the option of fixed-width
encodings seem to have melted away now that almost everything that interfaces
with users has a layer of high-level glue code, naive implementations of
strings have been deemed harmful to security and even ostensibly "embedded"
platforms can get away with Java. Rather than having an overwhelmed committee
of American industrialists decide over the faith of every single codepoint,
then, they could simply allocate a prefix to every catalogued writing system
and defer the rest to non-technical authorities whose suggested lists would
only require basic rubber-stamp sanity-checking.

> "You can't write your name in your native language, but at least you can
> tweet your frustration with an emoji face that's the same shade of brown as
> yours!"

This seems fairly characteristic of the apparent belief of social justice
activists - and I can't imagine that Unicode's inclusion of skin colours would
be a result of anything other than pressure by the same - that they can
improve the whole world with remedies conceived against the background of US
American race/identity politics.

~~~
webitorial
Had nothing to do with proving they were not humorless fascists. There was a
legitimate need for a universal codepoint among Japanese cellphone operators.
Keep in mind, Japanese is a language where frequently entire concepts are
represented in a single character, so this isn't perhaps as odd as you think.
"Poo" had a specific semantic in "cellphone Japanese" that the market
demanded. To be used interoperably with other Unicode characters, various
'emoji' were added to Unicode. I, for one, retain the right to be called a
humorless fascist.

------
ramviswanadha
The argument that author makes is "every letter in English alphabet is
represented, why not every letter/grapheme in Bengali/Tamil/Telugu/Name your
language" argument is specious at best

~~~
zhemao
Except that the whole purpose of Unicode is to create a character encoding
that "enables people around the world to use computers in any language" \-
taken from the Unicode Consortium website. Bengali is also not an obscure
language. It is the 10th most spoken language in the world and the national
language of Bangladesh.

~~~
zhemao
And besides, the argument isn't "Why aren't all Bengali characters represented
when all English characters are?" The argument is "Why aren't all Bengali
characters represented when a pile of poo is?"

~~~
webitorial
No, the argument is "Why didn't the Unicode authors make the same technical
choice I would have based on my limited knowledge of this topic -- including
not knowing that the pile of poo was created for use in the Japanese market;
as well as not knowing that I could participate myself for just $75, not the
$18,000 lie in my story; and not understanding the nuances of international
standardization; or reviewing the list of international liaisons to the
Unicode organization where much of the language-specific work is done?"

------
kaptain
Most of the discussion here centers on how linguistic commitments ought to
drive decision making in determining the Unicode spec. It's already been
covered how Unicode provides a number to symbol (i.e. codepoint) mapping;
composition, rendering, input, etc. is left to the system implementation to
determine.

I actually worked with Lee Collins (he was my manager) and a whole bunch of
the ITG crew at Apple in the early 00's. The critique that this was just a
bunch of white dudes that only loved their own mother tongues but studied
other languages dispassionately as a superficial proof of their right to
determine an international encoding spec is, to me, misinformed, and DEEPLY
offensive.

I had just gotten out of college and it was AMAZING to see how much these
people loved language! Like it was almost weird. Our department had a very
diverse linguistic background and, oftentimes, when you had a question about a
specific language property you could just go down the hall and just ask "Is
this the way it should be?"

All the discussion here happened on the unicode mailing lists as well as in
passing at lunch and at team meetings. Lots of people felt VERY passionately
about particular things; I just liked watching the drama. But to write an
article like this that somehow intimates that people didn't care is wrong.
People cared A LOT.

It's been touched in a couple of comments, but a big factor in Unicode was
also the commercial viability of certain decisions. You have Apple, Adobe,
Microsoft, etc with all of their previous attempts at
internationalization/localization. If you wanted these large companies to
adopt a new standard you had to give them a relative easy pathway to
conversion and adoption.

I think the article, in general, lacks a perspective that dismisses the work
of the Unicode team as well as the different stakeholders. Historically,
accessible computing was birthed in the United States. The tools and processes
are naturally Ameri-centric. I'm not saying it's right, I'm just saying it is.
The fact that the original contributors to Unicode were (mostly?) American
shouldn't be a surprise; they had the most at stake in creating interoperable
systems that spanned linguistic boundaries.

[http://www.unicode.org/history/earlyyears.html](http://www.unicode.org/history/earlyyears.html)

A purely academic solution may have been better. That's not certain. But it's
pretty clear that any solution that didn't address the needs of pre-existing
commercial systems would never be adopted. I'm surprised this hasn't been
emphasized more.

I have more thoughts on how to solve the problem that the author complains
about but I'll leave that for another day.

------
ramgorur
I can type anything in bengali without any issue -- হঠাৎ, কিংকর্তব্যবিমূঢ়,
সংশপ্তক, বর্ষাকাল...

Not sure about other "second class" languages, this open source phonetic
(unicode) keyboard is extremely popular in bangladesh, and everyone does all
sorts of bengali stuffs with it --

[https://www.omicronlab.com/avro-
keyboard.html](https://www.omicronlab.com/avro-keyboard.html)

------
olgeni
I searched for "white" and of course I found a few, as expected.

------
robmccoll
"in My Fair Lady (based on George Bernard Shaw’s Pygmalion)" which of course
draws on Ovid's Metamorphoses poem about Pygmalion from ancient Greek
tradition.

------
jkot
I dont speak Bengali, but some european languages have similar problem. In
Czech language letter "ch" is a single character with its own place in
alphabet.

------
6t6t6
What is amazing here is that everybody is talking about the character "ৎ"
instead of talking about the real issue.

I guess that the main point of the article is that the Unicode Standard is
being decided by a bunch of North American and European people, instead of
being decided by the speakers of the languages Unicode is intended to.

In my opinion, for instance, Han Unification is a botch and no Japanese would
consider that it makes sense.

------
sirseal
Let's use UTF-16:
[http://en.wikipedia.org/wiki/UTF-16](http://en.wikipedia.org/wiki/UTF-16).
Problem solved.

------
kenko
Why was the title changed?

~~~
elchief
Because poo

------
volune
get your hundreds of millions of native speakers off their asses and solve the
problem! the current internet is in roman glyphs because that's what all the
people doing the work speak!

~~~
zhemao
Given the number of Indians working in the software field, I think it's safe
to say that there are quite a few Bengali speakers "doing the work" to keep
the internet running. Actually, I'd posit that the majority of programmers are
not native English speakers.

------
anon4
Or maybe you could start writing using a normal alphabet.

------
enupten
I couldn't agree with you more on this!

But there is more to it when it comes to India, it is mostly that "nobody
really cares".

It is sad that while India is develops, it is rapidly leaving its many
languages behind when it comes to the computer; indeed, while Chinese/Arabic
keyboards are extremely common, one would be hard pressed to find Hindi,
"Nagari", (let alone Bengali) keyboards in India. This despite the nauseous
linguistic jingoism in the country's political history. Indeed there has been,
sadly very little to show, and it has infact, gotten worse past independence,
as with many things; see,
[http://www.columbia.edu/cu/mesaas/faculty/directory/pollock_...](http://www.columbia.edu/cu/mesaas/faculty/directory/pollock_pub/classics.pdf)

I would not be surprised, at this rate, if India turns into a monoculture, two
centuries hence.

P.S: Speaking of representation, the script the OP talks of, belongs to the
set of phonetically accurate scripts; one which while being populated mostly
by Indic scripts, is given - quite disgracefully - the name "abugida", after a
lone Semitic script - "Geez" \- from Ethiopia - which ironically is probably
derived from one of the Indic ones. Systematic biases are far too apparent in
Indology.

------
pavlov
By the title, I assumed this would be an article about how 2-year-olds can
learn to do some interesting stuff on a smartphone.

[Edit -- Sorry about this lame joke. I've suffered the consequences in
downvotes.]

~~~
spiritplumber
I thought it was going to be about illiteracy.

------
marincounty
My English professor biggest pet peeve. He told us day one.

It was:

1\. Don't use cliches in your writing. 2\. Don't ever use emoji in any
communication.

p.s. Emoji wasen't even around when I went to school. I bet Mr. Taylor woukd
have had a field day though? Never been the fan of any smiley face, even
rotated 90 degrees.

