
Known anomalies in Unicode character names (2017) - fanf2
http://unicode.org/notes/tn27/
======
adontz
I am not sure if native speakers are involved in Unicode standardization at
all.

Personal example.

In Georgian (a few millions native speakers) we had historically three
different independent alphabets: mtavruli, nuskhuri and mkhedruli. All
represent same letters, but different ways to write them. Alphabets are
represented in Unicode under different code points. Only mkhedruli is used
currently (last few centuries) and therefore most fonts have glyphs for
mkhedruli only.

Also there is no concept of capitalization in Georgian, there is no "a" and
"A", we have only "ა", one does not start a sentence, a word or a personal
name with some variant of a letter. Anna is ანნა in Georgian, the first and
the last letters are written the same way.

Now, because of some reasons, letters of mtavruli alphabet are registered as
uppercase versions of letters of mkhedruli alphabet. Mind blown.

And since well written software uses rules defined by Unicode standard, via
ICU library or some platform specific routines, many desktop and mobile
applications as well as major web-sites are hard to use. First letters of
sentences are capitalized, but since mtavruli glyphs are missing from fonts,
they are rendered as ⍰.

So instead of

მაგათი დედაც ვატირე.

we see

⍰აგათი დედაც ვატირე.

and have to guess what was the first letter. But who cares, if it is not used
in U.S. right?

~~~
CharlotteBuff
The _uppercase_ mapping of Mkhedruli is Mtavruli in Unicode, but the
_titlecase_ mapping is not. If you use the uppercase mapping to automatically
capitalise the first letter of a sentence, you are not following the
recommendations of the Unicode standard.

~~~
techdragon
But how many programmers are even aware of the fact Unicode has both an
Uppercase and Titlecase mapping? Let alone that they can do different things!

I routinely see code that treats titlecase as “set individual character in
this Unicode string to its upper case equivalent”... It wouldn’t surprise me
at all to find out this kind of titlecase vs uppercase mistake is a _very_
widespread issue.

~~~
TeMPOraL
I for one never heard of "titlecase" before. I guess that's what I get from
learning text encoding from APIs exposed in programming languages.

------
dwheeler
The use of the obscure word "solidus" instead of "slash" for the character "/"
is absurd. The term "slash" should at least be a formal alias, if it isn't
already. This text seems to imply it's not a formal alias, but maybe it is.

After all, the POSIX standard includes slash as a name for the character, and
that an ISO standard:
[https://pubs.opengroup.org/onlinepubs/9699919799/](https://pubs.opengroup.org/onlinepubs/9699919799/)
What's more, Unicode refers to "\" as "backslash".

Very weird.

~~~
reaperducer
_What 's more, Unicode refers to "\" as "backslash"._

I'd be happy if _nobody_ ever referred to the word "backslash."

I heard it read in a URL in a TV commercial this week. I can't believe it.
People have been getting that wrong since AOL days.

~~~
rplst8
What would you prefer then?

~~~
reaperducer
They simply don't belong in web addresses.

------
cooper12
Considering how huge the Unicode standard is, it's more surprising that there
are so little (known) errors. CJK characters alone account for thousands upon
thousands, and these can sometimes vary by just a single stroke. I suspect the
majority of redundancies or suboptimal choices were the result of subsuming so
many existing standards though. There might also be plain errors in research
but are probably in the more obscure blocks.

See also, ghost kanji:
[https://www.japantimes.co.jp/life/2018/10/29/language/ghost-...](https://www.japantimes.co.jp/life/2018/10/29/language/ghost-
kanji-lurk-japanese-lexicon/)

~~~
duskwuff
> CJK characters alone account for thousands upon thousands

The vast majority of CJK characters have systematic names like "CJK UNIFIED
IDEOGRAPH-72AC" which aren't really subject to errors in the same way as the
more verbose names used for other scripts.

~~~
claudiawerner
I have to wonder what the systematic names refer to; is it a Chinese/Japanese
dictionary ordering in which they were assigned, stroke count (but then, in
what order are characters of the same stroke count organized), or some other
or arbitrary ordering? As in, why is the 72AC character at 72AC, not at 72AD?

~~~
duskwuff
That's a good question. Unfortunately, a lot of old Unicode process
documentation isn't available online -- you can see a couple of relevant-
sounding documents at [1], but very little of it is available until 1999 or
so.

There are some pretty clear patterns to the character ordering, though -- if
you look closely, you can see big runs of characters which share radicals. For
example, characters 5000 through 500F are:

倀 倁 倂 倃 倄 倅 倆 倇 倈 倉 倊 個 倌 倍 倎 倏

all of which have the 亻radical on the left side. It's clearly not arbitrary.

[1]:
[https://www.unicode.org/L2/L1990/Register-1990.html](https://www.unicode.org/L2/L1990/Register-1990.html)

~~~
cynix
> all of which have the 亻radical on the left side.

Except for 倉 it seems.

~~~
a1369209993
Supposedly the two strokes at the top that look like a roof are somehow
supposed to be the same character as "亻". You could argue that it's not any
stupider than claiming that a upside-down vee with a crossbar is the character
as a dee with the ascender cropped off, but it also isn't any _less_ stupid,
so... <shrug>.

------
gfaure
Does anyone know the background behind:

> U+262B FARSI SYMBOL

> This symbol is so named because as symbol of Iran it cannot be encoded in
> ISO standards.

It would seem strange that ISO standards cannot refer to countries, so I'm
guessing this is something specific to Iran?

~~~
andyjohnson0
A quick search brought me to [1] which says:

<quote>

As noted by Roozbeh Pournader:

 _Neither Farsi, nor a symbol. In real life, it is the official emblem of the
goverment of the Islamic Republic of Iran._

Technically that would make it a logo and thus not a suitable candidate for
encoding. But Roozbeh also noted:

 _Exactly. The funny fact is that it has been in Unicode since 1.0..._

</quote>

How reliable this is I don't know, but it sounds plausible.

[1]
[http://archives.miloush.net/michkap/archive/2005/01/29/36320...](http://archives.miloush.net/michkap/archive/2005/01/29/363208.html)

~~~
tialaramex
It's interesting, I'd guessed it was a compatibility choice but there doesn't
seem to be any evidence for that.

Aside: Microsoft sucks for having purged blogs like that one. The archive
you're looking at captures most of what was once Michael Kaplan's blog at
Microsoft. Kaplan was terminated by Microsoft and then died, and his blog was
one of a large number of valuable blogs with insights into Windows
technologies that at some point were "tidied away" because they didn't fit
whatever nonsense brand vision somebody had that week.

~~~
mark-r
I really hate the way Microsoft rearranges their web pages seemingly on a
whim. You can't rely on a link to them being useful for any significant period
of time, and as you note useful information is lost on a frequent basis.

~~~
tialaramex
Where applicable: Look for URLs in aka.ms - this looks superficially like a
link shortener but (whether by corporate policy or just people inside
Microsoft agree with us) they are maintained so that when the pages are all
shuffled around the aka.ms links still get where you were going.

[https://aka.ms/RootCert](https://aka.ms/RootCert) will always be Microsoft's
Root Trust programme documentation (how a Certificate Authority like Let's
Encrypt gets themselves listed as trusted in Windows) even when Microsoft
decides that page should now be in the form of a GIF anim or a 3D bullet hell
game.

------
smcl
Known anomalies in “Known anomalies in Unicode character names”: they refer to
“háček” as “hacek”.

I wouldn’t normally pick something up like this, but isn’t the whole point of
ASCII to have a way to represent non-English characters?

------
js8
> The "caron" should have been called hacek and combining hacek. The term
> "caron" is suspected by some to be an invention of some early standards body

That's funny. Hacek is a Czech invention (it was actually a dot originally),
but I always assumed that "caron" is the official "correct" (english) name.

------
raldi
I don't understand why some of these errors were fixed via formal aliases but
not others. Why not give them all proper aliases?

~~~
polm23
I think that for some the correct alias was already in use for another
character, like with the digrams.

------
billme
(2017) should be append, since both the header and footer state that document
is from 2017.

—

Also, from the parent page for that link:

“ Unicode Technical Notes provide information on a variety of topics related
to Unicode and internationalization technologies.

These technical notes are independent publications, not approved by any of the
Unicode Technical Committees, nor are they part of the Unicode Standard or any
other Unicode specification. Publication does not imply endorsement by the
Unicode Consortium in any way. These documents are not subject to the Unicode
Patent Policy.”

SOURCE: [https://unicode.org/notes/](https://unicode.org/notes/)

~~~
dang
Added, thanks.

