
Ask HN: Unicode text variants? - arikrak
🅆🄷🅈 𝕚𝕤 𝕥𝕙𝕖𝕣𝕖 𝒖𝒏𝒊𝒄𝒐𝒅𝒆 𝖋𝖔𝖗 𝖉𝖎𝖋𝖋𝖊𝖗𝖊𝖓𝖙 𝓯𝓸𝓻𝓶𝓪𝓽𝓽𝓲𝓷𝓰? I thought each character was supposed to be different. Is it recognized as equivalent by all text-search programs?
======
patio11
_Is it recognized as equivalent by all text-search programs?_

The magic words you want to look for are [Unicode canonicalization], which
aspires to make that (and other string-comparison needs) actually work.
Implementation quality across the universe of programs is... mixed.

~~~
Someone
Canonicalization is more restricted and different.

More restricted in that it only treats truly identical strings that have
multiple representations the same. Normalization won't turn "foo" and "FOO"
into the same string, but it will turn "fòó" and "fo<with grave accent>o<with
acute accent>" into the same string.

Different in the sense that it creates a new string, rather than comparing two
strings. Just like you neither need nor want to do a tolower(s) when comparing
case-insensitively, you don't need nor want to normalize unicode to do a
normalization invariant comparison.

The unicode standard uses "equivalence" to treat "<fl ligature>" and "fl" as
equivalent (see
[http://en.wikipedia.org/wiki/Unicode_equivalence;](http://en.wikipedia.org/wiki/Unicode_equivalence;)
Unicode technical report on normalization at
[http://www.unicode.org/reports/tr15/tr15-18.html](http://www.unicode.org/reports/tr15/tr15-18.html))

~~~
patio11
Thanks; you're right.

------
rchowe
To me, it seems useful in cases where the style is a part of the meaning of
the symbol. This mostly comes up in mathematics, where a letter represented in
fraktur or blackboard bold has some semantically different meaning, and this
meaning can be part of the file instead of part of the foramtting of the file.

The practical part of me agrees with patio11, and that the knowledge gain of
having these semantics inherent in the file is offset many times by possibly
having to treat different bytes as the same character semantically.

~~~
ihuman
To add to that, some languages use some of the same characters from the
English language, but formatted differently in order to fit into that
language's rules. For example, Japanese characters are the same width, so in
Unicode there are the number and letters with ｆｕｌｌ ｗｉｄｔｈ in order to fit in
when English is mixed in with Japanese.

------
JohnHaugeland
Those are not actually formatting. Those are math symbols. Math relies on
formatting to convey meaning, and Unicode is expected to be able to render
math correctly without formatting. Therefore, it must be that math symbols'
formatting is actually a feature of the character.

My Ubuntu box can't render the first three so I don't know what they are.

"is there" is blackboard lettering, though Unicode insists on calling it
double-struck lettering. You may recognize the capital R in double-struck, ℝ,
as the symbol for the Reals.
[http://en.wikipedia.org/wiki/Real_number](http://en.wikipedia.org/wiki/Real_number)
Similarly, ℤ is the integers, ℂ is the complexes, et cetera. I can say this on
HN, without formatting, because unicode.

The word "unicode" is in Mathematical Bold Italic. The words "for different,"
which you probably interpret as Fraktur, are ... oh wait unicode calls them
mathematical fraktur.
[http://www.w3.org/TR/MathML2/bycodes.html#U1D58C](http://www.w3.org/TR/MathML2/bycodes.html#U1D58C)
𝖆 means a ring group. ℵ is used for the cardinality of infinite sets. 𝖌 is a
Lie algebra. Etc.

[http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbo...](http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols)

I don't know what the last set are.

~~~
xroche
"why" is squared latin capital ("Enclosed Alphanumeric Supplement" Unicode
block -- www.unicode.org/charts/PDF/U1F100.pdf‎) "for different" is
mathematical bold fraktur ("Mathematical Alphanumeric Symbols" Unicode block
-- www.unicode.org/charts/PDF/U1D400.pdf‎) "formatting" is mathematical bold
script ("Mathematical Alphanumeric Symbols" Unicode block)

------
greenyoda
HN Search doesn't find "Unicode text variants", so it's probably not a good
idea to use these characters in article titles:

[https://hn.algolia.io/#!/story/forever/prefix/0/unicode%20te...](https://hn.algolia.io/#!/story/forever/prefix/0/unicode%20text%20variants)

What's weird is that if I copy/paste the variant characters from the article
title into HN Search, it matches a whole bunch of articles, none of which are
this one.

DuckDuckGo finds this post as its first hit if you copy/paste the title into
its search box, but not any other articles about "Unicode" or "variants" (just
lots of random junk).

Google, on the other hand, apparently canonicalizes the text, since it returns
hits on other articles about Unicode and text variants, as well as this post.

So, here we have three text search programs that behave rather differently.

Also, Firefox doesn't find any of the variant text strings on this page if you
search for the normal (ASCII) characters.

~~~
redox_
You're right; MySQL is currently throwing an error (and therefore we don't
even send the item to the Algolia engine). Gonna take a look tomorrow or the
day after.

Mysql2::Error: Incorrect string value: '\xF0\x9D\x96\x86 m...' for column
'text' at row 1:

Thank you for the bug report ;)

~~~
xroche
"utf8" in mysql is limited to 16-bit range (!) ; see
[http://golem.ph.utexas.edu/~distler/blog/archives/002539.htm...](http://golem.ph.utexas.edu/~distler/blog/archives/002539.html)
\- basically, you need to use "utf8mb4"
([http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-
utf8m...](http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-
utf8mb4.html))

~~~
redox_
I switched our database encoding to `utf8mb4` (thank you xroche), let's
understand why the text is now truncated in
[https://hn.algolia.io/#!/story/forever/prefix/0/unicode%20te...](https://hn.algolia.io/#!/story/forever/prefix/0/unicode%20text%20variants)
:)

------
EwanG
Anyone else seeing this (latest version of Chrome on Win 8.1) as all boxes? If
I highlight and right-click, it will ask me if I want to search for each word,
so that I was able to see this is a question about Unicode formatting. I tried
setting my fonts to Unicode fonts in the Advanced Settings, but that didn't
seem to help.

~~~
ibelimb
Chrome on iOS shows it as all boxes.

~~~
greggman
You mean Safari (or WebView) on iOS. Chrome on iOS is just using a skinned
WebView because Apple's rules don't allow anything else.

In other words Safari on iOS shows the same boxes. When and if Apple danes to
fix it it will also be fixed on Chrome for iOS.

It's possible it's just a font issue. Tried pasting in FB Messenger, iOS
Notes, iOS Messages. All of them just show boxes except for 🅆🄷🅈

~~~
arm
It seems like it is both a font issue _and_ a Safari issue. Here’s a
screenshot¹ of all the characters in the string along with their Unicode
codepoints:

[http://f.cl.ly/items/2m0S0t3Y471q1X2u0W1u/Screen%20Shot%2020...](http://f.cl.ly/items/2m0S0t3Y471q1X2u0W1u/Screen%20Shot%202014-03-12%20at%201.37.32%20AM.png)

Note that other than U+0020 (SPACE), U+1F146 (SQUARED LATIN CAPITAL LETTER W),
U+1F137 (SQUARED LATIN CAPITAL LETTER H), and U+1F148 (SQUARED LATIN CAPITAL
LETTER Y), iOS doesn’t have any fonts containing glyphs for any of the
characters. (The _squared letter_ characters are covered (on iOS 7 at least)
by the fonts _Hiragino Kaku Gothic ProN_ and _Hiragino Mincho ProN_ ²).
However, for some reason, Safari on iOS is not performing font substitution in
this situation and using one of the Hiragino fonts to display them. As you
noted, this font substitution seems to be working fine in other iOS apps, as
these _squared letters_ are displaying there.

As an aside, on iOS 7, it is finally possible to install custom fonts
yourself³ via a configuration profile⁴. I’ve installed a few fonts I’ve needed
in this way⁵, and while they work perfectly in apps like Pages (for iOS),
Safari seems to completely ignore their existence. Even after installing them
and rebooting, Safari still shows the entire string as boxes⁶. (Symbola alone
has a glyph for every character in the above string, so every character should
certainly be displayable). However, copying and pasting the string of boxes
into an app like Pages shows all the characters just fine⁷.

――――――

¹ — Screenshot is of UnicodeChecker
([http://earthlingsoft.net/UnicodeChecker/](http://earthlingsoft.net/UnicodeChecker/)).

² — [http://support.apple.com/kb/HT5878](http://support.apple.com/kb/HT5878)

³ — [http://www.saturngod.net/create-custom-font-for-
ios-7](http://www.saturngod.net/create-custom-font-for-ios-7)

⁴ —
[https://developer.apple.com/library/ios/featuredarticles/iPh...](https://developer.apple.com/library/ios/featuredarticles/iPhoneConfigurationProfileRef/Introduction/Introduction.html)

⁵ —
[http://f.cl.ly/items/0o360a3t1q2R3E2a2g2b/IMG_0403.jpg](http://f.cl.ly/items/0o360a3t1q2R3E2a2g2b/IMG_0403.jpg)

⁶ —
[http://f.cl.ly/items/3f1o1R2F082i3I0B1y1o/IMG_0404.jpg](http://f.cl.ly/items/3f1o1R2F082i3I0B1y1o/IMG_0404.jpg)

⁷ —
[http://f.cl.ly/items/3y1J2G0u2u1c0Y1w2Q3Z/IMG_0405.jpg](http://f.cl.ly/items/3y1J2G0u2u1c0Y1w2Q3Z/IMG_0405.jpg)

------
nikdaheratik
As rchowe said, the different formats are mostly used in higher math and have
semantic meanings specific to that context.

These should almost never be used outside that context as they have different
byte values from the usual Roman characters (which means the computer doesn't
even see them as equivalent without "help" from the programmer), may not be
supported by every browser or text search program, and may not even render
correctly on many OS systems as the glyph for that character may be swapped in
for a glyph from a different font or not at all for older systems.

------
Elv13
From what I see, applications developed in latin (charset), but non english
speaking regions tend to handle this better. The reason is that people usually
want to search without accentuated characters and still find what they are
looking for. So issues like this one are handled faster in the dev cycle as
users _will_ complain.

------
arikrak
Cool, this made it to the front page! I just discovered this whole area of
unicode and wanted to see if it would show up here and why it exists. I guess
the answers about it being for Math make sense.

