Hacker News new | more | comments | ask | show | jobs | submit login
Ask HN: Unicode text variants?
40 points by arikrak on Mar 12, 2014 | hide | past | web | favorite | 25 comments
πŸ…†πŸ„·πŸ…ˆ π•šπ•€ π•₯𝕙𝕖𝕣𝕖 π’–π’π’Šπ’„π’π’…π’† 𝖋𝖔𝖗 π–‰π–Žπ–‹π–‹π–Šπ–—π–Šπ–“π–™ 𝓯𝓸𝓻𝓢π“ͺ𝓽𝓽𝓲𝓷𝓰? I thought each character was supposed to be different. Is it recognized as equivalent by all text-search programs?



Is it recognized as equivalent by all text-search programs?

The magic words you want to look for are [Unicode canonicalization], which aspires to make that (and other string-comparison needs) actually work. Implementation quality across the universe of programs is... mixed.


Canonicalization is more restricted and different.

More restricted in that it only treats truly identical strings that have multiple representations the same. Normalization won't turn "foo" and "FOO" into the same string, but it will turn "fΓ²Γ³" and "fo<with grave accent>o<with acute accent>" into the same string.

Different in the sense that it creates a new string, rather than comparing two strings. Just like you neither need nor want to do a tolower(s) when comparing case-insensitively, you don't need nor want to normalize unicode to do a normalization invariant comparison.

The unicode standard uses "equivalence" to treat "<fl ligature>" and "fl" as equivalent (see http://en.wikipedia.org/wiki/Unicode_equivalence; Unicode technical report on normalization at http://www.unicode.org/reports/tr15/tr15-18.html)


Thanks; you're right.


NFKD changes some characters significantly, e.g. Β² changes to 2.


To me, it seems useful in cases where the style is a part of the meaning of the symbol. This mostly comes up in mathematics, where a letter represented in fraktur or blackboard bold has some semantically different meaning, and this meaning can be part of the file instead of part of the foramtting of the file.

The practical part of me agrees with patio11, and that the knowledge gain of having these semantics inherent in the file is offset many times by possibly having to treat different bytes as the same character semantically.


To add to that, some languages use some of the same characters from the English language, but formatted differently in order to fit into that language's rules. For example, Japanese characters are the same width, so in Unicode there are the number and letters with ο½†ο½•ο½Œο½Œγ€€ο½—ο½‰ο½„ο½”ο½ˆ in order to fit in when English is mixed in with Japanese.


Those are not actually formatting. Those are math symbols. Math relies on formatting to convey meaning, and Unicode is expected to be able to render math correctly without formatting. Therefore, it must be that math symbols' formatting is actually a feature of the character.

My Ubuntu box can't render the first three so I don't know what they are.

"is there" is blackboard lettering, though Unicode insists on calling it double-struck lettering. You may recognize the capital R in double-struck, ℝ, as the symbol for the Reals. http://en.wikipedia.org/wiki/Real_number Similarly, β„€ is the integers, β„‚ is the complexes, et cetera. I can say this on HN, without formatting, because unicode.

The word "unicode" is in Mathematical Bold Italic. The words "for different," which you probably interpret as Fraktur, are ... oh wait unicode calls them mathematical fraktur. http://www.w3.org/TR/MathML2/bycodes.html#U1D58C 𝖆 means a ring group. β„΅ is used for the cardinality of infinite sets. π–Œ is a Lie algebra. Etc.

http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbo...

I don't know what the last set are.


"why" is squared latin capital ("Enclosed Alphanumeric Supplement" Unicode block -- www.unicode.org/charts/PDF/U1F100.pdfβ€Ž) "for different" is mathematical bold fraktur ("Mathematical Alphanumeric Symbols" Unicode block -- www.unicode.org/charts/PDF/U1D400.pdfβ€Ž) "formatting" is mathematical bold script ("Mathematical Alphanumeric Symbols" Unicode block)


HN Search doesn't find "Unicode text variants", so it's probably not a good idea to use these characters in article titles:

https://hn.algolia.io/#!/story/forever/prefix/0/unicode%20te...

What's weird is that if I copy/paste the variant characters from the article title into HN Search, it matches a whole bunch of articles, none of which are this one.

DuckDuckGo finds this post as its first hit if you copy/paste the title into its search box, but not any other articles about "Unicode" or "variants" (just lots of random junk).

Google, on the other hand, apparently canonicalizes the text, since it returns hits on other articles about Unicode and text variants, as well as this post.

So, here we have three text search programs that behave rather differently.

Also, Firefox doesn't find any of the variant text strings on this page if you search for the normal (ASCII) characters.


Chromium's ctrl-f also canonicalizes, at least well enough to recognize this title.


You're right; MySQL is currently throwing an error (and therefore we don't even send the item to the Algolia engine). Gonna take a look tomorrow or the day after.

Mysql2::Error: Incorrect string value: '\xF0\x9D\x96\x86 m...' for column 'text' at row 1:

Thank you for the bug report ;)


"utf8" in mysql is limited to 16-bit range (!) ; see http://golem.ph.utexas.edu/~distler/blog/archives/002539.htm... - basically, you need to use "utf8mb4" (http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m...)


I switched our database encoding to `utf8mb4` (thank you xroche), let's understand why the text is now truncated in https://hn.algolia.io/#!/story/forever/prefix/0/unicode%20te... :)


Bing doesn't find it when searched in plain-text, while Google can find it in both their auto-complete suggestions and in Chrome's page search. +1 Google.


Anyone else seeing this (latest version of Chrome on Win 8.1) as all boxes? If I highlight and right-click, it will ask me if I want to search for each word, so that I was able to see this is a question about Unicode formatting. I tried setting my fonts to Unicode fonts in the Advanced Settings, but that didn't seem to help.


On latest Chrome in Win 7, I also see all boxes. Interestingly, the page title of the comment thread displays correctly though.


somewhat interestingly, to me at least, The title is boxes in the page content, but renders correctly in the tab header in chrome on windows for me.

It renders fine in both on my mac.


IE and Firefox on Windows 8.1 render it properly. Chrome renders it as all boxes.


Chrome on iOS shows it as all boxes.


You mean Safari (or WebView) on iOS. Chrome on iOS is just using a skinned WebView because Apple's rules don't allow anything else.

In other words Safari on iOS shows the same boxes. When and if Apple danes to fix it it will also be fixed on Chrome for iOS.

It's possible it's just a font issue. Tried pasting in FB Messenger, iOS Notes, iOS Messages. All of them just show boxes except for πŸ…†πŸ„·πŸ…ˆ


It seems like it is both a font issue and a Safari issue. Here’s a screenshotΒΉ of all the characters in the string along with their Unicode codepoints:

http://f.cl.ly/items/2m0S0t3Y471q1X2u0W1u/Screen%20Shot%2020...

Note that other than U+0020 (SPACE), U+1F146 (SQUARED LATIN CAPITAL LETTER W), U+1F137 (SQUARED LATIN CAPITAL LETTER H), and U+1F148 (SQUARED LATIN CAPITAL LETTER Y), iOS doesn’t have any fonts containing glyphs for any of the characters. (The squared letter characters are covered (on iOS 7 at least) by the fonts Hiragino Kaku Gothic ProN and Hiragino Mincho ProNΒ²). However, for some reason, Safari on iOS is not performing font substitution in this situation and using one of the Hiragino fonts to display them. As you noted, this font substitution seems to be working fine in other iOS apps, as these squared letters are displaying there.

As an aside, on iOS 7, it is finally possible to install custom fonts yourselfΒ³ via a configuration profile⁴. I’ve installed a few fonts I’ve needed in this way⁡, and while they work perfectly in apps like Pages (for iOS), Safari seems to completely ignore their existence. Even after installing them and rebooting, Safari still shows the entire string as boxes⁢. (Symbola alone has a glyph for every character in the above string, so every character should certainly be displayable). However, copying and pasting the string of boxes into an app like Pages shows all the characters just fine⁷.

――――――

ΒΉ β€” Screenshot is of UnicodeChecker (http://earthlingsoft.net/UnicodeChecker/).

Β² β€” http://support.apple.com/kb/HT5878

Β³ β€” http://www.saturngod.net/create-custom-font-for-ios-7

⁴ β€” https://developer.apple.com/library/ios/featuredarticles/iPh...

⁡ β€” http://f.cl.ly/items/0o360a3t1q2R3E2a2g2b/IMG_0403.jpg

⁢ β€” http://f.cl.ly/items/3f1o1R2F082i3I0B1y1o/IMG_0404.jpg

⁷ β€” http://f.cl.ly/items/3y1J2G0u2u1c0Y1w2Q3Z/IMG_0405.jpg


Chrome on OS X renders properly.


As rchowe said, the different formats are mostly used in higher math and have semantic meanings specific to that context.

These should almost never be used outside that context as they have different byte values from the usual Roman characters (which means the computer doesn't even see them as equivalent without "help" from the programmer), may not be supported by every browser or text search program, and may not even render correctly on many OS systems as the glyph for that character may be swapped in for a glyph from a different font or not at all for older systems.


From what I see, applications developed in latin (charset), but non english speaking regions tend to handle this better. The reason is that people usually want to search without accentuated characters and still find what they are looking for. So issues like this one are handled faster in the dev cycle as users _will_ complain.


Cool, this made it to the front page! I just discovered this whole area of unicode and wanted to see if it would show up here and why it exists. I guess the answers about it being for Math make sense.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: