Hacker News new | past | comments | ask | show | jobs | submit login

This is (ab)using unicode, for example the F is actually a Mathematical Bold Fraktur Capital F:

https://www.compart.com/en/unicode/U+1D571

This is terrible for screen readers and the like which are unable to read or understand these unicode characters making accessibility a real concern.




Things like screen readers can still try the Unicode decomposition techniques to try to make sense of the nonsense. The Fraktur F does decompose to "F", in this particular example. A better example is something like Lowercase Greek Letter Alpha which does not decompose to Latin "a", despite the readable similarity to most Latin character form audiences.

https://www.compart.com/en/unicode/U+03B1

Though there too, there are patterns screen readers can attempt to find to figure out when alpha is pretending to be Latin a.

That said, it's still not a great idea to use them for text anywhere. It puts a lot of burden on the reader's pattern matching skills. Not just screen readers, but human readers too; everyone reads them a bit slower, and that's before you consider the usual human skill/ability modifiers such as dyslexia that make these things so much worse, too.


I wonder actually. Google seems to understand these, you can search with them. I wouldn't be surprised if screen readers understand these symbols. They're originally for Mathematics, and I'm sure there's plenty of blind mathematicians.


I wouldn't be surprised if screen readers understand these symbols.

I develop web sites with screen readers in mind, and I would be very surprised if they could handle this sort of thing consistently.

Google spent billions of dollars learning how to search and interpret the web. Screen reading companies don't have that kind of scratch.

Also, it's important to remember that screen readers are about more than just the blind. There are screen readers that help people learn a new language, or translate text into simplified forms for people with low education, or low attention (think of the Mac's built-in text summarizing service).

For people who have a hard time reading custom fonts because of limited sight, or limited attention, they often override custom CSS fonts with something easier to read. This sort of thing will make the page unusable.


I'm not trying to suggest screen reading is easy, but the billions of dollars Google invest are not comparable here as they obviously do a huge range of things and the reader tools need to do a much narrower range.

This came up at Halloween with all those pleas not to post tweets heavy with emoji due to the issues with screen readers. I get the concern, and I'd typically do my best to be inclusive with personal content and compliant with accessibility on professional content, but there is a balance to be struck - we don't need a Procrustean restriction on what are now reasonably established forms of communication, what we need is for screen reader efforts to step up and work in these cases. It may seem a challenge (especially for legacy coded readers) but other posters already linked to basic solutions that can help on the fonts and this isn't beyond the wit of human ingenuity. This is an opportunity.


I'm not saying that we need to prohibit anyone from using crazy fonts. I'm merely responding to a postulation that screen readers could probably handle the text. My response, from experience, is that they probably can't, and most people misunderstand the myriad of uses for screen reading technology.

Odd fonts like this have a place. For example, I don't expect any screen reader ever to be able to interpret a PETSCII drawing.


It's less "can they render them" and more "do they maintain their meaning in alternate renderings".

For example, it's common in these to use non-latin characters that visually resemble latin characters. But if you were to try and substitute them literally, either as their actual phonetic sound, or a latin equivalent that is semantically similar but visually different, you risk breaking the meaning.

It's a bit like when old manuscripts use letters that look like "s" and "f" in places that seem nonsensical today.

In google's case, it's certainly worth it to write a little interpreter that maps the visual meaning to its latin equivalent so you can get better search results.


>It's less "can they render them" and more "do they maintain their meaning in alternate renderings".

I just entered "this is a test" on the site in Safari and enabled VoiceOver. It reads it as "Mathematical Bold Fraktur Small t. Mathematical Bold Fraktur Small h" and so on.

Completely broken for those who rely on VoiceOver as a screenreader.


Just configure it to not pronounce the "Mathematical Bold Franktfur Small" bits


Wow, that's actually pretty neat: https://www.google.com/search?q=site%3Anews.ycombinator.com%...

Anyone know if there are any existing libraries that do this conversion?


The ICU [1] transliterator does this. Here is an example showing how to use it:

    icu::ErrorCode status;
    auto t = icu::Transliterator::createInstance("Any-Latin; NFKD", UTRANS_FORWARD, status);
    t->transliterate("unicode string");
[1] http://site.icu-project.org/


Here's a test page for one.

http://minaret.info/test/normalize.msp


If you use c#:

"𝕙𝕖𝕝𝕝𝕠".Normalize(NormalizationForm.FormKC);


Look up Case Folding for how this is done, most languages have libraries to handle it.


This is called unicode normalisation, specifically NKFD form. You can do it in two lines of Python. This was covered last time a similar website came up (plagiarising myself):

https://news.ycombinator.com/item?id=17752680

unicode normalisation: http://unicode.org/reports/tr15/

Nobody in that thread tried it on an actual screen reader then, either. Someone did mention that iOS got well confused by it, so there's one data point.


Some of them are outside the mathematical range, that just happened to be the one I picked.

Google has a lot of resources to do normalization, when IDNA in the URL bar became common they and other browser manufacturers had to put resources behind similar looking glyph attacks to make sure that you were actually on google.com and not on some site that was using a homoglyph attack.

https://blog.malwarebytes.com/101/2017/10/out-of-character-h...

This may explain why Google is able to discern these use cases.


In chromium, the glyph attack defense works by normalizing to a "confusability skeleton" -- the ICU library actually does the normalization. Even within latin-1 this does some mangling: m gets normalized to rn, w to vv, etc. The query rewriting is a different problem, but it's possible it uses the confusables list as an input.


blind mathematicians don't write english to communicate in these unicode blocks... These blocks don't exist to be used for writing words in the english language.


Yes, I just mean to say, if you were creating a screen reader, it wouldn't be that hard to take this sort of thing into account. So, maybe they do. Either way, I'll add a little warning to the site.


Even if it is not hard to take this thing into account, it is definitely hard to take all the things into account. Adding more things doesn't help.


Ah yes.. The old "if you were creating <thing author doesn't know much about>, it wouldn't be that hard to take this sort of thing into account"


Well, we live in an age of wonder, I heard they landed a man on the moon.


Interesting. This made me wonder about small caps. Aᴘᴘᴀʀᴇɴᴛʟʏ Gᴏᴏɢʟᴇ ᴅᴏᴇsɴ'ᴛ sᴜᴘᴘᴏʀᴛ ᴛʜᴏsᴇ.


I tested the ones with circles and squares, and Google definitely doesn't understand them.


It's also against web compat standards to use these characters without providing alt text (failing case: https://www.w3.org/TR/WCAG20-TECHS/F71.html)


I agree, but it has non-abuse uses that I did not realise I missed. For example just a simple smoke test has more nuaances than expected ...

𝕋𝕙𝕚𝕤 𝕥𝕖𝕤𝕥 𝕥𝕖𝕩𝕥 𝕝𝕠𝕠𝕜 𝕝𝕚𝕜𝕖 𝔼𝕟𝕘𝕝𝕚𝕤𝕙 𝕓𝕦𝕥 𝕚𝕤 𝕦𝕤𝕚𝕟𝕘 𝕟𝕠𝕟-𝕒𝕤𝕔𝕚𝕚 𝕔𝕙𝕒𝕣𝕒𝕔𝕥𝕖𝕣𝕤 𝕒𝕟𝕕 𝕤𝕠 𝕚𝕗 𝕥𝕙𝕚𝕤 𝕣𝕖𝕞𝕒𝕚𝕟𝕤 𝕣𝕖𝕒𝕕𝕒𝕓𝕝𝕖 𝕥𝕠 𝕒𝕟 𝔼𝕟𝕘𝕝𝕚𝕤𝕙 𝕤𝕡𝕖𝕒𝕜𝕖𝕣 𝕨𝕖 𝕔𝕒𝕟 𝕓𝕖 𝕣𝕖𝕒𝕤𝕠𝕟𝕒𝕓𝕝𝕪 𝕔𝕠𝕟𝕗𝕚𝕕𝕖𝕟𝕥 𝕨𝕖 𝕒𝕣𝕖 𝕙𝕒𝕟𝕕𝕝𝕚𝕟𝕘 𝕦𝕟𝕚𝕔𝕠𝕕𝕖 𝕡𝕣𝕠𝕡𝕖𝕣𝕝𝕪 𝕥𝕙𝕣𝕠𝕦𝕘𝕙𝕠𝕦𝕥 𝕥𝕙𝕖 𝕧𝕒𝕣𝕚𝕠𝕦𝕤 𝕤𝕪𝕤𝕥𝕖𝕞𝕤 𝕠𝕦𝕣 𝕕𝕒𝕥𝕒 𝕥𝕣𝕒𝕧𝕖𝕝𝕤.


Can confirm, put a hello world in codepen.io with some other text and JAWS skipped right past it.


I'd kind of expect screen readers to be able to apply Unicode compatibility decomposition http://unicode.org/reports/tr15/#Canon_Compat_Equivalence since there are many characters that are just visual variants. Ligatures like ffl or the like are at least somewhat common e.g. in PDFs. On the other hand, maybe that breaks other stuff and only whitelisted characters are converted.

Maybe a HN reader using a screen reader can describe how theirs handles these characters.


quick rundown from my experience:

NVDA: oss, people have hacked in normalization that they can flip on when they hear something that sounds like nonsense, and then flip back off after reading that particular part.

JAWS: people have to listen to a bunch of crap if there's no alttext and will not be able to understand your content

VoiceOver OSX: people have to listen to a bunch of crap if there's no alttext and will not be able to understand your content


I'm not a regular screen reader user but VoiceOver (in macOS High Sierra) will not read the whole words, it will only say "Show HN." Because I can see there's more there, I can delve deeper and navigate character-by-character at which point it will say "f," "a," "n," "c," "y," "space," etc.


That implies that PDFs work with glyph indices, not Unicode code points. How do PDFs work for shaped languages like Hindi? In a script like that, you may have glyphs without any corresponding Unicode code point. Or does the PDF perhaps store the original unshaped text?


Tt's possible for a PDF to be annotated with the original Unicode text (look up the `/ActualText` feature in the spec), to support extraction of the underlying text rather than the shaped glyph stream for purposes such as copy/paste and search.

However, few PDF generators do this, and not all PDF readers have good support for it. So results vary depending on the specific tools and use-case.


It might just be the output of some particular programs that pre-bake their ligatures. All I know is that sometimes, I try to copy-paste text out of a PDF only to end up with some annoying ligatures interspersed throughout.


> This is terrible for screen readers and the like which are unable to read or understand these unicode characters making accessibility a real concern

Yes, Firefox for Android doesn't render it properly.


Here it works fine, what do you mean?


Results will depend on the available fonts your device vendor chose to include.


Could also have issues with systems that have broken 16-bit-only Unicode support (Java, Windows, ...) in which code points beyond U+FFFF have to be encoded with some surrogate pair nonsense that is likely untested in many text-handling situations.


That's like saying UTF-8 requires "nonsense" pair, triplet, or quadruplet chars. UTF-16 handles all Unicode code points just fine. UCS-2 does not. Windows transitioned from UCS-2 to UTF-16 long ago.


The problem is Windows programs, not Windows per se.

The difference is that UTF-8 gets tested in this regard; multi-byte encoding situations actually occurring in UTF-8 are not rare occurrences that only trigger on funny characters that nobody uses.

(For that matter, four-byte UTF-8 situations are in the same boat, of course, but not two- or three-.)


> (For that matter, four-byte UTF-8 situations are in the same boat, of course, but not two- or three-.)

Yeah. Notorious example here is MySQL's "utf8" column type only supporting 3-byte UTF-8 sequences.


I don't see why screen readers can't just be extended to handle characters that look like F.


Because they have uses outside of the abuse. If you always expect that they're being used improperly, then you mess up the actual use cases.


What would be an example of a string like "𝕙𝕖𝕝𝕝𝕠" appearing in a context where the screen reader should not attempt to pronounce it? Note that in math usage, a lone ℝ should be pronounced exactly the same way as a lone R.


In math usage an ℝ might be "the Regular set" and involved in equations that use a variable R. It's not unusual, and in fact, sometimes important, when reciting math as speech to make it clear which is the "bold" or "set" or "group" R, versus which is the ordinary R.


Someone using a screen reader to learn a new language.


There are no natural languages that use mathematical blackboard bold script, although you are right that screen readers probably shouldn't try to read Cyrillic according to what English letter the characters look the most like.


Well, because it's not an F, and how would you deal with this: 𓃓 I mean it's a bull symbol.


I think mapping common lookalikes to their corresponding Latin-1 glyphs would be a broad improvement, and with websites like this collecting them it may not even be all that much work.

Although, like you say, it would be nice if screenreaders could produce verbal descriptions of iconographic symbols. That would be a lot of work but it would be helpful.


𓃓 = 𝕓𝕦𝕝𝕝


Depending on context it may not actually mean "F". Screen readers are not yet capable of fully understanding the meaning behind the text they are reading.

If for example someone uses this to write a mathematical formula, having a screen reader says "F" changes the entire meaning.

Could this be done, absolutely, but it is something to be aware of and is something I noticed on Twitter where blind users were complaining that they were unable to "read" Twitter messages using this, thereby making them second class citizens all over again.


If I were to write a math formula in ASCII, it would probably have strings of multiplied variables that would be indistinguishable from English words to a screenreader that didn't know anything about meaning. The only way it would know to sound out the letters is if they had spaces or asterisks between them. The additional glyphs would do nothing to change this.


Screen readers should faithfully render characters as sound so people with limited/no eyesight can engage with written language, not make a guess that you meant to write cursed/fancy text instead of what those characters are for.


What do you think the general development budget for a screen reader is compared to the development budget for Google Search services?


Compared to the marginal cost of implementing this small additional input sanitization, both budgets are infinite.


NVDA is open source https://github.com/nvaccess/nvda for me it would be an inordinately expensive use of my time, but for you?

at any rate, both budgets are not infinite, and it's infuriating to hear the described as such.


They certainly could, but resources are limited and they probably have more important tasks.


ኾᥱΛ᜶ұ, you don't?


Just use mysql < v8 to store your data. By default it has utf8mb3 anyway. Problem solved ;)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: