Three possible solutions, not mutually incompatible:
* Make the browsers catch it - Chrome just shows characters that look like apple.com here, and shouldn’t.
* Whitelist character sets that are allowed to be mixed, at the registry level - it should only be possible to mix cyrillic-latin homoglyphs with cyrillic non-homoglyphs.
* Don’t allow IDNs on gTLDs - if you want “écriture”, get écriture.fr, not .com
Obviously, the registries have a conflict of interest here, and won’t let 2 and 3 happen on .com because it would cut into Verisign's revenue.
See also https://en.wikipedia.org/wiki/IDN_homograph_attack
And this domain isn't made of mixed characters.
A particularly terrible language is Chinese where most old people can't type the characters even though they can type Latin letters. That's because you have to deliberately invest time to sit down and learn an input method which is a non-trivial endeavor that takes weeks of effort and old people just aren't going to go back to school for that.
yes. Not everybody speaks english.
>Every web user must be already used to typing Latin characters because so many major websites use them.
s/web user/existing web user/
Unicode domains are one more piece required for the net to be as inclusive as possible.
I should also add that the general attitude of "not everybody speaks English so we should adapt our tech to reduce the need for English" seems to imply a privilege of already knowing English. It is true that not everyone speaks English at this moment, but the right solution would be to teach everyone English as it expands horizons immensely, not to balkanize the world. Languages are not equal and English is the single most useful one. One can argue that e.g. Russian is just as good as English, but it's just not true. The amount of information available in English is immeasurably higher than in any other national language, and one should have the privilege of knowing English for some time (or being a native speaker) to forget the fact.
by clicking on the link on my search engine. Though of course, either that page is in a language I can easily type (so why then use that URL?), or otherwise, I would not have searched for that term to begin with, whether I typed it in the URL bar or in the search field.
If I'm clicking a link on a page in a script I can read, then the point is moot too.
>(more people learn English over time and that's a beautiful thing
I don't know. Giving access to people who don't (yet) speak english to me is a nobler goal than forcing people to learn english and the latin script.
If they want to learn english, that's fine. But forcing them to is being exclusive.
Yes. There's a lot more content available in english and in the end, that's what made me learn it (honestly - the sole reason I started to learn english was to be able to play the talkie version of "Indiana Jones and the Fate of Atlantis"), but this was my decision. I wasn't forced to.
But declaring English the lingua franca means that people from other cultures will be at a perennial disadvantage vis-a-vis native speakers, or that their native language would be neglected.
There's an obvious benefit we get from everyone speaking English, but there's also a non-obvious cost to the homogenisation of cultures. Already, there are languages dying (and dead), and with every languages dies a potentially different and meaningful way to look at the world.
In any case, I'd leave such decisions to those actually affected by them. People tend to react in surprising ways to other cultures telling them what they're worth or not (c. f. "Balkanisation")
I can easily continue replacing "þ" and "ð" with "th", continue removing diacritics, but it feels like being robbed of an aspect of your language.
The other option would be to do this based on the browser's current language preference configuration. But that means that unless your domain is in ASCII, you can never be sure how it's going to be rendered on your customer's browsers which would make IDN domains second-class compared to ASCII domains.
Is that your personal experience? Most input methods I know work by 1. knowing the Pinyin romanization of the word you want to type 2. typing it in 3. selecting the appropriate characters from a long list of candidates.
If they know the characters and Latin letters, then the only roadblock I can think of would be not knowing Pinyin. That shouldn't take weeks to learn if you already know Chinese, it's more like learning a very simple alphabet.
No. They are very very rarely used even in countries that do not use the Latin alphabet. Given the risk that phishing poses I don't think it's worth the risk.
Whilst it's easy to say "Just enable punicode always"
People that use the web in different languages lose a lot of functionality because of it.
I could imagine a solution would be to collect a list of homogliphs then when the browser suspects an overlap it does a search for similarly spelt sites then warns the user of the possibility of the site being an imitation.
Of course then also converting the URL in the address bar to punicode.
What other ideas are there?
Firefox 52.0.2 here displays it as apple.com
So updated Firefox (which failed until I reran it) and then.. Still displays it as apple.com (Firefox 53.0)
Eg. about:config and set the "network.IDN_show_punycode" to true to avoid this trap.
That's not a fix, that's a workaround.
EDIT: This led me to read up on how various browsers handle non-ASCII letters which in turn helped me discover that apparently no browser supports the German sharp-s ("ß") which gets auto-expanded to "ss" although domains containing the sharp-s can be registered separately from "ss" domains -- effectively allowing people to register domains that can't be accessed in any browser without explicitly using the unreadable punycode representation.
EDIT2: It seems the fix is more fine-tuned than just showing punycode for everything. So it's still a workaround (punycode URLs are not fit for human consumption so this still actively punishes confusing domains even if they're not intentionally malicious) but it affects fewer domains than I initially feared.
It was West-centric, yes, but it allowed for a unique and legible ASCII identifiers. And encouraged non-ASCII languages to create a unique (or, mostly-unique) Latin representation of their scripts — which is, in general, a good thing. It encouraged unification, using ASCII as the common ground.
Allowing for Unicode characters opened a new Pandora box, creating a situation that is unsolvable — either we keep the new names, making almost every string of characters potentially ambiguous, or we return to the state where ASCII-only names are the only ones usable.
Also, differentiating between ASCII and non-ASCII names doesn't solve the thing. Imagine what if the legitimate address is already in a non-ASCII script.
Some people in this threat seem almost eager to throw out any attempt at respecting cultures other than their own using the earliest convenient excuse.
Excluding EBCDIC, which has the same characters, can you name a major character set that doesn't start with a carbon copy of ASCII? Shift JIS starts with ASCII. Big5 starts with ASCII. Every code page starts with ASCII. Unicode, of course, starts with ASCII. Look at just about any (physical) keyboard for any language and it will support ASCII.
A proper fix would keep the domain name human-readable but differentiate between the ASCII and homoglyph versions.
How? Not my job to figure that out. If you want a random idea: the homoglyphs could be rendered differently (i.e. make the font disambiguate them). That's probably not a perfect solution but I'm not getting paid to do this.
> Block a label made entirely of Latin-look-alike Cyrillic letters when the TLD is not an IDN (i.e. this check is ON only for TLDs like 'com', 'net', 'uk', but not applied for IDN TLDs like рф.
That's neither "ridiculously clever", nor it will make (non-nefarious) IDN domains ununsable.
However it seems they don't flat out block all IDN domains but only those containing the homoglyphs. IUIC they also don't block domains containing Cyrillic homoglyphs alongside other Cyrillic characters.
This seems somewhat reasonable. I still think rendering Cyrillic in a way that makes alphabet mismatches more obvious would be a better and more future-proof solution.