As far as I can see, the characters used in this attack are indeed on the list. So presumably the problem is that not all domain registrars are checking this.
I don't understand why ICANN or Verisign don't enforce this. This is hardly a new problem. Does anyone know the background to why this is possible?
There is a starting advantage for ASCII because so many domains were registered from English speaking countries in the early years of the web. But non-ASCII using countries are just as vulnerable to this attack.
Although my linking to the old version was an accident, it's interesting that a) the old version explicitly says it's “for IDN”, and b) the characters used in this particular attack were already listed then.
Here's a report of this attack from 2002, emphasising that it really isn't a new problem. Why on earth haven't the domain registrars acted? Or if they have, what went wrong in this case?
U+0435 CYRILLIC SMALL LETTER IE
U+0440 CYRILLIC SMALL LETTER ER
U+0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
U+0441 CYRILLIC SMALL LETTER ES
I don't see this as being an easy or even solvable problem in general --- after all, a Russian IDN registrar could rightly argue that it should be valid to use all Cyrillic characters in IDNs; but at least for those of us who are content with English-only domain names, we can just disable IDN.
I suppose one solution would be, when IDN is detected, to show both the Unicode and Punycoded versions:
One could also argue that this is also a usability feature, for easy copying of IDNs by those who otherwise cannot write them.
The problem is quite old.  is a 12 year old bug report on this very problem in Mozilla.
That looks like the best solution. The security certificate of the site shows the punycode version of the URL. If the address bar URL and the certificate URL are different, then, at least for addresses in English, that should raise an alarm.
There's a mention below ( https://news.ycombinator.com/item?id=14119826 ) that some browsers will show the IDN in the certificate too, presumably because it would otherwise be equally confusing if they didn't match (they do match.) This isn't a problem that can be solved without showing both the IDN and punycode.
From that, you can see domains composed exclusively of those characters with are homoglyphic or nearly so with the ASCII equivalents would be susceptible to this:
That's not all the letters, but there are plenty of domains in English which can be written to look the same with entirely Cyrillic.
Everyone should learn to speak English like normal people, is that what you're trying to say?
There is a set of simple, clearly defined symbols (ASCII) that make up the URL. It must be like that for URLs to be ever secure. If the URL can be made of tens of thousands of symbols, some of which are completely identical, the URL will never be secure.
Simple sanity checks for mixed character urls by the browser easily remedies this.
Except this particular example, the name 'epic' is written entirely with Cyrillic characters. One could argue whether a Russian company has the right to use such a name (with all-Cyrillic characters), or how it would be input --- and if you want it to be easy to input for those who don't completely have IDN support and matching ways to input those characters, you'd have to also display the punycoded version, at which point a transliterated name would be better anyway.
My point/question is more "is the issue actually words in two different character sets creating a collision or mixed alphabet collisions?" If it's the former, I'd like to see more examples to show this is a prevalent issue that wouldn't be caught by a sanity check that catches URLs with mixed character sets. (That is, how many unique words across languages are there that actually get a hit this way, in the same way Epic/Ерис do?)
As for passports, I understand your point on the universal nature of English, but International passports serve a completely different purpose than a domain name. So much of the web is non-english, save the domain name for some of them, and I'm not sure why it has to be this way. If the owners want it that way, great. If they want a domain in the local language, I'm not sure why they shouldn't be allowed to. A passport is meant for a very specific set of people (border guards, law enforcement, airline and other travel employees) to be able to verify your identity. A domain name just needs to resonate with its audience.
I just don't see why every business across the planet needs to compete for the same flowers.com address, or petstore.com, or cars.com. There are plenty of unique words and ways to communicate the same idea, I'm not certain why we need to restrict ourselves to English here. It just makes already expensive domain names even more expensive. If I have a club in San Francisco that I wanna call White Nights, there's no reason that a club in St. Petersburg should have to compete with me for the same name when they have their own term for White Nights, especially when the issue seems to not be about foreign character sets, but mixing character sets in the domain name. All ascii does work, but it's a grenade for an anthill.
Edit: an aside, I didn't realize that Safari translates to punycode. Very cool.
There's already a .ru domain. And why would it want petstore.com instead of zoomagazin.com? Everyone is able to transliterate to Latin here. Many years have passed since .рф was introduced and almost nobody is using domains in Cyrillic. Sites that use Latin domains are much more popular. At this point it's safe to say that non-ASCII domains is a feature that nobody really needed (except for phishers, of course).
IMO it'd be fine if there was an automated flagging mechanism that would require manual review for suspicious domains, though. But good luck getting registrars to do anything more involved than a simple blacklist check.
It would absolutely have to be a pretty labor-intensive, manually-maintained database.
Get phishing sites on Chrome/Firefox's red-flag lists (presumably this already happens?)
Show a country flag in the URL bar based on the language/alphabet of the URL
Change the URL bar colour for punycode domains
Show a big banner the first time you visit a site that you've not visited before.
Have the browser detect and warn about possible homographs based on history.
Having said all that, is this really such a large problem that no browser has needed to solve it? I suspect the centralised malware lists are good enough, but scary demo URLs will never get onto one.
Languages are often spoken in more than one country. Should amazon.com fly the Union Jack or wien.gv.at a German flag? I don't think a Russian flag for Cyrillic would be appreciated in Ukraine either.
Scripts (e.g., Latin, Cyrillic, Hangul) could get their own signifier or symbol, but that does mean disallowing (or discouraging) mixed script domains — which may not be a bad idea.
Perhaps certain scripts should be limited to certain CCTLDs and generic TLDs so that the demo URL would be disallowed in .com, but permissable in .cyrl, which in turn disallows any other scripts (including Latin). It might solve the phishing angle.
It would also solve complex cases such as Japanese (which would require Chinese characters, katakana, hiragana, and some Latin characters; perhaps the full-width set?). Perhaps put those under .日本 and allow only Latin characters under .jp.
Technically feasible, though politically untenable.
Unfortunately, this would require all the big registrars to be on board for it to actually be effective.
I think a lookup table would be enough.
And then a rule that might be good to implement at registrars (please correct me if I'm missing a valid use case for this): if all the the unicode characters represent ASCII (or its 'duplicates'), reject the registration.
Forcing punycode representation is a hack that would help English speaking countries (you don't expect non-Latin characters in typical English language website domains) but it wouldn't solve the problem.
Also, it's not as simple as "ASCII or not": outside English most European languages use a superset of ASCII glyphs mostly via diacritics or variants. It's not useful to single these glyphs out as "non-ASCII". But these languages too are affected by the same homoglyph attacks as English.
It's a hard problem and in the 21st century it certainly won't be fixed with one-off small-scale hacks that only benefit a subset of monolingual English-speaking users.
Your company name contains an umlaut? Better buy both the IDN and ASCII versions for each variant. Boom, doubled the sales of domain names.
For my purposes I don't need nonascii domains. Feel free to use unicorn characters in your address bar, but I'd much rather have it special cased in mine. My security shouldn't suffer just because it's such hard to solve it universally.
We don't need a universal solution.
There's an argument to be made that we should not solve problems in a fashion that only benefits English (or English and other European languages). Generally when that happens the speakers of those languages - which are still the dominant group in internet standardization - loose all interest in fixing the problem for everyone else.
The internet is on the way to becoming a truly global medium, if it isn't there already. We need to stop designing for English first.
But security is not the place for such cavalier attitudes. I'll gladly give up my Ñ's and É's and Ç's in domains and other global identification services, in exchange for knowing that It Will Just Work, always, everywhere, safely. And once the "very smart people that really figure things out" have a solution that works with more extensive cultural/linguistic features, by all means put it in place.
How is that English first?
Maybe someone could make this as a chrome extension?
Useless to you != useless to everyone.
Chrome behaves oppositely, showing unicode for https://www.xn--e1awd7f.com/ but not showing emoji for http://xn--8q8h17eba.ga/
The Scite text editor did reveal the difference:
Does it work on any others?
Edit: I am incorrect. Windows and notepad are not the problem. It seems to be a peculiarity of the particular browser I am using, Cyberfox 52.0.4, which is converting the punycode as it is copied.
Edit: It looks like my Anti-virus, Avast is doing the conversion. I tested this on browsers with and without the Avast plugin, and the conversion only occurs on the browsers with the plugin.
Could it be that notepad now delegates the handling of Unicode to the OS. I tested the above on the latest official release, Windows 10 Creators Update.
I tried the Copy/Paste of the fake URL with standard Firefox, Brave (Chrome), and Avast Safezone (Chrome). Only in Brave did the fake address copy correctly as punycode. The other two browsers have the Avast plugin as a common factor, but Brave does not.
This suggests that the browser is converting the URL before it can be copied, indicating that it converts the punycode rather than rendering it, but refreshing the page does not lead to the real site, so this only happens when copying from the address bar. I am using Cyberfox 52.0.4.
Especially since the title for the linked article also calls out the offending browsers.
They look exactly the same, they are the same characters just borrowed from one language to another..
It should just be illegal to register same-character domains. All other solutions that force browser vendors to "do something about it" are implying that the web is in english all other languages are just addons.
You already mention this yourself: you don't want the browsers to "do something about it." Ok, but somebody has to. So, if not the browsers, then ICANN. But they would be using the very same algorithm, in the end, no? If so, then why not have the browsers do that, in the first place? They have more control over UI, and it is a cleaner separation of concerns and responsibility.
Somebody needs to do something. And because this is about humans (in the end, when you say "similar looking", you mean "... to humans"), better have that be in the part of the system that actually is about them: the user agent. ICANN is too slow to be able to rapidly adapt to changes in this fight.
If I give you a piece of paper with "epic -> epic" you would say "it says epic two times".
If we give the same piece of paper to a Russian he will say "it says <insert Russian meaning here> two times". The Russian guy may even say it says epic(as in the English meaning) because there might be no "epic" word in Russian.
In short: It all depends on the reader how he interprets the characters e|p|i|c. The same should apply to computers too.
The lowercase a is the same character in French, Russian, English etc. The fact that some decades ago some computer inventors decided that there are four types of "a" depending on the language - that's completely broken in my opinion.
I don't think and I don't see any use of embedding the belonging language to every character. For me, this is obviously like an encoding "flaw".
So the ones that actually have to do something are the OS vendors. If they don't then ICANN or browser vendors should do something..
- Chromium Version 57.0.2987.133
- Chrome Version 57.0.2987.133
- Firefox ESR 45.8.0
- Tor Browser 6.5.1 (based on Mozilla Firefox 45.8.0)
tested with https://www.xn--80ak6aa92e.com/ -> https://www.apple.com/
I consider myself to be quite a bit privacy concious. But how do you even protect against something like this? IMO there should be a clear opt-in unicode URL character conversion in browsers.
And of course we should stop using passwords and switch to physical keys.
But the problem would still remain, for example if one company uses a name with latin "i" character and another with ukrainian charater that looks the same.
We currently use unbound for users.
The possibilities seem mind boggling
They already do this with misspelled domains.
My point is, for this segment will never be a money problem, more a digital culture deficit problem. Most of them are still not proactive enough about those issues, it seems like they don't care that much. At least, until they will be hit hard by this. Then, they will likely spend 100x in lawyers, reputation, and so on.
1. Non SSL/TLS sites get marked as insecure. With recent events (ISP allowed to sell your data) this seems like a no-brainer.
2. Self-signed certificates such as Letsencrypt get a normal status.
3. Extended Validation SSL (or however they are called) get the green lock to mark them as trusted.
About your first point there's a huge chunk of the internet who doesn't even speak English so I'd say yes.
Regarding countries that don't use latin letters, examples:
In contrast, Japanese or Chinese people almost never use English characters mixed with their own on a day to day basis.
The alexa results is really interesting, but I think the whole unicode in the url thing is recent (or at least browser support), right? so it would make sense for unicode urls not to be so popular yet. Or it might just be that way, I'd love to see say the top N sites what % use punycode to see if it's relevant or not.