The Unicode consortium maintains a list of “confusable” strings[0], that is supposed to be used to prevent domains being registered that are visually identical or near-identical to already-registered domains.
As far as I can see, the characters used in this attack are indeed on the list. So presumably the problem is that not all domain registrars are checking this.
I don't understand why ICANN or Verisign don't enforce this. This is hardly a new problem. Does anyone know the background to why this is possible?
It's not that hard to come up with legitimate domains in non-Latin script that could be recognized as word written in Latin script. For example, керас is a Russian transliteration of Keras. So I imagine they would get a bit of false positives and (fair) claims that non-ASCII languages are oppressed. To me, better alternative would be to limit puny code names to punycode TLDs, and make sure that no two TLDs look the same: task of much smaller scale.
It's not necessarily a matter of oppressing non-ASCII languages if it's a symmetric relation. So if "epic.com" is registered using ASCII, that precludes registering the cyrillic version, but if it's already registered in cyrillic, that precludes it being registered in ASCII.
There is a starting advantage for ASCII because so many domains were registered from English speaking countries in the early years of the web. But non-ASCII using countries are just as vulnerable to this attack.
Yes, that was my suggestion: non-Latin domains MUST be in the non-Latin TLDs, with additional constraint that non-Latin TLDs cannot look like already registered Latin TLDs. In a twist of irony, Russian word "сом" (roughly pronounced as "some") means "catfish" (the animal, not the action of catfishing though).
Although my linking to the old version was an accident, it's interesting that a) the old version explicitly says it's “for IDN”, and b) the characters used in this particular attack were already listed then.
Here's a report of this attack from 2002, emphasising that it really isn't a new problem. Why on earth haven't the domain registrars acted? Or if they have, what went wrong in this case?
U+0435 CYRILLIC SMALL LETTER IE
U+0440 CYRILLIC SMALL LETTER ER
U+0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
U+0441 CYRILLIC SMALL LETTER ES
...which are basically indistinguishable from the corresponding ASCII (65 70 69 63) in most if not all fonts.
I don't see this as being an easy or even solvable problem in general --- after all, a Russian IDN registrar could rightly argue that it should be valid to use all Cyrillic characters in IDNs; but at least for those of us who are content with English-only domain names, we can just disable IDN.
I suppose one solution would be, when IDN is detected, to show both the Unicode and Punycoded versions:
Yeah, I remember this vulnerability getting a lot of attention back when that bug was filed. I don't remember how the browsers decided to fix it, but there was some kind of fix. What happened in the last decade that changed that? And what went wrong in the browser-makers' processes for the underlying issue to be forgotten or overlooked?
>I suppose one solution would be, when IDN is detected, to show both the Unicode and Punycoded versions:
That looks like the best solution. The security certificate of the site shows the punycode version of the URL. If the address bar URL and the certificate URL are different, then, at least for addresses in English, that should raise an alarm.
If the address bar URL and the certificate URL are different, then, at least for addresses in English, that should raise an alarm.
There's a mention below ( https://news.ycombinator.com/item?id=14119826 ) that some browsers will show the IDN in the certificate too, presumably because it would otherwise be equally confusing if they didn't match (they do match.) This isn't a problem that can be solved without showing both the IDN and punycode.
The least-impact visualisation could be using colors: group the characters in the domain name by script family, then use regular colors for the largest group (so that e.g. an all-Korean domain name would look just fine) and highlight those exotic relative to the rest.
or maybe just disallow mixed-charset domains. After all there's not much use for a domain that's partially Cyrillic and partially Latin. Not that I'm aware of anyway - and the rule could be more granular if there are legit uses.
the point is that is reduces the number of potential variations. There would only be 1 or 2 possible visually-identical phishing domains for any latin domain, as opposed to potentially hundreds or thousands as it stands now.
From that, you can see domains composed exclusively of those characters with are homoglyphic or nearly so with the ASCII equivalents would be susceptible to this:
AaBCcEeFHhIiJjKMOoPpSsTXxY
That's not all the letters, but there are plenty of domains in English which can be written to look the same with entirely Cyrillic.
When I generate all the English words that are in a UK dictionary that use the letters "acehijopsx" exclusively, and then test which are reachable domains, I come up with only about 500. Other than epic.com, the most significant ones are chase.com, sap.com, soap.com and sex.com, I think.
I've said this for years now. This is an old exploit that could be 100% preventable with ASCII-only URLs. I remember, the last time I said this here I got downvoted to oblivion for stating the true and obvious.
As a non-native English speaker: Yes (English is my 3rd language).
There is a set of simple, clearly defined symbols (ASCII) that make up the URL. It must be like that for URLs to be ever secure. If the URL can be made of tens of thousands of symbols, some of which are completely identical, the URL will never be secure.
It also really changes how naming goes though for a lot of sites and limits the non-English speaking world to have to compete for website names in English when perfectly equivalent words exist in the native language. If a Russian upstart wants to have a simple site like зоомагазин.com or .рф they shouldn't have to compete for the name petstore.com, competing for a name that they don't use, isn't in their country even, and can't possibly be competition. Plenty of shops and and businesses here use .рф addresses alongside those that transliterate Russian to English. Limiting to just ascii characters removes options and artificially inflates the value of common and simple domain names, forcing a bidding war where there doesn't have to be one. Simple sanity checks for mixed character urls by the browser easily remedies this.
One only has to look at things like passports, where internationally-significant naming is important, to see that English (and thus transliterated names) is the norm:
Simple sanity checks for mixed character urls by the browser easily remedies this.
Except this particular example, the name 'epic' is written entirely with Cyrillic characters. One could argue whether a Russian company has the right to use such a name (with all-Cyrillic characters), or how it would be input --- and if you want it to be easy to input for those who don't completely have IDN support and matching ways to input those characters, you'd have to also display the punycoded version, at which point a transliterated name would be better anyway.
I'm not sure that there really are that many full word collisions across two different alphabets. Epic versus Ерис is certainly a nice example, but how many others are there? (Also, I know there is a character that can be used to look like the latin Ii, but every time I see the и character or sound, it's always Ии, but that's not really what this is about) And honestly, I do think that Ерис.com has every right to be a site as Eric.com - it would be trivial for a browser to know whether the url was provided in latin characters or cyrillic.
My point/question is more "is the issue actually words in two different character sets creating a collision or mixed alphabet collisions?" If it's the former, I'd like to see more examples to show this is a prevalent issue that wouldn't be caught by a sanity check that catches URLs with mixed character sets. (That is, how many unique words across languages are there that actually get a hit this way, in the same way Epic/Ерис do?)
As for passports, I understand your point on the universal nature of English, but International passports serve a completely different purpose than a domain name. So much of the web is non-english, save the domain name for some of them, and I'm not sure why it has to be this way. If the owners want it that way, great. If they want a domain in the local language, I'm not sure why they shouldn't be allowed to. A passport is meant for a very specific set of people (border guards, law enforcement, airline and other travel employees) to be able to verify your identity. A domain name just needs to resonate with its audience.
I just don't see why every business across the planet needs to compete for the same flowers.com address, or petstore.com, or cars.com. There are plenty of unique words and ways to communicate the same idea, I'm not certain why we need to restrict ourselves to English here. It just makes already expensive domain names even more expensive. If I have a club in San Francisco that I wanna call White Nights, there's no reason that a club in St. Petersburg should have to compete with me for the same name when they have their own term for White Nights, especially when the issue seems to not be about foreign character sets, but mixing character sets in the domain name. All ascii does work, but it's a grenade for an anthill.
Edit: an aside, I didn't realize that Safari translates to punycode. Very cool.
>зоомагазин.com or .рф they shouldn't have to compete for the name petstore.com
There's already a .ru domain. And why would it want petstore.com instead of zoomagazin.com? Everyone is able to transliterate to Latin here. Many years have passed since .рф was introduced and almost nobody is using domains in Cyrillic. Sites that use Latin domains are much more popular. At this point it's safe to say that non-ASCII domains is a feature that nobody really needed (except for phishers, of course).
The rules on homoglyph domain registration need to be tightened up and enforced.[1] As usual, ICANN is soft on making registrars obey rules. That's where the blame lies.
It's a hard problem. There are valid use cases for mixing characters from different alphabets. They'd need to detect malicious mixing without creating a prohibitive amount of false positives.
IMO it'd be fine if there was an automated flagging mechanism that would require manual review for suspicious domains, though. But good luck getting registrars to do anything more involved than a simple blacklist check.
Homograph detection isn't about prohibiting mixed characters; it's about considering letters that look the same to be the same when checking for existing duplicates.
Yeah, if I were doing it I'd probably tend to be pretty cautious - if I see any popularly-used typeface that would confuse the characters I'd mark them as homographs.
It would absolutely have to be a pretty labor-intensive, manually-maintained database.
And thus extremely error prone. I think introducing IDNs at all was a mistake given the (security) cost-benefit tradeoff, and the best solution would be to scrap them but I realize that is unlikely to happen. Domain name script is probably the last place where cultural imperialism needs to be fought, it's just empty PC posturing and a way to extract some more easy money from gullible domain owners.
Seems like the kind of thing google safe browsing is there for - flagging URLs as malicious. Sure, they have to be discovered first but at least its a start
Or at least, prohibit using strings of homeographs to spoof entire words. This solution has the advantages that it would not restrict the detection of spoofed domains to English only sites, and it should be robust to the use of diacritics and non-English alphabetical characters. This would not work if the spoofed site did not use homeographs to spoof the entire word.
Get phishing sites on Chrome/Firefox's red-flag lists (presumably this already happens?)
Show a country flag in the URL bar based on the language/alphabet of the URL
Change the URL bar colour for punycode domains
Show a big banner the first time you visit a site that you've not visited before.
Have the browser detect and warn about possible homographs based on history.
Having said all that, is this really such a large problem that no browser has needed to solve it? I suspect the centralised malware lists are good enough, but scary demo URLs will never get onto one.
> Show a country flag in the URL bar based on the language/alphabet of the URL
Languages are often spoken in more than one country. Should amazon.com fly the Union Jack or wien.gv.at a German flag? I don't think a Russian flag for Cyrillic would be appreciated in Ukraine either.
Scripts (e.g., Latin, Cyrillic, Hangul) could get their own signifier or symbol, but that does mean disallowing (or discouraging) mixed script domains — which may not be a bad idea.
Perhaps certain scripts should be limited to certain CCTLDs and generic TLDs so that the demo URL would be disallowed in .com, but permissable in .cyrl, which in turn disallows any other scripts (including Latin). It might solve the phishing angle.
It would also solve complex cases such as Japanese (which would require Chinese characters, katakana, hiragana, and some Latin characters; perhaps the full-width set?). Perhaps put those under .日本 and allow only Latin characters under .jp.
Technically feasible, though politically untenable.
It's a bit weird that punycode for emojis gets rejected for .com domains, but this kind of stuff is perfectly fine. I wonder who made these decisions, and with what reasoning.
The proposed fix is invalid as it will break all those sites that actually use international letters. I am looking specially at Japanese and Chinese, but I can see how you cannot allow for other language charactere since, for example, cyrillic has some really similar characters to latin. This is a really odd situation and there's no easy fix IMO.
I think it would be useful to implement some security against this at the registrar level (until a better fix is more broadly available). For example, if I'm registering "epic.com" (the ASCII version), the registrar could suggest that I also register "epic.com" (the Cyrillic version), or vice versa. This could at least help site owners avoid phishing attacks on their own domains.
Unfortunately, this would require all the big registrars to be on board for it to actually be effective.
In order to prevent anything, you would need to register every combination of latin and cyrilic. For a short domain like "epic" this constitutes 16 domains. For a 7 character domain it would be 128 domains. In either case it would be a heavy multiplier on the base cost of the domain.
If you remind them there would be an increase in sales - especially if they point out the danger and then upsell the ASCII version - then they'd like implemenet it; at least they should A/B test it. Not great - but that's how I see them doing it.
In all the cases I've seen so far, the glyphs haven't just been similar, they've been the exact same glyphs, just in a different part of unicode.
I think a lookup table would be enough.
And then a rule that might be good to implement at registrars (please correct me if I'm missing a valid use case for this): if all the the unicode characters represent ASCII (or its 'duplicates'), reject the registration.
Address bars should show punycode urls in a different color or perhaps have an option for displaying urls in ascii only (that is turned on by default for english browsers)
What English speakers tend to forget is that the problem exists outside Latin script in exact reverse: if there are non-Latin homoglyphs for Latin characters that means there are also Latin homoglyphs for non-Latin characters.
Forcing punycode representation is a hack that would help English speaking countries (you don't expect non-Latin characters in typical English language website domains) but it wouldn't solve the problem.
Also, it's not as simple as "ASCII or not": outside English most European languages use a superset of ASCII glyphs mostly via diacritics or variants. It's not useful to single these glyphs out as "non-ASCII". But these languages too are affected by the same homoglyph attacks as English.
It's a hard problem and in the 21st century it certainly won't be fixed with one-off small-scale hacks that only benefit a subset of monolingual English-speaking users.
Given that English, and in particular, ASCII domain names, have practically been the lingua franca of the Internet even for those who otherwise use a different language, I don't think it's too unreasonable to say that IDN has really caused more problems than it's solved.
I don't know much about the apparently thriving Japanese and Chinese internet communities, but there's a big chunk of the internet in Spanish. I agree English was the absolutely most used language on the internet, but with cheap and global internet access I'll argue that the % of English speakers relative to internet users is lower everyday.
lived years in China, not a single time I saw advertised IDN domain, they always use regular latin domains, number domains are particularly popular, so don't really understand what's point of this option...
For my purposes I don't need nonascii domains. Feel free to use unicorn characters in your address bar, but I'd much rather have it special cased in mine. My security shouldn't suffer just because it's such hard to solve it universally.
There's an argument to be made that we should not solve problems in a fashion that only benefits English (or English and other European languages). Generally when that happens the speakers of those languages - which are still the dominant group in internet standardization - loose all interest in fixing the problem for everyone else.
The internet is on the way to becoming a truly global medium, if it isn't there already. We need to stop designing for English first.
We need to think things through and take security seriously, before we cause more serious problems. We have all seen semi-broken characters and weird symbols appearing in text (due to fonts, codepages, encodings, or plain bad handling of unicode strings) for decades now, and have mostly shrugged it off, accepting that maybe someone else is seeing more appropriate text thanks to that.
But security is not the place for such cavalier attitudes. I'll gladly give up my Ñ's and É's and Ç's in domains and other global identification services, in exchange for knowing that It Will Just Work, always, everywhere, safely. And once the "very smart people that really figure things out" have a solution that works with more extensive cultural/linguistic features, by all means put it in place.
This is a pretty decent idea. I struggle to think how it would apply to non english language browsers as they use so many Latin-character domains anyway.
Maybe someone could make this as a chrome extension?
I find the title confusing. At first I thought the post was about DNS hijacking or something similar. However, the phishing attack does not make use of identical domains but homographs.
How effective would a hashing solution be? Possibly with a colour code. Lastly, what about showing the number of Unicode and ascii characters in the URL so as not to get confused?
Windows users have a problem. The suggested solution of copying the URL into Notepad to reveal the true 'punycode' address does not work. Both Notepad and Wordpad render the 'punycode' as it appears in the address bar. Similarly for copying and pasting the URL in place, at least on firefox. A Chrome user commented that it did reveal the fake address.
Edit: I am incorrect. Windows and notepad are not the problem. It seems to be a peculiarity of the particular browser I am using, Cyberfox 52.0.4, which is converting the punycode as it is copied.
Edit: It looks like my Anti-virus, Avast is doing the conversion. I tested this on browsers with and without the Avast plugin, and the conversion only occurs on the browsers with the plugin.
Neither Notepad nor Wordpad have any mechanism for converting punycode on the fly -- the copy and paste executes correctly on my box and displays as https://xn--e1awd7f.com/ in both Notepad and Wordpad. You must be somehow copypasting the post-converted url.
I tried it again with the same result. From the same Copy/Paste, Notepad showed https://www.epic.com but Scite showed https://www.????.com/. This demonstrates that I have not copypasted the post-converted URL. Interestingly, pasting the fake address into HN comment box showed https://www.xn--e1awd7f.com/
Could it be that notepad now delegates the handling of Unicode to the OS. I tested the above on the latest official release, Windows 10 Creators Update.
I also have the latest update. Not really sure what to make of this. If you manually type the punycode URL (https://www.xn--e1awd7f.com) into your Notepad or Wordpad and then copy THAT and paste it back on the following line, does it convert?
I agree with your comment below mentioning Cyberfox -- the browser appears to be to blame, not Windows per se. The only question is why your browser is doing that. On both Firefox and Chrome, I am not getting the Unicode inserted into my clipboard, only the actual punycode. In fact, if I attempt to copy the Unicode from the address bar in either browser, it pastes as https://www.xn--e1awd7f.com/ in the text editor, not as "epic.com." Edge seems to be completely immune to this exploit; it renders the punycode as punycode in the URL (I suppose it just doesn't "understand" punycode is all.)
I suspect that my antivirus software, Avast is converting the URL to unicode when it is copied.
I tried the Copy/Paste of the fake URL with standard Firefox, Brave (Chrome), and Avast Safezone (Chrome). Only in Brave did the fake address copy correctly as punycode. The other two browsers have the Avast plugin as a common factor, but Brave does not.
Manually typing the URL does not result in 'on the fly' conversion. Copying/Pasting from the manually typed URL does not result in a conversion. Copying from the link in your comment above does result in a converted URL. So, you are correct. Notepad is not the culprit.
This suggests that the browser is converting the URL before it can be copied, indicating that it converts the punycode rather than rendering it, but refreshing the page does not lead to the real site, so this only happens when copying from the address bar. I am using Cyberfox 52.0.4.
When I clicked on the padlock icon of the example site https://www.xn--e1awd7f.com/ in Android Chrome, the immediate dialog box showed the real site URL. When I clicked on Details, it showed the certificate as issued to epic.com instead. Wild stuff.
Using Cyberfox on windows 10, clicking on the padlock --> more information --> view certificate showed the certificate was issued to www.xn--e1awd7f.com
The title was just recently changed to remove the offending browsers. Since it's not a universal attack, it's limited in scope to only Chrome and Firefox; why remove this important metadata?
Especially since the title for the linked article also calls out the offending browsers.
The Unicode letters could be darker, which would still be visible if you are color blind. If the highlighted characters were not ones that would be expected to be represented by Unicode, i.e. if they were not special characters or diacritics, than that should raise an alarm, especially if the user was familiar with how the genuine URL looked.
Why is it even allowed to register "epic.com" with e|p|i|c cyrillic when it is already registered with english/ascii??
They look exactly the same, they are the same characters just borrowed from one language to another..
It should just be illegal to register same-character domains. All other solutions that force browser vendors to "do something about it" are implying that the web is in english all other languages are just addons.
It's about separation of concerns. Who defines what is the same and what is not? It's computers, so you need to be clear about this. You can't emulate our judicial process, here: make a law with a spirit and define the corner cases as they get litigated.
You already mention this yourself: you don't want the browsers to "do something about it." Ok, but somebody has to. So, if not the browsers, then ICANN. But they would be using the very same algorithm, in the end, no? If so, then why not have the browsers do that, in the first place? They have more control over UI, and it is a cleaner separation of concerns and responsibility.
Somebody needs to do something. And because this is about humans (in the end, when you say "similar looking", you mean "... to humans"), better have that be in the part of the system that actually is about them: the user agent. ICANN is too slow to be able to rapidly adapt to changes in this fight.
The truth of the matter is that e|p|i|c are actually the same characters just borrowed from one language to another.
If I give you a piece of paper with "epic -> epic" you would say "it says epic two times".
If we give the same piece of paper to a Russian he will say "it says <insert Russian meaning here> two times". The Russian guy may even say it says epic(as in the English meaning) because there might be no "epic" word in Russian.
In short: It all depends on the reader how he interprets the characters e|p|i|c. The same should apply to computers too.
The lowercase a is the same character in French, Russian, English etc. The fact that some decades ago some computer inventors decided that there are four types of "a" depending on the language - that's completely broken in my opinion.
I don't think and I don't see any use of embedding the belonging language to every character. For me, this is obviously like an encoding "flaw".
So the ones that actually have to do something are the OS vendors. If they don't then ICANN or browser vendors should do something..
Well this was pretty scary. Anybody got a link to the chrome canary fix? And what about mobile browsers?
I consider myself to be quite a bit privacy concious. But how do you even protect against something like this? IMO there should be a clear opt-in unicode URL character conversion in browsers.
I think this is a problem with registrars. It is the same if they would arbitrarily share access to your domain setup to another person. They should not allow registering domain names that look identically or very similar.
And of course we should stop using passwords and switch to physical keys.
Seems like it would be much easier to convince browsers to make punycode in the browser bar opt-in than to convince every registrar to stop selling certain punycode domains. For starters, who draws the line on which characters are unsafe?
Maybe that's a good idea. Punycode could be used only for names ending with non-latin TLD.
But the problem would still remain, for example if one company uses a name with latin "i" character and another with ukrainian charater that looks the same.
I ran the name through NFKC, but it didn't help. I didn't get back the ASCII equivalents. I wonder if Unicode has a normalisation like NFKC which applies to similar looking glyphs. Anyone know?
Is there any name servers that would allow me to block / redirect any non-ASCII domain names? At this point, the value of Unicode domain names is a lot less than local safety.
In general, using a password manager creates a good line of warning against phishing attacks. I assume they would not trigger autofill for those fake sites either.
Or maybe use a font that make obvious the difference between ASCII chars and their UNICODE clones without uglying too much the results for the end users.
This is bad. Perhaps big corps should buy, as fast as they can, these fake domains before scammers do and redirect them to the legit websites. Doing so they will help protect their users as well as their brand reputation.
would these kind of attacks, where you can replace 1+ letters and have a new domain, not be subject to combinatorial explosion? I can imagine companies with longer domains having to buy 10,000+ domains to do this.
Sure, but phishing is most all about very big brands or banks.
My point is, for this segment will never be a money problem, more a digital culture deficit problem. Most of them are still not proactive enough about those issues, it seems like they don't care that much. At least, until they will be hit hard by this. Then, they will likely spend 100x in lawyers, reputation, and so on.
One solution is to detect when whole words are rendered via punycode. This would allow the use of diacritics and apply to languages other than English. Of course, there could still be exceptions, but maybe these could be handled on a case by case basis. The problem would become more tractable, at any rate.
There is no "conventional Unicode" in hostnames. DNS is ASCII-only in practice. All non-ASCII hostnames have to be encoded in punycode for DNS to work. This is why punycode was invented.
I see two issues: the mere existence of punycode (does any serious website use a punycode domain?) and the ability of any website to generate an ssl certificate for itself in a matter of minutes, making the "secure" part of the UI most users have been trained since forever to respect irrelevant.
I see you're Spanish, and so am I--how many times have you seen domains with accented letters? If I see one of those I automatically think I have to remove the accents before I type the URL in my browser.
Regarding countries that don't use latin letters, examples:
Not many, but arguably Spanish people (at least young people in Spain) omit always the graphic accents in instant messaging and many times in social media so it's not weird for us to type an url without an é or an ó. I do know I've been asked by companies to get both the accented and unaccented names just in case.
In contrast, Japanese or Chinese people almost never use English characters mixed with their own on a day to day basis.
The alexa results is really interesting, but I think the whole unicode in the url thing is recent (or at least browser support), right? so it would make sense for unicode urls not to be so popular yet. Or it might just be that way, I'd love to see say the top N sites what % use punycode to see if it's relevant or not.
Punycode has been there for over 10 years and it's almost unused. If it hasn't been a success yet, it will arguably never be a success. It's not worth it to let it live, given the hazard that phishing poses.
As far as I can see, the characters used in this attack are indeed on the list. So presumably the problem is that not all domain registrars are checking this.
I don't understand why ICANN or Verisign don't enforce this. This is hardly a new problem. Does anyone know the background to why this is possible?
[0] http://www.unicode.org/Public/security/revision-03/confusabl...