
Phishing attack uses Unicode characters in domains to clone known safe sites - rukuu001
https://www.wordfence.com/blog/2017/04/chrome-firefox-unicode-phishing
======
robinhouston
The Unicode consortium maintains a list of “confusable” strings[0], that is
supposed to be used to prevent domains being registered that are visually
identical or near-identical to already-registered domains.

As far as I can see, the characters used in this attack are indeed on the
list. So presumably the problem is that not all domain registrars are checking
this.

I don't understand why ICANN or Verisign don't enforce this. This is hardly a
new problem. Does anyone know the background to why this is possible?

[0]
[http://www.unicode.org/Public/security/revision-03/confusabl...](http://www.unicode.org/Public/security/revision-03/confusablesSummary.txt)

~~~
mynegation
It's not that hard to come up with legitimate domains in non-Latin script that
could be recognized as word written in Latin script. For example, керас is a
Russian transliteration of Keras. So I imagine they would get a bit of false
positives and (fair) claims that non-ASCII languages are oppressed. To me,
better alternative would be to limit puny code names to punycode TLDs, and
make sure that no two TLDs look the same: task of much smaller scale.

~~~
codedokode
Maybe non-latin domains should use non-latin TLDs? Or at least registrar
should reserve all alternative spellings when registering a domain.

~~~
mynegation
Yes, that was my suggestion: non-Latin domains MUST be in the non-Latin TLDs,
with additional constraint that non-Latin TLDs cannot look like already
registered Latin TLDs. In a twist of irony, Russian word "сом" (roughly
pronounced as "some") means "catfish" (the animal, not the action of
catfishing though).

------
userbinator
For reference, the example is using:

    
    
        U+0435 CYRILLIC SMALL LETTER IE
        U+0440 CYRILLIC SMALL LETTER ER
        U+0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
        U+0441 CYRILLIC SMALL LETTER ES
    

...which are basically _indistinguishable_ from the corresponding ASCII (65 70
69 63) in most if not all fonts.

I don't see this as being an easy or even solvable problem in general ---
after all, a Russian IDN registrar could rightly argue that it should be valid
to use all Cyrillic characters in IDNs; but at least for those of us who are
content with English-only domain names, we can just disable IDN.

I suppose one solution would be, when IDN is detected, to show both the
Unicode and Punycoded versions:

[www.xn--e1awd7f.com][https://www.epic.com/](https://www.epic.com/)

One could also argue that this is also a usability feature, for easy copying
of IDNs by those who otherwise cannot write them.

~~~
Zuider
>I suppose one solution would be, when IDN is detected, to show both the
Unicode and Punycoded versions:

That looks like the best solution. The security certificate of the site shows
the punycode version of the URL. If the address bar URL and the certificate
URL are different, then, at least for addresses in English, that should raise
an alarm.

~~~
beaconstudios
no-one checks the certificate readout - it'd have to be inline for regular
users to see any benefit.

~~~
usrusr
The least-impact visualisation could be using colors: group the characters in
the domain name by script family, then use regular colors for the largest
group (so that e.g. an all-Korean domain name would look just fine) and
highlight those exotic relative to the rest.

~~~
beaconstudios
or maybe just disallow mixed-charset domains. After all there's not much use
for a domain that's partially Cyrillic and partially Latin. Not that I'm aware
of anyway - and the rule could be more granular if there are legit uses.

~~~
krallja
see
[https://news.ycombinator.com/item?id=14120372](https://news.ycombinator.com/item?id=14120372)
\- this example uses all-Cyrillic

~~~
beaconstudios
the point is that is reduces the number of potential variations. There would
only be 1 or 2 possible visually-identical phishing domains for any latin
domain, as opposed to potentially hundreds or thousands as it stands now.

------
Animats
The rules on homoglyph domain registration need to be tightened up and
enforced.[1] As usual, ICANN is soft on making registrars obey rules. That's
where the blame lies.

[1]
[https://en.wikipedia.org/wiki/IDN_homograph_attack](https://en.wikipedia.org/wiki/IDN_homograph_attack)

~~~
pluma
It's a hard problem. There are valid use cases for mixing characters from
different alphabets. They'd need to detect malicious mixing without creating a
prohibitive amount of false positives.

IMO it'd be fine if there was an automated flagging mechanism that would
require manual review for suspicious domains, though. But good luck getting
registrars to do anything more involved than a simple blacklist check.

~~~
azernik
Homograph detection isn't about prohibiting mixed characters; it's about
considering letters that look the same to _be_ the same when checking for
existing duplicates.

~~~
nzp
It's not that simple. A lot of different characters may or may not look the
same depending on the typeface used.

~~~
azernik
Yeah, if I were doing it I'd probably tend to be pretty cautious - if I see
any popularly-used typeface that would confuse the characters I'd mark them as
homographs.

It would absolutely have to be a pretty labor-intensive, manually-maintained
database.

~~~
nzp
And thus extremely error prone. I think introducing IDNs at all was a mistake
given the (security) cost-benefit tradeoff, and the best solution would be to
scrap them but I realize that is unlikely to happen. Domain name script is
probably the last place where cultural imperialism needs to be fought, it's
just empty PC posturing and a way to extract some more easy money from
gullible domain owners.

------
mattbee
Random ideas to solve this:

Get phishing sites on Chrome/Firefox's red-flag lists (presumably this already
happens?)

Show a country flag in the URL bar based on the language/alphabet of the URL

Change the URL bar colour for punycode domains

Show a big banner the first time you visit a site that you've not visited
before.

Have the browser detect and warn about possible homographs based on history.

Having said all that, is this really such a large problem that no browser has
needed to solve it? I suspect the centralised malware lists are good enough,
but scary demo URLs will never get onto one.

~~~
Freak_NL
> Show a country flag in the URL bar based on the language/alphabet of the URL

Languages are often spoken in more than one country. Should amazon.com fly the
Union Jack or wien.gv.at a German flag? I don't think a Russian flag for
Cyrillic would be appreciated in Ukraine either.

Scripts (e.g., Latin, Cyrillic, Hangul) could get their own signifier or
symbol, but that does mean disallowing (or discouraging) mixed script domains
— which may not be a bad idea.

Perhaps certain scripts should be limited to certain CCTLDs and generic TLDs
so that the demo URL would be disallowed in .com, but permissable in .cyrl,
which in turn disallows any other scripts (including Latin). It might solve
the phishing angle.

It would also solve complex cases such as Japanese (which would require
Chinese characters, katakana, hiragana, and some Latin characters; perhaps the
full-width set?). Perhaps put those under .日本 and allow only Latin characters
under .jp.

Technically feasible, though politically untenable.

------
blauditore
It's a bit weird that punycode for emojis gets rejected for .com domains, but
this kind of stuff is perfectly fine. I wonder who made these decisions, and
with what reasoning.

------
franciscop
The proposed fix is invalid as it will break all those sites that actually use
international letters. I am looking specially at Japanese and Chinese, but I
can see how you _cannot_ allow for _other language charactere_ since, for
example, cyrillic has some really similar characters to latin. This is a
really odd situation and there's no easy fix IMO.

------
hartz
I think it would be useful to implement some security against this at the
registrar level (until a better fix is more broadly available). For example,
if I'm registering "epic.com" (the ASCII version), the registrar could suggest
that I also register "epic.com" (the Cyrillic version), or vice versa. This
could at least help site owners avoid phishing attacks on their own domains.

Unfortunately, this would require all the big registrars to be on board for it
to actually be effective.

~~~
syrrim
In order to prevent anything, you would need to register every combination of
latin and cyrilic. For a short domain like "epic" this constitutes 16 domains.
For a 7 character domain it would be 128 domains. In either case it would be a
heavy multiplier on the base cost of the domain.

~~~
MiddleEndian
I don't think you're allowed to combine Latin and Cyrillic in a domain name.
The issue is mainly that the two sets have identical looking characters.

------
richdougherty
Wow this is an old attack. I wonder whether we could use computer vision to
check for similar/confusing glyphs.

~~~
andai
In all the cases I've seen so far, the glyphs haven't just been similar,
they've been _the exact same glyphs,_ just in a different part of unicode.

I think a lookup table would be enough.

And then a rule that might be good to implement at registrars (please correct
me if I'm missing a valid use case for this): if all the the unicode
characters represent ASCII (or its 'duplicates'), reject the registration.

~~~
pbhjpbhj
I don't think that works, "cccp" could look identical in latin and cyrillic
but represents 2 different acronyms.

~~~
andai
Oh, dang, good point.

------
unabridged
Address bars should show punycode urls in a different color or perhaps have an
option for displaying urls in ascii only (that is turned on by default for
english browsers)

~~~
pluma
What English speakers tend to forget is that the problem exists outside Latin
script in exact reverse: if there are non-Latin homoglyphs for Latin
characters that means there are also Latin homoglyphs for non-Latin
characters.

Forcing punycode representation is a hack that would help English speaking
countries (you don't expect non-Latin characters in typical English language
website domains) but it wouldn't solve the problem.

Also, it's not as simple as "ASCII or not": outside English most European
languages use a superset of ASCII glyphs mostly via diacritics or variants.
It's not useful to single these glyphs out as "non-ASCII". But these languages
too are affected by the same homoglyph attacks as English.

It's a hard problem and in the 21st century it certainly won't be fixed with
one-off small-scale hacks that only benefit a subset of monolingual English-
speaking users.

~~~
userbinator
Given that English, and in particular, ASCII domain names, have practically
been the _lingua franca_ of the Internet even for those who otherwise use a
different language, I don't think it's too unreasonable to say that IDN has
really caused more problems than it's solved.

~~~
franciscop
I don't know much about the apparently thriving Japanese and Chinese internet
communities, but there's a big chunk of the internet in Spanish. I agree
English was the absolutely most used language on the internet, but with cheap
and global internet access I'll argue that the % of English speakers relative
to internet users is lower everyday.

~~~
Markoff
lived years in China, not a single time I saw advertised IDN domain, they
always use regular latin domains, number domains are particularly popular, so
don't really understand what's point of this option...

------
homakov
How do I disable unicode in domains entirely for Chrome? Such a useless and
dangerous feature.

~~~
amake
> Such a useless and dangerous feature.

Useless to you != useless to everyone.

~~~
homakov
Last time you used it?

------
nigel182
Interestingly, Safari doesn't translate [https://www.xn--
e1awd7f.com](https://www.xn--e1awd7f.com) but it _does_ translate [http://xn--
8q8h17eba.ga/](http://xn--8q8h17eba.ga/) to emoji in the address bar.

Chrome behaves oppositely, showing unicode for [https://www.xn--
e1awd7f.com/](https://www.xn--e1awd7f.com/) but not showing emoji for
[http://xn--8q8h17eba.ga/](http://xn--8q8h17eba.ga/)

~~~
pimlottc
For me, Safari 10.1 does not show emoji in the address bar for that URL; both
the ones listed show up as raw punycode.

------
blechschmidt
I find the title confusing. At first I thought the post was about DNS
hijacking or something similar. However, the phishing attack does not make use
of identical domains but homographs.

~~~
sctb
Thanks, we updated the title from “Chrome and Firefox Phishing Attack Uses
Domains Identical to Known Safe Sites”.

------
rubatuga
How effective would a hashing solution be? Possibly with a colour code.
Lastly, what about showing the number of Unicode and ascii characters in the
URL so as not to get confused?

------
Zuider
Windows users have a problem. The suggested solution of copying the URL into
Notepad to reveal the true 'punycode' address does not work. Both Notepad and
Wordpad render the 'punycode' as it appears in the address bar. Similarly for
copying and pasting the URL in place, at least on firefox. A Chrome user
commented that it did reveal the fake address.

The Scite text editor did reveal the difference:
[http://www.ebswift.com/scite-text-editor-
installer.html](http://www.ebswift.com/scite-text-editor-installer.html)

Does it work on any others?

Edit: I am incorrect. Windows and notepad are not the problem. It seems to be
a peculiarity of the particular browser I am using, Cyberfox 52.0.4, which is
converting the punycode as it is copied.

Edit: It looks like my Anti-virus, Avast is doing the conversion. I tested
this on browsers with and without the Avast plugin, and the conversion only
occurs on the browsers with the plugin.

~~~
zuminator
Neither Notepad nor Wordpad have any mechanism for converting punycode on the
fly -- the copy and paste executes correctly on my box and displays as
[https://xn--e1awd7f.com/](https://xn--e1awd7f.com/) in both Notepad and
Wordpad. You must be somehow copypasting the post-converted url.

~~~
Zuider
I tried it again with the same result. From the same Copy/Paste, Notepad
showed [https://www.epic.com](https://www.epic.com) but Scite showed
[https://www.????.com/](https://www.????.com/). This demonstrates that I have
not copypasted the post-converted URL. Interestingly, pasting the fake address
into HN comment box showed [https://www.xn--
e1awd7f.com/](https://www.еріс.com/)

Could it be that notepad now delegates the handling of Unicode to the OS. I
tested the above on the latest official release, Windows 10 Creators Update.

~~~
zuminator
I also have the latest update. Not really sure what to make of this. If you
manually type the punycode URL ([https://www.xn--e1awd7f.com](https://www.xn--
e1awd7f.com)) into your Notepad or Wordpad and then copy THAT and paste it
back on the following line, does it convert?

~~~
zuminator
I agree with your comment below mentioning Cyberfox -- the browser appears to
be to blame, not Windows per se. The only question is why your browser is
doing that. On both Firefox and Chrome, I am not getting the Unicode inserted
into my clipboard, only the actual punycode. In fact, if I attempt to copy the
Unicode from the address bar in either browser, it pastes as [https://www.xn--
e1awd7f.com/](https://www.xn--e1awd7f.com/) in the text editor, not as
"epic.com." Edge seems to be completely immune to this exploit; it renders the
punycode as punycode in the URL (I suppose it just doesn't "understand"
punycode is all.)

~~~
Zuider
I suspect that my antivirus software, Avast is converting the URL to unicode
when it is copied.

I tried the Copy/Paste of the fake URL with standard Firefox, Brave (Chrome),
and Avast Safezone (Chrome). Only in Brave did the fake address copy correctly
as punycode. The other two browsers have the Avast plugin as a common factor,
but Brave does not.

------
falcolas
The title was just recently changed to remove the offending browsers. Since
it's not a universal attack, it's limited in scope to only Chrome and Firefox;
why remove this important metadata?

Especially since the title for the linked article _also_ calls out the
offending browsers.

------
sebastian
A good way to fix this would be coloring all Unicode characters in the URL
bar.

~~~
deckiedan
That doesn't work for colour blindness, or languages that use extended Latin
alphabet. Some letters of a word would be one colour and others another.

~~~
Zuider
The Unicode letters could be darker, which would still be visible if you are
color blind. If the highlighted characters were not ones that would be
expected to be represented by Unicode, i.e. if they were not special
characters or diacritics, than that should raise an alarm, especially if the
user was familiar with how the genuine URL looked.

~~~
sebastian
Different colour + underlined.

------
everydaypanos
Why is it even allowed to register "epic.com" with e|p|i|c cyrillic when it is
already registered with english/ascii??

They look exactly the same, they are the same characters just borrowed from
one language to another..

It should just be illegal to register same-character domains. All other
solutions that force browser vendors to "do something about it" are implying
that the web is in english all other languages are just addons.

~~~
nothrabannosir
It's about separation of concerns. Who defines what is the same and what is
not? It's computers, so you need to be clear about this. You can't emulate our
judicial process, here: make a law with a spirit and define the corner cases
as they get litigated.

You already mention this yourself: you don't want the browsers to "do
something about it." Ok, but somebody has to. So, if not the browsers, then
ICANN. But they would be using the very same algorithm, in the end, no? If so,
then why not have the browsers do that, in the first place? They have more
control over UI, and it is a cleaner separation of concerns and
responsibility.

Somebody needs to do something. And because this is about humans (in the end,
when you say "similar looking", you mean "... to humans"), better have that be
in the part of the system that actually is about them: the user agent. ICANN
is too slow to be able to rapidly adapt to changes in this fight.

~~~
everydaypanos
The truth of the matter is that e|p|i|c are actually the same characters just
borrowed from one language to another.

If I give you a piece of paper with "epic -> epic" you would say "it says epic
two times".

If we give the same piece of paper to a Russian he will say "it says <insert
Russian meaning here> two times". The Russian guy may even say it says epic(as
in the English meaning) because there might be no "epic" word in Russian.

In short: It all depends on the reader how he interprets the characters
e|p|i|c. The same should apply to computers too.

The lowercase a is the same character in French, Russian, English etc. The
fact that some decades ago some computer inventors decided that there are four
types of "a" depending on the language - that's completely broken in my
opinion.

I don't think and I don't see any use of embedding the belonging language to
every character. For me, this is obviously like an encoding "flaw".

So the ones that actually have to do something are the OS vendors. If they
don't then ICANN or browser vendors should do something..

------
shoshino
On Debian 9.0 (64-bit), with defaults:

Not affected: \- Chromium Version 57.0.2987.133

Affected: \- Chrome Version 57.0.2987.133 \- Firefox ESR 45.8.0 \- Tor Browser
6.5.1 (based on Mozilla Firefox 45.8.0)

tested with [https://www.xn--80ak6aa92e.com/](https://www.xn--80ak6aa92e.com/)
-> [https://www.apple.com/](https://www.apple.com/)

------
askjdlkasdjsd
Well this was pretty scary. Anybody got a link to the chrome canary fix? And
what about mobile browsers?

I consider myself to be quite a bit privacy concious. But how do you even
protect against something like this? IMO there should be a clear opt-in
unicode URL character conversion in browsers.

------
codedokode
I think this is a problem with registrars. It is the same if they would
arbitrarily share access to your domain setup to another person. They should
not allow registering domain names that look identically or very similar.

And of course we should stop using passwords and switch to physical keys.

~~~
pharrlax
Seems like it would be much easier to convince browsers to make punycode in
the browser bar opt-in than to convince every registrar to stop selling
certain punycode domains. For starters, who draws the line on which characters
are unsafe?

~~~
codedokode
Maybe that's a good idea. Punycode could be used only for names ending with
non-latin TLD.

But the problem would still remain, for example if one company uses a name
with latin "i" character and another with ukrainian charater that looks the
same.

------
robk
Can anyone suggest a Chrome extension that flags or blocks any non ascii URLs?

~~~
Lost_BiomedE
Punycode Alert is one. It has 12 users and no reviews. I haven't looked into
it besides a quick search.

~~~
robk
Thanks!

------
_nalply
I ran the name through NFKC, but it didn't help. I didn't get back the ASCII
equivalents. I wonder if Unicode has a normalisation like NFKC which applies
to similar looking glyphs. Anyone know?

------
protomyth
Is there any name servers that would allow me to block / redirect any non-
ASCII domain names? At this point, the value of Unicode domain names is a lot
less than local safety.

We currently use unbound for users.

~~~
Retr0spectrum
You could probably block the `xn--` prefix, which is how it appears over the
wire.

------
yawz
In general, using a password manager creates a good line of warning against
phishing attacks. I assume they would not trigger autofill for those fake
sites either.

------
hartator
Or maybe use a font that make obvious the difference between ASCII chars and
their UNICODE clones without uglying too much the results for the end users.

------
tudorconstantin
Would this also work with email? Like, register the Cyrillic fb.com and send
emails from mark@fb.com as a valid reply-to address?

The possibilities seem mind boggling

------
roryisok
Browsers should probably disable IDN by default, at least if region settings
are not set to use Cyrillic chars

------
achairapart
This is bad. Perhaps big corps should buy, as fast as they can, these fake
domains before scammers do and redirect them to the legit websites. Doing so
they will help protect their users as well as their brand reputation.

~~~
beaconstudios
would these kind of attacks, where you can replace 1+ letters and have a new
domain, not be subject to combinatorial explosion? I can imagine companies
with longer domains having to buy 10,000+ domains to do this.

~~~
achairapart
I think Google or Facebook can afford this.

They already do this with misspelled domains.

~~~
beaconstudios
Google or Facebook sure, but medium-sized businesses might find it to be too
expensive.

~~~
achairapart
Sure, but phishing is most all about very big brands or banks.

My point is, for this segment will never be a money problem, more a digital
culture deficit problem. Most of them are still not proactive enough about
those issues, it seems like they don't care that much. At least, until they
will be hit hard by this. Then, they will likely spend 100x in lawyers,
reputation, and so on.

------
fooker
Why not have the browser handle this?

~~~
sleepychu
Handle what? Detect when punycode is ASCII only?

~~~
Zuider
One solution is to detect when whole words are rendered via punycode. This
would allow the use of diacritics and apply to languages other than English.
Of course, there could still be exceptions, but maybe these could be handled
on a case by case basis. The problem would become more tractable, at any rate.

~~~
bzbarsky
In this case, the whole word is rendered via punycode. Are you saying that
should be allowed, or prohibited?

~~~
Zuider
Prohibited. Or at least, flagged. There is no compelling reason to use
punycode where conventional Unicode can be used.

~~~
bzbarsky
There is no "conventional Unicode" in hostnames. DNS is ASCII-only in
practice. All non-ASCII hostnames have to be encoded in punycode for DNS to
work. This is why punycode was invented.

------
andoon
I see two issues: the mere existence of punycode (does any serious website use
a punycode domain?) and the ability of any website to generate an ssl
certificate for itself in a matter of minutes, making the "secure" part of the
UI most users have been trained since forever to respect irrelevant.

~~~
franciscop
I partly agree with the second part. I think it starts to make sense to set it
as:

1\. Non SSL/TLS sites get marked as _insecure_. With recent events (ISP
allowed to sell your data) this seems like a no-brainer.

2\. Self-signed certificates such as Letsencrypt get a normal status.

3\. Extended Validation SSL (or however they are called) get the green lock to
mark them as trusted.

About your first point there's a huge chunk of the internet who doesn't even
speak English so I'd say yes.

~~~
andoon
I see you're Spanish, and so am I--how many times have you seen domains with
accented letters? If I see one of those I automatically think I have to remove
the accents before I type the URL in my browser.

Regarding countries that don't use latin letters, examples:

[http://www.alexa.com/topsites/countries/RU](http://www.alexa.com/topsites/countries/RU)

[http://www.alexa.com/topsites/countries/CN](http://www.alexa.com/topsites/countries/CN)

[http://www.alexa.com/topsites/countries/DZ](http://www.alexa.com/topsites/countries/DZ)

~~~
franciscop
Not many, but arguably Spanish people (at least young people in Spain) omit
always the graphic accents in instant messaging and many times in social media
so it's not weird for us to type an url without an é or an ó. I do know I've
been asked by companies to get both the accented and unaccented names just in
case.

In contrast, Japanese or Chinese people almost never use English characters
mixed with their own on a day to day basis.

The alexa results is really interesting, but I think the whole unicode in the
url thing is recent (or at least browser support), right? so it would make
sense for unicode urls not to be so popular yet. Or it might just be that way,
I'd love to see say the top N sites what % use punycode to see if it's
relevant or not.

~~~
andoon
Punycode has been there for over 10 years and it's almost unused. If it hasn't
been a success yet, it will arguably never be a success. It's not worth it to
let it live, given the hazard that phishing poses.

