
Watch out: ɢoogle.com isn’t the same as Google.com - lucodibidil
http://thenextweb.com/google/2016/11/21/google-isnt-google/
======
rurban
What about ‮goog‬le.com which is really <U+202E>goog<U+202C>le.com :)

TR36 bidi spoofs are usually worse than TR39 confusables. Move over with your
cursor over it.
[http://www.unicode.org/reports/tr36/#Bidirectional_Text_Spoo...](http://www.unicode.org/reports/tr36/#Bidirectional_Text_Spoofing)

That's why browsers or dns tools use libidn, just programming languages not.

------
a3n
This is strange to me. This is clearly meant, in unicode, to be 'G' that we
all know and love. It has uselessly expanded "the alphabet" (to be western-
centric) in a confusable way.

Unicode maybe should have been three dimensional, with "concept of G" in the
2D space, and "ways of representing G" behind G, along the third axis. All
ways of representing G, whether little capital, capital, lower case, would or
at least could equate to conceptual G in the 2D space.

~~~
stevenbedrick
It actually does do something along those lines, with the "canonical" and
"compatible" equivalence rules:

[https://en.wikipedia.org/wiki/Unicode_equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence)

As mentioned by others on this thread, the real issue is not with Unicode per
se, but rather with the ways that web browsers handle it (or fail to handle
it, as the case may be).

~~~
zokier
I think it is very much an issue in Unicode that they did not define the NFKD
of ɢ to be G. As far as I can tell, the rationale is that ɢ is semantically
different because it is used in IPA. I find that pretty weak, considering the
ubiquity of smallcaps. Asking browsers to diverge (as far as equivalence goes)
from Unicode standards sounds a lot like a failure of Unicode.

------
donquichotte
Some time ago I registered [http://www.goolge.io/](http://www.goolge.io/).
Still haven't done anything with it, I guess at some point I'll just redirect
it to duckduckgo. [EDIT: now it's redirected to duckduckgo.]

This can of course be used in a malicious way. I thought about rebuilding the
homepage of the bank Credit Suisse on www.credit-siusse.ch, but that's
probably illegal.

~~~
ergot
Most browsers should forcibly transcribe this to Punycode[1]:

    
    
        https://www.𝙿𝙰𝚈𝙿𝙰𝙻.com/
    

And yet when I paste this into the latest Firefox it redirects to
[https://www.paypal.com/](https://www.paypal.com/)

No 301 redirects or anything, the browser just treats it like ASCII, which it
is clearly not, it actually happens to be Fullwidth:

[https://en.wikipedia.org/wiki/Fullwidth_form](https://en.wikipedia.org/wiki/Fullwidth_form)

Serious phishing opportunity if you ask me!

[1]
[https://en.wikipedia.org/wiki/Punycode](https://en.wikipedia.org/wiki/Punycode)

~~~
bazzargh
Nope. The browser is behaving sensibly, since you can't register that domain.
It's applying the same rules that the registrars do.

ICANN require that registries follow RFC3491 and related RFCs for name prep
before allowing a name to be registered
[https://www.icann.org/resources/unthemed-pages/idn-
guideline...](https://www.icann.org/resources/unthemed-pages/idn-
guidelines-2005-11-14-en) . What that one does is (among other things) NFKC
normalization and case-folding:

    
    
        irb(main):016:0> "\ufeff\uff30\uff21\uff39\uff30\uff21\uff2c"
        => "﻿ＰＡＹＰＡＬ"
        irb(main):017:0> "\ufeff\uff30\uff21\uff39\uff30\uff21\uff2c".unicode_normalize(:nfkc).downcase
        => "﻿paypal"

~~~
2T1Qka0rEiPr
Interesting. So, out of interest, why is the same not being applied for ɢ?
(When I ran it through Python's unidecode I got the roman symbol all the
same).

~~~
bazzargh
Because 'small capital g' doesn't have a compatibility decomposition to G, but
wide letter P does have a compatibility decomposition to 'normal' P. Unicode
normalization kills large classes of homograph attacks but by no means all.
conventions over mixing scripts from different languages stop some more, but
there's no single answer.

------
Entangled
Web browsers should have an option to show non-ascii chars in urls in red.

~~~
mcv
This would be a great solution. Allowing unicode characters in domain names is
just inviting trouble. I understand that people with non-Latin scripts want
domain names in their own language and alphabet, but there are way too many
unicode characters that will confuse people about legitimate-looking domain
names.

Showing non-ascii in red would be an easy solution for everybody.

~~~
a3n
Don't even show the suspect URL, show "THIS MIGHT BE A SCAM", with some kind
of hover over showing the URL, and some way to click to more information.

~~~
witty_username
Why?

Non-latin alphabet domain names do have legitimate uses, although they are
very rarely used.

~~~
tinus_hn
Except by a third of all people who live in China and India. Not everyone
speaks a language that is representable in the latin alphabet. In fact, a very
large percentage of people do not.

~~~
saurik
And it is then worth noting that as it stands, the attitudes of western
developers with respect to text input and name lookup has so horribly screwed
the Chinese with respect to domain names that they started using numbers
instead of letters for their major web properties.

[https://newrepublic.com/article/117608/chinese-number-
websit...](https://newrepublic.com/article/117608/chinese-number-websites-
secret-meaning-urls)

------
cjrd
Proud owner of [http://gïthub.com](http://gïthub.com) checking in...

~~~
y4mi
the visiblend screenshot on your projects page is dead because of an
unresolveable dns href.

the screenshot on your kmap repo[1] was dead as well, until i actually opened
it. i'm guessing the jpg isnt generated until somebody clicks on it.

enough cyberstalking for me this evening :p

[1] [https://github.com/cjrd/kmap](https://github.com/cjrd/kmap)

------
TazeTSchnitzel
[https://en.wikipedia.org/wiki/IDN_homograph_attack](https://en.wikipedia.org/wiki/IDN_homograph_attack)

~~~
talideon
Most registries did a better job on constructing their IDN tables than
Verisign did. :-(

------
orbitur
This is something that's been bugging me for years.

Why are there multiple representations of alphabet characters in Unicode? It
seems reasonable to include accent marks, but what's the benefit in having a
Cyrillic 'o' alongside a standard 'o' or the 2 or 3 other ASCII-lookalike sets
of characters?

~~~
jstimpfle
There will never be agreement what's the set of distinct characters (also,
what characters should be included, bitcoin logo, facebook logo?)). I see
Unicode as a necessary evil. Due to its complexity most applications should
treat Unicode text as black boxes.

I never rely on Unicode for computation. When receiving Unicode I always make
sure it's in the ASCII range. It could be argued that there should never have
been Unicode domain names but I guess Western people are very lucky that ASCII
includes most of their characters...

~~~
user5994461
> When receiving Unicode I always make sure it's in the ASCII range. [...]
> Western people are very lucky that ASCII includes most of their
> characters...

Please don't spread the myth of Western languages being encodable in ASCII,
and don't pretend to support Unicode (or anything else than English) if you
filter everything to ASCII.

The _only_ Western language that is encodable in ASCII is English.

Corollary: English is the only language that can be encoded in ASCII.

The other western languages have endless issues with text being
encoded/stripped down to ASCII. e.g. French, Spanish, Portuguese, German...

~~~
jstimpfle
As a german I can attest that I can very well converse (e.g Email) in ASCII.
Although it's convenient to use Umlauts, which I do. And I also agree that
French or Spanish might be less convenient.

But that was not my point. The point was about identifiers, such as DNS names.

------
ergot
For me it just redirects to

    
    
        http://money.get.away.get.a.good.job.with.jack.ilovevitaly.com
    

The actual domain is [http://xn--oogle-wmc.com/](http://xn--oogle-wmc.com/)

Which is an Internationalized domain name[1] in punycode transcription

[1]
[https://en.wikipedia.org/wiki/Internationalized_domain_name](https://en.wikipedia.org/wiki/Internationalized_domain_name)

The G in question here is

[https://en.wiktionary.org/wiki/%C9%A2](https://en.wiktionary.org/wiki/%C9%A2)

OR

[http://charcod.es/#%C9%A2/610](http://charcod.es/#%C9%A2/610)

~~~
underyx
>ilovevitaly.com

This Vitaly guy…

I got tons of referral header spam (that shows up in e.g. Google Analytics)
for all sorts of social media buttons and EU cookie law scare tactic sites.
And then there was Vitaly who just spammed me with ilovevitaly.com, which if I
recall correctly actually was a site about himself at the time.

~~~
ergot
Wow what an odd site

------
Kenji
Unicode URLs are the devil. Too many indistinguishable characters. URLs should
stay full ASCII imho. And I say that as someone whose language requires non-
ASCII symbols.

Or, in Bruce Schneier's words: "Unicode is just too complex to ever be
secure."

~~~
rurban
But think about the poor underrepresented folks using foreign character sets?

You really need to support this 'sub café {} café()' => Undefined subroutine
café in your friendly and social programming language, otherwise you will be
accused of discrimination. Of course the two é are not normalized.

Which unicode-friendly language does really check for mixed script
confusables? Java only is my guess. Even perl6 falls into this trap.

[http://unicode.org/reports/tr39/#Mixed_Script_Confusables](http://unicode.org/reports/tr39/#Mixed_Script_Confusables)

~~~
palunon
When it is just accents, it's ok. But when your users have a language that
uses à radically different alphabet, sometimes they can't even read Latin
transliterations.

~~~
rurban
agree. but then you need to declare your exoting encoding somehow, such as in
perl via use encoding 'greek'; and then the parser does not need to guess
about mixed scripts encodings on every identifier. there's only latin and
greek valid, everything else invalid.

how many languages even check for mixed script confusables? for dynamic
languages this check is much too expensive, but they are leading the "good
cause", allowing everything, and checking nothing.

------
underyx
It was a pretty nice surprise that when sending this URL in Slack it was
automatically converted to `xn--oogle-wmc.com`.

~~~
Fiahil
Slack is not doing anything. It's Google chrome filling up your clipboard with
the "extended" version of the url.

~~~
underyx
But when I paste it in the Slack message box it shows the ɢoogle.com version.

~~~
pvdebbe
I haven't used slack, but I think both are doing the best practices around
there: Chrome copies the punycoded URL to clipboard, Slack will decode pasted
punycode-URLs into a nicer presentation.

------
SamWhited
There has been talk at the IETF of redefining IDNA2008 (the current way you
prevent issues like this) in terms of the PRECIS framework (RFC 7564). This
wouldn't exactly "solve" the problem, but it would mean that IDNA could be
more agile with respect to Unicode versions and would make it easier to react
to new problems, new confusable characters, etc. as they happen.

------
vbezhenar
What about Googlé.com and infinite number of other variations?

~~~
StavrosK
Why is everyone thinking so small? What about
[https://www.goоgle.com](https://www.goоgle.com)?

Or how about the word "gullible" isn't in the dictionary?

[http://www.dictionary.com/browse/gulliblе](http://www.dictionary.com/browse/gulliblе)

~~~
bmmayer1
Stupid question, how did you do that? What characters are you using?

~~~
freshyill
I frequently have to deal with lots of scientific, mathematical, and many
other unusual characters.

I use [http://unicode-table.com](http://unicode-table.com) to help figure out
what's what. The official Unicode specifications[1] is impenetrable, and it's
really hard to deal with.

[1]
[http://www.unicode.org/Public/UCD/latest/](http://www.unicode.org/Public/UCD/latest/)

------
joncrocks
I believe now that browsers have support for non-ascii URLs, each of them have
schemes for anti-phishing.

See [https://www.w3.org/International/articles/idn-and-
iri/](https://www.w3.org/International/articles/idn-and-iri/)

and
[https://wiki.mozilla.org/IDN_Display_Algorithm](https://wiki.mozilla.org/IDN_Display_Algorithm)

plus [http://www.chromium.org/developers/design-documents/idn-
in-g...](http://www.chromium.org/developers/design-documents/idn-in-google-
chrome)

~~~
77pt77
Browsers have supported this for almost a decade.

------
hannele
Ahh, the old classic, PayPaI:
[https://en.wikipedia.org/wiki/PayPaI](https://en.wikipedia.org/wiki/PayPaI)
(uppercase 'i')

------
alessioalex
This just redirects me to [http://xn--oogle-wmc.com/](http://xn--oogle-
wmc.com/) so I know it's not the real google (using Chrome).

------
cesis
Why Google analytics isn't filtering out this referral spam?

~~~
akerro
It's literally not their job to filter referrals... they do the opposite, they
collect referrals.

------
jahewson
Browsers already blacklist many visually similar characters, it seems that the
IPA characters need to be added to that list.

------
chaz6
I thought there were supposed to be registry rules preventing similar looking
names to be registered as an idna. I guess not.

~~~
shshhdhs
I believe they aren't preventative measures, but responsive. So if Google
contacts ICANN, then they may do something about it

~~~
darkr
Some registries do this automatically. Some don't.

------
Programmatic
I'm not sure how feasible this is, but wouldn't it make sense for
.com/.net/etc to be latin alphabet only and allow other domains to be
localized with unicode? I wouldn't really have a problem with 新浪首页.cn, and I
doubt I would confuse ɢoogle.ru or whatever with google.com

~~~
barkingcat
That defeats the purpose of an internationalized dns system.

The whole point of getting unicode into domain names is so we can have
新浪首页.com so that it's no longer a latin alphabet centric system.

~~~
Programmatic
Doesn't that yield a whole class of problems though that we're trying to solve
with obtuse solutions such as "let's make that character set in red so people
don't get phished"? How is that any more international and/or easy to use?

It seems that putting the allowed character set into the tld would be a pretty
user-friendly way of doing that.

Edit: As an added bonus, tlds are centrally managed, and are already
western/latin encoded. So why not customize it with a localized abbreviation
for the language or tld type?

~~~
hyperhopper
One is a matter of international standardization of a protocol. Another is a
matter of client side security for a certain type of user.

------
Roboprog
Cool! I want a cool non-alpha unicode domain. I guess "square-root" is already
taken, but there must be some cool domains left (even though nobody can
actually type them in).

Actually, some of these would probably be nice aliases for some math / science
oriented sites.

E.g. - .com

~~~
Roboprog
Meh. Markup ate my "radioactive pie" (9762 dec / 2622 hex) symbol :-(

------
hannele
I'm curious, why is it allowed to register domain names with mixed character
sets? I am behind allowing Unicode characters in domain names for the obvious
reasons, but are there compelling use cases for allowing them to be mixed?

~~~
klodolph
Technically, Unicode is only one character set. If you want to disallow
mixing, you have to disallow it on some other basis, like script. There are
many edge cases to consider, though, and many legitimate reasons to mix
scripts.

------
reacweb
Maybe browser should have a security option to whitelist characters in URL.
When a URL uses another character, there would be popups with explanations and
choices.

------
transfire
Oh, you mean Unicode Sucks(TM)? Yes. Yes it does.

