
Domain hacks with unusual Unicode characters - edent
https://shkspr.mobi/blog/2018/11/domain-hacks-with-unusual-unicode-characters/
======
jfk13
Under "How does this work?", the post refers to the text in RFC 5895:

> map characters to the "Simple_Lowercase_Mapping" property (the fourteenth
> column) in
> <[http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>](http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>),
> if any.

as if that were responsible for turning ℡ into TEL. But the fourteenth column
in

> 2121;TELEPHONE SIGN;So;0;ON;<compat> 0054 0045 004C;;;;N;T E L SYMBOL;;;;

is empty!

What we're actually seeing is a Compatibility Decomposition, used when Unicode
normalisation form NFKC is applied to the text.

Whether it's appropriate for browsers to be applying NFKC may be questionable.
RFC 5895 calls for the use of NFC (which would not apply mappings like this),
but it also says that

> These form a minimal set of mappings that an application should strongly
> consider doing. Of course, there are many others that might be done.

which leaves things rather open.

------
kam
I was hoping this would work on the "Regional Indicator Symbols" used for flag
emoji, since they encode the same country codes used in ccTLDs.

🇩🇪 is actually two Unicode code points 🇩 (U+1F1E9 'REGIONAL INDICATOR SYMBOL
LETTER D') 🇪 (U+1F1EA 'REGIONAL INDICATOR SYMBOL LETTER E') that fonts display
as a single flag grapheme. But google.🇩🇪 becomes the Punycode google.xn--h77hc
and fails to resolve rather than mapping to ASCII as these other characters
do.

~~~
edent
Yes, I found it rather curious to see which symbols converted back to "pure"
ASCII.

For example, ℡ goes to TEL, but ℻ goes to punycode!

~~~
tyingq

      http://℻zero.com works for me. 
    

Apparently just doesn't work as a TLD (chrome/linux).

~~~
gondo
most likely because .fax is not a valid TLD but .tel is

------
dec0dedab0de
This kind of stuff is fun, reminds me of decimal IP addresses.

like [http://3520653040](http://3520653040) should take you to hn (atleast to
the IP I'm resolving for hn right now)

Also, I know this is off topic, but that python example really bothered me.

original:

    
    
      python -c 'import sys;print sys.argv[1].decode("utf-8").encode("idna")' "℡"
    
    

Should have been

    
    
      python -c 'print "℡".decode("utf-8").encode("idna")'
    

Or in python3

    
    
      python -c 'print( "℡".encode("idna"))'

~~~
edent
Pull Requests welcome :-)

More seriously, I just copied that codefrom somewhere. Why is the other way
better?

~~~
half-kh-hacker
Brevity - You don't need to `import sys` if you just use the literal character
instead of reading from argv

------
dane-pgp
This makes me wonder what if there are new ways to consider the question of
"What is the shortest possible domain name?".

An amusing approach is the one taken here:

[https://www.namepros.com/threads/worlds-shortest-domain-
name...](https://www.namepros.com/threads/worlds-shortest-domain-
name.1061981/)

leading to the domain used for this URL shortener:

[https://l.tl/](https://l.tl/)

However, the ccTLD for São Tomé and Príncipe allows single-letter second-level
domains, so perhaps this is a contender:

l.ﬅ

~~~
kmm
A few TLDs have A records, like ai. or dk.

[http://ai](http://ai). [http://dk](http://dk).

~~~
hultner
How come, they don't seem to resolve to anything? I tried both via my browser
and curl.

I can see the records and if i curl the ip-address for the dk-records I only
get a nginx 301 redirect loop to the http-s version which serves a certificate
for [https://eksempel.dk](https://eksempel.dk).

Similar exprience with ai, curling the ip seems to point to a
[http://offshore.ai](http://offshore.ai) page.

Is the top level A-records used for some other protocol? Do they server any
purpose?

~~~
foepys
I tried both in Chrome in Android and they are working fine for me.

------
krallja
@edent did you submit this to HN as [https://xn--69f31l4t57c0mag4b613h.xn--
7uh4898msjaso/🆆🆃🅵/](https://🅂𝖍𝐤ₛᵖ𝒓.ⓜ𝕠𝒃𝓲/🆆🆃🅵/) or the ASCII variant?

edit: whoa, HN autoconverted it to Punycode. 🅂𝖍𝐤ₛᵖ𝒓.ⓜ𝕠𝒃𝓲/🆆🆃🅵/

~~~
bluejekyll
It’s a little sad that we ended up with punycode, given that utf8 is so
elegant as a forward compatible character set with ascii.

DNS’ concern over backward compatibility is a bit of a pain sometimes. And now
we even have two competing standards, where multicast DNS, mDNS, allows utf8,
but “standard” DNS does not.

~~~
drewmate
An important benefit of punycode is that it provides some protection against
homograph attacks [0]. There are so many similar-looking characters in Unicode
that it seems reasonable to trim the allowed characters to a subset. Of course
it's a compromise and ASCII's not perfect, but it's a lot easier to spot
g00gle.com compared to gооgle.com.

[0]
[https://en.wikipedia.org/wiki/IDN_homograph_attack](https://en.wikipedia.org/wiki/IDN_homograph_attack)

~~~
kuschku
At the same time there's sites like flüge.de which is not reachable under any
domain except the unicode domain, and while ü could be written as ue,
fluege.de is already owned by a competitor.

Over time, punycode is going to cause more phishing problems in non-ascii
countries than it's going to solve, because users aren't going to see a
difference between xn-blabla.de and xn-blablu.de if all domains are unreadable
to them.

~~~
btown
I feel like this is a browser UX problem, right? A browser designed to prevent
phishing of readers of both ASCII and non-ASCII languages might display both
the punycode and unicode versions of a website, and if a heuristic is detected
that a homograph is used that would otherwise result in an Alexa Top 100k
site, display a dialog to warn against a phishing attack. (Your flüge.de
example shouldn't trigger that warning, for instance.)

[https://github.com/phishai/phish-protect](https://github.com/phishai/phish-
protect) is an attempt to do this, but I think there's a better middle ground
for international users that doesn't simply block-by-default all punycode
domains.

------
Kaveren
I recommend setting network.IDN_show_punycode to true in Firefox via
about:config. This will help keep you safe from this phishing vector.

~~~
Ndymium
Vivaldi shows domains in punycode by default. I believe this is the only
reasonable solution, otherwise browser makers will always be playing catch-up
with exploiters.

~~~
bluejekyll
At that point, how much value is there in supporting Unicode? By only using
ascii (punycode), it pretty much eliminates the reason it exists: To allow
software to show a domain in someone’s native language.

Should we perhaps instead be restricting the domain characters to the glyphs
of ascii (rfc1035 compliant) and those glyphs that appear in their locale?
Otherwise revert to punycode when the glyphs fall outside those ranges.

~~~
lmm
> At that point, how much value is there in supporting Unicode? By only using
> ascii (punycode), it pretty much eliminates the reason it exists: To allow
> software to show a domain in someone’s native language.

Allowing a user to _enter_ a domain in their native language is very much
worthwhile, I'd say, even if we revert to ascii for display.

------
paulpauper
this is fascinating. given how many pixels can fit in a character, the
possibilities are in the tens of thousands at least.

It's almost possible to write a simple math paper using unicode instead of
Latex

these characters can also be used to evade username restrictions and other
spam filters when character substitution does not work.

you can even use these codes for bitcoin wallets

------
lelf
Also: wordpreß.com

