
Mimic – abusing Unicode to create tragedy - epsylon
https://github.com/reinderien/mimic
======
omgtehlion
Slightly tangential:

In Russia there is a government procurement portal. Where gov organizations
have to post their requests to enforce competetion and best prices.

The usual tactics [1] of corrupt officials was replacing cyrillic (russian)
letters with respective latin homoglyphs so only affiliated companies can find
and win this contract.

[1]
[http://www.bbc.com/russian/rolling_news/2013/04/130409_rn_st...](http://www.bbc.com/russian/rolling_news/2013/04/130409_rn_state_auctions_improve)

~~~
thomasfl
Now that you have revealed the secret, Hacker News will be banned in russia
forever.

------
b0ner_t0ner
Taylor Swift? Never heard of her:
[https://www.google.com/search?q=Τаylοr+Ѕwіft](https://www.google.com/search?q=Τаylοr+Ѕwіft)

:D

~~~
pierrec
Kind of surprised at how poorly Google handles this (I would have expected at
least a correction suggestion)! Heck, it might open the door for an obscure
blackhat/phishing technique...

~~~
andrewflnr
Forget google. Can you do this in domain names?

~~~
thousande
No,
[https://en.wikipedia.org/wiki/Domain_name#Technical_requirem...](https://en.wikipedia.org/wiki/Domain_name#Technical_requirements_and_process)
(last paragraph)

~~~
alister
No, you're mistaken. It is actually a very big problem. Earlier on the same
page you linked to, it explain that "ICANN approved the Internationalized
domain name system, which maps Unicode strings used in application user
interfaces"[1].

As a concrete example, the following are fake links to Wikipedia (and entirely
equivalent):

[http://xn--wkd-8cdx9d7hbd.org](http://xn--wkd-8cdx9d7hbd.org) (FAKE, same as
below)

[http://www.wіkіреdіа.org](http://www.wіkіреdіа.org) (FAKE, same as above)

It is true that _network protocols_ encode these internationalized domain
names in a subset of ASCII, but the _user_ sees Unicode in his browser address
bar or email. There is no restriction on how applications (like browsers)
display domain names[2]; they can use Unicode if they want. This lead to all
sorts of devious attacks[3].

[1]
[https://en.wikipedia.org/wiki/Domain_name#Internationalized_...](https://en.wikipedia.org/wiki/Domain_name#Internationalized_domain_names)

[2]
[https://en.wikipedia.org/wiki/Internationalized_domain_name#...](https://en.wikipedia.org/wiki/Internationalized_domain_name#Internationalizing_Domain_Names_in_Applications)

[3]
[https://en.wikipedia.org/wiki/IDN_homograph_attack](https://en.wikipedia.org/wiki/IDN_homograph_attack)

------
wodenokoto
So, one might wonder why these homo-graphs have different code points. After
all the French A and the English A share the same code point.

It's really difficult to do the right thing here. If Greek question marks
share code point with semi-colon, it obstructs search and replace for question
marks.

Subtle differences in how Japanese and Chinese are written has led to
differently written characters sharing the same code point. It's nice that you
can easily look up most Japanese characters in a Chinese dictionary and see
how they are used in China, but it has become frustratingly hard to get
subtleties in their written form right. The Chinese version may have the line
strike through another line, while the Japanese only has it touching.

I honestly don't know how to go about posting how to same code points have
different written forms!

But it seems like it would be nice if code editors warned about text outside
ascii. You usually only want that in strings and comments.

~~~
lazyjones
> _It 's really difficult to do the right thing here. If Greek question marks
> share code point with semi-colon, it obstructs search and replace for
> question marks._

Context is the key here. Greek text doesn't use the semicolon for other
purposes and searching/replacing such single characters in source code is a
terrible idea anyway (think comments, string literals...). So what is the
prohibitive failure scenario here?

Indistinguishable (for humans) characters with different code points were a
stupid idea, it's fine to abuse it in order to point out that fact.

~~~
wodenokoto
I don't know enough about Greek to deeply comment on it. some people do
consider using the same code point for apostrophe and citations problematic
(it's definitely annoying when doing word segmentation) we also have breaking
and non-breaking spaces, as well as tabs. We luckily didn't follow typewriter
conventions and collapsed several alphabetic characters!

A semicolon is not considered sentence final, whereas a question mark usually
is. This makes it easier for software to auto capitalize. So at least that's
possible a use case.

There's also the possibility that Greek requires the top dot to be square or
circle, meaning it might in fact have subtle differences in print.

~~~
lazyjones
> _A semicolon is not considered sentence final, whereas a question mark
> usually is. This makes it easier for software to auto capitalize. So at
> least that 's possible a use case._

Such software will need and have a language setting anyway (e.g. for
hyphenation). It doesn't have to and cannot rely on code points alone, so the
characters (or rather, different uses of the semicolon) needn't have different
codepoints.

~~~
retbull
What if someone is quoting something in a different language for a translation
or similar?

~~~
DarkUranium
To be fair, that would break hyphenation as well if both languages use a Latin
alphabet (give or take a few carons and such).

------
acdha
Mac users might appreciate the great UnicodeChecker:

[http://earthlingsoft.net/UnicodeChecker/](http://earthlingsoft.net/UnicodeChecker/)

It offers a convenient utility to diff arbitrary strings, which is also quite
handy for e.g. detecting normalization discrepancies, and installs a service
so you can highlight a character in any app and use “Display character
information” to see what it actually is.

I have Python command-line version in my PATH which displays the character
info for arbitrary input strings:
[https://github.com/acdha/unix_tools/blob/master/bin/unicode-...](https://github.com/acdha/unix_tools/blob/master/bin/unicode-
characters.py)

~~~
acdha
Using the Taylor Swift example from
[https://news.ycombinator.com/item?id=10438363](https://news.ycombinator.com/item?id=10438363)
in the comparison window looks like this:

[https://www.dropbox.com/s/9j9h5rjt4gu22hb/Screenshot%202015-...](https://www.dropbox.com/s/9j9h5rjt4gu22hb/Screenshot%202015-10-23%2014.49.30.png)

Each hex value shown can be clicked to open the Unicode character info for
that codepoint

------
motti
This sort of stuff can be the basis for many XSS attacks, see
[http://websec.github.io/unicode-security-guide/character-
tra...](http://websec.github.io/unicode-security-guide/character-
transformations/)

For instance, \u2329, \uFE64, \uFF1C and \u3008 can be best-fitted
automatically to \u003C (the regular '<' mark in HTML)

~~~
lisivka
It is also good tool to check is Unicode supported well: just convert all user
visible messages and then check interface of the program for <?> or [].

------
mattlondon
I had something similar happen in the wild to me.

I work for a "major search engine" that does a lot of advertising & marketing
stuff. To get the most out of it, we need customers to implement some
javascript on their ecommerce sites.

As is often the case, javascript code that needs to get implemented on an
ecommerce site often gets copy-pasted or emailed around a lot internally
within a customer before it reaches the right person who can add it to the
site's pages.

In this example somewhere along the way, a normal javascript snippet got all
of the semi-colons changed from ; to ;.

In case you've not already spotted it, ; is not a ; but is actually "Greek
Question Mark"
([http://www.fileformat.info/info/unicode/char/037e/index.htm](http://www.fileformat.info/info/unicode/char/037e/index.htm)).

It was very confusing why Chrome was moaning about a semi-colon an illegal
token. I had a genuine "Am I going mad? Seriously?" moment before I realised
what was happening.

~~~
slowmotiony
I would probably have to quit my job before I could figure out that problem.
May I ask how did you spot it?

~~~
falcolas
Personally, my vim status line has an indicator which shows the hex code for
the rune currently under the cursor.

After fighting against word processor quotes, it's become second nature to
double check it periodically.

~~~
iamcurious
That seems useful. Can you share it?

~~~
falcolas
Here you go:

    
    
        :set statusline=%F%m%r%h%w\ [TYPE=%Y]\ [ASCII=\%03.3b]\ [HEX=\%02.2B]\ [POS=%04l,%04v][%p%%]\ [LEN=%L]
        :set laststatus=2
    

Produces a status line when inserting and recording a macro like (the
character under the cursor is 'm':

    
    
        ~/.vimrc [TYPE=VIM] [ASCII=109] [HEX=6D] [POS=0123,0020][67%] [LEN=182]
        -- INSERT --recording
    

And, of course... `:help statusline`

~~~
rspeer
This doesn't seem Unicode-aware at all. I put my cursor on the character 每 and
it says:

    
    
      [ASCII=2>4] [HEX=0>4]
    

It's not ASCII, and its hex code is 6BCF.

It also says "ASCII=252" when I put the cursor over "ü". Claiming that values
over 127 are ASCII is just a malapropism.

------
sheraz
There is a special place in hell for anyone doing this. I'm going to watch
this repo and blacklist pull requests from anyone who forks it :-)

~~~
creshal
They share the place with coding blogs that use &nbsp; instead of spaces for
code snippets.

~~~
minikomi
Or “real” quotes in code examples..

~~~
TorKlingberg
I don't think people do this intentionally. Either the code snippet has passed
through MS Word (why?) or their blog tool is being "helpful".

~~~
jff
I once had to do a team project with another student who did all his coding in
Wordpad, god knows why. His indentation was more or less random. I wanted to
murder him.

------
pierrec
I can foresee a new phenomenon arising in stackoverflow-style sites and coding
discussion forums:

" _My simple piece of code looks perfect and should work without problems. Yet
it won 't compile! Help!_"

Answer:

" _Try running `. /mimic --reverse` on your source._"

~~~
probably_wrong
I actually almost submitted something in that vein once. I'd type

    
    
      > ls | wc -l
    

and get

    
    
      > bash:  wc: command not found
    

As it turns out, I need Alt+1 to type a pipe character in my keyboard. If I'm
not quick enough releasing the Alt key, I'll type Alt+Space instead of just
Space, which inserts a Non-breaking space[1] in Mac. This character is not a
space, and therefore it gave me a weird "command not found" error.

This lasted for months until I found out what the problem was - given that it
was a combination of my keyboard settings and OS, finding the root of the
error took quite some time. The hint? The "command not found" error had an
extra space in front of the unknown command.

[1] [https://en.wikipedia.org/wiki/Non-
breaking_space](https://en.wikipedia.org/wiki/Non-breaking_space)

~~~
TazeTSchnitzel
This bit me as well, as I mentioned in a comment above. The solution I found
best was to make OS X not produce an NBSP on alt+space.

------
austinjp
I'm reminded how very useful I've found Text::Unidecode in the past.

[http://search.cpan.org/~sburke/Text-
Unidecode-1.27/lib/Text/...](http://search.cpan.org/~sburke/Text-
Unidecode-1.27/lib/Text/Unidecode.pm)

~~~
avian
Author of Python port of Unidecode here. I wrote a comment previously,
pointing out that Unidecode does the reverse of Mimic. But then I actually
checked the tables of characters that Mimic uses and deleted my comment.

Mimic chooses replacement characters solely based on their visual similarity
with ASCII. Unidecode, while still doing character-by-character replacements
without deeper analysis, tries to optimize the replacement tables for
transliteration of natural languages.

For example, mimic will replace Latin capital H with Greek capital eta
(U+0397), because they look similar. However, Unidecode will replace U+0397
with Latin capital E, because Latin E is typically used in place of Greek eta
when transliterating Greek text to Latin.

~~~
Drdrdrq
I have used the php port long ago when creating a simple website search
engine... Great project!

------
adrianN
On a Mac you (used to?) get a non-ascii space when you hit the space bar while
holding Alt or something like that. Easy to fat-finger it in any case and
looks the same in most text editors. It's a great source of fun for novice
Mac-using programmers to find out why the compiler complains.

~~~
amadahy
This is still happening as of today:

ps aux | grep foo

zsh: command not found: grep

It happens to me at least every other day.

------
sly010
Ironically I have weird OCD where I always assume I made a typo, so I keep
deleting and retyping code a few dozen character at a type, often in lines
where I see nothing wrong. Over time this has just become something my hands
do whenever my brain needs time to think about something else. So in a way I
developed natural immunity to said unicode tricks ;)

~~~
jobigoud
I think you're not alone. A common error I've noticed is when you make a typo
somewhere (that compiles) and copy and paste it in a different place where you
have the correctly named symbol. It's often hard to see the typo because the
eye fly over the word. So you erase and type it manually.

------
Animats
There's a set of rules used on domain names to stop homoglyph abuse
there.[1][2] Applying those rules to language identifiers would prevent this
problem. It's also useful to apply those rules to login names for forum/social
systems. The rules prevent mixed language identifiers, mixed left to right and
right to left text, and similar annoyances.

[1] [https://tools.ietf.org/html/rfc5893](https://tools.ietf.org/html/rfc5893)
[2] [http://unicode.org/reports/tr46/](http://unicode.org/reports/tr46/)

------
rbinv
I guess someone should develop an IDE/editor plugin that marks non-ASCII
characters outside of string literals.

~~~
lazyjones
> _marks non-ASCII characters outside of string literals._

Many programming languages support non-ASCII variable name characters now.

~~~
TorKlingberg
> Many programming languages support non-ASCII variable name characters now.

Just because you can do something doesn't mean you should.

It is usually worth keeping variable names and such in English in enable
international collaboration. Also non-ASCII source files can get mangled in
transit.

~~~
kuschku
Well, it does happen, though – look at this weather data from a large German
newspaper, it is in a custom format ('|' separated values) and in German:
[http://wetter.bild.de/data/meinwetter.txt](http://wetter.bild.de/data/meinwetter.txt)

It happens all the time, everywhere, that people write code and stuff in their
native language.

~~~
Dylan16807
That's data. The suggestion is about variable names.

~~~
kuschku
Well, the variable names of Bild.de (for example HTML class names) are also in
German.

It happens all the time, everywhere.

------
AUmrysh
I think the line about "Mimic substitutes common ASCII characters for obscure
homographs" has it backward. Shouldn't it say Mimic substitutes obscure
homographs for common ASCII characters?

~~~
pascalmemories
Never occurred to me before, but here "substitutes" reads to me as being
commutative. I read both as having the same meaning. (i.e. you end up with
unicode homographs replacing your ascii) Just me?

~~~
dmd
In that case, you won't mind if I substitute poison for your favorite tasty
beverage.

~~~
kps
Technically, my favorite tasty beverage _is_ poison.

------
cstross
Also GREAT if you're trying to identify untaken phishing domain names to
register for your next scam!

~~~
sheraz
wouldn't you end up with the 'xn--' ascii expansion in the url window?

~~~
sschueller
Most modern browser will show you the unicode version.

~~~
germanier
Nowadays most modern browsers will revert to the punycode ("xn--") if there is
any chance of confusion, cf.
[https://en.wikipedia.org/wiki/IDN_homograph_attack](https://en.wikipedia.org/wiki/IDN_homograph_attack)

------
reinderien
Mimic author here... sorry, humanity...

------
Svenstaro
Wow, now that's just pure evil.

~~~
torgoguys
Yes, seriously. This is why we can't have nice things.

------
ant6n
One could name variables and functions to later identify whether code was
copied (e.g. to find out whether somebody copied some GPL code).

~~~
Kristine1975
Note to self: Run mimic --reverse on GPL code I copy.

------
tucif
Spotify used to have a security problem with this kind of characters:

[https://labs.spotify.com/2013/06/18/creative-
usernames/](https://labs.spotify.com/2013/06/18/creative-usernames/)

------
cruise02
> Replace a semicolon (;) with a greek question mark (;) in your friend's C#
> code and watch them pull their hair out over the syntax error

I'm not sure how frustrating this would be. Wouldn't most people just delete
the character immediately and type a new one?

~~~
patal
If faced with a linter error, I don't typically delete the marked stuff, write
it anew and hope fingers crossed that the error would be gone. I would try to
make sense of the message, how it applies, and what the error is. At some
point though, I definitely would pull my hair over a greek question mark.

~~~
cruise02
Only one character is going to be marked in this case, not a whole line or
section of code. Deleting it and retyping it costs one second. I guess I've
seen more than my fair share of encoding issues. I used to tutor at a
university, so students were constantly coming in with code they'd copy/pasted
out of their assignment (usually a Word doc) or from a web site.

~~~
patal
I think that's a great argument. If someone mails the code, I hope to have the
cleverness to suspect the encoding. However, I thought about a code repository
or similar where this may be an issue, but most often is not. And I have seen
some code where a wrong language character did not provoke a reasonable error,
but some arbitrary parser error that went off in another line altogether (not
necessarily C#).

------
pmlnr
This somewhat reminds me if this little entry on how "tolerant" JavaScript
is...

[https://mathiasbynens.be/notes/javascript-
identifiers](https://mathiasbynens.be/notes/javascript-identifiers)

------
gnud
This can actually be used productively, to see how your app reacts to weird
input :)

------
cammsaul
The repo's README mentions a vim plugin to highlight Unicode homoglyphs. As an
Emacs user, I did a quick M-x package-list-packages, thinking I'll find at
least half a dozen equivalent Emacs packages.

To my dismay, there were _none_. So I spent the rest of my afternoon
correcting this glaring deficiency. Fellow Emacs users, protect yourself from
Unicode trolls and grab it here: [https://github.com/camsaul/emacs-unicode-
troll-stopper](https://github.com/camsaul/emacs-unicode-troll-stopper)

------
Procrastes
This seems like a useful tool for fuzztesting your dev ops person, or if you
are the dev ops person, for fuzz testing development. Fuzz for all!

------
Kristine1975
Piping the result through TTS creates weird results (on OS X):

    
    
      echo "hello world" | mimic --me-harder 100 | say

~~~
archimedespi
Can anybody provide an audio snippet for those of us who use Linux?

~~~
roryokane
I don’t have an audio snippet, but I can transcribe what the voice says on
different runs. It usually pronounces random letters individually, but
sometimes pronounces syllables with letters missing:

“L-W-R-D”, “L-L-W-R-L”, “hell-erl”, “H-L-er-D””, “eor-D”, “H-L-L-erl”, “L-L-W-
L-R-D”, “H-L-W-R-L”, “hell-W-R”

------
n-gauge
Just chuck the code into an XML validator. Any character > 127 will be flagged
as invalid.

------
foolfoolz
is anyone aware of the reverse of this, a homoglyph normalization library? id
love to be able to take strings that visually look the same and compare them
against one master list, such as for spam detection

------
r721
In cases like those I use unicodelookup.com to list suspicious characters :)

~~~
bmn_
That site does not work for me. I paste into the input field and it
automatically turns into %F0%9F%90%88.

Compare with [https://codepoints.net/](https://codepoints.net/) instead.

Edit: great, HN is broken too.

~~~
r721
Thanks, didn't know this one.

------
bradbeattie
Add the following to your ~/.vimrc to always highlight non-ascii characters:

    
    
        au BufWinEnter * let w:matchnonascii=matchadd('ErrorMsg', "[\x7f-\xff]", -1)

------
Induane
These dang democrats done banned Ben Carson from google man!

[https://www.google.com/search?q=Ben+Сarѕоn](https://www.google.com/search?q=Ben+Сarѕоn)

------
perlancar2
Made a perl port:
[https://metacpan.org/pod/mimic](https://metacpan.org/pod/mimic) (currently
50% faster)

------
TazeTSchnitzel
In some languages which allow non-ASCII but aren't Unicode-aware (PHP, for
instance), you can add significant, invisible zero-width spaces to
identifiers.

------
florian-f

        > var ﷺ = 1;
        < undefined

------
webXL
Hmm... I wonder if this can be used in browser source maps.

------
cevaris
Some people just want to see the world burn...

------
mrzool
Some men just want to watch the world burn.

------
ChrisArgyle
And now I know what I'm doing for April 1st next year.

------
ehosca
i smell a Notepad++ extension

------
grabcocque
YOU ARE A TERRIBLE PERSON AND I LIKE YOU

