

Can we believe our eyes? Misleading people with Unicode. - bensummers
http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.aspx

======
byrneseyeview
This was how some folks pretended that Google had erased all mentions of
"Oracle":

[http://giorgiosironi.blogspot.com/2010/08/google-never-
remov...](http://giorgiosironi.blogspot.com/2010/08/google-never-removed-
oracle-from-its.html)

I used this to prank some people on the in-house SEO team at my last job. I'd
ask them if they had done anything that might be considered black-hat. Then I
sent him a link to a "site:" query on Google indicating that our site had been
removed from the index.

e.g.
[http://www.google.com/search?sourceid=chrome&ie=UTF-8...](http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=site%3An%D0%B5ws.ycombinator.com)

~~~
SiVal
The wide variety of exploits such as these suggests that we need to integrate
character spoofing into the general malware detection system on devices, which
evolves over time (in the way that virus checkers evolve, with lots of human
input) to deal with known or anticipated problems.

I'm thinking of a system that combines aspects of virus checking, malware
detection, bayesian spam filtering, and spell checking.

A Unicode system can be supplied with tables of characters that could easily
be mistaken (visually) for one another. These tables, combined with
dictionaries, could spot words that could look like dictionary entries
visually but are not spelled in the ordinary way. This approach could even
spot things that have been problems for years in pure ASCII: confusion of 0
and O, of l and 1 and I, of rn and m, etc, It could also spot insertions of
non-visible characters into such things as URLs and filenames.

Such a system would be able to spot .exe files that had names written in such
a way that the .exe extension was not visually displayed at the end of the
name. If you double-clicked such a file the first time, it could ask you if
you realized that it is program you are about to run and not a ".jpg" as the
name might suggest. In fact, it could ask you about any file whose real
extension and apparent visual extension differed.

There will still be problems that will sneak through, just as today you can
phish people with subtle misspellings that don't require anything more than
ASCII.

But making this a part of the system's evolving general malware detection
system, with human-created tables and heuristics borrowed from malware
detectors, spam filters, and spell checkers, is the best solution, IMO.

~~~
nitrogen
I would augment the human-generated tables of confusable characters with
something like OCR run on each font to detect similarly-shaped characters. The
algorithm could provide a score indicating how similar any two characters are
(or maybe how similar a given character is to all other characters, combined
with statistical frequency of that character), which could be weighted and
incorporated in a malware detection heuristic.

~~~
mtraven
Overdesign alert.

How about the OS adopting the convention that any codes outside of a few
trusted (expected) alphabets get displayed in a way that makes it obvious to a
human that they aren't what they look like (eg, a bright red border or
something).

~~~
nitrogen
_Overdesign alert.

How about the OS adopting the convention that any codes outside of a few
trusted (expected) alphabets get displayed in a way that makes it obvious to a
human that they aren't what they look like (eg, a bright red border or
something)._

AIUI, there are two major reasons this wasn't done in the first place, and why
more complex solutions are necessary:

1\. Those few trusted alphabets would probably include Greek, Cyrillic, and
Latin, all of which have similar or identical characters with different
Unicode code points.

2\. The goal of Unicode support, localized domain names, etc. is for software
to be equally easy to use for all languages, rather than to favor some
languages over others.

That said, it might be advantageous to have a locale-specific approach, so
that characters not used by the current language will be highlighted. But,
that could be seen as hindering the ability of sites in one region to reach
users in another region, doesn't work well for text that includes multiple
languages, and malware writers will probably find a way to mark their
characters as expected anyway.

Edit: also, the two words "get displayed" paper over a vast amount of
complexity in the way operating systems and applications display text. It
would probably be just as much work as any of the other solutions proposed.

------
Rygu
I'm pretty shocked that I have never heard of the RLO unicode character before
this article. Let's see if it works: ppa.emorhCelgooG => ‮ppa.emorhCelgooG

~~~
yahelc
Whoa. Check out you did to the markup of this page. <http://d.pr/HKSQ>

~~~
jerf
The HTML standard specifies that changes in text direction are bounded to the
block they occur in: [http://www.w3.org/TR/1999/REC-
html401-19991224/struct/dirlan...](http://www.w3.org/TR/1999/REC-
html401-19991224/struct/dirlang.html#h-8.2.2)

I discovered this as I was writing a paranoid HTML cleanup library and wanted
to prevent the attack where a user sticks a text-direction-change character
into the page and reverses the whole thing. As we've all just witnessed, that
can't happen in a conforming browser.

But when viewing the page as a text stream, yup, it reverses and then never
really unsticks. Everything's working as designed!

(Maybe my library should still restore the page flow after all... I never
thought of how it could mess up view source. As attacks go, it's weak sauce...
but like I said, it's meant to be really, really paranoid.)

~~~
yuvipanda
Is this 'really, really paranoid' library available somewhere?

~~~
jerf
Not yet. But it'll be in Haskell anyhow, so you're probably not actually
interested :)

There are other such libraries for other languages, poke around. See for
instance <http://htmlpurifier.org/> .

~~~
yuvipanda
Even more intriguing. What exactly were you writing in Haskell that needed a
super-paranoid html sanitizer? Yet another web server/blog/cms? Or something
way more cooler?

~~~
jerf
Yet another blog, except not targeted for release or anything, just to run my
own site. To replace the Django blog that runs my site. It's sort of my
entertaining diversion, you know? Working with my own fresh, clean code base
where I can try some ideas out without having to carry around a couple of man-
centuries worth of legacy code every time I step at work. The cleansing
library doubles as my HTML formatter, too, doing things like ensuring close
italic tags and such. The paranoia is half real, half fun exercise.

~~~
yuvipanda
Ah, yes - I see where you're coming from. Some day I hope to finish my blog,
written in C :D Good luck for your haskell blog :)

------
driverdan
I personally find the RLO / LRO issue much more concerning. I just tested
Chrome and Firefox and found it works in URLs. You could rewrite pyapla.com to
paypal.com and phish people easily.

~~~
driverdan
I've done a bit more testing and found that while it can be used in URLs it's
problematic. When pasted in the navigation bar it causes errors in both Chrome
and Firefox. There's probably a way to exploit this and make it work but I
don't have time to dig into it right now.

------
mcantor
It's funny to think that a hapless vimmer who happens to be running Windows
would have never noticed this, because they would simply have typed ":edit
$SYSTEMROOT\system32\drivers\etc\hosts" and gotten the real file.

(This isn't a "look how cool command line junkies are" comment; I was just
musing.)

~~~
cbr
tab completion

~~~
spicyj
On sane systems, tab completion refuses to complete if there is more than one
potential completion. Unfortunately, I seem to remember that Windows does some
silly thing where it cycles through the possibilities…

~~~
nitrogen
Vim typically cycles through the possibilities as well.

------
gwern
I remember years back on Wikipedia, clever vandals would play Unicode tricks.
It was interesting, to say the least - you'd register a name that looks
identical to a real user, vandalize, and hope the administrator would type the
name in...

This was ultimately stopped by Antispoof
([https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extens...](https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:AntiSpoof))
but the bug reports are still interesting:

\- <https://bugzilla.wikimedia.org/show_bug.cgi?id=2593>

\- <https://bugzilla.wikimedia.org/show_bug.cgi?id=2290>

~~~
carussell
Although it's not vandalism, something else that permeated Wikipedia is the
use of the Cyrillic ya (Я) in places of R, where the stylized artwork for the
article subject reverses the R (and varies it in many other ways, for sure).

The difference between the two is that this phenomenon is not wholly in the
past.

One of the biggest thorns of the situation is when editors bring up an
official or semi-offical website related to the subject and using Я, pointing
to its existence as "proof". No, that isn't proof; whoever is managing that
area of the web properties is just a jackass.

~~~
eavc
Can you point to a few examples?

~~~
carussell
<http://en.wikipedia.org/wiki/Toys_%22R%22_Us>
<http://en.wikipedia.org/wiki/Korn>

Thankfully, Cyrillic as a poor approximation of the artwork doesn't always win
out
[http://en.wikipedia.org/wiki/Talk:Superunknown#Track_name_.2...](http://en.wikipedia.org/wiki/Talk:Superunknown#Track_name_.22Superunknown.22)

And hey! It looks like discussion that I wasn't even aware of spawned on the
Toys R Us talk page about it.

------
noahc
Somewhat related to this is the ability to change 'l" and "I" around when they
both look the same, basically a straight line.

This was very common in Yahoo Chat Rooms when folks would pretend to be
someone else by registering their name with the opposite of what they had
(assuming it had an "i" or "l" in it).

They would then take a screen shot of their font and copy that exactly so they
could appear to be the other person. I'll let you imagine the chaos that could
occur because of this!

~~~
Timothee
Somebody got me with that exact trick during the Charlie Sheen debacle: I was
going back and forth between the Twitter page for @CharlieSheen and
@CharIieSheen and couldn't figure out how this was possible… I didn't feel
exactly smart when I realized what was going on.

edit: on a related note, I half-jokingly tend to read RockMelt as rock-me-it…

~~~
carussell
I've seen an unblinking use of "RockMeIt" before. I'm fairly sure it was here
on HN.

------
andresmh
this reminds me of IDN domains, a few weeks ago I purchased fácebook.com and
góogle.com. I get a few hundred visits every day. I posted about it here
[https://plus.google.com/110362380602139255131/posts/NPS3VNyD...](https://plus.google.com/110362380602139255131/posts/NPS3VNyDxuJ)

~~~
pavel_lishin
I wonder how/why people end up typing á in that URL - on my machine, it takes
extra effort (OPTION-e a) to enter that.

~~~
njharman
non-english keyboards

links

~~~
andresmh
Yes. I think this is part of it. Here are the logs from a few weeks ago:
<http://j.mp/pBqfbx> I'd be curious if anyone can figure out why so many
visitors are using a browser called Netfront (apparently from Samsung
mobiles).

~~~
BrandonM
On my BlackBerry, inadvertently swiping the trackpad while typing a vowel
results in an accented version of the vowel. There might a similar mechanism
at work in Samsung devices.

~~~
andresmh
I see. What is interesting is that most of the requests come from Samsung
devices in Spanish-speaking countries, so I guess their keyboards must make it
even easier to make the mistake.

------
matthavener
This is why "filters" that prevent XSS, etc by remove malicious characters are
so easily breakable. This type of attack is called a canonicalization attack
(more here
[https://www.owasp.org/index.php/Canonicalization,_locale_and...](https://www.owasp.org/index.php/Canonicalization,_locale_and_Unicode))

~~~
cbr
This is not a canonicalization attack. Those attacks are based on there being
multiple ways to encode the same unicode codepoint in utf8. A utf8 decoder
should reject portions of utf8 streams that don't use the shortest possible
encoding, but not all do. If there are multiple ways to encode '<', then an
xss prevention filter is going to have trouble.

The attack described here is simpler: two unicode codepoints, roman 'o' and
cyrillic 'o', usually look identical. So by substituting cyrillic we can make
a file called 'hosts' that the operating system won't pay attention to. This
is the same problem with punycode internationalized domain names, where
paypal.com might be spelled with a cyrillic 'a' and mislead people. The fix
for domain names was to restrict what unicode you could use where. I'm not
sure what the fix is here, aside from always showing hidden files.

~~~
pilif
> A utf8 decoder should reject portions of utf8 streams that don't use the
> shortest possible encoding

so you would say that there should be no file names using the cyrillic o? So
if a russian-speaking person wants to save a file, that file name should be
rejected? Or translated into a mish-mash between cyrillic and roman
characters?

How will that work if that filename is reused on a system on which the default
font doesn't contain the roman characters (I'm sure such a thing exists) and
thus font substitution needs to happen?

The fix definitely isn't this easy. Maybe one could disallow homoglyphs of a
different language than the one dominating the current file name. But this
might be a lot of work and I doubt it's fool-proof.

~~~
cbr

        > A utf8 decoder should reject portions of utf8
        > streams that don't use the shortest possible encoding
    
        so you would say that there should be no file
        names using the cyrillic o?
    

I'm sorry, I was unclear. I should have said "don't use the shortest possible
encoding _for a code point_ ". Cyrillic 'o' is code point U+043E while roman
'o' is code point U+006F. The canonicalization attack relies on overly liberal
utf8 decoders that would allow multiple binary streams to be interpreted as,
say, code point U+006F.

This looks like the canonicalization attack, but is a different problem, one
that is not solved by fixing decoders.

~~~
Sephr
The Cyrillic o isn't being interpreted as U+006F, it just looks like it. A
UTF-8 decoder doesn't know how it's output is going to be used.

~~~
sesqu
That's not the point. The attack relies on some nonconforming decoders
exhibiting a many-to-one mapping of bitstreams to codepoints, changing the
semantics of the bistream.

------
fferen
A while ago I compiled a list of unicode characters that looked like letters,
to get past curse filters. Not comprehensive, because I just manually skimmed
through a unicode table, but here it is if anyone cares:

Wide letters: Ａ Ｂ Ｃ Ｄ Ｅ Ｆ Ｇ Ｈ Ｉ Ｊ Ｋ Ｌ Ｍ Ｎ Ｏ Ｐ Ｑ Ｒ Ｓ Ｔ Ｕ Ｖ Ｗ Ｘ Ｙ Ｚ ａ ｂ ｃ ｄ ｅ ｆ
ｇ ｈ ｉ ｊ ｋ ｌ ｍ ｎ ｏ ｐ ｑ ｒ ｓ ｔ ｕ ｖ ｗ ｘ ｙ ｚ

Better looking letters: ϲ р с Ѕ І А В Е М Н О Р С а е о ѕ і ԛ

Anyone know of a better resource?

~~~
est
Someone need to write a scanner that compare Unicode fonts visually and list
the most alike characters.

~~~
shabble
Seems like it might be doable with a bit of hacking of the magical detexify:
<http://detexify.kirelabs.org/classify.html>

------
nodata
Seems easy enough to guard against. Highlight the characters which are
unexpected for my locale.

~~~
neilk
That's a clever idea, but I don't know how one would determine what
"unexpected" is an increasingly international world.

~~~
adobriyan
Code point has script property and mixing it inside strings which aren't
likely to contain multiscript code points (like filenames) is a sign of
trouble.

~~~
xentronium
Filenames can contain multiscript code points.

E.g.: "Архив.gz" (Archive.gz)

~~~
pavel_lishin
But do default windows ones? I think (hope?) that the hosts file is hosts on
every system, no matter which country or locale you selected - instead of,
say, хозяин.

Perhaps limit system folders and files to ascii-only. Doesn't solve any of the
picjpg.exe issues, but it's a start.

~~~
xentronium
To be honest, I don't think it solves the problem of system file being
modified by a process of unknown origin.

------
sarenji
This is an important issue in some chat programs. I had to deal with this all
the time: malicious users using i vs. l to pose as others, and using unicode
to mess up or reverse the entire chat. One of the more interesting unicode had
characters going left, right, _and up and down_. This confused moderators
about who to kick/ban and obscured other users' text.

The solution was to implement a regex of whitelisted characters; since it's an
English-only program, this works well and is future-safe. For multiple
languages, a blacklist is probably okay, but the difficulty lies in keeping
the blacklist both complete and up to date.

~~~
derleth
> malicious users using i vs. l to pose as others

It seems like it would be easy to implement a blacklist that automatically
updates based on the current crop of registered names such that filters are
applied to each registered name to generate a number of lookalike names which
are all unavailable for use. This should allow any good-faith user to register
any name they want without causing confusion.

------
mikelward
On Linux, how could you make standard tools highlight or differentiate
potentially misleading characters?

I guess the solution would have to be in the terminal emulator? Would a
blacklist of Unicode ranges be sufficient?

~~~
feydr
od -h -- I used to run into this problem a lot when I was writing grammars
against different locales

~~~
mikelward
I'm aware of od, xxd, etc.

I'm wondering if there's a way to make it more obvious that doesn't require
running od. It should be immediately apparent any time there's a file with a
whacky name, not something you find out two minutes into investigating a
compromise.

------
evilswan
Yes - reminds me of how several users would exploit the Bolt.com chat system
(back in the day) using upper-case 'I's as lower case 'L's to pose as
different users and cause mayhem.

~~~
hasenj
In the old days, I tried once to put a backtick ` on my username in counter-
strike as a way to prevent admins from kicking me :P (the backtick toggles the
console, and kicking requires typing 'kick <name>' from the console).

I was kicked within 2 seconds from joining the server.

~~~
Tyr42
I run a tf2 server, and we can just type /kick na`me in chat to kick you. (tf2
is like cs is that ` is console)

------
ams6110
Isn't the real issue that in order to be vulnerable to this, you have to be
running as a user who has permission to diddle with the hosts file? Or that
your hosts file has too-liberal write permissions?

Hosts file attacks are well-known enough that on windows I always set them to
read-only, so that even administrators can't change them without first
clearing the read-only flag.

~~~
lallysingh
Check out the second example, where you can have a .exe file show up as a
.jpg, which is much, much nastier.

------
delinka
With the exe as jpg example, it'd be even more misleading if the exe used a
photo for an icon, launched the photo viewer app for a matching jpg photo, and
launched an insidious process in the background. Even harder to detect from
whence the malware came.

------
HankMcCoy
0x43E (о) is the stuff of nightmares

facebооk.com is still available ;)

------
tantalor
But, how does this work? Does Windows source all of the files in your
%SystemRoot%\system32\drivers\etc? Why does it matter what the file is named?
To hide from idiots?

~~~
omh
Windows is loading the real "hosts" file. This is hidden, and there's a non-
hidden "h_sts" file there as well.

It seems like it would only work for hiding from people casually checking.
Personally I'd open the file by typing the path myself, so I'd end up finding
the trojan's file. The same would be true for any automated anti-spyware tool.

So yes, this looks like it would only affect a very limited number of people -
technical enough to check the hosts file, but naive enough to do it manually
and not notice the other hidden file.

~~~
juliano_q
Since Windows by default don't show hidden files, I must say that most of my
engineer coleagues would fall in the trick. Sadly, most people that I worked
open the files manually.

~~~
Blarat
I find that disturbing, one of the first things I do after a clean install of
windows is to check the "Show Hidden Files", which was hard to find in Windows
7 :/ I thought most tech savvie people did that?

~~~
shrikant
I wonder why you say it was hard to find - it's always been under the View tab
in Folder Options.

Maybe because of the Control Panel revamp, or the fact that the menu-bar is
hidden by default in Explorer windows?

~~~
nxn
I've been using Windows 7 at work for about a year, and I had absolutely no
idea that the menu will show up when I hit the alt key until exactly 2 days
ago. Every time I needed something from it I'd just rummage around all the
visible menus, not find it, curse my head off and open up the command prompt.

I'm not sure which will look dumber as a consequence of this post: me or the
Windows 7 UI.

~~~
dagw
Don't feel too bad. I've been using Windows 7 since shortly after it came out
and I had no idea until I just read your post.

------
adr_
I've had a Fасевоок Offіcіal account lying around for a year or so.

------
arvinjoar
Seems like there's a hidden benefit to always keeping a tab with the hosts
file open in notepad++. I use the hosts file from time to time, and I just
leave the tab open, never thought it could help me out security-wise though.

