I used this to prank some people on the in-house SEO team at my last job. I'd ask them if they had done anything that might be considered black-hat. Then I sent him a link to a "site:" query on Google indicating that our site had been removed from the index.
The wide variety of exploits such as these suggests that we need to integrate character spoofing into the general malware detection system on devices, which evolves over time (in the way that virus checkers evolve, with lots of human input) to deal with known or anticipated problems.
I'm thinking of a system that combines aspects of virus checking, malware detection, bayesian spam filtering, and spell checking.
A Unicode system can be supplied with tables of characters that could easily be mistaken (visually) for one another. These tables, combined with dictionaries, could spot words that could look like dictionary entries visually but are not spelled in the ordinary way. This approach could even spot things that have been problems for years in pure ASCII: confusion of 0 and O, of l and 1 and I, of rn and m, etc, It could also spot insertions of non-visible characters into such things as URLs and filenames.
Such a system would be able to spot .exe files that had names written in such a way that the .exe extension was not visually displayed at the end of the name. If you double-clicked such a file the first time, it could ask you if you realized that it is program you are about to run and not a ".jpg" as the name might suggest. In fact, it could ask you about any file whose real extension and apparent visual extension differed.
There will still be problems that will sneak through, just as today you can phish people with subtle misspellings that don't require anything more than ASCII.
But making this a part of the system's evolving general malware detection system, with human-created tables and heuristics borrowed from malware detectors, spam filters, and spell checkers, is the best solution, IMO.
I would augment the human-generated tables of confusable characters with something like OCR run on each font to detect similarly-shaped characters. The algorithm could provide a score indicating how similar any two characters are (or maybe how similar a given character is to all other characters, combined with statistical frequency of that character), which could be weighted and incorporated in a malware detection heuristic.
How about the OS adopting the convention that any codes outside of a few trusted (expected) alphabets get displayed in a way that makes it obvious to a human that they aren't what they look like (eg, a bright red border or something).
How about the OS adopting the convention that any codes outside of a few trusted (expected) alphabets get displayed in a way that makes it obvious to a human that they aren't what they look like (eg, a bright red border or something).
AIUI, there are two major reasons this wasn't done in the first place, and why more complex solutions are necessary:
1. Those few trusted alphabets would probably include Greek, Cyrillic, and Latin, all of which have similar or identical characters with different Unicode code points.
2. The goal of Unicode support, localized domain names, etc. is for software to be equally easy to use for all languages, rather than to favor some languages over others.
That said, it might be advantageous to have a locale-specific approach, so that characters not used by the current language will be highlighted. But, that could be seen as hindering the ability of sites in one region to reach users in another region, doesn't work well for text that includes multiple languages, and malware writers will probably find a way to mark their characters as expected anyway.
Edit: also, the two words "get displayed" paper over a vast amount of complexity in the way operating systems and applications display text. It would probably be just as much work as any of the other solutions proposed.
such a way that the .exe extension was not visually displayed at the end of the name. If you double-clicked such a file the first time, it could ask you if you realized that it is program you are about to run and not a ".jpg"
Suggestions like this should get you a professional penalty, like a yellow card.
"Saw a specific problem, suggested an Are-you-sure dialog for this specific case, on top of that, one most people can't answer".
Wow, clicking your link (the google search for site:news.ycombinator.com) sure was scary, but since I kinda knew what it was about, I didn't freak out, and simply changed the encoding of the website to some ASCII code page, revealing the illusion.
HOWEVER, what I find extremely worrying, is that the URL in the URL bar remained the same all the time, with the embedded query string site%3Anеws.ycombinator.com, and IT DIDN'T CHANGE TO ASCII when I changed the encoding of the web page. Haven't they/we learned yet, that this is a very serious security issue?
They've gone back and forth on this. It is a serious security issue, but for people who regularly use non-ascii characters, it's a usability issue. A blacklist might be a usable compromise, but I bet it would feel terribly confusing - in part because of font variability.
I'm pretty shocked that I have never heard of the RLO unicode character before this article. Let's see if it works: ppa.emorhCelgooG => ppa.emorhCelgooG
I discovered this as I was writing a paranoid HTML cleanup library and wanted to prevent the attack where a user sticks a text-direction-change character into the page and reverses the whole thing. As we've all just witnessed, that can't happen in a conforming browser.
But when viewing the page as a text stream, yup, it reverses and then never really unsticks. Everything's working as designed!
(Maybe my library should still restore the page flow after all... I never thought of how it could mess up view source. As attacks go, it's weak sauce... but like I said, it's meant to be really, really paranoid.)
Even more intriguing. What exactly were you writing in Haskell that needed a super-paranoid html sanitizer? Yet another web server/blog/cms? Or something way more cooler?
Yet another blog, except not targeted for release or anything, just to run my own site. To replace the Django blog that runs my site. It's sort of my entertaining diversion, you know? Working with my own fresh, clean code base where I can try some ideas out without having to carry around a couple of man-centuries worth of legacy code every time I step at work. The cleansing library doubles as my HTML formatter, too, doing things like ensuring close italic tags and such. The paranoia is half real, half fun exercise.
Good point. It is probably the right thing to do technically, since the source view can't rely on parsing the markup. I think I'd prefer such non-printable characters to be simply escaped though.
This is an interesting page from 2006: http://digitalpbk.blogspot.com/2006/11/fun-with-unicode-and-... and seems to demonstrate to me that Firefox, Opera, and Internet Explorer all will eagerly display the RLO character, and Chrome and Safari will not.
Is Chrome and Safari broken or being responsible? (Are there settings to change the behaviors in any of these browsers...?)
It behaves funny in a very "simple" way. Selections have this little problem where you move your mouse over the visual representation but the selection is in the logical representation.
So say your logical text is this:
ltr LTR.
where the capital letters are RTL chars (whether because they're actually RTL or because of an RLO in the char stream). That is, the above represents the reading order. Visually this would look like this:
ltr RTL.
assuming that this paragraph directionality is left-to-right; a reader reading this text would read the letters in the order 'l', 't', 'r', ' ', n 'L', 'T', 'R' (which if you note matches the logical order, hence the name).
Now say you mouse down between the 't' the 'r' and drag right until your mouse is between the 'R' and the 'T'. Those are your selection endpoints. But the selection happens on the _logical_ text, so what's selected is the 'r', the space, the 'L', and the 'T' (and even in that order). You can see that in the example above; if you mouse down between the two 'o' chars in "elgooG" and drag right to between the two 'o' chars in "Google", then you get exactly this sort of behavior.
Ah, that makes sense. I didn't think that text selection defines end points and everything between the endpoints (in the physical text) is highlighted. So, if you have:
Hi RLO elgooG
^ select ^
I expected (everything between the logical end points selected):
Hi Google
^^^^^^^ (selected)
In other words, I expected text selection to obey RLO character as well.
Nope - selection is on logical text, not on visual text. For tons of "fun" reasons. If you want an afternoon of madness, read the Chrome or WebKit sources relating to BiDi text.
I think you're confusing your logical and physical, or I'm misunderstanding you.
Moving your mouse happens over the physical text and sets the selection endpoints. Then everything that's logically (as opposed to physically) between those endpoints is highlighted. Your "I expected" diagram is showing the text physically between the endpoints, not the text logically between them...
This is really dangerous. Luckily the status bar reveals the real URL. However it's still dangerous when you use a visually trusted address like Gmail:
I personally find the RLO / LRO issue much more concerning. I just tested Chrome and Firefox and found it works in URLs. You could rewrite pyapla.com to paypal.com and phish people easily.
I've done a bit more testing and found that while it can be used in URLs it's problematic. When pasted in the navigation bar it causes errors in both Chrome and Firefox. There's probably a way to exploit this and make it work but I don't have time to dig into it right now.
Absolutely! I wouldn't call myself as "unwary", but I would totally click a file with .jpg extension! (the article says 'Unwary people treat this file as a picture'). EXE is especially dangerous because you can make "SEXe.jpg". Who wouldn't click that?
But I guess from now on I look at the chars to the left of the dot as well..
It's funny to think that a hapless vimmer who happens to be running Windows would have never noticed this, because they would simply have typed ":edit $SYSTEMROOT\system32\drivers\etc\hosts" and gotten the real file.
(This isn't a "look how cool command line junkies are" comment; I was just musing.)
On sane systems, tab completion refuses to complete if there is more than one potential completion. Unfortunately, I seem to remember that Windows does some silly thing where it cycles through the possibilities…
I remember years back on Wikipedia, clever vandals would play Unicode tricks. It was interesting, to say the least - you'd register a name that looks identical to a real user, vandalize, and hope the administrator would type the name in...
Although it's not vandalism, something else that permeated Wikipedia is the use of the Cyrillic ya (Я) in places of R, where the stylized artwork for the article subject reverses the R (and varies it in many other ways, for sure).
The difference between the two is that this phenomenon is not wholly in the past.
One of the biggest thorns of the situation is when editors bring up an official or semi-offical website related to the subject and using Я, pointing to its existence as "proof". No, that isn't proof; whoever is managing that area of the web properties is just a jackass.
Somewhat related to this is the ability to change 'l" and "I" around when they both look the same, basically a straight line.
This was very common in Yahoo Chat Rooms when folks would pretend to be someone else by registering their name with the opposite of what they had (assuming it had an "i" or "l" in it).
They would then take a screen shot of their font and copy that exactly so they could appear to be the other person. I'll let you imagine the chaos that could occur because of this!
Somebody got me with that exact trick during the Charlie Sheen debacle: I was going back and forth between the Twitter page for @CharlieSheen and @CharIieSheen and couldn't figure out how this was possible… I didn't feel exactly smart when I realized what was going on.
edit: on a related note, I half-jokingly tend to read RockMelt as rock-me-it…
I have a friend whose last name is McIntire, and his handle everywhere was xmcintire. I used to have a lot of fun changing my username to xrncintire - exploiting keming to impersonate him on any non-fixed-width-font system.
(name changed to protect him, he was pretty peeved that I did this - though I think he registered a few places as pavellishir to exploit the r/n similarity, too :P)
On my keyboard, it's '-a, with ' in a place that's easy to hit accidentally, and since it's combining, it doesn't print. I suspect some layouts might have OPTION-a, which would be almost as easy a mistake to make.
Yes. I think this is part of it. Here are the logs from a few weeks ago: http://j.mp/pBqfbx I'd be curious if anyone can figure out why so many visitors are using a browser called Netfront (apparently from Samsung mobiles).
On my BlackBerry, inadvertently swiping the trackpad while typing a vowel results in an accented version of the vowel. There might a similar mechanism at work in Samsung devices.
I see. What is interesting is that most of the requests come from Samsung devices in Spanish-speaking countries, so I guess their keyboards must make it even easier to make the mistake.
This is not a canonicalization attack. Those attacks are based on there being multiple ways to encode the same unicode codepoint in utf8. A utf8 decoder should reject portions of utf8 streams that don't use the shortest possible encoding, but not all do. If there are multiple ways to encode '<', then an xss prevention filter is going to have trouble.
The attack described here is simpler: two unicode codepoints, roman 'o' and cyrillic 'o', usually look identical. So by substituting cyrillic we can make a file called 'hosts' that the operating system won't pay attention to. This is the same problem with punycode internationalized domain names, where paypal.com might be spelled with a cyrillic 'a' and mislead people. The fix for domain names was to restrict what unicode you could use where. I'm not sure what the fix is here, aside from always showing hidden files.
> A utf8 decoder should reject portions of utf8 streams that don't use the shortest possible encoding
so you would say that there should be no file names using the cyrillic o? So if a russian-speaking person wants to save a file, that file name should be rejected? Or translated into a mish-mash between cyrillic and roman characters?
How will that work if that filename is reused on a system on which the default font doesn't contain the roman characters (I'm sure such a thing exists) and thus font substitution needs to happen?
The fix definitely isn't this easy. Maybe one could disallow homoglyphs of a different language than the one dominating the current file name. But this might be a lot of work and I doubt it's fool-proof.
> A utf8 decoder should reject portions of utf8
> streams that don't use the shortest possible encoding
so you would say that there should be no file
names using the cyrillic o?
I'm sorry, I was unclear. I should have said "don't use the shortest possible encoding for a code point". Cyrillic 'o' is code point U+043E while roman 'o' is code point U+006F. The canonicalization attack relies on overly liberal utf8 decoders that would allow multiple binary streams to be interpreted as, say, code point U+006F.
This looks like the canonicalization attack, but is a different problem, one that is not solved by fixing decoders.
That's not the point. The attack relies on some nonconforming decoders exhibiting a many-to-one mapping of bitstreams to codepoints, changing the semantics of the bistream.
you shouldn't restrict anything - just give the users enough hints to be more informed.
I would highlite the background of any character that is not from the users codepage in red.
for eg. if your local settings are us-en, any character not from that codepage will have a red background (or even in italics, some way to signify that the character is 'foreign')
This is not a canonicalization attack, although it is similar. The attacks described used semantically different strings that just happened to look like what the user was expecting.
I believe it's called a "homograph attack", or at least that's one name for it. Wikipedia has an article on the internationalized-domain-name version of the attack, and the various attempts by registrars and browsers to mitigate it: http://en.wikipedia.org/wiki/IDN_homograph_attack
A while ago I compiled a list of unicode characters that looked like letters, to get past curse filters. Not comprehensive, because I just manually skimmed through a unicode table, but here it is if anyone cares:
nodata told you. Highlight characters not expected in my locale. My locale right now is en_US.utf-8. In security-sensitive contexts like file names and domain names, it's not actually that hard to figure out which characters are a surprise for me. And if I have a file that is named entirely in non-ASCII-subset characters... which I do... then light them all up. It's OK. It won't be hard for users to figure out what's going on, even if only subconsciously.
Code point has script property and mixing it inside strings which aren't likely to contain multiscript code points (like filenames) is a sign of trouble.
But do default windows ones? I think (hope?) that the hosts file is hosts on every system, no matter which country or locale you selected - instead of, say, хозяин.
Perhaps limit system folders and files to ascii-only. Doesn't solve any of the picjpg.exe issues, but it's a start.
This is not a perfect solution because the RLO character isn't printed, there is nothing to highlight. There are several other non-printing characters in Unicode that I'm aware of, like the zero-width space.
I wonder if it would suffice to highlight characters not found in the primary font. I don't know how, exactly, fonts fallthrough, but I doubt most fonts venture outside locales.
Fonts increasingly venture outside of locales. Its very common for one font to have all western-European characters, or even all European (including Cyrillic) characters. This is especially true for the default system fonts.
CJK fonts often include not only CJK, but European characters as well. Almost always at least ASCII.
Ultimately, having a single font for all characters is desirable: Having to go track down more fonts because you're seeing � in your text is a pretty bad experience. Substituting other fonts is at best a kluge, as it often looks terrible.
Projects like DejaVu who plan to eventually cover all living scripts (http://dejavu-fonts.org/wiki/Plans) are not only a good thing but they are also making substantial progress.
Also, even in ASCII, there are a bunch of confusing characters (all depend on which font, of course): I (eye), l (ell), 1 (one), | (vertical bar); O (oh), and 0 (zero); {} (braces) and () (parentheses); 5 (five), S (ess), and $ (dollar), rn (r-n) and m; vv (v-v) and w; etc.
Wе nееԁ tօ fⅰnⅾ Ьеttеr ѕоⅼυtions tҺаɳ vіѕυаⅼⅼУ dіѕtіɳgυіѕҺing сҺаrаⅽtеrѕ.
This is an important issue in some chat programs. I had to deal with this all the time: malicious users using i vs. l to pose as others, and using unicode to mess up or reverse the entire chat. One of the more interesting unicode had characters going left, right, and up and down. This confused moderators about who to kick/ban and obscured other users' text.
The solution was to implement a regex of whitelisted characters; since it's an English-only program, this works well and is future-safe. For multiple languages, a blacklist is probably okay, but the difficulty lies in keeping the blacklist both complete and up to date.
It seems like it would be easy to implement a blacklist that automatically updates based on the current crop of registered names such that filters are applied to each registered name to generate a number of lookalike names which are all unavailable for use. This should allow any good-faith user to register any name they want without causing confusion.
I'm wondering if there's a way to make it more obvious that doesn't require running od. It should be immediately apparent any time there's a file with a whacky name, not something you find out two minutes into investigating a compromise.
Yes - reminds me of how several users would exploit the Bolt.com chat system (back in the day) using upper-case 'I's as lower case 'L's to pose as different users and cause mayhem.
In the old days, I tried once to put a backtick ` on my username in counter-strike as a way to prevent admins from kicking me :P (the backtick toggles the console, and kicking requires typing 'kick <name>' from the console).
I was kicked within 2 seconds from joining the server.
In systems that allow spaces in the middle of your username, there are some that do the same with the various unicode variations on white space. That's confusing even to people who know what to look for.
I have thought someone should compile a mapping of all the visually similar characters in Unicode to one canonical character. Then we can all include a check for potential impersonation in account-creation code.
Isn't the real issue that in order to be vulnerable to this, you have to be running as a user who has permission to diddle with the hosts file? Or that your hosts file has too-liberal write permissions?
Hosts file attacks are well-known enough that on windows I always set them to read-only, so that even administrators can't change them without first clearing the read-only flag.
With the exe as jpg example, it'd be even more misleading if the exe used a photo for an icon, launched the photo viewer app for a matching jpg photo, and launched an insidious process in the background. Even harder to detect from whence the malware came.
But, how does this work? Does Windows source all of the files in your %SystemRoot%\system32\drivers\etc? Why does it matter what the file is named? To hide from idiots?
Windows is loading the real "hosts" file. This is hidden, and there's a non-hidden "h_sts" file there as well.
It seems like it would only work for hiding from people casually checking. Personally I'd open the file by typing the path myself, so I'd end up finding the trojan's file.
The same would be true for any automated anti-spyware tool.
So yes, this looks like it would only affect a very limited number of people - technical enough to check the hosts file, but naive enough to do it manually and not notice the other hidden file.
Since Windows by default don't show hidden files, I must say that most of my engineer coleagues would fall in the trick. Sadly, most people that I worked open the files manually.
I find that disturbing, one of the first things I do after a clean install of windows is to check the "Show Hidden Files", which was hard to find in Windows 7 :/ I thought most tech savvie people did that?
I've been using Windows 7 at work for about a year, and I had absolutely no idea that the menu will show up when I hit the alt key until exactly 2 days ago. Every time I needed something from it I'd just rummage around all the visible menus, not find it, curse my head off and open up the command prompt.
I'm not sure which will look dumber as a consequence of this post: me or the Windows 7 UI.
It's due to the hidden menu, same drill with Office 2010 and finding "Save As.." etc, I've never been much for just using the keyboard, I like using the mouse to navigate the menus ;)
Like any other hidden file, it's not rendered until you choose to show hidden files and then at the point, it shows the file, but the icon is semi transparent.
The hosts file is very important to windows because it maps domains to ips. So I could actually point google.com in my hosts and send myself to the ip of yahoo.com. The key observation here is that the REAL hosts file was hidden, and a hosts file that only looked like a hosts file (with a different kind of o) was the non hidden one.
The operating system sees the hosts file as normal. A person trying to debug it through the gui (with hidden files hidden) sees the dummy hOsts file and thinks that's not where the problem is.
Seems like there's a hidden benefit to always keeping a tab with the hosts file open in notepad++. I use the hosts file from time to time, and I just leave the tab open, never thought it could help me out security-wise though.
http://giorgiosironi.blogspot.com/2010/08/google-never-remov...
I used this to prank some people on the in-house SEO team at my last job. I'd ask them if they had done anything that might be considered black-hat. Then I sent him a link to a "site:" query on Google indicating that our site had been removed from the index.
e.g. http://www.google.com/search?sourceid=chrome&ie=UTF-8...