
Is Unicode Safe? - nsmalch
http://www.jefftk.com/p/is-unicode-safe
======
vezzy-fnord
The homoglyph attack is a very old and often abused technique. A related,
though not identical one, is bit flipping where a single character is swapped
in a domain name and you prey on those who make misspellings. It turns out,
however, that it isn't even necessary for someone to make a blatant error like
that...

[https://www.youtube.com/watch?v=ZPbyDSvGasw](https://www.youtube.com/watch?v=ZPbyDSvGasw)
[DEFCON 21: DNS May Be Hazardous To Your Health]

~~~
theboss
I actually purchased a few domain of erroneous domain names spellings for the
purpose of experimentation and ended up with A LOT of traffic.

Because I got so much traffic, I decided to make the sites useful, and I wrote
a little php script that downloads the latest technology RSS feeds and
displayed the headlines.

On one particular domain I was getting more than 20 hits per day.

------
api
Some of these attacks, such as the exe.doc one, are the fault of the use of
in-band signaling by Windows to indicate that a file is executable. You can't
do that on OSX or Linux, where attributes determine execute capability.

The equivalent domain name issues are a lot tougher, and are going to require
a character lookalike table or some other system of rules to warn the user.

~~~
notJim
I was pretty skeptical of that particular example anyway. If a user is opening
random files from untrusted sources, security has already gone out the window.

~~~
jakub_g
Well on a mass scale you're right, but if you're targeting a particular person
X, you may create a throwaway gmail account impersonating X's friend Y, and
send an innocuously looking pdf/doc/xls/lolcat gif as an attachment. IMO lots
of people may fall prey to this kind of attack if the email looks legit.

------
_delirium
I knew about the A vs. Α vs. А issue, where visually similar/identical
characters map to different domain names. But I didn't know IDNs also could
map visually _different_ characters to the _same_ domain names. I would've
guessed that full-width characters would be punycoded as well, rather than
treated as their ASCII equivalents. Is this done with any other characters?

~~~
kijeda
Perhaps this is not so surprising. Prior to IDNs, the DNS also did case
folding so "a" and "A" would go to the same place.

One of the particular challenges with IDNs is that there are two versions of
the specification, a deprecated 2003 version and the current 2008 version. For
a few characters they provide subtly different transforms. The 2008 version
also ratchets down on a lot of non-sensical characters — they are no longer
eligible in domain names. The remaining permissible set is quite conservative
to limit some of the issues seen in the original version.

~~~
ygra
This has nothing to do with UTF-8. Most of his examples use Punycode, the
filename one is UTF-16.

Don't confuse Unicode with how to represent code points in bytes, please.

------
null_ptr
Why does Unicode have such powerful control characters that can be used to
construct misleading strings? Is there a non-malicious use case for them?

~~~
jahewson
Stacked diacritics are used in Thai and other Asian languages, as well as
rarely-seen languages such as those of the Yukon.

The right-to-left control character is for embedding e.g. Arabic or Hebrew
script inside Latin text (or vice versa). It is actually a controversial
feature of Unicode as some people feel it belongs in a higher-level protocol.

Check out the examples here
[http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=...](http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=CmplxRndExamples)
for an idea of what rendering non-Western scripts can entail.

~~~
BitMastro
Thanks for the link, it's very interesting!

------
ddebernardy
Asking if utf-8 is safe on the basis of these examples seems like asking if we
should throw the baby out with the bath water -- along with the toys and the
tub for good measure.

The potential for abuse is evident, but it seems like these primarily ought to
be fixed in userland. For instance, by giving cues by highlighting characters
in widely different areas (latin vs cyrillic) or by ignoring rtl for
extensions when a string starts in ltr.

(Not to mention, in the latter case, if users are opening random docs attached
to spammy emails, utf8 is the last of your problems.)

------
twoodfin
I don't think the problem is Unicode, the problem is trusting your ability to
determine the ownership of a URL (and thus the trust that should be inherited
from its owner) based on its name. Plenty of phishing attacks work with
domains like "yahoo-password-reset.com".

If you're not seeing a valid TLS session with a certificate signed by an
issuer you trust not to allow these shenanigans, it really doesn't matter what
chracters you're seeing in the URL bar.

------
rkangel
To me, we're in a Unicode transition period - 10 years ago it was almost
completely unsupported, and as it is adopted more an more, we're finding the
places it can cause issues.

Part of the problem is that a lot of the languages and tools we use pre-exist
widescale use of Unicode and don't handle it very well. The Python 3 approach
is by far the best one I've come across (would be interested to hear of other
examples), and they needed to make a backwards incompatible change to handle
it in a way that made it harder to screw up.

It is a complex technology, and inevitably there are going to be holes, but as
in a lot of other cases, it is worth it (necessary, even), and as we move
forward our tooling, languages, libraries and practices will get better and
reduce the risk. The internet is a complex technology that can never be
completely secure. Doesn't mean it's not worth it though.

------
lyndonh
I was reading this and I'm like, Unicode (I assume UTF-8) isn't really that
complicated at all. The UTF-8 system is straightforward, no more complex than
simple run length coding. I'm also thinking that Unicode is basically a list
of glyphs in every language plus a few control codes for rendering glyphs
correctly, BOM, etc.

It's like saying a dictionary contains dangerous information.

I think the problem is software that enables Unicode input but is not willing
to handle all the different types of input. For example, it seems like a bad
idea to even let people input combined words of different languages; that's
why we have input methods that filter out bad combinations; and dumping this
on the font renderer without making sure the difference is highlighted.

~~~
rkangel
The article here relates to Unicode in general rather than any of its specific
encodings (the author themselves gets confused by this in the first
paragraph). This is quite a good article that explains the difference
(particularly of use to Python users):
[http://nedbatchelder.com/text/unipain.html](http://nedbatchelder.com/text/unipain.html)

~~~
lyndonh
Well UTF-16 and UTF-32 are less complex than UTF-8. My point is that codings
aside, it's just a big table of characters plus some command codes.

------
asdfaoeu
Safari users check out:

[http://‮moc.lapyap.m‭d2.shptech.com](http://‮moc.lapyap.m‭d2.shptech.com)

(copy and paste link)

------
wila
These unicode attacks are interesting and unicode is far too useful to stop
from using it. The question is what can we do to fix some of these issues?
Like the RTL character. It shouldn't be blocked as it has a valid use case,
but is there a non malicious use case for it when surrounded by normal latin
characters? eg: abc[RTL]def

If it's just one RTL character then that should be fairly easy to filter out.
Of course if that's a way a filter works then there will be other unicode
characters you can add to the mix and still make it look the same for an
average user and pass that particular filter.

One could identify unicode characters that belong to a particular character
set (say latin) and see if some text contains more as one character set. Then
invoke the filter if a text has more as 2 different character sets. Of course
I can see that getting in the way of some use cases as well (text with
translations in 3 languages for example)

------
gleenn
Some of those attacks are just awesome. Pretty scary as a web programmer

------
TazeTSchnitzel
U+202E, the right-to-left override, is endless fun on web forums. Use it once
and all the rest of the page will be flipped.

~~~
dspeyer
let's test this

hello ‮backward world

~~~
dspeyer
more text

------
hsmyers
While I appreciated the article, another one of his caught my eye and I
thoroughly enjoyed it: [http://www.jefftk.com/p/teach-yourself-any-
instrument](http://www.jefftk.com/p/teach-yourself-any-instrument)

------
ttflee
This reminded me about an interview to an adware author, in which he told a
story about creating unwritable registry keys and file names 'by exploiting an
“impedance mismatch” between the Win32 API and the NT API':

[http://philosecurity.org/2009/01/12/interview-with-an-
adware...](http://philosecurity.org/2009/01/12/interview-with-an-adware-
author)

The adware registered a key in the Windows Registry with Null unicode in the
middle of the string so that the UI of Windows failed to display or modify
that string.

~~~
erichocean
In my experience, NULL is the least-supported UTF-8 character. Whenever
software claims conformance, that's the first thing I check.

Personally, I'd have preferred they disallowed it in the standard, but it's
too late for that now. Anyone know why it was included (other than the obvious
reason that it obeys the encoding rules)?

------
rcthompson
Does Windows really display the that "exe.doc" RTL example with the icon for a
Word Document? Or is the exe file just set to use that for its icon in order
to complete the illusion?

~~~
zxcdw
The file browser just looks at the extension, rather than the file header, so
yes.

This can be tested by creating a dummy .txt file and changing it's extension
-- the icon changes although file contents remain the same.

~~~
groby_b
The extension is still .exe, though. It only _looks_ like .doc after font
rendering. And so, as the OP suspected, the .exe needs to have the proper icon
embedded. It's not provided by the file browser.

~~~
rcthompson
So it seems like one important thing that Windows (or any OS) should do is to
not display custom icons for untrusted executables.

------
im3w1l
𝐈𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭 𝐀𝐧𝐧𝐨𝐮𝐧𝐜𝐞𝐦𝐞𝐧𝐭

Is another unicode classic

~~~
getdavidhiggins
Handy tool for doing this type of stuff:
[http://unicod.es/](http://unicod.es/)

------
robin_reala
If anyone’s writing a library to deal with homoglyph attacks I recommend
Unicode’s list of ‘confusables’ as a data source:
[http://www.unicode.org/Public/security/revision-06/confusabl...](http://www.unicode.org/Public/security/revision-06/confusables.txt)

------
nnnnni
I like and appreciate the fact that the Spotify people used the one guy's
findings to improve security (for everything that uses Twisted!) rather than
just throwing the legal system at him.

