

Remind HN: Unicode hacks - olalonde

Just a friendly reminder that some Unicode characters[1] look like spaces and should be taken into account when writing filtering/trimming functions. Of course it's not a big deal but something to keep in mind to prevent stuff like usernames who are basically a bunch of spaces.<p>[1] http://www.cs.tut.fi/~jkorpela/chars/spaces.html
======
tptacek
This is a classic web security problem; most famously, WinAPI systems have a
"flattening" function that would convert things like PRIME U+2032 into ASCII
0x27 (the tick that terminates SQL statements). Database engines can also
interpret character sets differently than the rest of the app stack, leading
to similar problems. UTF-7 cursed Wordpress for something like a year in which
multiple preauth SQL injection flaws were discovered.

The answer to these problems is whitelist filtering and neutralization; if a
character isn't known-safe, substitute its HTML entity alternative. If you're
writing blacklist filters that need to know what spaces are, you're already
playing to lose.

~~~
perlgeek
Sorry, whitelisting isn't the answer to SQL injection - bind parameters are.

With bind parameters you can pass data out of band, and the DB engine never
tries to parse it as SQL.

~~~
tptacek
I wasn't trying to be prescriptive about SQL injection. But, it always skeeves
me out when people knee-jerk out "parameterized queries" as the answer to SQL
injection. Yes, they're better and safer and you should use them wherever you
can. But they don't "solve" SQL injection; for instance, there are query
fragments that can't be parameterized (which is why you still find SQL
injection in sortable table headers and in pagination and in custom query
builders).

Be careful.

------
olalonde
Seems like Twitter is "vulnerable" to U+00A0 tweets:
<http://twitter.com/#!/olivierll/status/7852651047817216>

For those who are wondering, you can type Unicode codes directly from your
keyboard (Ubuntu: Ctrl-Shift-u, other OS:
<http://en.wikipedia.org/wiki/Unicode_input>)

~~~
Bootvis
A more innocent trick with unicode and twitter is squeezing extra characters
in a tweet by using unicode ligatures:

[http://en.wikipedia.org/wiki/List_of_precomposed_Latin_chara...](http://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode#Ligatures)

Unfortunately the amount of ligatures is small but it might come in handy.

------
VMG
Interesting - just tested it in python and everything is removed with
str.strip(), _except_ "\ufeff", which also has zero width.

    
    
        >>> print("\ufeff#")
        #
        >>> print(len("\ufeff#".strip()))
        2

------
olalonde
For more details on the potential visual spoofs:
<http://unicode.org/reports/tr36/#visual_spoofing>

------
stwe
‏There are also other unicode hacks like changing ‏ text direction (U+200F)‏.

~~~
stwe
It used to have funny effects on websites (browser name in title bar spelled
backwards), but it doesn't seem to work now. The above comment contains the
unicode character three times.

------
citricsquid
­

~~~
citricsquid
Seems ALT+0173 works here as a "blank" character. I'm not sure of its exact
purpose, but I've never seen it dealt with and often use it as "nothing". The
only solution I've seen to properly sanitising Unicode characters is just to
disable them entirely and print their name.

~~~
alanh
If you break my “typographer’s quotes” by overzealously sanitizing when you
don’t absolutely need to, you’ll end up 6' under a † if you know what I mean.
;-)

