

XSS war: a powerful Java HTML sanitizer - robicch
http://roberto.open-lab.com/2009/11/05/a-java-html-sanitizer-also-against-xss/

======
simonw
On further inspection, I don't trust your implementation at all. You're
blacklisting CSS rules and attributes rather than whitelisting them. This
means you wouldn't catch attacks like this one, for example:

[http://www.davidpashley.com/blog/computing/livejournal-
mozil...](http://www.davidpashley.com/blog/computing/livejournal-mozilla-
bug.html)

I also don't think you're the right thing when you DO find something that's
not on the whitelist - you should be escaping it rather than stripping it (we
couldn't have a discussion about your code using your system, since our XSS
examples would be stripped).

I'd suggest re-engineering to use a whitelist for everything.

------
simonw
Allowing any CSS at all is very risky indeed. There was a brilliant phishing
attack on MySpace a few years ago where the attacker constructed their own
"log in" link and used CSS absolute positioning to overlay it across the real
"log in" link in the global navigation. They stole 30,000+ accounts.

Even if you filter out "position: absolute", there's a chance people might
figure out a way to do something similar using enormous padding values or
negative margins.

Your general approach (tokenise the HTML and use a whitelist) is an OK start,
but you should be white-listing attributes as well. You should also have an
ENORMOUS set of unit tests.

You allow object and embed which is very worrying - the allowScriptAccess
attribute can allow Flash to make JavaScript calls to the parent page, for
example.

Also remember this: you're not dealing with valid HTML, you're dealing with
malicious HTML that might be designed to evade your filters but still be
handled by browser's built-in error correction code. Since the most widely
used HTML engine is closed source, there's no telling what kind of weird
constructs might be error-corrected and rendered by IE.

HTML cleansing is a mine-field.

------
bensummers
This is important reading:

<http://www.feedparser.org/docs/html-sanitization.html>

I'm not convinced it's possible to stop a browser executing code, because
there are so many possible ways a browser can be given code to execute. Not
only do you have to read all the specs for all the versions of the browsers,
you've got to find all the bugs in them too.

Case in point: An earlier version of this code didn't remove javascript from
CSS expressions, making it possible to get past it in IE6 and 7.

I gave up trying to sanitize HTML, and instead used a library to render it to
plain text and stuck it in a <pre> element with usual HTML escaping. But I
need to take a very paranoid approach in my app.

EDIT to add: I think this is probably one of the better java implementation,
and has a good whitelisting approach to HTML. However, it's let down by taking
a blacklist approach to CSS.

~~~
niyazpk
If you are doing HTML sanitization, white-listing is the way to go (both HTML
and CSS). Another thing to note is that if you are trying to get away with
sanitization by some regex magic, you are destined to fail.

In many cases, you don't have to use HTML at all. Use something like MarkDown
and it will save you a lot of time and frustration.
<http://daringfireball.net/projects/markdown/>

~~~
bensummers
My app gets people uploading test data which is provided formatted in HTML, so
I don't have the luxury of requiring a different markup language.

But you're right, using HTML is very often overkill.

------
simonw
This works:

<s c r i p t>alert('xss')</ s c r i p t>

I think older versions of IE might execute this - the LiveJournal XSS filter
has defence against this one. At the very least though it should be escaped
rather than being allowed through on to the page.

~~~
robicch
No, this not works because is html-encoded. There are three output to
HtmlSanitizer: .html and .text should be used for visualization while .val
should be used to store data as it try to preserve data inserted.

------
RyanMcGreal
> Our approach is to remove unwanted tags and properties

I'm no security guru, but isn't the best practice to have a whitelist of
accepted elements rather than a blacklist of prohibited elements?

------
robicch
I know that the black list approach to css is dangerous, but there is someone
that saw the alert on the page
[http://patapage.com/applications/pataPage/site/test/testSani...](http://patapage.com/applications/pataPage/site/test/testSanitize.jsp)?
and if yes how?

------
tdoggette
Try middle-clicking a link on that page in Chrome. Tt opens in the same tab
(for me at least), but I get normal behavior in IE7.

