
There's more to HTML escaping than &amp;, &lt;, &gt;, and "  - jamesbritt
http://wonko.com/post/html-escaping
======
nbpoole
...except when there isn't. ;-)

PHP's htmlspecialchars function, which is called out specifically in this
post, is actually fine for most use-cases. If you write code like this:

    
    
        <input type="text" name="foo" value="<?php echo htmlspecialchars($_GET['bar']) ?>" />
    

Then an attacker can't break out of the attribute (and consequently can't do
anything bad). htmlspecialchars also accepts a flag, ENT_QUOTES, which allows
it to sanitize values for single-quoted attributes properly. If you don't
surround an attribute's value at all, then you're still vulnerable: you also
have invalid HTML.

The real takeaway here can be found in the middle of the post: " _you must be
aware of the context in which you’re working with user input_."

~~~
wbond
Right, htmlspecialchars() works perfectly as long as you use it for an
attribute value surrounded by quotes, or if you escape a whole string. The
only way you can get into trouble is if you use it on an attribute value that
doesn't have quotes.

If you are using PHP and want to accept any user input that should be
interpreted as HTML, you basically need to be using
<http://htmlpurifier.org/>. If you are going to be accepting text input, clean
the input value to ensure proper character encoding is being used (important
for multi-byte encoding such as UTF-8) and then use htmlspecialchars(), and
make sure to specify your encoding as the third parameter.

    
    
      // Specify your encoding to the browser
      header('Content-Type: text/html; charset=utf8');
    
      function get_escape($field) {
        $value = iconv(
          'UTF-8',
          'UTF-8//IGNORE',
          isset($_GET[$field]) ? $_GET[$field] : ''
        );
        return htmlspecialchars($value, ENT_COMPAT, 'UTF-8');
      }
      
      // Safely output text user input
      echo '<html><body><p title="' . get_escape('title') . '">' . get_escape('content') . '</p></body></html>';

~~~
rll
That's actually not true, and htmlspecialchars() will automatically do the
UTF-8 validation for you as long as you have set your charset correctly. But
even if you do that, you are still vulnerable.

Inside on* handlers and style attributes, the rules are different. Take
something like this:

<?php $foo = htmlspecialchars($_GET['foo'], ENT_QUOTES);?> <a href=""
onmouseover="a='Fantas<?php echo $foo?>tic';">Mouse Over Me</a>

htmlspecialchars() does its job here. It turns a single quote into &#039;
however, inside on* and style attributes the &#039; entity is treated as a raw
single quote. You need to double-escape in this particular case to be safe, or
better yet, don't use raw on* handlers and style attributes. There are much
cleaner ways to do those, but if you have to, don't ever put user data in them
because you will mess up the escaping.

You can try a live example here:

<http://talks.php.net/show/flux/14>

It is not possible to write a single generic html escaping function that will
work in all contexts. If it was, I would have written htmlspecialchars()
differently.

There are more examples of how you can mess up even if you always quote your
attributes if your escaping function isn't smart. The UTF-7 hack was
mentioned, which is good, but the invalid UTF-8 hack wasn't explained. That
is, if you send an invalid UTF-8 sequence, like %E0 then certain browsers
(well, just IE) will lose their minds unless you make sure you don't display
that invalid UTF-8 sequence back to the user. So htmlspecialchars() does more
than just escape the set of chars you mentioned, it also validates the
characters and makes sure it never outputs an invalid UTF-8 byte sequence.

0xE0 by itself is the first byte of a 3-byte UTF-8 char and IE will simply eat
the following 2 bytes to make up the char. So if you output: "<e0>"> even
though the byte is inside quotes, IE will eat the following "> and replace
those 3 bytes with the dreaded (?) char, but more disastrously it will think
it is still inside the quoted attribute so the next raw quote it sees will end
the attribute and you have yourself another quoted xss hole.

~~~
premchai21
> Inside on* handlers and style attributes, the rules are different.

That may be so, but I'm not sure I'd call that an “HTML” escaping problem per
se. An attribute always has an additional syntax, and you have to account for
the subsyntaxes of whatever attributes are in place—but those aren't at the
HTML layer proper, only implied by it. E.g., a's href attribute takes a URI.
onfoo attributes take JavaScript code. style attributes take CSS. So in order
to make HTML “safe” (FSVO “safe”) when it contains those attributes, you have
to make those values “safe” recursively according to their subsyntaxes—e.g.,
if you want to allow CSS with url(), then you have a URI inside CSS inside an
HTML attribute inside an HTML document inside (for instance) a UTF-8 string,
and you have to take all the layers into account.

It may be that a lot of people don't realize there's several potential layers
of syntax involved, think of it as a monolithic and simple thing, and then get
confused when it is not. Things like PHP htmlspecialchars can inadvertently
encourage this kind of inaccurate view.

(I'm not disagreeing with you exactly, just describing from a variant
perspective. A bit of redundancy in discussion can create an antialiasing-like
effect.)

~~~
Sizlak
Is this thread a parody of how no one should use PHP?

~~~
premchai21
That certainly wasn't my intention specifically. I mentioned htmlspecialchars
because it was the example being used upthread, but any API with analogous
functionality is potentially subject to similar provisos, and if any
surrounding cultural element encourages its use without thinking through the
syntax layers, it can have a similar effect. I assumed this was implicit.

------
hedgehog
When generating output for a browser what you're really doing is writing an
HTML serializer. Kind of tricky to do right by concatenating a bunch of
strings together. Some template systems (such as Genshi for Python) actually
parse the template as HTML or XML so they understand how to encode all of your
outputs correctly for their context.

~~~
wladimir
Indeed, if you're generating the HTML/XML stream as tokens instead of as plain
text, the context-sensitive quoting can be done automatically.

When I do have to generate HTML as text I usually go with escaping &<>"' and I
double quote all attribute values. Isn't this best practice?

Is there anyone using 's or (eek) unquoted HTML attributes at all?

~~~
nostrademons
We use unquoted attributes at Google sometimes. It's for bandwidth/latency-
saving reasons. They're usually limited to literal template text (i.e. class
names, width/height attributes, etc.) and not user-generated text.

------
DanBlake
Cant believe nowhere in the article is the tick/grave mentioned ( ` )

That bugger can really do some damage.

~~~
rgrove
I debated whether or not to mention this, and in the end decided I didn't want
an in-depth discussion of edge cases to overwhelm the basic message I was
trying to get across, which is that context is key.

As far as I know, ` is only an issue when using user input in innerHTML with
IE. Are there other situations where it can be harmful?

~~~
nbpoole
` can be used in place of single or double quotes around attribute values in
IE.

~~~
rgrove
My understanding (and I tested to confirm) is that IE only treats ` as an
attribute delimiter when it's assigned to an element's innerHTML value
dynamically. So this is important when working with client-side code, but not
so much when generating HTML on the server.

Am I wrong?

~~~
nbpoole
I just tried the following HTML:

    
    
        <input type="text" value=`asdf` />
    

In IE, the input box contained the string asdf. In other browsers, it
contained the string `asdf`

~~~
rgrove
You're right. I was mistakenly testing only a limited case (described at
<http://html5sec.org/#59>). Thanks!

------
yuhong
Personally I’d just consider quoting attribute values to be best practice. And
BTW, escaping < and > is not even necessary if you are using quotes with HTML
attributes, unless you need compatibility with browsers prior to Netscape 2.

~~~
xentronium
> Personally I’d just consider quoting attribute values to be best practice.
> And BTW, escaping < and > is not even necessary if you are using quotes with
> HTML attributes, unless you need compatibility with browsers prior to
> Netscape 2.

<a href="...">Label with evil angle brackets</a><script>...

~~~
yuhong
Sorry, I forgot to say within attribute values.

~~~
rgrove
That's exactly the point of the blog post: it's all about context.

------
oconnore

        <a href="/user/foo" onmouseover="alert(1)">foo" onmouseover="alert(1)</a>
    

I put this example into my browser and there was no attack. Replace the second
mouseover with alert(2) and it's clear that it is never parsed as javascript.
Am I missing something? [Edit: yes, I am missing something]

EDIT: Blerg! I misread it! Disregard!

~~~
JoachimSchipper
The first alert() should work - note that it's _not_ part of the template.

The second alert() is just garbage, of course.

------
Yxven
I was concerned about this because I wrote an open source python html white
list that escapes/eliminates everything that isn't on the white list, but it
sounds like my filter works as long as you quote attributes. I guess I need to
add a warning for that.

Will someone double check me?
<https://sourceforge.net/projects/htmlfilterfacto/files/>

~~~
JoachimSchipper
You'll want to check at least <http://news.ycombinator.com/item?id=2525126>,
<http://news.ycombinator.com/item?id=2525181>.

~~~
Yxven
I'll fix it. Thank you.

------
leviathan
I can't find a use case for the example the article is working with. Who in
their right mind would take user input and put it in an HTML attribute?

If your system accepts 'foo" onmouseover="alert(1)' as a username, you've got
bigger problems.

~~~
pornel
> If your system accepts 'foo" onmouseover="alert(1)' as a username, you've
> got bigger problems.

Technically that shouldn't be a problem. I can put that in HTML, in URL, in
the database. I can even make directory with that name and use it in shell
scripts — as long as every one of them uses correct escaping.

Bobby Tables is welcome on my systems.

