Hacker News new | past | comments | ask | show | jobs | submit login

Output sanitization is what you want to bet on. Only your website / app knows where a piece of data will be displayed, so that is when you should apply appropriate encoding of the output stream.

Yup. React has this by default with everything you render. You have to override it with “dangerouslySetHtml” if you don’t want it.

As have done other frameworks for quite some time, btw.

As other frameworks have done for quite some time* :)

As done framework time some other.

Actually, input might be the better option as one rarely needs to accept HTML or such special characters.

It's also a more common to display or use data than storing it, so you don't have that many places where you can fail when you just convert the input before storing it.

It's nice to be able to trust all data coming from the server.

Trust but verify. The only correct option is to do both.

Can you (or another commenter) give an illustration of this sort of output sanitation?

The term you probably want to look for in your web framework is "encoding".

I don't like "sanitization" personally, because it sounds like you're removing "bad stuff", but in general, "bad stuff" is not identifiable or removable because "bad stuff" is highly context-dependent, plus a lot of times the "bad stuff" is perfectly legitimate [1]. Apostrophes are "bad stuff", because they can break out of SQL queries and HTML tags, but they are also parts of people's names. Double-quotes are "bad things", but they are legitimately part of all sorts of real data. Any "sanitize(string)" function is by definition wrong because it has no place for a context to go, and it will do bad things to your data.

One of my items on my short checklist for examining an HTML templating language is "does the simplest possible way to dump out a string to the user at least do HTML encoding on the value"? That is,

    x = "<>"
    template = compileTemplate("{x}")
    template.Dump({x: x})
for whatever the simplest output of a value is, should output "&lt;&gt;"; if it outputs "<>", you've got a templating language that you ARE going to write XSS attacks in, no matter how careful you are. The time you want to dump out non-encoded text is the exception, not the rule.

Bonus points for being even more aware of the context and correctly encoding things in Javascript context vs. HTML context, etc. This isn't a magic wand that fixes everything, but in general, if it does default to a blind HTML-encode it at least means that instead of a security failure if you screw up the encoding, you'll get the user seeing some ugly stuff on their screen like &lt; instead.

[1]: Although, technically, I think it's acceptable for an HTML encoding function to just eliminate the ASCII control characters other than newline, carriage return, and tab, rather than encode them. Those are just asking for trouble, even if you encode them. Especially NUL. Even in 2019, best to keep NULs out of places they don't belong.

Exactly, sanitization is a misnomer. If you are concatenating plain text together with HTML then you have an app which is functionally broken when someone with an apostrophe in their name tries to use it -- it's not just a matter of security. The strings must be the same format (i.e both valid HTML fragments) before you concatenate them or the result will be unparsable garbage.

And the idea of "sanitization at input" is especially ridiculous: how can you know what you will be concatenating that input with until you actually do it? I.e. is it being inserted into some HTML? is It going in an attribute value or a text node? What about outputting JSON?


This is why we typically speak about defense in depth. Input sanitization works best when applied to known expected inputs, like a phone number or dob.

Output encoding is the real solution where we know where we intend any data to end up (this is how it’s displayed) so we can ensure that it’s in the correct format and that that format parser won’t interpret it as code instead of data. Ie html attribute, html, Json, JavaScript, etc.

If a user has a bracket character in any field, it's OK to allow it, as long as you don't render it directly in any HTML. You have to make sure that when you render it you render it as `&lt;` or `&gt;`, which get displayed as `<`, or `>`, but aren't interpreted as HTML.

Correct. And one reason to properly format for output, rather than sanitise input is because you do not know how the string might be used. I mean you can sanitise for HTML output, but it won't cover shell command output (i.e.: when you pass the string as a parameter to a tool via --vehicle-name=). Thus input is to be stored as is, and NEVER trusted even if some input sources "sanitise" it.

this mistake is what causes the incredibly common html entities in plain-text emails, as well as RSS article titles.

Output formatting, example: "><script src="// becomes &quot;&gt;&lt;script src=&quot;// for the web. For other types of applications, where any of these characters have special meaning and might be interpreted, these might be formatted differently for output.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact