
*NEVER* sanitize your inputs - billpg
http://blog.hackensplat.com/2013/09/never-sanitize-your-inputs.html
======
breischl
I guess the "never sanitize" headline is clickbait, but the point is valid.
"Sanitizing" input is really hard, and can provide a false sense of security.
That string has been sanitized, so it's safe! Wait, is it safe for SQL? What
about HTML? What about inside a <script> tag? What about a different database
engine, or Mongo, or Azure Tables? You are much better off giving up on the
illusion of "safe input" that sanitization gives you, and instead always
treating user input as data rather than mixing it up with your code.

My major complaint is that after correctly identifying the solution for SQL,
he ends up with nothing to say about HTML. The right approach for rendering
user input into HTML is with the Javascript createTextNode() function. That's
how you tell the browser that it absolutely shouldn't interpret that content
as HTML.

~~~
billpg
Thanks for that. I'll add a note mentioning createTextNode once I've had a
chance to read up on it.

------
king_magic
"But that's what we mean by "sanitize"! Then you should stop calling it that."

Ugh, eyeroll. Seriously, let's waste time arguing over what to call security
vulnerabilities & ways to address them - instead of using consistent
terminology that security-minded developers instantly recognize.

To quote the hilarious Mean Girls - "stop trying to make fetch happen".

~~~
billpg
Okay, let's keep advising people to "sanitize" inputs. Even though its
confusing and there's another word that isn't confusing. Because reasons.

~~~
awda
There is no other word that isn't confusing. If you refuse to actually do some
cursory research into something before implementing it, you're going to get
the stupid delete-is-sanitization stuff the author describes.

"Doctor, it hurts when I do this." Don't do that!

~~~
mfisher87
Except it looks like everyone's confused and there's a lot of misinformation
or high signal-to-noise ratio _already_.

My google search "Input sanitization" yielded these first 2 results

[http://en.wikipedia.org/wiki/Secure_input_and_output_handlin...](http://en.wikipedia.org/wiki/Secure_input_and_output_handling)

2nd page (or more with a lesser screen), under "other solutions," this is the
only line about parameterization: "In particular, to prevent SQL injection,
parameterized queries (also known as prepared statements and bind variables)
are excellent for improving security while also improving code clarity and
performance." Everything else is about filtering, blacklisting, whitelisting,
escaping.

[http://www.esecurityplanet.com/browser-security/prevent-
web-...](http://www.esecurityplanet.com/browser-security/prevent-web-attacks-
using-input-sanitization.html)

Discusses filtering as solution to HTML injection. Lastly discusses SQL
injection, first recommending mysql_real_escape_string(), then in the second
paragraph linking to another article about parameterization.

It's not, to an inexperienced developer (this is the web remember?), a clear-
cut best practice from just "cursory research". It's a popular tech joke with
obvious but non-optimal solutions.

~~~
awda
[https://www.google.com/search?q=input+containerization](https://www.google.com/search?q=input+containerization)

What does the inexperienced developer learn from the new search terms?

I don't know why magically using different, non-standard words would prevent a
developer from being inexperienced.

~~~
mfisher87
Why do you think that? The idea isn't to tell new developers to search for
content that doesn't exist. The idea is to teach better solutions, a small
part of which is using correct descriptive language when naming things.

"I don't know why magically using different, non-standard words would prevent
a developer from being inexperienced." It's really hard to give _any_ response
to this sort of flawless logic...

------
bevacqua
This is terribly confusing advice: "NEVER sanitize your inputs!". He means:
"just don't call it sanitizing".

~~~
msl09
Click honeypot

~~~
return0
I get the 'click' part, but why is this being upvoted as well?

------
kijin
The argument makes sense in the SQL injection example (don't escape, use
prepared statements!) but falls apart when you get to the XSS example. Now
we're just trying to redefine words.

"HTML injection" does sound cool though. Since XSS nowadays is not necessarily
about sending cookies to another site, perhaps we could adopt "HTML injection"
as a more generic term.

Now of course, the problem we're trying to fix is someone who does:

    
    
        $content = htmlspecialchars(mysql_real_escape_string(addslashes($content)));
    

before $content ever hits the database, without any understanding of what
those functions really do. It's a surprisingly common cargo cult among newbie
web devs. Just throw all the security-related functions together and you'll be
safe!

~~~
vezzy-fnord
HTML injection already is a term:
[https://www.owasp.org/index.php/HTML_Injection](https://www.owasp.org/index.php/HTML_Injection)

Same principle, but different method of exploitation. If we supply plain HTML
tags in a vulnerable parameter, it's HTML injection. If we use JavaScript (via
a script tag or whatnot), it's XSS.

------
gizmogwai
This whole post is ridiculous. The problem he poorly tries to described has
been solved by mathematicians a few millenniums ago. In a single word:
CONTEXT.

A word is nothing if not bound by a context. Developers have already developed
part of this context. Design patterns names are an example of those words
defined within the context. Sanitizing input is just another.

~~~
billpg
Put yourself in the shoes of an inexperienced programmer building their first
website. You've been advised to sanitize your inputs with the example of Bobby
Tables.

You know the plain English meaning of "Sanitize". Clearly, you need to remove
those single quote characters as they are unsanitary?

~~~
gizmogwai
Put yourself in the shoes of an inexperienced mathematician solving its first
theorem. You've been advised to take care of infinity as described with two
parallels crossing at infinity.

Problem, your theorem is dealing with discrete numerable infinity...

On the side note, English meaning of Sanitize is "Make clean and hygienic",
nothing more. It says nothing about "removing". Other definitions are
extensions based on CONTEXT, once again.

~~~
zAy0LfpBZLC8mAC
... which is exactly what you should not do. Here, let me post this:

"If you want to create a horizontal line in HTML, you write <hr>"

See that? There is nothing "unclean" about it, hence you should not "clean"
it. You just have to encode it if you output it embedded in HTML. That's why
calling it "sanitizing" is misleading.

~~~
gizmogwai
Again, wrong.

Encoding without proper context means "convert in a coded form". Hum that's
not exactly what we want. So, let's add the "computing context", now we have,
as an example, the ability to encode a WAVE file into a MP3. But wait, we lost
information here! Bummer...

Sanitization in the context of computing does not specifically means that you
have to "encode", or better, "transcode". It means that you have to take
appropriate measure so that your input DATA cannot be interpreted as CODE by
the receiver. Bonus point is taken if the measure you choose is lossless in
term of information carried by your data.

~~~
zAy0LfpBZLC8mAC
Well, yeah, "transcode" might be better, but then again there isn't really any
hard difference between "encode" and "transcode", or possibly "encode" is just
useless because it can not ever happen without an associated decoding of the
information source?

But no, in a way, you are getting it all backwards, or at least a bit
confusing.

This is how you should construct a system that processes user input:

First, the input format should be defined such that it can only describe
things that make sense within the given context, in particular it should
usually not be possible to represent in it instructions for programming
language interpreters.

Second, whenever you have to represent user input in some context, you have to
encode (well, transcode) it into the format of that context. This transcoding
generally should only change representation and not change the meaning of the
converted information.

This automatically implies that you can not "inject code". There isn't really
anything magic about "code". That's what I think is a large part of the
confusion around "sanitizing input". The input can not represent code, the
conversion does not change the meaning, so if the input can not represent
code, the transcoding obviously can not cause code to appear either, and thus
you are safe - and not only are you safe, but your system also works as it
should otherwise, which it potentially does not if you start "removing
dangerous characters".

That is why you should not "sanitize", but only validate and
encode/transcode/convert. Which you need to do anyway for your system to work
properly. Lack of injection vulnerabilities will result automatically.

------
simonw
This article almost gets it right, then screws it up with the HTML example.

Both SQLi and XSS have the same cause: concatenating strings when you are
working with active code of some sort.

They both have the same solution: you need to know the escaping rules for the
active code you are assembling.

You shouldn't be solving XSS by stripping tags (that's a great way to build a
discussion forum where no-one can talk about how to use HTML) - you should be
escaping user input before assembling it in to HTML.

To protect against dumb mistakes (because it's really easy to screw up just
once and have a huge security hole) you should use abstractions that do this
for you. If you're working with Django the ORM will do this for SQLi and auto-
escaping in the template language will do this for XSS (watch out for
variables you are outputting in a script tag context though).

Escaping, not sanitizing, should be the message.

~~~
huxley
Hate to disagree with you but you can have plenty of flexibility with element
and attribute white-lists without abandoning sanitizing. Sanitize as much of
your inputs as you are comfortable with and escape the outputs.

------
nubs
I've always seen "sanitization" as more of an output-encoding problem.

People love to consider sanitizing the inputs, but how you do so doesn't
depend on the inputs but on the specific usage of it - more-or-less the
_output_ of your program.

Rather than trying to think of all the ways the inputs to your program could
be abused to cause abuse, I find that it is safer to start at where the output
occurs - database calls, system calls, etc. The most commonly used of these
(database calls, shell commands, etc) tend to have a variety of encoding
capabilities to ensure that when you want to stick a string in a particular
place it does exactly that regardless of whether the string came from user
input or elsewhere. For example, bind parameters for databases, or proper
escaping functions.

If you think about it as sanitizing input it means you tend to misplace your
attention to detail and only consider the entry to your application. A single
input is often used to do multiple things through a program so you cannot
properly handle sanitization at input.

The real push should be for proper output encoding, not input sanitization.

~~~
peterwwillis
The purpose of sanitizing input is not to prevent security vulnerabilities. It
is to make sure the values taken by your program are valid. If you accept a
number range, and the user inputs a word, it's invalid input for your
parameter and your program will crash. Input sanitizing validates the input is
correct for your use. It _indirectly_ improves security, but is not itself a
practice of making an app more secure.

~~~
nubs
The term "sanitizing" is not used to reference this, as commented on, what you
are describing is "validating" the user input. That should, of course, happen.
Many validations will result in only accepting input that happens to be safe
for many uses - i.e., if it's a valid number between 1-100 you could of course
send it to an integer field in a database without doing any special encoding,
but I wouldn't rely on my input validation doing this in my model layer.

Encoding a "safe" value doesn't make things any less safe. Failure to encode
it, however, leaves potential holes in your application. Something may bypass
input validation and be given to the database as an unsafe, unvalidated value.
Usage of the value may change (new functionality using it differently, changed
storage in database, etc) and in the new usage the value may not be safe.

Input validation is obviously something you want to do, but it should never be
relied upon for protecting from injection attacks.

------
SlashmanX
> Perhaps this is why some Irish people prefer to spell their name using the
> letter Ó. After years of having their name mangled by naive software
> developers, they made a new letter.

Stopped reading here as I assumed the rest of the article was satirical

------
theboss
This is stupid and I don't see anyone quite hitting the mail on the head as to
why.

People normally dumb web vulnerabilities together. Xss and sqli especially.
Preventing xss you have to sanitize. Preventing sqli you used parameterized
queries.

To prevent stored xss you sanitize what you put in the database. So really...
You still need to sanitize.

I've also seen people make arguments about inexperienced web programmers and
how this advice can cause them to write bad code. I think the argument is bad
because so many resources exist to help them. There is real code on stack
overflow, w3 schools, owasp, and other blogs that can be copied and pasted in
to their projects.

~~~
zAy0LfpBZLC8mAC
No, you don't sanitize what you put in your database, you validate what you
put in your database, and convert into the output format when using data from
the database. Sanitizing is always(!) wrong.

~~~
theboss
If you never sanitize how do you prevent xss ...

For the average web Dev my approach is plenty good enough.. It's funny because
your approach still requires sanitizing

~~~
zAy0LfpBZLC8mAC
Using validation and encoding. You check input for conformance to your data
model and reject anything that fails the validation (you tell the user about
the error and ask them to correct their mistake), and then you convert from
your data model to the output format that you are generating.

So, for example, you could have a data model of "plain text field", in that
case you check that the input is a valid character string (so no undefined
codepoints present and, for example, no syntax errors in the UTF-8 encoding if
that is what you are using). Thus you can be sure that you have only
characters strings in that column of your database. Then, if you want to
output one of those strings to be displayed within an HTML page, you convert
it from plain text to HTML (replacing "<" with "&lt;", "&" with "&amp;", and
so on). That way there is no XSS possible, and also, any input the user makes
is displayed back exactly as they entered it.

~~~
theboss
SDepends on what you need. Depends on the input field. Another example for why
this is stupid

------
IgorPartola
A somewhat related term, I really like "mogrify":
[http://initd.org/psycopg/docs/cursor.html#cursor.mogrify](http://initd.org/psycopg/docs/cursor.html#cursor.mogrify)

~~~
rhizome31
I've been wondering where this word comes from. The only other occurrence I
know of is the ImageMagick command of the same name. It doesn't seem to be a
real English word. What does it evoke to a native English speaker? (ESL here)

~~~
hyborg787
It's short for transmogrify: [http://www.merriam-
webster.com/dictionary/transmogrify](http://www.merriam-
webster.com/dictionary/transmogrify)

Calvin & Hobbes may have played a part in popularizing the term?
[http://calvinandhobbes.wikia.com/wiki/Transmogrifier](http://calvinandhobbes.wikia.com/wiki/Transmogrifier)

------
huxley
> "Perhaps this is why some Irish people prefer to spell their name using the
> letter Ó. After years of having their name mangled by naive software
> developers, they made a new letter."

I hope this is satire, Irish didn't "make up" the letter Ó, it was the
standard historical form but was converted into O' when the names were
anglicized.

Frankly his advise about sanitizers seems equally suspect, I've processed a
lot of complex scientific abstracts using html5lib and Bleach without any
mangling like he describes. He must be using very naive sanitizers.

------
zAy0LfpBZLC8mAC
The overall point is very true indeed, though I think it's not made
particularly clear what the actual problem with sanitizing input is.

The problem is that you are silently changing information, and that's an
absolute no-go for reliable data processing, and the cause is that people
think of, say, html, as "some kind of text/strings".

HTML is a serialization of a tree, similarly, SQL is a serialization of a
syntax tree ... - and if you want to add plain-text user input to such a
serialized tree, you have to _convert_ it from, say, "plain text" to "HTML
character data". You have to think of them as two different data types, and so
when you want to use a value presented as one of the types as the other type,
you don't have to "sanitize" it, even calling it "escaping" is confusing - you
have to _convert_ it. And if it happens that some input can not be represented
in the target type, then you have to _validate_ the input and _reject_ broken
input.

~~~
marcosdumay
"Convert" is ambiguous. When you read the text "<head>" from a template file,
you probably want to represent it as the HTML "<head>", but when you read it
from the database, you may, or may not want to represent it as "&lt;head&gt;".

People started using the word "sanitize" exactly because it conveys that
information that "you want to treat it differently, depending of where it
comes from". We also use the words "dirty" (sometimes "tainted") and "clean"
conserving their usual relations to "sanitize".

Now somebody wants throw away a very concise and expressive jargon just
because some people are giving bad advice on the Internet?

~~~
zAy0LfpBZLC8mAC
This has nothing to do with whether you read it from "a (template) file" or
"the database", it's only about what _format_ it is in. If the template
contains HTML, then the conversion to HTML (for the HTML part) is the identity
function, of course, the same if the database contains HTML - if you want to
use the same thing in a plain text email, you will have to convert to plain
text. If, on the other hand, the template file or the database contains plain
text, the reverse applies: conversion to plain text is the identity function,
conversion to HTML is the usual replacement with entity references.

That you are using "dirty/tainted" and "clean" only shows how deep the
confusion is. There is some justification to use those terms when talking
about before and after validation, but other than that it's probably an
indication of confusion (which also seems to be the common usage).

Take, for example, a general plain text field for optional free-form text.
There is essentially nothing that could be validated (other than maybe that
it's a valid UTF-8 string). Now, you want to generate a plain text email using
the user input - how would you "sanitize" it?

There is nothing "better"/"cleaner" about any particular encoding, be it plain
text, HTML, SQL, or any other, they are simply different encodings, and you
have to always use the correct one, not the "best one"/"cleanest one", and you
have to always know what format the data that you are processing is in so that
you can convert correctly.

This jargon is not at all concise, actually (some people mean "remove
'strange' stuff/clean it", others mean "escape it", ...), and it makes you
think in ways that obscure the actual problem that you are solving: Conversion
between data types/data representations.

------
TomGullen
False surely, as another poster commented you want to sanitise inputs for
example for user signatures to remove Javascript and other nasties. Sanitising
inputs isn't just about protecting against SQL injections.

What the author actually means is the removal of apostraphes to prevent SQL
injections can affect your data integrity, so paramatise your queries.

Alternatively replace single apostraphes with double apostraphes in your
queries also works, but paramatising queries is a much better practise to get
into.

~~~
IgorPartola
No, TFA is right. If the user wants to post <script>alert("I am a
hacker.")</script>, so be it. Display it literally. You do have to take care
to escape it when you are rendering your HTML. But guess what? You have to
anyways since <script> is not the only evil tag out there. XSS can be
performed a number of ways and you are not going to catch them all by removing
stuff from user input.

~~~
blincoln
No. What you are describing is a way of doing things that inherently leads to
security vulnerabilities, because it depends on someone else remembering to
include something in their code, and people will always forget to do the right
thing at least some of the time. Developers should never allow input into the
system that doesn't match their expectations about what is valid for that
value/field/etc. If you have a user text input that only requires alphanumeric
characters, space, period, and comma, then strip out any character which is
not one of those things. That field is now no longer a possible source of XSS.

~~~
vidarh
The problem is that there are plenty of fields where your attempts at
filtering will _break user expectations horribly_ if you filter the data even
remotely strictly enough to ensure security.

Such as, say, comment fields. It'd be terribly restrictive for your users if
they can't write about <script> tags on a technical forum without munging it.

And you're still not safe. All the characters needed for an SQL injection
attack, for example, commonly occur in normal English usage. All the
characters needed for XSS commonly occur too, so you'd need more restrictive
filtering.

And have fun when a bug that causes your filter to be more restrictive than it
should now means data is unretrievable because you've just stored the
sanitised output of your buggy filter.

Once you've dealt with that, you're still facing the issue of changing
filtering requirements: What is safe for HTML may not be safe for your CSV
export. What is safe for your PDF generation may not be safe for your HTML
generation, and vice versa. Suddenly you're asked to pass data via an API,
with different expectations of what a "safe" value contains. Boom.

In other words, if you believe that what is in your database is safe from
causing security problems, you've lost. You need to treat every piece of data
that may possibly contain user input as a potential cause of problems whenever
you output it or pass it on anywhere, whether or not you've (attempted) to
validate and restrict the input.

A typical example I used to have to deal with: Mail systems. HTML that is
entirely safe when downloaded and rendered by a mail client that contains the
HTML in a document that is just for that one e-mail, can leak data all over
the place and compromise the users account if left unfiltered when rendered on
the web server. You can't insert it pre-filtered into the database without
inserting the raw content too because the user may want to download it.

And because the only reasonably safe filtering method is white-listing tags
and CSS due to evolving standards, you will regularly have to revise the
filters and add functionality and people will be _very_ annoyed if their
e-mails still don't render correctly after you've fixed the bugs (and if you
have to tighten the filters again, you don't want to have to re-filter all the
data).

------
nraynaud
Well in HTML you can use a sandboxed iframe (or <webview> in technologies that
have it), but it's not cheap.

~~~
tokenizerrr
I just remembered visiting a website that used iframe's with script tags
disabled for users their signatures. It was a pretty interesting approach.

~~~
ceejayoz
I just hope they used a proper lib like Purifier to do it, or someone's going
to have fun with `onmouseover`.

~~~
nraynaud
I think that when it's sandboxed with the proper attributes, you can't do
anything appart from trashing the content of the frame.

~~~
ceejayoz
I had no clue that was so widely supported. No IE8/9 but virtually everything
else. Neato! [http://caniuse.com/#feat=iframe-
sandbox](http://caniuse.com/#feat=iframe-sandbox)

------
mantrax4
I can't post a comment containing <script> in this comment form, because it's
"disallowed", instead of just escaping it as plain text.

The thick, thick irony of a guy who can't even follow his own advice.

~~~
zAy0LfpBZLC8mAC
He is perfectly following his own advice. Apparently, his form field takes
HTML syntax with a subset of HTML tags. Your input does not conform to that.
So, instead of silently altering what you wrote, it tells you about the
validation failure and asks you to correct your input instead of silently
changing what you wrote. The input field takes HTML, so you have to write
"&lt;script>" (I suppose, haven't tested it) in order to display "<script>" \-
that is perfectly consistent.

~~~
mantrax4
No form should accept just "HTML" if you don't want just any "HTML" in your
form.

I was actively trying to talk about his script example and instead I had to
second-guess his parser to get past the validator (I eventually resigned and
replaced < and > with [ and ]).

If you want to support _some_ tags, have your parser be an HTML-like DSL
language with those tags supported. Don't disallow perfectly good input.

~~~
zAy0LfpBZLC8mAC
I don't really understand what point you are trying to make, but in any case,
he has a particular input format for that form field, your input did not
conform to it, so it was rejected, nothing particularly surprising or wrong
there.

Now, I haven't tried it, but I suppose his form field expects HTML syntax?
Have you tried entering your text in HTML syntax? Was that rejected?

