
Don’t try to sanitize input – escape output - benhoyt
https://benhoyt.com/writings/dont-sanitize-do-escape/
======
jedberg
You should do both. Sanitize your inputs so that it can be safely stored in
your data store, and then sanitize your output so it can be safely displayed.

We did this at reddit. We had basic SQL sanitization on the way in, and then a
full pass on the way back to the user. The advantage this gave us is that when
someone discovered a new way to hack our sanitization, all we had to do was
update the output filter and everything was magically safe.

We didn't have to do full database scans to find all the bad data and change
it.

Edit: Apparently I shouldn't have simplified "paramaterazation of SQL" as
"sanitize your input". I used the more generic term since I was talking about
any kind of data store. But yes, it was of course parameterized.

~~~
laumars
Are you sure that’s what is happening at Reddit? You shouldn’t need to
sanitise your inputs for SQL. Paramatised SQL has been a thing in some
languages for two decades now. This really is a long solved problem by now.

Output is a different matter though but that’s because of rendering content
safely down to HTML, JavaScript or JSON (to name a few examples). SQL
shouldn’t come into the equation by this point.

~~~
tie_
This. I'm tired of people implying or right out stating that SQL injection is
an input validation problem. Why couldn't you have _foo ' OR 1=1;_ as a title
of your post? It is all good characters as far as text entry is concerned.

SQL injection is really a problem of how you pass parameters to your SQL
layer. Parametrized queries are the (easy and widely available) solution. If
you are concatenating input to your SQL queries, you're doing it wrong.

~~~
simias
I blame PHP. Many webdevs active today started with it, and the standard
library's solution to injections was escaping everything half a dozen times
just in case. Because PHP being PHP nobody saw any red flags when they
implemented a function named "mysql_real_escape_string". Apparently they've
deprecated these functions since then, but the damage is done.

~~~
ivanhoe
But that's not a thing for 15 years or more? PDO was added around 2005, and
even before that anyone in their right mind used mysqli extension for prepared
statements. Since 2012 you can't even use the mysql extension without getting
a depreciation warning.

And yes, in 90s php's security sucked, but that was nothing php specific, it
was just the sentiment of that time. Everyone did it, in all languages. I
remember using tons of $dbh->do() in Perl's DBI back then, intentionally
avoiding to prepare statements for a quick and dirty stuff (and most of the
scripts back then were quick and dirty stuff). It's in a big part because we
were used to building desktop apps and thinking in terms of security that
applied for them like being careful about your pointers and input strings
lengths and stack overflows and stuff. Web was still pretty new thing.

~~~
teh_klev
> But that's not a thing for 15 years or more? PDO was added around 2005

Ex-shared hosting bod here, who had the joy of managing our PHP environments
:(

Sadly in the real world, even after the great big (and pointless) act of
deprecating and removing the mysql_* library, naive developers (and
experienced ones that should've know better) just moved onto mysqli_* or PDO
and _still_ used string concatenation with raw inputs, instead of learning how
to parameterise their queries.

Used to drive me flippin' nuts.

~~~
ivanhoe
> naive developers (and experienced ones that should've know better) just
> moved onto mysqli_* or PDO and still used string concatenation with raw
> inputs, instead of learning how to parameterise their queries.

True, I stand corrected, I've just checked and Wordpress still does it just
like that: [https://github.com/WordPress/WordPress/blob/master/wp-
includ...](https://github.com/WordPress/WordPress/blob/master/wp-includes/wp-
db.php#L2023)

------
rossdavidh
So, essentially he is saying: 1) go ahead and accept the risky thing from your
users, and store it right there in your database, but 2) make sure that you
remember, in every single place in your code where you read that out of the
database, to treat it properly, and 3) make sure that every other programmer,
now or at any time in the future, remembers to do this also, in any code they
write which reads user input out of the database and puts it on the screen.

What a bad idea. Don't leave landmines there for other maintainers of the code
to step on. Especially because the other maintainer may actually be you, six
months or a year from now.

Sanitize your inputs. Also, escape your outputs.

~~~
DagAgren
That simply does not work. You can't sanitise, escape and reproduce correctly
all at the same time.

Say you run a blog. I post a comment saying "But in this case, B<A!"

This is clearly dangerous input! But it is also exactly what I wanted to say.
How do you sanitise this? Change < to &lt; in the database? Now you have to
remember to NOT escape that again when outputting! And you have to make sure
that, say, your text resources in your UI are all also escaped the exact same
way, or you have to remember to escape them DIFFERENTLY than user-provided
input.

Or maybe you "sanitise" by stripping out dangerous characters like "<". Now
you have broken my comment.

The only strategy that is at all maintainable is to store the comment as
received, and to escape on output. Anything else is massively fragile or
broken.

~~~
dictum
Maybe I'm overengineering, but couldn't you store the sanitized version as the
_normal_ value, and also store and make publicly available the original
unsanitized value in an ominously and obviously named key (say,
dangerouslyUnsanitizedValue) that happens to be easily greppable/lintable?

~~~
GuB-42
I think you are overengineering ;)

Plain text can contain anything and it shall be treated as such, it is that
simple.

As for security, don't assume everything in your database came from a trusted
source. Maybe there are remains from an old version of your code that didn't
sanitize, maybe you improperly used admin tools that bypassed checks.

------
0xff00ffee
These two issues have been relevant for over 20 years, older than today's
college grads. I find it fascinating the new blog posts are being written on a
regular basis explaining these two pitfalls. And there are probably a million
blogs with these same two examples going back decades. This is an epic re-post
in spirit.

I'm tempted to ask, "Why hasn't this been fixed yet?" Where "fixed" means,
"Something new programmers just starting off their careers don't have to jam
into their brains?"

I know the expected answer will be: it's an abstraction of a more complex
problem of understanding data and how it is used, and we can talk about how JS
and PHP have added native functions to construct custom code to address this
problem (hack cough cough hack).

But these two cases in particular stick in my craw because there is no
fundamental solution presenting yet.

So I ask again (and I've been asking this since 2004-ish):

Why do these two examples still persist? Why do the frameworks not eliminate
them by construction? This is such a repeated pattern, why is it even there?

~~~
extrapickles
It boils down to two things. One is library/tool design typically makes it too
easy to make user input be in-band with execution. The other is that most
tutorials/guides only show you how to do things in-band.

An example of a tutorial for SQL:

SELECT first_name FROM users WHERE last_name = ‘Smith’

They then have an exercise to hook this query to a text box in the program,
where through omission, the programmer is guided to use string concatenation
to build the query.

If from the first SQL statement to the last that a programmer saw was
parameterized, it would be much harder for them to reach for string
concatenation.

Most modern web development frameworks make it very hard to insert un-escaped
text into the DOM. You have to go out of your way to introduce a XSS
vulnerability in your web application with one, and most of the tutorials and
documentation about the framework warn you about using the raw HTML
functionality.

Another way to look at it is that the out-of-band way of doing things is
typically perceived as either lower in performance, harder to do and/or less
elegant (eg: C-style strings vs pascal strings).

I consider anything with user input that is done in-band (eg: escaping is a
fix) to be doomed to fail. This is similar in idea to the cryptographic doom
principle where decryption before authenticating the message is ultimately
doomed to failure.

~~~
gwd
> If from the first SQL statement to the last that a programmer saw was
> parameterized, it would be much harder for them to reach for string
> concatenation.

I dunno -- I've been doing C programming for 30-ish years now, but just
learned SQL about a year ago. _Every man page_ I looked at, as well as every
stackoverflow question, emphasized the importance of using parametrized
queries. And IIRC in Python, "only execute a single statement" is enabled by
default; if you want to execute multiple statements, you have to use a
different call. So even if you somehow manage to forget to parameterize your
queries, you'll still be safe from Little Bobby Tables.

Do SQL injection attacks still actually happen? How is it possible?

~~~
hobs
SQL injection attacks still happen, a LOT.

Making sure your query only executes a single statement is not enough to
prevent sql injection (depending on how you concat the query) - you just have
to use the provided context to get/set data you need to escalate your
permissions (eg if its SELECT stuff FROM table, you might be able to inject it
such that your query replaces stuff and then you can select whatever the
querying user has access to.)

2019: For its "State of the Internet" report, Akamai analyzed data gathered
from users of its Web application firewall technology between November 2017
and March 2019. The exercise shows that SQL injection (SQLi) now represents
nearly two-thirds (65.1%) of all Web application attacks. That's up sharply
from the 44% of Web application layer attacks that SQLi represented just two
years ago.

~~~
zAy0LfpBZLC8mAC
> The exercise shows that SQL injection (SQLi) now represents nearly two-
> thirds (65.1%) of all Web application attacks.

Are you sure that you don't actually mean "weird strings sent to web
applications" when you write "web application attacks"? Sending a weird string
to web applications is not an attack in any particularly relevant sense unless
there is an actual vulnerability--other than when akamai wants to sell you
snake-oil against all the danger of weird strings.

------
vxNsr
It's interesting how so much of programming is knowing about the existence of
libraries and not trying to rebuild things that already exist. I'm sure there
are thousands of people out there who sanitize inputs and outputs naively and
don't know about great libs like DomPurify.

I know I'm guilty of trying to build something from first principles only to
google it after banging my head against edge cases and finding a ready-made
library or util that with a tiny bit of finesse or modification does the job.

~~~
inapis
TBF, most of these libraries are not easily discovered. You have to luck on
the chance that the library author used similar words as your search query.
Barring major stuff (authentication, databases etc), you will rarely know
about a library in existence. Search engines are still limited by the language
of humans. Unless and until, you note down every possible library you come
across for future reference, this problem is here to stay.

Maybe a language having a vast standard library won’t suffer from this problem
but it will definitely have other problems.

~~~
vxNsr
yeah, creating a personal search db is time consuming and kinda impossible...
a while ago everyone was coming up with bookmark managers that kinda sorta
could function as a personal search db, but it still required a lot of
customization. Also they were all cloud based, and didn't really function
offline.

------
billpg
I wrote a very similar piece six or seven years ago. People responded here
that I was just making sematic arguments or otherwise fell over themselves to
ignore the point I was trying to make.

It is good to see from the responses here that we've learnt absolutely
nothing.

Shameless plug: [https://blog.hackensplat.com/2013/09/never-sanitize-your-
inp...](https://blog.hackensplat.com/2013/09/never-sanitize-your-inputs.html)

~~~
palant
I've been saying the same for at least a decade, e.g. in
[https://palant.de/2016/03/02/why-you-should-go-with-
secure-b...](https://palant.de/2016/03/02/why-you-should-go-with-secure-by-
default-for-your-web-application/). It's ridiculous that somebody still has to
explain it.

------
firefoxd
It's definitely an interesting article but he didn't go deep enough with
subject of escaping the output.

I always recommend this excellent article from Joel Spolsky, Making wrong code
look wrong.

[https://www.joelonsoftware.com/2005/05/11/making-wrong-
code-...](https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-
wrong/)

~~~
nothrabannosir
Agreed, it seems what this article (along with most of the comments) is
dancing around is the idea of a stronger type system. String != string, e.g.
UserText vs SqlStatement. Using explicit conversion methods between those
types helps clarify the actual “boundaries” of your system, or rather the
independent parts and their individual boundaries. Joel Spolsky’s article
illustrates the problem well.

The problem stems from our simplistic type systems, which reflect a value’s
storage class (“array of bytes”) rather than its semantic type (“raw input
from user”, “sql safe TEXT literal”). Once your type system can differentiate
between those two, it can help you identify where conversion between the two
(aka “escaping” or “encoding”) is necessary. Then the problem of “dangerous
string” disappears, because there is no more string: if it’s a UserText, it
can’t be concatenated with a SqlStatement without conversion. Just like an
“int” for example, or an array.

Anyway I’m just rehashing Spolsky’s article, poorly. Don’t let my inaccurate
summary reflect negatively on his point :p

------
patrec
Arrrghh, it's 2020 and developers are still too dumb to understand the
difference between a string (list of characters) and a tree (sql, html, ...).

No, the solution is not to "sanitize" input, output or both! The solution is
to not use the same type for text and trees!

We should rid the world of brain-damaged templating solutions like jinja, go
template and similar garbage that pretends your SQL or HTML is some flat
string that just needs a bit of extra contextual "escaping" magic [1].

If you interpolate into a proper AST there is no problem and you need no
"escaping". For efficiency reasons you may not want to directly interpolate
into an actual ast and then de-serialize all of it to a string again just to
output it down a socket in the next line, but that's just an optimization.

Bourne shell fucked this up as well (for an easier case of interpolation,
since there is basically no nesting) and it remains a constant source of
severe bugs and security holes in shell scripting as well. By contrast lisp
has been doing this right for literally decades:

    
    
        `(html (div ,some-string-to-interpolate ,@a-list-of-inline-elements-to-splice))
    
    

[1] Yes, I get it's "convenient" to have a "universal" solution that works for
any type of file. But html and sql in particular are easily important enough
to have a correct solution and it's not hard, it's not inherently slower and
it's way more convenient in any real sense than puzzling about several rube-
goldberg sanitization schemes, running some stupid security linter and paying
$$$ to pen-testers for hunting down this crap.

~~~
matheusmoreira
Parser support is non-existent in most if not all languages. Every language I
know is able to parse regular languages _at best_. Parsing HTML and SQL and
manipulating the resulting tree is not the first solution developers think of.

We should be able to look up some RFC, give the EBNF grammar to a library and
get a parser out of it. In order to do that today, we need to use ancient
parser generator tools. Why? A parse(grammar, input) -> tree function would be
easier to use. The Earley algorithm can receive a grammar as input.

Related: [http://langsec.org/](http://langsec.org/)

~~~
patrec
Well, I'm not some prefix-fanatic, but much of the problem would not exist in
the first place if we had just used some sexp style syntax for HTML, it would
be more pleasant to edit, and faster and much easier to parse for both humans
and machines to boot. Another billion dollar mistake.

So I feel a bit ambivalent about attempts to lower the costs of pushing out
more over complex grammars into the world. When have you last used in earnest
a non-sexp/internal DSL for something like build systems that didn't engender
in you an occasional urge to visit physical violence on its creators? But what
I'd unambiguously like to see is easier parsing of "sane" languages and the
death of perl-style regexps.

Still my guess would be: 95% of trouble comes from two: html (including js,
css, svg etc. unfortunately) and sql. And most of the remaining 5% from bash
:) So just dealing with those three would make a big dent.

Also things are not quite so bleak as you make them out to be: jsx is much
saner and any established mainstream language has a conforming html5 parser
these days (sadly, to do it properly you also want something that deals with
the various other languages that get munged into html: css, javascript, gimped
xml and here the situation is less good). SQL is thornier (and has many wildly
different dialects), but unless you need dynamic queries, parameterized
queries are available everywhere.

In fact a 1% effort/80% of the benefits approach is to not bother with parsing
at all and to just use different types for e.g. html and text (interpolate
html into html as string-interpolation as text (i.e. plain strings) into html
as escaped string).

------
thdrdt
Ofcourse you can validate input!

Validation != Sanitation

In this thread some people seem to confuse those.

You can validate if all input characters are utf-8. But the moment you start
to 'sanitize' non-utf-8 input into utf-8 you are in trouble. It's best to
notify the user that the input validation failed and that you don't accept the
input.

------
varelaz
As for XSS I think browsers could have done more to fix this. For example add
some tag like <unsafe> or <sandbox> for part of HTML that cannot have access
to cookies and javascript on the page and disables any active components, like
iframes and objects. Developers can use them to renderer rich content provided
by users. Right now you can do this with iframes and CORS only, but that's too
heavy to implement. These tags could have their own CORS limits for example.

Why I think that it's a browser problem. Security of the output needs to be
reviewed with browser features added and only on browser level it can be up to
date with all new features.

~~~
brlewis
Do you think CSP doesn't do the job? [https://developer.mozilla.org/en-
US/docs/Web/HTTP/CSP](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP)

~~~
varelaz
It does but only on whole document. Problem that by default it allows
everything, people are lazy to find how it works and setup it right. Also you
need to open every third party separatelly which could work bad for ads. If
there will be a sandbox, you can allow only what you need for a particular
part. I understand that this conception looks complex and more like iframe.
Basically right now a lot of ads content rendered in iframes without src which
are kind of sandboxes in this case.

------
peterwwillis
You don't "sanitize" input, you "sane-itize" input. The whole point of
checking the input is to make sure it's valid, not to try to scrape away cruft
you don't think is valid.

Example: phone numbers. Is the input you accepted a phone number? Well, if you
were just "sanitizing input", you might pass it through some generic
"sanitizing function" that just checks if it had "malicious characters" or
something and strip those out. But what you should actually be checking is _is
this a phone number?_ By making sure the input is what it's supposed to be,
you not only gain a better security posture, but you improve your program's
reliability by making it operate in the way you expect it to.

Some input fields like "give me a random block of text and I'll store it" are
very hard to validate, so for those fields you can encode them as Base64 for
storage, and at output time decide how to format them safely.

Also consider wrappers for any functions which pass input from a user. Perl's
Taint Mode
([https://en.wikipedia.org/wiki/Taint_checking](https://en.wikipedia.org/wiki/Taint_checking))
is a global way to enforce this, but for languages that don't have a Taint
Mode, you'll have to implement it yourself.

~~~
jessaustin
Some people really go overboard with this sort of validation, however. I've
had to argue with vendors who didn't believe email addresses can end in
something other than ".com" or ".org".

------
rawfan
It's actually quite easy:

1\. Sanitize input when you actually need/want to do that but at least to a
degree that it's save to put in a database (e.g. through prepared statements)

2\. Always validate input

3\. Always escape output (unless you have a reason not to).

------
ethhics
Postel’s law—be conservative in what you do, but liberal in what you accept
from others

~~~
SigmundA
Exactly what came to mind:

[https://en.wikipedia.org/wiki/Robustness_principle](https://en.wikipedia.org/wiki/Robustness_principle)

Having done plenty of interfaces between systems these are words to live by.

~~~
zAy0LfpBZLC8mAC
No, it's terrible advice as it only causes unnecessary interoperability
problems and vulnerabilities. There is no reason why anyone should need to
generate invalid input to your program and it is never a better idea to make
every consumer more complex to deal with broken input than to make one
producer create non-broken output.

The only robustness to invalid input you should have is that you should not
fall over when you encounter broken input, but simply reject it.

~~~
SigmundA
Both TCP which Postel wrote the spec and HTML follow this principle so it it
seems have its merits.

You know what followed your principle? XHTML...and the arguments were the
same, its not well formed just reject it, why would you ever accept broken
input.

Sure it makes parsing faster and simpler and yet what actually works and is
robust in the real world, HTML...

~~~
zAy0LfpBZLC8mAC
What is your argument here?

That HTML is a platform with an extraordinary security track record? Noone has
ever exploited all the ambiguities that result from the incoherent mess that
is the web?

Or is it that we never had any interoperability problems with HTML? All
browsers always reliably rendered websites consistently? "This website is
optimized for IE" never happened?

How isn't that just the best example to support my point?

As for TCP ... how is it relevant that Postel wrote the spec? Does that mean
that the vulnerabilities in TCP never happened? Or are you saying that modern
TCP implementations try to accept any crap whatsoever? (No, they don't, of
course they don't, people have actually learned that that's a bad idea.)

~~~
SigmundA
People seem to prefer web sites that render inconsistently rather than not at
all because of one little issue in the markup. It is more robust to render
something rather than nothing and is one big reason XHTML was abandoned.

Yes a system that no one uses is more secure than one everybody does.

Postel's Law is literally in the TCP RFC [1], don't you think that makes it
relevant?

1\.
[https://tools.ietf.org/html/rfc761#section-2.10](https://tools.ietf.org/html/rfc761#section-2.10)

~~~
zAy0LfpBZLC8mAC
> People seem to prefer web sites that render inconsistently rather than not
> at all because of one little issue in the markup.

Except those are not the alternatives. The alternatives are consistently
rendered websites or inconsistently rendered websites. If browsers had
strictly enforced HTML syntax from the beginning, noone would ever have built
websites with "little issues in the markup".

IP stacks do not accept randomly misformatted IP packets. The result is
obviously not that you constantly encounter internet services that you can not
access because your IP stack is picky about broken IP packets, the result is
that noone ever sends you broken IP packets.

> It is more robust to render something rather than nothing

No, it just isn't. You are just looking at a very small part of the
consequences of this implementation strategy that indeed happens to be
positive, but completely ignoring the big picture of all the externalities and
other indirect damage that result from it.

> and is one big reason XHTML was abandoned.

Erm ... no? The reason why XHTML was abandoned was because people are
incompetent at writing software, and there existed an alternative that allowed
them to keep their idiotic practices, including all the vulnerabilities and
interoperability problems that result from those, so that's what people did.

> Yes a system that no one uses is more secure than one everybody does.

How does that follow? And what does that have to do with anything?

> Postel's Law is literally in the TCP RFC [1], don't you think that makes it
> relevant?

Relevant ... for what?

~~~
SigmundA
>Except those are not the alternatives. The alternatives are consistently
rendered websites or inconsistently rendered websites. If browsers had
strictly enforced HTML syntax from the beginning, noone would ever have built
websites with "little issues in the markup".

Thats not reality, if everyone got perfect formed input we wouldn't be having
this debate, the reality is it occurs, so what do you do, reject it or accept
it and try and do something with it. XHTML simply rejects malformed markup and
you get a blank page, HTML tries to make sense of it and render something.

>IP stacks do not accept randomly misformatted IP packets. The result is
obviously not that you constantly encounter internet services that you can not
access because your IP stack is picky about broken IP packets, the result is
that noone ever sends you broken IP packets.

So you never heard of ECN? The ECN bits being set where technically incorrect
depending on how pedantic you where in the interpretation and some stacks
rejected packets if the bits weren't set to zero. Due to he robustness
principle most stacks ignored these bits allowing others to use them for ECN,
allowing a graceful update to the spec. The stacks that took your stance
however and rejected where simply roadblocks in the adoption.

>No, it just isn't. You are just looking at a very small part of the
consequences of this implementation strategy that indeed happens to be
positive, but completely ignoring the big picture of all the externalities and
other indirect damage that result from it.

I'm not ignoring anything, I am just pointing out reality, the real world is
messy and the stacks that try to keep working under messy conditions seem to
be prevailing. Its not pretty and I don't deny the issues that arise, but here
we are communicating on the largest most successful computer network ever
built using a protocol and a markup language built with Postels law in mind.

>Erm ... no? The reason why XHTML was abandoned was because people are
incompetent at writing software, and there existed an alternative that allowed
them to keep their idiotic practices, including all the vulnerabilities and
interoperability problems that result from those, so that's what people did.

I think most who know the history there would disagree with this opinion [1],
it was obvious to me at the time why XHTML would fail even though I thought it
a cleaner solution, I realized thats what was holding it back. It was much
better to see your page come up with maybe a weird rendering artifact than
just have the browser render nothing and throw an error if some small part was
malformed.

>How does that follow? And what does that have to do with anything?

Because complaining about security vulnerabilities found in some of the most
used software in the world while comparing to something that no one uses
doesn't help your point.

>Relevant ... for what?

Uh gee I don't know maybe Postel's law is kinda relevant when discussing TCP
because Postel wrote the spec you know like what you asked in the post before?
What kind of game are you playing here?

1\. [https://thehistoryoftheweb.com/when-standards-
divide/](https://thehistoryoftheweb.com/when-standards-divide/)

~~~
zAy0LfpBZLC8mAC
> Thats not reality, if everyone got perfect formed input we wouldn't be
> having this debate,

Erm ... you do understand that, you know, there is feedback involved in this?
That I am obviously not saying that noone would ever have typed broken HTML
into a file if browsers had rejected broken HTML from the start?

I mean, it's even the norm for implementations of other computer languages to
be rather strict about syntax, and it doesn't hinder their popularity with the
same audience. The exact same people who produce garbage HTML do so using Perl
or PHP or Ruby or ... whatever. And whatever you otherwise think about those
languages, none of them will just make shit up when there is a syntax error in
your program, they will simply reject it. And no, that does not mean that I am
claiming that noone has ever made a syntactical mistake when writing code in
those languages. But, you know, people are actually capable of fixing those
mistakes when they are pointed out to them.

> So you never heard of ECN? The ECN bits being set where technically
> incorrect depending on how pedantic you where in the interpretation and some
> stacks rejected packets if the bits weren't set to zero. Due to he
> robustness principle most stacks ignored these bits allowing others to use
> them for ECN, allowing a graceful update to the spec. The stacks that took
> your stance however and rejected where simply roadblocks in the adoption.

Erm ... what? That's almost fractally wrong!?

None of the ECN problem was one of pedantry, it was simply one of a broken
specification, namely the TCP specification. "Reserved for future use. Must be
zero." is simply a bad specification. If you specify an extension mechanism,
you have to always specify how the extension mechanism is supposed to work.
What you call the pedantic interpretation is a perfectly valid interpretation
of what the text says. You are just looking at it in hindsight, with the idea
that it's supposed to support the operation of ECN, and then it's obviously a
problem--but people who implemented TCP stuff before there was ECN could not
possibly know that that is how people would expect to use this _if the TCP
specification doesn 't specify that_. There is nothing wrong with extension
mechanisms that work by having the recipient discard messages with flags it
doesn't know. That's just not what ECN chose to do, but that is kinda ECN's
fault. You might just as well have ended up with a situation where someone
would have tried to build an extension that assumes that recipients discard
segments with unknown flags, and everyone would have been pointing fingers at
those who chose to ignore the flags instead, and how they were pedantic to
ignore the flags just because the specification does not explicitly say that
such segments are invalid. It's just an accident of history that most
implementations chose to ignore unknown flags, and therefore people now point
to the exception, without any basis other than them being the majority.

Also, obviously, the "robustness principle" did not allow for a graceful
update to the spec. The fact that a graceful update was not possible is the
whole reason why you mentioned ECN at all. And that is not necessarily a
result of failing to follow the robustness principle, as the robustness
principle really doesn't tell you anything useful. All you can do with it is
to point at things in hindsight and say "if everyone had built this the same
way, then things would be compatible now!" But the robustness principle is
useless for actually achieving that. For any format specification, there is an
almost infinite number of ways you can deviate from the specification where
humans could look at any individual one of those deviations and come to an
agreement as to how that deviating message could reasonably be interpreted.
And any one of those deviations could in principle be implemented as part of
the corresponding parser. But implementing a parser that "correctly"
interprets _all_ of those possible deviations is at the very least a major
undertaking, and usually even impossible due to contradictions between various
deviations when they appear in combination.

And that is why hindsight is misleading: In hindsight, you only see one
particular (small set of) deviation(s) causing interoperability problems, and
it would almost always have been possible to make every parser coherently
interpret those deviations just fine, and if everyone had done that, then you
would not have any interoperability problems. But that isn't the perspective
of someone who initially builds the implementation. They can only either
strictly follow the spec (which works perfectly if everyone does so and the
spec isn't broken) or they can increase complexity of and effort required for
their implementation an order of magnitude or more to accept close to anything
that could happen (which noone does for obvious reasons) or they can implement
a random selection of deviations they like (which then leads to
interoperability problems and the view in hindsight that everyone else could
easily have done the same, which, of course, they couldn't, because they
couldn't know what others were doing). Of course, there is a simple solution
to that last approach: If you want to implement deviations from the agreed-
upon spec but you don't want to run the risk of creating interoperability
problems, you could get together with all the other implementers and talk
about which deviations everyone is going to implement. But obviously, that's
just the first approach in disguise: After you have agreed on the deviations,
they aren't deviations anymore, you have simply created a new spec, and
everyone then strictly follows that new spec.

Essentially, what is happening here is that you see one interpretation of
something that the spec doesn't actually specify as obvious. And then you
claim that the solution to interoperability problems is that everyone does the
obvious thing. But you fail to recognize that the whole problem we are trying
to solve with specifications in the first place is that _what seems obvious is
different for different people_. Which is why this (a) can not work and (b)
obviously in practice does not work. You can not solve the problem of people
having different approaches to problems by simply saying "they should just all
have the same approach" while at the same time saying that methods to create
agreement (i.e., specifications) should not be taken too seriously.

> I'm not ignoring anything, I am just pointing out reality, the real world is
> messy and the stacks that try to keep working under messy conditions seem to
> be prevailing. Its not pretty and I don't deny the issues that arise, but
> here we are communicating on the largest most successful computer network
> ever built using a protocol and a markup language built with Postels law in
> mind.

Then your points are just irrelevant? I never said that broken systems can not
be successful, did I? Yes, there clearly are evolutionary advantages to
externalizing costs, and taking risks can pay off. But there are also other
parties who have to pay those externalized costs, and taking risks can also
end in a catastrophe. Externalizing costs is still an asshole move (and is
generally frowned upon by society when people understand that that is what is
happening) and whether the risks taken by the web, for example, have actually
paid off is far from obvious.

Also, possibly all of this was built with Postel's law in mind. But what I
would be interested in is whether that was to our benefit. Just because
something was a factor in creating a certain overall positive situation does
not mean that therefore that factor made that situation better than if it
hadn't been there. In particular, evolutionary success does not mean that a
different approach would not have produced a better result.

> I think most who know the history there would disagree with this opinion
> [1], it was obvious to me at the time why XHTML would fail even though I
> thought it a cleaner solution, I realized thats what was holding it back. It
> was much better to see your page come up with maybe a weird rendering
> artifact than just have the browser render nothing and throw an error if
> some small part was malformed.

How does that contradict what I said? Yes, it was obvious that XHTML would
fail due to the massive incompetence of developers ... your point being?!

> Uh gee I don't know maybe Postel's law is kinda relevant when discussing TCP
> because Postel wrote the spec you know like what you asked in the post
> before? What kind of game are you playing here?

I am not sure what kind of game you are playing, but I had the impression like
you were trying to make a point and not just state the historical fact that
that's where Postel formulated the "robustness principle". Yeah, I agree,
that's what he did. And it was a bad idea.

~~~
SigmundA
>And whatever you otherwise think about those languages, none of them will
just make shit up when there is a syntax error in your program, they will
simply reject it. And no, that does not mean that I am claiming that noone has
ever made a syntactical mistake when writing code in those languages. But, you
know, people are actually capable of fixing those mistakes when they are
pointed out to them.

Except its pretty common now for programming languages add quality of life
changes that loosen some of the strict parsing rules, such as trailing commas
or optional semi colons. Same with whitespace, many languages don't pay much
attention to it then you have a formatter that is strict about it (gofmt).
This is Postels law in action, liberal acceptance strict output. The
alternative is strict adherence to whitespace then no need for a formatter,
just have the compiler reject it and put the burden on the programmer.

>Also, obviously, the "robustness principle" did not allow for a graceful
update to the spec.

Again your opinion is not shared historically, ECN is held up as an example of
the robustness principle having been followed in most stacks, with some
problem ones that did not causing some issues [1].

>Then your points are just irrelevant?

Then your points are just irrelevant? We can play this game forever. Just the
fact the you are using HTML and not XHTML and TCP under that to write these
should make some relevant point that you can't seem to see.

>Yes, it was obvious that XHTML would fail due to the massive incompetence of
developers ... your point being?!

Or more likely all these developers weren't incompetent including myself, just
when given the choice the strictness of XHTML lost to the liberalness of HTML
proving Postels law again. Messy and robust won over clean and fragile again,
that's the point, get it?

>I am not sure what kind of game you are playing, but I had the impression
like you were trying to make a point and not just state the historical fact
that that's where Postel formulated the "robustness principle". Yeah, I agree,
that's what he did. And it was a bad idea.

>As for TCP ... how is it relevant that Postel wrote the spec? Does that mean
that the vulnerabilities in TCP never happened? Or are you saying that modern
TCP implementations try to accept any crap whatsoever? (No, they don't, of
course they don't, people have actually learned that that's a bad idea.)

Going back to you original question since your having hard time connecting the
dots, Postel wrote the spec for TCP and put his law in the spec as guidance.
ECN was developed taking advantage of that principle and most stacks accepted
the malformed packets because of it. There are other examples of this [2], TCP
is complicated if stacks didn't follow Postels law they would never get
anything done on the internet.

1\. [https://tools.ietf.org/html/draft-ietf-tcpm-generalized-
ecn-...](https://tools.ietf.org/html/draft-ietf-tcpm-generalized-
ecn-05#section-4.2.2.2) 2\.
[https://www.snellman.net/blog/archive/2016-02-01-tcp-
rst/](https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/)

~~~
zAy0LfpBZLC8mAC
> Except its pretty common now for programming languages add quality of life
> changes that loosen some of the strict parsing rules, such as trailing
> commas or optional semi colons. Same with whitespace, many languages don't
> pay much attention to it then you have a formatter that is strict about it
> (gofmt). This is Postels law in action, liberal acceptance strict output.

Erm ... no, it's obviously not? Or at least not in a way that is relevant to
this discussion. I am obviously not objecting to specifying languages that
give you a lot of freedom in how you format things, so what is the point of
bringing up that you could interpret the robustness principle to mean just
that? I am obviously objecting to accepting input that does not conform to the
respective relevant specification, and the fact that making languages more
flexible in their formatting is often useful has no relevance to that
whatsoever.

You interpret some term to mean a broad range of things, I point out that one
of those things is a bad idea, and your defense is that one of the other
things is good ... how is that even an argument? How does that change that
what I pointed out is a bad idea?

> The alternative is strict adherence to whitespace then no need for a
> formatter, just have the compiler reject it and put the burden on the
> programmer.

No, the alternative is strict adherence to the language specification. Or,
really, it's not an alternative at all, because there is zero contradiction
between specifying a language with flexible whitespace grammar (or separator
grammar or whatever) and then strictly enforcing that grammar (and thus
obviously avoiding interoperability problems).

> Again your opinion is not shared historically, ECN is held up as an example
> of the robustness principle having been followed in most stacks, with some
> problem ones that did not causing some issues [1].

In other words: You position is unfalsifiable? If there are no
interoperability problems due to everyone interpreting messages identically,
then that is obviously due to the robustness principle, and if there are
interoperability problems because implementations deviate in how they
interpret messages, then that is also obviously a success of the robustness
principle? Is there any scenario where that robustness principle would not
count as successful?

> Then your points are just irrelevant? We can play this game forever. Just
> the fact the you are using HTML and not XHTML and TCP under that to write
> these should make some relevant point that you can't seem to see.

How is the fact that I am using something in any way relevant to the question
of whether an alternative would have avoided interoperability problems and
vulnerabilities?

> Or more likely all these developers weren't incompetent including myself,
> just when given the choice the strictness of XHTML lost to the liberalness
> of HTML proving Postels law again. Messy and robust won over clean and
> fragile again, that's the point, get it?

How is it relevant that HTML won? How do you connect from "technology X won
over technology Y" to "therefore, technology Y would not have had fewer
interoperability problems and vulnerabilities than technology X"?

Why do you answer every question as to technical properties of a technology
with "it lost" or "it won" while completely failing to say anything at all
about the technical property being discussed?

NOONE DENIES THAT HTML WON OVER XHTML.

Also, it seems you almost completely ignored the central explanation of my
previous post, simply to repeat your previous points as if I never had said
anything. I am happy to read your explanation as to where my analysis is
wrong, but I am completely uninterested in reading over and over points that I
repeatedly explained why I don't agree with them with no insight at all into
how my reasoning is wrong.

------
kazinator
Problem with this kind of description is that input and output are often two
names for the same thing. One processing element's output is the next one's
input.

Basically, don't substitute HTML cruft onto text, except at that stage in
processing when that text is just about to be inserted into HTML.

Don't do HTML-ization prematurely.

You wouldn't encode and store data in Base64 just in case it might be needed
that way in some future processing step.

------
HelloNurse
A good point, but it is obscured by a terrible choice of vocabulary: in this
article "sanitizing input" actually means specifically trying to sanitize
input of the whole system as soon as it is received (quite impossible, given
the open-ended and conflicting nature of sanitization needs) and "escaping
output" actually means sanitizing the input of a specific subsystem for a
specific purpose.

Given a faithfully persisted and assumed unsafe original text, SQL query
builders can turn everything into SQL strings or die trying, XML parsers can
check entity expansion size and other traps, HTML generation templates can
introduce fancy markup to surround arbitrary text, XML generation templates
can escape input wholesale in CDATA sections, and so on. It's the traditional
principle of separation of concerns.

------
dwheeler
I think the title here is misleading. The title is, "Don’t try to sanitize
input. Escape output."

The article itself is _only_ talking about sanitizing user input to "prevent
cross-site scripting attacks". He later on _does_ require input checking:
"Input sanitization is usually a bad idea, but input validation is a good
thing... by all means validate it and return an error if it’s invalid."

It's vital for secure programs to check their inputs and minimize what they
will accept. I _do_ think it's a good idea to reject "&" and "<" when you can.

But I also agree that in most cases you can't completely forbid all inputs
that have HTML metacharacters. In the case of _cross-site scripting_ , the
best countermeasure is output escaping. Many modern frameworks do output
escaping by default; Rails (for example) has done it for years (in Rails, any
"normal" string is automatically escaped when sent back out as HTML). A good
reason to prefer one framework over another is because it has secure defaults;
if your framework doesn't escape by default, you should consider using a
better framework.

You can't depend on just one thing to suddenly make your software secure. You
need to validate inputs so that only _valid_ inputs are accepted into your
program. You need to escape output, because there are often legitimate input
characters that must be escaped. You need to prefer tools that have safe
defaults. Use tools to scan your results, to find what you missed. It's not
rocket science, but it does require a _set_ of approaches; there is no silver
bullet for making secure software.

------
hamilyon2
Escaping output does not always work when, e.g. you have thousand of
integrated systems and don't control any of them nor their upgrades.

If you don't filter malicious inputs, they will forever live in your database
and it takes one bad release of some reporting tool somewhere for your users
to become vulnerable.

~~~
wccrawford
If you rely on sanitizing before storing, you can end up with data in your
database that somehow missed being sanitized, or is maliciously entered in
your database.

You _must_ escape the outputs, no matter how hard you try to sanitize the
inputs. Losing "control" of any integrated systems means your system is
vulnerable, even if only to someone at a terminal typing things into the DB
manually.

~~~
hamilyon2
Of course you must escape your outputs, get in-depth layers you don't even
need now, track how data flows in your application, patch every known
vulnerability, sandbox every single thing, implement capability - based
security, and many more. I didn't mean otherwise.

Filtering input is not sufficient. But it is not optional.

------
pwdisswordfish2
> Incidentally, the mother in the xkcd comic says, “I hope you’ve learned to
> sanitize your database inputs.” Which is somewhat confusing, but I’ll give
> Randall the benefit of the doubt and assume he meant “escape your database
> parameters”.

What a strangely roundabout way of saying that programming advice found in a
webcomic may be actually wrong.

(To be fair, that Xkcd comic was a product of its time, when ‘sanitisation’
was all the rage.)

~~~
zarmin
colloquial sanitization?

------
winrid
Generally I try to sanitize the input and store the original raw value in case
we find sanitation bugs.

I'd prefer fixing the old data when needed to having overhead on every read,
at least for FastComments...

~~~
winrid
downvotes?

~~~
zAy0LfpBZLC8mAC
Yes. Sanitizing inputs is a bad idea, still.

~~~
winrid
Maybe Sanitizing is the wrong word for what I'm doing. For example, I need to
strip marketing/tracking information from URLs before saving them or else
someone coming from Google will have a different URL than someone coming from
FB and then the comments won't load.

I guess I meant normalization.

------
lenkite
Always use prepared statements to set query parameters. This will handle 99.9%
of all query use-cases. Constructing dynamic SQL from user input is a fool's
game.

------
water42
Don't filter input. Instead, prevent certain characters from being input in
text elements. This is a user experience problem, not a software problem. The
software can validate that a "name" is rejected if it does not follow the
front end validations, but it doesn't need to do any more than that.

Of course, this argument does not extend beyond a "name" field to more complex
fields. But more complex fields are less susceptible to introducing UX
problems if certain characters are sanitized.

------
wurp
The article misstates what 'sanitizing inputs' means.

I agree with posters who recommend passing data as parameters to methods that
don't require sanitized input (e.g. stored procedures or KeyValue APIs).

Also, sanitizing input means transforming input so you retain the original
content, but without escape or control characters. Sanitizing input does not
mean throwing part of the input away (except when you know it is meaningless
in your context, e.g. spaces at the end of a name).

------
dana321
In the example given there, use json_encode($name)

That will encode any data structure properly for output into JavaScript.

------
_nalply
It depends.

If you take arguments to some sub-system (an example are database keys like
the id of an entity instance), then you need to sanitize input.

Anyway, today I learnt something. If you have free-form data like text it
makes sense not to sanitize it because in this case sanitizing depends on the
output domain. For example < is dangerous for HTML and ' is dangerous for SQL,
and so on.

------
NohatCoder
A method I have generally found useful is to make a whitelist of safe
characters (something like alphanumeric, comma, dot and space), and escape
everything else. You might escape a bunch of stuff that technically didn't
need escaping, but the method is simple, rock solid, and doesn't mangle
anyone's names.

~~~
DuckyC
My name contains Ø, and im guessing i would not be able to enter that with
your method. I would consider that mangling my name if i had to write o or oe.

~~~
NohatCoder
No, escape means keep, in HTML for instance Ø would become &#216; escaped, but
it is still there visible, same as every other character.

~~~
hombre_fatal
This kind of thinking is how your users end up getting emails from your buggy
service like "Hello &#216;stein &amp; friends, ..." and your JSON API
consumers encounter the same silly output.

Don't escape input. Escape based on output. Escaping doesn't mean anything
until you've also specified an output format. It's not always HTML.

~~~
NohatCoder
You are grossly misrepresenting my post, I have said nothing about whether the
escaping should be applied to input or output, please edit or delete your
comment.

------
jiveturkey
amazing that in 2020 this is still so poorly understood.

not as evidenced the TFA (which I didn't bother to read), but by the strong,
uh "opinions" here on HN.

------
dangerface
Why not both?

I don't filter my input I just make sure its sane.

------
ch
One's output is another's input.

------
jancsika
In the example given, why can't the programmer simply set the textContent for
the given element to the arbitrary user input?

------
Jugurtha
"Don't tell me what the fuck to do - tell me what you did and what has worked
for you in a given context."

------
justinator
Why is Billy the Kid, a Wild West outlaw, talking like a pirate?

~~~
benhoyt
Good call! Guess I was thinking of Captain Kidd. Fixed, kind of. :-)

Then again, I'm not sure what Billy the Kid's doing surfing the web in 2020
either...

------
tobyhinloopen
This is a terrible idea.

------
AstralStorm
The sane way is to escape input. Be 8-bit safe by using hex escapes and if
necessary, use specialized Unicode hex collation rules.

Every text processing is supposed to have clear escape rules. Albeit there
were bugs where accidental unescaping does happen.

~~~
varelaz
Problem with escaping that you need to know what you escape for, escaping is
needed for almost everywhere (SQL, URL encoding, HTML, JSON, YAML) and double
escaping could break the content.

------
Hitton
Only escaping output has one significant disadvantage. Say that you are
escaping &. You'll get &amp;. Your user then wants to edit the text. You save
the edited text. Now when you escape and output it again, you get &amp;amp;.
Rinse and repeat.

~~~
zAy0LfpBZLC8mAC
No, the browser will decode &amp; in the HTML into the '&' character and will
submit that character back to you if it is part of some form field.

~~~
Hitton
Wow, I feel really stupid now. That means I have misunderstood how does it
work and have evaded it needlessly till now (not that I do much web app
development).

------
keymone
Bad advice. Input is the boundary of your system. Always protect the boundary
of your system orders of magnitude more than it’s internals. That’s like
programming basics.

~~~
zAy0LfpBZLC8mAC
There is nothing to protect there, the idea that some data is inherently
dangerous is nonsense.

~~~
keymone
RCE due to buffer overflows, google it.

~~~
megous
You don't need to change the data to make it less dangerous. Just fix the
program processing it. It's the program that's dangerous, not the data.

~~~
strictnein
Because you have full control over all programs in an enterprise?

------
ehsankia
So then it's fine for an input that deletes your entire database, as long as
none of that data makes it back out?

~~~
onion2k
A database query is an output. Anything that your code generates and sends to
_something else_ , whether that a web server to answer a user request or an
API call to connect with an external service, or an internal request on the
server, it's all outputs from your code.

No matter how good your input santization is, you still wouldn't ever send an
unescaped query to a database, right? That's because the query is an output.

~~~
throwawayjava
...so the blog post boils down to "sanitize all inputs that don't get piped to
/dev/null; also, there are some good libraries that will do that for you
(...by escaping outputs... but oh btw those only work sometimes of course, and
in other cases, be careful?).

In other words, for the love of god please do sanitize your inputs.

~~~
xenomachina
No.

"Sanitize inputs" means modifying the input before you even know where it's
going. It's fine for stuff like normalizing user input (eg: "strip leading and
trailing spaces") but should not be used to combat things like SQL injection
or XSS.

For issues like SQL injection and XSS you should escape on output. Outputting
HTML? HTML escape, or better yet: use templating framework that does it by
default. Outputting to SQL? SQL escape, or better yet use prepared statements
and pass in your arguments using an API that escapes by default.

In the "sanitize inputs" approach to handling these situations you can't store
"O'Hara <3 Sue" as a value, because you need to "sanitize" the apostrophe for
SQL and the less-than for HTML. In the "escape outputs" approach, you have
"O'"Hara <3 Sue" in your SQL, and "O'Hara &lt;3 Sue" in your HTML, and the
user's input is preserved.

~~~
throwawayjava
_> "Sanitize inputs" means modifying the input before you even know where it's
going._

Okay.

That's not how I've ever used that term or seen it used. Prepared statements
are a form of input sanitation. HTML purifiers are a form of input sanitation.
Maybe this lingo is specific to PHP-land?

In any case, "You need to know the semantics of the sink in order to know what
to do with an untrusted source" seems like an obvious truism not worth writing
about.

~~~
onion2k
_" You need to know the semantics of the sink in order to know what to do with
an untrusted source" seems like an obvious truism not worth writing about._

Given how often developers get it wrong, I don't think it's written about
enough.

Also, you say "untrusted source" here. Whether you trust the source or not is
irrelevant. You should still be escaping the output where you use data from it
in order to make sure your outputs are safe - the source could be compromised,
or broken, or sending something valid that you didn't expect. Maybe this isn't
quite so obvious after all.

------
josh_fyi
I think he is wrong on handling code as input and visible output (for sites
like StackOverflow). No need to filter such input. Escaping your strings will
handle that as well. The code <tag/> , for example, will be escaped to
&lt;tag/&gt; , appearing in the rendered page as <tag/> (but not _interpreted_
as <tag/> ).

------
skyzyx
This is bad advice. Definitely sanitize on input, then escape on output. You
should never be knowingly storing unsafe input.

~~~
e12e
Actually, rather than sanitize input, I would recommend whitelist and reject
in most cases. ID should be an integer, but you get a string with spaces
around an integer? - error out. There's html in your text input? Error out.

In the vast majority of cases you control the client (web form) - anything
"surprising" will then be an error - or worse; malicious.

In the case of a json service, if the client doesn't submit valid json for
your schema / api - error out.

~~~
duncans
> There's html in your text input? - error out

How do you detect HTML? Less-than/greater-than signs? Users are now banned
from entering Less-than/greater-than signs?

~~~
hombre_fatal
I once registered for a forum to ask a question, but they had it configured so
new users couldn't submit URLs, probably to deal with spam. Their solution was
to reject posts that contained what looked like URLs which means any time you
don't put a space after a period, it's probably a URL. Like "That is
fine.Pizza is good." -> [http://fine.pizza](http://fine.pizza) detected, post
rejected

But it gets worse. I had a code block in my post and it was also detecting
URLs in my code.

`console.log` -> URL detected!

------
skybrian
It seems like this depends on your types? If you are storing user input to a
database using a text field, it's best to assume the field can contain
arbitrary text, since the database allows that. But if you're storing the
input into a number field then it _must_ be parsed as a number and you can
assume it's a number. If you constrained the number field to a certain range
then you can assume the number is in that range.

Storing arbitrary text is common, so we usually need to know how to render it
correctly. There are fancier types and constraints, though.

~~~
dTal
Each piece of code should enforce its own contracts explicitly across
communication boundaries. If your backend relies on x input being parseable as
a number, it shouldn't assume that it is just because you know that it
ultimately came from a number picker in an HTML form - it should check for
itself. This is defensive not just for security, but so that when you change
something later and screw it up you get actual error messages instead of
silent breakage.

------
Izmaki
What is horrible advice is to tell people to never sanitise input, but then
forget to switch the focus to what should be done instead. There's too much
time spend justifying the headline vs. explaining what should be done instead
and why this is more effective.

~~~
z3t4
Instead prevent "injections" by using innerText instead of innerHTML and
parameterize SQL queries instead of concatenating strings.

But you always want to sanitize user input! Ever wondered how the average age
of your user base was so high, only to discover that some users claim they are
several million years old. You don't want to sanitize, you want to sanitYize.
People writing their e-mail as street address and vice versa.

------
emayljames
The examples are very outdated. In PHP you would convert the offending
characters to html entities >at input, yes..input<. It then only needs
filtered >one time<.

~~~
function_seven
So the SMS message I send from my service would look like this?

 _Alert: Bob &amp; Jane O&apos;Brien just commented on your recent post_

If you store HTML in your database, now you have to convert from that to
whatever you're outputting to later. _And_ you haven't covered the dangerous
characters for other contexts.

Just store whatever the user sent you. When outputting to various formats,
convert accordingly.

~~~
emayljames
This whole article is based on a DB row being used for everything. That is
just not reality. From a cost point of view, if you're gonna just use that
data a couple of ways, it is pointless to constantly convert it coming out.
Even then, if that was the case, convert from entities, no more costly than
what you are saying.

------
throwawayjava
This feels like a distinction without a difference.

Escaping outputs is just one way of sanitizing inputs. Sometimes it works.
Sometimes it doesn't. The author of this post even realizes that their
prognostication is not general and then offers the advice to "be sure to get
security review"...

At the end of the day, you need to make sure that any untrusted source is
treated in a safe way by every sink and does not otherwise interfere with
system specs (e.g., mangling user output). Whether that happens at line 5
(where the input is read) or line 155 (where the command is generated) doesn't
really matter. Or to be more precise, is determined by whatever design
patterns the framework developer chose.

What matters at the end of the day is that command injection isn't possible
and the system's specs (including UI/UX specs) are respected.

Crucially, both input and output constraints are informed by the nature of
both the source and the sink. Hence the existence of libraries like DomPurify
and HTMLPurifier, which consider one very particular type of sink. Sometimes
you will write code in domains where others haven't written excellent
libraries but where sanitization (of either input or output) is needed. E.g.,
embedded systems.

I'd replace the author's advice with "carefully specify the semantics of your
sources and sinks", which is ultimately what the author's actual advice
(basically, "use trusted libraries and, when not, be sure to get security
review") boils down to.

~~~
tptacek
Not really, no. Output filtering is done in the context of a specific output
domain. Input sanitization isn't; the developer who builds sanitization has to
guess at all the possible output domains.

"Filter outputs not inputs" is a very old appsec truism.

~~~
throwawayjava
Output filtering _is_ input sanitization. wtf is is that you think you are
filtering? Inputs!

 _> the developer who builds sanitization has to guess at all the possible
output domains._

No they don't. They need to carefully understand/document all the places input
might be used and ensure no command injections are possible. In some cases
(e.g., web apps, where everything is string) that works relatively well...

Until, of course, you're the one writing the input sanitization logic in the
HTML purifier / prepared statements generator. And those code bases do have
occasional CVEs. So, random PHP dev can put faith in a library but the system
itself never gets away from having to sanitize input!

Output filtering has the complimentary problem -- you need to understand every
possible input. That's not always trivial like it is in PHP-based websites.
Think about e.g. an embedded system santiziing potentially adverarial time
series data (what does this mean / how do you detect it? Harder, right?). Or a
compiler. The blog post author even points this out: "...In these cases you’re
best off using a proper SQL parser (like this one) to ensure it’s a well-
formed SELECT query – but doing this correctly is not trivial, so be sure to
get security review."

Ultimately, "Filter outputs not inputs" is incomplete advice that kinda sorta
works well for the most part in web apps. The correct advice is, again,
"carefully specify the semantics of your sources and sinks".

~~~
wglb
>you need to understand every possible input.

This is often not possible.

When I talk to developers about this, I point use database storage as an
example. There may be computations behind the scenes that mangle the nicely
input-sanitized database contents. Concatenation with other values, string
work, data from some other system. Thus, data that was sanitized upon input is
now questionable for output.

 _This is well-intentioned, but leads to a false sense of security, and
sometimes mangles perfectly good input._

And in some applications, for example, ones that must process data in a
forensic environment, any change to the input is prohibited.

Thus, the only useful way to think about this is that the contents of the
database is toxic and must be sanitized on output. Simply working with the
input gives the programmer no useful idea about what is in the database when
it comes time to output it.

Frameworks these days help significantly with providing tools to properly
parameterize SQL. However, it is unlikely that they handle all the cases.
Consider an example where user input from a web page is used to build a column
name or table name. This isn't covered by frameworks. That needs to be
carefully processed in the code.

>Ultimately, "Filter outputs not inputs" is incomplete advice that kinda sorta
works well for the most part in web apps. The correct advice is, again,
"carefully specify the semantics of your sources and sinks".

It is in fact the primary advice that should be followed.

So sanitization of input is a good idea, but if output is not properly
encoded, somebody else is likely to profit.

~~~
throwawayjava
Sorry, this still seems like a terribly hacky way to think about code.

Again, if you write a template engine or a SQL engine, the code the library's
developer writes to determine how holes are safely filled _is literally
sanitizing input!_ You _never_ get away from sanitizing inputs, you just do it
further from the source and closer to the sink.

 _> So sanitization of input is a good idea_

Right. "Don’t try to sanitize input" is bad advice. Also, the whole point of
escaping outputs is that you don't trust inputs. Escaping outputs is done to
sanitize inputs.

If by "sanitize input" you mean "add some backslashes to $_GET values like
it's 1995", well, I guess, point taken. But then, the actually good advice
should be "step back learn how to think more systematically about your code",
not "escape outputs instead of inputs!"

