Hacker News new | past | comments | ask | show | jobs | submit login
Don’t try to sanitize input, escape output (2020) (benhoyt.com)
128 points by maple3142 6 days ago | hide | past | favorite | 127 comments

I think a better way to think of this may be in terms of canonicalization. Inside your application, you should decide on a single canonical way to represent data, one which fits the type of processing and expected use of the application. For example, you might decide that all strings should be UTF8, and should be interpreted (and stored) as whatever the user initially wrote. You might decide that any structured data should be parsed and then stored as protobufs in a BigTable. Or you might decide that an RDBMS is your native datastore and use whatever the native string encoding is for it, as well as parse & normalize data into tables upon input.

Then, whenever you take input, your job is to validate and encode it. If you get a Windows-1252 string, you should re-encode it to utf8 for further storage. If it has data that are invalid UTF-8 codepoints, you should either strip, replace with a replacement character, or notify the user with a validation failure. Same with structured data that fails your normalization rules - you should usually notify the user.

And when you send output, you should escape based on the intended output device. If you're putting it in an HTML page, HTML-escape it. If it's a URL, url-encode it. If it's a database query, SQL escape it. If it's a CSV, quote it.

Thinking in these terms keeps the internal logic of your application simple (there are no format conversions except at system boundaries), and it also gives you a lot of flexibility to preserve the user's intent and add new output formats later.

> If it's a database query, SQL escape it.

No, you parameterize it.

At least our db doesn't support parameterized lists for use with IN operator. Not often I need this but when I do I'm glad our db layer has macros for SQL with automatic string escaping.

Which db is that? Just curious, the few I have used take (?,?)

I assume they're on about an unbounded, pass array in situation (?) rather than having to dynamically build that part of the query (?,?,?...).

Done both, manually doing it looks messy, and higher chance of typos.

I've done the latter, in JDBC and I built the (?,?,...) part programmatically. However you have to be aware how many ?s are elsewhere in the PreparedStatement such that filling the query with the array contents works correctly. It'd be a much better design decision if JDBC actually supported named parameters (as JPQL does).

Sure but it also has a limitation of ~2000 parameters. And it's kind of a PITA to dynamically generate and set the parameters compared to just setting one parameter.

Sybase SQLAnywhere btw.

100% agree, it is ugly. I misunderstood thinking it would not allow them in the “in” statement.

Why on earth would you need more than 2000 parameters?

The list I pass to IN can have than 2000 entries, and if I were to use parameters and my database doesn't support array parameter as mentioned...

As for why more than 2000 entries in a IN list, well, some customers generate a fair bit of data, and our program allows for bulk updating via pasting from Excel or similarly tabulated data. Depending on the case, it's necessary to grab all the entries from the db and work on them as a batch, rather than one-by-one.

Why not pass that as json or xml and then join the data?

For our DB that would require a temporary table, in which case we could just do a LOAD TABLE statement. But we do do that as well, but it's a lot more overhead than just setting a single parameter, or macro when using IN.

so you would prevent stored XSS attacks by escaping on the output step instead of the canonicalization step

Right - the way to avoid XSS is to escape on output.

Most good template languages these days implement auto-escaping of variables that are interpolated into HTML.

You still have to be careful embedding content into non-HTML contexts. One classic example there is outputting a blob of JSON inside a <script> tag - you need to make sure that you handle the case where a string could contain "</script><script>evil_code_here()</script>".

React (technically JSX) has a nice feature where all output is escaped. So this doesn't work:

  const evil = "<script>alert('')</script>";
That'll output:

If you must output a raw HTML string, they make you acknowledge that you're aware of what you're doing:

  <div dangerouslySetInnerHTML={{__html: evil}} />

Also, in languages, which do not treat HTML as a simple string (looking at PHP and many others) or have libraries for doing exactly that, using any kind of data inside any HTML element, where it is put as text, will automatically make it escaped as text, with no overhead for the developer.

> Thinking in these terms keeps the internal logic of your application simple (there are no format conversions except at system boundaries),

This is an excellent point but are there any devs who don't do this? This seems like such an obvious thing to do. I mean I guess if you dealing with tech debt and you want to upgrade from ASCII/bytes to UTF-8 you will not have this invariant for a shirt while but why would you not maintain the invariant in new code?

The fundamental problem is attempting to conflate a bunch of semantically-distinct things, just because they might happen to (sometimes) be represented in memory by similar byte sequences.

Such 'byte coincidences' lead to lazy, non-sensical operations, like "append this user-provided name to that SQL statement"; implemented by munging together a bunch of bytes, without thought for how they'll be interpreted.

A much better solution is to ignore whether things might just-so-happen to be represented in a similar way in memory; and instead keep things distinct if they have different semantic meanings (like "name", "SQL statement", "HTML source", "shell command", "form input", etc.). That way, if we try to do non-sensical things like appending user input to HTML, we'll get an informative error message that there is no such operation.

This isn't hard; but it requires more careful thought about APIs. Unfortunately many languages (and now frameworks) have APIs littered with "String"; ignoring any distinctions between values, and hence allowing anything to be plugged into anything else (AKA injection vulnerabilities)

It's cool to see how these posts are becoming less and less important in the wake of today's frameworks/tools protecting devs by default.

From ORMs escaping SQL, to FE frameworks escaping html/js, to browsers starting to default to same-site=lax. It feels like we've slowly pulled ourselves out of OWASP hell. Pretty nice to see!

Obviously it's still important (see log4j) to know it all especially when its not so clear cut, but still good progress.

I think we really failed in earlier eras to get it right due to the momentum of the frameworks.

I would liken to some of the crap building materials that were allowed in the past as new, cheap alternatives but subsequently showed failure or hazards after short service-lifes. Contractors were tasked with implementing these materials to stay within budget and everyone suffered the effects later.

> The parallel for SQL injection might be if you’re building a data charting tool that allows users to enter arbitrary SQL queries. You might want to allow them to enter SELECT queries but not data-modification queries. In these cases you’re best off using a proper SQL parser [...] to ensure it’s a well-formed SELECT query – but doing this correctly is not trivial, so be sure to get security review.

If you are ever in this situation, you should actually use a dedicated read-only user that can only access the relevant data. If you need to hide columns, use views. Trying to parse SQL can easily go very wrong, especially when someone (ab-)uses the edge cases of your DB.

Discussed at the time:

Don’t try to sanitize input – escape output - https://news.ycombinator.com/item?id=22431022 - Feb 2020 (280 comments)

This solution doesn't match the problem. Even the SQL injection example shows him sanitizing the input, which is at odds with the title of the post. Log4J is a more recent example of it being too late/useless to escape the output.

This is an example of why the term "sanitize" just brings confusion and leads to incorrect software. If we say "escape" (for concatenation) or "parameterize" (for discrete arguments) instead, then there's no confusion: we know that it should be done at the point of use, because the procedure for doing so depends on that use.

Calling it "sanitization" implies that the data is somehow dirty, so naturally it should be cleaned as soon as possible, and after that it's safe. But all that accomplishes in general is corrupting the data, often in an unrecoverable way, and then opening up security vulnerabilities because the specific use doesn't happen to exactly match the sanitization done in advance.

It's great to validate the data on input and make it conform to the correct domain of values, but conflating this with output formats and expecting this to take care of downstream security as well just leads to incorrect data along with security vulnerabilities.

PHP's long-ago-removed magic quotes feature was an example of this confusion in action. It not only mangled incoming strings containing single quotes in an effort to prevent SQL injection, but did so in a way that left some databases completely exposed, depending on their quoting syntax.


SQL injection is avoided at the point of usage. Trying to sanitize your input against it is an extremely bad practice. The same is true about HMTL injection (whether you call it XSS or something else).

Log4j is an example of not interpreting text that the developer was never aware that was code. It's kinda of the extreme opposite of escaping your text on usage.

The article says DON'T sanitize when putting it into the database. I think contextual escaping counts as "sanitizing input", so the solution of "don't try to sanitize input" is undermined.

If the user says his name is "Bob'; drop tables students --", that is what you should store on your database. Unless, of course it's not a valid name for the rest of the system.

That's so old and obvious advice that I'm surprised people keep posting here and upvoting. And even more surprised when people keep disagreeing here.

If you're storing "Bob'; drop tables students --" in the database, you had to have sanitized your inputs, or there would be no students table.

The article title says NOT to sanitize inputs. perhaps it's that nuance doesn't fit in a headline, but eh...

The confusion is what is input and what is output. The string "Bob'; drop tables students --" should not be sanitized/encoded on *input* to the application. However, if you're not using parameterized queries, it should be encoded on *output* to the database.

Data should only be sanitized in transit and not stored in an sanitized form. That's what the article is really saying.

No you don't. You use a parameterized query: execute("INSERT INTO foo VALUES (?)", user_input)

I interpreted the message as not sanitizing inputs at the point they are received, a la PHP magic quotes. Instead, escape at the output (the output to the database engine).

> a la PHP magic quotes

Up to this day, the official way to deal with XSS in .Net is by doing sanitization at the receiving point. I imagine the article is directed at that.

That sounds pretty terrible, do you have an example of some docs which demonstrate that practice?

No where in the article do they use "output" to mean from the database engine; they use it to mean "outputting HTML".

The article doesn't explicitly say the words "outputting SQL to the database engine", but that's because the focus is on XSS attacks and the part about SQL injection is just an aside. Clearly it's what they were trying to imply with language like this:

> The only code that knows what characters are dangerous is the code that’s outputting in a given context. And of course use your SQL engine’s parameterized query features so it properly escapes variables when building SQL: ... This is sometimes called “contextual escaping”.

The "context" is that you are outputting to the database engine.

  > your SQL engine’s parameterized query features so
  > it properly escapes variables when building SQL
This is wrong. Parameterized queries do not build an SQL string by escaping the input. The input is actually sent to the database separately from the SQL.

Well, in all sane implementations, anyway. PHP has an PDO::ATTR_EMULATE_PREPARES option that does build SQL from a parameterized query. And, of course, Wordpress has $wpdb->prepare() that returns an SQL string with the parameter escaped. Also, so far as I know, one cannot run a prepared statement from the SQLite CLI, so no parameterized queries there either:


>This is wrong. Parameterized queries do not build an SQL string by escaping the input. The input is actually sent to the database separately from the SQL.

Your blanket observation is not necessarily true of all databases or database drivers. You found three counter-examples yourself, but there's no reason to not consider them "sane". It's not less correct than for databases that do support prepared statements in the driver protocol.

Sure, maybe it does not literally send a substituted SQL string, but in order to send the parameters "separately" from the query, do they not still eventually get concatenated into a single binary string of some form to be sent across the wire? In spirit I think the same arguments apply there, it's just that the format of the data is not strictly SQL. It's actually the wire format of the database protocol.

You are correct that the parameters go across the wire, obviously, but I've never heard of an attack in which the parameters caused any type of compromise in the wire protocol. I would highly appreciate examples if any exist.

It probably wouldn't result in an attack (unless you were dealing with a really sophisticated attacker), it's just necessary for correctness. Which is also true of all these examples: for example, people won't appreciate having backslashes wrongly inserted around legitimate characters of their names or other personal information, or having the software fail to process their request due to the characters in their name. It's not just a security concern.

In the general case there are certainly many examples of security vulnerabilities created by wrong serialization of data into the wire protocols of services, but maybe not specifically for this situation of query parameters. But maybe there are, I have no idea really. Either way, it's not the application developer's responsibility at that point, it's the responsibility of the people who developed the database driver.

For a long while, input sanitization in the web world was about modifying inputs to strip the problem areas. As such many consider escaping and sanitization to be completely different practices.

It seems like this article is using this differentiation. In my experience, it's very common. It's not worth arguing about.

The article is specifically about sanitizing inputs to prevent XSS attacks. Sanitizing input isn't a great defense against that; you need a defense that better matches the attack.

Validating or sanitizing input input is a reasonably good defense against certain other things. E.g. zeroes in values you'll later divide by, when it's too late to return an error; multi-gigabyte names; information that you want to avoid storing like credit card numbers. That sort of use case doesn't really have a whole lot to do with the article, though.

Yeah little Bobby Droptables is still a thing.

What are you referring to? The SQL injection example is showing what not to do.

"So the better approach is to store whatever [data] the user enters verbatim, and then have the template system HTML-escape when outputting HTML"

With this logic, someone could use a SQL injection. It wouldn't be sanitized as the INSERT is happening, so the SQL injection would be executed.

EDIT: I know he goes on to talk about escaping characters, but the title of the post is "Don't try to sanitize input". My point is simply that SQL injections happen on input, not output. His example of escaping the SQL is at odds with the title of the post.

They're calling the SQL query "output" (from the app to the DB server). The point is that the "bad characters" depend on the context, so it's the step where you combine trusted and untrusted data that you need to think about escaping or validating.

No they're not. They're using the word "output" to mean "back into the HTML".

"So the better approach is to store whatever name the user enters verbatim, and then have the template system HTML-escape when outputting HTML, or properly escape JSON when outputting JSON and JavaScript."

The sentence immediately after that is "And of course use your SQL engine’s parameterized query features so it properly escapes variables when building SQL"

Most SQL systems have bind parameters for this sort of thing. That is a form of encoding the input. You have to encode the SQL values as well. You're basically saying if you don't use the suggested technique, the suggested technique doesn't work. Well, yeah. It has to be used consistently, all the time, every time.

Unfortunately, that's just life. There's no way around it. One way or another you're going to be doing something or you're going to get owned.

They show the solution of using parameterized queries to store the user input verbatim. What is an example of the attack you have in mind?

Sorry, how does this happen if you’re using DB parameter in the query string?

Shameless plug: NEVER Sanitize Your Inputs (by me, 2013) https://billpg.com/never-sanitize-your-inputs/

Sanitizing inputs is not what you realistically want. You should prohibit certain types of input. Whitelisting strings is that what I would call it.

You should escape outputs, of course (not that anyone in 2022 thinks otherwise).

Why escaping outputs alone won't work is because user inputs will be stored in some database and you can't realistically predict how, when, where it will be used. Years in the future. User name could be used as a filename once, opening up possibility of shell-based exploit. It could trigger a little-known spreadsheet formula vulnerability when exported for analysis. Novel, interesting xss attacks are common and produced every day. That could be even not your code, but the code your client or partner organisation run. You just never know.

One common defence is user names (and other freeform fields) should not be allowed to be arbitrary bytes.

That is defence in depth, an established practice.

Agree and Disagree. Sanitization has it's place, but from a user perspective it's better to just outright reject (through validation) inputs that aren't valid.

There are often unexpected ways that data gets into the system (IT manually adding data, internal support tool to help customers add data, etc.) You need to ensure that you're properly sanitizing your input at every single input faucet and your sanitization has to predict how, when, and where it will be used by sanitizing for dangerous characters in filenames, shell, spreadsheet formula vulns, and XSS attacks.

Instead, (Or In addition to) just make the assumption that data in the database is dangerous, and ensure that you properly escape for your use case when using that data.

Using a username to create a new file? Escape for filenames based on which OS/language your using.

Using birthdates in an excel file? Escape for excel formulas.

Using bio on an HTML page? HTML Escape.

Using username as part of a URL path? URL Escape.

And finally circle back to the fact that sanitization where you change user input without their knowledge (like the "O'brien" -> "Obrien" example in the article) creates for a frustrating user experience.

I agree, when your app does exporting, use escaping and be happy. Nobody ever challenged that. But that is not enough. You should do defence in depth. What I am talking about, you can't realistically escape for every use, because

1) once it is stored, it is usually outside of your control. You simply do not know where your data will end up, due to e.g. new integrations that will be developed in future.

2) you can even not know the proper escaping rules for document types you are producing due to software obscurity. Nobody I can think of escapes any csv files for excel-2001 vulnerabilities. This is just one exaple of software where those files can actually end up opened.

What is more economical/rational to change, your input validation or every csv/excel exporter/converter ever in existence?

Your argument against escaping is the same one against input sanitization. Literally you cannot know where all your data will go when storing it so how do you even begin to sanitize that data? No special characters? Alphanumerics only? You're going to have a bunch of modified data that user's are unhappy with to start. Furthermore you have the same problem with simply not knowing all of the ways that data will get into your system, so you won't be able to reliably sanitize all of your inputs. (e.g. manual entry, internal tools, random scripts, etc.)

I understand defense in depth and by all means doing some sanitization up front can help increase your defensive posture (Often at the cost of user experience), but the proper and effective way to protect any of 100000s of use cases for your data is to ensure that each use case is escaping data relevant to that use case.

Direct responses:

1. If you store simply store the data, it is not your responsibility to ensure that the data is "sanitized", but simply to ensure that the data follows the expected format. Rather *the onus of security is on the person actually Using the data.*

2. I have tested applications and written up findings where customer's use internal data to generate excel files w/ PII in them, and one user could steal another the PII of employees from another customer due to CSV injection. Yes these attacks are weird and niche, and even if this one attack doesn't matter, it's just an example of various weird attacks. If you are writing software where you take user data and put it into strange filetypes or systems, you should be researching those systems and writing code that doesn't break or have unexpected issues. You can find the necessary technical documents to figure out how to escape things. (https://owasp.org/www-community/attacks/CSV_Injection)

Finally, input validation is not input sanitization. Validation is making sure it conforms to your expected data-type (e.g. Username cannot contain special characters.) Sanitization is when the application modifies data that the user has submitted - e.g. stripping off special characters from names (O'Brien => OBrien).

You keep saying "Do defense in depth" but your argument is why sanitization is better. Do both, but ALWAYS escape if you have to choose.

Thank you for good answer. English is not my main language, it is fully my fault if i didn't make my position clear.

I fully agree with everything you said. Always escaping output is the right thing. Sanitization is not what you want, like, ever.

> Literally you cannot know where all your data will go when storing it so how do you even begin to sanitize that data?

You construct a minimum regular language your inputs should confirm to. In programming parlance this is called regex.

This requires knowledge of your domain and, yes, a bit of extra work.

So, the comment field on payment becomes an alphanumeric with certain unicode characters. A pain for a user who wants to send a code examples, sure. But should code examples be in payment comment field?

And so on.

Identifier field? Allow only latin letters and numbers.

Custom css user wants to use? Restrict to valid css.

Name of some goods to sell? First character is alphabetical, rest alphanumeric and [-#"'*|:,] but never two consecutive special characters, even with spaces between.

Pdf? Construct a reasonable pdf regex and match for it.

You should publish your regexp and guarantee your database contents should adhere to it.

I think this is practical and reasonable thing to do. It would fail against a determined sophisticated attacker, but at least it might give it hard time, and make attack a bit more detectable and a bit less effective.

That works well for things you can limit to alphanumeric, which is pretty much only usernames. For everything else there will be an exploit in some context without proper escaping. You can decrease the attack surface, but you have to weigh that against the false sense of security it might give developers.

If you are echoing a user's input back to them, what's the threat model that requires you to sanitize the output?

That said, it's obviously not worth build a "don't sanitize this" filter for that case.

Do both pls.

If you're doing both, I'd ask you what you think you're accomplishing by sanitizing input, especially when you're already escaping output.

All you're doing is corrupting the data with a ritual that seems like it's securing something, and it tends to make you think that your data is now ready to be rendered anywhere without issue.

I can't emphasize this enough. This isn't a matter of taste, like, maybe you sanitize, maybe you escape on the way out, it's all good, it all works, it's just a matter of opinion.

Sanitizing the input is wrong. Actively, objectively, unrecoverably wrong. Once you've destroyed your data you can't get it back. Huge amounts of effort have been wasted by people trying to fix and recover data that was destroyed by systems "helpfully" "sanitizing" data. God help you if you have a sequence of these systems in a row each doing their own "sanitization" before you get the data.

Do not "sanitize" your inputs. Do not tell other developers to sanitize their inputs. Do not sagely spout off on HN about the importance of sanitizing your inputs. It is wrong.

The only "sanitization" that should be done is that when encoding to the output there are sometimes things that should simply be removed. For instance, a good HTML escaping function probably ought to entirely drop nulls, not even encoding them as &#00; or anything, just drop them. Some of the other ASCII characters are straight-up illegal in HTML as well, even encoded. But all that sort of "sanitization" should be in the escaping step. If you want to reject null characters at input time, that's part of validation, not sanitization.

_Some_ sanitisation is fine. For instance, stripping leading and trailing space in some fields, case normalisation, automatic insertion of spaces in credit card numbers, that kind of thing. That is to say, you should sanitise as an affordance to the user. Given the choice between presenting an error to the user and automatic sanitisation, the latter is preferable. It's something that should be done carefully, but it's still good.

Thoughtless sanitisation is a whole different kettle.

I agree that cleanup is acceptable, and there's certainly some wiggle room in what people call cleanup vs. sanitization and such.

But when people chant "sanitize your inputs" and expect it to be treated as sage wisdom, it's in a security context, and it is wrong in that context. Sanitization is not a valid security tool. Mind you, you might be forced into it if your back is against the wall and you're working on other code that is broken and you can't fix that other code's broken failure to escape or whatever. But it's still wrong, just a wrong thing you were forced to do.

A richer point of view is more "don't destroy data you don't 100% mean to destroy". Whitespace in the wrong place or stray nulls can meet that bar. Removing characters for "security" reasons doesn't. Destroying data to prevent security issues downstream is not a good idea.

To me that sounds more like canonicalization than sanitation. Depending on your requirements it might be fine to convert the input to a canonical form before processing. If you do this, be certain to do it before validation so that you don't accidentally "canonicalize" validated input into something which wouldn't pass the validation checks.

A key aspect of canonicalization compared to sanitation is that the result should be something that the user would consider equivalent to their original input. The most common offender in my experience is the abuse of case normalization, especially for data like email addresses which are not defined as case-insensitive (at least for the mailbox name) even if many servers treat them that way. If you don't preserve the original case (and other parts such as "+" labels whose meaning is defined by the mail server) the address may not work at all, or may result in sending messages to the wrong user.

Names, as an intimate part of the user's identity, are another area where case normalization can sometimes prove annoying or even offensive. If some legacy system requires names to be entered as all-caps US-ASCII characters, fine, but at least don't turn "O'Conner" or "MacDouglas" into "O'conner" or "Macdouglas" in some misguided attempt to ensure that just the first letter is capitalized. (And in some situations the first letter shouldn't be capitalized, e.g. the "dos Santos" in "Giovani dos Santos Ramírez"[0]—which is a single surname, not two names.)

[0] https://en.wikipedia.org/wiki/Giovani_dos_Santos

Oh, believe me. As somebody with a name that includes accents, and a surname that contains two words, with relatives whose names include internal capitalisation and apostrophes, I know _all_ about that.

The thing is that canonicalisation is a kind of sanitisation. As you mentioned, I personally prefer it to be done in real time. Sometimes it can't, however, you have to resort to munging, which is on the nastier end of sanitisation. Here's a short story:

AFNIC run the .fr registry, and they, unlike other registries, expect you to provide a contact's given name and surname separately. The joys of French bureaucracy. At my previous job (hosting provider and domain registrar), I built the company's domain management system. The systems in front of that didn't care about the form a person's name took so long as it was present, and most other domain registries are the same. There was no sensible way to get the applicant to enter them previously (this data was taken from the billing system). This necessitated that I build a library that could parse people's names, and I ended up developing a rather large number of heuristics for doing so as accurately as possible. It only covered the Latin alphabet, as that's all AFNIC would accept at the time, but it worked.

The problem is that most don't put that kind of thought into data sanitisation, and do things such as those you mentioned. And that's why we can't have nice things.

If I'm reviewing code and someone is implementing escaping that's an immediate, massive, red flag. It's SO HARD to get right and there are many MANY libraries for doing it correctly. The scary thing is how many bugs still make it into these libraries.

Strongly prefer using an established library and see designs such as https://web.dev/trusted-types.

The downside of not sanitizing inputs is that the data may well end up in contexts or new apps altogether wherein escaping output is not reliably done or known to be necessary. In an ideal world it shouldn't happen. But, it does, especially given turnover within organizations and the long lifespans of many systems and their datasets, lack of documentation, etc.

So, this calls into question the viability of a strict policy for every scenario that encourages blithely storing <script>doEvil()</script> in the database.

Seems some level of sanitization has a place as an extra layer of defense for some use cases.

> Sanitizing the input is wrong. Actively, objectively, unrecoverably wrong.

I agree on the “unrecoverably” (sic) part, but strongly disagree on words like “objectively”. It can be bad only if the input sanitization is poorly done. If that’s poorly done, then it’s also likely that the output sanitization may be poorly done. One cannot then say that output sanitization is objectively bad because someone doesn’t know or care enough to do it properly.

This is a complex topic that deserves more attention, not hand waving away with claims that cannot stand on their own.

Validate inputs, escape outputs

Yes, but remember in a lot of cases nearly anything is valid input.

Even better- never sanitize your data.

You should only use templating systems which safely handle user data. Don't use innerHTML assignments, don't concatenate user data into SQL queries. Use existing, validated libraries for generating HTML and SQL.

>If you're doing both, I'd ask you what you think you're accomplishing by sanitizing input, especially when you're already escaping output.


I'd argue that sanitization makes things worse from that standpoint.

What exactly was transformed in some given data and for what context? What needs to be done to reverse the sanitization process if you want to see the verbatim data, if that's even possible? Now that you want to escape the output, how can you reverse the sanitization transform so that you aren't double-escaping? What were the assumptions being made when this data was sanitized and what was that transform?

In other words, it's simpler to hold the verbatim data and then ask "ok, how does it need to be escaped for this context?" than having to ask that same question with arbitrarily mangled data while worrying if the data was sufficiently escaped for this context at input-time some point in the past.

Even beginners get almost all mileage from parameterized SQL queries + using an HTML templating library that escapes by default which is almost all of them these days.

I think knee-jerk sanitization is a relic of the days where that wasn't common, namely <?php echo $username ?>, which wasn't necessarily the worst advice when you otherwise had to remember to echo htmlEscape($username) every single time. Fortunately, things have improved since those days.

I've used a bunch of sanitizers and never had any issues with any of them. I'm sure there are exceptions but IME they tend to mangle the kind of text which the user really has no legitimate need to enter most of the time.

Far from being a relic the recent log4j vulnerability highlighted just how much value there is in this kind of defense in depth.

Obviously knee jerk decisions in tech are usually bad news.

The data store may be one, but the teams and apps working on the inputs and the outputs may be disparate and different. Relying on other teams all the time to do things correctly may not be a wise approach.

Two different advises for two different things. That one is about data validation, making sure it is coherent and fits your data quality rules. This one is about data encoding, making sure it fits a different system's rules.

These are both good advise. I have seen really funny bugs where Java accepted non-ascii numbers in an IP address but the C++ control plane very much did not. If the re-serialized version was sent to the backend this wouldn't have been an issue.

But the domains are different. Data validation is ensuring that the information is something that your system accepts. Data encoding is used when you are serializing information. You should very likely validate on input, but not "sanitize" or encode. You do your encoding on output.

I think the domain models are different. "parse-don't-validate" is great when your users are internal and trusted (e.g. a library that does codegen - the operators of the parser are already in the codebase). When your users are potentially hostile, you should at some level have a separate validate and eject strategy.

The idea of parsing over validation is just as applicable with untrusted input as with trusted input. The idea is more about system design rather than the prevention of security vulnerabilities.

come again? I didn't say you can't validate while parsing for untrusted input, I said for untrusted input you will probably STILL need additional separate validation methods. Key emphasis on negating don't absolute imperative in the original aphorism.

I didn't say that that's what you said. I tried to communicate that whether or not the input is trusted is besides the point.

Unless I'm deeply confused, the idea in parse, don't validate is about doing something like this

  parseFoo :: Text -> Maybe Foo
  parseFoo t =
    if textIsAFoo t
    then Just (Foo t)
    else Nothing

  f :: Foo -> IO ()
  f = _

Rather than something like this (which seems to be more common)

  f :: Text -> IO ()
  f t = when (textIsAFoo t) g
    where g = _

I thought the idea of parse, don't validate is that your parser should contain validation logic.

So instead of

text -> generic json parser -> validate json -> ...

you would do

text -> custom json parser that stops if encounters "incorrect content" -> ...

You can think of the parser as containing validation logic — it can parse the value into a more constrained type if it conforms to validation rules, or it can fail.

The point is that once your value is in a more principled type, the rest of the system is free from having to make assumptions (and guard against potential failures) about the breadth of that value's domain.

As the article mentions, this is really only relevant to languages with proper type systems like Haskell.

I think "parse, don't validate" is an improvement on what the author of this article recommends:

« So in cases where you do need to “echo” raw user input, carefully filter input based on a restrictive whitelist, and store the result in the database. When you come to output it, output it as stored without escaping. »

I think the "parse, don't validate" approach comes out as follows:

- take the list of things you would have included on your whitelist

- add nodes for them to your internal representation for parsed markdown

- extend your markdown parser to convert html-like input into those nodes

- implement output for those nodes in a similar way to normal markdown

This way, given the "escape output" they recommend, it's harder for any variant of the input that you hadn't considered to have harmful effects.

No, the parse, don’t validate idea is completely unrelated. It’s about leveraging a type system like in Haskell. It’s about parsing a value into a type with a narrower domain which in turn minimises the amount of control flow needed to implement a sufficiently correct program.

I agree with "It’s about parsing a value into a type with a narrower domain", but I don't see how you get to the first sentence.

In their example of a markdown renderer, the internal-representation node is the type with the narrower domain.

I think what makes this hard for folks is tracking what the expected form of data is at each step of its lifecycle, especially considering people working with new and unfamiliar codebases or splitting focus on multiple projects.

There are some frameworks that try using types to solve the problem. Alternatively, the developers could throw in a comment that looks something like:

// client == submits raw data ==> web_server == inserts raw data (param. sql stmt) ==> db_server ==> returns query with raw data ==> our_function == returns html-escaped data ==> client

I think escaping output is making the same mistake as sanitizing input. What we should really be saying is "stop using string interpolation/concatenation to process generic user data".

By default, text should only ever be treated as a blob. Yes, there are circumstances where it needs to be treated otherwise but they should be seen as a giant flashing 'danger' sign indicating the need to go back to sanitizing etc.

Every time this topic comes up, the comments are full of people talking past each other because they're operating under different definitions of "sanitize", "input", and "escape".

And now in this case, we add "output" to the confusion.

Is the SQL query you send to your DB input or output?

Since you don't know where your output will end up how could you possibly know the syntax to escape it?

And how can the consumer of an arbitrary string trust that every input will have been properly escaped?

You can't escape it ahead of time, for the same reason you can't reliably block or remove "dangerous" inputs ahead of time — you can't reliably know all the places and contexts they will be used in.

So you escape at the point of use, as late as possible, when you know exactly what escaping you need.

It's also easy to forget to escape. This is why it's best to have tools and practices that automate it, e.g. HTML templating engine that escapes everything by default, e-mail composing library that automatically converts text to whatever MIME magic is required, etc.

I'm really surprised by the discussion here. It's so obviously true and I realized this when correct php function to escape string for sql was names mysql_real_escape_string

¿Por qué no los dos?

Because if someone wants to register with the name O'Malley, you should not refuse them, or worse, mangle their name.

Then that sanitation is incorrect. I don't think this discussion would has any merit if we're speaking about incorrect implementations.

You can't sanitize "correctly" if you don't know where the data will be used. This is exactly why the article advocates for escaping output (e.g. immediately before inserting a string into a SQL query) rather than sanitizing input (e.g. by deleting single-quotes or other potentially problematic characters from strings as soon as they're received).

Ok, and imagine we are on an Internet forum, talking about what is correct sanitization, and I want to make the following example : <script>alert(42)</script>

Will HN remove the <> characters and make my comment incomprehensible or will it escape it on output, preserving all the meaning ?

(Well, I’ll know after hitting Reply)

Edit : good boy, HN

If someone wants to register with the username O'Malley and password O`Malley, would you let them?

That's not sanitization, that's validation. You implement a validation rule that says that the password and the username can't be the same thing, then use that to redisplay the registration form with an error message.

Sometimes, you just sanitize input for user reasons. The username and password above are different ASCII strings. The non-tech savvy (very elder) user does not know the difference between "`" and "'" which means they are locked-out. This results in phone calls to support, which in volume result in "fix the password field"

See https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html

I'd call that normalization rather than sanitization, but that's my own personal terminology, not necessarily terminology that's widely used.

No, garbage in, garbage out. Sure, things like log or SQL injections should not only be solved by sanitizing. You solve it by separating data and code. A lot of times you really want to store data in a structured canonical way. Usernames for instance. It is bad if you with Unicode trickery can create multiple usernames that looks the same. Product descriptions, it is bad if your ML needs to handle HTML and so on.

This is wrong. If I leave a comment `'; DROP TABLE users; --` You should display it back in the app as exactly that. If you put it into an HTML attribute you escape the `'` and if you stick it in SQL you use parametrized statements.

There is nothing "wrong" with that initial input. What is wrong is pasting it into an SQL string, HTML element, HTML attribute, URL parameter or anywhere else without properly encoding it.

This is the main reason you can't "sanitize" input. You need to know what the output format is to properly encode it. There are different requirements if you are pasting it into a sed replacement command vs HTML attribute vs HTML element body. You can strip everything except a-zA-Z and cross your fingers but even that isn't necessarily sufficient for all output formats.

Maybe a better way to put is that you should be smart about why, when, and where to sanitize your data. A comment on a forum should not remove “‘; DO BAD THINGS;”. Why would it? It is just text in probably some UTF8 encoding. No viable web framework will write it out in a raw format if you do not explicitly ask for it. In SQL you use parameters. But as I wrote in my original comment. There are several scenarios and if you work with a web, probably the most cases, where you really want to make sure that what you have stored is a clean structured canonical data representation. Not only for your security but also for third party consumers and analyzing.

I understand that everybody who sells NOSQL solutions disagree.

using parameterized statements is sanitizing inputs into the database.

The database is "outside" of your application server. You communicate with the database using statements and when you get the value back from the database it is unchanged. The encoding was just for transfer, no data has actually been changed.

Every online form where user can interact and send data back to a server is always a nightmare in terms of security. I do utilize mod_secure, but with my next project, I have an idea of doing "base64" on everything in client's browser via javascript then sending it to server and checking on backend if content is a valid base64. Is that a good concept?

That could work if you are just going to store things as base64.

It accomplishes nothing if you are going to decode the base64 on the backend and then use the original value as-is. If anything it's worse than nothing, because now mod_secure will just see the base64 content and might fail to detect certain attacks.

Unfortunately that wouldn't help with a whole lot. The danger with input is that it could be used to e.g. escape a SQL query and delete your database. Which is why we now have parameterised queries and such to help alleviate those worries.

If you think about it the process you're describing already happens: the browser sends the user's input as (usually) UTF8 string data, then the server decodes it. Changing that process to base64 wouldn't change much.

Only if you never decode it from base64. :)

Wouldn’t base64ing your inputs bypass mod_security?

ok thank you everyone for your responses (+1s) - I was research on this idea and couldn't find anything online - now I know why!

Wouldn't prevent XSS afaik

A strong content security policy also helps with xss

Instead of sanitizing input you create unsafe datastore which might be used in other applications later. Do it as soon as possible.

I think it cuts both ways, as anyone who has needed to mine an existing data set for a new purpose can attest. Having the data sanitized can may your parsing job infinitely easier, while it can simultaneously destroy data which would have been extremely helpful to the new project.

If it doesn't fit into a data standard you are enforcing, it shouldn't exist in the database. There is nothing wrong with capturing the original text in a field or separate table.

guess I'll just put that 2gb "first name" directly into my database then

That's validation, not sanitization.

Sounds like a restatement of Postel's robustness principle[1]. Did it go out of style to "be conservative in what you send, be liberal in what you accept" and we need to relearn it again?

Well, perhaps it did. History has shown the dangers of not handling malformed input well. Postel's principle has received scrutiny[2] for reinforcing those mistakes by creating a mistaken belief in robustness. More recent recommendations have been to be stricter in handling of inputs[3].

But I think there is some confusion between robustness and defensiveness. "Be liberal in what you accept" may be confused with "don't sanitize your inputs" when not sanitizing is the less liberal action. Robustness means the program should not fail if it receives input it didn't expect. A program that crashes, hangs, executes unintended shell code, mangles the data, changes the thermostat, or other undefined behavior is not being robust. To prevent that from happening then data must be sanitized at input so that it can be processed without those side-effects. The examples of programs failing robustness have been because they were insufficiently defensive.

The bigger issue is that robustness doesn't scale easily. You may know how your bit of code will deal with malformed data, but what about every other library you use? Or other systems you communicate with? It becomes a backstage problem, where once someone has gained access to a restricted area it's assumed they are authorized to be there. The further down the tech stack you go the less likely the code will be defensive. That puts a burden on the public-facing sanity checks to anticipate how relaxed they can be about the input.

If you change the definition of output to include internal-outputs, then Postel's principle gets new life. That is, try not to program the entire system and ecosystem at once, but treat each software component as an island. Be liberal not only with the data you receive from the end-user, but also with return values from functions. Be conservative and escape not only your generated HTML, but also the SQL statements you dispatch to the backend. This is what input sanitizing is actually about, it's keeping the promise to the other parts of your program that your code isn't going to give them bad data. That's also what the linked article is saying, because the HTML being generated is itself one component in a chain of programs that includes the end-user's browser.

[1] https://en.wikipedia.org/wiki/Robustness_principle

[2] https://programmingisterrible.com/post/42215715657/postels-p...

[3] https://datatracker.ietf.org/doc/html/draft-iab-protocol-mai...

sanitize (client side) => confirm with user => trim+escape (server side) => insert


validate (client side) => insert using sql parameterization (escaping) => escape per context when outputting

Sanitizing is the idea that you are cleaning dangerous things from the original input (different than validating which is disallowing user's to input characters that don't conform to what your program expects).

One BIG issue here is that validation is generally clear to the user ("That is an invalid email address") whereas sanitization normally doesn't consult or inform the user that there were changes and may result in unexpected things happening from a user perspective.

From article: name is "John O'Brien" now displays as "John OBrien" (this is a trivial example but still an issue)

The name thing is a great example of things you might not expect your users to do but are still totally valid use cases. Sanitization can be Extremely frustrating from a user perspective.

Why "escape"? Just insert. Using SQL parameters.

Insert using SQL parameters is "escaping". The parameterization ensures that the data being passed gets interpreted by the DB as the expected data type by ensuring special characters aren't interpreted as "special" in that context.

I think that is a strange use of the word "escape". You "escape" from something, in this context the query string. If parameters are not passed inside the query string then how can you say they are escaped?

At least for the database I am familiar with (mssql), the query string is one parameter in the binary protocol, and then there are the other parameters that are not the query string which are used as arguments.

Your usage here is a bit like saying that a standalone PNG file is "escaped" from the HTML document it is referenced from...since it is marked as not being HTML...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact