I wrote https://github.com/microcosm-cc/bluemonday which is a pure Go HTML sanitizer inspired by https://github.com/owasp/java-html-sanitizer .
The key things to understand about HTML sanitizers:
* They must be whitelist based
* They must be aware of context
* You must sanitize ALL user input even if you don't think you're going to render it on a web page.
The book linked to in the article does not seem to understand any of the above.
The section on sanitization has the equivalent of "string replace" as the primary recommendation. Elsewhere in the XSS section a focus is on escaping content before it is rendered.
Sanitization needs to know not to run on <pre> blocks, and to escape HTML entities automatically, and to understand what links are safe and which are not.
XSS can be really interesting and quite targeted. It can be that a user-agent contains the XSS, because the target may not be the person reading the page but the admin looking at a web page of their web server logs through an analytics program on the same domain.
The bluemonday package I wrote can deal with all of these things but that isn't the point, the point is that this is an area I know and the book falls way short of a decent standard for creating a secure and safe web application. And if it falls short in this area (the first 2 chapters) then I would assume that it falls short in all areas.
I'm not able to make sense of this. Sanitize it for what context? SQL? JSON? HTML? Inclusion as a command-line argument? All of these, and hope that sanitizing it for one context doesn't un-sanitize it for others?
Usually this means the HTML context. Different sanitization is needed depending on _where_ in the HTML document the input is used.
For instance, if the input is used in between HTML tags (let's say $foo is user input in this PHP example):
... <body><?php echo $foo ?></body>
Therefore, to correctly sanitize this, you would call the PHP `htmlentities` function:
... <body><?php echo htmlentities($foo) ?></body>
What if foo is used in a different context?
... <body><a href='<?php echo htmlentities($foo) ?>'>...
The key problem is that `htmlentities` is not valid sanitization in the context of an HTML attribute value. In this example, you need to use `urlencode`
... <body><a href='<?php echo urlencode($foo) ?>'>...
In my mind, the context-sensitivity of XSS is one of the key reasons why it is so prevalent.
I'm a big fan of not necessarily sanitizing, but treating it appropriate in context. This may mean removing characters, or mapping them, or just delimiting the entire thing.
To that end, I argue that you should not necessarily sanitize on the way to the storage mechanism. You should only sanitize at the boundaries. So, a web view should make sure any strings are treated as strings. A database layer should make sure query parameters are not able to alter the query. Etc.
(All of this is trying to simply reinforce your point.)
Perhaps it is more of a use-case question - I haven't worked on comment or wiki systems but that seems like it would be much harder to validate user inputs, and thus maybe that is where the line should be drawn? When user inputs are expressly expected to have markup?
If you accept any user input and this will be displayed or processed anywhere (be it in HTML fragments, input to SQL, etc)... then sanitize it and use bluemonday or any other sanitizer applicable to your content.
For user generated content, like comments on Hacker News, there is a pretty rich policy provided by bluemonday so you don't have to think about it and are going to be alright.
That looks like this:
p := bluemonday.UGCPolicy()
htmlOut := p.Sanitize(htmlIn)
In fact, you have it totally backwards: you're not supposed to sanitize all user input before storing it. Instead you're supposed to sanitize any user input before you output it back to your webpage.
Even more so: it's the output that dictates what sanitization you should perform, not the input. You don't do input sanitization for HTML (for XSS etc) when you store your data in your DB. Instead you should sanitize the input for SQL Injection issues. And similarly for whatever other output -- if you take user input and run a shell command, you should sanitize for shell safety, not run html sanitization.
I believe by sanitization most people mean processing content which will be rendered and not escaped, a good example is content from WYSIWYG editors. And this is where sanitization libraries would come into play.
You would sanitize HTML fragments before storing them in database because you don't escape them during rendering. Text content is not sanitized before saving to database as you can just escape it when rendering.
As long as the data is sanitized before it can affect the storage/transport mechanism for its content type, you're good.
No, not really. Storing the user's data as is is almost always of paramount importance. The fact that it may be output as HTML/XML/MarkDown/whatever means that it really is at output-time that you must sanitize/escape/quote.
That's why the moral of the Bobby Tables story isn't: "Oh, just remove all semicolons". It's "use prepared queries".
Sometimes, data really does need to be sanitized at the point of submission. If you disagree, that's more of a point about application design than appsec.
That was the point I was trying to make: Sanitizing input is fail-from-the-start. There's no way to know ahead of time what outputs you're going to be producing 5 years from now. Conclusion: Store all input exactly as received. (We can do that these days with form/url encodings and whatnot).
Ok, so now you have the data stored accurately.
Next step: You need to output to, let's say, HTML. Ok, so you just escape/quote everything appropriately and nobody gets hurt. If you just do the escaping/quoting properly there is no XSS attacks. It's really just that simple.
However, it is NOT about sanitizing at the "input" point. Do you get what I'm saying now?
(I realize that that sounds aggressive, but I really just want to force this point home. Please tell me if you disagree or find some detail in my explanation confusing. This is important for the security of the web and either I'm wrong or you're wrong or I didn't understand what you said. Let's figure out which is the case.)
 There are caveats here.
I didn't specify whether the sanitize occurred on receiving user input or displaying it.
I only said, sanitize all user input.
I'm sorry, but you basically did. You said:
> You must sanitize ALL user input even if you don't think you're going to render it on a web page
Which implies that sanitizing input at display time, when you know you're rendering it to a web page, is too late. That's why people are jumping on you. Keeping a clean database is the absolute most important thing you can do. The database isn't contextual. The data it stores can find its way into HTML pages, REST responses, SQL queries, PDF reports, XML/JSON data exports and a ton of other formats. Each of these output formats will require a different form of sanitizing. Sanitizing before the data hits disk creates a nightmare for anyone displaying the data in a context other than the sanitization that was performed. So what you said originally is precisely incorrect. Only sanitize input when you know it's going to be rendered to a webpage. Otherwise, leave it alone.
Now, you should be using view-layer frameworks to make that sanitization easy, automatic and the default action. When rendering to HTML, the templating language should sanitize by default and give a way for template authors to opt-out when they know the data did not come from user input. Likewise, in the SQL context, prepared statements also make it easy for the developer to do the right thing. But at no point are you speculatively sanitizing all user input. You're getting user input to disk in as pristine a format as possible and sanitizing contextually depending on how the data is outputted.
Still not convinced about the usefulness of the source, given the criticisms raised here and there.
That code -- using sha256 to hash a password+salt, and store it in a db -- should not be there. Someone WILL copy paste it.
Don't make it easy for people to do something stupid.
"However, this approach has several flaws and should not be used. It is given here only to illustrate the theory with a practical example. The next section explains how to correctly salt passwords in real life."
and goes on to use bcrypt.
IMHO, I think examples of dangerous code should implement one of the following mindless-script-kiddie-consultant defenses (in order of most sane to least sane):
1. use screenshots of the snippet
2. use pseudo-code
3. put in deliberate syntax mistakes
The code sample in question is in no way clearly labeled as being problematic, and a tiny statement buried in a large body of text does not change that fact.
It can find printf-format errors, invalid shifts, unreachable code, etc.
Even though go vet is very helpful, it is sometimes scary that the compiler allows to build such incorrect code. For instance: https://play.golang.org/p/2AVHUt5Wcf
Go does already have one of the strictest compilers (by default) out there. But it is pretty typical for a great many language frameworks to support optional additional strictness. Eg "use strict" in Perl or the gcc flags of which I cannot recall off hand. I see God's vet as akin to those. ie there when you need it but keeps out of your way when you just need to get some prototyping done.
http.Handle("/o/", http.StripPrefix("/o/", http.FileServer(http.Dir("/"))))
See here how to contribute : https://checkmarx.gitbooks.io/go-scp/content/howto-contribut...
> Don't read it
and listen to you instead? Wrong.
> always start with using a good framework.
> It will do everything for you