Hacker News new | past | comments | ask | show | jobs | submit login
HTML Sanitizer API (developer.mozilla.org)
114 points by graderjs on May 6, 2021 | hide | past | favorite | 82 comments



For people saying this is a bad idea to do on the front-end, I believe it's a great idea to have this as a standard. Generally, the process in the past has been:

1. Library that does the heavy lifting

2. Front-end catches up by creating a new standard (seems like https://wicg.github.io/sanitizer-api/)

3. Node.js implements these native methods, unless it did something similar but actually different between 1. and 2. (and then we are in a mess).

4. Now you can use this everywhere without depending on random libraries.

So hopefully with this coming to the front-end, it means that eventually we'll have this functionality native in Node.js at some point in the next 1-3 years!

This has been similar (though not always 1-to-1 like desired unfortunately) with many things like promises, crypto, web workers, etc. I am still waiting for native Node.js' fetch().


There are 3 correct ways (and 1 incorrect way) to secure your strings in a web application:

1. Output escaping (server-side)

2. Output sanitization (server-side)

3. Output sanitization (client-side) - i.e. this standard

4. [Wrong] Input sanitization (server-side) (not to be confused with input validation, which is good but unrelated to this conversation)

Without boring with the details of why 4 is wrong, normally you should choose either 1, 2 or 3 depending on different applications:

(1) should generally be used in almost ALL cases. Contextual escaping is orders of magnitude more secure than sanitization, and is usable by any application not rendering user-provided rich text.

Furthermore, most/many user-provided rich-text applications use a DSL like Markdown which sidesteps the need for sanitization, so you can use escaping for these cases as well.

(2) should only be used if you are rendering user-provided rich-text. These cases are quite rare, and chances are if you need this feature you have a large application with lots of experienced engineers and you have the resources to put in place appropriate caching mechanisms.

(3) is the same as (2) but offloads the work to the client (no caching necessary) and has a resultant negative impact on the performance of your app (client-side UI latency). Making browsers fast is already a gargantuan challenge for browser-makers; adding client-sized sanitization to the mix will exacerbate this.

But mainly, this new standard will encourage inexperienced engineers to use (3) where (1) is much more secure for most applications.


5. Use proper API instead of escaping. (client-side) Say you assign your non-static data to el.textContent instead of concatenating strings into something that's to be interpreted as markup/code. Or you use DOMParser to parse markup safely, and then extract data from it or prune it using a whitelist via regular DOM API.

5a. Block all access to unsafe ways to use strings as code/markup via trusted-types CSP. (at least during development)

Conceptually the same thing you do with SQL server side. Parametrize the queries, instead of escaping and pasting into the query.


Great point!

It wanted to edit the comment to change (1) to (server/client) but I passed my edit timeout.

I would include your (5) within (1). `textContent` and other DOM methods like `setAttribute` are effectively secure output-escaping on the client.

Your (5a) is an excellent extra measure. In this area, I'd also add security-focused linting for (1) and (5)–e.g. for (5), to ensure secure DOM methods are used, I use Mozilla's `eslint-plugin-no-unsanitized`[0] plugin for all my personal & work projects.

[0] https://github.com/mozilla/eslint-plugin-no-unsanitized/


I think that's a bit missing the point of why sanitization is wrong (e.g. 2 & 3). The core problem is that blacklisting elements is really error prone. People frequently find ways around it.

You really shouldn't do "block this element, allow this element" type sanitizing ever.

What you should do is:

1. Parse the input into an AST that only allows safe nodes. I.e. the AST should have no `<script>` node at all. 2. Write it back out as HTML.

It's more work.


> You really shouldn't do "block this element, allow this element" type sanitizing ever.

> What you should do is:

> 1. Parse the input into an AST that only allows safe nodes. I.e. the AST should have no <script> node at all. 2. Write it back out as HTML.

Firstly I don't see the difference between what you say you shouldn't do and what you say you should do. This is what a sanitizer does: it parses the input into an AST that only allows safe things. It does "block this element, allow this element" on an AST.

So I don't understand what things you're contrasting?

Secondly, I guess this was just brevity, but disallowing script nodes is the tip of the iceberg when it comes to sanitization. You need to deal with a plethora of vectors involving scriptable & interactivity attributes, and other "active" non-script elements.


> Furthermore, most/many user-provided rich-text applications use a DSL like Markdown which sidesteps the need for sanitization, so you can use escaping for these cases as well.

Markdown does not side-step the need for sanitization. You can embed HTML inside markdown, so you actually need sanitization for exactly the use case of rendering markdown on the client.


> Furthermore, most/many user-provided rich-text applications use a DSL like Markdown which sidesteps the need for sanitization, so you can use escaping for these cases as well.

Be very careful about that assumption for Markdown. Many markdown processors will pass html tags through I modified so you could have a script in Markdown.


Good point. Maybe I should've said BB :D


I'll bite - why is 2 right but 4 wrong? Is it because it's not reversible? Is it in case your sanitization rules change?


Firstly, to distinguish sanitization -vs- validation: sanitization retains information lossily, validation throws. Validation may increase security, but the primary motivator is stability (avoiding unexpected state) and UX (relevant error messaging).

Input sanitization is intended purely as a security measure but is at best insufficient and usually reduces system security.

The first problem is that input sanitization is sanitizing without context: you don't know what threats you're fighting because threat vectors of this variety target the output format. Will your input variable be included into a HTML template, an Atom feed, saved to some file that's parsed later, sent in a JSON API, templated into a style or script block, used in an SQL query. The possibilities aren't known at input (or if they are they will change as your application grows) so protecting against all possibilities isn't viable. Often people html-escape or blindly html-strip their inputs, regardless of whether they'll ever be used in a html template.

Given the above you might think input sanitization is at worst useless, inefficient, but not harmful to security, but that brings us to the follow-up problems: lossiness & unknown state.

If you're doing input sanitization, you're not doing input validation (at least not properly). Input sanitization is about accepting lossy values (threats removed) when you receive unexpected input. That means you're accepting unexpected input, which leads to all the problems input validation is designed to protect against.

Finally, there is of course double-escaping. As mentioned above, people often html-escape as part of input sanitization. This essentially disables your ability to do reliable secure output escaping because that leads to double escaped values in output (or horrible double-escape-reversal hacks in output templating code). Generally you want to be securely output/escaping everything, which means you want a system where you can rely on always receiving clean unescaped values into your output template.

I have encountered too many systems using a good, secure html templating library with output escaping baked in by default, where devs had to disable output escaping explicitly because some inputs were already pre-escaped. In theory you can keep track of which vars need raw output and which vars don't, but that's an unmanageable mess in practice. And its impossible to automate enforcement.


Or in other terms: Store Raw, output safe means you can improve output safety over time. Storing safe locks you in to that mutation.


That's a really nice succinct way of putting it. Will steal.


Just don't steal my weird capitalisation hah


Looks like they're on their way to piece-mealy rediscover SGML, with config options to exclude non-permissible elements and attributes such as <script> and onclick. Except SGML can do this in a holistic, context-dependent way which brings us to

> Issue 1: It’s unclear whether we can assume a generic context for parseFromString, or if we need to re-work the API to take the insertion context of the created fragment into account.

which of course refers to the concept of tag inference/omission in HTML/SGML not covered in this proposal at all, and similarly

> Issue 2: What about comment nodes, CDATA, etc. ?

and

> Some HTML elements require special treatment in a way that can’t be easily expressed in terms of configuration options or other algorithms. The following algorithm collects these in one place: To handle "funky elements" on a given element, run these steps: ...

(introducing the ad-hoc concept of funky elements), and

> Issue 3: The spec currently treats MathML and SVG as unknown content and therefore blocked by default. This needs to be fixed.

It's depressing that we're iterating over these kind of issues that seemed like a solved problem in 1986 (35 years ago) when SGML was first published.


oh how I miss dsssl! xml schema was a real improvement over dtd. but xslt? not so much.


quite partial to RNG / .rnc


yes, relaxng of course is the better xml schema. schema was a leap from dtd.


It's hard to belive that JavaScript still does not have standard function to encodeHTML to string (as in replace < and > etc with &lt; etc). I know this can be solved by other wasy (as in never using innerHTML), but in reality it's often ignored or just custom implementations put in place.


I think part of the reason that we don't have such a function yet is that it is essentially a one-liner if you use decimal notation for entities:

function escape_html(s) { return s.replace(/[&<>"']/g, m => `&#${m.charCodeAt(0)};`) }

That being said, a built-in function would be convenient and faster.


One problem is that it's just as simple to implement an incirrect version of this function


The `textContent` setter also does this. E.g. you can do this:

    const span = document.createElement('span')
    span.textContent = text
    return span.innerHTML


This does not handle quotes as far as I remember


There are many built-in functions that could also be accomplished with a one-liner akin to your example.

I hear what you're saying - there's an easy manual workaround for this requirement - but a built-in function would still add value nonetheless.


Agreed. As I said, it would be convenient, faster, and most importantly, guaranteed to be error-free.


Sorry for my ignorance, but why do you need to replace "<" and ">" etc with "&lt;" etc? If I am not wrong:

- When the user input is plaintext, you will output using .innerText, which means such replacement will actually break those special characters.

- When the user input is actual html (or contenteditable), such replacement will also break those tags. If you don't want tags in the first place, why make so that user inputs html?


Would be nice if sanitizer output was a "known safe" type instead of plain string.

With only strings, your API has no way of knowing if a string is safe (was already sanitized) or not.

For example, goog.html.SafeHtml

https://google.github.io/closure-library/api/goog.html.SafeH...


> The other method available is the Sanitizer.sanitize() method. Which is very similar to above, however returns a DocumentFragment with disallowed script and blink elements removed.


There's honestly no such thing as "safe HTML". It's all very contextual. It can be safe in one context and unsafe in another. So trying to type it is futile.

I've seen similar attempts with "taint flags" and what not. They fall very short in practice.


You can also disable all unsafe APIs, which is workable if you don't have to integrate third party code into your frontend. Then you don't have to mark anything as safe/unsafe, because it doesn't matter.


I wonder how this could work along side the trusted types api.


My thoughts as a maintainer of a HTML sanitizer in Go https://github.com/microcosm-cc/bluemonday which is also available to Python via https://github.com/ColdHeat/pybluemonday

1. Sanitizing is not difficult, defining the policy/config is.

This is difficult as your need is not someone else's. First glance of this proposal is that this needs a lot more work to cover people's needs. It's good enough, but will have a lot of edges and will need to evolve.

2. If you allow a blocklist then it will be less secure.

Because people will use that by default as it's easier to say "I don't want <blink>" than it is to say "I only accept <long list of things and the myriad of when attributes are valid on which elements>". The problem here is that a blocklist requires the person writing the config to cover every scenario to be safe and unless they're a security engineer they will not... and if they were a security engineer they'd use the allowlist. Blocklists seldom deliver good security, allowlists deliver great security.

3. Provide sane defaults.

Most engineers simply do not know what is safe or not. I ship a policy in bluemonday for user generated content... it is safe by default and good enough for most people, and it can be taken and extended due to the way the API is structured so can cover other scenarios as a foundation policy.

-----

I think the proposal in general: specify a standard for a sanitization API has merit. But mostly it has merit if it specifies a standard for defining sanitization policies/configuration, allowing them to be portable across different languages and systems.

The one I wrote is very heavily inspired by https://github.com/owasp/java-html-sanitizer which is the OWASP project one maintained by Mike Samuel. When I did my research before writing the Go one, this was far and away the best way to construct the policy/config and I already saw that this perspective was more valuable than whether it's a token based parser (GIGO but low memory) or a DOM builder (more memory)... no-one cares about the internals, they care about expressing what safe means to them.


Also... unless I'm wrong in the reading of the spec, data-uri href values would be allowed, which means an attacker could just use that to inject a full document that isn't sanitised.


> This is difficult as your need is not someone else's

It helps when the config accepts functions who are given suitable context about the content they're filtering.

I also agree 100% about everything you've noticed.


Wouldn‘t a callback style the most easy one? Instead of complex config with so many overlapping rules you simply use a callback for everything and let the developer decide whether he wants something.


Any time I think about HTML sanitization I think about MySpace and how the engineers there never parsed user authored HTML, they just searched for substrings in the input which allowed users to craft some really funky HTML that was valid, but unconventional.

Interesting times. Also a good example of how when you work at a company as a developer, not every issue you come across you’ll have time to fix or may even know how to.

Tools proliferated years later to help with this sort of thing, but it’s interesting to think after many years that such a common concern had not really been explicitly addressed in standards as far as I know.


There's a DOMParser as a part of platform api, that I've been using for maaany years for this.

https://developer.mozilla.org/en-US/docs/Web/API/DOMParser

On top you just write some recursive tree sifting whitelist based function, for your expected input.


I'm not against this, it can be used to replace lots of custom parsing for WYSIWYG editors that paste HTML from the clipboard (for example what happens when you copy content from Word, or from a site).

But it seems awfully underspecified. There's no single way to "sanitize" HTML. It's essentially a filter. So we need a way to control what's filtered.

We need a whitelist of elements, attributes, etc. for this to be useful in practice.

I believe a better name for this would be "NormalizeHTML" (think HTMLTidy), and accept a filter specification.


If you open specification linked to that page ( https://wicg.github.io/sanitizer-api/#sanitizer-api ), you'll see that Sanitizer constructor accepts config param, that does exactly that and will probably extended before spec is ready.


The Santizer constructor can take a list of allowed elements etc.: https://developer.mozilla.org/en-US/docs/Web/API/Sanitizer/S...


Oh I see, thanks. It's a good start, but the attributes should be specified both globally and per node type, I don't see this. Also a blacklist (the "drop" variations) is a really bad idea.


I made one of these as a filter written in C using Lex.

This is used by my fork of the Lurker mailing list archiver, for the sake of allowing people to post HTML into mailing lists and have it safely included in the web archives.

http://www.kylheku.com/cgit/hc/tree/

The "wl" file contains the whitelist specification.

  foo: bar baz
means that the foo element is allowed (passes through the filter), and only the bar, baz attributes of that element are allowed. If it has other attributes, they are stripped.

  !xyzzy:
is actually just a comment in the file. Lines starting with ! are ignored. Any element not whitelisted is disallowed.

This spec is translated into the wl.h header file by wl.txr.


> I made one of these as a filter written in C using Lex.

What could possibly go wrong?

Having cut my CS teeth on both I must say both C and Lex are giant foot-cannons if used to parse untrusted input.


A number of things could go wrong.

You have to look at the actual program, rather than making assumptions about it based on the language it uses.

Flex-generated code could be a risk. But in many years, I've not seen problems with Flex scanners. TXR uses a complicated Flex scanner. In almost twelve years of developing this program, I've never had to debug issues with Flex-generated code and have never seen valgrind report any problems against that code, so I'm confident in deploying a simple Flex-generated scanner program for untrusted data.

Though Flex generates C, it isn't C. When you use Flex, you're not rolling your own scanner in C; you're relying on a widely used tool.

There are parts of the Lex input file which are C, but in my program, all they do is return integer constants. Some of them also use BEGIN to switch to a different lex state. My Lex file is very low risk. I didn't customize the input buffering or anything.

I also de-risked the rest of the program substantially by defining a token structure which is passed and returned by value. There is almost no pointer manipulation or arithmetic. There is only a small bit of manual memory management: I used strdup to copy the token text, subject to manual freeing. When a function returns a token to another, the caller becomes the owner; it calls deltok if it consumes the token (doesn't return it to anyone). The ownership protocol is very easy to verify. The program carefully accesses static arrays which represent the whitelist, and doesn't perform any arithmetic that could overflow.

Now, on the other hand, delegating HTML cleaning to browsers: what could go wrong there? Unsafe HTML should probably not be served to browsers, period; unsafe data should not escape from the back end. If you can catch bad HTML at the back end, why wouldn't you?


This seems to be something new. Anyone who can share why one would choose to sanitize on the front-end instead of the back-end? You might want to sanitize it before it goes over the wire, but still have to sanitize on the back-end as well as you can’t trust user input. You could sanitize it once you send it to the client, but you should not end up in a situation where you are the one sending possibly corrupted html to a client, right? I’m probably missing something


There's a number of handy uses cases for this, but the biggest one from my point of view is another layer of defence in depth. This is something that many frontend libraries already do for you (React, Vue, etc) where you have to explicitly render unsafe HTML when you want to. Making this a standardized API simply makes it more accessible and hopefully faster. It is often the case that the backend won't know what data will be rendered and what will used in another way so it makes sense to have a sanitation step before you render what you know on the frontend to be dynamic data. Put another way, even if there is a vulnerability or two in your backend that might sum up to an injection, if you also sanitize on your frontend, you'll still keep your users safe.


There is often a need to sanitize user input that never goes to the backend at all. There are backends you don't control but from which you still need to fetch resources from and present to the client. And if this API becomes part of an JS engine, than you will be able to use it on backend as well


The front-end tends to use optimistic updates more and more, meanings the user's input is directly used, wihtout passing by the server. I guess that's one of the use cases.


Maybe you're not in control of the back end.

For example, say you were writing an app that displayed RSS feeds from different sources. You don't know what might be coming over the wire, so it would be useful to sanitize it at the point of display.


Or if you work with a CMS and you don't trust them to not allow script tags


It’s a lot easier to verify that content is safe if it’s sanitized close to where it’s used. Otherwise there’s some degree of trust involved.


Client side is required to handle inputs from server properly anyway and to use the proper API so that it doesn't interpret input as markup/code.

For server output it's optional, because preventing client side code injection is not its concern.


Looking at the comments here, I think the final doc for this API should have a huge banner saying something like "this API is designed to be used for sanitizing output". Otherwise it would be easy for some folks to fall into a trap of using this function for sanitizing inputs on client, which is dangerous practice (never trust what's coming from the client).


This feels an awful lot like we're going full PHP[1]. Can't this just remain as a library someone else could implement?

[1]: https://www.php.net/manual/en/filter.filters.sanitize.php


See the spec document for the rationale. In short: The browser already has a good and safe parser, and knows best how it will treat which elements. An external parser written in JS is overhead while likely to be worse than/outdated compared to the browser. (In general security practices, parsers being used for validation being different than the parser used for processing later is somewhat of a smell, because it makes it more likely you can sneak something bad through a special case)


If it’s part of browser, then it’s likely up-to-date with the latest features and quirks of the browser.


Is "PHP had something like this once" your only argument?


Given I've seen some CMS's double escape html character entities and other such bad uses of sanitation filters, I think it is a reasonable concern to think about. Forcing people to source a library would make it clear what their intent is in terms of cleaning things up.


Considering the history of mysql_escape_string it's an important argument


This is from MySQLs C library.


This looks like a supplement only for innerHTML method. Using the DOM to produce DOM artifacts has always produced clean markup.

The risk here is if the string is untrusted you are pushing untrusted and potentially risky code onto the user. The only silver lining is that now it may not break the page.


What is the point of sanitizing input in JavaScript? One could easily bypass that and still put nasty stuff in the input.

I think this is a bad idea, I just know there will be people who will make applications that will rely entirely on this API with disastrous results.


Never sanitize user data before storing it in your backend. Always sanitize user data before displaying it on the screen.


I would amend that slightly:

It's okay, but mostly a waste of time, to sanitize data before storing it.

You must sanitize data when outputting it.

Why? Because someone could get the data into storage in another way, or new vulnerabilities might be discovered that you aren't sanitizing for before storage.


The only positive I can think of, is you run, say a comment section on a news aggregator. You could let people freely type up their own markup in a <textarea>, accept it as is and just throw it at clients to clean on their own. Another great use I could think of, say you have an image upload service, assuming this would work with SVG as well, you could just serve all sorts of potentially malicious SVG and have the client remove all the script tags.

While I'd prefer the inputs stored server-side to be pre-sanitized, I can see the benefit for just not touching it and shrugging.


You would sanitize it on the receiving end


I didn't think of that. I assumed the backend would already have sanitized it.

Either way, I think there should be a warning that it should never be used for input sanitation.


Explain "easily bypass"?


To clarify, I was assuming this was going to be used for input sanitation. If it were used for that a malicious actor could see where a form or XHR call would post to and simply post the raw input to that endpoint, bypassing any sanitation.


Oh, no it's not meant for that. It's used for sanitizing untrusted outputs for display.


IMHO, the idea here is nowadays more and more people are using third-party libraries which is hard to trust and even worse while people `npm update` them with no idea what changed between versions.


Tangentially related (satire): http://motherfuckingwebsite.com/

>"Good design is as little design as possible."- some German motherfucker


I wonder how it handles cases like this:

<sc<script>ript>alert('XSS')</sc</script>ript>

...and other strings from https://github.com/minimaxir/big-list-of-naughty-strings


  > (new Sanitizer()).sanitizeToString(`<sc<script>ript>alert('XSS')</sc</script>ript>`)
  "ript&gt;alert('XSS')ript&gt;"


Wait why does the second example remove <blink> but the first one doesn't? (with apparently the same default/none configuration). Any rationale behind this? I'd expect both to return the same HTML, just in different formats (string vs DOM nodes)


This is the correct output of the first example, confirmed in real Firefox (note also the two spaces near the end):

  "Some text <b><i>with</i></b> tags, including a rogue script  def."


Oh so you are saying `<blink>` is always removed? That's right, it seems if it's not in the "allowElements" it's dropped:

https://wicg.github.io/sanitizer-api/#defaults

Edit: corrected "allowAttributes" => "allowElements"


<blink> would be allowElements, not allowAttributes.


Yeah, I think that's some kind of typo.


Thanks! Added a PR to the docs:

https://github.com/mdn/content/pull/4757


i read the title as "hand sanitizer API" and wondered what it was all about. side effects of the pandemic huh? :D


It saddens me that the blink tag is no more. It really defined the style of early web pages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: