With context, this article is more interesting than the title might imply. > The...

crote · 2025-12-10T17:43:00 1765388580

Yeah, I was expecting something closer to "because that's what people Google for".

A big part of designing a security-related API is making it really easy and obvious to do the secure thing, and hide the insecure stuff behind a giant "here be dragons" sign. You want people to accidentally do the right thing, so you call your secure and insecure functions "setHTML" and "setUnsafeHTML" instead of "setSanitizedHTML" and "setHTML".

guessmyname · 2025-12-10T18:50:40 1765392640

100%… it’s like Rust’s “unsafe” package, or Rust reqwest package naming things like danger_accept_invalid_certs(true) and danger_accept_invalid_hostnames(true) → https://docs.rs/reqwest/latest/reqwest/struct.ClientBuilder....

cess11 · 2025-12-10T18:47:24 1765392444

get_magic_quotes_gpc() and mysql_real_escape_string() had quite a bit to teach in this area.

some_furry · 2025-12-10T22:52:52 1765407172

Both of those functions were deprecated years ago.

mysql_real_escape_string() was removed in PHP 7.0.

get_magic_quotes_gpc() was removed in PHP 8.0.

https://www.php.net/mysql_real_escape_string

https://www.php.net/get_magic_quotes_gpc

The current minimum PHP version that is supported for security fixes by the PHP community is 8.1: https://www.php.net/supported-versions.php

If you're still seeing this in 2025 (going on 2026), there are other systemic problems at play besides the PHP code.

garaetjjte · 2025-12-11T17:52:55 1765475575

mysql_real_escape_string is only deprecated because there is mysqli_real_escape_string. I always wondered why it's "real"...like is there "fake" version of it?

cess11 · 2025-12-12T21:44:20 1765575860

Yes.

https://www.php.net/manual/en/function.mysql-escape-string.p...

https://stackoverflow.com/questions/3665572/mysql-escape-str...

One hardly even tries to do the thing it says on the tin, the other one at least tries to be the real thing. None of them worked very well, however.

cess11 · 2025-12-11T08:46:40 1765442800

Hence why I chose "had" for my previous comment.

tacone · 2025-12-11T12:44:18 1765457058

Decades ago.

mubou2 · 2025-12-10T17:32:13 1765387933

The author really needs to start with that. They say "the API that we are building" and assume I know who they are and what they're working on, all the way until the very bottom. I just assumed it's some open source library.

> HTML parsing is not stable and a line of HTML being parsed and serialized and parsed again may turn into something rather different

Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous? Because it's pretty much the only thing you can do when sanitizing server-side, which we do a lot.

Edit: I also wonder how one would add for example rel="nofollow noreferrer" to links using this. Some sanitizers have a "post process node" visitor function for this purpose (it already has to traverse the dom tree anyway).

crote · 2025-12-10T17:55:13 1765389313

> Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous?

The article links to [0], which has some examples of instances in which HTML parsing is context-sensitive. The exact same string being put into a <div> might be totally fine, while putting it inside a <style> results in XSS.

[0]: https://www.sonarsource.com/blog/mxss-the-vulnerability-hidi...

tobr · 2025-12-10T17:43:33 1765388613

> They say "the API that we are building" and assume I know who they are and what they're working on, all the way until the very bottom.

This is a common and rather tiresome critique of all kinds of blog posts. I think it is fair to assume the reader has a bit of contextual awareness when you publish on your personal blog. Yes, you were linked to it from a place without that context, but it’s readily available on the page, not a secret.

mubou2 · 2025-12-10T17:53:39 1765389219

Well that's... certainly a take. But I have to disagree. Most traffic coming to blog posts is not from people who know you and are personally following your posts, they're from people who clicked a link to the article someone shared or found it while googling something.

It's not hard to add one line of context so readers aren't lost. Here, take this for example, combining a couple parts of the GitHub readme:

> For those who are unfamiliar, the Sanitizer API is a proposed new browser API being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.

Easy. Can fit that in right after "this blog post will explain why", and now everyone is on the same page.

swiftcoder · 2025-12-10T18:02:04 1765389724

> Most traffic coming to blog posts is not from people who know you and are personally following your posts

Do we have data to back that up? Anecdotally the blogs I have operated over the years tend to mostly sustain on repeat traffic from followers (with occasional bursts of external traffic if something trends on social media)

rerdavies · 2025-12-11T01:47:53 1765417673

Your data sounds a bit anecdotal. :-P

Here's my anecdotal data. Number of blogs that I personally follow: zero. And yet, somehow, I end up reading a lot of blog posts (mostly linked from HN, but also from other places in my webosphere).

(More than a bit irritated by the "Do you have data to back that up" thing, given that you don't really have data to back up your position).

swiftcoder · 2025-12-11T07:55:51 1765439751

> (More than a bit irritated by the "Do you have data to back that up" thing, given that you don't really have data to back up your position).

It wasn't necessarily a request for you personally to provide data. I'm curious if any larger blog operators have insight here.

"person who only reads the 0.001% of blog posts that reach the HN front page" is not terribly interesting as an anecdotal source on blog traffic patterns

tobr · 2025-12-10T18:17:40 1765390660

> It's not hard

It’s also not hard to look around for a few seconds to find that information, is my point.

rerdavies · 2025-12-11T01:52:45 1765417965

What's hard in this case is that you end up making it 80% of the way through the article before you start to wonder what the heck this guy is talking about. So you have to click away to another page to figure out who the heck this guy is, then start again at the top of the article, reading it with that context in mind.

One word would have fixed the problem. "Why does the Mozilla API blah blah blah.". Perhaps "The Mozilla implementation used to...". Something like that.

THAT is not hard.

LegionMammal978 · 2025-12-10T17:49:19 1765388959

They had a link in their post [0]: it seems like most of the examples are with HTML elements with wacky contextual parsing semantics such as <svg> or <noscript>. Their recommendation for server-side sanitization is "don't, lol", and they don't offer much advice regarding it.

Personally, my recommendation in most cases would be "maintain a strict list of common elements/attributes to allow in the serialized form, and don't put anything weird in that list: if a serialize-parse roundtrip has the remote possibility of breaking something, then you're allowing too much". Also, "if you want to mutate something, then do it in the object tree, not in the serialized version".

[0] https://www.sonarsource.com/blog/mxss-the-vulnerability-hidi...

tlb · 2025-12-10T18:31:40 1765391500

setHTML needs to support just about every element if it's going to be the standard way of rendering dynamic content. Certainly <svg> has to work or the API isn't useful.

SanitizeHTML functions in JS have had big security holes before, around edge cases like null bytes in values, or what counts as a space in Unicode. Browsers decided to be lenient in what they accept, so that means any serialize-parse chain creates some risk.

LegionMammal978 · 2025-12-10T18:55:44 1765392944

If you're rendering dynamic HTML, then either the source is authorized to insert arbitrary dynamic content onto the domain, or it isn't. And if it isn't, then you'll always have a hard time unless you're as strict as possible with your sanitization, given how many nonlocal effects can be embedded into an HTML snippet.

The more you allow, the less you know about what might happen. E.g., <svg> styling can very easily create clickjacking attacks. (If I wanted to allow SVGs at all, I'd consider shunting them into <img> tags with data URLs.) So anyone who does want to use these more 'advanced' features in the first place had better know what they're doing.

bffjjfjf · 2025-12-10T22:23:09 1765405389

That overly reductive thinking can go back to the 80s before we had learned any lessons. There are degrees of trust. Binary thinking invites dramatic all or nothing failures.

LegionMammal978 · 2025-12-11T15:30:31 1765467031

And my point is that with HTML, there's always an extremely fine line between allowing "almost nothing" and "almost all of it" when it comes to sanitization. I'd love to live in a world where there are natural delineations of features that can safely be flipped on or off depending on how much control you want to give the source over the content, but in practice, there are dozens of HTML/CSS features (including everything in the linked article) that do wacky stuff that can cross over the lines.

mubou2 · 2025-12-10T17:58:57 1765389537

Ah, I see what they're talking about. That's a good article; my brain totally skipped over that link. Thanks.

rebane2001 · 2025-12-11T07:35:39 1765438539

> Because it's pretty much the only thing you can do when sanitizing server-side

I'd suggest not sanitizing user-provided HTML on the server. It's totally fine to do if you're fully sanitizing it, but gets a little sketchy when you want to keep certain elements and attributes.

masklinn · 2025-12-10T18:29:12 1765391352

> Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous?

The term to look for is “mutation xss” (or mxss).