

Misusing DOM text methods - BenjaminCoe
http://benv.ca/2012/10/4/you-are-probably-misusing-DOM-text-methods/

======
yuliyp
Text is a sequence of characters. HTML is a sequence of tags and HTML-encoded
text. Some text can be interpreted as HTML. Some of that HTML can be
malicious. The bottom line is if you take text, and you give it to something
which expects HTML, you will encounter bugs with non-alphanumerics, XSS holes,
or both.

Let's look at the methods discussed in the article. textContent gives you the
text inside of an element, ignoring any tags. This text can certainly look
like HTML, and that HTML can be malicious.

createTextNode takes text and creates a node with that text as its content.
innerHTML of that gives you HTML that, when rendered, is the sequence of
characters that matches the text you passed it. If you want a sequence of HTML
which cannot contain tags, creating a text node and immediately grabbing the
HTML within it certainly is a safe way to do it.

In general, "escaping" is the wrong way to think about it. You have functions
which can convert text to the equivalent HTML, and you have functions which
extract the text within a DOM node. While sometimes the HTML which renders as
a given text string is the same as the string, this is definitely not always
the case.

~~~
thwarted
Agreed on the escaping. Alternatively, if you have content you want to remove
HTML from before displaying, don't generate rich DOM nodes for that content,
immediately stick it into a TextNode and insert _that_ into the DOM. Per the
example in the OP, div.textContent will remove the "first level" of HTML, and
leave what looks like tags, as the characters are unescaped. So just display
this as an actual TextNode. It will show the <script> string in-line (I say
string here, because it's not actually a tag, since it's in a TextNode), but
that's actually the content the author intended by having it escaped in the
input.

~~~
bentlegen
That's a good point. I'll edit the article to mention this.

