

Ask HN: Regular Expression to Remove Script Tags and Their Contents? - NathanKP

Hello,<p>I am currently working on a Ruby web parser for reading webpages and extracting the text only.  I am using the following lines of code to strip scripts, styling, and tags:<p><pre><code>    cleaned = html.gsub(/&#60;[^(&#62;|\/)]*?script.*?&#62;[^&#60;]*?&#60;.*?\/script.*?&#62;/, "")
    cleaned = cleaned.gsub(/&#60;[^(&#62;|\/)]*?style.*?&#62;[^&#60;]*?&#60;.*?\/style.*?&#62;/, "")
    cleaned = cleaned.gsub(/&#60;\/?[^&#62;]*&#62;/, "")
</code></pre>
The only problem is that if the script tag contains a less than sign such as:<p><pre><code>    &#60;script type='text/javascript'&#62;
        var foo=3;
        if(foo&#60;5) {
            alert("bar")
        }
    &#60;/script&#62;
</code></pre>
In that case the first regular expression for removing script tags and their contents does not work.  It all boils down to the part of the regular expression:<p><pre><code>    [^&#60;]*?
</code></pre>
I can't seem to figure out a better way though to make the star grab everything in a lazy fashion (in other words don't match the first opening script tag with the last opening tag in the HTML).<p>Has anyone else worked around this problem in the past?  Are there any regular expression ninjas who might be able to lend a hand?<p>I thank you in advance for your time and consideration.
======
gaius
"ninjas" <\- <http://news.ycombinator.com/item?id=1591573>

Anyway, try the Ruby port of Python's BeautifulSoup:
<http://www.crummy.com/software/RubyfulSoup/>

~~~
NathanKP
_ninjas_

I was thinking of the XKCD comic where the hero swings in to save the day with
his regular expression. ;)

I will look at Beautiful Soup and see if that can work as an alternate
technique.

Thank you.

------
byoung2
<http://wonko.com/post/sanitize> _Sanitize, a whitelist-based HTML sanitizer
written in Ruby. Given a list of acceptable elements and attributes, Sanitize
will remove all unacceptable HTML from a string._

------
lhorie
You'll probably want to read this:

[http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags)

~~~
NathanKP
Okay, I get the point. Trying to write regular expressions instead of using a
real XML parser is a bad idea.

