
Can regular expressions parse HTML or not? - duck
http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/
======
bunderbunder
The correct answer doesn't just depend on what you mean by "regular
expressions", it also depends on what you mean by "parse". And perhaps also
what you mean by "HTML".

Case in point: A while back I ran into a nasty bug in our software which would
cause it to crash with a stack overflow exception. It turned out to be that
the problem was in how HTML Agility Pack would respond to certain kinds of
particularly poorly-formed HTML documents.

Now, we had chosen to use HTML Agility Pack in part because of all the
comments to the effect of "NOOOOOO don't use regex, use Agility Pack!"
complete with links to a certain famous Stack Overflow answer that are
littered around the internet. But in analzying the situation in more detail,
we discovered a few things: First, we were just stripping tags, so our parsing
needs were very minor compared to what HTML Agility pack does. Second,
generating a full-on parse tree for an HTML document was kind of harming us,
in that doing so takes both time and RAM, two things we were trying to be
light on. Third, we could afford to be fairly tolerant of a missed tag or some
lost content, but the parser giving up was a minor tragedy. Noisy results were
far better than no results for our purposes.

The upshot of all this being, for our "HTML parsing" needs, regular
expressions turned out to be _exactly_ the right tool for the job. Following
that realization, it was the work of only one afternoon to fix the bug, and as
a side benefit dramatically improve both the engine's performance and the
quality of its output.

Long story short, what passes for common wisdom on the Internet is no
substitute for knowing what you're doing.

------
ufo
> HTML in the wild can be rather wild.

One easy way to get an idea of how complicated this can be is checking out the
parsing algorithm in the HTML5 spec [1] (section 12).

To start the complexity, HTML is not XML and not all tags are created equal.
Void elements like <img> and <br> don't have matching close tags and contents
for things like <textarea> and <script> are treated as text (you can't nest
more tags inside). Then you have the issue of javascript being able to call
document.write at HTML-parsing times, something you need to hope doesn't
happen or parsing becomes intractable. Finally, There is a vast number of
rules and special cases that kick off when you have missing or mis-nested tags
or tags being put in the wrong places.

[1] <http://www.whatwg.org/specs/web-apps/current-work/multipage/>

------
mjn
On the theoretical side, pure regular expressions (w/o PCRE extensions) can
actually parse HTML to the extent that most browsers can. The expressions
would be gigantic, unmaintainable messes, and probably not even possible to
write by hand, but nonetheless it's more of a practical problem than an issue
of formal language power.

For example, WebKit sets a maximum DOM nesting limit of 512 [1], and most
other browsers have a limit as well. A context-free language restricted to a
finite maximum production depth becomes a regular language, so WebKit's
parsing _could_ be done by an NFA or DFA, and the parser could be encoded as a
regular expression if desired. But the only plausible way to write such an
expression correctly would be to mechanically "compile" it from a grammar. So
you're going to need the grammar anyway, at which point you might as well use
it directly.

[1]
[http://trac.webkit.org/browser/trunk/Source/WebCore/page/Set...](http://trac.webkit.org/browser/trunk/Source/WebCore/page/Settings.h#L408)

~~~
bvdbijl
Why does webkit have a limited tag depth?

~~~
coldtea
Logically to avoid denial of service attacks, with exhaustion of memory etc.

~~~
T-hawk
And here's the canonical example in XML:
<http://en.wikipedia.org/wiki/Billion_laughs>

------
bculkin2442
IMO The probablity of a given page being well formed, or being non-well formed
in a manner that will not completely screw the parser is probably very low.
Therefore, you should use a dedicated html parser for parsing html.

~~~
johndcook
I agree as far as parsing most HTML. I use regular expressions all the time
when I'm working with HTML I've written by hand because I know what's there.

I just find it interesting that the efficacy of regular expressions can be
framed as a computer science question, a practical question, and a statistical
question.

~~~
fuzzix
In practical terms, how often are you actually _parsing_ HTML? Building a DOM,
rendering...

A knee-jerk reaction to 'HTML' and 'regular expression' being used in the same
sentence is like someone seeing 'goto' without understanding the context and
shouting "goto considered harmful!"

I recently use some combination of perl's split() and regexes to trivially
pull all links from a piece of markup. I had to suppress the "Don't parse HTML
with regular expressions!" voices echoing in my head the whole time I was
writing it. I'm OK with the code now, of course.

~~~
johndcook
Goto statements are a good example. It's easier to recite "goto is harmful"
than to say "In most situations, other control structures are more expressive
and easier to maintain than goto statements. However, there may be rare
occasions, particularly in low-level system programming, where goto statements
could be preferable."

"Don't parse HTML with regex" is good general advice, but no more an absolute
than avoiding goto statements.

~~~
Dylan16807
>rare occasions, particularly in low-level system programming, where goto
statements could be preferable

Or something as simple as "break 2;"

------
lutusp
This can easily be summarized -- if the target HTML meets a modern
specification and has no serious syntax errors, or if we're really talking
about XHTML, which has a much stricter syntax, then no problem. Otherwise
parsing the target will produce endless difficulties.

But then, if the HTML has syntax problems or contains inconsistent syntax,
then no approach is entirely reliable, which means it's not about regular
expressions any more.

------
newishuser
This post reminds me of Clarence Thomas' explanation of his short opinions.
This is the perfect example of a "5 cent idea in a $10 sentence."

Post adds nothing of substance to the following:

 _"Well-formed HTML is context-free. So you can match it using regular
expressions."_

 _"But most HTML you see in the wild is not well-formed. And just because you
can, doesn’t mean that you should."_

which wasn't even written by him.

~~~
cardine
5 cent ideas in $10 sentences is pretty much describing half of HackerNews.

------
Qantourisc
The topic in #regex on irc.freenode.org:

READ THIS FIRST: Need help? 1) Language/platform. 2) Sample string. 3) Desired
result. 4) Your attempt. | Do NOT use RegEx to parse HTML! | Do NOT tell us a
RegEx doesn't work if you aren't testing it in the language you asked about! |
Intro: <http://bit.ly/XrnV> | Home: <http://bit.ly/KEo1Gx> | FAQ:
<http://bit.ly/L949Mk> | Quiz: ? quiz | Regex website: <http://regex101.com/>

Note the part "Do NOT use RegEx to parse HTML!" ...

You can use regex to to parse HTML, but it's a verry bad idea. Basicly you
would nee to reimplement the HTML definition in regex, otherwise your regex
will screw up.

------
giardini
In general, no. Regular expressions match regular languages but HTML is a
context-free language. Every regular grammar is context-free but not all
context-free grammars are regular.

See [http://stackoverflow.com/questions/590747/using-regular-
expr...](http://stackoverflow.com/questions/590747/using-regular-expressions-
to-parse-html-why-not) for details. Also

[http://stackoverflow.com/questions/5175840/is-html-a-
context...](http://stackoverflow.com/questions/5175840/is-html-a-context-free-
language)

And then you have attempts to recover when HTML is ill-formed:
<http://en.wikipedia.org/wiki/Tag_soup>

~~~
omaranto
Is your comment aimed at people who didn't read the linked article?

