

Why you should not parse (X)HTML with a Regexp - superted
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

======
obtino
The technical explanation for this is given in comment 3 of the page and sums
it up perfectly:

"I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free
grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a
Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't
possibly hope to make this work. But many will try, some will claim success
and others will find the fault and totally mess you up."

More info: <http://en.wikipedia.org/wiki/Chomsky_hierarchy>

------
iwwr
_Even Jon Skeet cannot parse HTML using regular expressions._

~~~
telemachos
But according to a comment on the post, "Chuck Norris _can_ parse HTML with
regex" (emphasis in original).

More seriously, when SO was newer, I remember the feeling that these things
came in waves. For a few months, there was someone who responded to nearly
_every_ offending Ruby question (or answer) by pointing out that exceptions
shouldn't be used for flow control. See these two by another poster for more
on HTML and regexes[1][2].

[1] [http://stackoverflow.com/questions/701166/can-you-provide-
so...](http://stackoverflow.com/questions/701166/can-you-provide-some-
examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege)

[2] [http://stackoverflow.com/questions/773340/can-you-provide-
an...](http://stackoverflow.com/questions/773340/can-you-provide-an-example-
of-parsing-html-with-your-favorite-parser)

------
wvl
And the previous discussion:

<http://news.ycombinator.com/item?id=1487695>

------
d_r
Fortunately, BeautifulSoup saves the day for HTML parsing tasks.

(<http://www.crummy.com/software/BeautifulSoup/>)

