

You Cannot Parse HTML with Regular Expressions - johnnyg
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

======
jmount
"Some people, when confronted with a problem, think 'I know, I'll use regular
expressions.' Now they have two problems." Jamie Zawinski 1997.

~~~
83457
No one should ever use regular expressions because there is a cute quote that
says not too.

~~~
jmount
The problem is that regular expressions (in addition not being powerful enough
to properly handle HTML) are a bit of a maintenance headache. They look like
line noise. Most regular expressions don't change properly with local (did you
use \w, [a-zA-Z] or \p{Alpha} and which of these did you really mean). And you
can't always be sure the regexp is implemented efficiently (by compiling a
finite automata instead of search, see
<http://swtch.com/~rsc/regexp/regexp1.html> ). Often a parser generator (of
which a lexer is often the first step, which is as powerful as regular
expressions) is a much better solution.

------
util
Is it right that BeautifulSoup was (is?) implemented in terms of Python
regular expressions?

~~~
rbonvall
Regular expressions can be (and are) used to tokenize the code, but cannot do
the actual parsing.

What "REs can't parse HTML" means in theory is that you cannot design a regexp
that tells HTML apart from non-HTML.

The fundamental reason is that regexps cannot detect arbitrarily-deep nested
structures.

In practice it is possible, because most regexp engines are Turing complete,
but it would be crazy to do so.

