

Can you parse html with regular expressions? - cbr
http://www.jefftk.com/news/2013-02-22

======
eamann
Obviously hasn't read the discussion here:
[http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/1732454#1732454)

RegEx is powerful, yes. But HTML (even though it's a standard) can be buggy,
inaccurate, and difficult to expect. Some people, encouraged by older books
that support the method, don't close things like `<p>` tags. Others omit the
self-closing / on certain tags because they can (i.e. hr, br, img).

So can you use regular expressions to parse a known subset of expected HTML?
Probably. Could use use it to parse arbitrary, unfiltered, potentially broken
HTML? No. And you wouldn't expect to use RegEx to parse any other broken text
document that fails to follow a defined, knowable schema either.

~~~
cbr

        some people don't close things like <p>
    

In html5 you generally don't need to close p tags: <http://www.w3.org/TR/html-
markup/p.html>

    
    
        Others omit the self-closing / on certain tags
    

In html5 the / is meaningless and ignored: [http://dev.w3.org/html5/spec-
author-view/syntax.html#syntax-...](http://dev.w3.org/html5/spec-author-
view/syntax.html#syntax-start-tag)

    
    
        And you wouldn't expect to use RegEx to parse
        any other broken text document that fails to
        follow a defined, knowable schema either.
    

If you want to extract information from a text document without a defined and
respected format, a RegEx will often do better than anything else common.

