Hacker News new | past | comments | ask | show | jobs | submit login
Can you parse html with regular expressions? (jefftk.com)
2 points by jefftk on Feb 22, 2013 | hide | past | favorite | 2 comments



Obviously hasn't read the discussion here: http://stackoverflow.com/questions/1732348/regex-match-open-...

RegEx is powerful, yes. But HTML (even though it's a standard) can be buggy, inaccurate, and difficult to expect. Some people, encouraged by older books that support the method, don't close things like `<p>` tags. Others omit the self-closing / on certain tags because they can (i.e. hr, br, img).

So can you use regular expressions to parse a known subset of expected HTML? Probably. Could use use it to parse arbitrary, unfiltered, potentially broken HTML? No. And you wouldn't expect to use RegEx to parse any other broken text document that fails to follow a defined, knowable schema either.


    some people don't close things like <p>
In html5 you generally don't need to close p tags: http://www.w3.org/TR/html-markup/p.html

    Others omit the self-closing / on certain tags
In html5 the / is meaningless and ignored: http://dev.w3.org/html5/spec-author-view/syntax.html#syntax-...

    And you wouldn't expect to use RegEx to parse
    any other broken text document that fails to
    follow a defined, knowable schema either.
If you want to extract information from a text document without a defined and respected format, a RegEx will often do better than anything else common.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: