

Parsing HTML with Regex - babawere
http://stackoverflow.com/a/1732454/1226894

======
ircambridge
While I agree with the answer. What most people take from it is don't use
Regex to scrape data from HTML. Which isn't exactly the point of it. Parsing
HTML and scraping are two different things.

If you know the exact HTML you are working with, using regex to extract the
data is in my opinion a superior way of doing it. Less lines and generally
less complexity. (Such as taking the name and id of an amazon product from a
single site is different from taking all the links out of any page given.)

~~~
ricardobeat
With today's widely available DOM manipulation tools (jsdom, phantom, zombie)
and proper HTML parsers in javascript there is _absolutely no reason_ to use
RegExps.

~~~
chimeracoder
> (jsdom, phantom, zombie) and proper HTML parsers in javascript

Well, for starters, if you're not using Javascript. All four tools you
mentioned are Javascript-based.

~~~
klibertp
There is an XPath implementation for everything under the sun now - I saw and
used one even in Erlang - and such an implementations beats regexes in
readability and loc 9 times out of 10.

The only real reason to use regexes is when dealing with html so broken that
parts of it are inaccessible through parser.

~~~
jgraham
On the other hand, the world would really benefit from someone updating the
libxml2 HTML parser to match the HTML(5) spec, since it has popular bindings
for many other languages (e.g. lxml in python). The current implementation is
broken in lots of (not so) edge cases. This can be a problem when there is the
choice of being fast and wrong (using libxml2) or being slow and correct
(using html5lib or one of the other high level implementations of the
standardised algorithm).

------
readymade
The answer is pretty entertaining, but in context it's pedantic to the
extreme. The poster's question was about matching opening tags that don't
contain a closing slash, which is a tiny (regular) subset of HTML. You don't
need pushdown automata to recognize these.

English, as any other natural language, is (at least mostly) a context free
language too, but you wouldn't go around telling people that you shouldn't
ever use regexen to match certain constructions in English text, right?

~~~
baddox
I wouldn't call a natural language context-free. They're not formal languages
at all.

~~~
readymade
I'm well aware that English isn't a formal language, that's why I added the
qualifier "mostly". The great majority of expressions in natural languages
can, in fact, be accounted for with CFGs, and purely CFG-based Phrase
Structure Grammars have been proposed (see the work of Gazdar and Pullum on
Generalized Phrase Structure Grammar, from the early 80's, if you're
interested). Many of Chomsky's original claims about the weak generative
capacity of CFGs with respect to natural lanaguage that gave rise to
transformational syntactic frameworks have since been disproven.

Whether or not there is an absolutely snug fit between CFGs formally and
natural language "in the wild", so to speak, is another topic, and rather
beside the point of the analogy. Context Sensitive Grammars are overly
expressive, Regular Grammars much too weak, for much the same reason why they
are too weak for HTML. Were there a perfect English language parser, you would
not need it in order to match regular subsets of English, just as you do not
need a full HTML parser in order to match regular subsets of HTML.

------
capkutay
I had one class where we had to build a multi threaded search engine in java
and parsing HTML with regex was a requirement. Regex was the downfall of about
50% of the class and the majority of the students who did well still had
slight issues with their HTML parsing. Moral of the story is that regex is a
poor solution for HTML. Not to mention, hours debugging regex is one of the
least meaningful or rewarding experiences you can have as a programmer.

~~~
pjscott
The obvious workaround for that broken requirement is to use regular
expressions to handle the tokenization, and then write a simple recursive
descent parser on top of that. Even if this is playing fast and loose with the
requirements, it would work out fine, and you would almost certainly get a
good grade if you explain why you did it.

~~~
pidge
Yes, this ^

Because, no, you can't parse XHTML with regex. As easily shown by the pumping
lemma and all that jazz.

But, there's no freaking reason why you can't tokenize an XML start tag with a
regex! In fact, you'll probably find that most uses of parsers in real life
have regexes to tokenize down at the level that they can handle, before using
a parser on the resulting tokens for the part that actually needs to be a CFG
(among other reasons, because a compiled FSM is a lot faster than even a
limited LALR parser).

Looking at this specific example, we can refer to the definitions for start
tags [1] and empty element tags [2] in XML, and see that all their constituent
rules form a regular language (if you don't believe me, it's not too hard to
go check for yourself). So, especially since the orignal question doesn't even
mention 'parsing', can we all please just shut up? (unless you actually want
to figure out the horrible mess necessary to define a regex from the spec :P )

1\. <http://www.w3.org/TR/xml11/#sec-starttags>

2\. <http://www.w3.org/TR/xml11/#dt-eetag>

------
joshdotsmith
You know you've been on SO too long when you see this title and go, "I'm
pretty sure I know this question already." And you're right.

------
brudgers
A programmer has a problem which requires parsing.

He decides to use regex.

Now he has two problems.

~~~
kami8845
A HN user reposts a SO post from ages ago.

Another HN user decides to repost a relevant joke from ages ago.

Now HN is going down the drain.

------
draegtun
Related SO post/comment - _Oh Yes You Can Use Regexes to Parse HTML!_ \-
<http://stackoverflow.com/a/4234491/12195> (HN -
<http://news.ycombinator.com/item?id=2741780>)

This comment is from Tom Christiansen of _Programming Perl_ / _Perl Cookbook_
fame which includes the following caveat:

 _So while it certainly can be done (this posting serves as an existence proof
of this incontrovertible fact), that doesn’t mean it should be._

------
elchief
There are plenty of good, real html parsers, so there's no need to try regex.

Xpath is nice for scraping HTML, though it always turns out the stuff you need
is in the middle of a bunch of other text.

~~~
philip1209
I would imagine that valid HTML could be parsed as XML, e.g. with the Python
ElementTree XML API

~~~
aardvark179
That would be true for XHTML but not for HTML as tags like paragraph do not
always require a end tag, which would break XML parsing.

~~~
philip1209
I was under the impression that HTML4 introduced XML requirements, thus
requiring the </p> tag.

------
languagehacker
The ultimate regex repost!

------
WadeWilliams
In audio/video format... best when he goes off the deep end and begins
speaking in tongues towards the end of the post
<http://www.youtube.com/watch?v=pQgNRKpmFuo>

------
sinkhole
You cannot parse HTML with regex. You can find and match strings, but you
can't actually parse html with regex.

Chuck Norris can parse HTML with regex.

~~~
kristm
"asking regexes to parse arbitrary HTML is like asking Paris Hilton to write
an operating system"

This cracked me up :)

------
jacobr
If you need something fast but not necessarily 100% correct, such as for a
real-time code syntax highlighter in JavaScript, RegEx is fine.

------
smilekzs
Hilarious and entertaining.

------
mistercow
I reckon that _technically_ regex is a tool that can be used to parse HTML.
It's just that you could only use it in a very trivial way that would be
better suited to other tools.

~~~
michaelhoffman
No. _Technically_ , you cannot _parse_ HTML with regular expressions. You can
_find_ certain strings in HTML which is a different thing.

~~~
mistercow
At the very least, you can use regex to match individual characters as you
scan the HTML for parsing. It's an inefficient and stupid way to do it, but it
is still something you can do. And in that case, regex is technically a tool
that you are using to parse HTML, even though 99.9% of the work is being done
by non-regex code.

