
Oh Yes You Can Use Regexes to Parse HTML - draegtun
http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491
======
scott_s
I classify this as a _parser_ using regular expressions rather than _parsing_
using regular expressions. That is, his regular expressions don't parse the
document. He wrote a parser, and uses regular expressions in that parser.

~~~
pak
This is correct. As soon as I saw his lex_html function, I could see that he
was using regexes to tokenize and then basically drop into different contexts
while consuming the document, thereby tracking multiple levels of state
throughout. That's what a full-blown parser does and it goes beyond the scope
of regular languages.

By "parsing with regular expressions" most people mean applying one or two
regexes to the entire document and using, for example, the group capture
facility to extract information. That is and will always be a bad idea for
HTML, because HTML is not a regular language.

Applying regular expressions to tokenize a string for a parsing is actually a
fairly standard pattern.

~~~
justinhj
Exactly. Gnu's Flex lets the user define regex's for each state of the
parser... <http://flex.sourceforge.net/manual/Patterns.html#Patterns>

~~~
LukeShu
Flex is used by the GNU System, but it is not GNU. It is incorrect to call it
"Gnu's Flex". <http://www.gnu.org/software/flex/>

~~~
justinhj
You're quite right, thanks

------
d0m
Errr.. Did he just wrote an html parser (and then used it) to prove everyone
that you can use regex to solve the use-html-parser-instead-of-regex problem?!

It's like if I suggest someone to use Python instead of ASM to solve a simple
problem, but then someone try to prove me wrong by writing a python
interpreter in ASM and then USE it to solve the same problem!

Also, that being said, I feel like the post is more a brag about "I'm the
creator of a popular perl book and perl rocks your language here's why
blahbahblah".

~~~
JonnieCache
Yeah, this is pretty much what seems to be going on.

I don't think anyone has ever said that regexs can't be used _in_ the parsing
of html, just that they can't be used _for_ the parsing of html. It's like
someone saying "You can't use bricks to keep warm!" and countering that by
saying "Observe! I have built a house using, among other things, bricks, and
it shelters me from the weather and keeps me warm!" Deliberately missing the
point in order to show off your housebuilding skills.

Still, it was an informative and well written article. Article, rather than an
answer to a question. Why do people write these thousands of words on stack
overflow, when they could publish them on their blogs. Are SO points from
confused corporate employees really worth more than the adulation of the
blogosphere? Actually, who cares. I almost lost the will to live just writing
that sentence...

~~~
statictype
I'd rather see it in Stackoverflow. It has much more visibility there than if
each person wrote it on their own blog (assuming they have one) which may not
be seen by more than a handful of friends.

------
rickmb
And for the next ten years, flawed attempts at imitating this will show up in
production code all around the world...

I'm not saying he shouldn't have (I've certainly learned something I didn't
know), but let's face it, posting this on StackOverflow is like handing a
loaded gun to a bunch of children and telling them not to pull the trigger.

~~~
rkalla

      but let's face it, posting this on StackOverflow is like 
      handing a loaded gun to a bunch of children and telling 
      them not to pull the trigger.
    

No, let's not face it. How is sharing well thought out, designed solutions to
problems ever a bad thing. Sure there are a handful of junior coders that now
feel overly confident and will mess this up, but they'll learn.

And then there is an equal pool of good developers that are now even better
thanks to the info-share.

I acknowledge I'm being pedantic, but the _Let's all nod and pretend we are
way smarter than THAT group over there_ mindset gives me brain-diarrhea... you
see it in every community (reddit, slashdot, HN, digg, etc.) and I've never
seen it help anybody accomplish anything, anywhere... ever.

</takes off his internet-police hat>

~~~
3am
No, using regexes for this is terrible. I could write you a program that will
parse html employing GOTOs, but it's been recognized as a bad programming
practice. I'm not going to do your research for you, but this is not opinion,
it's the result of people studying large code bases and defect rates within
them. Same goes for overly long functions, bad variable naming, poor
commenting, and so on.

~~~
yid
A correctly written regex is not a bad programming practice. The insinuation
that most programmers can't write or debug a regex correctly is disingeneous.

~~~
3am
First, correctly written code can be bad practice. Regexes are a powerful
tool, and have appropriate uses. I disagree this is one of those cases, but at
-4 on my previous comment, I guess most don't agree. Second, I would bet the
majority of programmers are mediocre with regular expresssions at best, and
even worse at reading regexes written by other programmers contributing to
code maintenance issues.

Finally, I may have been "incorrect" but "disingenuous" is an insult. I'll be
charitable and assume you're using the word wrong.

~~~
yid
Perhaps it's a difference of experience, but I really haven't met a
professional programmer who doesn't understand regexes but still insists on
using them for non-trivial tasks. Thus my usage of "disingenuous", because I
have trouble believing that such people exist, and I felt you were trying to
make a point insincerely, perhaps out of confirmation bias. I apologize if it
came across as an insult -- it wasn't intended as one.

~~~
3am
No harm, no foul. They really do exist, unfortunately.

------
masklinn
You can use _perl's string-matching facilities_ (which really are not regular
expressions at all[0]) to parse HTML.

[0] in fact it was a rather neat idea to rename those "patterns" (or something
along those lines) in Perl 6. Unfortunately this name change has been rolled
back. Shame.

~~~
lloeki
There was some form of consensus in a recent discussion not particularly
pertaining to Perl that the term _regex_ did not equate anymore to _regular
expression_.

------
michael_dorfman
Meanwhile, the highest-rated answer in StackOverflow history says otherwise:

[http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/1732454#1732454)

~~~
godDLL
OP doesn't actually do regex parsing of HTML, so all is well. It does regular
'ol parsing, and uses regex to chunk html into consumables.

------
retube
One of the best and academically proficient answers I've seen on SO. And if I
understand correctly turns on its head the old refrain "You can't use regexes
to parse HTML", of which I've always been a staunch proponent.

Now I understood the reason _why_ you can't use regular expressions to parse
HTML is that HTML is usually not regular. Is this true? Does this solution in
perl work because of the extended capabilities of perl regexes?

~~~
Terretta
> _Now I understood the reason _why_ you can't use regular expressions to
> parse HTML is that HTML is usually not regular. Is this true?_

From the comments:

Q. "The answer is you can't. HTML is not regular so be definition it can't be
described by a regular expression."

A. "Your use of REGULAR in regular expressions has been irrelevant and wrong
since Ken Thompson first put backrefs into regexes around 40 years ago.
/(.)\1/ parses non-REGULAR languages perfectly well. Please stop repeating
this nonsense. – tchrist"

~~~
adobriyan
Backreferences in some sense don't work because real HTML contains misnesteded
tags and other bugs, so many of them, even HTML5 spec explictely mentions
correction algo.

------
gjm11
A slightly more accurate summary of Tom Christiansen's excellent answer there
would be: "Oh yes you can use regexes to parse HTML, but you usually
shouldn't, unless what you want to do is really, really simple."

Actual quotations: "Even if my program is taken as illustrative of why you
should not use regexes for parsing general HTML -- which is ok, because I
kinda meant for it to be that"; "That was kinda my point, actually. I wanted
to show how hard it is." (the latter in response to someone else who said "You
can write a novel, like tchrist did, or you can use a DOM library and write
one line of XPath").

~~~
lloeki
That said, his HTML chunker is, dare I say, gorgeous.

If that is an example of what should not be done, I wish there was more of
them like that around.

Besides, lexing HTML in 234 lines grand total, most of them being whitespace,
(169 SLOCs according to sloccount) is impressive. Writing even a basic non
regex-based parser is bound to take quite some space.

To me the real conclusion is not: "don't try to parse random HTML using
regexes" but "don't try to write your own wide-purpose HTML parser".

Or, as Tom put it in his SO answer:

 _> The correct and honest answer is that they shouldn’t attempt [trying to
parse arbitrary HTML] because it is too much of a bother to figure out from
scratch_

~~~
jerf
"Besides, lexing HTML in 234 lines grand total, most of them being whitespace,
(169 SLOCs according to sloccount) is impressive."

I mean no disrespect at all to tchrist, but it isn't impressive at all; not
because tchrist is wrong, but because _lexing isn't hard_. If you understand
the problem, you can almost literally read the lexer right off the standard;
indeed, that's part of the purpose of the standard. Look at it (taking HTML4
here as it's easier to see): <http://www.w3.org/TR/html401/types.html#h-6.2>
You can literally read off the lexer expression for ID and NAME right from
'must begin with a letter ([A-Za-z]) and may be followed by any number of
letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and
periods (".").'

Generally, if you're having a hard time putting a lexer together for some
language you're creating (bearing in mind this includes the broader definition
of "language" beyond just "programming language", which includes things like
JSON or text formats you may create ad hoc), that's a sign that you've got an
overcomplicated language on your hand. (Hi, C++! I see you over there!)

------
ratsbane
You still can't use _A_ regular expression to parse HTML. Of course you can
use a set of regular expressions and some other logic to parse HTML. There's
shouldn't be anything surprising about this post.

------
jgrahamc
I think the point here is don't use regexps for this. Some time ago I got into
a spat with Eric Raymond about this very subject. Bottom line is that there
are nice libraries for HTML parsing out there. Use one:
<http://blog.jgc.org/2009/11/parsing-html-in-python-with.html>

------
draegtun
And related this previous HN discussion on different SO question _RegEx match
open tags except XHTML self-contained tags_
(<http://news.ycombinator.com/item?id=1487695>)

------
jdnier
In the spirit of Tom Christiansen's lexer solution, here's a link to Robert
Cameron's seemingly forgotten 1998 article, "REX: XML Shallow Parsing with
Regular Expressions".

    
    
      http://www.cs.sfu.ca/~cameron/REX.html
    

"""

Abstract

The syntax of XML is simple enough that it is possible to parse an XML
document into a list of its markup and text items using a single regular
expression. Such a shallow parse of an XML document can be very useful for the
construction of a variety of lightweight XML processing tools. However,
complex regular expressions can be difficult to construct and even more
difficult to read. Using a form of literate programming for regular
expressions, this paper documents a set of XML shallow parsing expressions
that can be used a basis for simple, correct, efficient, robust and language-
independent XML shallow parsing. Complete shallow parser implementations of
less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.

The syntax of XML is simple enough that it is possible to parse an XML
document into a list of its markup and text items using a single regular
expression. Such a shallow parse of an XML document can be very useful for the
construction of a variety of lightweight XML processing tools. However,
complex regular expressions can be difficult to construct and even more
difficult to read. Using a form of literate programming for regular
expressions, this paper documents a set of XML shallow parsing expressions
that can be used a basis for simple, correct, efficient, robust and language-
independent XML shallow parsing. Complete shallow parser implementations of
less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.

"""

~~~
jdnier
If you enjoy reading about regular expressions, Cameron's paper is
fascinating. His writing is concise, thorough, and very detailed. He's not
simply showing you how to construct the REX regular expression but also an
approach for constructing any complex regex.

I've been using the REX regular expression on and off for 10 years to solve
the sort of problem the initial poster on Stack Overflow asked about (how do I
match this particular tag but not some other very similar tag?). I've found
the regex he developed to be completely reliable.

~~~
jdnier
REX is most useful is when you're focusing on lexical details of a document --
for example, when transforming one kind of text document (e.g., plain text,
XML, SGML, HTML) into another, where the document may not be valid, well
formed, or even parsable for most of the transformation. It lets you target
islands of markup anywhere within a document without disturbing the rest of
the document.

------
linuxhansl
No you can't.

HTML is a Context Free Language (Type 2 in the Chomsky hierarchy) that is
defined by a Context Free Grammar, and parsed by a stack machine.

Regular expressions can describe regular (Type 3) languages. They do not have
a stack.

Note that there is a loop in the code, so it's not just regular expressions.

------
scrrr
Isn't the theory that for a regular grammar you can use regexp, for a context-
free grammar you need sth. with a stack (= parser).

HTML isn't regular, though, is it? So if there's not (even an implicit) stack
in his example, this won't work for the general case.

------
parenthesis
For quick-and-dirty extraction of data from HTML documents, lynx -dump can be
useful.

------
reirob
When I need to extract data from HTML I use XPath. But to do so I have to use
the combination of following tools

iconv: necessary only when the page is NOT encoded in UTF-8

tidy: used to convert from HTML to XHTML which is XML. I call it as

xmlstarlet: to extract data from the XML file using XPath.

I find XPath a much better and much reliable tool for HTML data extraction.

~~~
masklinn
> When I need to extract data from HTML I use XPath. But to do so I have to
> use the combination of following tools

I just use lxml.html (handles 99.999% of the HTML out there, add
beautifulsoup's UnicodeDammit[0] for wonky encodings) and then use lxml's
built-in xpath support on top of that.

Plus extra bonus, if the datamining paths are simple enough you can use CSS
queries instead of XPath.

[0] [http://lxml.de/elementsoup.html#using-only-the-encoding-
dete...](http://lxml.de/elementsoup.html#using-only-the-encoding-detection)

~~~
reirob
Thanks for pointing this out. I forgot to precise that I am using XPath on
Unix/Cygwin command line or in Shell scripts. Here an example of extracting
the Hacker News title;url;points;user data:

    
    
       curl -s http://news.ycombinator.com/news | tidy -quiet -asxml -numeric -utf8 -file /dev/null | xmlstarlet sel -N x=http://www.w3.org/1999/xhtml -t -m "//x:tr[x:td[1][@class='title']]" -v "normalize-space(x:td[3][@class='title']/x:a)" -o ";" -v "x:td[3][@class='title']/x:a/@href" -o ";" -v "str:tokenize(following-sibling::x:tr[1]/x:td[2]/x:span[1], ' ')[1]" -o ";" -v "following-sibling::x:tr[1]/x:td[2]/x:a[1]" -n | xmlstarlet unesc

~~~
sukuriant
Could you put some newlines in that? For some reason, my browser (IE9) is
rendering that construct as a VERY long line and extending the page to match
it. (no code-specific scroll-bar for me)

~~~
reirob

      curl -s http://news.ycombinator.com/news | \
    	tidy -quiet -asxml -numeric -utf8 -file /dev/null | \
    	xmlstarlet sel \
    		-N x=http://www.w3.org/1999/xhtml \
    		-t -m "//x:tr[x:td[1][@class='title']]" \
    		-v "normalize-space(x:td[3][@class='title']/x:a)" -o ";" \
    		-v "x:td[3][@class='title']/x:a/@href" -o ";" \
    		-v "str:tokenize(following-sibling::x:tr[1]/x:td[2]/x:span[1], ' ')[1]" -o ";" \
    		-v "following-sibling::x:tr[1]/x:td[2]/x:a[1]" -n | \
    	xmlstarlet unesc

~~~
sukuriant
Thank you.

------
benmmurphy
it doesn't handle script tags correctly for the example cited. anything
between <script> and </script> shouldn't be interpreted as html.

<html> <head> <script type='text/javascript'> var tag = '<input type="hidden"
name="foo" value="bar"/>'; </script> </head> <body> body </body> </html>

./html_input_rx test.html input tag #1 at character 57: name => "foo" type =>
"hidden" value => "bar"

but very cool and could probably be fixed to handle that case quite easily.

------
profquail
For parsing HTML, I'd recommend using a purpose-built HTML-parsing library
instead of bothering with regular expressions. (Though, like the author of
that answer wrote, regex's can work just fine for parsing small snippets.)

An interesting fact: you can parse HTML/XHTML (correctly) with some of the
popular regex implementations. (Note the word _implementations_.)

------
sambeau
I'm just disappointed that he didn't use the flip-flop operator ..

------
wazzupflow
best part is in the answer just below this one: "1. You can write a novel like
tchrist did..."

------
jpr
Now you have two problems.

