By "parsing with regular expressions" most people mean applying one or two regexes to the entire document and using, for example, the group capture facility to extract information. That is and will always be a bad idea for HTML, because HTML is not a regular language.
Applying regular expressions to tokenize a string for a parsing is actually a fairly standard pattern.
Of course, most modern "regex" engines provide features which go beyond the strictly regular languages.
S → aS | ε
Honestly though, this entire conversation is fuzzed by the blurred distinction between "parsing with regular expressions, and writing a parser that uses regular expressions to tokenize" by the original author. I'm not really sure where just about every other person in this discussion is coming from as a result..
However, no regular grammar (not even a recursive regular grammar) can describe a recursive language.
All this subthread is fairly confusing because people are using the term "recursive" in a non-stardard way. Recursive languages are the languages decidable by a Turing machine. They're a strict superset of regular languages, i.e., not all regular languages can be described by regular grammars, but all regular grammars describe recursive languages.
Now, some regular grammars are left-recursive or right-recursive, which means sometimes the same symbol appears on both sides of a rule. It doesn't mean they the have the power of full recursion that we find in full-blown programming languages, since they don't have the equivalent of a stack.
It's like if I suggest someone to use Python instead of ASM to solve a simple problem, but then someone try to prove me wrong by writing a python interpreter in ASM and then USE it to solve the same problem!
Also, that being said, I feel like the post is more a brag about "I'm the creator of a popular perl book and perl rocks your language here's why blahbahblah".
I don't think anyone has ever said that regexs can't be used in the parsing of html, just that they can't be used for the parsing of html. It's like someone saying "You can't use bricks to keep warm!" and countering that by saying "Observe! I have built a house using, among other things, bricks, and it shelters me from the weather and keeps me warm!" Deliberately missing the point in order to show off your housebuilding skills.
Still, it was an informative and well written article. Article, rather than an answer to a question. Why do people write these thousands of words on stack overflow, when they could publish them on their blogs. Are SO points from confused corporate employees really worth more than the adulation of the blogosphere? Actually, who cares. I almost lost the will to live just writing that sentence...
Was he able write an arbitrary HTML parser using regexes because HTML is a regular language or because Perl's string matching syntax, which can match some languages that are not regular, is called "regular expressions" anyway?
I am pretty sure that, for example, an arbitrary number of balanced "<div></div>" pairs are not a regular expression however you can use modern PCRE to match it anyway.
The first option is easy to rule out: HTML is not a regular language. Evidence: S-Expressions are not a regular language. XML is isomorphic to S-expressions of the form (<tagname> ((<attr1> <value1>) (<attr2> <value2>) ...) <contents>). HTML is isomorphic to XHTML, which is XML.
I can't (read: don't feel like trying to) rigorously disprove that Perl's "regular expression" facility could be an HTML parser all by itself, but it looks like what's going on here is closer to a regex-based tokenizer (certainly possible, and in fact this is a very common pattern). Then, regular Perl flow control constructs are used to interpret the tokens as tags and such.
 Edit: Looking at that, it's not as solid as it was in my head when I was writing it. HTML is still not a regular language, because regular languages cannot key on nesting with uncapped depth.
Perl regular expressions are pretty powerful. I wrote a small perl program (just for fun after reading the post), which can parse S-Expressions. It wrote it quick, but it shows it is possible:
I know it has bugs.
I'm not saying he shouldn't have (I've certainly learned something I didn't know), but let's face it, posting this on StackOverflow is like handing a loaded gun to a bunch of children and telling them not to pull the trigger.
but let's face it, posting this on StackOverflow is like
handing a loaded gun to a bunch of children and telling
them not to pull the trigger.
And then there is an equal pool of good developers that are now even better thanks to the info-share.
I acknowledge I'm being pedantic, but the Let's all nod and pretend we are way smarter than THAT group over there mindset gives me brain-diarrhea... you see it in every community (reddit, slashdot, HN, digg, etc.) and I've never seen it help anybody accomplish anything, anywhere... ever.
</takes off his internet-police hat>
Finally, I may have been "incorrect" but "disingenuous" is an insult. I'll be charitable and assume you're using the word wrong.
True, but that's the case for posting anything non-trivial where Google can find it. And isomorphic to using non-trivial programming languages or patterns in industrial programs.
At a certain point, you have to let go of the idea that if you hide anything requiring thought and judgment from people, they won't hurt themselves.
Some people are going to foul things up no matter what you do or don't tell them, and some will excel even if you make it hard for them to find knowledge. The interesting group are those who are open to being influenced by what you have to share.
I suggest that even if only a small proportion of SO readers walk away enlightened, this article is a win. The folks who use it to justify their own failed attempts to manage arbitrary HTML with regexen would have pooched the task in some other way even if they stuck to using a parser library.
At least that's my takeaway.
> You can write a novel like tchrist did
> [...] – meder Nov 20 '10 at 19:36
> That was kinda my point, actually. I
> wanted to show how hard it is. – tchrist
> Nov 20 '10 at 19:38
>> Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
 in fact it was a rather neat idea to rename those "patterns" (or something along those lines) in Perl 6. Unfortunately this name change has been rolled back. Shame.
Now I understood the reason _why_ you can't use regular expressions to parse HTML is that HTML is usually not regular. Is this true? Does this solution in perl work because of the extended capabilities of perl regexes?
From the comments:
Q. "The answer is you can't. HTML is not regular so be definition it can't be described by a regular expression."
A. "Your use of REGULAR in regular expressions has been irrelevant and wrong since Ken Thompson first put backrefs into regexes around 40 years ago. /(.)\1/ parses non-REGULAR languages perfectly well. Please stop repeating this nonsense. – tchrist"
Maybe he meant /(.*)\1/?
Note: I know that backreferences make Perl regexes capable of matching non-regular languages, but this doesn't seem like an example of that.
Yes. It works because perl's "regexes" are not regular expressions (nb: almost no "regex engine" these days is limited to regular expressions). Actual regular expressions can express type-3 chomsky grammars, whereas HTML's grammar (and most programming languages's) is of type 2.
Recognizing paired start/end tags is equivalent to recognizing palindromes, and the palindrome language is only context-free if there's a finite, fixed alphabet. But the set of HTML tags a parser needs to match isn't finite. While there are a finite number of defined tags in any particular version of the HTML specification, a parser is still supposed to be able to match start/end tags it doesn't recognize, to distinguish the case of malformed syntax from the case of correctly formed syntax that uses unrecognized tags.
An easy way to see it is to try to write out CFG productions for HTML. The intuitive way requires something like backreferences, which would make the language no longer context-free:
html = '<' + tag + '>' + html + '</' + $0 + '>'
html = '<head>' + html + '</head>'
| '<body>' + html + '</body>'
| '<b>' = html + '</b>'
I don't think this is necessary in HTML, which expects every unknown tag to be ignored regardless of nesting, unless you are planning to render the content using CSS or support scripting, which may expect the unknown element to be in the DOM or to be styleable, but even there historically different browsers tended to do different things to unknown tags.
More on topic, XML is not even a type 2 language since you have to check for matching of tags, and the tags can come from an arbitrary set, so you can't write a context-free grammar that recognises well-formed XML.
You'd want rules like:
XML -> TAG
TAG -> "<" ([a-z]+) ">" (TAG|Text)* "<" "/" \1 ">"
HTML is more accurately a context-free language (CFL). A regular expression does not allow you to do any sort of counting or stack-based matching in a match, which is required to do things like "I just saw <div>...<a>...<img> so I better see "</img>...</a>...</div>" later on.
The reason the linked solution works is because of the extended capabilities of PCREs, such as backtracking and things like that.
I believe the reason is that HTML is a Type 2 grammar by Chomsky hierarchy (that is, a push-down automaton), whereas regexp is a Type 3 grammar (that is, a finite state automaton). To put it simply, HTML has a frame/state stack, and regexp isn't advanced enough for that (instead it reads "left-to-right" - no subroutines or recursion).
I suspect you might be right about him "cheating" using perl, but not knowing a lick of perl, I can't say for sure one way or the other.
Edit: Apparently, back-references mean regexp isn't regular - that actually makes more sense now; they've never quite meshed with my understanding of regular languages.
Actual quotations: "Even if my program is taken as illustrative of why you should not use regexes for parsing general HTML -- which is ok, because I kinda meant for it to be that"; "That was kinda my point, actually. I wanted to show how hard it is." (the latter in response to someone else who said "You can write a novel, like tchrist did, or you can use a DOM library and write one line of XPath").
If that is an example of what should not be done, I wish there was more of them like that around.
Besides, lexing HTML in 234 lines grand total, most of them being whitespace, (169 SLOCs according to sloccount) is impressive. Writing even a basic non regex-based parser is bound to take quite some space.
To me the real conclusion is not: "don't try to parse random HTML using regexes" but "don't try to write your own wide-purpose HTML parser".
Or, as Tom put it in his SO answer:
> The correct and honest answer is that they shouldn’t attempt [trying to parse arbitrary HTML] because it is too much of a bother to figure out from scratch
I mean no disrespect at all to tchrist, but it isn't impressive at all; not because tchrist is wrong, but because lexing isn't hard. If you understand the problem, you can almost literally read the lexer right off the standard; indeed, that's part of the purpose of the standard. Look at it (taking HTML4 here as it's easier to see): http://www.w3.org/TR/html401/types.html#h-6.2 You can literally read off the lexer expression for ID and NAME right from 'must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").'
Generally, if you're having a hard time putting a lexer together for some language you're creating (bearing in mind this includes the broader definition of "language" beyond just "programming language", which includes things like JSON or text formats you may create ad hoc), that's a sign that you've got an overcomplicated language on your hand. (Hi, C++! I see you over there!)
I've been using the REX regular expression on and off for 10 years to solve the sort of problem the initial poster on Stack Overflow asked about (how do I match this particular tag but not some other very similar tag?). I've found the regex he developed to be completely reliable.
HTML is a Context Free Language (Type 2 in the Chomsky hierarchy) that is defined by a Context Free Grammar, and parsed by a stack machine.
Regular expressions can describe regular (Type 3) languages. They do not have a stack.
Note that there is a loop in the code, so it's not just regular expressions.
HTML isn't regular, though, is it? So if there's not (even an implicit) stack in his example, this won't work for the general case.
iconv: necessary only when the page is NOT encoded in UTF-8
tidy: used to convert from HTML to XHTML which is XML. I call it as
xmlstarlet: to extract data from the XML file using XPath.
I find XPath a much better and much reliable tool for HTML data extraction.
I just use lxml.html (handles 99.999% of the HTML out there, add beautifulsoup's UnicodeDammit for wonky encodings) and then use lxml's built-in xpath support on top of that.
Plus extra bonus, if the datamining paths are simple enough you can use CSS queries instead of XPath.
curl -s http://news.ycombinator.com/news | tidy -quiet -asxml -numeric -utf8 -file /dev/null | xmlstarlet sel -N x=http://www.w3.org/1999/xhtml -t -m "//x:tr[x:td[@class='title']]" -v "normalize-space(x:td[@class='title']/x:a)" -o ";" -v "x:td[@class='title']/x:a/@href" -o ";" -v "str:tokenize(following-sibling::x:tr/x:td/x:span, ' ')" -o ";" -v "following-sibling::x:tr/x:td/x:a" -n | xmlstarlet unesc
curl -s http://news.ycombinator.com/news | \
tidy -quiet -asxml -numeric -utf8 -file /dev/null | \
xmlstarlet sel \
-N x=http://www.w3.org/1999/xhtml \
-t -m "//x:tr[x:td[@class='title']]" \
-v "normalize-space(x:td[@class='title']/x:a)" -o ";" \
-v "x:td[@class='title']/x:a/@href" -o ";" \
-v "str:tokenize(following-sibling::x:tr/x:td/x:span, ' ')" -o ";" \
-v "following-sibling::x:tr/x:td/x:a" -n | \
var tag = '<input type="hidden" name="foo" value="bar"/>';
input tag #1 at character 57:
name => "foo"
type => "hidden"
value => "bar"
but very cool and could probably be fixed to handle that case quite easily.
An interesting fact: you can parse HTML/XHTML (correctly) with some of the popular regex implementations. (Note the word implementations.)