The full set of PCREs, as the author points out, are NP-complete. The problem is, just because you can do something doesn't mean you should.
If you ever find yourself constructing a recursively enumerable grammar (or even a CFG) using a regular expressions - whether PCREs or any other variant - you should ask yourself why you aren't using a parser generator or a proper tool for creating a compiler front-end.
I hope people don't miss the author's closing point, which is the most important part:
> But don’t forget: Just because you can, doesn’t mean that you should. Processing HTML with regular expressions is a really bad idea in some cases. In other cases it’s probably the best thing to do.
I disagree that there are cases in which it's probably the best thing to do. Most languages support XPATH/CSS selectors/etc., which are much better tools for matching arbitrary HTML patterns. I'm guilty of conjuring up a regex to scrape image links every now and then, but you should really only do that when your domain of expected input data is far more restricted than the actual CFG that you're dealing with.
I'm guilty of conjuring up a regex to scrape image links every now and then
Any idea how to supersede that temptation? That is, what's wrong with the HTML-processing tool/sublanguage that makes regexes attractive instead? (For me, it's that I don't need to scrape HTML quite often enough to want to learn to use what's available. I'm probably irrationally lazy.)
Usually the problem is that the APIs are like typical XML apis which are a royal pain to use requiring various hoop jumping in order to say which tags are of interest in which container and relationships with siblings. If you are lucky you have to learn some sort of XPATH like syntax. Meanwhile you do a view source, and can trivially see exactly which lines you want, and say sod all that nonsense - I'll use a regex.
My recommendation is to find a library that provides jQuery/CSS selector style syntax and semantics, and then suddenly it is a lot easier to deal with the document. For example for Python there is soupselect or cssselect.
Amusingly the latter shows that the selector "div.content" translates into the XPATH "descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]".
> "As such they can also match well-formed HTML and pretty much all other programming languages."
While you could match well-formed HTML, the inevitable follow-up will be "how do I match parsable, but not well-formed HTML"? There is a reason that XHTML was deeply unloved.
(also... The <center> cannot hold it is too late. .. .ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ)
It's from a very popular Stack Overflow answer a while back. One longtime user got tired of having to explain the intricacies of HTML parsing to people, so instead he posted a Lovecraftian rant about the horrors of parsing HTML with regular expressions.
As someone who has been reading The Wheel of Time, this article title makes me laugh. Backreferences as a metaphor for tapping into the power of The Dark One seems way, way too apt.
He noted that that's difficult, using just the PCRE library in PHP. In perl or C, where you can make the regex engine fire arbitrary code when certain things are matched it becomes much more possible (but still probably not an approach you really want to take).
For those not familiar with the subject: PCREs are not regular expressions in the computer science meaning. The article implicitly states that the PCREs are regular expressions.
So, the article's message boils down to "if you make your language more powerful, you can do more with it. The PRCE language is so powerful that you can do the following with it: ..."
Thanks for rewriting it. In fact email validator was the thing I also found most interesting. It makes me consider doing email regex checks in Webapps.
No. Just no. As someone who has to maintain and build upon an RE based parser, no. They couldn't even get the RE right for URLs in the RFC, what makes you think they can get it right for HTML?
The author declares that well-formed HTML can be recognized with a CFG, but it's far from clear to me whether that's the case. It's almost certainly not true for HTML5.
If you ever find yourself constructing a recursively enumerable grammar (or even a CFG) using a regular expressions - whether PCREs or any other variant - you should ask yourself why you aren't using a parser generator or a proper tool for creating a compiler front-end.
I hope people don't miss the author's closing point, which is the most important part:
> But don’t forget: Just because you can, doesn’t mean that you should. Processing HTML with regular expressions is a really bad idea in some cases. In other cases it’s probably the best thing to do.
I disagree that there are cases in which it's probably the best thing to do. Most languages support XPATH/CSS selectors/etc., which are much better tools for matching arbitrary HTML patterns. I'm guilty of conjuring up a regex to scrape image links every now and then, but you should really only do that when your domain of expected input data is far more restricted than the actual CFG that you're dealing with.