
The true power of regular expressions - saurabh
http://nikic.github.com/2012/06/15/The-true-power-of-regular-expressions.html
======
polemic
> _"As such they can also match well-formed HTML and pretty much all other
> programming languages."_

While you could match well-formed HTML, the inevitable follow-up will be "how
do I match _parsable_ , but not well-formed HTML"? There is a reason that
XHTML was deeply unloved.

(also... The <center> cannot hold it is too late. .. .ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱
TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘
̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ)

~~~
vinhboy
̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚&#8203;N̐Y̡
H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘
̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ) interesting...

~~~
chc
It's from a very popular Stack Overflow answer a while back. One longtime user
got tired of having to explain the intricacies of HTML parsing to people, so
instead he posted a Lovecraftian rant about the horrors of parsing HTML with
regular expressions.

~~~
obviouslygreen
For the curious:

[http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/1732454#1732454)

------
Millennium
As someone who has been reading The Wheel of Time, this article title makes me
laugh. Backreferences as a metaphor for tapping into the power of The Dark One
seems way, way too apt.

------
abecedarius
This shows using PCRE to _recognize_ text following a context-free grammar
(and some context-sensitive ones). Can you make it produce a parse tree too?

~~~
dllthomas
He noted that that's difficult, using just the PCRE library in PHP. In perl or
C, where you can make the regex engine fire arbitrary code when certain things
are matched it becomes much more possible (but still probably not an approach
you really want to take).

------
chimeracoder
The full set of PCREs, as the author points out, are NP-complete. The problem
is, just because you _can_ do something doesn't mean you _should_.

If you ever find yourself constructing a recursively enumerable grammar (or
even a CFG) using a regular expressions - whether PCREs or any other variant -
you should ask yourself why you aren't using a parser generator or a proper
tool for creating a compiler front-end.

I hope people don't miss the author's closing point, which is the most
important part:

> But don’t forget: Just because you can, doesn’t mean that you should.
> Processing HTML with regular expressions is a really bad idea in some cases.
> In other cases it’s probably the best thing to do.

I disagree that there are cases in which it's probably the best thing to do.
Most languages support XPATH/CSS selectors/etc., which are much better tools
for matching arbitrary HTML patterns. I'm guilty of conjuring up a regex to
scrape image links every now and then, but you should really only do that when
your domain of expected input data is far more restricted than the actual CFG
that you're dealing with.

~~~
abecedarius
_I'm guilty of conjuring up a regex to scrape image links every now and then_

Any idea how to supersede that temptation? That is, what's wrong with the
HTML-processing tool/sublanguage that makes regexes attractive instead? (For
me, it's that I don't need to scrape HTML quite often enough to want to learn
to use what's available. I'm probably irrationally lazy.)

~~~
rogerbinns
Usually the problem is that the APIs are like typical XML apis which are a
royal pain to use requiring various hoop jumping in order to say which tags
are of interest in which container and relationships with siblings. If you are
lucky you have to learn some sort of XPATH like syntax. Meanwhile you do a
view source, and can trivially see exactly which lines you want, and say sod
all that nonsense - I'll use a regex.

My recommendation is to find a library that provides jQuery/CSS selector style
syntax and semantics, and then suddenly it is a lot easier to deal with the
document. For example for Python there is soupselect or cssselect.

Amusingly the latter shows that the selector "div.content" translates into the
XPATH "descendant-or-self::div[@class and contains(concat(' ', normalize-
space(@class), ' '), ' content ')]".

------
lectrick
For what it's worth, here is the cited RFC 5322 email validator regexp except
written in a form that Ruby's regexp parser understands:

<https://gist.github.com/4626713>

~~~
blablabla123
Thanks for rewriting it. In fact email validator was the thing I also found
most interesting. It makes me consider doing email regex checks in Webapps.

------
amatus
This should be titled "The true power of PCREs"

~~~
Someone
For those not familiar with the subject: PCREs are not regular expressions in
the computer science meaning. The article implicitly states that the PCREs are
regular expressions.

So, the article's message boils down to "if you make your language more
powerful, you can do more with it. The PRCE language is so powerful that you
can do the following with it: ..."

------
draegtun
I couldn't see this in there so this is related: _Oh Yes You Can Use Regexes
to Parse HTML_ \- <http://news.ycombinator.com/item?id=2741780>

------
sebcat
No. Just no. As someone who has to maintain and build upon an RE based parser,
no. They couldn't even get the RE right for URLs in the RFC, what makes you
think they can get it right for HTML?

------
mikegirouard
This was a great read, but the comments got even more interesting.

It's all pretty much over my head, so I couldn't figure out if StoneCypher was
trolling or if he had a real point.

~~~
mnarayan01
Seems to be trolling. I find that refusal to link to a claimed source in the
face of requests to do so almost always corresponds to trolling. Also

    
    
      There's actually a mathematical proof out there that no regular expression engine will safely extract from all broken HTML
    

seems like it has to be false unless it's _much_ more qualified.

------
mnarayan01
The author declares that well-formed HTML can be recognized with a CFG, but
it's far from clear to me whether that's the case. It's almost certainly not
true for HTML5.

------
tolos
I was really expecting an article about parsing HTML when he said right at the
beginning:

>> You cannot parse HTML with regular expressions, because HTML isn’t regular.
Use an XML parser instead.

> This statement - in the context of the question - is somewhere between very
> misleading and outright wrong.

But nope, after disappointment and going back to the beginning, he says he's
not talking about HTML:

> What I’ll try to demonstrate in this article is how powerful modern regular
> expressions really are.

And not even a warning about how easy it is to make really terrible regex.

------
jre
Really interesting read, thanks for this !

