The true power of regular expressions

chimeracoder · on Jan 24, 2013

The full set of PCREs, as the author points out, are NP-complete. The problem is, just because you can do something doesn't mean you should.

If you ever find yourself constructing a recursively enumerable grammar (or even a CFG) using a regular expressions - whether PCREs or any other variant - you should ask yourself why you aren't using a parser generator or a proper tool for creating a compiler front-end.

I hope people don't miss the author's closing point, which is the most important part:

> But don’t forget: Just because you can, doesn’t mean that you should. Processing HTML with regular expressions is a really bad idea in some cases. In other cases it’s probably the best thing to do.

I disagree that there are cases in which it's probably the best thing to do. Most languages support XPATH/CSS selectors/etc., which are much better tools for matching arbitrary HTML patterns. I'm guilty of conjuring up a regex to scrape image links every now and then, but you should really only do that when your domain of expected input data is far more restricted than the actual CFG that you're dealing with.

abecedarius · on Jan 24, 2013

I'm guilty of conjuring up a regex to scrape image links every now and then

Any idea how to supersede that temptation? That is, what's wrong with the HTML-processing tool/sublanguage that makes regexes attractive instead? (For me, it's that I don't need to scrape HTML quite often enough to want to learn to use what's available. I'm probably irrationally lazy.)

rogerbinns · on Jan 24, 2013

Usually the problem is that the APIs are like typical XML apis which are a royal pain to use requiring various hoop jumping in order to say which tags are of interest in which container and relationships with siblings. If you are lucky you have to learn some sort of XPATH like syntax. Meanwhile you do a view source, and can trivially see exactly which lines you want, and say sod all that nonsense - I'll use a regex.

My recommendation is to find a library that provides jQuery/CSS selector style syntax and semantics, and then suddenly it is a lot easier to deal with the document. For example for Python there is soupselect or cssselect.

Amusingly the latter shows that the selector "div.content" translates into the XPATH "descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]".

meric · on Jan 25, 2013

Let's say you're correcting a 200 line HTML file, currently open in your text editor, to make it look neater. E.g. fixing tags like

  < p id='some_id'>

It would be ideal to use regular expression find and replace to look for:

  < ([a-z])

and replace with:

<$1

Of course, be sure to review every replacement to make sure it isn't part of javascript or something like that.

IMO it'd be faster than writing a script and then running it against the file.

polemic · on Jan 24, 2013

> "As such they can also match well-formed HTML and pretty much all other programming languages."

While you could match well-formed HTML, the inevitable follow-up will be "how do I match parsable, but not well-formed HTML"? There is a reason that XHTML was deeply unloved.

(also... The <center> cannot hold it is too late. .. .ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ)

vinhboy · on Jan 24, 2013

̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ) interesting...

chc · on Jan 24, 2013

It's from a very popular Stack Overflow answer a while back. One longtime user got tired of having to explain the intricacies of HTML parsing to people, so instead he posted a Lovecraftian rant about the horrors of parsing HTML with regular expressions.

obviouslygreen · on Jan 24, 2013

For the curious:

http://stackoverflow.com/questions/1732348/regex-match-open-...

Millennium · on Jan 24, 2013

As someone who has been reading The Wheel of Time, this article title makes me laugh. Backreferences as a metaphor for tapping into the power of The Dark One seems way, way too apt.

abecedarius · on Jan 24, 2013

This shows using PCRE to recognize text following a context-free grammar (and some context-sensitive ones). Can you make it produce a parse tree too?

dllthomas · on Jan 24, 2013

He noted that that's difficult, using just the PCRE library in PHP. In perl or C, where you can make the regex engine fire arbitrary code when certain things are matched it becomes much more possible (but still probably not an approach you really want to take).

amatus · on Jan 24, 2013

This should be titled "The true power of PCREs"

Someone · on Jan 24, 2013

For those not familiar with the subject: PCREs are not regular expressions in the computer science meaning. The article implicitly states that the PCREs are regular expressions.

So, the article's message boils down to "if you make your language more powerful, you can do more with it. The PRCE language is so powerful that you can do the following with it: ..."

blablabla123 · on Jan 24, 2013

After all it should be no big deal binding the PCREs from your favorite language...

draegtun · on Jan 24, 2013

I couldn't see this in there so this is related: Oh Yes You Can Use Regexes to Parse HTML - http://news.ycombinator.com/item?id=2741780

lectrick · on Jan 24, 2013

For what it's worth, here is the cited RFC 5322 email validator regexp except written in a form that Ruby's regexp parser understands:

https://gist.github.com/4626713

blablabla123 · on Jan 24, 2013

Thanks for rewriting it. In fact email validator was the thing I also found most interesting. It makes me consider doing email regex checks in Webapps.

sebcat · on Jan 24, 2013

No. Just no. As someone who has to maintain and build upon an RE based parser, no. They couldn't even get the RE right for URLs in the RFC, what makes you think they can get it right for HTML?

mnarayan01 · on Jan 24, 2013

The author declares that well-formed HTML can be recognized with a CFG, but it's far from clear to me whether that's the case. It's almost certainly not true for HTML5.

mikegirouard · on Jan 24, 2013

This was a great read, but the comments got even more interesting.

It's all pretty much over my head, so I couldn't figure out if StoneCypher was trolling or if he had a real point.

mnarayan01 · on Jan 24, 2013

Seems to be trolling. I find that refusal to link to a claimed source in the face of requests to do so almost always corresponds to trolling. Also

  There's actually a mathematical proof out there that no regular expression engine will safely extract from all broken HTML

seems like it has to be false unless it's much more qualified.

tolos · on Jan 24, 2013

I was really expecting an article about parsing HTML when he said right at the beginning:

>> You cannot parse HTML with regular expressions, because HTML isn’t regular. Use an XML parser instead.

> This statement - in the context of the question - is somewhere between very misleading and outright wrong.

But nope, after disappointment and going back to the beginning, he says he's not talking about HTML:

> What I’ll try to demonstrate in this article is how powerful modern regular expressions really are.

And not even a warning about how easy it is to make really terrible regex.

jre · on Jan 24, 2013

Really interesting read, thanks for this !