
The true power of regular expressions (2012) - doomrobo
https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html
======
Fripplebubby
I find this to be an interesting side effect of computer science as a branch
of mathematics vs applied computer science in "programming" and "development".

I remember touring the computer science building at the university I would end
up attending before I had enrolled there - some students in one room were
discussing whether there was a regular expression for detecting prime numbers.
I laughed to myself - because in my mind, that's not at all what a regular
expression was.

Sure enough, there is in fact such a regular expression (or "regex pattern",
if you prefer)[1]. Regular expressions in practice encompass BOTH the academic
level-3 regular grammar stuff, and a bunch of other stuff that's been tacked
on because it's useful in real-world applications.

[1] - This has been posted on HN before, probably a bunch of times:
[https://iluxonchik.github.io/regular-expression-check-if-
num...](https://iluxonchik.github.io/regular-expression-check-if-number-is-
prime/)

~~~
bsdetector
This regex doesn't tell whether a number is prime, it tells whether the length
of a specially crafted string is prime.

Kind of a big difference as you can make a regex that answers _any_ yes/no
question if you preprocess the input enough.

~~~
Fripplebubby
Well, saying that a number presented as unary isn't a number, to me, is the
same as saying that a number presented as decimal isn't a number. But you're
right that it is "specially crafted" rather than a general solution.

------
zokier
While true that PCRE is widely used, it is also true that there are several
regexp libraries that do not veer very far from the formal regular language
theory. Most notably Googles re2 which is based on finite automation engine.

~~~
jonstewart
Yes, there’s a claim in TFA about modern implementations going well beyond
regularity, and, well, if you consider PCRE modern, then sure, just as TS
Eliot is Modern. Now _contemporary_ regexp engines, OTOH...

------
FartyMcFarter
> So if somebody found a fast solution to a NP-complete problem, pretty much
> all of the computationally hard problems of humanity would be solved all in
> one strike.

There are plenty of interesting computationally hard problems that are
(probably) harder than NP-complete problems, e.g. PSPACE-complete problems.
For example solving chess, or more on topic, the problem of "does this regular
expression generate all strings over this alphabet"?

~~~
mikejb
Small but relevant nitpick:

 _Generalized_ chess is EXPEPACE-complete. Standard chess has an 8x8 grid with
a finite number of states. In theory, chess is solvable in constant time
(because you can provide a finite lookup table).

~~~
FartyMcFarter
If I understand correctly it's only exptime-complete if you allow an
exponential number of moves? If you only allow a polynomial number of moves
then it's PSPACE-complete.

But yes good point about specific board sizes :)

------
orcs
My takeaway from this, since the majority of it went over my head: I can
continue to parse HTML with regular expressions.

~~~
TheDong
I disagree with that takeaway.

Technically PCRE regexes are powerful and can match anything.

In reality, a complex PCRE regex will almost always be more difficult to
maintain than a parser-combinator or hand-rolled parser in a more traditional
language.

People saying "Regexes can't match HTML, use an html library" are wrong to say
regexes are incapable of it, but they're right to say to use a library meant
for the job.

Sure, you can use this regex to match emails ([http://www.ex-
parrot.com/pdw/Mail-RFC822-Address.html](http://www.ex-parrot.com/pdw/Mail-
RFC822-Address.html)), but using a more normal parsing language than PCRE
regex will result in more readable code.

The same is true for almost any regular expression that takes advantage of
PCRE features, especially backreferences.

In addition, a regexp will only match html correctly if you write a very
complex one. With a naive regexp for an html tag's contents, you'll find that
you still might match that text inside a <script> tag even though that is not
html.. so you now need to figure out when you're in a script tag and exclude
that, or if you're inside an html attribute string, and before you know it you
have a 2000 character regexp that no one else will be able to read, all
because you didn't want to use an html parsing library where getting a tag's
value correctly would be a single xpath expression or css selector away.

~~~
pierrec
There's no arguing the fact that regexes are a poor fit for HTML, but maybe
this is the wrong time to use that ridiculous email regex as an example, since
TFA features a highly readable, fully compliant email matching regex as its
main example.

~~~
dfox
What the language contains as a main example is PCRE pattern that matches
email addresses and in comparison to it's original EBNF incarnation is highly
unreadable due all the syntax noise required to graft backtracing support onto
traditional regex syntax (not to mention the fact that it's performance is
certainly highly suboptimal).

~~~
TheDong
The performance might be better in some cases actually.

In the case of perl, where regexps performance has been optimized for
significantly, the regexp actually performs better than a more normal parser.

From [http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html](http://www.ex-
parrot.com/pdw/Mail-RFC822-Address.html):

> It provides the same functionality as RFC::RFC822::Address, but uses Perl
> regular expressions rather that the Parse::RecDescent parser. This means
> that the module is much faster to load as it does not need to compile the
> grammar on startup.

Of course, if perl were a statically compiled language, the cost of compiling
the grammar could be done at compile time.

~~~
dfox
Perl5 is not too meaningful for such performance comparisons, because on one
hand it's regex implementation is very optimized while on other hand
performance of Perl5 on "normal" procedural code is horrible (eg. Perl5 is
about an order of magnitude slower on Gabriel's Takeuchi function benchmark
than CPython).

~~~
nicoburns
This result generalises to most interpreted languages though. PHP, python,
javascript, etc all have highly optimised regex engines, and regexs can
consequently be a good optimisation technique when using those languages.

------
andrewmcwatters
Wow! Fascinating! The portion displaying the regular expression for matching
email addresses as per RFC 5322 was excellent. I've not once seen a regular
expression written out this way.

------
gbacon
Related content on Stack Overflow: "Regex: How to find and extract acronyms
and corresponding definition of acronym from a text?"[0] that gives the nearly
the same RFC 5322 parser.

The reduction of 3CNF-SAT seems to be borrowed from "Reduction of 3-CNF-SAT to
Perl Regular Expression Matching"[1], whose author is careful to note that the
reduction shows that regex matching with backreferences is NP-hard. To show
that it is NP-complete, you'd also have to show that regex matching with
backreferences is also in NP, _i.e._ , that any match is checkable in
polynomial time.

Readers may also be interested in "Oh Yes You Can Use Regexes to Parse
HTML!"[2] on Stack Overflow by Tom Christiansen.

[0]:
[https://stackoverflow.com/a/18339610/123109](https://stackoverflow.com/a/18339610/123109)

[1]:
[https://perl.plover.com/NPC/NPC-3SAT.html](https://perl.plover.com/NPC/NPC-3SAT.html)

[2]:
[https://stackoverflow.com/a/4234491/123109](https://stackoverflow.com/a/4234491/123109)

------
dang
Lots of previous discussions at
[https://hn.algolia.com/?query=The%20true%20power%20of%20regu...](https://hn.algolia.com/?query=The%20true%20power%20of%20regular%20expressions&sort=byDate&dateRange=all&type=story&storyText=false&prefix&page=0)

------
Kenji
_Wrapping up As the article was quite long, here a summary of the main points:

\- The “regular expressions” used by programmers have very little in common
with the original notion of regularity in the context of formal language
theory.

\- Regular expressions (at least PCRE) can match all context-free languages.
As such they can also match well-formed HTML and pretty much all other
programming languages.

\- Regular expressions can match at least some context-sensitive languages.

\- Matching of regular expressions is NP-complete. As such you can solve any
other NP problem using regular expressions._

I wish he wrote that in the beginning of the article. Would have saved me a
lot of skimming (I already know grammar theory).

------
another-cuppa
It's only OK to parse something with regex if it was defined with regex. Far
too often I see people wanting to match postcodes and things which are not
defined by regex and your heuristic could break at any time.

~~~
zamadatix
What about postcodes makes heuristics break for regex but not a general
parser? I'd assume if the format changed it'd break both all the same.

------
sharpercoder
As a lead developer, I frown upon every instance of regex used. If it's
shorter then, say, 20 characters, it may be allowed. Longer then that will
result in a negative advice and binding advice to refactor. You implement a
language lexer & parser, or use plain string functions.

~~~
jonathanyc
I’d expect you, “as a lead developer,” to know how the tokenizer in your lexer
almost certainly works. Rather than frowning upon “every instance of regex
used,” I’d expect you to be familiar with cleaner and more maintainable ways
of writing regexes, which are, after all, just declarative ways of specifying
finite automata. An NFA is infinitely easier to characterize than some mess of
a hand written recursive descent parser, precisely because they are so
restricted.

