Maybe I wasn't paying close enough attention, but didn't he forget to mention that numbers must be displayed in unary form and just jumps right into checking "111", as if one hundred and eleven is prime, when he is actually checking if 3 is a prime.
He covered it, but I don't think he explained what was going on particularly well. I'm surprised by all of the people claiming it was a good explanation. I think if he had picked a few numbers and actually worked them through the algorithm completely it would have been much more of a useful explanation.
I agree. I’m not familiar with Python and figured the '1'*n was just a quirk to convert the int to a string, not an integral part of the process. Kind of a weird way to repeat a string, tbh, but I guess the concision is appreciated in typical Python applications like computer linguistics or something.
I don’t think he drew attention to it, because a few minutes later he highlights that it isn’t actually as simple as he first expressed, and shows the “1” * n.
Quite; motivated by that paper I implemented a decent sieve of Eratosthenes with the standard wheel factorization in Lean 4: https://github.com/ykonstant1/esiv
So in summary there is no special thing here about this being a regex: the program described by it basically just brute force tries to divide the given number by every number smaller than it, it’s just written in a way that isn’t obvious to understand.
That’s not to detract from the excellent post, just that this isn’t a mathematical trick that exploits some structure of primes but rather an incredibly clever way to write a computer program.
Part of it also that this isn't a regular language. The PCRE is more powerful a language than a Chomsky type 3 language in that there are strings that can be matched by a PCRE (such as a prime number expressed in unary) that are not recognized in a pure regular language.
I don't know why people keep pointing this out - RegEx's not being regular languages has been true for basically all of history (it is not just pcre, traditional unix (basic) regexes also have this). Most people's only experience with "regular" things have been with non-regular regexes. Grep is 51 years old at this point.
Because recognizing a prime number using an actual regular expression would prove that prime numbers form a regular language, which would be an incredible result.
And somehow, some people like me are in computer science mode when reading such sentences, such reminders wakes us up: "Oh, ok, not actually regular, not such a big deal"
> Because recognizing a prime number using an actual regular expression would prove that prime numbers form a regular language, which would be an incredible result.
It would be literally incredible, because the pumping lemma shows that it's false.
> Ah yes, I even used to teach this actually. Sorry for the understatement xD.
No worries! Mainly I was just pleased with myself for remembering the pumping lemma, and kind of wished that I knew how to contact my professor (from back in the mid-1990s, when an intro CS course was taught out of Sipser and involved almost no actual programming) and tell him that he did such a good job that it stuck with me some 30 years later!
The OP's title uses the word "regular", and it's about doing mathy things which puts the brain in math mode, so it's helpful to point out that this is only works with non-regular regexes.
The POSIX standard for regular expressions (which grep implemented half a century ago) doesn't support back references. Even I-Regexp (RFC 9485) doesn't support it.
This specification describes an interoperable regular expression (abbreviated as "regexp") flavor, I-Regexp.
I-Regexp does not provide advanced regular expression features such as capture groups, lookahead, or backreferences. It supports only a Boolean matching capability, i.e., testing whether a given regular expression matches a given piece of text.
It wasn't until '97 when PCRE was released to mimic Perl's handling of regex and some time after that that GNU grep added -P as an option (BSD doesn't appear to support PCRE).
While PCRE is a defacto standard (I been heard uttering "ugh, that only handles posix regex"), for most of the history of regex they were only as powerful as a NDFA.
POSIX EREs do not support backrefs. This goes back to the 1970s because egrep used a different regex matching algorithm to grep — egrep compiled the regex to a DFA which could not match backrefs, unlike grep’s nondeterministic algorithm — and egrep also had different syntax.
You lack theory of mind, people may "experience" regexes in practice but not make the careful distinction/connection to the elementary theory (and theorems, e.g. about the limitations of regular expressions) taught in CS majors at university, this is not some unusual disconnect, but happens often and in many disciplines whenever transferring any knowledge from academia to industry.
The thing is: regex are computer programs. The regex text is code written in a domain specific programming language that is compiled and then run against an input.
It was an important point in the design of Perl 6, now named Raku. Perl already had first class support for a powerful regex variant, in Raku, they went a step further to consider it for what they really are.
> In Raku, regexes are written in a domain-specific language, i.e. a sublanguage or slang. This page describes this language, and explains how regexes can be used to search for text patterns in strings in a process called pattern matching.
> Fundamentally, Raku regexes are very much like subroutines...
> Fundamentally, Raku regexes are very much like subroutines...
And if they're part of a grammar, they're essentially methods on a class. And a class that can be sub-classed. Or have roles mixed in, consisting of regexes.
...and the divide is effectively implemented by "multiplication", i.e. repeating the same match group (via backreference). It's one of those things that looks impossible at first, but you instantly turn to "of course that's how it works!" once you understand. That said, I still think this article is on the verbose side.
Also, strictly speaking, it's not a "regular expression" but a "regex", as backreferences make the language more powerful than regular.
AUTHOR
Larry Wall <larry@wall.org>
Maintainer: Larry Wall <larry@wall.org>
Date: 4 Jun 2002
Last Modified: 18 May 2006
Number: 5
Version: 7
This is the Apocalypse on Pattern Matching, generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).
> So in summary there is no special thing here about this being a regex
No, I think the story is that it's an incredible thing to implement a prime test in a regex. It was a pretty neat thing 20+ years ago when I first saw it and I reckon it's still pretty neat.
The "JAPH" thing was a pretty cool thing too.
perl -e '$a = q 94a75737420616e6f74686572205065726c204861636b65720a9 and
${qq$\x5F$} = q 97265646f9 and s g..g;
qq e\x63\x68\x72\x20\x30\x78$&eggee;
{eval if $a =~ s e..eqq qprint chr 0x$& and \x71\x20\x71\x71qeexcess}'
I think that this is definitely not true. There are lots of things where an "aha!" moment makes things appear conceptually much simpler after you've internalized the framework, but there are plenty of things in what I consider my area of expertise that are still hard even though I know very well how.
> In CS theory, regular languages are a strict subset of context-free languages, but regular expression implementations in mainstream programming languages are more powerful. As noulakaz.net/weblog/2007/03/18/… describes, so-called "regular expressions" can check for prime numbers in unary, which is certainly something that a regular expression from CS theory can't accomplish. – Adam Mihalcin
Commented Mar 19, 2012 at 23:50
Reminds me of work, a coworker was telling me that a new guy, who eventually quit, was really close to solving an issue with had with HTML in some fields where it doesn't belong, and then mentioned that the new guy was using regex to do it. I was like, I doubt he was actually close to solving the problem. That said, he likely would have solved our need because it really is just a subset of HTML used for stylizing text and not the full language, but it's a harder problem than people initially think.
To save a click, the regex in question is: ^1?$|^(11+?)\1+$ (it checks if a unary number is not prime.
It is kind of surprising to hear that regex can do that, but once you see the regex its kind of obvious. Its just checking if the number is 1, or if the number can be represented by 2 or more 1's repeated at least 2 times. Which is literally the definition of a prime (is the number divisible by a number >= 2)