Hacker News new | past | comments | ask | show | jobs | submit login
Demystifying the regular expression that checks if a number is prime (2016) (illya.sh)
159 points by aquir 34 days ago | hide | past | favorite | 47 comments



Also check the Matt Parker video for a more entertaining explanation: https://www.youtube.com/watch?v=5vbk0TwkokM


I appreciate that he picks a subject dealing with regexes for his Halloween video. I can think of few things more frightening or appropriate.


I wish someone explained regexes to me as concisely as Matt Parker did in that video, it would have saved me so much trouble.


Maybe I wasn't paying close enough attention, but didn't he forget to mention that numbers must be displayed in unary form and just jumps right into checking "111", as if one hundred and eleven is prime, when he is actually checking if 3 is a prime.


He covered it, but I don't think he explained what was going on particularly well. I'm surprised by all of the people claiming it was a good explanation. I think if he had picked a few numbers and actually worked them through the algorithm completely it would have been much more of a useful explanation.


I agree. I’m not familiar with Python and figured the '1'*n was just a quirk to convert the int to a string, not an integral part of the process. Kind of a weird way to repeat a string, tbh, but I guess the concision is appreciated in typical Python applications like computer linguistics or something.


Yeah, that is the very first thing he did when he showed his python code.


I don’t think he drew attention to it, because a few minutes later he highlights that it isn’t actually as simple as he first expressed, and shows the “1” * n.


The precondition that you need to first convert to a unary number makes this feel like it's almost cheating.

The regex is not totally trivial, but it's not super sophisticated either: conceptually 'just' a Sieve of Eratosthenes.


It isn’t a sieve, it’s trial division. For example, a sieve skips powers of primes but this regex tests them all.


Correct, I would be much more impressed with a regex implementing the sieve of Eratosthenes. Not that this is not amusing!


Implementing the real Sieve is quite a challenge in any language:

https://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf


Quite; motivated by that paper I implemented a decent sieve of Eratosthenes with the standard wheel factorization in Lean 4: https://github.com/ykonstant1/esiv


I stand corrected :)


To be fair, you aren’t alone! Matt Parker also said it’s the Sieve of Eratosthenes in his latest video.


So in summary there is no special thing here about this being a regex: the program described by it basically just brute force tries to divide the given number by every number smaller than it, it’s just written in a way that isn’t obvious to understand.

That’s not to detract from the excellent post, just that this isn’t a mathematical trick that exploits some structure of primes but rather an incredibly clever way to write a computer program.


Part of it also that this isn't a regular language. The PCRE is more powerful a language than a Chomsky type 3 language in that there are strings that can be matched by a PCRE (such as a prime number expressed in unary) that are not recognized in a pure regular language.

Extending finite automata to efficiently match Perl-compatible regular expressions - http://dx.doi.org/10.1145/1544012.1544037


I don't know why people keep pointing this out - RegEx's not being regular languages has been true for basically all of history (it is not just pcre, traditional unix (basic) regexes also have this). Most people's only experience with "regular" things have been with non-regular regexes. Grep is 51 years old at this point.


Because recognizing a prime number using an actual regular expression would prove that prime numbers form a regular language, which would be an incredible result.

And somehow, some people like me are in computer science mode when reading such sentences, such reminders wakes us up: "Oh, ok, not actually regular, not such a big deal"


> Because recognizing a prime number using an actual regular expression would prove that prime numbers form a regular language, which would be an incredible result.

It would be literally incredible, because the pumping lemma shows that it's false.


> because the pumping lemma shows that it's false

Ah yes, I even used to teach this actually. Sorry for the understatement xD.


> Ah yes, I even used to teach this actually. Sorry for the understatement xD.

No worries! Mainly I was just pleased with myself for remembering the pumping lemma, and kind of wished that I knew how to contact my professor (from back in the mid-1990s, when an intro CS course was taught out of Sipser and involved almost no actual programming) and tell him that he did such a good job that it stuck with me some 30 years later!


The OP's title uses the word "regular", and it's about doing mathy things which puts the brain in math mode, so it's helpful to point out that this is only works with non-regular regexes.


The POSIX standard for regular expressions (which grep implemented half a century ago) doesn't support back references. Even I-Regexp (RFC 9485) doesn't support it.

    This specification describes an interoperable regular expression (abbreviated as "regexp") flavor, I-Regexp.

    I-Regexp does not provide advanced regular expression features such as capture groups, lookahead, or backreferences. It supports only a Boolean matching capability, i.e., testing whether a given regular expression matches a given piece of text.
It wasn't until '97 when PCRE was released to mimic Perl's handling of regex and some time after that that GNU grep added -P as an option (BSD doesn't appear to support PCRE).

While PCRE is a defacto standard (I been heard uttering "ugh, that only handles posix regex"), for most of the history of regex they were only as powerful as a NDFA.


POSIX BREs do support backrefs. See section 9.3.6 BREs Matching Multiple Characters point 3 at https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1...

Backref support was added to grep between 6th edition and 7th edition unix

6th edition grep manual: http://man.cat-v.org/unix-6th/1/grep

7th edition grep manual: http://man.cat-v.org/unix_7th/1/grep

Both of those refer to ed(1) for the syntax of regular expressions

6th edition ed manual: http://man.cat-v.org/unix-6th/1/ed

7th edition ed manual: http://man.cat-v.org/unix_7th/1/ed

POSIX EREs do not support backrefs. This goes back to the 1970s because egrep used a different regex matching algorithm to grep — egrep compiled the regex to a DFA which could not match backrefs, unlike grep’s nondeterministic algorithm — and egrep also had different syntax.


Before seeing the regex, I was thinking, how can you possibly recognize a prime number with a regular language?

The answer is, you don't, this regex doesn't describe a regular language


You lack theory of mind, people may "experience" regexes in practice but not make the careful distinction/connection to the elementary theory (and theorems, e.g. about the limitations of regular expressions) taught in CS majors at university, this is not some unusual disconnect, but happens often and in many disciplines whenever transferring any knowledge from academia to industry.


The thing is: regex are computer programs. The regex text is code written in a domain specific programming language that is compiled and then run against an input.

It was an important point in the design of Perl 6, now named Raku. Perl already had first class support for a powerful regex variant, in Raku, they went a step further to consider it for what they really are.

> In Raku, regexes are written in a domain-specific language, i.e. a sublanguage or slang. This page describes this language, and explains how regexes can be used to search for text patterns in strings in a process called pattern matching.

> Fundamentally, Raku regexes are very much like subroutines...


> Fundamentally, Raku regexes are very much like subroutines...

And if they're part of a grammar, they're essentially methods on a class. And a class that can be sub-classed. Or have roles mixed in, consisting of regexes.


...and the divide is effectively implemented by "multiplication", i.e. repeating the same match group (via backreference). It's one of those things that looks impossible at first, but you instantly turn to "of course that's how it works!" once you understand. That said, I still think this article is on the verbose side.

Also, strictly speaking, it's not a "regular expression" but a "regex", as backreferences make the language more powerful than regular.

https://en.wikipedia.org/wiki/Regular_expression#Patterns_fo...


>Also, strictly speaking, it's not a "regular expression" but a "regex"

And regex is short for regular expression, so you've essentially said nothing.


https://web.archive.org/web/20100112232513/http://dev.perl.o...

    AUTHOR

    Larry Wall <larry@wall.org>


    Maintainer: Larry Wall <larry@wall.org>
    Date: 4 Jun 2002
    Last Modified: 18 May 2006
    Number: 5
    Version: 7

    This is the Apocalypse on Pattern Matching, generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).


> So in summary there is no special thing here about this being a regex

No, I think the story is that it's an incredible thing to implement a prime test in a regex. It was a pretty neat thing 20+ years ago when I first saw it and I reckon it's still pretty neat.

The "JAPH" thing was a pretty cool thing too.

perl -e '$a = q 94a75737420616e6f74686572205065726c204861636b65720a9 and ${qq$\x5F$} = q 97265646f9 and s g..g; qq e\x63\x68\x72\x20\x30\x78$&eggee; {eval if $a =~ s e..eqq qprint chr 0x$& and \x71\x20\x71\x71qeexcess}'


Everything is easy once you know how.


> Everything is easy once you know how.

I think that this is definitely not true. There are lots of things where an "aha!" moment makes things appear conceptually much simpler after you've internalized the framework, but there are plenty of things in what I consider my area of expertise that are still hard even though I know very well how.


I like this regex, which divides a number by sqrt(2):

    (?=(x(x*)).*(?=\1*$)\2+$)(?=(x\1)+(x?(x*)))(?=\4(x(x*?))\1+$)(?=.*(?=(?=\4*$)\4\5+$)(x*?)(?=\3*$)(x?(x*?))(\1+$|$\9))(?=.*(?=(?=\4*$)(?=\6*$)(?=\4\7+$)\6\5+$|$\4)(x*?)(?=\3*$)(x?(x*?))(\1+$|$\13))(?=.*(?=\12\12\9$)(x*?)(?=\3*$)(x?(x*?))(\1+$|$\17))(?*.*?(?=((?=\3*(x?(x*)))\21(x(x*?))\1+$)))(?=.*(?=\23*$)(\23\24+$))(?=.*(?=(?=\21*$)\21\22+$)(x*?)(?=\3*$)(x?(x*?))(\1+$|$\27))(?=.*(?=(?=\21*$)(?=\23*$)(?=\21\24+$)\23\22+$|$\21)(x*?)(?=\3*$)(x?(x*?))(\1+$|$\31))(?=.*(?=\30\30\27$)(x*?)(?=\3*$)(x?(x*?))(\1+$|$\35))(?=.*(?=\26\26)(?=\3*(x*))(\1(x)|))(?=.*(?=\34\34\40)(?=\3*(x*))(\1(x)|))(?=(?=(.*)\13\13\17(?=\6*$)\6\7+$)\44(x+|(?=.*(?!\16)\41|(?!.*(?!\38)\8).*(?=\16$)\41$))(\25\31\31\35){2}\43$)\20|xx?\B|
Source: https://codegolf.stackexchange.com/a/198428


Very nice! Almost as complex as the regex to parse HTML: https://stackoverflow.com/a/1732454


From the replies :)

> In CS theory, regular languages are a strict subset of context-free languages, but regular expression implementations in mainstream programming languages are more powerful. As noulakaz.net/weblog/2007/03/18/… describes, so-called "regular expressions" can check for prime numbers in unary, which is certainly something that a regular expression from CS theory can't accomplish. – Adam Mihalcin Commented Mar 19, 2012 at 23:50


Reminds me of work, a coworker was telling me that a new guy, who eventually quit, was really close to solving an issue with had with HTML in some fields where it doesn't belong, and then mentioned that the new guy was using regex to do it. I was like, I doubt he was actually close to solving the problem. That said, he likely would have solved our need because it really is just a subset of HTML used for stylizing text and not the full language, but it's a harder problem than people initially think.


I figure he quit after the nightmares of backtracking regexes consuming the world started.


The title should be: … the regex that checks if the length of a string with the same characters is a prime number.


That's not a regular expression and it's a ridiculously inefficient way to check for primality.


I think he meant to say smallest maybe? Ie its description is terse, not its runtime.


To save a click, the regex in question is: ^1?$|^(11+?)\1+$ (it checks if a unary number is not prime.

It is kind of surprising to hear that regex can do that, but once you see the regex its kind of obvious. Its just checking if the number is 1, or if the number can be represented by 2 or more 1's repeated at least 2 times. Which is literally the definition of a prime (is the number divisible by a number >= 2)


I had a hunch that maybe a lookahead might help a bit, but it turned out to be slower:

  /^(..+?)(?=\1+$)/
Edit: of course, silly me, +? is non-greedy.


Isn't this based on an expression from Abigail then at Perlmonks? https://www.masteringperl.org/2013/06/how-abigails-prime-num...


perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' <number>

I saw that by Abigail on comp.lang.perl.misc many moons ago. Here is an article about it: http://test.neilk.net/blog/2000/06/01/abigails-regex-to-test...

As far as I know, she was the genesis of this whole thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: