Hacker News new | comments | show | ask | jobs | submit login
A regular expression to check for prime numbers (noulakaz.net)
192 points by Moyamo on Feb 12, 2015 | hide | past | web | favorite | 67 comments



This regular expression is due to Abigail of Perl fame. Who is still very active, see http://search.cpan.org/~abigail/ for more.

However while cute, it is very slow. It tries every possible factorization as a pattern match. When it succeeds, on a string of length n that means that n times it tries to match a string of length n against a specific pattern. This is O(n^2). Try it on primes like 35509, 195341, 526049 and 1030793 and you can observe the slowdown. (Some regular expression engines may employ tricks that speed this up over the naive analysis, but none will be super speedy about it.)

Usual arithmetic is much more efficient. :-)


It's even worse, because the speed of factoring algorithms is measured against the number of digits of the binary expansion, and this is a unary expansion. So it's really O(2^n^2)


You really have to put parentheses when building "towers" of exponentials. The bound is (2^n)^2, which simplifies to 2^(2n). This is of course still horrible, but much less so than the 2^(n^2) for which what you wrote may be mistaken.


Yeah, I realized that after the edit link expired. Let's just disambiguate the expression in whatever way makes me look best :)


It's right associative, normally :)


Thank you for answering my anticipated question. I was thinking, "There is some beauty here, but this seems very slow relative to other options."


Some how the beauty of this reminds me of Lambda calculus and Church numerals.


Aha! My memory doesn't fail me... I recall her discussing it on #perl 18 or 19 years ago.


For some this will be obvious, but for others it may still be interesting:

If this was a "true" regular expression, in the sense that it describes a regular language (http://en.wikipedia.org/wiki/Regular_language), then checking whether the number is prime or not, by checking if the expression matches, would run in linear time. This would be worth a few nobel prizes at least.

Unfortunately, it is an "extended" regular expression, i.e. not regular at all in the strictest sense, and (as the article says) requires backtracking.


It's also worth noting that what this expression matches are not primes given in decimal notation; rather, the language it accepts is "111...111" with any prime number of 1s. And we know straight-up that no "true" regular expression matches this language, by the pumping lemma (http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_langu...).


Not true!

First of all there is no Nobel in math/CS.

But more importantly, when we talk about the running time of algorithms, we talk about the time in relation to how many bits it takes to describe the number. So linear in the size of the number is exponential in the number of bits, and that is not very interesting.

And a better primality test would not rock the world of cryptography. At the moment http://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_... is demonstrably good enough as a prime test that nobody really is hurting for better ones. The thing that you want is a better factoring algorithm. And the General Number Sieve is already subexponential, and so better than any regular expression could be.


First of all there is no Nobel in math/CS.

I know. An algorithm this efficient would still be worth that much. :)

But more importantly, when we talk about the running time of algorithms, we talk about the time in relation to how many bits it takes to describe the number. So linear in the size of the number is exponential in the number of bits, and that is not very interesting.

No, the regular expression described here works on numbers in unary representation. Linear to the size of the unary representation is linear to the number itself.


Unary representation is useless when speaking about complexity in number theory.

If you take a number N given in unary, convert it to binary and do trial division up to the square root, it will take O(N log N) time for conversion to binary and O(sqrt(N) * log(N)^2) for trial division (depending on your computational model, it could be O(N) and O(sqrt(N)) - I am counting bit complexity). In total, it's O(N log N). The runtime is dominated by reading the input! The complexity of trial division and the brilliant AKS algorithm is the same from this viewpoint.

Even if you had an algorithm that did not have to convert to binary and could tell in time linear to unary represenation whether a number is prime, it would be interesting trivia but nothing worthy a Nobel prize. In practice numbers are given in binary (or some other base>1 number system). To use your algorithm, you would have to convert to unary, which already means trial division would be faster.


Ah, I see, thanks for the explanation.


First of all there is no Nobel in math/CS.

He didn't say it would win a Nobel prize, just that it would be worth a few Nobel prizes. A major breakthrough in algorithmic theory could easily result in someone winning a Fields medal, and I'd say that's easily worth more than a Nobel prize.


M-R is a probabilistic primality test, with a deterministic variant for some ranges and a general-deterministic variant dependent on GRH. Lots of people want a better test, so we have Lucas-Lehmer and APRT-CL and AKS.

Better primality testing would be interesting to the crypto community, but it could rock the math world - like "Primes is in P" introducing AKS as the first deterministic polynomial time primality test.

As you said, though, better factoring would be game-changing for both cryptography AND mathematics.


I've heard that, on modern cpus, with the first 100 or so primes cached, for all but the most performance sensitive applications, MR can be run with sufficient accuracy that there's a greater likelihood that a stray neutron will flip a bit in the cpu at the time of computing MR than that MR will produce an incorrect result. Or something like that. Is it true than MR can be tuned to be as useful as a non-probabilistic test?


This is absolutely true. Every time you run it, a non-prime has probability of at most 1/4 of passing.

Run it 5 times, and you'll catch 99.9% of non-primes.

What about neutron flips? Judging from http://lambda-diode.com/opinion/investigations/transmutation... you can probably expect one every 4-5 years.

If you run MR 20 times, your odds of wrongly declaring a non-prime prime are on the same order of magnitude as the number of microseconds in a 5 year period. So yes, neutron flips and MR errors are on par with each other.

If you're concerned, just run MR a few more times. :-)


There does not exist a "true" regular expression (i.e. a FSM) to check if a number is prime.

Proving this is a common assignment in automata theory classes.


In other words, this is not a regular expression as defined mathematically, i.e. equivalent to a specific finite-state machine?


Not a regular language as mathematically defined.

Unfortunately, quite a few very-well-defined terms have been made much-less-defined by use in practice.

"regular expression" has come to mean "anything that can be specified by a Perl 'RE' expression, which can backtrack and may take exponential time", rather than the original precise definition (that I will not give here) equivalent to "anything that can be accepted by a memoryless finite state automaton"

Similarly, "real time" has come to mean "on-line" or "computation happening more or less at the same time input is generated", as opposed to the original meaning of "guaranteed response to input within a pre-specified time frame" (which has been relegated to being "hard real time", with the amortized version often called "soft real time").

One could think software construction, being somewhat of an engineering discipline, would be more precise with terms - but one would be wrong; software construction as practiced today is something between an art and a craft, and is mostly not related to engineering.


> the original precise definition (that I will not give here)

I'll give it: the regular expression syntax acceptable to (n)awk. (and Perl, etc.)


It has been posted before [2], so I'll repost my comment from back then [3]:

It's cool, but it's regexp, not regular expression -- regular expressions cannot test for primeness. The proof is a simple application of pumping lemma[1]: we want to show that language { a^p: p prime } is not regular. Suppose it is. By pumping lemma we have numbers N, n >= 1, such that for every k >= 0, we have a^(N+nk) in our language. But that means that for every k, N+nk is prime, and it fails for k = N, since N+nN = N(1+n) is not a prime, since both N and 1+n are greater than 1.

[1] - http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_langu... [2] - https://news.ycombinator.com/item?id=3391547 [3] - https://news.ycombinator.com/item?id=3391572


This is indeed beautiful. But the author limits itself to a low-level unpacking of the regex and running a few tests cases, all the while remaining mystified as to why this works.

A high-level description helps makes obvious what's happening:

  To check the primality of a number N with string matching,     
    repeat a string (say, "1") N times
    and declare N prime:
      UNLESS the repetitions match the string 0 or 1 times
      OR the repetitions can be EVENLY matched into groups of MORE than one string.


I wouldn't say he remained mystified. His exploration of the 9 case basically explained how it worked, just the author chose to leave it a bit implied for the reader to figure out. I would agree for a blog post it's usually better to hammer the point home though.


Uses back-references, so it doesn't describe a regular language in the strict sense. Still, it's pretty cool to use them to trick a string-processing engine to do integer factoring.


Also, it's easily provable (using pumping lemma) that the language: { 1^n | n is a prime number } is non-regular, so no "regular expression" (referring to strict mathematical definition here) can express it.


"Regular expression" lost the sense of "strictly only regular languages" around the time that Perl crushed the world.

It's probably time to give up on that terminology nit-pick.


Actually, I think it's still a very important distinction to make. Making this point helps encourage understanding what regular languages, backtracking, et cetera mean, which allows us as engineers use our tools more effectively.


It's not a nit-pick. True regular expression engines can run in guaranteed worst-case linear time. Non-regular regex engines generally have no such guarantee. If you're using a regular expression library to parse inputs on a server, such efficiency guarantees are a simple and effective defense against input-based denial of service attacks.


The post you are replying to is not about word usage, it is about an aspect of computer science.


> It's probably time to give up on that terminology nit-pick.

Many Perl coders are self-taught programmers with no CS background. If a man page calls something a regular expression, that is good enough for them.

Why drop to that level if you know better?

I don't want to appear "gratuitously ignorant" in front of someone who might be a decent computer scientist.

There are already ways in which I will appear ignorant due to actual ignorance; why go out of my way to dumb down the way I speak about something that I do understand?

If we are going to wreck math terminology why don't we just redefine "prime number". Let's see: 3 is prime, 5 is prime, 7 is prime, oh ... so 9 is prime 11, is prime, 13 is prime, 15 is prime ... And, what do you know: now we can use a regular regular regex for primality testing over the binary representation: why, it's just (0|1)*1. As a bonus, we can restore 1 to its proper status as a prime number and kick out the oddball 2.

Say, if Perl called an odd test function "isprime", would you start calling odd numbers prime?


I don't want to appear "gratuitously ignorant" in front of someone who might be a decent computer scientist.

Someone who is a decent computer scientist will be unlikely to care whether you say "regular expression" or "look how smart I am, I know to say that this thing Perl does is called regular expressions by ignorant plebes even though it strictly isn't, let's high-five!"

Because, y'know, someone who's actually a decent computer scientist probably has more important things to care about.


In that case, I also don't want to appear "gratuitously ignorant" in front of such a lesser computer scientist who does care. :)

Maybe one thing like this doesn't matter, but if you make a habit out of sloppy use of technical terms, it will not go unnoticed.

Analogy: one neuron firing might not exceed a threshold, but thirty seven of them might.

> "look how smart I am, I know to say that this thing Perl does is called regular expressions by ignorant plebes even though it strictly isn't, let's high-five!"

That's rather verbose, making the cheesy scare quotes finger gesture an appealing alternative.


I believe the next line in this traditional dialogue is “Language is a fluid thing and words are defined by their use!”

Your turn.


The next line is, which philosophy of meaning do you subscribe to? Ludwig Wittgenstein ("meaning is use")? Or are you an all out Humpty-Dumpty-ist ("when I use a word, it means just what I choose it to mean, not more or less")? Or maybe Sapir-Whorf: the language not only determines its own meaning, but your entire cognitive process.


I can't tell if you're defending this line of thought or making fun of it, but the term "regular language" has a specific mathematical meaning and perverting it to include how Perl uses the term is wrong, plain and simple.


I was, in fact, slightly mocking both the argument and the related argument that people who use sloppy language like to bring out in favor of using words to mean whatever they like. I certainly agree that words with precise meanings are important to use correctly.


It actually effectively lost it a bit before. Perl did not invent regular expressions, in fact Larry Wall started with a package by http://en.wikipedia.org/wiki/Henry_Spencer that itself was a reimplementation of an API that appeared in AT&T's V8 Unix that was released in 1985 and was incorporated into standard Unix tools like egrep.

Perl standardized a good enough regular expression dialect that everyone else copied it. But the divergence between "regular expression" and "strictly only regular languages" was already a lost cause.


POSIX regular expressions (file glob patterns, BRE's and ERE's) have more features than regexes in old CS papers, but they are still regular: they compile to automata with no backtracking.


The ERE described in egrep(1) has backreferences which can match non-regular languages. Are you thinking of some other notion of ERE?


He is thinking of what POSIX specified, which does not have backreferences. See http://www.regular-expressions.info/posix.html for details.


Aha, I didn't know about the difference between POSIX ERE and GNU ERE. Thank you!


I'm currently working on a ruby gem to generate examples of given regular expression. For example:

    /this|is|awesome/.examples # => ["this", "is", "awesome"]
So, I tried it out on this "prime number validator" regex:

    /1?|(11+?)\1+/.examples
      #=> ["", "1", "1111", "111111", "11111111"]
    /1?|(11+?)\1+/.examples(max_repeater_variance: 50, max_group_results: 100).uniq.sort.map(&:length).first(10)
      #=> [0, 1, 4, 6, 8, 9, 10, 12, 14, 15]

    require 'prime'
    Array(1..100) - /1?|(11+?)\1+/.examples(max_repeater_variance: 50, max_group_results: 3000).uniq.sort.map(&:length) == Prime.first(25)
      #=> true
Horribly inefficient, but pretty cool! Here's my gem if case anyone's interested: https://github.com/tom-lord/regexp-examples


Thank you very much for your great gem :)

Very usefull!


Thanks :)

It's still "under development", while I add a few missing features, but as you can see from the README these are only for fairly obscure parts of the regex language! - E.g. POSIX groups, named properties, character set intersection, ...


Nice (abuse of regexp) !

AFAICT the way the string is going to be parsed is equivalent to this:

  def isprime(n):
    if n == 0 or n == 1:
      return False
    i = 2
    while i <= n:
      if (n - i) % i == 0:
        return False
      i += 1
    return True
A parser implementation may optimize and do "while (n - i) >= 1" instead, which cuts in half the number of iterations.


A better implementation will only check to sqrt(n), as there can be no prime factor greater than that.


Yes, there can.


The point is that if there is a prime factor greater than sqrt(n), then your will have already found its co-factor by then.


Correct. I misspoke. I should have said,

A better implementation will only check to sqrt(n), as there can be no prime factor greater than that without there also being a prime factor lower than that.


I explored this technique applied to a few other problems just for fun, integer square root, coprimes, continued fractions, modulo, etc.

https://github.com/fxn/math-with-regexps/blob/master/one-lin...


It uses a string of '1' as a unary numeral representation of a number and then checks it for divisibility with 2,3,4... using brute force.


less sexy when you write

  /^(11+?)\1+$/ is a prime test on whether a unary string  > 2 is composite
then people have a real chance of figuring it out. (go ahead and try, if you haven't read the article.)

As it stands that 1 x shift is super confusing. who would have thought 1 would become a literal! we wouldn't even write it that way in english.


The thing that made this click for me was realizing the "1" in the regex was not special. The regex is just looping through integers (starting with 2) and checking if the target number is divisible by any of them. It can use any digit if the target string uses the same digit. Ruby example using 5 instead of 1:

def is_prime(n); str = "5"*n; str !~ /^5?$|^(55+?)\1+$/; end

The regex essentially says, for integer n, "does 2 ones fit evenly into n ones", then "does 3 ones fit into n ones", etc., which is the same as "does 2 fit into n", "does 3 fit into n", etc.


Thanks for sharing. This is a post I wrote way back in 2007 and, even though, as many have pointed out, the algorithm is very inefficient, it is still a good illustration of how powerful regular expressions are.

When I am explaining the basics of regular expressions to people, I always refer them to this post to make them realize that regular expressions are not only glorified search terms. In general, they are amazed :-)


An aspect of the syntax is confusing me here. I would have assumed that the expression 'x+?' would be parsed as '((x)+)?', ie. 'x*'. However, it seems the parser special-cases this as (x)(+?) instead, and interprets that as a non-greedy operator. Isn't this horribly inconsistent?


nice trick!

here a JS implementation

    function isPrime(n) {
      var re = /^1?$|^(11+?)\1+$/;
      var s = [];
      while (n -- > 0)
        s.push(1);
      return !re.test(s.join(''));
    }


If I may suggest an improvement:

  function isPrime(n) {
    var re = /^1?$|^(11+?)\1+$/,
          s = (new Array(n+1)).join('1');
  
    return !re.test(s);
  }


Why not just function isPrime(n) { return !/^1?$|^(11+?)\1+$/.test(Array(n+1).join('1')) } then? :]



man that's genius!!


We all learned this way back with the Regex Game: http://regex.alf.nu/6


Is there a way to write a (non-regular, of course) regular expression that checks for prime numbers in a non-unary base?


If there were, there would be finitely many repunit primes (via the pumping lemma and the notion that there are arbitrary large non-prime repunits).

Edit: that's not quite correct. The above would imply some simple structure in repunit primes. If you add the known asymptotical behaviour of the 'number of primes < N)' function, I think it would imply it.

Also, in binary, it would either imply there are finitely many Mersenne primes, or that there is a N such that all Mersenne numbers larger than N are prime (edit: Mersenne numbers are repunits in binary)


This one has been going around for a while. I'd be curious to see a base-2 or base-10 variant of it.


Evidently beauty is in the eye of the beholder.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: