
Regular Expression That Checks If A Number Is Prime - iluxonchik
https://iluxonchik.github.io/regular-expression-check-if-number-is-prime/
======
TimWolla
It should be noted that this is not possible with regular expressions in the
traditional sense (i.e. regular expressions only matching regular languages):
[http://math.stackexchange.com/a/181233/21405](http://math.stackexchange.com/a/181233/21405)

Because of back references PCRE regular expressions can match non-regular
languages as well.

~~~
jnordwick
One of my pet peeves is how PRCE destroyed the definition of "regular" in
regular expressions. It has basically made a huge number of programmers
illiterate as to what truly is regular in the formal sense.

~~~
conistonwater
But why should people care about what is regular _in the formal sense_?
Rather, _regular_ in this context would mean it can be recognized with a
restricted type of algorithm, which resembles the formalism.

~~~
Chinjut
Is there a standard for which additional features one can add on top of
regular expressions in the original limited sense and still be considered a
regex?

~~~
db48x
"regular language" is a mathematical term. You can add any features you want
to your matcher, but some features allow you to match languages that are not
"regular".

However, most programmers are not mathematicians or computer scientists, and
so are effectively using lay terminology which is similar to but not precisely
the same as the mathematical terminology. Most programmers only care that
their regex library allows them to concisely match strings, not that it allows
them to match regular languages.

Thus there are two standards you could follow: the precise mathematical
definition of "regular", or the lay definition.

~~~
greydius
I don't buy the layman argument. Programmers might not be mathematicians or
computer scientists, but they should be professionals and held to the same
standards as professionals in other fields. For example, a mechanical engineer
doesn't use a "laymen" definition of tensile strength when choosing materials
for a design.

~~~
eponeponepon
No, but he probably does use layman's terms when describing molecular lattices
and electron shells. If he ever does.

The point here is that users of (colloquial) regular expressions, however
professional they are or aren't, are just engaging with the surface of a
deeper set of mathematical principles that require greater nuances of meaning
to discuss than their applications do.

~~~
OskarS
I basically agree, but I would say that a slightly deeper understanding of
regular expressions is a really useful thing to know for basically all
professional programmers. If for no other reason they can recognize the
"pathological" cases where backtracking implementations (i.e. almost all
modern regex implementations) would need until the heat death of the universe
to evaluate.

------
delsarto
tl;dr if you know a little regex

    
    
       - convert your number to a string of that length (1 = 1, 2 = 11, 3 = 111)
       - handle 0 / 1 separately
       - ^(..+?)\1+$
       - trick; the WHOLE thing has to match to return (^$)
       - first \1 match is "11", ergo string must be "11", "1111", "111111" to match 
       - second \1 match "111", ergo string must be "111", "111111", "111111111" to match
       - and so on.  if you find before length of string, it was not prime.
    

Clever trick. Look forward to being asked it in your next google interview :)

~~~
biot
In other words, it's the Sieve of Eratosthenes implemented via regex:
[https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes)

~~~
hexane360
Well, not really. It's closer to trial division, though even that doesn't go
through every single multiple from 1-n.

The difference is the sieve works on a large batch of numbers, and only tests
divisibility with numbers not known to be composite, and less than sqrt(n).

[https://en.wikipedia.org/wiki/Trial_division](https://en.wikipedia.org/wiki/Trial_division)

------
jsrn
Invented in 1998 by Perl hacker Abigail:

[http://neilk.net/blog/2000/06/01/abigails-regex-to-test-
for-...](http://neilk.net/blog/2000/06/01/abigails-regex-to-test-for-prime-
numbers/)

(check out Abigail's other JAPHs if you like stuff like this)

~~~
iluxonchik
Yep, you are correct :)

------
fxn
This regexp is totally brilliant. It solves this problem in an unexpected way,
because you think numbers, but the solution thinks strings. Not only that, the
paradox, or its beauty, is that it goes to the root of the number abstraction:
counting sticks.

I explored this technique some years later to solve a bunch of other problems,
coprimality, prime factorization, Euler's phi, continued fractions, etc. (see
[https://github.com/fxn/math-with-regexps/blob/master/one-
lin...](https://github.com/fxn/math-with-regexps/blob/master/one-liners.sh)).

------
jemfinch
I wish they'd stop calling these expressions "regular".

~~~
iluxonchik
Okay, technically it's regex, but I didn't want to get into this distinction,
since "regex" and "regular expression" are used pretty much interchangeably
(unless you're in the academia :) ).

------
rrauenza
> How would we go about that? Well, all we have to do is add ? in front of the
> +. This will lead us to the <.+?> regex.

I was very confused until I realized the author's definition of 'in front'
wasn't the same as mine...

~~~
iluxonchik
What do you mean? Could you please clarify that? What did your understand by
"in front"?

~~~
Chinjut
Presumably, they thought of A as in front of B in the expression "AB" rather
than in the expression "BA".

(Which is also how I would normally think of "in front", for what it's worth!)

~~~
iluxonchik
Oh, that was probably it. That would be true if we were talking about FIFO
queues :)

~~~
thaumasiotes
I'm another one who thinks the letter "in front" is the one that comes first.
May I ask how you were thinking about the "front" and "back" of text?

~~~
iluxonchik
Well, we read from left to right, naturally that is the order of the letters,
so the one on the right is "in front" of the left one.

~~~
thaumasiotes
I guess if you imagine a person walking from the beginning of the text to the
end, _their_ front (the side with the face on it) would face the end of the
text.

But English never conceives of text in this manner; we view text as being
arranged in a _chronological_ order, where text that occurs chronologically
earlier comes "before" text that occurs chronologically later. This mirrors
the application of "before" and "after" to time in the rest of the language.
Whether you conceive of reading as the reader traveling through text from the
beginning to the end, or as text arriving at the reader, the reader will
always encounter text on the left before text on the right, and therefore the
text on the left is in front of the text on the right.

(In the only other language I'm qualified to talk about this for, mandarin
chinese, earlier and later time might be indicated by either of two spatial
metaphors: "up" for the past and "down" for the future ["up" is also used as a
metaphor for beginning things]; or "front" for earlier and "back" for later.
When text is read from left to right, "front" is used to indicate text on the
left.)

Here ( [http://www.friesian.com/egypt.htm](http://www.friesian.com/egypt.htm)
) is someone writing about the ancient Egyptian writing system, inadvertently
assuming that the front of text is its beginning and the back of text is its
end:

> Note that Egyptian glyphs have a front and a back. All the images above and
> below face to the left, [...] which indicates that the text is to be read
> from left to right. This is conformable with the usage of English and other
> European languages. However, although this would be familiar and agreeable
> to the Egyptians, Egyptian usage was ordinarily to write from right to left,
> as today is done in Hebrew and Arabic. They indicated this direction by
> having all the glyphs face to the right instead of to the left

(Egyptian glyphs often depict a person or an animal with an actual face. They
face towards the beginning of the text, not the end.)

You seem to speak English at a fully native level, based on your writeup here.
(Although you don't seem to have picked up on the idea that if a quantifier
"precedes" a '?', it must be "in front" of that '?'.) Do you have another
native language? Are you based in a country that primarily speaks some other
language? What is the metaphor that determines that later words are "in front"
of earlier words?

------
snicky
Is it just me or the author of this blog post has a very annoying way of
writing? He keeps explaining things and then a couple of lines below he writes
something like "remember the discussion above?" or "if you remember correctly
...". Hell, sure I remember, because I just read about this 10 seconds ago. Do
people really have such a short attention span and can't remember what was in
a previous paragraph? Repeating stuff over and over seems almost like content
farming to me.

~~~
nothrabannosir
It's seo. The entire article is optimised for search terms relating to regular
expression, prime, etc.

On a tangent, I really wish Google would release a new algorithm that punished
this stuff. It's killing articles about common search terms. :/

------
mwpmaybe
Very interesting. Thanks for writing this up.

This is a bit more Perlish (not that I'm an authority):

    
    
        sub is_prime {
          (1x$_[0]) !~ /^(?:.?|(.{2,}?)\1+)$/;
        }
    

But perhaps less clear to a novice reader. You can also leave off the
semicolon.

~~~
bigiain
perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' [number]

(My first exposure to that was in '97 or so as a .sig from abigail in
comp.lang.perl.misc)

~~~
peteretep
Some asshole gave me this in an interview for a Perl dev job, and asked me
what it did.

~~~
bigiain
Wasn't at an interview, but I remember quizzing the "new guy" Perl dev with
"So explain what this one does":

    
    
      @sorted = map  { $_->[0] }
               sort { $a->[1] cmp $b->[1] }
               map  { [$_, foo($_)] }
                    @unsorted;
    

(Twenty years or so back, I could occasionally be "that asshole"... I'm better
now, honest...)

------
0xmohit
I suppose that this has been around for a while:

[https://news.ycombinator.com/item?id=9039537](https://news.ycombinator.com/item?id=9039537)

[http://montreal.pm.org/tech/neil_kandalgaonkar.shtml](http://montreal.pm.org/tech/neil_kandalgaonkar.shtml)

------
baristaGeek
Performing a Miller-Rabin primality test or filling a Sieve of Eratosthenes
will almost always be the way to go. However, this line of code seems like
magic and that's kind of impressive.

------
jimjimjim
how many problems do we have now?

~~~
JBiserkov
Approximately n / ln n according to [https://en.wikipedia.org/wiki/Prime-
counting_function](https://en.wikipedia.org/wiki/Prime-counting_function)

------
daenney
Why does the example given by the author with L=15 state that the regex
matches once it reaches 5, instead of 3?

> So first, we’ll be testing the divisibility by 2, then by 3, then by 4 and
> then by 5, after which we would have a match.

~~~
jk563
It tests from the highest number first I believe.

>> As a heads-up, I just want to say that I’m lying a little in the
explanation in the paragraph about the ^(..+?)\1+$ regex. The lie has to do
with the order in which the regex engine checks for multiples, it actually
starts with the highest number and goes to the lowest, and not how I explain
it here. But feel free to ignore that distinction here, since the regular
expression still matches the same thing

~~~
daenney
Right, that makes sense. But in the explanation they go from 2, to 3 to 4 to
5, following along with the lie, yet picking the real solution. It's a bit
confusing.

------
ldom22
have regular expressions gone too far?

------
ipozgaj
It's basically the Sieve of Eratosthenes[1] algorithm, and it's possible to do
it with regular expressions because numbers here are represented in unary[2]
number system, where number of characters/tokens equals the number itself.
It's a common trick for testing various Turing-machine stuff.

[1]
[https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes)
[2]
[https://en.wikipedia.org/wiki/Unary_numeral_system](https://en.wikipedia.org/wiki/Unary_numeral_system)

~~~
conistonwater
No it's not, this is the trial division algorithm for primality testing.

------
YeGoblynQueenne
Can't quite get this to work in vim [1]:

    
    
      \(^1\{-,1}$\)\|\(^\(11\+\)\1\+$\)
    

The two parts match what you'd expect on their own but the OR-ing screws it
up: it means the whole regex matches _everything_.

Is this something vim gets right and every other engine wrong, the other way
around, or...?

_____________

[1] I'm matching 1's rather than dots to avoid highlighting every bit of text
ever anywhere all the time.

Also, that way it's so much more pleasing to the eye and easy to read, don't
you think?

~~~
thaumasiotes
> Also, that way it's so much more pleasing to the eye and easy to read, don't
> you think?

No, it's harder to read. For ease of reading, you need to match something that
isn't already part of the expression, like 2s or Ks.

~~~
YeGoblynQueenne
>> No, it's harder to read.

Don't worry, I had my humour removed at birth also. It grows back, eventually.

------
Senji
Our hubris will be our undoing.

------
kgdinesh
how does it fare on time complexity compared to normal tests like say AKS?

~~~
YomiK
It's very clever, but obnoxiously slow. It's useful for code golf and as a
pretty impressive party trick. But like your banker will not be impressed with
your college funding plan of pulling a quarter out of his ear, this is not
going to make it in any real use.

Imagine naive absolute-beginner-programmer trial division. This is worse. Now
add the overhead of counting via regex backtracking and integer comparison via
matching strings. A fair number of regex engines will also start using
enormous amounts of memory.

AKS is of theoretical interest, but not really a "normal test." It's very slow
in practice, being beat by even decent trial division for 64-bit inputs (it's
eventually faster, as expected, but it takes a while). But it is quickly
faster than this exponential-time method. The regex is in another universe of
time taken when compared to the methods typically used for general form inputs
(e.g. pretests + Miller-Rabin or BPSW, with APR-CL or ECPP for proofs).

As others have noted, "has been popularized by Perl" is because it was created
by Abigail, who is a well-known Perl programmer (though almost certainly a
polyglot). It's also been brought up many times, though it's a nice new blog
article. I hope the OP found something better when "researching the most
efficient way to check if a number is prime." In general the answer is a
hybrid of methods depending on the input size, form, input expectation, and
language. The optimal method for a 16-bit input is different than for a
30k-digit input, for example.

~~~
kgdinesh
Why is AKS only of theoretical interest? Isn't it proven to be a deterministic
test of primality?

Also, what is the fastest way to test for primality that's practically
feasible?

~~~
YomiK
It is a proven deterministic test of primality. We already had those before
AKS, and they are significantly faster than AKS (even the various
improvements). But they don't check all the boxes that are useful for stating
things in computational complexity proofs without a paragraph of weasel words.
So from the theory side of things, it's great since we don't particularly care
about the actual execution time, but the asymptotic behavior and whether it is
solidly shown to be in P.

Lots of math packages have deterministic primality tests, but none use AKS as
a primary method, because AKS offers no benefits over other methods and is
many orders of magnitude slower.

For inputs of special form, there are various fast tests. E.g. Mersenne
numbers, Proth numbers, etc.

The fastest method depends on the input size and how much work you want to put
in. For tiny inputs, say less than 1M, trial division is typically fastest.
For 32-bit inputs, a hashed single Miller-Rabin test is fastest. For 64-bit
inputs, BPSW is fastest (debatable vs. hashed 2/3 test). The BLS methods from
1975 are significantly faster than AKS up to at least 80 digits, but ECPP and
APR-CL easily beat those. ECPP is the method of choice for 10k+ digit numbers,
with current records a little larger than 30k digits.

------
caub
it's highly inefficient tho

