

Regularly divisible, in general - robinhouston
http://s3.boskent.com/divisibility-regex/divisibility-regex.html 

======
robinhouston
Earlier today I read a blog post (via
<http://news.ycombinator.com/item?id=1937062>), discussing the problem of
finding a regular expression that matches the binary numbers that are
divisible by 3.

The author of that post solved it more or less by trial-and-error, but – as
some commenters on that thread already pointed out – there’s actually a fairly
straightforward algorithm that works for any divisor in any base.

I thought it would be a fun exercise to implement the algorithm in Javascript,
and here is the result. It's quite fun to play with.

Even though (obviously) I understand how it works, it's still somehow
surprising to see a regular expression that matches all the multiples of 7 in
base 10, for example. (That example is quite a hairy one!)

------
Groxx
Yowza.

    
    
      ([05]|[16][16]*[05]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([05]|[16][16]*[05])|([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))*([05]|[16][16]*[05]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([05]|[16][16]*[05]))|([49]|[16][16]*[49]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([49]|[16][16]*[49])|([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))*([49]|[16][16]*[49]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([49]|[16][16]*[49])))([49]|[16][16]*[49]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([49]|[16][16]*[49])|([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))*([49]|[16][16]*[49]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([49]|[16][16]*[49])))*([05]|[16][16]*[05]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([05]|[16][16]*[05])|([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))*([05]|[16][16]*[05]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([05]|[16][16]*[05]))))*
    

More easily calculated as:

    
    
      /(0|5)$/
    

(divisible by 5 in base 10)

Interestingly... doesn't seem to work. Attempting it in JS, .test(input)
always returns true, but .exec(input)[0] is not "" if it is divisible.
Usually. "106" returns ["10", "10", undef...]. Anyone care to test in another
engine? `/...theregex.../ =~ input` in irb is always returning 0, though mine
works fine...

~~~
robinhouston
That's a good point. Certainly simpler expressions are possible in the case
where the divisor (5 in this case) is a factor of some power of the base: if
the divisor is a factor of (base)^n, then it suffices to check the last n
digits.

I didn't want to add a special case for that, because this is intended as a
demonstration of the general algorithm. I do think it's remarkable that such a
simple algorithm works. The actual algorithm is just this code:

    
    
      /**
       * Generate a DFA that takes a string representation of a number in base `base'
       * that computes its numeric value modulo `modulus'.
       */
      function modular_dfa(base, modulus) {
        if (base < 2 || base > digits.length) {
          throw "base ("+ base +") is out of range";
        }
      
        var dfa = DFA.create();
        for (var i = 0; i < modulus; i++) {
          dfa.addState(i);
        }
        for (var i = 0; i < modulus; i++) {
          for (var j = 0; j < base; j++) {
            dfa.addTransition(i, (i * base + j) % modulus, C(digits.charAt(j)));
          }
        }
      
        return dfa;
      }
    
      /**
       * Generate a regular expression that matches only multiples
       * of `divisor', when expressed in base `base'.
       */
      function divisibility_regex(base, divisor) {
        var dfa = modular_dfa(base, divisor);
        for (var i = 1; i < divisor; i++) {
          dfa.eliminateState(i);
        }
        return dfa.transitionRegex(0, 0).toString();
      }
    

I'm not sure why you couldn't make it work. Perhaps it wasn't clear that you
need to anchor the expression with ^ at the beginning and $ at the end? I just
tried it in Ruby, and it works just fine:

    
    
      re = /^
        (
          [05]
         |[16][16]*[05]
         |([27]|[16][16]*[27])([27]|[16][16]*[27])*([05]|[16][16]*[05])
         |([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))
          ([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))*
          ([05]|[16][16]*[05]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([05]|[16][16]*[05]))
         |(
           [49]|[16][16]*[49]
          |([27]|[16][16]*[27])([27]|[16][16]*[27])*([49]|[16][16]*[49])
          |([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))
           ([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))*
           ([49]|[16][16]*[49]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([49]|[16][16]*[49]))
          )(
           [49]|[16][16]*[49]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([49]|[16][16]*[49])
          |(
           [38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))
           ([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))*
           ([49]|[16][16]*[49]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([49]|[16][16]*[49]))
          )*(
           [05]|[16][16]*[05]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([05]|[16][16]*[05])
          |([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))
           ([38]|[16][16]*[38]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([38]|[16][16]*[38]))*
           ([05]|[16][16]*[05]|([27]|[16][16]*[27])([27]|[16][16]*[27])*([05]|[16][16]*[05]))
          )
        )*$/x;
    
      0.upto(1000) { |i|
        puts (i.to_s =~ re) ? "#{i}: yes" : "#{i}: no";
      }

~~~
Groxx
Yup, I didn't anchor it. Can I suggest adding that to the generated regex, as
it really is part of it? And I fully approve of non-special-casing it, it's
just amazing how big it is for such a "simple" case :)

It's too bad this didn't get up higher, it's quite interesting. Good job, in
any case!

~~~
robinhouston
Thanks.

I’ve already added the anchors in fact, when I guessed this might be the
problem you were having.

I'm also wondering whether it would be possible to generate shorter regular
expressions by using a different algorithm for converting the DFA into a
regex. I'm using state elimination, which was straightforward to implement,
but perhaps the results would be shorter using Brzozowski's algorithm? I'll
have a play with that, when I get a spare hour.

I'm a little disappointed more people didn't find this interesting, because I
would have enjoyed discussing it, but it was a fun hack in any case!

~~~
Groxx
Submission time seems to play a pretty big role. Looks like this went up
around 7pm (here. US Central) - for future reference, try ~4pm for your
target, when people get off work ;)

I'm almost certain I saw an article here very-roughly a couple months ago that
had an example of how to "shrink" regexes to their functional equivalent.
Can't find it now, though :\

~~~
robinhouston
Interesting. If you can dig that out, I'd love to see it.

The problem of reducing a regular expression to a minimal equivalent is
PSPACE-complete, so there's not much hope for a practical exact algorithm, but
I don't know what the state of the art is on approximation algorithms for
minimisation.

