
Sat solver on top of regex matcher - justinucd
https://yurichev.com/news/20200621_regex_SAT/
======
Thorrez
> Another practical usage I've heard: match "string" or 'string', but not
> "string'.

You don't need backreferences for that:

    
    
        '[^']*'|"[^"]*"

~~~
throw681158
Won't work if you're already in a string, or if there are escaped quotes in
the string. Also won't work if you have two or more double quoted strings that
both contain an apostrophe.

~~~
Thorrez
Backreferences don't really help with those problems.

> Won't work if you're already in a string

This doesn't make sense. How can you search for a string if you're already in
a string? I can't think of a realistic situation where that would be useful or
even really possible.

> or if there are escaped quotes in the string.

Solvable:

    
    
        '(\'|\\|[^\'])*'|"(\"|\\|[^\"])*"
    

> Also won't work if you have two or more double quoted strings that both
> contain an apostrophe.

The regex in my previous comment already solves that. See:
[https://repl.it/repls/SolidCapitalProgram](https://repl.it/repls/SolidCapitalProgram)

~~~
b2gills
Raku allows strings inside of strings. Of course it does this by way of
embedded closures.

    
    
        "abc{ "def" }"
    

Which allows it to be arbitrarily deep.

    
    
        "a{ "b{ "c{ "d{ "e{ "f" }g" }h" }i" }j" }k"
        → "abcdefghijk"
    

This can be handy to generate the correct string.

    
    
        my $count = 3;
        "I went to $count place{ "s" if $count ≠ 1 } today"

~~~
Thorrez
Interesting, thanks for pointing out a use case. But I don't think
backreferences will help with that, it needs to be parsed by something more
powerful than a regex.

And that example reminds me that Bash can do something similar:

    
    
        echo "$(echo "$(echo "$(echo "hi")")")"

~~~
b2gills
The Rakudo implementation actually uses Raku regexes to parse Raku. To be fair
though it is a lot easier to do that with the redesigned regexes that Raku
has.

Basically you can use backreferences for that if you also allow the regex to
be recursive.

    
    
        my $regex = /
          :ratchet
          $<q> = (<["']>) # the beginning quote
    
          {}:my $q = ~$<q>; # put it into a more normal lexical var
    
            # capture between " and {
            $<l> = ( [ <!before $q> <-[{}]> ]* )
    
            [
              [
                :sigspace
                ｢{｣
                    <self=&?BLOCK>? # recurse
                ｢}｣
              ]
    
              {$q = ~$<q>}
    
              # capture between } and "
              $<r> = ( [ <!before $q> <-[{}]> ]* )
            ]?
    
          "$q" # match the end quote
    
          # pass the combined string parts upwards
          { make ($<l> // '') ~ ($<self>.ast // '') ~ ($<r> // '') }
        /;
    
        ｢'a{ "b{ "c{ "d{ 'e{ "f" }g' }h" }i" }j" }k'｣ ~~ /^ <r=$regex> $ { make $<r>.ast }/;
    
        say $/.ast;
        # abcdefghijk
    

Note that `Regex` is a subtype of `Block`. That is why `&?BLOCK` can be used
as a reference to the regex itself.

`<foo=bar>` is a way to call `bar`, but also save it under the name of `foo`.
`$<foo> = …` is a way to capture `…` and save it under the name of `foo`.

\---

It is a lot nicer and modular when you use regexes as part of a grammar:

    
    
        # use Grammar::Tracer;
        grammar String::Grammar {
          token TOP { <strings> }
    
          rule strings {
            # at least one string
            # if there are more than one they are separated by ~
            <string> + % ｢~｣
          }
    
          token string {
            $<q> = <["']>
    
            # set a dynamic variable to the quote character
            {}:my $*quote = ~$<q>;
    
            <string-part>*
    
            "$<q>"
          }
    
          # multiple tokens that act like one
          # which is nicer than using |
          proto token string-part {*}
          multi token string-part:<non> {
            [ <-[{}]> <!after $*quote> ]+
          }
          multi token string-part:<block> {
            <block>
          }
    
          rule block {
            ｢{｣ ~ ｢}｣ <strings>?
          }
        }
    
        class String::Actions {
          method TOP     ($/) { make     $<strings>.ast }
          method strings ($/) { make [~] @<string>».ast }
          method string  ($/) { make [~] @<string-part>».ast }
          method block   ($/) { make     $<strings>.ast }
    
          method string-part:<non>   ($/) { make ~$/ }
          method string-part:<block> ($/) { make $<block>.ast }
        }
    
        say String::Grammar.parse(
            ｢"a{ "b{ "c{ "d{ "e{ "f" }g" ~ "zz" }h" }i" }j" }k"｣,
            :actions( String::Actions ),
        ).ast;
        # abcdefgzzhijk
    

A `token` is just a `regex` with `:ratchet` mode turned on. (prevents
backtracking) A `rule` is just a `token` with `:sigspace` also turned on.
(makes it easier to deal with optional whitespace.)

Every instance of `<foo>` is basically a method call.

`make` is about generating an `.ast` to pass up and out of the parse. In this
case the only thing the actions class does is return what would be the
resulting string if it were compiled in Raku.

------
PaulHoule
That is quite literally a formal proof that "regex+backreferences" is NP-
complete, since SAT is the index NP-complete problem.

~~~
simonebrunozzi
This line is super smart, and yet, despite I should know a lot about regex and
NP-complete, my head feels dizzy as I try to make full sense of it. A sign I'm
getting old or dumb, perhaps :(

Jokes apart: I'd love for you to elaborate a bit more on this. I'm pretty sure
I would benefit a lot from a more expanded, "dumber" explanation.

~~~
ColinWright
>> That is quite literally a formal proof that "regex+backreferences" is NP-
complete, since SAT is the index NP-complete problem.

> _I 'd love for you to elaborate a bit more on this._

I'm not the original poster, but I'll have a go.

SAT is NP-Hard. In other words, any literally any NP problem can be
efficiently[0] converted to SAT, and any solution can then be efficiently[0]
converted back to a solution to the original.

Example: Think of the problem of factoring integers. Someone gives you an
integer to factor, with a little work you can create a SAT instance, solve
that, and then read off the factorisation of the original integer. SAT is, in
some real sense, at least has hard as INT.

So there is a proof that SAT is at least as hard as _every_ NP problem. That's
what we call "NP-Hard".

Now someone has shown that they can solve SAT problems by using
regex_backtrack. That means that every NP problem can be converted to SAT,
then converted to regex+backtrack, solved, and the solution to the original
read out from the result.

Thus regex+backtrack is at least as hard as every NP problem.

Now in the case of SAT, it itself is NP. So the combination of being NP _and_
being NP-Hard is called "NP-Complete", or NPC. So SAT is an example of a
problem that's NPC.

What has _not_ been shown (I think) is that regex+backtrack is in NP. Showing
that a solution to regex+backtrack implies a solution to SAT shows that
regex+backtrack is NP-Hard.

If the linked article also shows that regex+backtrack is NP, then it is
therefore NPC. But we can see that regex+backtrack is in NP, because verifying
an alleged match is a polynomial time operation.

So regex-backtrack is NPC.

    
    
                 +--------------------+
                 |                    |
      NP-Hard -> |                    |
                 |   ,------------.   |
                  \ /              \ /
                   X  NP-Complete   X
                  / \              / \
                 /   `------------'   \
          NP -> |                      |
                |                      | 
                .    +------------+    ,
                 \   | Polynomial |   /
                  `--+------------+--'
    

[0] For a technical definition of "efficient"

~~~
sacado2
Am I missing something? I read it the other way: all CNF instances can be
rewritten as regexp + backreferences, meaning re + backreferences are _at
least as_ general that SAT, not _at most as_. Meaning, they could be higher in
the polynomial hierarchy.

~~~
ColinWright
As always, with all these things, there's a non-zero chance that I've mis-
spoken myself somewhere. I'm going to "think out loud" on this so people can
follow the thought processes.

> _I read it the other way:_

OK ...

> _all CNF instances can be rewritten as regexp + backreferences,_

By CNF you are referring to instances of the SAT problem. So yes, if you have
an instance of the SAT problem, it can be re-written as an instance of
regex+backtrack.

> _meaning re + backreferences are at least as general that SAT,_

Yes, the regex+backtrack problem is at least as hard as the SAT problem.

> _... not at most as._

Where did I say that? Here's a stripped-down summary of my comment:

* SAT is NP-Hard.

* Now someone has shown that they can solve SAT problems by using regex+backtrack. _(That 's the linked article)_

* Thus regex+backtrack is at least as hard as every NP problem.

* SAT is NP, so it's NPC

* regex+backtrack can be seen to be in NP.

* So regex-backtrack is NPC.

So rewording that:

* The linked article shows regex+backtrack >= SAT.

* Independently we observe that checking an alleged regex+backtrack solution is a polynomial task, therefore regex+backtrack is in NP.

* SAT is in NPC, therefore regex+backtrack <= SAT (because regex+backtrack is in NP).

* Thus regex+backtrack = SAT (for some definition of "=")

So, I think you must have misread something ... I think everything I've
written is correct as stands.

~~~
sacado2
> we observe that checking an alleged regex+backtrack solution is a polynomial
> task

That's the point I missed at first. That's good news, because I was pretty
sure perl regex were accidentally Turing complete, I don't know why.

~~~
ColinWright
Cool.

This sort of thing can be really tough to follow because it's all deeply
intertwungle. Glad I got it right.

Cheers!

------
awirth
This reduction is really cool. I love reductions like this.

Is there a general consensus to use "regular expression" to refer to the
actual regular ones and "regex" to refer to the non-regular variants?

~~~
chubot
I wouldn't say so, but I use the term "regular language" if I mean the
mathematical concept.

~~~
robinhouston
I don’t think it’s pedantic to say that a regular language is not the same
thing as a regular expression. The difference between syntax and semantics is
real and important.

~~~
chubot
(late reply) Right that's what I'm saying. Who said it was pedantic? :)

------
sacado2
One of the cool features of SAT problems is that they always terminate (if
you're patient enough). Aren't regex, especially with backreferences, Turing-
complete though? If so, they could be caught in an infinite loop, meaning they
are more general than the SAT problem.

~~~
dmichulke
Programming languages are more general than the problems they solve. (=
feature, not bug)

Still, yes, you can mess up your "add 1 to the input" program and make it run
infinitely.

~~~
sacado2
Yeah, I meant it the other way, if those regexps are Turing-complete, not all
of them have an equivalent CNF representation, contrarily to what the article
seems to state in its first paragraph (and title).

That being said, regexps were not initially meant to be "programming
languages", so I'm not sure about the "feature, not bug" part. I'd rather have
a notation that would let me solve, for instance, the "HTML tag matching"
problem _and_ would be guaranteed to always terminate, than one that also lets
me implement Conway's game of life.

------
klyrs
That "popcnt1" is also known as a 1-hot constraint.

[https://en.m.wikipedia.org/wiki/One-hot](https://en.m.wikipedia.org/wiki/One-
hot)

------
punnerud
time python3 solver.py fred.cnf

Took 9min and 10seconds on RPi 3 running Ubuntu 20.04. Consuming 100% CPU and
1% RAM (1024MB).

~~~
nurettin
This is actually amazing. My python programs rarely run at 100% cpu, whereas
C++ binaries are usually up there. Always thought python's inefficiency causes
the drop in cpu utilization.

~~~
nromiun
I don't know about how efficient it is but I have always been able to peg all
cores with the multiprocessing module. Even something useless like "x * x" is
more then enough for 800%.

