
Rover: Reasoning over Rules - davidfoster
https://rule-reasoning.apps.allenai.org/
======
YeGoblynQueenne
Seems to be overfitting to statistical regularities in the dataset, or in any
case it completely ignores the facts and rules you give it and draws the
answer from who knows where:

    
    
      Metals ermuf electricity. 
      Insulators do not ermuf electricity. 
      If something is made of gudranga then it is metal. 
      Nails are made of gudranga.
      
      Nails conduct electricity.
      
      
      ROVER prediction:
      
            Nails conduct electricity.   True  (confidence = 0.99)
    
    

Yes, it can tell that nails ermuf electircity:

    
    
      ROVER prediction:
      
          Nails ermuf electricity.   True  (confidence = 0.99)
    
    

However, it also thinks that nails gudranga electricity:

    
    
      ROVER prediction:
      
          Nails gudranga electricity.   True  (confidence = 0.99)
    

So in short, it is very determined to find that Nails Y electricity, for
whatever Y, whether Y is something that relates nails to electricty or not.

[https://rule-
reasoning.apps.allenai.org/?p=Metals%20ermuf%20...](https://rule-
reasoning.apps.allenai.org/?p=Metals%20ermuf%20electricity.%20%0AInsulators%20do%20not%20ermuf%20electricity.%20%0AIf%20something%20is%20made%20of%20gudranga%20then%20it%20is%20metal.%20%0ANails%20are%20made%20of%20gudranga.&q=Nails%20ermuf%20electricity).

------
andreyk
Here's a link to the paper:
[https://arxiv.org/abs/2002.05867](https://arxiv.org/abs/2002.05867)

Shortened Abstract: "AI has long pursued the goal of having systems reason
over _explicitly provided_ knowledge, but building suitable representations
has proved challenging. Here we explore whether transformers can similarly
learn to reason (or emulate reasoning), but using rules expressed in language,
thus bypassing a formal representation. We provide the first demonstration
that this is possible, and characterize the extent of this capability. To do
this, we use a collection of synthetic datasets that test increasing levels of
reasoning complexity (number of rules, presence of negation, and depth of
chaining). We find transformers appear to learn rule-based reasoning with high
(99%) accuracy on these datasets, and in a way that generalizes to test data
requiring substantially deeper chaining than in the training data (95%+
scores). We also demonstrate that the models transfer well to two hand-
authored rulebases, and to rulebases paraphrased into more natural language. "

The performance numbers are pretty impressive IMO. But learning from synthetic
datasets is pretty perilous, hard to say if it'll generalize well. Kudos to
them for putting out a live demo.

~~~
YeGoblynQueenne
Thanks for the link to the paper. Here's my very informal, mini-review.

Their synthetic language is basically Prolog, disguised as natural-ish
English: their "rules" and "facts" are Horn clauses (a conjuction of
"premises" that imply a "conclusion") with implicitly universally quantified
variables and negation is Negation As Failure under a closed-world assumption.
They restrict their rules to have a single variable and their predicates to
only binary relations (xRy) and even transform monadics to dyadics
(is(Alan,Big) rather than big(Alan)).

While it's easy to see how their technique could be useful in Natural Language
Processing, they do a very poor job of supporting their claim that it's useful
in formal theorem proving -'as a kind of limited "soft theorem prover"'. Well,
what does "kind of limited" and "soft" mean? They make no attempt to compare
their work to actual theorem provers and only compare their work to other work
with transformers. And yet there is ample scope for comparison with theorem
provers: since they try to derive statements from other statements in a
restricted Horn-clause language they could directly compare their work to a
Resolution-based theorem prover (well, Prolog) that does exactly that and
explain what are the advantages and disadvantages of their approach. But they
don't do anything like that. They fail completely to place their work in the
context of formal theorem proving even as they claim repeatedly that their
work somehow fits right into it.

Suffice it to say that their claims of "theorem proving" are far from
convincing. Cursory testing with their live demo (as in my other comment in
this conversation) suggests that their architecture does not "emulate" the
behaviour of a theorem prover, it just does what language models do best:
capture statistical regularities in their training dataset. Which is useful in
certain contexts, for sure- but nothing like "emulating" the "i/o" of a
theorem prover.

From my point of view, as someone coming from the symbolic reasoning side of
things, this is a very sloppy attempt to tackle some hard problems that
already have perfectly serviceable solutions, but without showing any
willingness to put in the work to understand those problems, or why they are
hard in the first place.

I'm sorry to rip into this work so harshly, but this is typical of the worst
that modern machine learning research has to offer: an ill-informed attempt to
impress with bold claims addressed at an audience that lacks the background to
evaluate them adequately. It's disturbing to think that this work will be
cited henceforth as "evidence that transformers can perform reasoning" and
that it may initiate a trend of more claims along the same lines.

------
asacalowww
Just tried a set of classic non-monotonic reasoning statements and it didn't
like it much: [https://rule-
reasoning.apps.allenai.org/?p=Penguins%20are%20...](https://rule-
reasoning.apps.allenai.org/?p=Penguins%20are%20birds%0ABirds%20can%20typically%20fly%0APenguins%20cannot%20fly%0ATweety%20is%20a%20bird&q=Can%20tweety%20fly%3F)

~~~
smoyer
That one seemed to work for me too ... I'm wondering if there is something
weird going on with the links (or is the transformer training as we go?).

~~~
asacalowww
Mm don't think so. It's a non-monotonic statement, in that given Tweety is a
bird, and therefore might also be a penguin then it can't make an assertion
either way as to whether Tweety can fly or not. Non-monotonic reasoning is
especially useful in many real world scenarios where a derived logic system
doesn't have tidy yes/no answers throughout.

~~~
YeGoblynQueenne
The funny thing is that they report results on some hand-crafted examples of
reasoning over Birds flying or not, (by Marek Sergot from Imperial) and they
have a little table where all the language models perform almost perfectly.

And yet, as you have found, they do not. Or perhaps- they do, but you have to
run the same experiment a few thousand times (it's a tiny dataset of five
clauses or so) until you get the good results, eh?

I'm just being a big old meanie now.

------
scribu
It doesn't seem very precise. For example, it doesn't seem to distinguish
between "Is" and "Is like":

[https://rule-
reasoning.apps.allenai.org/?p=A%20pear%20is%20a...](https://rule-
reasoning.apps.allenai.org/?p=A%20pear%20is%20a%20type%20of%20fruit.%0AA%20pear%20is%20like%20an%20apple.&q=An%20apple%20is%20a%20fruit.%0AAn%20apple%20is%20a%20pear.%0AA%20pear%20is%20an%20apple.%0AA%20fruit%20is%20a%20pear).

Edit: Added more test cases

~~~
smoyer
Did you edit your link (and therefore test cases)? ... I thought the results
looked good the first time but they have changed since.

