
Generative Language Modeling for Automated Theorem Proving - tosh
https://arxiv.org/abs/2009.03393
======
Tarean
Apparently it takes around 1 V100 GPU hour per proof. Wonder if something like
this might make it into developer tooling eventually.

Writing down some nontrivial invariant like 'no deadlocks are possible' and
getting the result some time later seems like it could be useful. In program
verification 99% of proofs can be quickly (and decidably) be dispatched with
an smt solver, the last 1% is probably quantifier instantiation which a
language modle might tackle.

~~~
spolu
Completely agreed.

------
visarga
This shows transformers are capable of generating more than plain text.

Transformers are probably the most general architecture to date, they can
handle plain text, images, math and speech, but the penalty is that they need
lots of training data and compute.

~~~
grandmczeb
Most transformer models require that the data is euclidian (e.g. a sequence or
image). Non-Euclidean deep learning methods like graph neural networks (GNNs)
are much more general and can accept almost arbitrarily structured data
(social networks, point clouds, trees, etc). You can actually model
transformers as a special case of GNNs[1].

[1] [https://towardsdatascience.com/transformers-are-graph-
neural...](https://towardsdatascience.com/transformers-are-graph-neural-
networks-bca9f75412aa)

~~~
bollu
All smooth manifolds can be realised as submanifolds of euclidean space. This
is the famous Nash embedding theorem:
[https://en.wikipedia.org/wiki/Nash_embedding_theorem](https://en.wikipedia.org/wiki/Nash_embedding_theorem)

Hence, it's unclear if the power gained by GNNs is actually 'useful'.

~~~
grandmczeb
What do you mean by useful? GNNs get better performance than other methods on
lots of real tasks. It's the same as CNNs being useful in practice despite not
having more "theoretical" power than MLPs.

~~~
lostmsu
Which tasks where GNNs get better performance have standardized benchmarks?

------
mensetmanusman
Are there an infinite number of theorems that could be proven? Or are we only
interested in the class of mathematical truths that are interesting to humans?

------
YeGoblynQueenne
I've only given a cursory reading to the paper, so my concerns may be
addressed somewhere in it. This is admittedly more rigorous work than previous
efforts to train language models to perform automated theorem proving and it's
admirable that the authors have actually reached out to the community around
Metamath (the classical automated theorem prover used in the paper) to
understand the extent to which their work is useful to that community.

That said, I don't understand the motivation of the work. Why is it necessary
to teach a language model to perform automated theorm proving, when there are
already approaches that can perform automated theorem proving, that outperform
language models in theorem proving and that do not have the high costs
associated with training language models? Indeed, Metamath is based on an
extremly simple principle of variable substitution, which is just what it
sounds like and comparing the complexity of a modern, GPT-based language model
to such a simple approach is not flattering for the language model approach.
Besides if I understand correctly, Metamath, was used as the "ground truth"
for evaluation and testing of the trained language model and the language
model didn't get anywhere near 100% performance, meaning it's far behind the
ability of Metamath in automated theorem proving. So we have a very simple
approach that works very well on one side and an extremely complex approach
that doesn't work very well on the other. The comparison is, again, not
flattering for the language model approach.

I can see that GPT-ƒ (the language model in the paper) has contributed some
shorter proofs to some theorems in Metamath's database, which seems to be very
much appreciated by the Metamath community, but on the other hand, it seems
that GPT-ƒ was not able to prove _new_ theorems (and a shorter version of a
theorem proved by GPT-ƒ was found by hand, by one of the Metamath users, a
little after the GPT-ƒ results were announced on the Metamath google group).

It seems to me it would be a long and arduous road for language models to
achieve parity with classical automated theorem provers - if indeed such
parity can be achieved at all. And all this hard work- for what, when there
already exist perfectly serviceable automated theorem provers with none of the
drawbacks of language models)? Why reinvent a wheel, only to come up with a
new wheel that can only roll one way and only on Tuesdays?

Finally, despite the more rigorous approach in this paper, the trend of only
comparing neural net approaches to other neural net approaches continues. An
automated theorem prover approach of any stripe should be compared to the
state of the art in automated theorm proving, not just to other efforts using
a similar approach! Imagine if I wanted to present a new image recognition
algorithm based on automated theorem proving and ommitted comparison with
Convolutional Neural Nets!

~~~
alexlyzhov
> Besides if I understand correctly, Metamath, was used as the "ground truth"
> for evaluation and testing of the trained language model and the language
> model didn't get anywhere near 100% performance, meaning it's far behind the
> ability of Metamath in automated theorem proving.

I'm new to this but wiki says Metamath refers to language + proof checker. It
doesn't refer do a method of proof search right? Isn't it just a database of
proofs? I'm confused.

Can you please shine light on traditional approaches that are better than DL
approaches at proof search? And on how the setting they are evaluated in is
different from the evaluation setting in this paper?

~~~
YeGoblynQueenne
Proof verification is traditionally considered an automated theorem proving
task. In short, being able to automatically decide the validity of a proof is
almost as useful as being able to generate a valid proof automatically.

A lot of work has gone into automated theorem proving since the early 20th
century. The wikipedia page is a place to start if you are interested in more
details:

[https://en.wikipedia.org/wiki/Automated_theorem_proving](https://en.wikipedia.org/wiki/Automated_theorem_proving)

This work includes proof checkers, for which wikipedia also has a page, though
it doesn't have a lot of meat on its bones and you'll have to follow the links
to the systems listed yourself:

[https://en.wikipedia.org/wiki/Proof_assistant](https://en.wikipedia.org/wiki/Proof_assistant)

I'm not an expert on proof checkers, but I hear a lot about Coq and Lean,
lately, so you might want to start there.

Traditional approaches to automated theorem proving, in the sense of
generating new proofs (rather than checking the validity of proofs) include,
for example, Robinson's resolution algorithm that is at the heart of the logic
programming language Prolog. Such traditional approaches are "better than DL"
in the sense that they are provably sound (they only return correct results)
and complete (they return all correct results) under certain assumptions.

Given the impossibility of empirically checking the proof of every possible
theorem, e.g. in first order logic, given finite resources, such system are
typically evaluated theoretically, i.e. by proofs of their properties. So for
instance, there are proofs of the soundness and completeness of resolution for
the purpose of deciding entailment between FOL theories given as sets of Horn
clauses. I keep returning to resolution because I happen to be familiar with
it, but there are other approaches, of course, though I believe many use some
kind of inference rule (e.g. Modus Ponens) and a search procedure usually
assisted by some heuristics to find the best next step in a proof.

~~~
lostmsu
> In short, being able to automatically decide the validity of a proof is
> almost as useful as being able to generate a valid proof automatically.

This statement is not known to be true, and is definitely false in practice at
the moment. Validating proofs is a P complexity task, so generating proofs is
by definition NP, very likely NP-complete.

~~~
YeGoblynQueenne
I'm sorry, but what is "practice at the moment"? I'm not any expert on
interactive theorem provers.

~~~
lostmsu
It means there's no known polynomial algorithm.

~~~
YeGoblynQueenne
How does this have to do with "practice at the moment"? Don't you mean
"practice" as in what people actually do, in practice? I don't understand
where complexity comes in to that. From what I can tell, work in generating
new proofs, in the style of Logic Theorist, has dried up and interactive
theorem provers are very popular, even among some mathematicians, so they
clearly are useful.

Anycase, far be it from me to discourage opportunistic nitpicking or pedantry,
my favourite sports, but, amicably, I don't understand what you are trying to
tell me. I'm curious and I'd like to know. If you have the time and patience,
please clarify.

~~~
lostmsu
Generating proof validity is a trivial task. Generating new proofs is not.

~~~
YeGoblynQueenne
Trivial? I don't think so. I mean, the article above describes a new state of
the art established by a large language model trained with much expense and
effort that achieved a measly 50% ish score.

But in any case, I think you are trying to say that if something has
polynomial time complexity then it's not useful, compared to a problem for
which we don't know a polynomial time algorithm. I don't think that works out
like that. For example, we have sorting algorithms with polynomial time
complexity and I don't think anyone would say they're not useful.

~~~
lostmsu
They generate proofs in the article, not validate existing ones. The algorithm
to validate a proof is known and AFAIK does not need much improvement (e.g. it
is already practical).

~~~
YeGoblynQueenne
You're right, the trained models evaluated in the paper did not perform proof
verification, which was handled by an external, hand-crafted Metamath
verifier.

However, the trained models didn't generate new _proofs_ either. Rather, they
assigned log-probabilities to proof _steps_ (a substitution unifying a theorem
to a goal) and the log-probabilities were then used to select the best next
goal to expand during a traditional, hand-crafted best-first search (the
authors say that the search ended up being breadth-first "most of the time",
but it was coded as best-first, from their descriptions).

Anyway this is a bit sideways to your point. I still think you're nitpicking
about my use of the word "useful" and applying it without much sensibility to
an argument about complexity that you are interested in. I didn't say anything
about complexity. I said that proof verficiation is very useful. I also don't
agree that it's "trivial". There are simple enough algorithms to perform it,
but, for example, Metamath's "variable substitution" is, from what I can say,
essentially unification and while unification is efficient, it was only
described - I was going to say by Robinson, but Wikipedia tells me that
Herbrand had a unification algorithm before that, which was still quite late
in the history of mathematics (early 20th century). Your comment above about
how proof verification is "trivial" is only true in hindsight.

I said above I don't mind nitpicking, but I'm actually getting a bit tired
with it. My original comment made a point about the lack of clear motivation
in the discussed paper. Your comment chose to address tiny point made as an
aside and in response to another comment. What is the purpose of that? What
have we learned from this discussion about the triviality of proof
verification and the NP-completeness of proof generation? Was this a
productive exchange for either of us, or was the whole point of this to show
off how much we respectively know to make the other commenter feel a fool?

