Hacker News new | past | comments | ask | show | jobs | submit login
Generative Language Modeling for Automated Theorem Proving (arxiv.org)
88 points by tosh 19 days ago | hide | past | favorite | 28 comments

Apparently it takes around 1 V100 GPU hour per proof. Wonder if something like this might make it into developer tooling eventually.

Writing down some nontrivial invariant like 'no deadlocks are possible' and getting the result some time later seems like it could be useful. In program verification 99% of proofs can be quickly (and decidably) be dispatched with an smt solver, the last 1% is probably quantifier instantiation which a language modle might tackle.

Completely agreed.

This shows transformers are capable of generating more than plain text.

Transformers are probably the most general architecture to date, they can handle plain text, images, math and speech, but the penalty is that they need lots of training data and compute.

Most transformer models require that the data is euclidian (e.g. a sequence or image). Non-Euclidean deep learning methods like graph neural networks (GNNs) are much more general and can accept almost arbitrarily structured data (social networks, point clouds, trees, etc). You can actually model transformers as a special case of GNNs[1].

[1] https://towardsdatascience.com/transformers-are-graph-neural...

All smooth manifolds can be realised as submanifolds of euclidean space. This is the famous Nash embedding theorem: https://en.wikipedia.org/wiki/Nash_embedding_theorem

Hence, it's unclear if the power gained by GNNs is actually 'useful'.

What do you mean by useful? GNNs get better performance than other methods on lots of real tasks. It's the same as CNNs being useful in practice despite not having more "theoretical" power than MLPs.

Which tasks where GNNs get better performance have standardized benchmarks?

> but the penalty is that they need lots of training data and compute.

I think it depends on how noisy and unstructured data is (e.g. human language), transformer will likely approximate few well correlated datapoints just fine.

Here the train set was only 1000 proofs (with an average of 90 steps per proof) so it is decidely a small dataset (but highly structured and noiseless).

(30k proofs to be precise, valid and test are each 1k)

Are there an infinite number of theorems that could be proven? Or are we only interested in the class of mathematical truths that are interesting to humans?

I've only given a cursory reading to the paper, so my concerns may be addressed somewhere in it. This is admittedly more rigorous work than previous efforts to train language models to perform automated theorem proving and it's admirable that the authors have actually reached out to the community around Metamath (the classical automated theorem prover used in the paper) to understand the extent to which their work is useful to that community.

That said, I don't understand the motivation of the work. Why is it necessary to teach a language model to perform automated theorm proving, when there are already approaches that can perform automated theorem proving, that outperform language models in theorem proving and that do not have the high costs associated with training language models? Indeed, Metamath is based on an extremly simple principle of variable substitution, which is just what it sounds like and comparing the complexity of a modern, GPT-based language model to such a simple approach is not flattering for the language model approach. Besides if I understand correctly, Metamath, was used as the "ground truth" for evaluation and testing of the trained language model and the language model didn't get anywhere near 100% performance, meaning it's far behind the ability of Metamath in automated theorem proving. So we have a very simple approach that works very well on one side and an extremely complex approach that doesn't work very well on the other. The comparison is, again, not flattering for the language model approach.

I can see that GPT-ƒ (the language model in the paper) has contributed some shorter proofs to some theorems in Metamath's database, which seems to be very much appreciated by the Metamath community, but on the other hand, it seems that GPT-ƒ was not able to prove new theorems (and a shorter version of a theorem proved by GPT-ƒ was found by hand, by one of the Metamath users, a little after the GPT-ƒ results were announced on the Metamath google group).

It seems to me it would be a long and arduous road for language models to achieve parity with classical automated theorem provers - if indeed such parity can be achieved at all. And all this hard work- for what, when there already exist perfectly serviceable automated theorem provers with none of the drawbacks of language models)? Why reinvent a wheel, only to come up with a new wheel that can only roll one way and only on Tuesdays?

Finally, despite the more rigorous approach in this paper, the trend of only comparing neural net approaches to other neural net approaches continues. An automated theorem prover approach of any stripe should be compared to the state of the art in automated theorm proving, not just to other efforts using a similar approach! Imagine if I wanted to present a new image recognition algorithm based on automated theorem proving and ommitted comparison with Convolutional Neural Nets!

> Besides if I understand correctly, Metamath, was used as the "ground truth" for evaluation and testing of the trained language model and the language model didn't get anywhere near 100% performance, meaning it's far behind the ability of Metamath in automated theorem proving.

I'm new to this but wiki says Metamath refers to language + proof checker. It doesn't refer do a method of proof search right? Isn't it just a database of proofs? I'm confused.

Can you please shine light on traditional approaches that are better than DL approaches at proof search? And on how the setting they are evaluated in is different from the evaluation setting in this paper?

Proof verification is traditionally considered an automated theorem proving task. In short, being able to automatically decide the validity of a proof is almost as useful as being able to generate a valid proof automatically.

A lot of work has gone into automated theorem proving since the early 20th century. The wikipedia page is a place to start if you are interested in more details:


This work includes proof checkers, for which wikipedia also has a page, though it doesn't have a lot of meat on its bones and you'll have to follow the links to the systems listed yourself:


I'm not an expert on proof checkers, but I hear a lot about Coq and Lean, lately, so you might want to start there.

Traditional approaches to automated theorem proving, in the sense of generating new proofs (rather than checking the validity of proofs) include, for example, Robinson's resolution algorithm that is at the heart of the logic programming language Prolog. Such traditional approaches are "better than DL" in the sense that they are provably sound (they only return correct results) and complete (they return all correct results) under certain assumptions.

Given the impossibility of empirically checking the proof of every possible theorem, e.g. in first order logic, given finite resources, such system are typically evaluated theoretically, i.e. by proofs of their properties. So for instance, there are proofs of the soundness and completeness of resolution for the purpose of deciding entailment between FOL theories given as sets of Horn clauses. I keep returning to resolution because I happen to be familiar with it, but there are other approaches, of course, though I believe many use some kind of inference rule (e.g. Modus Ponens) and a search procedure usually assisted by some heuristics to find the best next step in a proof.

> In short, being able to automatically decide the validity of a proof is almost as useful as being able to generate a valid proof automatically.

This statement is not known to be true, and is definitely false in practice at the moment. Validating proofs is a P complexity task, so generating proofs is by definition NP, very likely NP-complete.

I'm sorry, but what is "practice at the moment"? I'm not any expert on interactive theorem provers.

It means there's no known polynomial algorithm.

How does this have to do with "practice at the moment"? Don't you mean "practice" as in what people actually do, in practice? I don't understand where complexity comes in to that. From what I can tell, work in generating new proofs, in the style of Logic Theorist, has dried up and interactive theorem provers are very popular, even among some mathematicians, so they clearly are useful.

Anycase, far be it from me to discourage opportunistic nitpicking or pedantry, my favourite sports, but, amicably, I don't understand what you are trying to tell me. I'm curious and I'd like to know. If you have the time and patience, please clarify.

Generating proof validity is a trivial task. Generating new proofs is not.

Trivial? I don't think so. I mean, the article above describes a new state of the art established by a large language model trained with much expense and effort that achieved a measly 50% ish score.

But in any case, I think you are trying to say that if something has polynomial time complexity then it's not useful, compared to a problem for which we don't know a polynomial time algorithm. I don't think that works out like that. For example, we have sorting algorithms with polynomial time complexity and I don't think anyone would say they're not useful.

They generate proofs in the article, not validate existing ones. The algorithm to validate a proof is known and AFAIK does not need much improvement (e.g. it is already practical).

You're right, the trained models evaluated in the paper did not perform proof verification, which was handled by an external, hand-crafted Metamath verifier.

However, the trained models didn't generate new _proofs_ either. Rather, they assigned log-probabilities to proof _steps_ (a substitution unifying a theorem to a goal) and the log-probabilities were then used to select the best next goal to expand during a traditional, hand-crafted best-first search (the authors say that the search ended up being breadth-first "most of the time", but it was coded as best-first, from their descriptions).

Anyway this is a bit sideways to your point. I still think you're nitpicking about my use of the word "useful" and applying it without much sensibility to an argument about complexity that you are interested in. I didn't say anything about complexity. I said that proof verficiation is very useful. I also don't agree that it's "trivial". There are simple enough algorithms to perform it, but, for example, Metamath's "variable substitution" is, from what I can say, essentially unification and while unification is efficient, it was only described - I was going to say by Robinson, but Wikipedia tells me that Herbrand had a unification algorithm before that, which was still quite late in the history of mathematics (early 20th century). Your comment above about how proof verification is "trivial" is only true in hindsight.

I said above I don't mind nitpicking, but I'm actually getting a bit tired with it. My original comment made a point about the lack of clear motivation in the discussed paper. Your comment chose to address tiny point made as an aside and in response to another comment. What is the purpose of that? What have we learned from this discussion about the triviality of proof verification and the NP-completeness of proof generation? Was this a productive exchange for either of us, or was the whole point of this to show off how much we respectively know to make the other commenter feel a fool?

After our conversation I had a look around and I found Vampire, which rang a bell. This is what deep learning theorem provers must beat:

The World Championship for Automated Theorem Proving


More to the point, they probably need to beat this:


Vampire is an automated theorem prover that has been winning some category of the world championship since 1999.

See, automated theorem provers can do pointless, stupid SOTA benchmarks, too.

Notice that 2018 winner in LTB category is MaLARea 0.6, NN-based theorem prover.

History repeats itself. Programs, which use hand-crafted heuristics, compete with NN-based programs for a time and then they are left behind. Like Stockfish and Leela Zero.

MaLARea won a single category in two years, out of 11 categories in a competition running since 1996 (though not all categories were used each year). Furthermore, there were five other winners in the same category since the first entry of MaLARea. I think your enthusiasm is a little premature.

It's just an observation that handcrafted heuristics, which are required to mitigate combinatorial explosion, time after time give way to heuristics produced by machine learning.

Haar cascades, various OCR techniques, rule-based translation, chess heuristics for alpha-beta search, pre NN speech recognition techniques etc.

I'm not sure I like the "heuristics" terminology with respect to machine learning, but let's agree to it, for simplicity.

Now, in automated theorem proving (ATP hence) there aren't only heuristics. For example, the resolution rule (used by the Vampire theorem prover) is not a heuristic, but an inference rule that comes with soundness and completeness results. It's what I would call a principled choice, i.e. one fully justified by theory. As to combinatorial explosion, in resolution that is not mitigated by heuristics, but by restricting the problem, e.g. to definite logic programs, etc.

See, the problem with ATP is not that we don't know good ways to perform ATP tasks- we do. We have very good algorithms that do that kind of thing. Algorithms that come with theoretical guarantees of optimality. Complexity results. Soundness and completenes results. Decidability results. All that jazz. So there is no real reason to look for approximations of the same algorithms, as might be learned by a neural network- because, why approximate something for which you already have a precise definition?

The real problem with ATP is the huge, unmanageable size of the hypothesis spaces - and the many negative rersults, e.g. of the undecidability of arbitrary first order logic languages, etc. These are not even efficiency problems that can really be addressed by better algorithms, or heuristics- they're fundamental problems associated with having finite resources, in a universe with finite time and having to explore an infinite domain. These are not the kinds of problems one solves by throwing a lot of data at an optimisation procedure.

Contrast this with chess. The search space for chess moves is finite, though large. We don't have a principled "solution" for chess, like a closed-form solution, or an inference rule- just some search procedures guided by heuristics (minimax and its ilk, including MCTS). Here, an approximation is a good alternative, because really that's all we've been able to do so far. In ATP on the other hand, there are principled solutions and we don't have to rely on heuristics and approximations. So, one has no reason to expect that neural nets can do well in ATP just because they've done well in chess.

As to the observation that neural nets have done well in diverse domains, here there's a bit of a survivorship bias at play I think, because for example, neither you nor me really are in a position to know in how many domains neural nets have been applied and didn't do so well, because that kind of work never sees the light of day- unlike the success stories, like chess or Go. So again you can't make very good predictions by looking at what has gone before.

In any case, who knows. I might have to eat my words at some point. Never say never and all that.

EDIT: Sorry about this part of my earlier comment: "I think your enthusiasm is a little premature." Re-reading it, it comes across as snarky and dismissive. That was not my intention, or if it was at the time of writing, I regret it.

OK it looks like this http://www.tptp.org/CASC/ is the standard benchmark for theorem provers. But I don't know the difference between evaluation schemes for Metamath and for CASC and I don't know why DL people evaluate on Metamath instead.

Isn't GPT-f SOTA on Metamath test though?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact