Writing down some nontrivial invariant like 'no deadlocks are possible' and getting the result some time later seems like it could be useful. In program verification 99% of proofs can be quickly (and decidably) be dispatched with an smt solver, the last 1% is probably quantifier instantiation which a language modle might tackle.
Transformers are probably the most general architecture to date, they can handle plain text, images, math and speech, but the penalty is that they need lots of training data and compute.
Hence, it's unclear if the power gained by GNNs is actually 'useful'.
I think it depends on how noisy and unstructured data is (e.g. human language), transformer will likely approximate few well correlated datapoints just fine.
That said, I don't understand the motivation of the work. Why is it necessary to teach a language model to perform automated theorm proving, when there are already approaches that can perform automated theorem proving, that outperform language models in theorem proving and that do not have the high costs associated with training language models? Indeed, Metamath is based on an extremly simple principle of variable substitution, which is just what it sounds like and comparing the complexity of a modern, GPT-based language model to such a simple approach is not flattering for the language model approach. Besides if I understand correctly, Metamath, was used as the "ground truth" for evaluation and testing of the trained language model and the language model didn't get anywhere near 100% performance, meaning it's far behind the ability of Metamath in automated theorem proving. So we have a very simple approach that works very well on one side and an extremely complex approach that doesn't work very well on the other. The comparison is, again, not flattering for the language model approach.
I can see that GPT-ƒ (the language model in the paper) has contributed some shorter proofs to some theorems in Metamath's database, which seems to be very much appreciated by the Metamath community, but on the other hand, it seems that GPT-ƒ was not able to prove new theorems (and a shorter version of a theorem proved by GPT-ƒ was found by hand, by one of the Metamath users, a little after the GPT-ƒ results were announced on the Metamath google group).
It seems to me it would be a long and arduous road for language models to achieve parity with classical automated theorem provers - if indeed such parity can be achieved at all. And all this hard work- for what, when there already exist perfectly serviceable automated theorem provers with none of the drawbacks of language models)? Why reinvent a wheel, only to come up with a new wheel that can only roll one way and only on Tuesdays?
Finally, despite the more rigorous approach in this paper, the trend of only comparing neural net approaches to other neural net approaches continues. An automated theorem prover approach of any stripe should be compared to the state of the art in automated theorm proving, not just to other efforts using a similar approach! Imagine if I wanted to present a new image recognition algorithm based on automated theorem proving and ommitted comparison with Convolutional Neural Nets!
I'm new to this but wiki says Metamath refers to language + proof checker. It doesn't refer do a method of proof search right? Isn't it just a database of proofs? I'm confused.
Can you please shine light on traditional approaches that are better than DL approaches at proof search? And on how the setting they are evaluated in is different from the evaluation setting in this paper?
A lot of work has gone into automated theorem proving since the early 20th century. The wikipedia page is a place to start if you are interested in more details:
This work includes proof checkers, for which wikipedia also has a page, though it doesn't have a lot of meat on its bones and you'll have to follow the links to the systems listed yourself:
I'm not an expert on proof checkers, but I hear a lot about Coq and Lean, lately, so you might want to start there.
Traditional approaches to automated theorem proving, in the sense of generating new proofs (rather than checking the validity of proofs) include, for example, Robinson's resolution algorithm that is at the heart of the logic programming language Prolog. Such traditional approaches are "better than DL" in the sense that they are provably sound (they only return correct results) and complete (they return all correct results) under certain assumptions.
Given the impossibility of empirically checking the proof of every possible theorem, e.g. in first order logic, given finite resources, such system are typically evaluated theoretically, i.e. by proofs of their properties. So for instance, there are proofs of the soundness and completeness of resolution for the purpose of deciding entailment between FOL theories given as sets of Horn clauses. I keep returning to resolution because I happen to be familiar with it, but there are other approaches, of course, though I believe many use some kind of inference rule (e.g. Modus Ponens) and a search procedure usually assisted by some heuristics to find the best next step in a proof.
This statement is not known to be true, and is definitely false in practice at the moment. Validating proofs is a P complexity task, so generating proofs is by definition NP, very likely NP-complete.
Anycase, far be it from me to discourage opportunistic nitpicking or pedantry, my favourite sports, but, amicably, I don't understand what you are trying to tell me. I'm curious and I'd like to know. If you have the time and patience, please clarify.
But in any case, I think you are trying to say that if something has polynomial time complexity then it's not useful, compared to a problem for which we don't know a polynomial time algorithm. I don't think that works out like that. For example, we have sorting algorithms with polynomial time complexity and I don't think anyone would say they're not useful.
However, the trained models didn't generate new _proofs_ either. Rather, they
assigned log-probabilities to proof _steps_ (a substitution unifying a theorem
to a goal) and the log-probabilities were then used to select the best next
goal to expand during a traditional, hand-crafted best-first search (the
authors say that the search ended up being breadth-first "most of the time",
but it was coded as best-first, from their descriptions).
Anyway this is a bit sideways to your point. I still think you're nitpicking
about my use of the word "useful" and applying it without much sensibility to
an argument about complexity that you are interested in. I didn't say anything
about complexity. I said that proof verficiation is very useful. I also don't
agree that it's "trivial". There are simple enough algorithms to perform it,
but, for example, Metamath's "variable substitution" is, from what I can say,
essentially unification and while unification is efficient, it was only
described - I was going to say by Robinson, but Wikipedia tells me that
Herbrand had a unification algorithm before that, which was still quite late
in the history of mathematics (early 20th century). Your comment above about
how proof verification is "trivial" is only true in hindsight.
I said above I don't mind nitpicking, but I'm actually getting a bit tired
with it. My original comment made a point about the lack of clear motivation
in the discussed paper. Your comment chose to address tiny point made as an
aside and in response to another comment. What is the purpose of that? What
have we learned from this discussion about the triviality of proof
verification and the NP-completeness of proof generation? Was this a
productive exchange for either of us, or was the whole point of this to show
off how much we respectively know to make the other commenter feel a fool?
The World Championship for Automated Theorem Proving
More to the point, they probably need to beat this:
Vampire is an automated theorem prover that has been winning some category of the world championship since 1999.
See, automated theorem provers can do pointless, stupid SOTA benchmarks, too.
History repeats itself. Programs, which use hand-crafted heuristics, compete with NN-based programs for a time and then they are left behind. Like Stockfish and Leela Zero.
Haar cascades, various OCR techniques, rule-based translation, chess heuristics for alpha-beta search, pre NN speech recognition techniques etc.
Now, in automated theorem proving (ATP hence) there aren't only heuristics. For example, the resolution rule (used by the Vampire theorem prover) is not a heuristic, but an inference rule that comes with soundness and completeness results. It's what I would call a principled choice, i.e. one fully justified by theory. As to combinatorial explosion, in resolution that is not mitigated by heuristics, but by restricting the problem, e.g. to definite logic programs, etc.
See, the problem with ATP is not that we don't know good ways to perform ATP tasks- we do. We have very good algorithms that do that kind of thing. Algorithms that come with theoretical guarantees of optimality. Complexity results. Soundness and completenes results. Decidability results. All that jazz. So there is no real reason to look for approximations of the same algorithms, as might be learned by a neural network- because, why approximate something for which you already have a precise definition?
The real problem with ATP is the huge, unmanageable size of the hypothesis spaces - and the many negative rersults, e.g. of the undecidability of arbitrary first order logic languages, etc. These are not even efficiency problems that can really be addressed by better algorithms, or heuristics- they're fundamental problems associated with having finite resources, in a universe with finite time and having to explore an infinite domain. These are not the kinds of problems one solves by throwing a lot of data at an optimisation procedure.
Contrast this with chess. The search space for chess moves is finite, though large. We don't have a principled "solution" for chess, like a closed-form solution, or an inference rule- just some search procedures guided by heuristics (minimax and its ilk, including MCTS). Here, an approximation is a good alternative, because really that's all we've been able to do so far. In ATP on the other hand, there are principled solutions and we don't have to rely on heuristics and approximations. So, one has no reason to expect that neural nets can do well in ATP just because they've done well in chess.
As to the observation that neural nets have done well in diverse domains, here there's a bit of a survivorship bias at play I think, because for example, neither you nor me really are in a position to know in how many domains neural nets have been applied and didn't do so well, because that kind of work never sees the light of day- unlike the success stories, like chess or Go. So again you can't make very good predictions by looking at what has gone before.
In any case, who knows. I might have to eat my words at some point. Never say never and all that.
EDIT: Sorry about this part of my earlier comment: "I think your enthusiasm is a little premature." Re-reading it, it comes across as snarky and dismissive. That was not my intention, or if it was at the time of writing, I regret it.
Isn't GPT-f SOTA on Metamath test though?