The maths goes over my head, but this paragraph was very interesting:
> Tao then kicked off an effort to formalize the proof in Lean, a programming language that helps mathematicians verify theorems. In just a few weeks, that effort succeeded. Early Tuesday morning of December 5, Tao announced that Lean had proved the conjecture without any “sorrys” — the standard statement that appears when the computer can’t verify a certain step. This is the highest-profile use of such verification tools since 2021, and marks an inflection point in the ways mathematicians write proofs in terms a computer can understand. If these tools become easy enough for mathematicians to use, they might be able to substitute for the often prolonged and onerous peer review process, said Gowers.
I helped a very little with the formalization (I filled in some trivial algebraic manipulations, and I was there for some Lean technical support).
It's exciting to see how quickly the work was done, but it's worth keeping in mind that a top mathematician leading a formalization effort is very exciting, so he could very easily scramble a team of around 20 experienced Lean users.
There aren't enough experienced Lean users to go around yet for any old project, so Gowers's point about ease of use is an important one.
Something that was necessary for the success of this was years of development that had already been done for both Lean and mathlib. It's reassuring that mathlib is being developed in such a way that new mathematics can be formalized using it. Like usual though, there was plenty missing. I think this drove a few thousand more lines of general probability theory development.
> Something that was necessary for the success of this was years of development that had already been done for both Lean and mathlib
Yes but the bulk of the work on this project was background as well. So the ease of use problem should be solving itself over time as more and more prereqs get filled in.
BTW, I find it interesting that mathlib is apparently also getting refactored to accommodate constructive proofs better, as part of the porting effort to Lean 4. This might encourage more CS- and program-verification minded folks to join the effort, and maybe some folks in math-foundations too (though Lean suffers there by not being able to work with the homotopy-types axioms).
Yes, sure, and that's how mathlib usually gets developed, project by project and PRing missing background material. I'm one of the mathlib maintainers, and I wanted to be sure there's acknowledgment here for the years of Boubaki-style work among hundreds of volunteers to build up the mathematics. In a way, now that the paper's been formalized it's time for the really hard work, which is figuring out how to restructure it (and mathlib) so that it can easily support future work. That's writing libraries for you.
Where are you getting the idea mathlib is being refactored for constructive proofs? I wouldn't say that's on the road map. In fact, the core Lean developers are uninterested in pursuing supporting this.
Since you're here, are you able to say why constructive proofs are interesting to program verification or (T)CS people? I've never heard anyone in TCS even know what constructivity is -- Lean supports their case where you write a program/algorithm and then (with classical logic) prove correctness and other properties. What would constructive logic give them?
I'll mention that the "Lean way" for constructivity is that you write `def`s for constructions. These can depend on classical reasoning for well-typedness if you want (like for being sure a number is in range when indexing an array), but there is a computability checker. In particular, the compiler sees if there is a way to lower the `def` into some simply typed IR, and this IR can be further compiled to C. If you trust the compiler, then you can trust whether it's real construction.
(Of course, you can also go and prove a definition evaluates to some value if you don't want to trust the evaluator.)
> [W]rite a program/algorithm and then (with classical logic) prove correctness. What would constructive logic give them?
Compared to the usual case of writing the "algorithm" (i.e. the construction/decision part) and "proof" separately, it basically offers better structuring and composability. Imagine writing an "algorithm" that itself relies for correctness on non-trivial preconditions, or must maintain some invariants etc. Proving correctness of a complete development that might involve wiring up several of these together is quite possible if you can use constructive proofs, but becomes quite tedious if you have to work with the two facets separately.
> Where are you getting the idea mathlib is being refactored for constructive proofs?
It's not being refactored for constructive proofs, but proofs that do use classical reasoning are being marked as such in a more fine-grained fashion rather than leaving it as the default for the entire development. Which makes it at least possible to add more constructive content.
Yeah, I can imagine that, and I've used sigma types to carry it out (`Subtype`s in particular, or perhaps custom structures). Why does the proof in the sigma type have to be constructive?
Certainly this is the road to "dependently typed hell" and it's good to avoid mingling proofs with data if you can, but avoiding the law of the excluded middle doesn't make any of that any easier.
This is also a bit of a distraction since what I really mean regarding TCS is that in my experience is that they do not see the proofs themselves as being objects-with-properties. They're perfectly happy with non-constructive reasoning to support constructions. I am looking to be convinced that constructive logic is worth it, but this mathematician hasn't seen anything compelling beyond work on foundations themselves.
Regarding mathlib, could you say what in particular you're talking about, since it's not ringing a bell, and I follow the changelogs.
> Regarding mathlib, could you say what in particular you're talking about, since it's not ringing a bell, and I follow the changelogs.
There's been several commits where general appeals to classical reasoning in some development have been replaced by more fine-grained assumptions of decidability for some objects, or marking uses of classical reasoning in some specific proofs. This has not happened throughout mathlib of course, just in some specific places. But it might still show one way of making the development a bit friendlier to alternate foundations.
> ...but avoiding the law of the excluded middle doesn't make any of that any easier.
Constructivist developments are not about "avoiding the law of excluded middle". They're more about leveraging the way we already structure mathematical proofs (and mathematical reasoning more generally) to make it easier to understand where simple direct reasoning has in fact been used, and a theorem can thus be said to be useful for such purposes. If all we did was proofs by contradiction and excluded middle, there would be no point to it - but direct proof is actually rather common.
> Why does the proof in the sigma type have to be constructive?
It doesn't, if the constructibility part has been stated separately (e.g. stating that some property is decidable, or that some object is constructible). That's in fact how one can "write the program/algorithm and the logic separately" in a principled way.
> I am looking to be convinced that constructive logic is worth it, but this mathematician hasn't seen anything compelling beyond work on foundations themselves.
This paper https://www.ams.org/journals/bull/2017-54-03/S0273-0979-2016... provides a reasonable summary of the relevant arguments. It's notable that a number of mathematicians are doing work that involves, e.g. the internal language of topoi, which pretty much amounts to invoking constructive foundations despite starting from what's otherwise a "classical" setting. That they go to all this trouble to do this certainly suggests that they find that work compelling enough. And indeed, it has practical implications wrt. e.g. alternate foundations for analysis, differential geometry etc.
Many times, Decidable assumptions are added simply because they appear in the terms in a theorem statement, and doing so makes applying such a theorem easier. There's the technical annoyance that Decidable instances are equal but not necessarily defeq, so if the wrong instances are present (such as the classical ones) then you need simp-like automation to rewrite instances before unifying. Maybe this is what you're seeing?
There have also been attempts to make polynomials computable, which peppers everything with decidability, but it's not a good data type for real large-scale computations, so it's not clear what the future is there. Maybe the defeqs are worth it, I don't know.
Re constructibility, this is all at too high of a level to really know what you're talking about, or why it's beneficial to write proofs themselves in a certain way. I'm not really even sure what you mean by "constructive". To me, I see no problem with writing a recursive definition that requires a non-constructive proof of termination by well-founded recursion -- is that constructive to you?
AIUI, the fact that well-founded recursion is encoded as "noncomputable" in Lean is just a wart/quirk of that particular system; it can be justified in dependent type theory which is quite constructive and used by other systems such as Coq.
It's not `noncomputable` in Lean though. Lots of computable functions use it.
I've read Bauer's paper before btw. I think topoi are cool, but I don't think it's a complexity worth thinking about for everyone else who isn't concerned about these things.
Since I have been part of many such grinds that are now big data sets, I can honestly tell you that it is one of the best things in the world. I know that those thousands of hours of my work has led to 100x savings for other people. I also met people and could work on problems I would never get in professional life.
Since you seem to be an expert on this matter I have always wondered whether at some point it becomes faster to prove novel theorems using a theorem prover then doing it on paper. I imagine that quite a bit of time is wasted in proving things that are already proven. In a similar manner that if there would be no libraries, just snippets of code in papers, in computer programming a lot of time would be wasted writing the same things again. I imagine a "proof finding" tool like hoogle is a "function finding" tool could provide a lot of value here. What is your perspective on this?
The biggest benefit of a formalized theorem library is not so much proving entirely new theorems (though it helps a little bit I suppose) but refactoring the existing developments to make them more elegant, easier to understand, and enhance their generality. In fact, it's already the case that the average formal proof is written to be somewhat higher in abstraction and more straightforward in reasoning than the informal counterpart.
This kind of refactoring work is quite hard to do with paper proofs alone, since one cannot essily be sure whether they've preserved correctness wrt. the original development. Having a proof assistant really is a big help there.
> The biggest benefit of a formalized theorem library is not [...] but [...]
[citation needed]
I think there are many benefits. Hard to claim that your favourite one is the biggest benefit.
In general, I think it's a bit weird that you are repeatedly (also other HN threads, and sibling comments in this one) making unfounded claims about Lean/mathlib, to the point where you are telling maintainers how their system works. And if they explain that you are misunderstanding the system, you ignore their correction and bring up the next (or the same) unfounded claim.
Whoops, you're right. Here's what Fields medalist Terence Tao has to say about it: https://terrytao.wordpress.com/2023/12/05/a-slightly-longer-... "...sometimes, after a proposition has been proven, someone in the project realizes that in order to apply the proposition neatly in some other part of the project, one has to slightly modify a hypothesis ... Often one can just modify that hypothesis and watch what the compiler does. ... [W]ith well-designed proofs, the process of modifying a proposition is often substantially easier than writing a proof from scratch. (Indeed it is this aspect of formalization where I think we really have a chance in the future of being in a situation where the task is faster to perform in a formal framework than in traditional pen-and-paper (and LaTeX) framework.)"
> And if they explain that you are misunderstanding the system
I don't think that's a fair description of the sibling threads. If a mathlib maintainer says "no, we're not going to avoid LEM/by-contradiction everywhere in our proof library, that would be silly" because that's what most mathematicians today think constructivism means, as in rejecting the bulk of existing math entirely as meaningless - and then they add "but yes, there are places where we want to work with our own assertions about decidability, and not let classical reasoning mess that up" I think it's entirely fair to call the latter pretty close to a constructivism-friendly approach.
(Keep in mind that most practicing mathematicians don't work with foundations, so the fact that misconceptions like the above would be widespread is not surprising. The remaining argument is about the increased complexity of constructive reasoning, and it's entirely fair for a practical development to want to avoid that, and just work with classical statements. After all, most of the mathematical literature is indeed classical; it does not bother with the computational aspect or with the perceived messiness of numerical analysis.)
> we want to work with our own assertions about decidability, and not let classical reasoning mess that up
I am not confident that you understood what I meant. It has nothing to do with the proofs themselves, but a mathematically uninteresting matter about whether or not you can rw/simp using the theorem due to the statement incidentally containing concrete Decidable instances (it's simply a weakness of these tactics that they are not able to make use of the fact that types are Subsingletons, so if Decidable instances aren't free variables the theorems might not be applicable). There are some old parts of the library that come from probably Lean 2, like Finset, that are fairly constructive, and, since the most developed finite sets are these Finsets, and many of their operations require Decidable instances in their definitions, many theorems about finite sets mention Decidable instances. It's a matter of how much engineering time is available among all the volunteer contributors (along with some fond feelings for Finset since it's been around so long) that we haven't switched over to using the subtype of Set satisfying Set.Finite, which is very non-constructive but would make our lives easier.
I'll emphasize that this rw/simp issue is not about classical reasoning. Even if you banish classical reasoning from the library, it plagues Decidable instances everywhere, and special attention must be made to ensure that theorem statements contain no concrete Decidable instances. Getting this right is an annoying distraction that does not enable any new features or insight.
I'll also emphasize that mathlib theorems do not have Decidable instance hypotheses that are only used in their proofs. Last I checked, there's even a linter for this.
> I don't think that's a fair description of the sibling threads.
Since we're talking about others' "misconceptions", am I right that you weren't aware that in Lean you can do well-founded recursion using a classical termination proof and get a computable function? That's the sort of thing I was trying to make sure we were on the same page about. I brought up LEM to feel out your position because that's what plenty of people mean by constructivism, and I honestly have not been able figure out what constructivism means to you and what sorts of concrete benefits you think constructivist proofs give you. If you're able to interleave a construction with classical reasoning arguing that the construction is well founded, then what is it that the CS and program verification folks are missing from Lean? That's what I'm missing here. (And if there's a good concrete answer the folks who develop Lean itself would be interested.)
To be clear, there are plenty of things missing from Lean that would definitely help CS and program verification folks, but as far as I know none of these features touch upon constructivism.
While I wouldn't say that there are no benefits to constructive reasoning, and I support the work people do on foundations, all these Decidable instances really add to the difficulty that people experience learning to do mathematics in Lean.
I'm sure there are people out there doing program verification that think about it using topos theory, but I suspect it is very few. That's hardly a justification to put resources into making proofs be constructive mathlib-wide.
The basic definition of wellfoundedness in Lean and mathlib is in fact constructive, same as in Coq. There are technical implications to being able to reason 'classically' about well-founded relations but they apparently are fairly minor, having to do with the "no infinite chain" definition of well-foundedness. https://proofassistants.stackexchange.com/questions/1077/wel... -- and actually, this requires a form of choice in addition to classical reasoning. Mathlib knows about this, it's in https://leanprover-community.github.io/mathlib_docs/order/or... .
I think I've emphasized above that constructivism is more about preserving and leveraging the implied computational content of existing proofs than insisting on constructive proofs everywhere - the latter in particular adds a lot of complexity even when it's actually feasible, so I agree that Mathlib shouldn't do that. But when something can be phrased as a computable construction or decision procedure "for free" simply by relying on straightforward direct reasoning and eschewing classical assumptions, it makes sense to enable that, including as part of a mathematical library.
I'm by no means an expert on interactive theorem proving -- I'm just a Lean user who knows a lot about how Lean works, and I got into it while procrastinating finishing my math PhD.
I think writing novel theorems directly in Lean is similar to the idea of architecting a large software system by jumping right into writing code. Certainly for small things you can just code it, but past a certain level of complexity you need a plan. Paper is still a useful tool in programming, and theorem proving is just a kind of programming.
During my PhD I occasionally used Lean to check whether some ideas I had made sense. They were more of a technical flavor of idea than what I'd usually think of a novel theorem being, and where Lean was helpful was in keeping track of all these technical details that I didn't trust myself to get completely right.
Speaking of roadmaps, one tool large formalization efforts use are blueprints, an idea developed by Patrick Massot. Here's the one for Tao's project: https://teorth.github.io/pfr/blueprint/dep_graph_document.ht... On this webpage there are also LaTeX versions of every main theorem and links to their Lean equivalents. The original paper was not formal enough to directly formalize in a theorem prover, and this blueprint represents the non-trivial additional creative effort that went into just planning out how to code it.
There's also an AI-driven free-form text tool that recently appeared: https://www.moogle.ai/
A funny thing about libraries saving people from wasting effort redoing things is that, while on one hand this is true that having centralized repositories of knowledge invites people to contribute to them and reuse their work, on the other what can happen is that philosophical differences can cause schisms and result in n-fold duplicated work. (Consider the endless web frameworks out there!) Even Lean could be such a duplication since there's already Coq, Agda, HOL, etc. So far, Lean's mathlib has managed to remain a single project, which I think is an important experiment in seeing if all of mathematics can be unified in a common language. Mathlib is a strange compared to normal software libraries, since its goal is to eventually consume every downstream project that's ever created. If this model can continue to scale, then that makes the question of whether something's been formalized yet or not simpler. Though, even now, there's so much code in mathlib that sometimes you need to go to leanprover.zulipchat.com and ask, and someone familiar with the right corner of the library will usually answer reasonably quickly.
One silver lining for n-fold duplication is that it's usually not a complete reinvention, and this gives variants of ideas an opportunity to explored that otherwise would succumb to the status quo. I feel that even when someone re-formalizes a theorem, it's an additional opportunity to evaluate how this knowledge fits into the larger theory.
> on the other what can happen is that philosophical differences can cause schisms and result in n-fold duplicated work
It would indeed be nice to have a principled answer to the issue of differing foundations, including e.g. working in ZF(C) vs. type theory. (The dirty little secret there - and what makes mathlib more popular than it might be otherwise - is that while most practicing mathematicians appeal to ZF(C) as their default foundation they don't actually work with it in a formal sense, and the way they do reason about math is a lot easier to account for in type theory. See also https://math.andrej.com/2023/02/13/formalizing-invisible-mat... ) I guess the closest thing we have to that is logical frameworks, which are designed to accommodate even very weak/general foundations at the outset and "bridge" theorem statements and proofs across when needed.
( https://westurner.github.io/hnlog/#story-38435908 Ctrl-F "TheoremQA" (to find the citation and its references in my local personal knowledgebase HTML document with my comments archived in it), manually )
> TIL i^4x == e^2iπx ... But SymPy says it isn't so (as the equality relation automated test assertion fails); and GeoGebra plots it as a unit circle and a line, but SymPy doesn't have a plot_complex() function to compare just one tool's output with another.
That's a lot of links to take in, and I don't do really anything with ML, but feel free to head over to https://leanprover.zulipchat.com/ and start a discussion in the Machine Learning for Theorem Proving stream!
My observation at the moment is that we haven't seen ML formalize a cutting-edge math paper and that it did in fact take a lot of experience to pull it off so quickly, experience that's not yet encoded in ML models. Maybe one day.
Something that I didn't mention is that Terry Tao is perhaps the most intelligent, articulate, and conscientious person I have ever interacted with. I found it very impressive how quickly he absorbed the Lean language, what goes into formalization, and how to direct a formalization project. He could have done this whole thing on his own I am sure. No amount of modern ML can replace him at the helm. However, he is such an excellent communicator that he could have probably gotten well-above-average results from an LLM. My understanding is that he used tools like ChatGPT to learn Lean and formalization, and my experience is that what you get from these tools is proportional to the quality of what you put into them.
Communication skills, Leadership skills, Research skills and newer tools for formal methods and theorem proving.
The example prompts in the "Teaching with AI" OpenAI blog post are paragraphs of solution specification; far longer than the search queries that an average bear would take the time to specify.
Is there yet an approachable "Intro to Arithmetic and beyond with Lean"? What additional resources for learning Lean and Mathlib were discovered or generated but haven't been added to the docs?
Perhaps to understand LLMs and application, e.g. the Cuttlefish algorithm fills in an erased part of an image; like autocomplete; so, can autocomplete and guess and check (and mutate and crossover; EA methods and selection, too) test [math] symbolic expression trees against existing labeled observations that satisfy inclusion criteria, all day and night in search of a unified model with greater fitness?
> If these tools become easy enough for mathematicians to use, they might be able to substitute for the often prolonged and onerous peer review process,
> This is a list of unusually long mathematical proofs. Such proofs often use computational proof methods and may be considered non-surveyable.
> As of 2011, the longest mathematical proof, measured by number of published journal pages, is the classification of finite simple groups with well over 10000 pages. There are several proofs that would be far longer than this if the details of the computer calculations they depend on were published in full.
> In the philosophy of mathematics, a non-surveyable proof is a mathematical proof that is considered infeasible for a human mathematician to verify and so of controversial validity.
I wonder how much of mathematics is locked away from us as it requires a level of intelligence we may never have. Why do we assume proofs should be simple little equations or even just a few pages? Elegance? Why should the universe be so elegant? Is it because elegant structures are more efficient and use less energy?
Edit: I know some proofs like Fermat's last theorem were like dozens of pages or more, but I realize that is still within the comprehension of a well trained and gifted human being.
Take for example the Standard Model Lagrangian (which is a sum of nonlinear fields and used to predict some but not all of particle physics (i.e. n-body gravity, superfluids, antimatter or not, Copenhagen interpretation or not, etc.)),
[ LaTeX rendering of the Standard Model Lagrangian, and other as-yet unintegrated equations ]
How elegant are these? Is it ever proven that they are of minimal complexity in their respective domains piece-wisely?
Learning of Entropy, I had hoped you know. But then that's just classical Shannon entropy, and not quite the types of quantum entropy described in for example the Quantum discord Wikipedia article, and then that's still not quite quantum fluidic complexity (with other field effects discounted, of course); so is there an elegant quantum fluidic thermodynamic basis for it all and emergence?
Quasiparticles display emergent behavior.
Virtual particles have or haven't mass independent of 2-body gravity.
Gravity alone sometimes produces mass, or photons at least.
Regardless, things have curl and Divergence.
And so Quantum Chaos: what can it predict? Can it can do quantum gravity effects in Superfluids at scale?
Progress in quantum CFD would require a different architecture to prove low error of a model for predictions in superfluids.
And so how many error-corrected qubits exist to simulate a gas fluid in space (in microgravity) is the limiting factor in checking the sufficiency of a grander unified model from here, too.
And also for how long qubits can be stored; we can save save a integer and a float for longer than human timescale with error corrected distributed storage networks, but we can't store the output wave function(s*) from quantum computer simulations for more than a second.
So prove it means QC, and that's not what I see here.
I doubt that he was intending a burn here. He's not talking about user interface issues or the like.
He's most probably, I would assume, talking about the well-known issue that formalisation tools require knowledge that a standard mathematician doesn't possess, such as the names of all the Lean tactics, how to locate the right theorem to use in mathlib (the Lean library), the necessity to write mathematics in a very formal language which is based on dependent type theory rather than based on sets (which is what most mathematicians are used to), etc.
Moreover, formalised mathematics does not work with natural language (and perhaps can't), and it will not accept informal or intuitive arguments. There are a lot of very fiddly things that have to be attended to. One can't assume that the reader is a skilled mathematicians who can easily fill in trivial details, as when writing a maths paper for expert readers.
But the Lean people are acutely aware of all of this. In my opinion he's not so much offering them much needed criticism so much as acknowledging one of the known barriers to mathematicians from outside the formalisation community getting into formalisation.
This is a major point of the article, that real, serious mathematicians have broken through that barrier recently, and it's not the first time. So things have improved enough to show that it is possible for seriously committed mathematicians to do this, even famous ones!
Elan makes it actually quite easy. It's among the better package managers out there, comparable to cargo.
There's a single bash script to install both the runtime and the package manager on a variety of platforms, and the integrated build system and native module bundler is both powerful and isomorphic, so all you need to learn is Lean. Without any prior knowledge of Lean I was able to patch existing build scripts to support platform specific C extensions in less than half an hour.
Yesterday I was wondering if it's possible to have an AI that can "recreate" the mathematics as we know it, and even go further and explore unexplored paths.
Though prduced by a video explaining how AI helped to find optimization in the computation of matrix multiplication.
You always hear "the map is not the territory". Which is to say, names are applied, models are asserted and all our fine ideas about the world are our own contrivance, not the world's.
Every map that is made leaves out an enormous amount of the territory because that detail is irrelevant to the map being used or created. The more detailed the map is, the more one should recognise just how much more detail is being ignored.
If you want to get close, just go to the territory itself and be there directly.
All too often, we use mathematics as a map for the actual world around and to keep that map manageable, we have to ignore much of the actual world.
Finding a proof is a search in a large space, akin to searching from abstractions to concretions. LLMs don’t do anything like this, and so you’re looking at the planning problem again. It’s not clear to me how framing this particular problem as a language problem is helpful in any case.
IIUC, any sensible way of "pairing up" these things will mean that anything you get out will be true. But the search might take millennia, and the outcome might be nothing (equivalently, "the LLM's conjecture is false").
The final state in chess is a single* state which yes, then branches out to N checkmate configurations and then N*M one-move-from-checkmates, and so on. (*Technically it's won/lost/draw.)
The equivalent final state in theorem proving is unique to each theorem so such a system would need to handle an additional layer-of-generalization.
Is this how some of the more advanced chess engines work, or even the not so advanced ones, where there's a point at which it stops searching the forward move tree in greatest depth, and instead starts searching backwards from a handful of plausible (gross move limit-bound) checkmate states looking for an intersection with a shallow forward search state?
Language models are currently the best AI systems that solve general reasoning problems, including mathematical ones. So it seems more than obvious to model the problem of finding proofs as a language problem.
I contributed a few trivial proofs to this project, and I tried enlisting GPT-4, Copilot, Moogle, and Morph Prover to help out (I did not try LLMStep or ReProver).
Out of these:
- GPT-4 answered maybe one or two questions about syntax
- Copilot autocompleted maybe 0.5%
- Moogle was helpful much more often but still often pointed to the wrong theorem in the right file; in most of those cases I could have just done go-to-def to get to the right file and scroll through it
Note that these results are actually very good! And Terence Tao seems to have a positive opinion as well. But we're very far away from automatically proving conjectures IMO.
I will say that Lean has great metaprogramming facilities to accept any type of AI innovation that might emerge. Currently I find the most helpful tactics to be `aesop?`, which is a classical search tactic that tries a bunch of substitutions and simplifications; and `exact?`, when you know there's a trivial proof but not sure of the name. But the UX for AI can be as simple as, for example, typing the `gptf` or `llmstep` tactic, and that is fully hooked into the Lean state (goals, hypotheses, etc). So there's a lot of opportunity for incremental progress in improving existing workflows (which consist of dispatching trivial manipulations with search/simplification tactics)
> Logical reasoning, i.e., deductively inferring the truth value of a conclusion from a set of premises, is an important task for artificial intelligence with wide potential impacts on science, mathematics, and society. While many prompting-based strategies have been proposed to enable Large Language Models (LLMs) to do such reasoning more effectively, they still appear unsatisfactory, often failing in subtle and unpredictable ways. In this work, we investigate the validity of instead reformulating such tasks as modular neurosymbolic programming, which we call LINC: Logical Inference via Neurosymbolic Computation. In LINC, the LLM acts as a semantic parser, translating premises and conclusions from natural language to expressions in first-order logic. These expressions are then offloaded to an external theorem prover, which symbolically performs deductive inference. Leveraging this approach, we observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate. On ProofWriter, augmenting the comparatively small open-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5 and GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%, respectively. When used with GPT-4, LINC scores 26% higher than CoT on ProofWriter while performing comparatively on FOLIO. Further analysis reveals that although both methods on average succeed roughly equally often on this dataset, they exhibit distinct and complementary failure modes. We thus provide promising evidence for how logical reasoning over natural language can be tackled through jointly leveraging LLMs alongside symbolic provers. All corresponding code is publicly available at https://github.com/benlipkin/linc
But "theorem provers" don't prove anything, they just check existing proofs for validity. That's a much less demanding task than coming up with a proof in the first place.
Open conjectures are usually solved by inventing new mathematical frameworks that embed the context of the existing question. Inventing new definitions is trivial, but there's no rules that constrain the space of moves. And inventing _interesting_ new definitions is extremely hard. I wouldn't bet on LLMs doing this ever.
What would be nice is an LLM to aid in translation between mathematics and a theorem proving language like Lean. That might fuel a revolution in how mathematics is done.
People have already trained models to assist suggestion tactics. They then linked it up to ChatGPT to interactively solve proofs.
In this scenario, ChatGPT asks the model for tactic suggestions, applies it to the proof and uses the feedback from Lean to then proceed with the next step.
FYI, The programmatic interface to Lean was written by an OpenAI employee who was on the Lean team a few years ago.
Also, check out Lean’s roadmap. They aspire to position Lean to becoming a target for LLMs because it has been designed for verification from the ground up.
As math and compsci nerds contribute to mathlib, all of those proofs are also building up a huge corpus that will likely be leveraged for both verification and optimization.
If AI can make verification a lot easier, then we’re likely going to see verification change programming similarly to the way it changed electronics.
> it has been designed for verification from the ground up.
It has a lot of places where you assume things work in a certain way and a lot of ways to assume it (and you can prove `false` if any of those assumptions aren't correct). I'd say it's a lot more geared towards formalizing math than verification.
There's some more things, e.g., the `Expression -> C-code` "compiler" isn't verified and neither is the C-compiler. Those could be fixed with some work (that nobody is working on currently, as far as I can tell). But the first issue seems pervasive. I don't think Lean will become a standard tool for software verification or significantly advance the discipline. Of course, I'd love to be wrong.
If the foundations are workable (and they seems to be, though IIRC there is a quotienting construct that's quite ad hoc and that people have complained about), the fact that you can additionally "assume" other things is fairly irrelevant, these are no different than extra axioms.
It's not useful to prove that "your program does what you want, provided the following 10000 functions implemented in C do what we assumed they do". Almost certainly at least one of them doesn't.
There isn't even a straightforward (let alone verified) way to list all assumptions of this type for some proof (you can print the axioms that a definition depends on, but that's not the same thing). It's possible to prove `false` in Lean using functions you'd normally use to make pretty much any useful program.
For trivial stuff, it might be able to even now (but those can probably be solved algorithmically as well). For more complex stuff, they are not even scratching the surface, I would believe — LLMs can’t really do long, complex thoughts/inferences, which are essential for coming up with proofs, or even just to solve a sudoku (which they can’t do — no, writing a program (which was likely part of its training set) and executing that doesn’t count).
It will never happen. Some open math problems are equivalent to the halting problem; and we know that finite computers can’t solve the halting problem.
Solving sudoku is a pure exercise in proving what the solution is.
But most sudoku tend to be unsatisfying in that you're forced to use trial and error. The more formal term would be "proof by cases": once your fun proving has reached its limits, you've shown that a particular cell must contain either 5 or 7. If you show that that cell containing a 5 will lead to a contradiction, you've created an impeccable proof that the cell contains a 7.
The distinction you've stated would only come into play if the number of possible solutions to a puzzle differs from one; if there are no solutions, you have inconsistent premises and your proofs will be valid but not sound. If there are several solutions, then guessing and being right will, as you describe, create a proof that a solution exists, but it won't prove that that solution is unique or characterize what valid solutions are like, both of which are common mathematical goals.
Proof verification, though still hard to automate, seems at least tractable. I'm not a mathematician by any means, but something tells me automated proof search, in the general case, would require solving the halting problem, which means it is impossible even in principle.
In the general case, proof search is obviously impossible, for the reason you state. But the case need not be general. A proof search machine that takes in a theorem and outputs `"proof: ..." | "antiProof: ..." | "idk"` would be quite powerful indeed.
More seriously, the benchmark would be something along the lines of "and it is better at finding proofs than humans are", like how chess engines haven't solved chess, but they are so much better than humans that they always win in a competition between the two.
There's no good reason to think that humans are capable of finding proofs that are hard to find algorithmically.
Clearly the prevalence of idk's in the output is the metric for determining power.
As for the human-vs-computer, I think the most powerful solutions will be those that combine human and computer prowess, similar to how a Human+Stockfish beats Stockfish^. To that end I've been working on Lean search tools that keep a human firmly in the loop. Terence has actually called out a tool I helped make as being quite helpful in his work, which is rather neat.
^ I heard this to be true at some point, but cannot find any reference on it at the moment.
> similar to how a Human+Stockfish beats Stockfish^
> ^ I heard this to be true at some point, but cannot find any reference on it at the moment.
Known as centaur chess (amongst other things). It was certainly the case at one point. I strongly suspect that it isn't still the case, but it's hard to say for absolute certainty because it never developed a serious competitive community like engine and non-engine chess has. I can't find anyone listing any results post 2017 [1]. It was barely the case in 2017, and Stockfish has improved a lot since then [2].
[2] E.g. here's someone running stockfish 14 (current from July 2 2021 - April 18 2022) against stockfish 8 (current from November 1 2016 to February 1 2018). Stockfish 14 won 61 games and drew 3: https://www.reddit.com/r/chess/comments/oov714/64_game_match... (PS: They don't document time controls and I'm not sure how scientific this test was, but it matches with what I would expect, so eh)
Not to discourage you from your project with lean. At the very least I would expect there to be a transitional period, likely of decades, where computer+human is better than computer.
Given that algorithm design is very much a field of mathematics, it’s in principle (and probably in practice) not too difficult to adapt the infrastructure you mentioned to support a workflow where you say “I want an algorithm that on input x computes y” (more formally, ofc), and the tool outputs exactly that for you
Math proofs are of NP complexity. If you had access to a non deterministic Turing machine you could enumerate all possible proofs of a given length and check them all in poly time.
That does not say anything about LLMs though. Personally, I believe they could be quite helpful to mathematicians in a way similar to copilot for software programming.
This is only true for very specific kinds of proofs, and doesn't apply to stuff that uses CiC like Lean 4. This is because many proof steps proceed by conversion, which can require running programs of extremely high complexity (much higher than exponential) in order to determine whether two terms are equal. If this weren't the case, it would be much much easier to prove things like termination of proof verification in CiC (which is considered a very difficult problem that requires appeal to large cardinal axioms!). There are formalisms where verifying each proof steps is much lower complexity, but these can be proven to have exponentially (or greater) longer proofs on some problems (whether these cases are relevant in practice is often debated, but I do think the amount of real mathematics that's been formalized in CiC-based provers suggests that the extra power can be useful).
This is all very interesting but it seems that we're just taking different views on what is the instance size. If it is the length of the theorem statement in some suitable encoding and the goal is to find a proof, of any possible length, then yeah, this is way too hard.
I'm taking the view that the (max) length of the proof can be taken as a parameter for the complexity because anything too long would not have any chance of being found by a human. It may also not be trusted by mathematicians anyway... do you know if the hardware is bug free, the compiler is 100% correct and no cosmic particle corrupted some part of your exponential length proof? It's a tough sell.
I'm talking about the proof length, not the length of the proposition. In CiC, you can have proofs like (for example) "eq_refl : eq <some short term that takes a very long time to compute> <some other short term that takes a very long time to compute>" (where both terms can be guaranteed to terminate before you actually execute them, so this isn't semidecidable or anything, just really slow). In other words, there exist short correct proofs that are not polynomial time to verify (where "verify" is essentially the same as type checking). So "find a proof of length n for this proposition" is not a problem in NP. You can't just say "ah but proofs that long aren't meaningful in our universe" when the proof can be short.
Could you give me a reference? This is not something I'm familiar with.
Can you claim that this equivalence proof is not in NP, without requiring this specific encoding? I would be very surprised to learn that there is no encoding where such proofs cannot be checked efficiently.
> Can you claim that this equivalence proof is not in NP, without requiring this specific encoding? I would be very surprised to learn that there is no encoding where such proofs cannot be checked efficiently.
There are encodings where such proofs can be checked efficiently, such as the one that enumerates every step. Necessarily, however, transcribing the proof to such an encoding takes more than polynomial time to perform the conversion, so the problems aren't equivalent from a P vs. NP perspective (to see why this is important, note that there is a conversion from any 3SAT formula to DNF with
a worst-case exponential size blowup, where the converted formula can be solved in LINEAR time! So the restriction to polynomial time convertible encodings is really really important in proofs of reductions in NP, and can't be glossed over to try to say two problems are the same in the context of NP-completeness).
Yes, executing a calculation as part of a proof after separately proving that it is the "right" calculation to solve the problem is quite a common way of proceeding. Even common things like algebraic simplifications of all kinds (including solving equations, etc.) can be seen as instances of this, but this kind of thing has notably come up in the solution of famous problems like the 4-color problem, or the Kepler conjecture - both of which have been computer-verified.
A proof longer than the size of the universe is also pretty useless and probably not something we need to worry about.
Like i guess you are saying we couldn't really use such a machine to determine whether a certain conjecture just has a very long proof/disproof or is actually undecidable. Which sure, but umm i think that is the sort of problem most mathematicians would love to have.
The real reason non-deterministic turing machines can't help us is that they dont actually exist.
Of course it would, you would enumerate lengths too. If the lengths need to be larger than polynomially bounded then we can be sure it would never be found by a human anyway.
> If the lengths need to be larger than polynomially bounded then we can be sure it would never be found by a human anyway.
We cannot, as a human might be able to intuitively devise proof that might be quite large in the target language but easily constructible in a separate one.
Then in that target language, found by a clever human, you could do the same type of enumeration...
My whole point is that humans simply cannot process/create by themselves any truly long proof (we can obviously create a process for that). Therefore enumeration puts everything achievable by humans in the NP complexity. Feel free to disagree, this is not so much a theorem as more of a thesis.
> Therefore enumeration puts everything achievable by humans in the NP complexity.
This is a severe misunderstanding of NP. Hindley Milner type inference is worst case doubly exponential, for example, and we solve that all the time. We just happen to rarely hit the hard instances. I think the statement you're trying to make is something more along the lines of https://en.wikipedia.org/wiki/Skolem%27s_paradox, which is definitely fascinating but doesn't have much to do with P vs NP. Notably, this paradox does not apply to higher order logics like CiC with more than one universe, which lack countable models (which is why your intuitions about provability may not apply there).
No misunderstanding about NP here for sure. As I said, this is about as much of a thesis as Church Turing is about what can be computed.
I have no clue about CiC, lean and whatnot. It was never my field and I don't doubt there can be some very weird things you can do with some fancy logic models that are rarely discussed by non logicians.
What I'm claiming is that anything a human could prove without a computer could be done by a non deterministic machine in poly time under a very reasonable assumption of proof length. I'm baking in the assumption of proof length, so I can claim something about NP...
Let me put it this way: if you can write a paper with your proof and I can check it with only my brain without a computer helping me, then a poly time algorithm could also check that proof. Unless you believe something exotic about how the brain computes, how would it not be the case?
> Let me put it this way: if you can write a paper with your proof and I can check it with only my brain without a computer helping me, then a poly time algorithm could also check that proof. Unless you believe something exotic about how the brain computes, how would it not be the case?
What does "without a computer" have to do with anything, and what do you mean by saying "in polynomial time" when restricted to a specific instance of the problem? To talk about solutions being checkable in polynomial time in the size of the input (which is what NP actually means, more or less, since every problem in NP can be encoded as a 3SAT formula with at most polynomial blowup), you have to actually specify some (infinite) class of problems first, otherwise "in polynomial time" doesn't really mean anything. Then, you have to specify a polynomial time verification algorithm for every instance in the class.
T he problem is that "proofs of length N" in CiC doesn't have a polynomial time verification algorithm (for checking proof correctness) in the size of the proof (and as I mentioned elsewhere, attempts to translate to an encoding that does can result in exponential or much worse blowup in the proof size, which does not allow you to put them in the same complexity class in this context). Moreover, it's very unclear how you'd even begin to carve out the complexity class of "proofs that can be verified quickly" because that pretty much requires you to write an algorithm to determine the complexity of the proof, which includes being able to determine the complexity of arbitrary functions in CiC (which as you can imagine, is not easy in such an expressive system! AFAIK, the totality proof requires axioms beyond ZFC, which means that the proof is not constructive and cannot provide any specific upper bound on the number of steps taken by an expression).
This means that there is no polynomial function f(n) such that a nondeterministic Turing Machine could always terminate after f(n) steps and (correctly) report "yes" when there is a proof with n or fewer steps and "no" otherwise, and it is rather unlikely that there is a decision procedure to restrict the proofs to "proofs that run in polynomial time in the size of the proof input" so you could practically apply your claim just to that subset of proofs (though I don't know whether the absence of such a decision procedure has actually been proven, but generally speaking problems like this are very rarely decidable for interesting programming languages like CiC, even if they are total). It does not mean that a human, or human with a computer, couldn't come up with a clever short proof in O(n) tokens that takes longer than f(n) steps to check on some particular instance of the problem for some particular f... because why would it?
I think what you're trying to say (to make your argument hopefully what you had in mind) is that if humans can verify a proof in "a reasonable amount of time", then a nondeterministic Turing Machine could produce the proof in "a reasonable amount of time." Which might well be true, but the point is that in CiC, how long the proof takes to verify has very little to do with the size of the proof, so this claim has nothing to do with proof verification being in P (and therefore, has nothing to do with proof search being in NP, which it isn't for CiC). More to the point, this is pretty much true of any problem of interest--if it takes more than a reasonable amount of time in practice to verify whether a solution is correct (where verification can include stuff like "I ran polynomial-time UNSAT on this instance of the problem" in a universe where P=NP), it isn't practically solvable, regardless of what fancy methods we used to arrive at the solution and regardless of the length of the solution. So I don't find this particularly insightful and don't think it says anything deep about humanity, mathematics, or the universe.
Anyway, I'm mostly just pointing out that even if P=NP, you can't really leverage that to discover whether a short proof exists in CiC in polynomial time. More concretely, there isn't an algorithm to convert an arbitrary CiC formula to a 3SAT formula without greater than polynomial size blowup (we know this because we can run non-elementary algorithms in CiC!). So even if we had a polynomial time 3SAT algorithm, which would solve a great many things, it wouldn't solve proof synthesis, which is another way to phrase "proof synthesis in CiC isn't in NP."
First of all, thank you for a thorough response. I'll need to take time to read it (and the refs in your other reply) with the care it deserves.
Basically I'm talking about the subset of proofs that could be done with pen and paper and pass a careful refereeing process. And you got that interpretation in your reply. You say that it is not insightful and that is of course an opinion you are entitled to.
To me it is an interesting observation that NP is a complexity class powerful enough to basically obsolete us. The context of this conversation/thread is using Lean to formalize a proof and whether in the near future, integrating AI models would potentially give us novel tools for discovering new mathematics. We routinely solve NP hard problems in practice for many instances, so I think drawing this parallel here was relevant.
I do agree with you that there would still be out of reach proofs even with an efficient algo for NP, but as some other person replied, it would be a nice problem to have...
> Basically I'm talking about the subset of proofs that could be done with pen and paper and pass a careful refereeing process. And you got that interpretation in your reply. You say that it is not insightful and that is of course an opinion you are entitled to.
Well, that part was me being subjective, but the objective part is that this subset of proofs doesn't necessarily have a verification algorithm "in NP." A very short proof that took time doubly exponential (in some sense) in the size of the proof could still feasibly be solved with pen and paper, because the proof was short even though the verification time was very long (or the verification was long due to a technicality--I describe an example below), depending on the exponent and the proof size. While at the same time, a short proof that took time polynomial (in some sense) in the size of the proof could take longer than the age of the universe to verify and be out of reach for both humans and computers, depending again on the degree of the polynomial and the proof size (and, since again proof verification in CiC is not reducible to 3SAT, proof search that incorporates non-polynomial functions plausibly might not even benefit from any speedup if P = NP!). I say "in some sense" because it's unclear exactly what function f to fix, since pretty much any decidable, total function you can think of grows more slowly than "max steps needed to typecheck a term of length n in CiC" (outside of other meta-level functions that similarly don't provide practical bounds on running time).
(To give a concrete example of where human verification might be much faster than computer verification of the same step, human mathematicians can easily use "tricks" when they know two results will be the same, without actually proving them equivalent, modifying the proof, or imparting extra knowledge to a verifier. For example, in Coq, by default you use Peano naturals, which are convenient for proofs but have very poor complexity--which means exponentiation actually takes exponential time to compute with them. Humans verifying a proof step will recognize this and just do exponentiation the normal way regardless of which inductive definition of naturals is being used, while for a computer to do so the person writing the proof has to deliberately switch out Peano naturals for digital naturals. It is entirely possible that the proof size when using non-Peano nats will be larger than the proof size using Peano nats, since induction on non-Peano nats is much more tedious, while the verification time is exponentially lower using non-Peano nats! So this is an example where proof search for a short proof and proof search for a fast to verify proof may be in conflict. It's also an example of somewhere where even though it's not explicitly stated in the proof, and even though different formulations of nats can't exactly be proved "fully equivalent" in CiC, humans verifying a proof understand that "in spirit" it doesn't really matter whether you use Peano nats or not.
Indeed, if you work with formal verification a lot, you will find you spend significant amounts of time optimizing your proofs, and learning tricks like whether to make an input to an inductive definition indexed or parametric, which are things that human mathematicians don't care about at all but which are very significant to the machine verifier).
In other words, I don't think being practical to verify can be described completely by either the complexity of verification or the size of the proof, and I don't think it's that useful to conflate this with a problem being in NP since (1) CiC proof synthesis isn't actually in NP, or even in the polynomial hierarchy, and (2) being in NP has a much more precise technical definition that only somewhat overlaps with "practical to verify."
There’s always another encoding, all you prove is that there doesn’t exist a proof in the encoding you selected less than the length you specified, it does nothing at all to disprove the existence of a short proof in general.
Right, and this is also the current status of handmade mathematics. All we know is that we did not find a proof yet with everything that has been tried. This typically means that a problem is harder the more stuff has been thrown at it and failed.
A non deterministic Turing machine would absolutely crunch through math theorems, I really don't know why there is so much push back against what I am stating. Basically, if you had such a machine, there essentially would no longer be any room left for proving stuff the manual way, you would get completely outclassed by them.
The real consequence for me though is that this is a strong evidence (call it faith) against P=NP.
> A non deterministic Turing machine would absolutely crunch through math theorems
Can you formalize that statement in any useful way? It seems you believe proof validation to be in NP, when it most certainly is not. Accordingly an N-TM would be of relatively little power against the problem.
It's rather difficult to provide a good formalization but let me give it a shot.
Suppose that mathematicians write papers with pen and paper in a subset of natural language without ambiguity (you wish!). What they write as proofs can be checked thoroughly for correctness by some other human (referee) otherwise it will not be published (again, we wish!).
Now for a paper with N tokens (or chars or whatever measure of length) how much computation can the referee reasonably dedicate to the task of checking that proof. Unless you believe something special about the human brain, I'd claim that poly time on N is a reasonable assumption. Now this means that checking is poly(N) for human checkable proofs of length N.
This is my argument that non deterministic Turing machines would absolutely crunch through everything that we could do in the current model of mathemticsl research.
The class of instances is that of theorems which a human could write a hand made proof and have it verified by another human. Recall that we are discussing formalization and what AI could achieve in this space, so I think formulating what we currently can achieve is a nice contrast with what could come in the future.
10 years ago a crack professor unit was defunded by an academic tribunal for misconduct they didn't commit. These men promptly escaped from a maximum teaching schedule to the research underground. Today, still wanted by the activist community, they survive as proovers of fortune. If you have a problem that no one else can solve, and if you can find them, maybe you can hire The A-Team.
My favourite bit of every episode was the iconic ὅπλισις when Prof. Dr. P.H.D. Baracus first adds structure preserving maps to every object in sight then uses the snake lemma to weld them together with connecting homomorphisms. "I pity the fool who doesn't chase diagrams"
I got a little sick of the recurring joke where Prof. Dr. Baracus would complain about flat, two-dimensional, infinite surfaces, and then the team would trick him into using one anyway.
Goddamn that's a long way to go to make that joke.
Worth the journey, but those are brain cells I haven't used in decades. I really wish I could garbage-collect some of the 1980s pop culture neurons and put them to better use.
> Tao then kicked off an effort to formalize the proof in Lean, a programming language that helps mathematicians verify theorems. In just a few weeks, that effort succeeded. Early Tuesday morning of December 5, Tao announced that Lean had proved the conjecture without any “sorrys” — the standard statement that appears when the computer can’t verify a certain step. This is the highest-profile use of such verification tools since 2021, and marks an inflection point in the ways mathematicians write proofs in terms a computer can understand. If these tools become easy enough for mathematicians to use, they might be able to substitute for the often prolonged and onerous peer review process, said Gowers.