I like this quote: "sometimes, performance is about more than just getting work done a little faster. When something takes a lot less time, it changes the way you interact with the computer. A process that takes hours goes on the nightly build server; a process that takes minutes might be a compile that runs on your local machine; a process that takes seconds is a progress bar; and a process that takes milliseconds might happen in an editor between keystrokes."
Most verification systems work in terms of hours or days. Changing the order of magnitude means that verification can happen constantly.
The paper mentions that it verifies Metamath set.mm, one of the largest bodies of formalized mathematics, in less than 1 second. Here's some more info for the curious:
> Most verification systems work in terms of hours or days.
What kind of verification does usually take hours? Just checked how long it takes to verify Raft using Coq and Verdi (pretty sizeable proof), and complete pipeline takes around 25 minutes (with Docker container creation, fetching dependencies, and what not). Verification of proof terms is a relatively easy task for a computer, as easy as type checking. Coming up with the proofs, on the other hand, that is hard.
The 25 minute value is only the time to verify a single Coq proof. I'm measuring the time to verify ALL proofs of an entire system.
The Metamath Rust verifier normally verifies 23,000+ proofs at a time. The actual verification time is just less than 1 second for the entire set. When you include the time for container creation, fetching dependencies, and so on, it typically takes ~48 seconds for it to verify the entirety of set.mm (which includes thousands of proofs). Here's an example: https://travis-ci.org/metamath/set.mm/builds/602369640
You could run 23,000 Coq jobs in parallel. I wouldn't want to pay for it :-). And it would complete verification of all the proofs in 25 minutes, vs. the 200msec of Metamath Zero or the ~1 second of the Metamath Rust implementation.
That's not to say that Coq is a terrible system. Coq is really interesting! But verification time is not something it's prioritized.
> The 25 minute value is only the time to verify a single Coq proof.
Logically it is a singe proof, but technically it consists of quite a few theorems and lemmas, so this is not a valid comparison. Also proofs in Coq are typically structured and pretty high-level because of all the automation, while metamath proofs look like core dumps :)
${
$d A a b v w x y $. $d B a b v w x y $. $d F a b v w x y $.
$( The ` 2nd ` (second member of an ordered pair) function restricted to a
one-to-one function ` F ` is a one-to-one function of ` F ` onto the
range of ` F ` . (Contributed by Alexander van der Vekens,
4-Feb-2018.) $)
f1o2ndf1 $p |- ( F : A -1-1-> B -> ( 2nd |` F ) : F -1-1-onto-> ran F ) $=
( vx vy va vv vb vw c2nd syl cv cfv wceq wi wcel wa wrex ex com23 wf1 crn
cres wfo ccnv wfun wf1o f1f fo2ndf weq wral f2ndf cxp wss fssxp cop ssel2
elxp2 sylib anim12dan fvres adantr ad2antlr eqeq12d op2nd eqeq12i funopfv
wf vex f1fun anim12d eqcom biimpi eqeqan12d simpl anim12i f1veqaeq sylan2
opeq12 syl6 com14 syl6bi pm2.43i syld impcom syl5bi sylbid adantl adantlr
com13 com12 wb eleq1 bi2anan9 anbi2d fveq2 simpllr imbi12d imbi2d 3imtr4d
simpr rexlimdvva rexlimivv imp mpcom ralrimivv dff13 df-f1 simprbi dff1o3
sylanbrc ) ABCUAZCCUBZJCUCZUDZXNUEUFZCXMXNUGXLABCVHZXOABCUHZABCUIKXLCBXNU
AZXPXLCBXNVHZDLZXNMZELZXNMZNZDEUJZOZECUKDCUKXSXLXQXTXRABCULKXLYGDECCCABUM
ZUNZXLYACPZYCCPZQZYGOXLXQYIXRABCUOKYIYLXLYGYIYLXLYGOZYAFLZGLZUPZNZGBRFARZ
YCHLZILZUPZNZIBRHARZQYIYLQZYMYIYJYRYKUUCYIYJQYAYHPYRCYHYAUQFGYAABURUSYIYK
QYCYHPUUCCYHYCUQHIYCABURUSUTYRUUCUUDYMOZYQUUCUUEOZFGABYNAPZYOBPZQZYQUUFUU
IYQQZUUBUUEHIABUUJYSAPZYTBPZQZQZUUBUUEUUNUUBQZYIYPCPZUUACPZQZQZXLYPXNMZUU
AXNMZNZYPUUANZOZOZUUDYMUUNUUSUVEOZUUBUUIUUMUVFYQUUSUUIUUMQZUVEUURUVGUVEOY
IUURUVGUVEUURUVGQZUVBXLUVCUVHUVBYPJMZUUAJMZNZXLUVCOZUVHUUTUVIUVAUVJUURUUT
UVINZUVGUUPUVMUUQYPCJVAVBVBUUQUVAUVJNUUPUVGUUACJVAVCVDUVKGIUJZUVHUVLUVIYO
UVJYTYNYOFVIGVIVEYSYTHVIIVIVEVFUVHXLUVNUVCUVGUURXLUVNUVCOZOXLUURUVGUVOXLU
URYNCMZYONZYSCMZYTNZQZUVGUVOOXLCUFZUURUVTOABCVJUWAUUPUVQUUQUVSYNYOCVGYSYT
CVGVKKXLUVGUVTUVOUVNUVGUVTXLUVCUVNUVGUVTUVLOOUVTUVNUVGUVNUVLUVTUVNUVPUVRN
ZUVGUVNUVLOOUVQUVSYOUVPYTUVRUVQYOUVPNUVPYOVLVMUVSYTUVRNUVRYTVLVMVNXLUVGUV
NUWBUVCXLUVGUVNUWBUVCOOXLUVGQZUWBUVNUVCUWCUWBFHUJZUVOUVGXLUUGUUKQUWBUWDOU
UIUUGUUMUUKUUGUUHVOUUKUULVOVPABYNYSCVQVRUWDUVNUVCYNYOYSYTVSSVTTSWAWBWAWCW
ATWDWJWETWFWGTSWHWKWIVBUUOYLUURYIUUNYJUUPUUBYKUUQYQYJUUPWLUUIUUMYAYPCWMVC
YCUUACWMWNWOUUOYGUVDXLUUOYEUVBYFUVCUUNUUBYBUUTYDUVAYQYBUUTNUUIUUMYAYPXNWP
VCYCUUAXNWPVNUUOYAYPYCUUAUUIYQUUMUUBWQUUNUUBXAVDWRWSWTSXBSXCXDXESTXEXFDEC
BXNXGXKXSXTXPCBXNXHXIKCXMXNXJXK $.
$}
> Logically it is a single proof, but technically it consists of quite a few theorems and lemmas, so this is not a valid comparison.
I think it's 100% valid, indeed, the Metamath and/or Metamath Zero verifiers are verifying much more. I'm measuring time to verify ALL proofs of an entire system, including every theorem and lemma. The Coq measure you're quoting is the time to verify just a single proof with its lemmas. Coq is slower at verifying "a single proof of a theorem including all its lemmas" (though a complex one) than Metamath is at verifying "proofs of all theorems including all lemmas". To be fair, Coq is not designed to do this quickly, so that should not be surprising.
> Also proofs in Coq are typically structured and are pretty complex because of all the automation, while metamath proofs look like core dumps :)
You're showing the "compressed" format. That's simply its internal storage format, which is normally optimized to reduce storage space. Humans do not need to work in that format (and usually don't). Normally for final display the proofs are shown in HTML, for example:
Note that in the HTML display every step is justified by a hyperlink; you can click on the hyperlink to see the justification (recursively). For example, step 3 uses the justification "syl"; if you click on "syl" you'll find that it says that if ⊢ (𝜑 → 𝜓) and ⊢ (𝜓 → 𝜒) then ⊢ (𝜑 → 𝜒). The theorem syl is itself proved, and you can click on its links to see how it derives all the way back to definitions and axioms.
If you want to see just simple text, you can see the proof in text format by installing "metamath" and running this:
> metamath 'read set.mm' 'show proof f1o2ndf1'
As far as "structured" goes, I think structuring proofs is necessary for any system. A complex proof in Metamath is typically built up from axioms, previous proofs, and lemmas as needed. I think that's necessary for any math verification system.
It goes without saying that metamath is not designed to be read from the source directly. To be perfectly fair, Coq isn't either; you can kind of get the gist but to really understand what's going on you have to start up Coq and look at proof states. Metamath as it's meant to be seen looks like this: http://metamath.tirix.org/f1o2ndf1.html .
I agree that calling it "one proof" is a mischaracterization, but comparing all of Raft to all of set.mm is probably about right. I would guess set.mm is 10-100 times larger than Raft in terms of number and complexity of actual proof content, but it checks in a fraction of the time, in part because of that "core dump". You might think it ugly, but have you ever looked at a .vo file? The difference is metamath checks its 'cache' in to version control.
> The difference is metamath checks its 'cache' in to version control.
As I said in the initial post, there is an entire world of difference between checking the proof term and, well, proving. Coq can dump proof terms that can also be verified relatively quickly, but typically people prefer to run proof scripts in CI. And proof scripts actually search for the proof, rather than simply describe it. So, travis run time is not a terribly precise metric.
As for comparing the number of lemmas... In Coq one has "magical" tactics such as `omega' or `lattice'. Metamath doesn't have tactics (?), so I presume that's the reason they need so many lemmas in standard library; just like you need 100x time more code in asm than in Haskell.
Metamath has tactics, but they aren't part of metamath per se; they are part of the tools that you use to write proofs. You don't make the rest of the world run the same proof search you did hundreds of times, finding the same proof every time, when you could just write down the proof.
I'm aware this is an entirely unfair comparison. Coq and Metamath aren't doing anywhere near the same amount of stuff, so of course Metamath will be faster. But the question is: are you doing what is necessary? Is running omega all the time a requirement for large scale proof checking, or is it just the way Coq does things? I think you will find that it's not necessary in the abstract, but it's also pretty hard to write Coq any other way. (To be absolutely clear, I don't mean don't use omega, I mean use omega once and generate a proof, and don't make the rest of the world have to re-run your proof script.)
The sizes of things are actually hard to gauge in Metamath if you want to compare LOC to amount of human work, because what you see in a Metamath file are actually the output of some honest to goodness "tactic" or high level user interface. No one is writing asm here.
If you are not writing lemmas, and are relying on Coq tactics to rediscover the proof all the time, well, that's just demonstrably slow and redundant. In Metamath we use a combination of tactics to find proofs and optimizers to extract common lemmas and use the library of existing lemmas effectively; the Coq model leaves no space for the optimizer, so you get something... suboptimal.
The verification of entire proof libraries (like the entire Coq standard library, or the Isabelle Archive of Formal proofs) usually ranges from hours to days, as do some of the larger computer science projects like seL4 and CakeML.
You are right, verification is much faster than coming up with proofs, and that explains the majority of the disparity. But then why are we reinventing proofs so much? This is clearly needless work, and there are a thousand ways to solve the problem but even recognizing that it is a problem is difficult in certain circles.
I encourage you to try Metamath. When starting a new axiom system from scratch, I found that I was not capable of writing anything that would cause lag; I wasn't able to make Metamath the slow part of my workflow. No matter how I approached it, the slowest parts of my work were my typing (with my fingers) and my typing (with my mind and whiteboard).
No other formal proof assistant has been this fast for me. The next closest one is Idris, which is type-theory-driven (fancy phrase for "slow") and it takes a few moments to round-trip between the REPL and the text editor.
My rule of thumb says: if fingers are the bottleneck, then your programming language is not powerful enough (too verbose). Working with higher level abstractions is good for health: brain doesn't get RSI.
> which is type-theory-driven (fancy phrase for "slow")
Syntax-only formalisms are faster than semantically-laden formalisms; a compiler's parser is usually faster than its backend. Metamath is going to always be faster than Coq, Lean, Isabelle, HOL, Agda, Idris, or indeed much weaker systems like Haskell.
There's no RSI in working with Metamath. What I mean is that the workflow is largely dependent on me knowing what I want to prove and roughly how I want to prove it, rather than Metamath chewing CPU.
The author, Mario Carniero, is both one of the top three smartest and nicest people I know (and many other Ohio State students where he did his undergrad would agree). Glad to see his research getting Hacker News recognition.
I went to high school with Logan and with Mario. Mario was in my AP Comp Sci class with Mrs. White. I remember being proud of my forays into encryption, concurrency, and network programming.
And then I looked over and Mario had implemented a graphing calculator (as in, take "y = x^2 + 4x -10" and emit a graph) in the windows command prompt. Then, when that was boring, he started researching and implementing the graphics algorithms for drawing lines at various slopes with minimal distortion. And then, when that was boring, he made his graphing calculator support polar coordinates (something I understood only as "oops, my graphing calculator is misconfigured").
Also, he was ~12 years old (IIRC) having skipped a few grades.
I've met a lot of smart people between Carnegie Mellon, Facebook Infrastructure, and the Linux Kernel community, but Mario's one of two or maybe three at the absolute top in terms of intellectual horsepower.
For those who are more into watching talks on YouTube than reading papers, and are interested in the low level hardware formalization aspect of this project, there is a recording of the ITP 2019 presentation here: "x86 verification from scratch" https://www.youtube.com/watch?v=7hAShC6K_vA
I recently learned about some limitations in both Coq and Lean theorem prover after watching the video by Kevin Buzzard on formalization of math in Lean. He did some claims about quotient support in Coq, so I just asked[1] Coq developers straightly. I recommend to read their discussion and explanations.
This is very cool work, and I'm happy to see it. I've been fascinated with Metamath for a very long time, finding its simplicity extremely appealing, and have also made a number of attempts (under the Ghilbert name) to fix the problems Mario so eloquently identifies. But I never quite got it to gel. It's clear Mario has the technical chops, and I hope this publication gets the MM family a little more academic respectability and notice.
For those who don't know much about computer-verified proofs, the way I'd sum it up is that there's a complexity crisis going on (similar to the "software crisis" which might not be completely solved but which has seen significant progress). And that Metamath Zero is one of the more promising projects out there aimed at this problem.
Software gets more complicated, and so bugs become more common. Hardware gets faster, so speed pressures are reduced on software, so it gets more layered and hence slower; and also more complicated. More layers of abstraction means more things to get wrong, so correctness becomes a serious problem. The world is increasingly reliant on software, so correctness becomes a serious concern.
That sums up the "software crisis" as it relates to correctness. Most theorem provers have been swept up in this. They are written in high level languages and frameworks made by people who paid no thought to verification and only slightly more than most to correctness. What's more, even some theorem prover languages have dubious correctness or semantics; Rust and Dafny are both languages that make a big deal about formal correctness but have no formal spec. (Rust is working on it.) Coq is not known to be consistent (the literature is full of "here is a proof that some subsystem of Coq has this and that property" but no one can handle the whole thing), not to mention that it has had a few soundness bugs in its history, one of which (Reals violate EM) wasn't even clearly a bug because no one knew what the model was supposed to be.
Agda is a free for all as far as I can tell, and Isabelle has some strange thing with axiomatic typeclasses that takes them out of the HOL world. HOL Light is pretty decent metatheory-wise, and the logic at least was formalized inside itself, but the implementation has not, and OCaml's Obj.magic is still there, ready to cause inconsistency if you feel like using it.
Lean has not had any soundness bugs to date, and the theory is well understood, but it is a massive C++ application and there are doubtless dozens of implementation bugs in there waiting to be found.
None of the big theorem provers are really appropriate for trying to prove correct, because either it is not known what this even means or it's trivially false and we all try to pretend otherwise, or it's an intractable open question.
These are the paragons of correctness in our world.
Can we do better? What does it even mean to do better? This paper aims to find out.
Lean has independent proof checkers, and Coq could have, too. The fact that these are big systems doesn't harm the de Bruijn principle: you only need to trust the checkers…
(btw I couldn't find the source code for the proof checker in the paper, is it available?)
The fast proof checker I talk about is mm0-c: https://github.com/digama0/mm0/tree/master/mm0-c . It is (deliberately) bare bones, and I'm planning on formalizing approximately the assembly that you get from running that through gcc.
There is a second proof checker for MM0 in mm0-hs: https://github.com/digama0/mm0/tree/master/mm0-hs , which has a few more features like the MM1 server mode to give interactive feedback, as well as a verifier that reads a text version of the proof file (not described in the paper, the .mmu format).
The footnotes are supposed to be links to the github repo but I got the versions mixed up and submitted a partially anonymized version to arXiv.
Do we even know if it's possible to prove correctness of a theorem-prover, in a very general sense? At some point, that would imply that some theorem-prover would have to prove its own correctness (either by proving itself directly or by proving itself through a loop of theorem-provers). This seems intuitively like it start to run up against Gödel-style incompleteness problems.
It does, but only if you state the theorem in a particular way. Let's say that you have a verifier A, that checks theorems in Peano Arithmetic (PA). Inside that logic, you can define what a computer is, how it executes, what PA is, what a verifier is, and what it means for a program to verify theorems of PA. Then you can have a sequence of bytes (verifier B), and prove that they represent a program that verifies theorems of PA.
What have you proven? Well, verifier A (using PA) proves that "verifier B proves theorems of PA". That means that "if B says phi is true, PA |- phi". We would have a hard time actually running verifier B inside the logic of verifier A (that entails running all the steps of the computation B does as theorem applications inside verifier A), but even if we did, we would obtain a proof of "PA |- phi". If we assume A correctly implements PA, then that means PA |- "PA |- phi". In general, this is the best we can do. If we assume further that PA is a sound axiom system, i.e. when it proves facts about numbers then we won't find concrete counterexamples, then this strengthens to PA |- phi and then to phi itself, so we've learned that phi is true.
The plan is to prove inside verifier A (that is, MM0) the statement "Here is a sequence of bytes. It is an executable that when executed has the property that if it succeeds on input "T |- phi", then T |- phi." The bootstrapping part is that the sequence of bytes in question is in fact the code of verifier A. In order to support statements of the form "here is a sequence of bytes", the verifier also has the ability to output bytes as part of its operation (it's not just a yes/no oracle), and assert facts about the encoding of those bytes in the logic.
Seems like the ideal would be a number of theorem provers each capable of verifying the next. And the first one simple enough to be verified with pen and paper
Provers in the LCF tradition have tiny cores that avoid having to trust lots of code. There are also reflective provers that can admit self-verified extensions. One project, called Milawa, had several layers of this.
(I'm the author of the paper BTW.) Regarding TLA+ and CakeML: TLA+ seems like a good idea, Leslie Lamport talks a lot of sense in most of his papers. I especially recommend "Composition: A Way to Make Proofs Harder" (https://lamport.azurewebsites.net/pubs/lamport-composition.p...). It sums up a lot of my feelings about "simplifying" a problem by adding more layers and generalizing the layers so that they have more work to do individually. I'm not sure I'm totally on board the idea of using temporal logic; my plan is to just use plain first order logic to talk about the states of a computer.
Regarding CakeML, I'm really impressed by the work they've done, and I think they more than anyone else are leading the charge on getting verification as pervasive as possible at every level (especially the low levels, where historically it has been difficult to get past the closed source barrier). That said, I think they are not doing an optimal job at being a bootstrap. They have a system that does that and so much more that if you just want a bootstrap it's way overkill. And you pay for that overkill - 14 hour compilation per target (see https://cakeml.org/regression.cgi).
(I edited my comment to remove my questions about TLA+ and CakeML, when I got to the "Related work" part of the paper. I was also wondering about Milawa but couldn't remember the name! Kudos sir!)
There is something about the process of getting optimal performance in a system that tells you whether you are moving in the right direction.
The universe speaks to you directly, in the count of cycles consumed, albeit somewhat garbled by incidentals of the CPU implementation and, particularly, caching. But the implementers of the CPU are chasing correctness and optimal performance too, so are always working toward the same goal, at their level. Even the caching, when you get down to counting cycles, reveals truths about your computation.
So, a proof system that is several orders of magnitude faster than another is that much closer to ultimate truth, because there is simply less room for significant errors of comprehension to hide in.
This is the reason that efficient languages and runtime systems are, and will always be, important. Garbage collection is an evil because it muddies the water, corrupting the connection between what your program says and what actually happens in the machine.
It seems that metamath's representation is a basic primitive for encoding maths at a syntactic level. It uses simple rules to recursively apply syntactical substitution transformations in order to build up a kind of complete, ostensibly true statement. A statement that is encoded in a way that a small proven-true statement checker can ~linearly (!) follow along the statement of the proof and find it logically sound. It seems weird to me that it's even possible to cross the boundary from syntactic representation to semantic meaning so seamlessly.
I like this quote: "sometimes, performance is about more than just getting work done a little faster. When something takes a lot less time, it changes the way you interact with the computer. A process that takes hours goes on the nightly build server; a process that takes minutes might be a compile that runs on your local machine; a process that takes seconds is a progress bar; and a process that takes milliseconds might happen in an editor between keystrokes."
Most verification systems work in terms of hours or days. Changing the order of magnitude means that verification can happen constantly.
The paper mentions that it verifies Metamath set.mm, one of the largest bodies of formalized mathematics, in less than 1 second. Here's some more info for the curious:
* My Gource visualization of the development of set.mm over time: https://www.youtube.com/watch?v=XC1g8FmFcUU
* My video "Metamath Proof Explorer: A Modern Principia Mathematica" (which gives overall background/context of Metamath): https://www.youtube.com/watch?v=8WH4Rd4UKGE
* Metamath Proof Explorer (MPE) aka "set.mm" - the home page for this body of formalized mathematics: http://us.metamath.org/mpeuni/mmset.html