I like this quote: "sometimes, performance is about more than just getting work done a little faster. When something takes a lot less time, it changes the way you interact with the computer. A process that takes hours goes on the nightly build server; a process that takes minutes might be a compile that runs on your local machine; a process that takes seconds is a progress bar; and a process that takes milliseconds might happen in an editor between keystrokes."
Most verification systems work in terms of hours or days. Changing the order of magnitude means that verification can happen constantly.
The paper mentions that it verifies Metamath set.mm, one of the largest bodies of formalized mathematics, in less than 1 second. Here's some more info for the curious:
* My Gource visualization of the development of set.mm over time: https://www.youtube.com/watch?v=XC1g8FmFcUU
* My video "Metamath Proof Explorer: A Modern Principia Mathematica" (which gives overall background/context of Metamath): https://www.youtube.com/watch?v=8WH4Rd4UKGE
* Metamath Proof Explorer (MPE) aka "set.mm" - the home page for this body of formalized mathematics: http://us.metamath.org/mpeuni/mmset.html
What kind of verification does usually take hours? Just checked how long it takes to verify Raft using Coq and Verdi (pretty sizeable proof), and complete pipeline takes around 25 minutes (with Docker container creation, fetching dependencies, and what not). Verification of proof terms is a relatively easy task for a computer, as easy as type checking. Coming up with the proofs, on the other hand, that is hard.
The Metamath Rust verifier normally verifies 23,000+ proofs at a time. The actual verification time is just less than 1 second for the entire set. When you include the time for container creation, fetching dependencies, and so on, it typically takes ~48 seconds for it to verify the entirety of set.mm (which includes thousands of proofs). Here's an example: https://travis-ci.org/metamath/set.mm/builds/602369640
You could run 23,000 Coq jobs in parallel. I wouldn't want to pay for it :-). And it would complete verification of all the proofs in 25 minutes, vs. the 200msec of Metamath Zero or the ~1 second of the Metamath Rust implementation.
That's not to say that Coq is a terrible system. Coq is really interesting! But verification time is not something it's prioritized.
Logically it is a singe proof, but technically it consists of quite a few theorems and lemmas, so this is not a valid comparison. Also proofs in Coq are typically structured and pretty high-level because of all the automation, while metamath proofs look like core dumps :)
$d A a b v w x y $. $d B a b v w x y $. $d F a b v w x y $.
$( The ` 2nd ` (second member of an ordered pair) function restricted to a
one-to-one function ` F ` is a one-to-one function of ` F ` onto the
range of ` F ` . (Contributed by Alexander van der Vekens,
f1o2ndf1 $p |- ( F : A -1-1-> B -> ( 2nd |` F ) : F -1-1-onto-> ran F ) $=
( vx vy va vv vb vw c2nd syl cv cfv wceq wi wcel wa wrex ex com23 wf1 crn
cres wfo ccnv wfun wf1o f1f fo2ndf weq wral f2ndf cxp wss fssxp cop ssel2
elxp2 sylib anim12dan fvres adantr ad2antlr eqeq12d op2nd eqeq12i funopfv
wf vex f1fun anim12d eqcom biimpi eqeqan12d simpl anim12i f1veqaeq sylan2
opeq12 syl6 com14 syl6bi pm2.43i syld impcom syl5bi sylbid adantl adantlr
com13 com12 wb eleq1 bi2anan9 anbi2d fveq2 simpllr imbi12d imbi2d 3imtr4d
simpr rexlimdvva rexlimivv imp mpcom ralrimivv dff13 df-f1 simprbi dff1o3
sylanbrc ) ABCUAZCCUBZJCUCZUDZXNUEUFZCXMXNUGXLABCVHZXOABCUHZABCUIKXLCBXNU
I think it's 100% valid, indeed, the Metamath and/or Metamath Zero verifiers are verifying much more. I'm measuring time to verify ALL proofs of an entire system, including every theorem and lemma. The Coq measure you're quoting is the time to verify just a single proof with its lemmas. Coq is slower at verifying "a single proof of a theorem including all its lemmas" (though a complex one) than Metamath is at verifying "proofs of all theorems including all lemmas". To be fair, Coq is not designed to do this quickly, so that should not be surprising.
> Also proofs in Coq are typically structured and are pretty complex because of all the automation, while metamath proofs look like core dumps :)
You're showing the "compressed" format. That's simply its internal storage format, which is normally optimized to reduce storage space. Humans do not need to work in that format (and usually don't). Normally for final display the proofs are shown in HTML, for example:
Note that in the HTML display every step is justified by a hyperlink; you can click on the hyperlink to see the justification (recursively). For example, step 3 uses the justification "syl"; if you click on "syl" you'll find that it says that if ⊢ (𝜑 → 𝜓) and ⊢ (𝜓 → 𝜒) then ⊢ (𝜑 → 𝜒). The theorem syl is itself proved, and you can click on its links to see how it derives all the way back to definitions and axioms.
If you want to see just simple text, you can see the proof in text format by installing "metamath" and running this:
> metamath 'read set.mm' 'show proof f1o2ndf1'
As far as "structured" goes, I think structuring proofs is necessary for any system. A complex proof in Metamath is typically built up from axioms, previous proofs, and lemmas as needed. I think that's necessary for any math verification system.
I agree that calling it "one proof" is a mischaracterization, but comparing all of Raft to all of set.mm is probably about right. I would guess set.mm is 10-100 times larger than Raft in terms of number and complexity of actual proof content, but it checks in a fraction of the time, in part because of that "core dump". You might think it ugly, but have you ever looked at a .vo file? The difference is metamath checks its 'cache' in to version control.
As I said in the initial post, there is an entire world of difference between checking the proof term and, well, proving. Coq can dump proof terms that can also be verified relatively quickly, but typically people prefer to run proof scripts in CI. And proof scripts actually search for the proof, rather than simply describe it. So, travis run time is not a terribly precise metric.
As for comparing the number of lemmas... In Coq one has "magical" tactics such as `omega' or `lattice'. Metamath doesn't have tactics (?), so I presume that's the reason they need so many lemmas in standard library; just like you need 100x time more code in asm than in Haskell.
I'm aware this is an entirely unfair comparison. Coq and Metamath aren't doing anywhere near the same amount of stuff, so of course Metamath will be faster. But the question is: are you doing what is necessary? Is running omega all the time a requirement for large scale proof checking, or is it just the way Coq does things? I think you will find that it's not necessary in the abstract, but it's also pretty hard to write Coq any other way. (To be absolutely clear, I don't mean don't use omega, I mean use omega once and generate a proof, and don't make the rest of the world have to re-run your proof script.)
The sizes of things are actually hard to gauge in Metamath if you want to compare LOC to amount of human work, because what you see in a Metamath file are actually the output of some honest to goodness "tactic" or high level user interface. No one is writing asm here.
If you are not writing lemmas, and are relying on Coq tactics to rediscover the proof all the time, well, that's just demonstrably slow and redundant. In Metamath we use a combination of tactics to find proofs and optimizers to extract common lemmas and use the library of existing lemmas effectively; the Coq model leaves no space for the optimizer, so you get something... suboptimal.
You are right, verification is much faster than coming up with proofs, and that explains the majority of the disparity. But then why are we reinventing proofs so much? This is clearly needless work, and there are a thousand ways to solve the problem but even recognizing that it is a problem is difficult in certain circles.
No other formal proof assistant has been this fast for me. The next closest one is Idris, which is type-theory-driven (fancy phrase for "slow") and it takes a few moments to round-trip between the REPL and the text editor.
My rule of thumb says: if fingers are the bottleneck, then your programming language is not powerful enough (too verbose). Working with higher level abstractions is good for health: brain doesn't get RSI.
> which is type-theory-driven (fancy phrase for "slow")
There's no RSI in working with Metamath. What I mean is that the workflow is largely dependent on me knowing what I want to prove and roughly how I want to prove it, rather than Metamath chewing CPU.
And then I looked over and Mario had implemented a graphing calculator (as in, take "y = x^2 + 4x -10" and emit a graph) in the windows command prompt. Then, when that was boring, he started researching and implementing the graphics algorithms for drawing lines at various slopes with minimal distortion. And then, when that was boring, he made his graphing calculator support polar coordinates (something I understood only as "oops, my graphing calculator is misconfigured").
Also, he was ~12 years old (IIRC) having skipped a few grades.
I've met a lot of smart people between Carnegie Mellon, Facebook Infrastructure, and the Linux Kernel community, but Mario's one of two or maybe three at the absolute top in terms of intellectual horsepower.
That sums up the "software crisis" as it relates to correctness. Most theorem provers have been swept up in this. They are written in high level languages and frameworks made by people who paid no thought to verification and only slightly more than most to correctness. What's more, even some theorem prover languages have dubious correctness or semantics; Rust and Dafny are both languages that make a big deal about formal correctness but have no formal spec. (Rust is working on it.) Coq is not known to be consistent (the literature is full of "here is a proof that some subsystem of Coq has this and that property" but no one can handle the whole thing), not to mention that it has had a few soundness bugs in its history, one of which (Reals violate EM) wasn't even clearly a bug because no one knew what the model was supposed to be.
Agda is a free for all as far as I can tell, and Isabelle has some strange thing with axiomatic typeclasses that takes them out of the HOL world. HOL Light is pretty decent metatheory-wise, and the logic at least was formalized inside itself, but the implementation has not, and OCaml's Obj.magic is still there, ready to cause inconsistency if you feel like using it.
Lean has not had any soundness bugs to date, and the theory is well understood, but it is a massive C++ application and there are doubtless dozens of implementation bugs in there waiting to be found.
None of the big theorem provers are really appropriate for trying to prove correct, because either it is not known what this even means or it's trivially false and we all try to pretend otherwise, or it's an intractable open question.
These are the paragons of correctness in our world.
Can we do better? What does it even mean to do better? This paper aims to find out.
(btw I couldn't find the source code for the proof checker in the paper, is it available?)
There is a second proof checker for MM0 in mm0-hs: https://github.com/digama0/mm0/tree/master/mm0-hs , which has a few more features like the MM1 server mode to give interactive feedback, as well as a verifier that reads a text version of the proof file (not described in the paper, the .mmu format).
The footnotes are supposed to be links to the github repo but I got the versions mixed up and submitted a partially anonymized version to arXiv.
What have you proven? Well, verifier A (using PA) proves that "verifier B proves theorems of PA". That means that "if B says phi is true, PA |- phi". We would have a hard time actually running verifier B inside the logic of verifier A (that entails running all the steps of the computation B does as theorem applications inside verifier A), but even if we did, we would obtain a proof of "PA |- phi". If we assume A correctly implements PA, then that means PA |- "PA |- phi". In general, this is the best we can do. If we assume further that PA is a sound axiom system, i.e. when it proves facts about numbers then we won't find concrete counterexamples, then this strengthens to PA |- phi and then to phi itself, so we've learned that phi is true.
The plan is to prove inside verifier A (that is, MM0) the statement "Here is a sequence of bytes. It is an executable that when executed has the property that if it succeeds on input "T |- phi", then T |- phi." The bootstrapping part is that the sequence of bytes in question is in fact the code of verifier A. In order to support statements of the form "here is a sequence of bytes", the verifier also has the ability to output bytes as part of its operation (it's not just a yes/no oracle), and assert facts about the encoding of those bytes in the logic.
Regarding CakeML, I'm really impressed by the work they've done, and I think they more than anyone else are leading the charge on getting verification as pervasive as possible at every level (especially the low levels, where historically it has been difficult to get past the closed source barrier). That said, I think they are not doing an optimal job at being a bootstrap. They have a system that does that and so much more that if you just want a bootstrap it's way overkill. And you pay for that overkill - 14 hour compilation per target (see https://cakeml.org/regression.cgi).
(I edited my comment to remove my questions about TLA+ and CakeML, when I got to the "Related work" part of the paper. I was also wondering about Milawa but couldn't remember the name! Kudos sir!)
Full disclosure: I'm one of the co-authors.
The universe speaks to you directly, in the count of cycles consumed, albeit somewhat garbled by incidentals of the CPU implementation and, particularly, caching. But the implementers of the CPU are chasing correctness and optimal performance too, so are always working toward the same goal, at their level. Even the caching, when you get down to counting cycles, reveals truths about your computation.
So, a proof system that is several orders of magnitude faster than another is that much closer to ultimate truth, because there is simply less room for significant errors of comprehension to hide in.
This is the reason that efficient languages and runtime systems are, and will always be, important. Garbage collection is an evil because it muddies the water, corrupting the connection between what your program says and what actually happens in the machine.