
Metamath Zero, a bootstrapping theorem prover - giomasce
https://arxiv.org/abs/1910.10703
======
dwheeler
This is really interesting work.

I like this quote: "sometimes, performance is about more than just getting
work done a little faster. When something takes a lot less time, it changes
the way you interact with the computer. A process that takes hours goes on the
nightly build server; a process that takes minutes might be a compile that
runs on your local machine; a process that takes seconds is a progress bar;
and a process that takes milliseconds might happen in an editor between
keystrokes."

Most verification systems work in terms of hours or days. Changing the order
of magnitude means that verification can happen constantly.

The paper mentions that it verifies Metamath set.mm, one of the largest bodies
of formalized mathematics, in less than 1 second. Here's some more info for
the curious:

* My Gource visualization of the development of set.mm over time: [https://www.youtube.com/watch?v=XC1g8FmFcUU](https://www.youtube.com/watch?v=XC1g8FmFcUU)

* My video "Metamath Proof Explorer: A Modern Principia Mathematica" (which gives overall background/context of Metamath): [https://www.youtube.com/watch?v=8WH4Rd4UKGE](https://www.youtube.com/watch?v=8WH4Rd4UKGE)

* Metamath Proof Explorer (MPE) aka "set.mm" \- the home page for this body of formalized mathematics: [http://us.metamath.org/mpeuni/mmset.html](http://us.metamath.org/mpeuni/mmset.html)

~~~
h91wka
> Most verification systems work in terms of hours or days.

What kind of verification does usually take hours? Just checked how long it
takes to verify Raft using Coq and Verdi (pretty sizeable proof), and complete
pipeline takes around 25 minutes (with Docker container creation, fetching
dependencies, and what not). Verification of proof terms is a relatively easy
task for a computer, as easy as type checking. Coming up with the proofs, on
the other hand, that is hard.

~~~
dwheeler
The 25 minute value is only the time to verify a single Coq proof. I'm
measuring the time to verify ALL proofs of an entire system.

The Metamath Rust verifier normally verifies 23,000+ proofs at a time. The
actual verification time is just less than 1 second for the entire set. When
you include the time for container creation, fetching dependencies, and so on,
it typically takes ~48 seconds for it to verify the entirety of set.mm (which
includes _thousands_ of proofs). Here's an example: [https://travis-
ci.org/metamath/set.mm/builds/602369640](https://travis-
ci.org/metamath/set.mm/builds/602369640)

You could run 23,000 Coq jobs in parallel. I wouldn't want to pay for it :-).
And it would complete verification of all the proofs in 25 minutes, vs. the
200msec of Metamath Zero or the ~1 second of the Metamath Rust implementation.

That's not to say that Coq is a terrible system. Coq is really interesting!
But verification time is not something it's prioritized.

~~~
h91wka
> The 25 minute value is only the time to verify a single Coq proof.

Logically it is a singe proof, but technically it consists of quite a few
theorems and lemmas, so this is not a valid comparison. Also proofs in Coq are
typically structured and pretty high-level because of all the automation,
while metamath proofs look like core dumps :)

    
    
      ${
        $d A a b v w x y $.  $d B a b v w x y $.  $d F a b v w x y $.
        $( The ` 2nd ` (second member of an ordered pair) function restricted to a
           one-to-one function ` F ` is a one-to-one function of ` F ` onto the
           range of ` F ` .  (Contributed by Alexander van der Vekens,
           4-Feb-2018.) $)
        f1o2ndf1 $p |- ( F : A -1-1-> B -> ( 2nd |` F ) : F -1-1-onto-> ran F ) $=
          ( vx vy va vv vb vw c2nd syl cv cfv wceq wi wcel wa wrex ex com23 wf1 crn
          cres wfo ccnv wfun wf1o f1f fo2ndf weq wral f2ndf cxp wss fssxp cop ssel2
          elxp2 sylib anim12dan fvres adantr ad2antlr eqeq12d op2nd eqeq12i funopfv
          wf vex f1fun anim12d eqcom biimpi eqeqan12d simpl anim12i f1veqaeq sylan2
          opeq12 syl6 com14 syl6bi pm2.43i syld impcom syl5bi sylbid adantl adantlr
          com13 com12 wb eleq1 bi2anan9 anbi2d fveq2 simpllr imbi12d imbi2d 3imtr4d
          simpr rexlimdvva rexlimivv imp mpcom ralrimivv dff13 df-f1 simprbi dff1o3
          sylanbrc ) ABCUAZCCUBZJCUCZUDZXNUEUFZCXMXNUGXLABCVHZXOABCUHZABCUIKXLCBXNU
          AZXPXLCBXNVHZDLZXNMZELZXNMZNZDEUJZOZECUKDCUKXSXLXQXTXRABCULKXLYGDECCCABUM
          ZUNZXLYACPZYCCPZQZYGOXLXQYIXRABCUOKYIYLXLYGYIYLXLYGOZYAFLZGLZUPZNZGBRFARZ
          YCHLZILZUPZNZIBRHARZQYIYLQZYMYIYJYRYKUUCYIYJQYAYHPYRCYHYAUQFGYAABURUSYIYK
          QYCYHPUUCCYHYCUQHIYCABURUSUTYRUUCUUDYMOZYQUUCUUEOZFGABYNAPZYOBPZQZYQUUFUU
          IYQQZUUBUUEHIABUUJYSAPZYTBPZQZQZUUBUUEUUNUUBQZYIYPCPZUUACPZQZQZXLYPXNMZUU
          AXNMZNZYPUUANZOZOZUUDYMUUNUUSUVEOZUUBUUIUUMUVFYQUUSUUIUUMQZUVEUURUVGUVEOY
          IUURUVGUVEUURUVGQZUVBXLUVCUVHUVBYPJMZUUAJMZNZXLUVCOZUVHUUTUVIUVAUVJUURUUT
          UVINZUVGUUPUVMUUQYPCJVAVBVBUUQUVAUVJNUUPUVGUUACJVAVCVDUVKGIUJZUVHUVLUVIYO
          UVJYTYNYOFVIGVIVEYSYTHVIIVIVEVFUVHXLUVNUVCUVGUURXLUVNUVCOZOXLUURUVGUVOXLU
          URYNCMZYONZYSCMZYTNZQZUVGUVOOXLCUFZUURUVTOABCVJUWAUUPUVQUUQUVSYNYOCVGYSYT
          CVGVKKXLUVGUVTUVOUVNUVGUVTXLUVCUVNUVGUVTUVLOOUVTUVNUVGUVNUVLUVTUVNUVPUVRN
          ZUVGUVNUVLOOUVQUVSYOUVPYTUVRUVQYOUVPNUVPYOVLVMUVSYTUVRNUVRYTVLVMVNXLUVGUV
          NUWBUVCXLUVGUVNUWBUVCOOXLUVGQZUWBUVNUVCUWCUWBFHUJZUVOUVGXLUUGUUKQUWBUWDOU
          UIUUGUUMUUKUUGUUHVOUUKUULVOVPABYNYSCVQVRUWDUVNUVCYNYOYSYTVSSVTTSWAWBWAWCW
          ATWDWJWETWFWGTSWHWKWIVBUUOYLUURYIUUNYJUUPUUBYKUUQYQYJUUPWLUUIUUMYAYPCWMVC
          YCUUACWMWNWOUUOYGUVDXLUUOYEUVBYFUVCUUNUUBYBUUTYDUVAYQYBUUTNUUIUUMYAYPXNWP
          VCYCUUAXNWPVNUUOYAYPYCUUAUUIYQUUMUUBWQUUNUUBXAVDWRWSWTSXBSXCXDXESTXEXFDEC
          BXNXGXKXSXTXPCBXNXHXIKCXMXNXJXK $.
      $}
    

...

~~~
digama0
It goes without saying that metamath is not designed to be read from the
source directly. To be perfectly fair, Coq isn't either; you can kind of get
the gist but to really understand what's going on you have to start up Coq and
look at proof states. Metamath as it's meant to be seen looks like this:
[http://metamath.tirix.org/f1o2ndf1.html](http://metamath.tirix.org/f1o2ndf1.html)
.

I agree that calling it "one proof" is a mischaracterization, but comparing
all of Raft to all of set.mm is probably about right. I would guess set.mm is
10-100 times larger than Raft in terms of number and complexity of actual
proof content, but it checks in a fraction of the time, in part because of
that "core dump". You might think it ugly, but have you ever looked at a .vo
file? The difference is metamath checks its 'cache' in to version control.

~~~
h91wka
> The difference is metamath checks its 'cache' in to version control.

As I said in the initial post, there is an entire world of difference between
checking the proof term and, well, proving. Coq can dump proof terms that can
also be verified relatively quickly, but typically people prefer to run proof
scripts in CI. And proof scripts actually search for the proof, rather than
simply describe it. So, travis run time is not a terribly precise metric.

As for comparing the number of lemmas... In Coq one has "magical" tactics such
as `omega' or `lattice'. Metamath doesn't have tactics (?), so I presume
that's the reason they need so many lemmas in standard library; just like you
need 100x time more code in asm than in Haskell.

~~~
digama0
Metamath has tactics, but they aren't part of metamath per se; they are part
of the tools that you use to write proofs. You don't make the rest of the
world run the same proof search you did hundreds of times, finding the same
proof every time, when you could just write down the proof.

I'm aware this is an entirely unfair comparison. Coq and Metamath aren't doing
anywhere near the same amount of stuff, so _of course_ Metamath will be
faster. But the question is: are you doing what is _necessary_? Is running
omega all the time a requirement for large scale proof checking, or is it just
the way Coq does things? I think you will find that it's not necessary in the
abstract, but it's also pretty hard to write Coq any other way. (To be
absolutely clear, I don't mean don't use omega, I mean use omega _once_ and
generate a proof, and don't make the rest of the world have to re-run your
proof script.)

The sizes of things are actually hard to gauge in Metamath if you want to
compare LOC to amount of human work, because what you see in a Metamath file
are actually the output of some honest to goodness "tactic" or high level user
interface. No one is writing asm here.

If you are not writing lemmas, and are relying on Coq tactics to rediscover
the proof all the time, well, that's just demonstrably slow and redundant. In
Metamath we use a combination of tactics to find proofs and optimizers to
extract common lemmas and use the library of existing lemmas effectively; the
Coq model leaves no space for the optimizer, so you get something...
suboptimal.

------
loganfrederick
The author, Mario Carniero, is both one of the top three smartest and nicest
people I know (and many other Ohio State students where he did his undergrad
would agree). Glad to see his research getting Hacker News recognition.

~~~
alexgartrell
I went to high school with Logan and with Mario. Mario was in my AP Comp Sci
class with Mrs. White. I remember being proud of my forays into encryption,
concurrency, and network programming.

And then I looked over and Mario had implemented a graphing calculator (as in,
take "y = x^2 + 4x -10" and emit a graph) in the windows command prompt. Then,
when that was boring, he started researching and implementing the graphics
algorithms for drawing lines at various slopes with minimal distortion. And
then, when that was boring, he made his graphing calculator support polar
coordinates (something I understood only as "oops, my graphing calculator is
misconfigured").

Also, he was ~12 years old (IIRC) having skipped a few grades.

I've met a lot of smart people between Carnegie Mellon, Facebook
Infrastructure, and the Linux Kernel community, but Mario's one of two or
maybe three at the absolute top in terms of intellectual horsepower.

------
digama0
For those who are more into watching talks on YouTube than reading papers, and
are interested in the low level hardware formalization aspect of this project,
there is a recording of the ITP 2019 presentation here: "x86 verification from
scratch"
[https://www.youtube.com/watch?v=7hAShC6K_vA](https://www.youtube.com/watch?v=7hAShC6K_vA)

------
xvilka
I recently learned about some limitations in both Coq and Lean theorem prover
after watching the video by Kevin Buzzard on formalization of math in Lean. He
did some claims about quotient support in Coq, so I just asked[1] Coq
developers straightly. I recommend to read their discussion and explanations.

[1]
[https://github.com/coq/coq/issues/10871](https://github.com/coq/coq/issues/10871)

------
raphlinus
This is very cool work, and I'm happy to see it. I've been fascinated with
Metamath for a very long time, finding its simplicity extremely appealing, and
have also made a number of attempts (under the Ghilbert name) to fix the
problems Mario so eloquently identifies. But I never quite got it to gel. It's
clear Mario has the technical chops, and I hope this publication gets the MM
family a little more academic respectability and notice.

------
jkingdon2000
For those who don't know much about computer-verified proofs, the way I'd sum
it up is that there's a complexity crisis going on (similar to the "software
crisis" which might not be completely solved but which has seen significant
progress). And that Metamath Zero is one of the more promising projects out
there aimed at this problem.

~~~
lliiffee
Would you mind elaborating? I'd love to hear more.

~~~
digama0
Software gets more complicated, and so bugs become more common. Hardware gets
faster, so speed pressures are reduced on software, so it gets more layered
and hence slower; and also more complicated. More layers of abstraction means
more things to get wrong, so correctness becomes a serious problem. The world
is increasingly reliant on software, so correctness becomes a serious concern.

That sums up the "software crisis" as it relates to correctness. Most theorem
provers have been swept up in this. They are written in high level languages
and frameworks made by people who paid no thought to verification and only
slightly more than most to correctness. What's more, _even some theorem prover
languages_ have dubious correctness or semantics; Rust and Dafny are both
languages that make a big deal about formal correctness but have no formal
spec. (Rust is working on it.) Coq is not known to be consistent (the
literature is full of "here is a proof that some subsystem of Coq has this and
that property" but no one can handle the whole thing), not to mention that it
has had a few soundness bugs in its history, one of which (Reals violate EM)
wasn't even clearly a bug because no one knew what the model was supposed to
be.

Agda is a free for all as far as I can tell, and Isabelle has some strange
thing with axiomatic typeclasses that takes them out of the HOL world. HOL
Light is pretty decent metatheory-wise, and the logic at least was formalized
inside itself, but the implementation has not, and OCaml's Obj.magic is still
there, ready to cause inconsistency if you feel like using it.

Lean has not had any soundness bugs to date, and the theory is well
understood, but it is a massive C++ application and there are doubtless dozens
of implementation bugs in there waiting to be found.

None of the big theorem provers are really appropriate for trying to prove
correct, because either it is not known what this even means or it's trivially
false and we all try to pretend otherwise, or it's an intractable open
question.

These are the paragons of correctness in our world.

Can we do better? What does it even mean to do better? This paper aims to find
out.

~~~
stouset
Do we even know if it's _possible_ to prove correctness of a theorem-prover,
in a very general sense? At some point, that would imply that some theorem-
prover would have to prove its own correctness (either by proving itself
directly or by proving itself through a loop of theorem-provers). This seems
intuitively like it start to run up against Gödel-style incompleteness
problems.

~~~
digama0
It does, but only if you state the theorem in a particular way. Let's say that
you have a verifier A, that checks theorems in Peano Arithmetic (PA). Inside
that logic, you can define what a computer is, how it executes, what PA is,
what a verifier is, and what it means for a program to verify theorems of PA.
Then you can have a sequence of bytes (verifier B), and prove that they
represent a program that verifies theorems of PA.

What have you proven? Well, verifier A (using PA) proves that "verifier B
proves theorems of PA". That means that "if B says phi is true, PA |- phi". We
would have a hard time actually running verifier B inside the logic of
verifier A (that entails running all the steps of the computation B does as
theorem applications inside verifier A), but even if we did, we would obtain a
proof of "PA |- phi". If we assume A correctly implements PA, then that means
PA |- "PA |- phi". In general, this is the best we can do. If we assume
further that PA is a sound axiom system, i.e. when it proves facts about
numbers then we won't find concrete counterexamples, then this strengthens to
PA |- phi and then to phi itself, so we've learned that phi is true.

The plan is to prove inside verifier A (that is, MM0) the statement "Here is a
sequence of bytes. It is an executable that when executed has the property
that if it succeeds on input "T |- phi", then T |- phi." The bootstrapping
part is that the sequence of bytes in question is in fact the code of verifier
A. In order to support statements of the form "here is a sequence of bytes",
the verifier also has the ability to output bytes as part of its operation
(it's not just a yes/no oracle), and assert facts about the encoding of those
bytes in the logic.

------
dwheeler
If you're interested in the main book on Metamath, it is "Metamath: A Computer
Language for Mathematical Proofs" by Norman Megill & David A. Wheeler. You can
get the hardcover here: [http://www.lulu.com/shop/norman-megill-and-david-a-
wheeler/m...](http://www.lulu.com/shop/norman-megill-and-david-a-
wheeler/metamath-a-computer-language-for-mathematical-
proofs/hardcover/product-24129769.html)

Full disclosure: I'm one of the co-authors.

------
dwheeler
FYI, the current publicly-available code for Metamath Zero is here:
[https://github.com/digama0/mm0](https://github.com/digama0/mm0)

------
ncmncm
There is something about the process of getting optimal performance in a
system that tells you whether you are moving in the right direction.

The universe speaks to you directly, in the count of cycles consumed, albeit
somewhat garbled by incidentals of the CPU implementation and, particularly,
caching. But the implementers of the CPU are chasing correctness and optimal
performance too, so are always working toward the same goal, at their level.
Even the caching, when you get down to counting cycles, reveals truths about
your computation.

So, a proof system that is several orders of magnitude faster than another is
that much closer to ultimate truth, because there is simply less room for
significant errors of comprehension to hide in.

This is the reason that efficient languages and runtime systems are, and will
always be, important. Garbage collection is an evil because it muddies the
water, corrupting the connection between what your program says and what
actually happens in the machine.

------
infogulch
It seems that metamath's representation is a basic _primitive_ for encoding
maths at a syntactic level. It uses simple rules to recursively apply
syntactical substitution transformations in order to build up a kind of
complete, ostensibly true _statement_. A statement that is encoded in a way
that a small proven-true statement checker can ~linearly (!) follow along the
statement of the proof and find it logically sound. It seems weird to me that
it's even possible to cross the boundary from syntactic representation to
semantic meaning so seamlessly.

