Hacker News new | comments | ask | show | jobs | submit login
Certigrad: bug-free machine learning on stochastic computation graphs (github.com)
99 points by kg9000 on July 10, 2017 | hide | past | web | favorite | 52 comments

I find the appeal of formally proven languages somewhat confusing. All you're doing is moving the bugs from the source code to the specification. (Alternately, you can think of source code as a 'specification' for a compiled program. You still have to transfer the same amount of information to the computer.)

> All you're doing is moving the bugs from the source code to the specification.

The value of doing this can vary, but there are some cases in which the gain is immense and indisputable. Suppose you are writing a compiler optimization. Your source code could be arbitrarily complicated, but your specification is simply "an optimized program always behaves the same as the original".

Are things like "the output must be equal to the output of this other thing" easily expressed to a prover? If so this seems like it'd be awesome for optimisation as well.

Yes, this is easy to express in a prover. A naive implementation can always serve as a specification for a sophisticated one.

This sounds very interesting indeed. Can this be done with Coq? Do you have any tutorials or such on doing this?

It can definitely be done with Coq. The typical recommendation is to read through and do the exercises in the freely-available Software Foundations book [1], which will teach you how to do that in Coq.

That said, while it was educational and an extremely good introduction to interactive theorem proving, as I went through the book I became increasingly disappointed with how Coq-style proofs were so laborious and difficult to achieve, as often you had to prove even the simplest things step by step. You also had to go through many chapters before you finally arrived at proving correctness of optimization.

I later found an alternative which I can't recommend highly enough. If you read through and do the exercises in the also freely-available Concrete Semantics book [2], you will also learn how to do exactly what you want, but you will learn it much sooner and in a much easier way.

In the latter book, you will be using the Isabelle/HOL prover instead of the Coq prover, though. The language looks slightly more idiosyncratic (I'm referring to the weird extra quotes - you will know what I mean), but despite that, to me it was worth a hundred times over: Isabelle/HOL has the most amazing proof automation I've seen - it will not only find proofs all by itself quite often, automatically using hundreds of theorems it already knows or that you have already proved before, but sometimes it will even automatically spit out human-readable proofs, which will teach you how the proof can be done (which is useful when you are stumped and had no clue how to move forward).

Isabelle/HOL can also be configured to automatically find errors in your specification and in your implementation - it has built-in automatic quickcheck / nitpick functionality, which can often tell you that you're trying to prove something that is not provable even before you attempt to prove it. IMHO, all of these things contribute to a significantly better user experience (as a newbie, at least).

If you are interested, give it a try. I highly recommend it!

[1] https://softwarefoundations.cis.upenn.edu/current/index.html [2] http://concrete-semantics.org/

I second this.

Isabelle/HOL has much better proof automation than any prover based on Curry/Howard. That's primarily because HOL is a classical logic, while Curry/Howard is constructive (yes, I know we can do classical proofs inside constructive logic, e.g. with proof irrelevance). Almost all automated provers are also classical. Having comparably powerful proof automation to Curry/Howard based systems is an open research problem.

   weird extra quotes
That's because Isabelle is not a prover, but a 'meta-prover' used for implementing provers, e.g. Isabelle/HOL. In practise HOL is the only logic implemented in Isabelle with enough automation to be viable, so we don't really use Isabelle's 'meta-prover' facilities in practise.

> Having comparably powerful proof automation to Curry/Howard based systems is an open research problem.

Isabelle/HOL is essentially isomorphic to a subset of Lean when we assume classical axioms, and any automation you can write for Isabelle, you can write for that subset of Lean. It is often easy to generalize automation techniques to support the rest of Lean (i.e. dependent types) as well, but sometimes doing so may introduce extra complexity or preclude the use of an indexing data structure. See https://arxiv.org/abs/1701.04391 for an example of generalizing a procedure (congruence closure) to support dependent types that introduces very little overhead.

This is interesting. It will take me some time to digest this paper.

As an aside: is there a terse description of Lean's type-theory? Also: Lean is Curry/Howard based, it's not an LCF-style architecture?

There are older papers that describes Lean's meta theory, https://leanprover.github.io/publications/, the first paper, and the one on elaboration. Unfortunately they both describe the version present in Lean 2, which still had the HoTT support we removed in Lean 3.

The new tactic paper, set to appear at ICFP '17, also briefly describes Lean's meta-theory. https://leanprover.github.io/papers/tactic.pdf

At a high level our meta-theory is very similar to Coq's with a few important departures, a big difference the remove of pattern matching and fixpoints from the kernel, we instead have primitive recursors/elminators/recursion principles.

The lead author of Lean (Leo De Moura) has built some of the worlds most advanced automated reasoning tools, and his goal has always to unify type theory and automation. Our goal is to have the benefits of dependent type theory coupled with the high level of automation present in SMT and from tools like Isabelle.

Thanks for the link to the paper.

De Moura is giving a talk tomorrow in Cambridge at the workshop in computer aided proofs called "Metaprogramming with Dependent Type Theory". I wanted to go, but won't be able to attend. I guess that talk will mostly be about the content of the paper. Now I don't feel so bad for being unable to attend.

Lean is LCF style in the sense that there is a small kernel, and all proofs written by the user and generated by the automation are checked by this kernel. This kernel (i.e. type checker) is much smaller than that of Coq or Agda.

It is _not_ LCF style in the sense that there is only a "theorem" datatype with proof operations. Lean proofs are type-checked expressions.

It is hard to find a terse description of the Calculus of Constructions or CIC. Lean's logic essentially consists of Martin-Löf type theory + (noncumulative) universes with Impredicative Prop + inductive type (where nested and mutual induction is compiled down to a simpler forms with only one eliminator) + quotient types + function and Prop extensionality. And optionally choice to gain a classical logic. Most other features, like a Agda like dependent pattern matching, or mutual and nested inductive types are compiled down to more basic features.

    there is only a "theorem" datatype 
You mean that there is NOT only a "theorem" datatype? In contrast to Curry/Howard provers, the LCF approach forgets proofs, it guarantees soundness by giving you access to proof rules only through the "theorem" datatype (which is the key trusted computing base). To be sure the "theorem" datatype may internally maintain a proof object (e.g. for the purposes of program extraction), but that's not necessary.

Sorry, I was unclear. I meant that most LCF style theorem provers only have one theorem datatype to carry proof information.

   Can this be done with Coq?
This can be done with any prover.

Simplifying a bit, an algorithm is a function of type A -> B. If you have two implementations, f1 :: A -> B and f2 :: A -> B, say one slow and simple (f1), one fast and complicated (f2), then you can verify that f2 does what it is supposed to do in two steps:

1. Show that for all a in A we have f1 a == f2 a.

2. Show that f1 meets the specification.

If you trust that f1 does what it is supposed to do, then you can skip (2).

> your specification is simply "an optimized program always behaves the same as the original"

I wouldn't say it is simple, since it is undecidable for general programs.

Daniel's point is that the specification itself is simple, he is saying nothing about the complexity of proving that your program matches the specification.

The decidability of the specification isn't important in this context. It is known there is no procedure to build a proof of this specification for any language or optimization, but as humans we can prove that it holds for a specific configuration of compilers, and optimizations, and have done so many times, for example CompCert, Alive, etc.

The specification is simple in the sense that it is easy to understand what it states and to confirm that it has the intended meaning. This does not mean that all proofs of the specification will be simple, but once you do construct a machine-checkable proof of the specification, your level of confidence that the system is correct will be extremely high (commensurate to the simplicity of the specification) and will not depend on the complexity of the proof.

The point is that the specification is short and human understandable: you get confidence that bugs are unlikely to exist in the specification. Now if you also can prove that your optimization passes fulfill the specification you've obviously gained a lot of value. It doesn't matter that deciding the correctness of arbitrary program transformations is undecidable if you happen to succeed in proving that the ones actually implemented are correct.

You used to have bugs in the ambiguous requirements in English, any formal specs which are precise, and the code. Formal specifications plus verification does indeed mostly move bugs to the formal specs or proofs. That mostly eliminates two classes of bugs. Then, it ensures that your extracted or verified code probably corresponds to your specs. It might even been proven free of (common errors here). With certifying compiler, the ML or C code might also be proven to compile to assembly without errors.

Quite a bit of improvement there. Developer is not just moving things on a ledger. That developer is preventing entire classes of problems plus reducing problems in what remains.

The specification is a lot smaller than the code, and so it's easier to read and manually verify that it's correct.

   specification is a lot 
   smaller than the code
I doubt that this is the case in general.

How can the (full) specification of a program be smaller than the program? This would mean we can always compress a program into a smaller program (after all a specification can be seen as another program -- alternatively a program can be seen as its own specification).

It's not true for all cases, but it's true for most real-world cases, especially for typical combinations of specification tools and traditional (imperative) programming languages.

A real-world program is usually more complex and intricate than a specification, often a lot more complex. This is usually done in the interest of optimization / performance, so that you program runs at least somewhat efficiently.

A specification is usually a lot simpler, because typically you only need to express the simplest possible way to do or to describe something. Even if the specification ends up looking like a very simple, naive and horribly inefficient program to a typical programmer, it doesn't matter: it's not going to be executed, it's only going to be used to prove that your program behaves in the same way, even though your program can be arbitrarily more complex.

Also, in your specification, sometimes you don't even need to express how to do something - you only need to express what the result looks like.

For example, let's say you want to prove that a sorting algorithm is implemented correctly. You literally only have to specify "after running the function sort() on an array, the resulting array is sorted. A sorted array means that for any X and Y elements of the array, if the index of X is less than the index of Y, then the value of X is less than or equal to the value of Y".

It's easy to see how that specifies correct sorting behavior, even though it doesn't even say how the sorting is supposed to be done. In contrast, the actual implementation of the sorting algorithm can be way more complex or difficult to see that is correct than that, especially if your program is written in a programming language prone to errors, with pointer manipulation and potential undefined behavior (such as C, for instance).

   A specification is usually 
   a lot simpler, because ...
I used to think that too, but after verifying some algorithms, I have become sceptical of that belief. Instead I conjecture that on average the full specification of an algorithm is at best proportional in length to the algorithm itself. There are two main issues:

- Most verification does not tackle the full specification, but rather some aspect.

- Verification needs many intermediate specifications (e.g. induction hypotheses in proofs), which need to be accounted for.

I'm happy to be convinced otherwise.

   prove that a sorting 
   algorithm is 
I produced the full verification of Quicksort (probably the first), and I needed to produce a large amount of intermediate specifications for auxiliary lemmas. The full specification was much longer than the algorithm.

You sound like you have more experience in this area than I do, so please correct me if I'm wrong.

In the context of the original poster, the size of the specification that matters with respect to being a potential source of errors is the specification of the final (initial?) theorems that you want to prove, not the actual proof, nor the intermediate lemmas/theorems, induction hypothesis, etc. The latter ones can be all incorrect for all you care, but it doesn't matter because the theorem prover will always tell you that your implementation does not implement the specification unless the proof and the intermediate theorems are also correct.

In the sorting example, what matters is that you correctly specify what is a sorted array and that the result of the sort() function should be a sorted array with the same elements as the input. If you decide to prove that your implementation of Quicksort implements that specification, you can use whatever implementation code you want and whatever intermediate lemmas or proofs you want, but again, if you make any mistake in implementing the algorithm or specifying the intermediate lemmas and proofs or even the final proofs, the theorem prover will complain loudly.

If you simply mean that the total size of all the specification, including proofs and intermediate theorems and proofs can be as large or larger than the actual implementation, I agree with you, that can be the case and that is definitely a problem, because that is one of the reasons why most programmers and the industry in general won't find formal verification to be worth it.

That said, SMT solvers (if you are so inclined) and proof automation in general can really help to reduce the size of the complete specification, including intermediate results and proofs (this is the main reason why I prefer Isabelle/HOL to Coq).

(As an aside, I was pretty sure I knew of an example where a real-world system was verified with an SMT solver and where the complete specification and proofs ended up being around 10% of the code size of the implementation, but for the life of me I cannot find that example... perhaps I am misremembering).

If by full verification you mean verifying all the way down to the machine code, I would argue that this kind of verification probably belongs in the compiler (e.g. CompCert or CakeML) rather than being something that the typical programmer should do.

But perhaps, and most likely, I have not understood you completely :)

   The latter ones can 
   be all incorrect for 
   all you care
Good point. Maybe we can call this the TSP (trusted specification base)?

In my experience the TSP is subtle and contains most of the intermediate definitions too. A few years back I had a very large Isabelle/HOL proof invalidated because of some small mistake in an obscure part of the specification that I never though could be important. It required a redesign from scratch of almost all of the proof.

   only have to specify "after 
   running the function sort() 
   on an array,
I'm afraid that's not enough in practise: you will also have to specify (and verify) things like: the implementation does not touch memory except the array to be sorted (and some auxiliary variables). If you don't, it will typically be difficult to use the proof of the sorting function in the context of the verification of a larger program.

   all the way down 
   to the machine code
No, that's an orthogonal dimension.

In your second point you are referring to fact that is needed in a calling context when proving a large spec, that means it is a top level specification, and therefor trusted. You don't need to prove an implementation only touches the allocated array to verify it as a sorting algorithm, its because there is some other part of the top level specification that requires that.

For example I recently verified a JIT compiler, and the compiler spec was functional correctness but we really wanted to ensure it was correct AND only used the memory we allocated for it. In this case we had to grow the specification to be larger, its not longer just being correct, but isolated in some way. This of course has a trickle effect where we need intermediate facts, and proofs in order to be able to prove the top level theorem.

As some posters remarked above you should still be able to give a concise top level specification that is much smaller then the implementation. You really just want to say (where m is pre-sort, and m' is post-sort):

    (forall addr m m',
      addr < addr_of(array) \/ addr_of(array) + length < addr -> read addr m = read addr m')

We have a spec very similar to this in said JIT compiler, where we prove it doesn't write outside its input and output buffers. The top level spec is no more then a few definitions and a few lines of spec, compared to the JIT compiler which is 10,000s of lines of proof & implementation it is tiny and very easy to audit.

You can also make proofs obligations like this easier by using regions, a framing rule, or some programming model that gives some amount of memory isolation.

   calling context
In compositional verification, you generally want to prove as much as possible about the piece of code at hand -- irrespective of calling context.

   spec very similar to this
Yes, I was using something similar. Note that this is right only in a purely sequential setting ...

   I recently verified a JIT compiler,
That sounds exciting. That must have been a major piece of work. Anything published yet?

With JIT compilers the issue of self-modification becomes pertinent. I know that Magnus Myreen has verified a small JIT compiler a few years back but I don't remember exactly what theorem he proves. IIRC he was using a tailor-made Hoare logic for x86 code that basically moved some of the complexity of the spec into the logic.

  I'm afraid that's not enough in practise: you will also have to specify (and verify) things like: the implementation does not touch memory except the array to be sorted (and some auxiliary variables). If you don't, it will typically be difficult to use the proof of the sorting function in the context of the verification of a larger program.
Depending on the programming language being proved, you don't need to specify that the algorithm doesn't touch other memory, nor which variables it uses as temporary variables, because it's implicit in the definition of the language or the types themselves (or the fact that you are not referencing global variables).

See [1] for an example of a formally proved implementation of an array search algorithm (implemented in the Dafny language, in this case. You can press the play button to see if the theorem prover approves of this implementation).

As you can see, in Dafny there's no need to prove that no other memory is touched, as it's implicit in the type of the method. However, even this example is not as good as it could be because:

- Ideally, you wouldn't need to specify that function arguments aren't null. This would have been implicit had the implementation language (Dafny) not had the concept of null (like Rust doesn't, for instance).

- Also ideally, you would refactor the pre- and post-conditions into a separate predicate, like it was done for sorted(), so that you could prove any array search function to be correct using the same short specification (and if a bug is found in the specification, the fix would be applied to all implementations of array search functions automatically).

Regarding having a large Isabelle/HOL proof invalidated, that sounds very interesting and would love to read more about it! I can definitely admit I have never done any large proofs yet, so I am interested in hearing about your experience.

Ideally there should be some way of isolating parts of a large specification such that a bug in one part could not affect the other parts (similar to how functional languages guarantee that a function can never read anything but its input, and can never write to anything but its output), but I have absolutely no idea if this is even logically possible, as I am still just a beginner in this area.

Thanks for the interesting discussion!

[1] http://rise4fun.com/Dafny/loJb

   Depending on the programming 
   language being proved
That's true, GCed languages are simpler in this regard. But for low-level languages you have to do this 'by hand', and it can be complicated.

   not had the concept of null
That doesn't make the problem go away, you just have another case in the destructors (e.g. pattern matching) that you need to account for logically. Just kicking the can down the road.

   would love to read more 
   about it
Nothing published. Long story. I'll tell you in person, should we meet some day.

But your "intermediate specifications" are not part of your specification! The specification says that the result is a sorted version of the input. It does not care about the induction hypothesis in your program. It might be technically necessary depending on the verification condition generator you are using. When you verify a functional program using Isabelle, Lean, or Coq your statement is that the output is a sorted version of the input.

Yes. Mostly.

But in practise, you need to specify what sortedness means, and in a generic sorting algorithm that means axiomatising the comparison predicate, including its side-effects. Note for example the C standard library implementation of Quicksort does take an arbitrary predicate, which may not be transitive, may have arbitrary side effects etc. The behaviour of sorting becomes pretty complicated in this scenario.

Verification ≠ specification.

How is that the case in this specific example? It looks a lot harder to check for correctness.

> The specification is a lot smaller than the code, and so it's easier to read and manually verify that it's correct.

>> How is that the case in this specific example? It looks a lot harder to check for correctness.

Author here.

Here is the specification for the stochastic backpropagation algorithm:


The preconditions do not need to be inspected carefully because we prove that the models of interest satisfy them (example: https://github.com/dselsam/certigrad/blob/master/src/certigr...)

Here is the part of the specification that needs to be inspected carefully:


The syntax may seem strange to those unfamiliar with Lean, and a few of the functions involved may not be self-explanatory without reading their definitions, but the statement is conceptually simple: the stochastic backpropagation algorithm correctly computes unbiased estimates of the gradient of the expected loss.

It is subjective, but I personally think that this specification is vastly easier to understand and to check for correctness than the actual implementation of the stochastic backpropagation algorithm.

> You still have to transfer the same amount of information to the computer.

Different target languages to transfer the information to may result in somewhat uncorrelated defects, allowing finding defects through mutual comparison.

You can test a generator by writing a recognizer and vice versa.

Errors in formal proof of correctness are somewhat uncorrelated with errors made in tests and errors made in implementation. By doing all 3 you can diversify to reduce risk.

You have to make sure your spec is right, but you also have to make sure your tests are right, and people are pretty happy with the idea of testing.

Yeah, I'm not saying that automated testing (whether through unit tests or through theorem proving) is bad, just that the two approaches seem pretty much equivalent. All these strong claims about "proven correctness" get thrown about but I'm not sure it's very useful when all you're proving is that the computer did what you told it (which you know anyway - the problem is almost invariably that you told it to do the wrong thing.)

They are fundamentally not equivalent, you can not guarantee that a system respects a safety property with testing. You can prove that certain concrete instances, i.e your test cases inputs are safe, but you don't know where there exists a path in your program where the safety property is violated.

This is incredibly important in critical systems because it might not be possible to even detect a safety violation has occurred, for example file system corrupts data, distributed consensus algorithm commits to the wrong value, compiler slightly mis compiles your code.

We want to be ensure there exists no path where something "bad" can occur, this is what verification gives us over testing.

There is always the fundamental problem of needing to communicate what we want to the computer, but the goal is to do it minimal way, and then ensure the program we write exactly matches the minimal implementation.

This seems to be a newly hot topic, whether to fight adversarial inputs or looking for efficiencies. I bookmarked a few relevant blogs (but really need to read more deeply)




and another preprint: https://arxiv.org/abs/1706.10268

Meta to the Certigrad project, for those mystified by the 'lean' programming language, here is a good start: https://leanprover.github.io/programming_in_lean/programming...

PDF warning for those of us on mobile.

Previous discussion of the associated paper: https://news.ycombinator.com/item?id=14658832

> Note: building Certigrad currently takes ~15 minutes and consumes ~7 GB of memory.

Why does it take so long and use so much memory?

Author here.

Building Certigrad involves replaying all tactic scripts in the entire project to reconstruct all of the formal proofs, and then checking each of the formal proof objects in Lean's small trusted kernel. Proving (and checking) the main correctness theorem for stochastic backpropagation is very fast. The vast majority of the time and memory is spent verifying that a specific machine learning model (AEVB) satisfies all the preconditions for backprop. This involves proving several technical conditions, e.g. that various large terms are (uniformly) integrable. We have not experimented much with simplification strategies, and there is probably a lot of room for improvement in bringing these numbers down. It would also be good to provide an option to build the system without reconstructing the proofs; checking the proofs is analogous to running the entire test suite, and most users do not do this for every tool they build.

Could this problem have been (partly) avoided by going for an LCF-based prover such as HOL or Isabelle/HOL, rather than a system based on the Curry-Howard correspondence?

I do not even know how I would have built Certigrad in Isabelle/HOL in the first place. In my first attempt to build Certigrad, I used a non-dependent type for Tensors (T : Type), and I accumulated so much technical debt that I eventually gave up and started again with a dependent type for tensors (T : Shape -> Type). This was my only major design error in the project, and development went smoothly once I made this change.

ITTs like Lean are more expressive as logics than HOL, but I doubt that Certigrad needs even a fraction of HOL's power. So anything you need that you express as types in Lean you should be able to express as theorem in HOL. Presumably that is inconvenient (= technical debt), but why?

Tensors are a good example: In a proof you might want to do induction over the dimensions of a tensor. This means your type of tensors needs to contain all tensors of all dimensions. But working on the type of all tensors is not so nice anymore: a lot of algebraic properties do not hold, they only hold for tensors of a specific dimension. An example is the Tensor library in the Deep_Learning AFP entry.

Now in Isabelle most strutures are encoded as type class, but when your type is not anymore in the type class suddenly you need to proof a lot of theorems yourself, and the automation does not work as nice as with type classes. Generally, in HOL provers you want at least that your type at least has nice algebraic properties on the entire type. If this is not the case proofs get much more convoluted. Isabelle's simplifier supports this case using conditional rewrite rules, but it still does not work as nice as using type inference to handle this cases.

In dependent type theorem provers it isn't a problem to proof everything on tensors with a fixed dimension. When a proof requires to do induction on the dimension itself these kind of shape can always constructed in the proof itsefl.

The memory consumption will surely be better, as HOL or Isabelle will only store the fact that a theorem was proven, but not how. But then Lean can store the proofs and has two independent external type checkers. Isabelle has the infrastructure to do this, but it can not cope with it after a certain size (Isabelle's HOL-Proof does not even contain all of HOL/Complex_Main). In my experience Lean feels much faster than Isabelle, so I would guess the proof will take much longer than in Lean. But I don't have any concrete measures.

In Isabelle one might want to trust the code generator to evaluate certain computations in ML.

One of my PhD students will soon have to learn an interactive prover. I was to recommend Isabelle/HOL because it's got the best automation, but maybe I should consider Lean (and learn it along with him).

I worry slightly about Lean being immature, and lacking a big library eco-system in 2017. OTOH, it's also good to be at the cutting edge.

> Machine learning systems are also difficult to engineer because it can require substantial expertise in mathematics (e.g. linear algebra, statistics, multivariate analysis, measure theory, differential geometry, topology) to even understand what a machine learning algorithm is supposed to do and why it is thought to do it correctly.

It's like Teen Talk Barbie.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact