
Formal Verification: The Gap Between Perfect Code and Reality - tom_mellior
https://raywang.tech/2017/12/20/Formal-Verification:-The-Gap-between-Perfect-Code-and-Reality/
======
Animats
Having been down this road [1], a big problem is that the people doing this
tend to be too much into the theory and not enough into wanting programs not
to crash. That's some of the Coq world.

It doesn't have to be this way. Microsoft's work in this area led to the
Static Driver Verifier [2], which was a huge success. Windows crashes due to
kernel drivers breaking the rest of the kernel have pretty much disappeared.
They were once (Windows XP and earlier) responsible for over half of Windows
crashes.

Most of the things you want to prove about programs to make sure they don't
crash are mathematically trivial. Detecting buffer overflows and off-by-one
errors seldom requires elaborate proofs. If they do, your program design
probably has a problem. Simple proofs can be handled by a prover which uses a
Nelson-Oppen decision procedure. This is a complete decision procedure for
formulas which contain only addition, subtraction, multiplication by
constants, Boolean operators, structures, and arrays. This covers most data
access operations. It's fully automatic and complete - the algorithm always
succeeds within its problem space. You need one of those for the easy parts.

For the hard parts, we used the Boyer-Moore prover. (Sources are available and
I brought it back to life on Github.[3] It runs 3000x faster than it did in
1982.)

The interface with the language is crucial to usability. Most efforts get this
wrong. They put the verification stuff in separate files, or in comments. It
should be part of the normal code. We built the verification language into the
compiler, so you used the same source and the same syntax for compiling and
verification. There were additional statements - ASSERT, ENTER, EXIT, etc. The
same compiler front end was used for both compiling and verification. The
verifier itself ran off the byte code from the compiler front end. All type
and size issues had been decided at that point, so the verifier was looking at
unambiguous operations such as "pop two operands from stack, add in 16-bit
twos complement arithmetic, push on stack". This captured the semantics of the
program the same way the compiler did.

The user never sees the easy stuff. Every subscript use generates a bounds
check assertion which has to be proved. If the prover is able to prove the
assertion true, the user doesn't see it. Only the errors are visible. Another
trick is that you don't write "assert(a and b and and d)", you write
"assert(a); assert(b);". Those generate separate proof goals, and if only one
fails,the user gets error messages about only that one. Also, when you write
"assert(a); assert(b);", the second one is proved under the assumption the
first one is true. This prevents cascading errors.

A key point here is that verification is meaningless unless you prove _all_
the theorems. If you write "assert(false)" other errors go away, and you're
left with one simple error, which you cannot eliminate. You don't get to
dismiss it. It's possible to write an assertion that's equivalent to
"assert(false);", such as "assert(0 > 1)"; This is why I'm so negative on the
arguments from some of the Rust crowd that a little undefined behavior is OK.
It can blow at any seam.

When you needed to prove something hard, you wrote

    
    
        assert(a);
        assert(b);
    

where a could be proved by the Nelson-Oppen prover from the code before it. B
was chosen such that if you assumed b, the Nelson-Oppen prover could prove
anything required past that point. Now you had one error message, that the
system couldn't prove "a implies b". So then someone had to try to prove that
with the Boyer-Moore prover. The Boyer-Moore prover is mostly automatic, but
often someone had to provide intermediate theorems to get to a hard one. Those
were all machine-checked. If this succeeded, the Boyer-Moore prover generated
a theorem that a implies b for use by the verifier. Rerunning the program
verifier would then succeed. Such theorems could be reused with different
variable bindings, so if the same problem came up again, it didn't require
more work with the Boyer-Moore system.

If you're generating vast amounts of user-visible proof work, something is
wrong.

The whole system I'm describing is now on Github.[4] It's decades-old code,
but I've ported about 80% of it to modern compilers. The Franz LISP to Common
LISP port of the prover isn't working well, and the author of that code died
years ago. If anyone wants to work on that, go ahead.

[1]
[http://www.animats.com/papers/verifier/verifiermanual.pdf](http://www.animats.com/papers/verifier/verifiermanual.pdf)
[2] [https://docs.microsoft.com/en-us/windows-
hardware/drivers/de...](https://docs.microsoft.com/en-us/windows-
hardware/drivers/devtest/static-driver-verifier) [3] [https://github.com/John-
Nagle/nqthm](https://github.com/John-Nagle/nqthm) [4]
[https://github.com/John-Nagle/pasv](https://github.com/John-Nagle/pasv)

~~~
nickpsecurity
I appreciate your detailed post. Very interesting esp with Nelson-Oppen being
new to me. I remember Boyer-Moore from studying CLI stack. Your comment brings
three things to mind:

1\. It implies, esp with links, that one should both look into and build on BM
prover. Whereas, my foray showed that high-assurance sector moved on from BM
to ACL2 with latter used in all sorts of stuff (esp hardware). Hardin et al
link below are my favorite at the moment. Your Github on nqthm even says avoid
BM in favor of ACL2 for new stuff. Anything you want to add on that such as
things ACL2 could use from BM or in general? I know from other works that new
and improved doesn't mean better in every way. Also, doesn't ACL2 lack a tiny,
proof kernel checking its steps like the LCF provers or am I misremembering? I
have seen work embedding it in HOL, though.

2\. It reminded me about Gypsy Verification Environment that got used on lots
of B3/A1-style projects. I didn't realize that was in the FTP site. It might
be interesting to get into runnable shape for reliving history and seeing how
things were done similar to what you appear to be doing with those other
works. I know it dates to the mid-70's but the mid-to-late 1980's paper on
applying 2.0+ version doesn't mention your work at all in references even
though done in similar time frame. Why you think that is? Since it doesn't,
what's better or worse about the approaches if we compare your Pascal-F method
to GVE? Side note: the report in the tarball describing the the Gypsy, Middle
Gypsy, and Micro Gypsy languages was really good on readability vs detail
included. I always love that.

3\. The current contenders in verified, low-level programming are the mature
SPARK Ada tool and Microsoft's Dafny. Would you recommend skipping the Pascal
tech to just enhance one of them directly or extracting new languages to their
form for use on their tooling? Latter was my plan if I build anything with
SPARK the target as many have done. Jasmin language similarly used Dafny.
Anything you think especially SPARK Ada could learn in automating practical
analyses from Pascal-F, Gypsy, or BM/ACL2? Interestingly, SPADE-Pascal,
SPARK's predecessor, was developed in mid-80's as well.

As usual, just trying to see how these points in history might connect in old
lessons or new opportunities.

[http://www.ccs.neu.edu/home/pete/acl206/slides/hardin.pdf](http://www.ccs.neu.edu/home/pete/acl206/slides/hardin.pdf)

~~~
Animats
I'm not suggesting one should use the original Boyer-Moore theorem prover
today. It's of historical interest. I made it runnable again because I used it
when it was state of the art, and I'd previously ported it once. But it's
1980s technology.

I like their constructive approach to mathematics. They build number theory up
from something close to the Peano axioms, one proof at a time. This is good
for soundness. I built on that and didn't add axioms. Proving McCarthy's
"axioms" for arrays in Boyer-Moore theory was possible, and was done. Instead
of explicit quantifiers, you have recursive functions which are proved to
terminate. That gets you out of many annoying logic knots associated with
quantifiers. Like undecidability.

I've been out of this for a long time. There's been considerable progress. But
the tooling still seems not to be all that programmer-friendly. It could be.
We had that partly figured out. What we didn't have was enough CPU power to
make it go fast enough to be useful.

I'd hoped Rust would go in for more verification. C/C++ has so much undefined
behavior that huge efforts have to be made just to nail down the language.
Rust is much better in that respect.

People get carried away with abstraction in this area. I wouldn't try to
verify macros or generics before they're instantiated. That requires elaborate
and iffy type theory. Instantiate them, look at the generated code (say at the
LLVM input level) with hard types for everything, and work on that. That's
unambiguous. Then build a system with enough automation that the user doesn't
have to do work for each instantiation. That's what computers are for. Don't
try to be too clever.

~~~
touisteur
Automated proof is what you get with SparkAda. With the 2014 version you get
executable contracts (thanks to DbC added to Ada2012) and lots of help from
the gnatprove tool (that acts as a frontend - meaning it implements all
runtime checks the compiler implements - to why3 or other provers or static
analysis tools - see the inclusion of codepeer recently). Then once you've
generated all the verification conditions you run z3, cvc4, alt-ergo... You
define the amount of effort you're allowing and if they fail to prove your VCs
you get immediate feedback (in your dev IDE) and you drill down to 'why' and
'how to help the prover' quickly once you've spent some time doing it (most of
the time it bogs down to 'state your unstated assumptions'). And if you're not
lucky but still motivated you can quickly jump into Coq or Isabelle with your
VCs.

They're very pragmatic. They went for a subset of Ada, that is deemed
'provable' but still largely useable. You get lots of nice typing features and
RT checks in Ada.

Still some difficult areas : aliasing, floating point, and user-friendliness,
but still the tech is working.

As for generics that's the tactic SparkAda adopted. It's disturbing at first
('they went the lazy way !') but it's pragmatic and it works great. In fact
some times you'd be inclined to transform a procedure, function or package to
a generic, because with the correct restricted, specific types and parameters
the proof effort is easier.

We're at a point where proof of absence of runtime error is almost achievable
by mere programmers. We'll be asking soon 'do we go for
test+fuzzing+quickcheck for robustness or proof ?'.

As for automation, they're always working on small but useful stuff. The
latest one I can think about is loop unrolling. Might look funny but writing
loop invariants and variants is tedious. They're slowly (lead bullets) working
up small automation steps.

------
wizeman
I almost completely disagree with the author of this article.

Yes, formally verified systems still have bugs due to the TCB (the trusted
computing base). But that's because the TCB is precisely the _only_ part of
the system that is not formally verified! So of course it's going to have
bugs.

If so many crashing and correctness bugs were found in the (small) trusted
part of formally-verified systems (well, in reality only 16 bugs were found in
3 entire formally verified systems, but I digress), imagine how many similar
bugs there are in systems that are not formally verified at all!

If anything, what this article should be arguing is that operating systems,
proof finding and verification tools and compilers should _also_ be formally
verified (and some already are), so that even bugs in these parts of the
system are also eliminated.

The only part of the article that I agree with is that it's not feasible to
write 10x more proof code than real code.

But IMHO, this is why proof automation is really important to have - tools
like Isabelle/HOL, Why3 and SMT solvers like Z3 provide this (the latter of
which the author mentions).

Coq, on the other hand, has much less automation and the result is that it's
much harder to use and results in larger proof code. Yet, it's seemingly
inexplicably much more popular - perhaps it's the catchy name?

~~~
mafribe
Coq has/is much stronger a logic (albeit constructive) than HOL. That is
_sometimes_ convenient. Most users of proof assistants don't get far enough
that powerful automation is unavoidable.

In addition, Coq benefits for a network effect: Coq started earlier targetting
university teaching, primarily at UPenn with Pierce's Software Foundations
[1].

[1]
[https://softwarefoundations.cis.upenn.edu](https://softwarefoundations.cis.upenn.edu)

~~~
wizeman
That Coq has a stronger logic is an argument I hear often, but is that really
so important when the result is that it's so hard to use that no real-world
software engineer is ever going to bother?

Maybe with HOL I will not be able to prove some deep theoretical mathematical
result that I could with Coq's Calculus of Constructions, but if I can
otherwise much more easily prove >99% of my code to be bug-free, then I would
argue that perhaps it's the better "tool" for the job...

~~~
seanwilson
> Maybe with HOL I will not be able to prove some deep theoretical
> mathematical result that I could with Coq's Calculus of Constructions, but
> if I can otherwise much more easily prove >99% of my code to be bug-free,
> then I would argue that perhaps it's the better "tool" for the job...

Is this true though? What program properties can Isabelle easily automate the
proofs of that Coq cannot?

~~~
wizeman
> Is this true though? What program properties can Isabelle easily automate
> the proofs of that Coq cannot?

Almost all?

Granted, I haven't used Coq very much, so correct me if I'm wrong, but isn't
it true that in Coq, with the exception of basic proofs like simple arithmetic
expressions, in order to prove a theorem you have to at least explicitly know
and write the name of the lemmas that are needed to prove it (apart from how
they fit together)?

In Isabelle/HOL, many times you literally only need to write "try" and
Isabelle will find the proof of the theorem for you, including the lemmas that
were needed to prove it and how they fit together (assuming these lemmas were
already proven before). If it turns out that your theorem is false, it can
show you an example of inputs where it fails.

This auto-proof functionality I'm describing is mostly thanks to Sledgehammer
[0], the auto-counterexample finding is due to Nitpick (and another tool which
I can't remember), although there are a few other automation tools that are
also very useful.

Granted, you still have to have a high-level understanding of how you want to
prove your code, but I would argue this eliminates a lot of the work and
(wasted?) time you need to accomplish the final goal...

[0] [http://isabelle.in.tum.de/website-
Isabelle2009-1/sledgehamme...](http://isabelle.in.tum.de/website-
Isabelle2009-1/sledgehammer.html)

~~~
seanwilson
Do you have any example proof scripts of some basic algorithms where the
proofs can be almost fully automated in Isabelle?

~~~
wizeman
See [1] for an example (edit: although this may not be the best example
nowadays, see below):

Quoting:

> Consider the conjecture "length(tl xs) <= length xs", which states that the
> length of a list’s tail (its "cdr") is less than or equal to the length of
> the entire list.

> Thanks to Vampire, Sledgehammer finds the following proof: by (metis
> append_Nil2 append_eq_conv_conj drop_eq_Nil drop_tl tl.simps(1))

Those are all the lemmas that were needed to prove the theorem. They were
selected completely automatically.

AFAIU the "metis" procedure tries to fit the lemmas together so that they
prove the theorem, and I believe the HOL kernel gives high assurance that if
"metis" succeeds, then the theorem was proven correctly (but my understanding
may be wrong here).

Note that Sledgehammer may even automatically find and return a human-readable
proof, which for the example above, it returns something like this:

    
    
      proof –
        have “tl [] = []” by (metis tl.simps(1))
        hence “∃u. xs @ u = xs ∧ tl u = []” by (metis append_Nil2)
        hence “tl (drop (length xs) xs) = []” by (metis append_eq_conv_conj)
        hence “drop (length xs) (tl xs) = []” by (metis drop_tl)
        thus “length (tl xs) <= length xs” by (metis drop_eq_Nil)
      qed
    

Edit: That said, note that at least nowadays Isabelle/HOL seems to be able
prove the above theorem by simplification alone ("by simp"), so this may not
be the best example.

[1]
[http://people.irisa.fr/Thomas.Genet/ACF/BiblioIsabelle/isabe...](http://people.irisa.fr/Thomas.Genet/ACF/BiblioIsabelle/isabelleOverview.pdf)

~~~
cmrx64
metis is a first-order automated theorem prover. the metis tactic will handle
encoding the goal as a first-order problem, using ~just the rules you provide
it, invoking actual-metis, and reconstructing a proof (in the kernel logic)
from the output proof. so, it is impossible to prove things that are false
using the metis tactic (ie, metis is entirely out of the TCB).

re: the simplifier, I feel like it's a bad example because coming up with good
simp rules and knowing when to add them and when not to when writing new code
in Isabelle/HOL is almost an art.

------
Aardappel
This is similar to the conclusion I came to after doing my PhD in a CS
department that had lots of people enamored by formal methods: that the
usefulness of it is inversely proportional to the complexity of the project to
be verified.

Sad that apparently hasn't changed much in the 20 years since.

I say "enamored", because formal methods seems to me like a field that gets a
lot of attention because the possible outcome of it is so damn attractive: if
we indeed could easily prove software correct, that would be the biggest
advance in software, EVER. But is a bit of a fallacy that just because the
imagined end result is so elegant and desirable, that there must be a path to
it.

Also, any gains that can be claimed by formal methods are hard to evaluate in
a vacuum, as using any other techniques that involve spending 10x more time
thinking about / working with some code can possibly have equally beneficial
reductions in bug counts. Most software that has undesirable bugs is such
because no time/money investment is deemed worth it to make it better, not
because formal methods is the only way to improve it.

I find it funny that meanwhile, mainstream software is going the opposite
direction, with ever larger systems written in dynamically typed languages
like JavaScript and Python, or overly simplistic static type systems like
those of Java and Go. You'd think if formal methods was our far away desirable
future we'd be incrementally moving in that direction by using more expressive
type systems (Haskell, Rust..) and maybe even dependent typing. The invariants
that can be encoded in such systems can rule out usefully large classes of
bugs with way lower effort than formal methods, but somehow movement on that
front is slow. If so few people care to even grab the low hanging fruit, why
do formal methods people think we can jump from chaos to order in one step?

~~~
toast0
For me, the problem with formal methods is that a prerequisite is a (formal)
specification, and I've never seen one of those in 20 years of software work.

~~~
Aardappel
Yes. And the ones I did encounter had zero "testing" done to see if they
corresponded to the real world.

Hence why we need more incremental tools, that allow us to encode more rules
into the (type) systems we already use.

~~~
catnaroek
> And the ones I did encounter had zero "testing" done to see if they
> corresponded to the real world.

You can excuse human beings for being unable to write the programs they want,
but if they can't even _specify_ the programs they want, they don't deserve
the epithet Homo _sapiens_.

------
gchadwick
All of my formal verification experience comes from using it in hardware, so
the following comment may not apply so well to software where I have no
experience of formal methods (beyond learning about it at university).

I think the mistake the author is making here is seeing formal verification as
an all or nothing affair. Either you formally prove your program to be correct
or it's useless. I would agree with them that doing this is a huge amount of
effort and potentially just not feasible for real systems.

However there's a huge amount of value in formal verification as an extra tool
to assist you in your verification efforts. By combining traditional
verification techniques (normal testing in software terms) with formal methods
you have a whole new way to find bugs, one which works rather differently and
hence tends to find different bugs. You may not get anywhere near confidence
you've formally proved everything (for one I find the more interesting
properties just never get proven, but leaving the tool to run for a while
often throws up bugs that are tricky to find via other means) but it does
accelerate your verification efforts.

It's also a useful tool for exploring certain scenarios you as the designer
(or programmer) may be worried about. You can overconstrain it to just look at
certain behaviour and you can ask it to show you some specific execution that
hits a particular property. Maybe there's some code path that's very hard to
stimulate for whatever reason and you want to check out it's functionality.
You can get the formal tool to hit it and then see what it does.

~~~
ghostly_gray
Hi, author here. I definitely agree the valuable part is thinking through the
verification process, which devotes a lot of time to understanding the system
and adds resilience in edge cases. Someone on Twitter brought that up:
[https://twitter.com/jetapiador/status/943909582279233536](https://twitter.com/jetapiador/status/943909582279233536).
Perhaps that's what I should have concluded my argument with.

------
oldgradstudent
This is an excellent blog post.

If you believe a proof of correctness produced by a formal verification tool,
I have a bridge to sell you.

For a proof of correctness, you need to model the environment, and write a
specification or a golden model. These are just as prone to bugs as the system
you are trying to verify (even assuming the tools are correct).

That is not to say formal verification is useless. Developing a proof will
force you to formalize your intuitive grasp of the system and its interfaces.
Counterexamples produced by the tools will often help to find deep bugs that
would be almost impossible to find any other way. And after many iterations of
debugging both the system and the proof, you might gain a lot of confidence in
the proof and the system.

But even after all of that, trusting a formal proof as a guarantee of
correctness is insane.

~~~
ique
What you're saying is just "trusting any system is insane", which sure, it
might be (depending on the criticalness of said system).

But that's not very insightful or informative, how does that drive any sort of
decision-making or improvement process?

Do you have any suggestion on how to improve the correctness of systems that
work better than formal verification?

~~~
oldgradstudent
Formal verification techniques can be very useful, especially for finding
deep, often insanely complex bugs.

The process of developing the proof can produce a lot of insight about the
system and expose hidden assumptions that were never considered while
designing the system.

It is the final proof itself which is the problem. It does not contribute
anything except for terminating the debugging process.

Because the specification can be just complex as the system, the proof can
only provide a false sense of security. A correctness theatre.

~~~
seanwilson
> Because the specification can be just complex as the system

Can you give an example of this? I don't find this true in general at all e.g.
writing the specification of a sorting algorithm is much easier than
implementing most practical sorting algorithms.

> the proof can only provide a false sense of security. A correctness theatre.

You could use this argument to say that unit tests, type systems, code review
etc. are correctness theatre. All these things prevent mistakes.

~~~
mafribe
> Can you give an example of this?

It's pretty clear that a _fully_ formal specification of a program must be
(roughly) as complex as the program itself, for otherwise you could construct
an infinitely descending chain of ever simpler specifications which leads to a
contradiction.

The reason that you found the specification of a sorting algorithm simpler
than the sorting algorithm is that you probably have not fully specified the
algorithm. Consider the sorting algorithm in C's standard library, which
allows the user to specify a comparison function. Now imagine you'd pass
something like the following comparison function:

    
    
       def compare (x, y):
          printf ( "I'm comparing %i with %i\n", x, y )
          if ( x < y ) return 1
          if ( x == y ) return 0
          else return -1
    

It almost completely exposes the inner workings of sorting. Hence a full spec
needs to account for this.

~~~
seanwilson
> It's pretty clear that a fully formal specification of a program must be
> (roughly) as complex as the program itself, for otherwise you could
> construct an infinitely descending chain of ever simpler specifications
> which leads to a contradiction.

That's like saying a theorem statement must be roughly as complex as the
proof.

Also consider that it's much harder to reason (and prove) that heavily
optimised versions of algorithms are correct compared to naive
implementations. For example, a proof of correctness for heapsort is more
involving than one for insertion sort.

------
fmap
Whenever I read a post like this I have to wonder: was there as much
resistance to writing tests for your code before that became common practice?
Many of the arguments seem to apply to test driven development in the same way
as they do to formal verification: \- You have to go over every single line of
code (to show full functional correctness|to achieve full test coverage). \-
The amount of code that goes into (verifying|testing) an application is on the
same order of magnitude as the application itself! \- There are still bugs in
(unproven|untested) parts of the application! \- It's not perfect, so why
should we use it at all?

At the end of the day, formal verification is slowly but surely becoming
mainstream. Whether through more advanced type systems, static analysis,
domain specific model checking, or through interactive theorem proving. We're
finally at the point where whole program verification is starting to become
realistic for some domains and it will only become easier from here.

\---

As for the article itself, others have already pointed out some specific
problems. Let me just add one more: Simulation proofs do not scale, but are
far from the only option. I would encourage you to look into more high level
verification techniques using axiomatic semantics, such as Iris in Coq.

~~~
canes123456
Is this serious question? Yes, people are still saying the same thing about
test-driven development.

------
guhcampos
Formal Verification is the state of the art verification method for hardware
these days, with most of the ASICs in the market going through it, and
recently many SoC's and complex chips like GPUs and non-RISC CPUs. It does not
mean that other methods of verification are dying or are ever going to. FV is
just as good as the combination of the tools and the people working on it, and
it's an extremely difficult discipline to work on.

FV for software is a whole other beast, we simply lack the compute power to
thoroughly verify even simple software properly, and the tools are still in
their infancy.

My personal opinion, having worked for the lead FV company in the market for
years, is that FV for software is probably sufficiently far fetched that
Machine Learning methods for verification might be there before FV for
software becomes a reality. The reason is: while for hardware it is imperative
to have 100% correctness, since the cost of bugs is absurdly high, for
software having 90 something percent is mostly acceptable, so if it's cheaper
to let a neural network (or something else, I'm no ML expert) estimate the
correctness of the software, that's whats going to happen.

------
mpweiher
Very interesting. For example:

"The guarantees of a formally verified system rest entirely on the assumptions
of the system, including its trusted computing base (TCB). The TCB of a
verified system includes the specs, which are manually written and hoped
(fingers crossed!) to be correct, the verification tools (e.g., Coq engine,
Z3, compiler), and the runtime infrastructure (OS, hardware). This should be a
huge red flag. For instance, if you’ve used Z3, you’ll agree that it’s black
magic to everyone except the developers, and my professors confess that
they’ve found correctness bugs, especially in decidability theories that are
not commonly used."

~~~
mbid
I don't quite understand why Z3 is part of the TCB. Isn't Z3 generally used to
generate proofs which are then verified by the respective proof assistant?

~~~
tom_mellior
Yes, if you use Z3 as part of a proof assistant like Isabelle/HOL or F-Star.

No, if you use Z3 as part of some other system that does not have its own
primitive notion of proofs and proof checking. Many tools are like this: Dafny
is one, I think, as is Why3, and the Alive system for checking the correctness
of LLVM peephole optimizations. They simply trust Z3.

It's also used in lots of other things that are not "verifiers" as such. For
example, to find new compiler optimizations by superoptimization:
[https://arxiv.org/abs/1711.04422](https://arxiv.org/abs/1711.04422) (And sure
enough, if you search the paper for "Z3", the first hit is a paragraph where
the authors describe that they once had a miscompilation due to a bug in Z3.)

------
Silhouette
It seems to me that the arguments here are fair, but maybe the conclusion
isn't.

 _But the guarantees that we expect are much stronger than a correct protocol,
handshake, or compiler optimization. We need the entire system to be reliable,
and we need proof effort to be minimal for widespread adoption to be
feasible._

Ideally, yes, of course that is true. But even if it is not currently true,
having _some_ aspects of a system formally proven still means we can focus our
limited resources on the other aspects and the integration, and that's still
beneficial.

------
godelmachine
Adrian Colyer's review of "An Empirical Study on the Correctness of Formally
Verified Distributed Systems" covered in The Morning Paper

[https://blog.acolyer.org/2017/05/29/an-empirical-study-on-
th...](https://blog.acolyer.org/2017/05/29/an-empirical-study-on-the-
correctness-of-formally-verified-distributed-systems/)

Thank me later :)

~~~
platz
No!

------
davidgrenier
_Although the CompCert paper explicitly says that the parsing step, as well as
the assembling /linking steps, are not verified, this kind of bug is a huge
blow to the credibility of formal methods. It really makes you wonder what
sort of bugs formal verification is supposed to catch._

That's strange I reach the exact opposite conclusion on this statement.

~~~
ghostly_gray
Author here. I guess what I was trying to say is that CompCert as a system has
been trusted for over 10 years, yet only now do we find such a silly
correctness bug. People won't read the paper to find out that parsing is
unverified, they'll just trust CompCert because they hear it's been rubber-
stamped as "formally verified". And that's dangerous.

~~~
nickpsecurity
A better way to do it is pull up C compilers, get their bug rates, and compare
it to CompCert for both unverified and verified parts. First tests formal
specs + safe language where latter tests verification. You can use smaller, C
compilers if uou want to be fair. Two or three spec errors versus what you
find in other compilers would corroborate CompCert team's claims. Likewise for
verified TLS versus ad hoc protocols or implementations.

------
crb002
The gap is IDEs becoming automated theorem provers. You put in the type
signature of what you want and keep adding unit tests until the IDE finds the
correct code for you.

~~~
marcosdumay
Unity tests are not formal proofs.

Their philosophy are completely different. Formal proofs aim at proving the
nonexistence of bugs, while tests (any kind of them) aim at revealing bugs in
case you add them.

You can sometimes generate code form a proof, but you can't do that from
tests.

------
mannykannot
I have begun to think of formal methods as a sort of tar pit. The idea is
attractive to people who are concerned about correctness and are aware that
you can reason about program behavior, but they do not yet provide a solution
to the practical problems of writing correct software. On the other hand, this
is not an excuse for abandoning any attempt at reasoning about programs, and
settling for trial-and-error as the only way to write software.

------
zitterbewegung
Formal methods have been verifying real world systems using ACL2 for a long
time. Also, sqlite has 730 times amount of test code
[https://sqlite.org/testing.html](https://sqlite.org/testing.html) .

Formal verification is not a magic bullet but it doesn't make your systems
bullet proof though.

------
he0001
Isn’t formal verification subject to the problem of “when to stop proving
correctness”? Even if you can prove one part you still have to prove that it’s
used correctly? Like in the WPA2 case?

~~~
wizeman
You don't need to go all the way and specify/prove everything in order to
achieve significant assurances of absence of (at least certain classes of)
bugs.

See the section "Use SPARK at the right level for you" [0] for an example of
how you can approach formal verification.

You can start proving certain things, like that your code is free from run-
time errors, which already significantly eliminates a lot of bugs.

Then you can start proving certain properties, like it was done for WPA2, to
make sure that at least those properties hold and are bug/error free.

Finally, you can achieve full functional correctness, which gives you a
significantly higher assurance that your implementation really does satisfy
your functional specification. You can start doing this in parts of the code,
and progressively increase all the way to the entire code base if you so wish.

Every added formal verification step gives you higher assurance that your code
works as intended.

Sure, you can never prove that your code is 100% bug free (this would require
the functional specification itself to be bug free, and AFAIK there is no way
to verify this), but I would argue that 99% verification is much better than
0% verification.

[0] [https://www.adacore.com/sparkpro](https://www.adacore.com/sparkpro)

------
wwarner
What's really needed is a mathematical analysis of size of the bug space
compared to the size of proving code required to cover it. I.e. to get to
99.5% confidence that a system of size N is bug-free it would take a proving
system of size f(N). Is f polynomial, a small exponent, a large exponent? If
it's a large exponent, then the more you take on, the farther you get from
your goal, and the narrowly targeted efforts are the best we can do.

~~~
mcguire
Off the top of my head, I would suspect that to be off in uncomputablity-land.

------
bitL
Some people work on monads over linear logic that supposedly could simulate
any imperative program and make them available for formal verification, so we
will see. OTOH I definitely wouldn't want to spend time proving correctness of
every single line of code I write; it would make programming "no fun" zone and
require rocket-science level of understanding.

~~~
wizeman
> I definitely wouldn't want to spend time proving correctness of every single
> line of code I write; it would make programming "no fun" zone and require
> rocket-science level of understanding.

Fortunately, you don't have to. Formal verification can be done as little or
as much as you want (and personally, I find it's quite fun, but that's
subjective).

You can look at it like it's testing on steroids - you can not only verify
that some part of your code works for certain inputs, but you can verify that
it works for _all_ inputs! (for the definition of "work" that you decide to
give it).

~~~
bitL
Yeah, it's fun to some extent; I still remember proving whole Peano arithmetic
and then using that to prove functional programs. But doing real-world detail
level I would probably go crazy fast; just making proofs some floating point
computation does what it's supposed to would be extremely tricky due to bit-
level stuff and stretching sparsity of representable numbers that could fit to
xyz-bits of floating point.

------
ENGNR
Interesting

I'm hoping the formal verification crowd will start to influence ethereum
developers. You have small amounts of relatively simple code, that absolutely
must be correct or at least predictable, seems like a perfect match

~~~
troymc
That's already happening. For example, see
[https://runtimeverification.com/blog/?p=496](https://runtimeverification.com/blog/?p=496)

~~~
nuclx
Also see [https://quantstamp.com](https://quantstamp.com)

------
mcguire
Wrong question.

The question that needs to be answered is not, " _But the guarantees that we
expect are much stronger than a correct protocol, handshake, or compiler
optimization. We need the entire system to be reliable, and we need proof
effort to be minimal for widespread adoption to be feasible._ " (Pretend
that's a question, ok?) The question that needs answering is, "what is the
easiest way to provide (strong) guarantees that the code is reliable?"

If you wanted to write a filesystem, how much work would you have to put into
it before you could consider it "reliable" using traditional techniques? If it
is 10x more than it would take to simply write the code, then "MIT’s FSCQ file
system, developed over about 1.5 years using Coq, the complete system was 10x
more code than a similar unverified filesystem. Imagine that — 2000 lines of
implementation become 20,000 lines of proof!" doesn't sound so bad, does it?
And 10x is not necessarily a bad estimate---unit testing alone would account
for 2x, at a rough guess, and not only does it not cover all possible
executions, it also has a reputation in this neck of the woods of focusing on
the happy path.

"An Empirical Study on the Correctness of Formally Verified Distributed
Systems" has been discussed here before
([https://news.ycombinator.com/item?id=14441364](https://news.ycombinator.com/item?id=14441364))
but I'd like to mention one quote from it:

" _Finding 9: No protocol bugs were found in verified systems, but 12 bugs
were reported in the corresponding components of unverified systems. This
result suggests that recent verification methods developed for distributed
systems improve the reliability of components that, despite decades of study,
are still not implemented correctly in real-world deployed systems. In fact,
all unverified systems analyzed had protocol bugs in the one-year span of bug
reports we analyzed._ "

(And speaking as a guy who has some network protocol experience, some of the
bugs that were found in the "shim" layers were...likely caused by
inexperience: no, TCP does not preserve write boundaries on the _recv_ end,
and yes, UDP will duplicate packets under some error conditions.)

I will admit that many of the author's specific complaints are completely
true. This quote should be in big letters in the introduction to any formal
verification work: " _On top of all that, if any spec or implementation
changes even slightly, all the specs, impls, and proofs that depend on it will
change as well. This makes even small changes a huge pain._ " And Coq is
something of a pain in the rump. (I encourage anyone with a few spare hours to
work through _Software Foundations_
([https://softwarefoundations.cis.upenn.edu/);](https://softwarefoundations.cis.upenn.edu/\);)
it's a lot of fun but at the same time very frustrating. [That should be
"many, many spare hours". --ed.]) The least damning thing you can say about
Coq is that it's not very well engineered. On the other hand, SMT solvers like
the kind of thing behind SPARK and Frama-C are brilliant when they succeed in
finding proofs but when they fail,...well, they light up a "?" on the
dashboard and expect the experienced driver to know what the problem is.
Further, they're rather limited on the programming features you can use with
them---Dafny and the Java verifiers have garbage-collected languages,
alleviating some of the pain.

But still, to address the author's fundamental issue,

" _I went into the class believing that formal verification is the future —
the only solution to a world of software ridden with bugs and security issues.
But after recent events and a semester of trying to apply formal methods, I’m
a serious skeptic. In this post, I’ll discuss why I think formal verification
has a long way to go — and why it just doesn’t work right now._ "

Ya' got a better option?

~~~
marcosdumay
> Ya' got a better option?

I don't think any clear-cut answer exists. For some applications (like your FS
example) formal proofs are a complete no-brainier. On other cases, you are
better served by many tests. On yet different cases, even tests are too much,
and you will do better matching your code format to the specification as well
as possible. Not to talk about the cases where specs validation is the real
issue, and all your attention will be there.

It's not even an either-or issue. You can have all of those situations
appearing on different places of the same code-base.

------
nickpsecurity
[Copying/pasting my reply to author in Disqus since at work. ]

Good write-up on the difficulties. The first I saw was Guttman’s paper that
set back the hype quite a bit:

[http://www.cypherpunks.to/~peter/04_verif_techniques.pdf](http://www.cypherpunks.to/~peter/04_verif_techniques.pdf)

His post and yours both oversell the problems while underselling the
successes. It’s best to keep big picture in mind. Yes, it’s extremely hard.
Yes, we shouldn’t rely exclusively on it or have too much expectations. On
Guttman’s end, the Orange Book of TCSEC (later, EAL7 of Common Criteria)
required many activities like I included in my own security framework:

[http://pastebin.com/y3PufJ0V](http://pastebin.com/y3PufJ0V)

Each was supposed to check the others. Each time, you want experienced people
doing it since it’s the people that make these processes work. What’s mostly
missing in this post and esp in Guttman’s is all the evidence of defect
reduction. Most of the work he cites were tracking problems they found
throughout their lifecycle. They all said the formal specifications caught
bugs due to ambiguity. Some found formal proofs beneficial with often
mentioning “deep” errors they likely wouldn’t find with testing or code
review. TLA+ use at Amazon was great example where errors showed up 30+ steps
into protocols. Most of old verifications found they caught errors just
simplifying their specs for the prover even if they wouldn’t run proofs. And
for high-assurance security, the A1-class systems would go through 2-5 years
of analysis and pentesting where evalutators said they did way better than
systems using mainstream practices. Most of those can’t even tell you how many
covert channels they have.

If you’re interested in methods that helped in practice, I dropped a start on
a survey paper on Lobste.rs trying to shorten several decades worth of
material into an outline post:

[https://pastebin.com/uyNfvqcp](https://pastebin.com/uyNfvqcp)

Now, all that said, full proof is hard with a lot of problems. You hit nail on
the head about the infrastructure problem. I’ve always strongly encouraged
people to concentrate on building up one or two projects to have
infrastructure for most common ways of solving problems in programming with as
much automation as possible. They have at least been building most things on
Coq, Isabelle/HOL, HOL4, and ACL2. Having prior work to build on has been
helping them. As you’ve seen, though, they’ve not made this a priority: the
tooling still is horrible. So, horribly-difficult work with horribly-difficult
tooling. That leads to the next proposal I did that might help you or some of
your readers who want 80/20 rule of most benefits without most problems:

[https://news.ycombinator.com/item?id=15781508](https://news.ycombinator.com/item?id=15781508)

[https://lobste.rs/s/n7v658/your_thoughts_on_this_advice_thos...](https://lobste.rs/s/n7v658/your_thoughts_on_this_advice_those)

Here’s some of those methods on top of the TLA+ and Alloy links:

[https://www.eiffel.com/values/design-by-
contract/introductio...](https://www.eiffel.com/values/design-by-
contract/introduction/)

[https://www.win.tue.nl/~wstomv/edu/2ip30/references/design-b...](https://www.win.tue.nl/~wstomv/edu/2ip30/references/design-
by-contract/index.html)

[http://www.anthonyhall.org/c_by_c_secure_system.pdf](http://www.anthonyhall.org/c_by_c_secure_system.pdf)

[https://www.amazon.com/Building-High-Integrity-
Applications-...](https://www.amazon.com/Building-High-Integrity-Applications-
SPARK/dp/1107040736)

Here’s my recommendation in general. Learn lightweight methods such as Meyer’s
Design-by-Contract for most stuff, TLA+ for concurrency/order, Alloy for
structure, SPARK for low-level code, and so on. TLA+ on distributed stuff
seems to sell people the most. With a formal method in hand, then apply this
methodology: memory-safe language + lightweight method to model-check
important stuff + Design-by-Contract + tooling to generate tests from
specs/contracts + fuzzer running with runtime checks enabled. This combination
will bring the most quality improvement for least effort. In parallel, we have
the experts formally verifying key components such as kernels, filesystems,
network stacks, type systems, compilers, security protocols, and so on. They
do so modelling real-world usage with each one coming with contracts and
guidance on exactly how they should be used. The baseline goes up dramatically
with specific components getting nearly bullet-proof.

Also, you said you wanted entire system to be reliable. There was one that set
the gold standard on that by noting how everything failed with systematic
mitigations for it in prevention and/or recovery:

[http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf](http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf)

[https://en.wikipedia.org/wiki/NonStop_(server_computers)](https://en.wikipedia.org/wiki/NonStop_\(server_computers\))

The OpenVMS clustering software alone let their systems run for years at a
time. One ran 17 years at a rail yard or something supposedly. The NonStop
systems took that to hardware itself with them just running and running and
running. Applying a combination of balanced assurance like I describe, NonStop
techniques for reliability, and Cambridge’s CHERI for security should make one
hell of a system. You can baseline it all with medium-assurance using tools
that already exist with formal verification applied to a piece of the design
at a time. The work by Rockwell-Collins on their commercially-deployed
AAMP7G’s shows that’s already reachable.

[http://www.ccs.neu.edu/home/pete/acl206/slides/hardin.pdf](http://www.ccs.neu.edu/home/pete/acl206/slides/hardin.pdf)

So, there’s you some weekend reading. Merry Christmas! :)

