
Proof of linearizability in Raft using Coq - jtgi
https://github.com/uwplse/verdi/pull/16
======
r0naa
Amazing job! It has been mentioned on the Raft mailing list that a proof was
in a progress, but to be honest I did not expect anything to come up before a
year or so. Forgive my lack of faith! :-)

Well done!

edit: to provide some context

\- Raft is a distributed consensus algorithm that is seen by many as a viable
alternative to *-Paxos because of its relative simplicity. It was created by
D. Ongaro and J. Ousterhout (Tcl/Tk!)

\- D. Ongaro's dissertation includes a TLA+ specifications of Raft; TLA+ is a
model checker (see ) for a more detailed explanation of what is a model
checker/theorem prover's job

\- TLA+ is a model checker; Coq is a proof assistant. See
[https://stackoverflow.com/questions/22418448/can-coq-be-
used...](https://stackoverflow.com/questions/22418448/can-coq-be-used-easily-
as-a-model-checker) for a more detailed explanation.

\- Verdi is a Coq framework to make formal proof about distributed systems.

\- Doug Woos and James Wilcox, made a proof of Raft using Verdi; Verdi helps
you figure whether your implementation of X (in this case X=Raft) meets the
specifications (Leader Safety, strong consistency etc..)

Link to the Verdi website:
[http://verdi.uwplse.org/](http://verdi.uwplse.org/)

edit: as noted by ahelwer, TLA+ is not a model checker but a language to
describe a distributed system's specifications. I was referring to TLC which
can work with TLA+.

~~~
dougws
Hey, I'm Doug Woos--thanks for this excellent summary! It's worth noting that
the Raft proof was completed by a team of people, including me, my research
partner James Wilcox
([http://homes.cs.washington.edu/~jrw12/](http://homes.cs.washington.edu/~jrw12/)),
and the other folks listed on our web page at
[http://verdi.uwplse.org/](http://verdi.uwplse.org/)

~~~
r0naa
Apologies to your research partner for leaving him out, it's edited!
Congratulations on the proof by the way, I am looking forward to this week-end
so I can have some time to appreciate it with more depth! :-)

~~~
dougws
Thanks! Let us know if you have any questions about the proof itself--
unfortunately, it's not particularly well-documented, and Coq isn't super easy
to read in the first place.

~~~
notsony
Can you recommend and (1) books and (2) online lectures/courses on proofs
which use Coq? Thanks!

~~~
dougws
At this point, the standard intro text is Software Foundations [1]. I highly
recommend it; it will teach you Coq and also probably make you a better
programmer. After SF, Certified Programming with Dependent Types [2] gets more
into the practice of proving serious programs correct. These books are both
available online in the form of literate Coq files.

As far as online lectures, OPLSS [3] often has Coq lectures which are quite
good.

[1]
[http://www.cis.upenn.edu/~bcpierce/sf/current/index.html](http://www.cis.upenn.edu/~bcpierce/sf/current/index.html)

[2] [http://adam.chlipala.net/cpdt/](http://adam.chlipala.net/cpdt/)

[3]
[https://www.cs.uoregon.edu/research/summerschool/summer15/](https://www.cs.uoregon.edu/research/summerschool/summer15/)

~~~
notsony
Thanks!

------
Confusion
What isn't yet clear to me about these kinds of proofs: how can you verify
that this proof is correct? How can you know there isn't a wrong assumption or
wrong bit of 'code' in those 5500 lines? Would it have been impossible for
Voevodsky to make his original mistake in such a formalization? Obviously the
point of his entire homotopy type theory project is that he thinks he couldn't
have. But why are mistakes (near) impossible here?

~~~
wilcoxjay
This is a great question that permeates all verification work. In the lingo,
we call the set of assumptions a proof makes the "trusted computing base" (see
[https://en.wikipedia.org/wiki/Trusted_computing_base](https://en.wikipedia.org/wiki/Trusted_computing_base)
).

I listed some of the assumptions we made in another comment
[https://news.ycombinator.com/item?id=10018985](https://news.ycombinator.com/item?id=10018985)
.

The question you ask was believed for a long time to be the death knell of
formal verification. It was persuasively argued in "Social Processes and
Proofs of Theorems and Programs"
[https://www.cs.umd.edu/~gasarch/BLOGPAPERS/social.pdf](https://www.cs.umd.edu/~gasarch/BLOGPAPERS/social.pdf)
that any proof of a complex system is necessarily at least as complex. Thus
there would be no reason to trust the proof any more than the original code.

The breakthrough in verification came when we started using _machine-checked_
proofs. In this approach, you write a short, simple, and trusted proof
checker. You can then verify complex systems using complex proofs that are
checked by the simple checker. Then the only possible errors in the proof can
come from the proof checker being wrong. Because the checker is simple and
general (ie, it can check all kinds of proofs, not just proofs about a
particular program), it is more trustworthy.

Just to reiterate (including some content from my other comment): we can be
wrong only if we have written the wrong definition of linearizability,
misstated the correctness theorem, or if there is a bug in the proof checker
or in Coq's logic itself.

------
mjb
Extremely cool stuff.

Questions for the researchers, if they are reading:

\- Does this proof cover any liveness concerns (weak fairness, deadlock
freedom, etc) in addition to the safety property of linearizability?

\- If not, what would it take to extend this model to cover liveness? Is this
even a good starting point?

~~~
dougws
We are reading! This is a great question. We currently don't prove anything
about liveness. We'd love to work on this.

As you probably know, Raft and other consensus algorithms are not guaranteed
to be live in all situations. But subject to some assumptions about the
frequency of failure, they are guaranteed to make progress.

In Verdi, systems are verified with respect to semantics for the network they
are running on. Our semantics currently don't include any restrictions about
how often failures can happen; a failure "step" can occur at any time. We're
not sure what the best way is to introduce this kind of restriction, but we've
got a couple ideas. One would be to guarantee that the total number of
failures has some finite bound which is unknown to the system itself but which
is available during verification. Another would be to model failure
probabilistically. We will probably end up doing at least one of these things
in the next year or so :).

~~~
mjb
Thanks. That sounds like extremely interesting research. Being able to say
things about liveness relative to probability of failure (or the distribution
of probability of failure) would be very interesting.

------
nernst
This is interesting work. I'm curious how confident we can be that the TLA+
proof from Diego Ongaro was correctly represented in Verdi/Coq. This still
seems like a manual, hard-to-verify process.

~~~
wilcoxjay
Hey, I'm James Wilcox, another member of the Verdi team.

This is a good question. More broadly, what do you have to trust in order to
believe our claims?

In addition to all the usual things (like the soundness of Coq's logic and the
correctness of its proof checker), the most important thing you need to trust
is the top-level specification. In our case, this is linearizability (have a
look at
[https://github.com/uwplse/verdi/blob/master/raft/Linearizabi...](https://github.com/uwplse/verdi/blob/master/raft/Linearizability.v#L262-L266)
for the key definition; the key theorem statement is at
[https://github.com/uwplse/verdi/blob/master/raft-
proofs/EndT...](https://github.com/uwplse/verdi/blob/master/raft-
proofs/EndToEndLinearizability.v#L396-L403) ). If you can convince yourself
that these correspond to your intuitive understanding of linearizability, then
you don't need to trust any other theorem statement or definition in the
development.

If you actually want to run our code, then you need to make several other
assumptions. Our system runs via extraction to OCaml, so you must trust the
extractor and the OCaml compiler and runtime. In addition we have a few
hundred lines of OCaml to hook up the Coq code to the real world (eg, writing
to disk and putting bits on the network).

To respond more directly to your question about Diego's proof, I can tell you
we referred to it often to get the high-level idea. But the TLA model differs
in several respects from our implementation of Raft in Verdi. Most
importantly, our code is an _implementation_ in the sense that you can run it.
This means that it resolves all nondeterminism in the specification.
Furthermore, there is no need to manually check that what we implemented
matches the TLA model, unless that is your preferred means for convincing
yourself that we really did implement Raft.

------
quibit
What exactly is a consensus algorithm and how do you prove linearizability?

~~~
macintux
Kyle Kingsbury has a good look at linearizability and other forms of
consistency: [https://aphyr.com/posts/313-strong-consistency-
models](https://aphyr.com/posts/313-strong-consistency-models)

~~~
quibit
That was a great read. Thanks for a new blog to put on my RSS feed

~~~
baq
the call me maybe series there is an absolute must-read for any post-web 2.0
developer.

