SPARK would seem to be the definitive baseline to compare all the others, too.
Is it just me, or is describing the functional correctness of a program with another programming language asking for more problems?
 http://adam.chlipala.net/papers/ImpurePOPL10/ -- The field is moving too fast for me to keep up, but last I checked the theoretical comp. sci. guys were still using a methodology similar to this. I haven't read many POPL or ICFP papers in the last few years so someone who's more in touch with cutting-edge research, please jump in.
A) It must be possible to automatically and completely mechanically ensure that implementation & proof don't get out of sync. Having a formal proof for an algorithm which isn't the algorithm that you've actually ended up implementing doesn't really help you one whit :). This is relatively simple for unit tests and property-based tests like QuickCheck: Just observe that the number of tests run actually increases over time and never decreases unexpectedly. (Actually, it seems to be rather rare to have support from Continuous Integration tools for tracking the latter.)
B) There must be (in some sense) at least some non-trivial difference between the implementation & "proof". (At the absurd extreme: bisimulation doesn't really work as proof of anything if the implementations are one and the same.) For example, in Idris the implementation is basically just code while the proof is the accomanying types -- which ensures that they're not basically just saying the (trivially) same thing, even though the types greatly constrain what the implementation can do.
I think that the allure of "practical" dependent types (e.g. Idris) may be that such languages trivially obey both properties. Of course you still end up with a "trusted computing base" which is non-trivial in such cases.
Thing is, almost all the problems in systems have happened due to coding and interface problems. Especially hidden assumptions. Formal verification and static analysis are great at dealing with that. They prevent and catch many more problems than they introduce despite problems occurring in specs or proofs. It's like a combo of a numbers game and applying proven principles.
Hope that makes sense.
The specification is also easier to write because you don't have performance constraints and you're using a different set of abstractions than your main code.