Hacker News new | comments | show | ask | jobs | submit login
Why Writing Correct Software Is Hard and Why Math (Alone) Won’t Help Us [video] (youtube.com)
90 points by DanielRibeiro 377 days ago | hide | past | web | 27 comments | favorite

Oh, another halting problem argument.

He's quite correct that you can write programs for which there's no faster way to find out if they halt than running them. His example is one that tests Goldbach's conjecture by brute force. A more practical example is trying to break a crypto key by brute force trials. Or a Bitcoin miner.

His analysis applies to the worst case. Most programs aren't like that. If your program is that close to undecidability, you probably need some kind of a loop counter to keep it under control.

Microsoft has a Static Driver Verifier for Windows kernel drivers. It's a formal proof of correctness system using symbolic execution. It tests whether a driver will halt (return, really), call any of the driver APIs incorrectly, or violate the execution model by subscripting out of range, using a bad pointer, or other crash-type error. It doesn't try to verify that the driver will run the device correctly, just that it won't take the entire kernel down. This is the main reason that driver-caused crashes stopped being a major source of Windows problems in Windows 7 and later. Before the Static Driver Verifier, more than half of Windows crashes were caused by driver bugs.

Microsoft's experience is that 5% to 10% of the time, the Driver Verifier can't decide if a program will get to a bad state. The Driver Verifier has explored too many cases and gives up. This is usually an indication of a flaky driver. If the Verifier can't tell if the driver will hang, the programmer who wrote it probably can't either. So the driver needs to be fixed. This does not turn out to be a serious problem in practice.

(I've worked on proof of correctness systems. There are lots of hard problems, but this just isn't one of them.)

I'm the speaker. First of all let me say that much of what I do these days formal specification and verification, so I am aware of what's possible in practice and what isn't, and the blog post does go into some detail. Also, I am well aware of Microsoft's great accomplishments with driver verification -- SLAM was a landmark project -- but those results don't seem to translate well to other domains. As I say in the beginning, the main purpose of the talk is to explain why the problem is so hard, not to list the special cases that are feasible in practice (although I do quickly go over some approaches).

Nevertheless, it is not true that all the complexity theory results I present are worst case. Of particular interest is the result by Schnoebelen from 2005 (25:16 in the talk, but it's better to look it up in the post) that showed that verification isn't FPT (fixed-parameter tractable), which is very strong evidence that we cannot find a good method way to restrict programs or structure them in a certain way -- even in practice -- so that verification becomes tractable. Another is the result Schnoebelen quotes of symbolic model checking (28:20, but better in the post), i.e. the ability to use syntactic constructs and abstractions to aid in verification), which, as he says, has been shown to hold in practice and is now considered an empirical fact. Of course, the fact that feasible verification of nonlocal properties is unsolved and that virtually all efforts attempt to tackle special cases (or don't care about feasibility), shows that the problem is very hard in practice, and I talk about that towards the end of the talk/post.

Also, the Goldbach example has little to do with brute force. It is about the difficulty proving a seemingly simple reachability problem on a 10-line program. That the program itself uses brute force is irrelevant, and only used to make it simple and short.

There are lots of things you can verify if you don't choose to bang your head against unsolved problems in number theory.

The first stage is simply verifying that the program stays within the language - no integer overflow or underflow, no subscripts out of range, no bad pointers. If you can't do that for a program, the program is just broken. It should be possible to do this for much "unsafe" Rust code, such as collection classes. To deal with collection classes, you need to be able to talk about an array for which only some of the elements are valid. Rust lacks syntax for that, but many verification systems have ways of talking about such things. Simple inductive proofs can handle situations such as growing or compacting a collection.

The next step is verifying validity semantics for data structures. Is a tree valid before and after an insertion - that sort of thing. There are good ways to do that. One way is to add information to the tree, such as back pointers and depth counts, and verify that the updating operations maintain the invariant that back and forward pointers are consistent, and that depth counts decrease towards the root. That info doesn't have to be carried along into the executable code if it can't affect the results. We did this in the Pascal-F verifier, and it was easy to explain this to programmers as a way of inserting checking code that had no overhead.

Specifying the problem formally remains tough. There are formal specification languages. I used one of the first, SRI's SPECIAL. They're useful for some kinds of problems. Industrial control often has well-defined constraints on the overall system. Databases do something which can be simply defined; consider a database as a tuple store with linear search. An efficient implementation of a database is quite complex, but the desired end result is defined by a simple model.

Security properties can often be addressed formally, and DoD has put a lot of work into that. Anyway, enough for tonight.

Have you watched the talk or read the post?

Yes, unfortunately.

From the paper: "I hope it’s clear by now that it’s unlikely we’ll find a general useful category of programs for which verification is affordable, as verifying even the most simple computational model, the FSM, is in general also infeasible."

The basic approach of this paper is to generalize the problem until it becomes unsolvable. Others less inclined to claim the problem is impossible have succeeded in verifying some useful systems.

Useful reading:

"Formal Verification of floating point RTL at AMD."[1] After the famous Pentium floating point bug, AMD decided to use formal proofs of the correctness of their FPU, starting from the Verilog representation used to create the chip. They found several bugs, which were corrected before making the chip.

L4 microkernel verification: "The binary code of the seL4 microkernel correctly implements the behaviour described in its abstract specification and nothing more. Furthermore, the specification and the seL4 binary satisfy the classic security properties called integrity and confidentiality."[2]

[1] http://www.russinoff.com/papers/paris.ps [2] https://ts.data61.csiro.au/projects/TS/l4.verified/

I remember that old ACL2 paper, they didn't perform verification on just any RTL module written in Verilog, they restricted it to a subset of Verilog that most matched their ACL2 language, it was a lisp like functional language, and their ACL2 system was descendant of the Boyer-Moore thereom prover Nqthm. Anyway they translated that subset of Verilog into ACL2, and then their prover would try do verification by using term rewriting. To use their prover you had to basically find a functional version of whatever program you wanted to verify. On top of that, there was no guarantee that it would perform verification successfully, you would either get a result or ACL2 would just hang/or fail! Its mode of use is very similar to Microsoft's SLAM you mentioned earlier, if it took too long then you really had no choice but to rewrite. Verifying any Verilog module was way outside its capabilities and by today's standards it wasn't a very sophisticated mechanical prover.

Anyway, the way verification is done these days is too reduce the state space as much as possible and check the FSM, that's not "generalizing the problem until it becomes unsolvable". There are techniques post 1990's to do just that for example abstractions and minimization, partial order reduction, and more famously symbolic model checking. But user pron is right, most useful programs in production are just not feasible to formally verify because even after using many of the state of the art techniques to reduce the state space of the resulting FSM, the FSM is still too large. The option you have today, if verification is important to you, is to structure your program so that its verification by mechanical provers is tractable. I apologize for any typos/grammar errors, I'm on my phone. Also for anyone interested in the history of model checking check out Turing Laureate Edmunde Clarke's paper "Birth of Model Checking", if your interested his textbook "Model Checking", and a more recent text is "Principles of Model Checking" by Baier&Katoen.

BTW, one of the coolest modern model checking techniques is "counterexample guided abstraction refinement" (CEGAR) which automatically soundly reduces (abstract) the state space, tries to prove the property, and if it fails it refines the abstraction (enlarging the state space) and tries again.

> The basic approach of this paper is to generalize the problem until it becomes unsolvable

This is simply not true. The talk doesn't say it's unsolvable. It explains why it is hard. It is based on results of researchers in the software verification community. Unfortunately, I don't know how to teach a complex subject without starting with the general and then moving on to the specifics -- which is exactly how the talk is structured: from decidability to complexity, to total programs, to FSMs, and finally to verification methods. How can I explain the idea of reducing the state space by abstract interpretation and type systems, before explaining why the size of the state space is what matters most? If you have a better way of introducing a theory, I'd love to hear it. Honestly, I have no idea how you'd understand verification tools if you don't know that, say, verifying an FSM language is PSPACE-hard. That's at the very core of verification and the design of the tools.

The last two (out of four) chapters of the post explore what is solvable and how. This is how chapter 3 beings:

Now we know why writing correct programs is hard: because it has to be. But while we cannot verify all programs all the time – regardless of how they’re written – there’s nothing stopping us from verifying some programs some of the time for some properties. As we’ll see, none of the software verification methods manage to overcome the limitations we have explored, and it is precisely those limitations that affect the design and tradeoffs of every verification approach.

The idea is that if you understand the problem, namely that the size of the state space is what determines the difficulty of verification, you'd understand how those methods work when they do and why they fail. Very broadly, there are two approaches: reduce the state-space in a sound way (this is called abstraction) and hope that the property you wish to verify is captured by the abstract state space (this is how SLAM works), or not, and work really hard.

If you claim that verification is easy, then you're wrong both in theory and in practice. If you claim that some useful properties are easy and some may be verifiable with a lot of effort, then you're in complete agreement with experience and what I said (which is no more than a summary of known results in verification).

> Others less inclined to claim the problem is impossible have succeeded in verifying some useful systems.

In my job, I formally specify and verify large, complex distributed systems. My success comes from an understanding of the difficulty, which helps guide the way towards what's possible, rather than from wishful thinking. Just as you can't succeed in programming if you don't understand the difficulty (and therefore the importance of readability and structure, for example), if you don't understand the difficulty in verification, you simply will not succeed. You'll either bang your head against the wall or fail to understand how to help the tools.

> L4 microkernel verification

The effort that went into writing the seL4 -- a mere 9KLOC C program -- was, IIRC, over 20 person-years. Someone on that team told me they could have done it in 5 if they had had all the tooling in advance, but we're still talking about significant cost (5x instead of 20x), carried out by a team of (rare!) experts trained in the task. Plus, seL4 was "dumbed down" significantly so that verification would be viable at all, even at that effort. They didn't verify a general micro-kernel, but wrote a much simplified microkernel so that it would be amenable to verification.

For seL4, as I understand it their trials and tribulations also included false starts, like trying CompCert before moving to Magnus Myreen's generate logic from binary code and then meet in the middle approach.

And they've released all this, GPLed, so hopefully others (including maybe myself when I get the time) can continue doing this more efficiently.

Absolutely, but if you ask them, they will tell you that they don't think their process could be done efficiently enough for general purpose software. More importantly: it shouldn't be. End-to-end verification is absolutely unnecessary for all but the most critical pieces of software. No one believes this can be done easily.

When doing software verification we must keep our eyes open, know that it is hard and why:

* There is no general solution

* The difficulty is not a result of code structure, organization or syntactic abstractions.

* The difficulty cannot be avoided with languages -- through totality or abstractions (although some aspects regarding local properties can certainly be aided by language design; see below).

* The difficulty cannot be modularized away.

* The main issue is the size of the state space.

So the main idea is reduction of the state space (and heuristic approaches -- like SAT solvers -- that allow us to tackle larger state spaces). For reduction of the state space to be useful we need to find properties that survive the abstraction and are still important (memory safety is a classic example). The more local the property, the greater the chance that it can survive reductions. Which local properties are important is an empirical question which must be studies.

As for the rest, it's up to careful design, use of simplified algorithms, a lot of grunt work, and a lot of experience, which often involves a manual reduction of the state space.

> Turing’s solution? Types!

>> [Quoting Turing] We shall need a number of efficient librarian types to keep us in order.

Uh... I think "librarian types" means "people who think like librarians". It's a common colloquialism in some places: you have your entrepreneur types, your manager types, your artist types, and your librarian types.

It was a joke! Clearly, it didn't land...

Of course he meant people, but it may surprise some to learn (although it shouldn't; he wrote about a lot of topics) that Turing actually did write about type theory: https://projecteuclid.org/euclid.jsl/1183410407

Oops, sorry! I thought for a moment that it might be a joke, but I couldn't see any other indication of that. Probably it came across better on the video than the transcript.

A driver is a pretty small program, and as you say, the "verifier" ("checker" would have been a better term) doesn't even try to verify that the driver will run the device correctly.

Actually proving correctness of large programs like compilers and operating systems is not considered an easy problem. You surely know that; I don't know what you're thinking when you dismiss the problem.

What proof-of-correctness systems have you worked on?

Proof-of-correctness is an irrelevent concept. You can't specify the correct solutions to all edge cases for non trivial problems. You can however add constraints and test for those especially when you are willing to deal with a bunch of edge cases manually.

First time I did static analysis I knew there was a lot of edge cases I could detect, but not deal with. But, in practice my code dealt with 99.97% percent of cases which was easily enough be be extremely useful.

Interesting. Does that verifier also check if pointer values always remain within valid range?

From the YouTube link:

Extended transcript with more results and full references can be found here:


Thanks so much for posting this. I probably would have missed out on the content otherwise, and it's a fantastic survey. I hadn't seen the 2005 result that decomposing programs doesn't make them easier to prove!

What an excellent and integrated tour.

Saying math (alone) won't help you write correct software is like saying math (alone) won't help us create good AI.

Of course that's true.

In either case, it's a combination of math and more ad-hoc heuristics. But you can push a lot of that into math, and you should, and we should keep doing that.

By "math" I didn't mean the use of math as a tool, as done in, say, physics, but the view of computer science as a branch of the academic discipline of mathematics, which normally deals with very different problems from those computer science does. For example, much of physics relies on numerical methods, which, while mathematically justified themselves, are not really the tool of choice for mathematicians, because their concerns lie elsewhere. But physics regularly tackles problems for which there are no closed-form solutions -- this is the norm rather than the exception -- so the mathematical tools used by physicists and mathematicians are quite different. Computer science is more similar to physics in that regard.

> Complexity is essential. It cannot be tamed, and there is no one big answer. The best we can do [...] is apply lots of ad hoc techniques

I think this is right -- and yet wrong. In fact, I would say that it's only by accepting the correctness of this statement that we can begin to glimpse how it might be wrong.

It is right that there is no one big answer that will work for any program. There is no algorithm, no language, no logic, no formalism, no decision procedure that will solve the problem, and there never will be.

But I'll wager that we can get much much better at "apply[ing] lots of ad hoc techniques" -- to the point that in practice, for the programs we actually care about verifying, it will eventually be possible to automate most of the verification process even for our largest programs.

If you look at what deep learning is -- to take one example of how this could happen -- it's essentially a method for generating ad hoc techniques by the zillions. When we train a network we generally neither know nor care what features it learns to detect at each layer. Each such feature detector is an ad hoc heuristic for something, but we don't even need to know what.

My suspicion is that the current state of machine learning, as exciting as it is, is still in its infancy, and the techniques are going to get much more powerful yet.

And the thing is, as humans, I believe we do often understand why our programs work. That means that a significant fraction of the time, we have in our heads informal correctness proofs that are actually right. I said "a significant fraction"! I am keenly aware how often we get them wrong, but I think that if we were practically never right about these things, our software wouldn't work even as well as it does. Sure, our short-term memory is minuscule, and our long-term memory unreliable. We make logic errors, we overlook corner cases by the thousands, and we simply forget required preconditions and postconditions. Anyone who writes software for a living knows how bad we are at it, and yet we can do it to a point.

What we have that the machine doesn't have is an amazing pattern-matching ability. We're seeing that machines are getting a lot better at pattern matching. What the machine has that we don't is massive reliable memory and absolute indefatigability.

Just to be clear, I'm not suggesting that pattern matching is all that stands between us and proving Goldbach's conjecture. (If that were true, we would have done it.) I'm saying that the proof tasks involved in verifying real-world programs are not difficult in that way. What makes them hard is not that they require great leaps of mathematical insight; a lot of them aren't even very long. (What makes them hard is, to return to the author's theme, complexity. To prove any particular fact about a large program requires figuring out what facts about the program are even relevant to one's task. Even just at the AST level, it would take many millions of facts (as relational tuples, say) to describe a large program. Naive brute-force search in a space like that would be beyond hopeless.)

> it's essentially a method for generating ad hoc techniques by the zillions

But in the end, the network is just a single algorithm. Complexity theory doesn't care that from your point of view you consider it to be zillions of ad hoc techniques. Complexity theory tells us which problems are learnable and which aren't: https://en.wikipedia.org/wiki/Probably_approximately_correct...

> Anyone who writes software for a living knows how bad we are at it, and yet we can do it to a point.

I agree, which is why I think that with enough empirical research there's hope (remember, the point of the talk is that math alone won't help, but empirical research might), but don't expect miracles. What we want is something that far exceeds what we think we know about our programs. That gap between what we think we know and what we actually know is where all bugs lie.

> Naive brute-force search in a space like that would be beyond hopeless

What complexity theory says is that knowledge comes at a price. If you want to know something you must pay that price. Your algorithm for knowing it is irrelevant. Knowing some things about programs can get really expensive, no matter the technique.

The PAC learning framework takes as one of its starting assumptions that the learner cannot be instructed as to the form of the answer. I think that's a stronger restriction than we need for this purpose. It's okay if we have to provide the learner with new concepts, as long as it can remember them and apply them appropriately after that.

(I'm not actually sure that deep learning, at least as it exists today, is going to be particularly useful for this problem. I was just trying to come up with an example that people would be familiar with.)

In the conclusion of the talk I say that the only way forward towards anything resembling a sufficiently general approach to verification. Any form of statistical analysis (which is what many current learning algorithms are) is simply an algorithmic view of that very statement. One of the attendees even asked if machine learning could assist in that empirical study, to which the answer is, of course! But it is important to understand that this is what the learning algorithms would do: empirically study human-written programs. I don't think there's currently a whole lot of research in that direction, though.

Yes, I think we agree on a lot of this. Where we seem to disagree is on the prospects for a "silver bullet". As much respect as I have for Fred Brooks, I have always thought he was too quick to dismiss the potential of AI (though admittedly, when he first wrote that essay, AI was much better known for its failures than its successes). In particular, this dismissive, oddly moralistic quote from Turski, that you've requoted, I think is absolute horseshit. (For one thing, I'd wager that the biotech people will actually solve the hair loss problem within the next 10 years. I sure hope so :-)

But we agree that it's essential to understand what has been proven to be impossible. There will be no algorithm (in the sense of an effective procedure that works on any input) for computing correctness of programs. Indeed, all the familiar tools of computer science are inadequate: languages, logics, indeed formalisms of any kind. They may all have roles to play, but none of them can be the silver bullet in and of itself. It is exactly as you say, that the breakthrough, if there is to be one, must come through empirical analysis of actual programs.

But we now live in a world where, at long last, a machine has beaten the world champion Go player. Many experts thought that wouldn't happen for decades. If we can do that, I think that means we are finally starting to have solutions to problems of perception and pattern-matching that have been stymieing us this whole time. I am very excited.

I probably came out sounding a little too combative at the beginning. I think you've written an excellent summary here (modulo the Turski quote :-), and I think it's kicked off a very interesting conversation. (Even Animats, I think, doesn't disagree with us as much as he/she initially appears to.)

> In my job, I formally specify and verify large, complex distributed systems.

I am actually working on a project to put some code where my mouth is: to build a reasoning system that uses AI and ML and learns to reason about programs. It's a long-term thing and I'm probably a couple of years from having anything to demo, but I'd love to learn more from you about the problems you encounter. Are you by any chance in Silicon Valley? If so, let's get together. My email is in my profile.

> I have always thought he was too quick to dismiss the potential of AI

I have no idea what AI is (in a practical context), and I think no one else does, either. Do you mean the statistical algorithms that people call machine learning?

> But we now live in a world where, at long last, a machine has beaten the world champion Go player.

It is very important to understand how that was achieved and the difference between that and verification. It's unknown how good Google's Go program is compared to some arbitrary opponent, as the algorithm it employs is not the algorithm to "win" at Go (which is far too computationally expensive to be feasible). It is "merely" able to beat other human players, but that is mostly because humans can beat human players, and the program learned from humans. Humans, however, do not efficiently verify other humans' programs, and that makes the problem very different.

Don't get me wrong: I think that statistics are absolutely necessary, and that machine learning could certainly help. But "help" does not necessarily mean "solve". At the end of the day, even if we pretend machine learning does resemble intelligence (which it does not, as any expert would tell you; we're still in the vicinity of zero IQ), we must remember that intelligence -- unlike what some people like to believe -- is not a very good "general problem solving" algorithm; it is a reasonable general algorithm for problems that humans tend to confront in many situations. For example, humans are even worse than computers in finding solutions to NP complete problems, and are in general terrible at solving many problems. In general, humans are good at forming conceptually deep but undetailed theories. Is verifying programs one of those problems that intelligence can solve? It could be, but we have no good theory that can give us intuition into whether it is or it isn't. I do know that a model checker was far better than me when reasoning about programs.

But instead of hypothesizing, a good first step would be to gather statistics, and to use machine learning in analyzing those statistics. I think it's more likely that machine learning would uncover important patterns that we could then exploit in more robust algorithms.

Thanks for the invitation, but I am very far away from Silicon Valley.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact