Rethinking software testing: perspectives from the world of hardware (2019)

throwaway984393 · on July 10, 2021

Stakes are 100% the reason most tech is crap. You can't get away with shipping buggy hardware, people would ask for their money back. You can't get away with shipping a medical device that hurts people, you'd get sued. You can't get away with building a car that randomly turns on the brakes while you're going 70MPH, nobody would buy your cars (and you'd get sued).

With consumer software, there are no stakes. The business knows it can make money with buggy software. The developers know the business doesn't care. In fact, most people in the business don't seem to care if it works at all. The only time they care is if there is a legal consequence, or if they will get fired. They are not always stupid, but they are always lazy.

And that is never going to change. There are parts of this world that are perfectly acceptable as just being shitty, and at least 95% of software in the world fits into that. Things don't get better without a motivation.

ChrisMarshallNY · on July 10, 2021

> And that is never going to change.

Not necessarily, but the cure could be worse than the disease.

The problem is that the stakes are getting higher and higher for all software. Every app developer is trying to turn their program into a PID collector, and they are storing the PID in leaky buckets.

I mean, why would I get an account setup screen to play a solitaire game? Sheesh.

But politicians and bureaucracies have a nasty habit of imposing fairly ridiculous and/or out-of-date remedies.

They could make it a requirement to fax your PID to play solitaire.

mrh0057 · on July 10, 2021

People do care but there is a lack of choice. Then you have another issue with enterprise software is the people who are using the software are generally not the ones buying it. It comes down to does it check the right boxes.

Business do care it is just much more difficult than people realize to make software simple and easy to use. Then you have the fact most software projects fail and way over budget so they think there is no other way. Therefore they end up settling since at least they got something that sort of works which is better than nothing.

mrjin · on July 11, 2021

Well, nothing is perfect and hardware is no exception. We'll have to admit that all projects have 3 primary constraints: time, resource and scope and one core yield: quality. The nature of defects/bugs is really miss-map of expectations to implementations, it does not really matter it's software or hardware and hardware bugs are possibly more fatal than software bugs in some situations. There are lots of examples: Therac-25, Infinia Hawkeye 4, Boeing 737 Max, and almost all recent Intel/AMD CPUs, which are more relevant to this topic. You can see that it's really hard to tell whether it's should be software or hardware defects, either way there would be valid points to argue otherwise.

Nonetheless, I do agree your assertion that "Stakes are 100% the reason most tech is crap.". As a result of the limitation mentioned earlier, there must be some trade off or you can never wrap up the project. Stakes just make things worse and they tends to push engineers to cut corners in pursuing profits. The only thing looks better for hardware is that if you went too far cutting corners, it will not work stably at best or even does not work at all. For software, defects can easily be covered up and the end user might not have noticed it in the first place.

catern · on July 11, 2021

>You can't get away with shipping buggy hardware, people would ask for their money back.

Why not? Intel shipped buggy hardware to everyone for years, it was called "Meltdown".

stitched2gethr · on July 11, 2021

And people have not forgotten.

ChrisMarshallNY · on July 10, 2021

I spent 30 years at hardware companies, writing software.

I'm fairly familiar with this conundrum.

The single biggest issue that hardware folks seem to have with software, is refusal to accept that software, in order to be valuable, has to run by different rules from hardware.

They. Just. Can't. Accept. That.

If it works for hardware, it must work for software. Attempts to explain that the agility inherent in software is a huge advantage, are met with stony glares, and accusations that "programmers are lazy slobs that hate quality."

So software developers are sidelined, treated like crap, not supported in efforts to achieve quality processes (it must be the same as hardware, or it's "lazy slob" stuff), insulted, tripped up, and basically ridiculed. It totally sucks.

Software is different from hardware; even firmware is different, these days. That simple assertion needs to be the baseline for any true reconciliation.

But I agree that modern software is of appallingly bad quality, and there's a dearth of lessons learned/mentoring in Quality practices.

Most practices (like TDD) are useful, but still fairly nascent, and tend to be "one-trick ponies." They will work nicely for some projects, but can be disasters for others.

Quality is a continuum; not a point. In hardware, we can get away with treating it as a point, but we hobble software, when we try to do it the hardware way.

Since leaving my last job, I have been working with ideas I developed during my career, that were never supported by the companies I worked for. These ideas are working fairly well, but they won't work for everyone.

detaro · on July 10, 2021

> The single biggest issue that hardware folks seem to have with software, is refusal to accept that software, in order to be valuable, has to run by different rules from hardware.

Except for the moment where the hardware contains a mistake, then it's on the software to be flexible and contort to somehow workaround the hardware issue, because software is magic and can adapt to everything.

That said, I see plenty companies that are not as bad, actively work to change or at least accept that they don't "get" software entirely and that they get better results if they leave it to those who do.

And also agreed in the other direction: Lots of resources about building software today describe things that work for a SaaS product you can deploy 5x a day. You can't do that if you ship hardware, especially hardware that can't phone home daily for a software fix.

ibushong · on July 10, 2021

I was doing ASIC design and verification for 5 years before I switched to software and I think it led me to write much higher quality code and tests. I was used to making 100% sure everything was correct, because fixing bugs in silicon is extremely difficult and expensive. You are also forced to spend more time defining architecture and interfaces ahead of time, which can be very useful/important for certain software projects.

Integration tests usually seem to find the most important bugs, but on hardware they take hours or days to run. With that limitation removed in software development, they're almost too good to be true.

I'd also be interested in seeing how formal verification could be applied to software development.

bradleyjg · on July 10, 2021

There’s some software projects that look a lot like hardware and would benefit from crossover learnings. There’s also lots of software projects where the hard part is not building out the code but instead figuring out what to build. A lot of the differences in common practices are due to that, not because software developers are foolish or lazy.

To give concrete examples:

Intel knows what it needs to accomplish in order to win back the top spot. There are tons of challenges to get there, but the end goal is clear. That’s an entirely different kind of problem from Twitter needing to figure how to monitize.

namibj · on July 10, 2021

They did do formal verification for a microkernel: https://sel4.systems

It's afaik the most complex piece of software with full formal verification to date.

exdsq · on July 10, 2021

Formal verification is used in software development! Check out Coq, Lean, Agda, Lean, TLA+, or any links when you google those tools :)

fouric · on July 10, 2021

Lean: so good, we mentioned it twice.

I've heard that ACL2 is also a thing that exists.

To me, one of the most intimidating parts of getting into formal verification is deciding which system to use - as you mentioned, there are several. What are the strengths and weaknesses of each?

exdsq · on July 11, 2021

Hahahaha good catch. Perhaps if I'd used Lean on that string it wouldn't have happened.

I think the big choice is between using lightweight or heavy tools. Things like Coq and Agda take a lot of time to learn and implement but can derive 'correctly Haskell code which you can use, while TLA+ is more like a blueprint tool to check an algorithm before you write it in {favourite_language}. I think it's practical to learn TLA+ but if it's purely for intellectual satisfaction Wadler has a good book called Programming Foundations in Agda (I think?) which is available for free online.

triska · on July 10, 2021

This article contains many good points.

However, in my opinion, it is still overly lax: The aim should always be complete coverage.

The article hints at this with:

“Test every combination of events? Test every output for every combination of events? Test every possible corner case? You would need an absolutely enormous number of tests in order to achieve that!”

However, it does not really follow through on it, and even continues with a dedicated section on "Why Random Testing Works". At least symbolic execution with coarse approximations should be used to cover all cases subsumed by more abstract domains.

The reason for this is that the number of possible cases is usually so large, and the areas where problems lie often so small, that random tests are unlikely to discover all mistakes. The article mentions precisely this situation: "The bug only manifested itself during a small set of overlapping corner cases." QuickCheck and fuzzing are popular and do find mistakes in practice. However, how many mistakes are not found by these tools, if such random approaches already discover mistakes?

In my opinion, an ideal test case only terminates if a mistake is found. If not, it will simply keep on searching, potentially for months, years, ... You need to be able to start the search for mistakes and keep it running for as long as you want.

Logic programming languages like Prolog excel at writing test cases, because they provide search as an implicit language feature. It is easy to write a query that asks: "Are there cases where such-and-such holds?", and the system will search for them exhaustively. The ability to concisely write test cases that cover vast areas of the search space is a major attraction of logic programming languages.

throwamon · on July 10, 2021

> In my opinion, an ideal test case only terminates if a mistake is found. If not, it will simply keep on searching, potentially for months, years, ... You need to be able to start the search for mistakes and keep it running for as long as you want.

That's a good point, but the fundamental problem is that any sufficiently complex space takes way more than a practical amount of time to be completely traversed (easily reaching or surpassing age-of-the-universe magnitudes), so isn't that where fuzzy testing comes in? If you use something like Prolog, isn't it going to perform the search in an ordered manner, drastically reducing the chances that a hard-to-spot mistake is found? And at least in the general case, don't you have to restart the search from scratch every time logic changes?

zhd · on July 11, 2021

These are basically the problems that https://hypofuzz.com is designed to solve.

- You write standard property-based tests with Hypothesis, which fit into a traditional unit-test workflow (run with pytest, or unittest, or whatever)

- then HypoFuzz runs indefinitely, using feedback-guided fuzzing to find bugs that random generation is too slow for

- restarting the search on each new commit (or daily, etc) begins by replaying every distinguishable input found earlier so you don't have to start from scratch. There is _some_ loss of context, but that's inherent in testing different code.

Plus other nice stuff like allocating more compute time to tests which are finding new behaviour faster, for maximum efficiency. The roadmap includes exploiting SAT solvers to find very very rare behaviour, git history to focus on recently changed code, statistical risk analysis ("what's the expected time to next bug"), and also a budget-mode ("my compute costs $. 08/hr, stop when you expect a bug to cost $100 or more to find"). It's a good time to be working on testing!

jka · on July 12, 2021

Improving coverage is a good goal, although (depending on the codebase) I think it can provide a false sense of robustness.

Generally more coverage is good, yes; but there can be an element of risk around "testing two inputs in isolation", to borrow terminology from the article.

You may have coverage for every line of function A and every line of function B, but that doesn't guarantee that results of the two - when combined dynamically at runtime - are going behave the way that is expected and intended.

I like your idea about tests that are continuously searching. In theory I also think it'd be possible to have tests that run on a codebase near-immediately as a developer types (even between commits); ideally exercising the data flow paths that are most directly affected by each syntax tree change first, and expanding out from there.

pjc50 · on July 11, 2021

> The aim should always be complete coverage.

The problem with this is the requirement to build a completely controllable test fixture including OS syscalls. In practice that's very difficult.

How many people test their javascript in a sandbox that allows for arbitrary fault injection of calls into the browser API or DOM?

> If not, it will simply keep on searching, potentially for months, years, ... You need to be able to start the search for mistakes and keep it running for as long as you want.

You also need to ship, and in many startup environments you don't even have a deadline, it's just "as soon as possible".

Fuzzing is very effective at finding security issues. It does seem reasonable to try to design the security boundary of an application in a way amenable to fuzzing.

triska · on July 11, 2021

To clarify: Tests as I see them and described them above do not need to stop when the software is shipped. They are meant to run indefinitely and can continue also long after the software is shipped, and independent of the environment in which it was developed. For example, I found several mistakes in released versions of GNU Emacs with test cases that repeatedly do the same thing, over and over, until they reveal a crash or memory leak. A few examples of this approach:

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=496

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=576

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=726

Fuzzing is a useful approach and does reveal many issues with software. As I see it, the fact that fuzzing already tends to reveal mistakes in software is a good motivation to strive for methods that yield greater coverage based on systematic exhaustive search, so that also mistakes that occur in very rare situations are reliably detected.

satisfice · on July 10, 2021

Yet another article on testing written by someone who defines testing in whatever way conveniently ignores the icky part: people.

Every time you want to say “manual” testing, ask yourself why you never say “manual programming.” Testing is a human, social process, just as programming is. Output checking can be automated, but so what? It’s not the essence of testing any more than running a compiler is the essence of programming.

megous · on July 10, 2021

Reference models are fun, but for anything more than a simple web app, if you try to make them sufficiently different in implementation, you end up re-implementing not only your business logic, but also a RDBMS functionality in your testsuite.

Also verifying state of the app against the reference model after each testing step that modifies the state of the app also gets quickly out of hand. Even more so if you try to do it via public interfaces only.

And there's no guarantee your model is right.

thequantizer · on July 10, 2021

Having a background in electrical and software engineering I found this post very fun to read and was nodding along with some of the conclusions

dboreham · on July 10, 2021

Also transitioned from hardware design to software. Same cognitive dissonance at much of what I see surrounding testing software.

JulianMorrison · on July 11, 2021

Two QA tools that have received less notice than they should have, in programming:

1. The Ada feature of being able to specify the legal range of your type independent of its machine representation. Maybe it's stored in a system sized integer, but you want its legal values to be between 2 and 53. In Ada you can tell the type system this, and it will be enforced statically. You simply can't assign a value of 1, 0, -1, and code that could will be rejected as illegal.

2. The testing technique I saw in Haskell, where you specify the contract of the function to the testing tool, and it fuzzes inputs across the whole legal range of the type, trying to find corner case inputs that break it. This seems to be a more advanced version of informally specifying the contract by explicitly programming out every corner case yourself.

baybal2 · on July 10, 2021

It needs to be said that software formal verification is much, much harder than hardware, and it will put much harder limits on they way you design a programs.

I suspect it would not be rare for a software project written this way to have more work hours spent on verification, than the development itself.

The use of shared code would be even more pervasive to the point whole library will likely be replaced rather than doing a single LOC change.

Hardware verification is very much like a pyramid. Verified cells are made of verified, and characterised to death semiconductor devices. Verified gates are made of verified cells. Verified logic components are made of verified logic gate designs. Verified libraries are built upon verified logic components. Verified blocks built upon verified libraries. And verified macroblocks built upon verified blocks.

bcrl · on July 11, 2021

The author only barely touches on the issue of complexity that results from feature creep, which is the most fundamental difference between hardware and software. Hardware feature sets by their very nature have to be completed many months before tapeout occurs (sometimes years for extremely complex hardware like CPUs). Software often has to adapt to customer requirements by adding new features over the course of weeks. Software ends up becoming orders of magnitude larger and more complex than hardware as a result. Given budget constraints, testing is naturally going to be less thorough as a result.

MasterIdiot · on July 10, 2021

I remember a talk where someone who did both software and hardware development for Intel talked about "Hardware development cycle fro software developers" or something like that at some conference, and the cost of simulating CPUs or fabricating them for a round of testing was absolutly mindblowing. Also - the fact the a 1 in a billion bug will happen multiple times a second on a modern CPU.

P.S Did anyone else happen to run into this talk? I tried to find on youtube and could'nt find anything related.

colin_mccabe · on July 12, 2021

Given what we know about the latest flaws in CPUs, such as Spectre and Meltdown, maybe we need to rethink hardware testing using perspectives from the world of software instead.

Jtsummers · on July 10, 2021

Several postings previously, but no discussion on any of them.