
Assertions in Production Code? (2008) - ScottWRobinson
http://www.drdobbs.com/architecture-and-design/assertions-in-production-code/228700788
======
tedd4u
I recently discussed this with an engineering manager for a notable consumer
iOS app with tens of millions of downloads. In discussing how to better reduce
their crash rate, there was a debate on his team about whether to enable
assertions in their production iOS app. As any iOS developer knows, even if
you have a great crash reporter and database there are a lot of crashes that
are hard to figure out.

One faction believed you should never intentionally crash the app, while
another faction believed the app was going to crash anyways or maybe do
something worse.

They decided to test it out and during a sprint that was already mostly about
bug fixes where they would do extra testing before release, they enabled
asserts, and closely monitored their crash reporter.

What they found was that the number of crashes did not change much, but the
cardinality went down, significantly. The learning was code executing past a
disabled assertion may be in one of n different bad states, each of which
might lead to a different type of crash. They now had better high-level
information about what was causing crashes (knowing which asserts were wrong)
and it helped them reduce their crash rate much more than raw crashes without
asserts (including cases where the crash was in iOS code, not app code).

~~~
bsaul
That's a really interesting results. Those are the kind of thought you may
have but knowing that it actually worked on a production app that's massively
used is a really valuable thing.

Anyone else had similar experience ?

~~~
pfranz
I work in visual effects for film writing internal tools. Usually they're
tools that take data from department to department or the department uses
internally; fairly short processes that serialize data in between.

One place I worked had a team that was very adamant about not really having
much error checking. Not much of any qc process, either. Wait for someone to
complain about bad data and respond. Honestly, this worked really well for
small, skunkworks type projects that needed to be nimble. As you would expect,
when errors did happen it was because of bad data from further up. You really
had to know the system well to be productive (the cynical of us thought the
developers liked this because they could look like heroes).

I prefer to error early and clearly and I got a lot of push back. To be fair
to them, often times the errors would be irrelevant to the specific thing they
needed and it would have been preferable to ignore and carry on. It felt like
I was imposing a bunch of bureaucracy and yak shaving. But bringing on new
people and scaling the amount of data going through would have been impossible
without more structure.

~~~
WalterBright
I find it sad that there seems to be no conventional wisdom in the software
business about the proper way to do this. Everyone starts from scratch and
tries to learn it the hard way.

------
WalterBright
If you want some fun (fun as in ruining your day, don't try this without
backing up first) fill up your system disk to near capacity, then try to run
various apps and system utilities.

All kinds of bizarre things start happening, usually bad. The cause is simple,
most of these are written in C and C programmers rarely check to see if writes
to the disk succeed.

For example, a few months ago, my Windows box would crash every time it would
auto-update. I did a lot of cursing about this, as how could Microsoft do
this? I eventually realized that the disk was nearly full. Cleared out a few
gigs, and the auto-update started working.

This is 2018.

It's not specific to Windows, either. It happens with every OS I've ever tried
it with, including Linux. No message like "disk full", or "failed to write
file". Just erratic random behavior and weird things happening.

I always try to keep at least 10% of disk free.

~~~
tonyedgecombe
I suspect it's the same with memory, nobody checks that malloc doesn't return
0.

~~~
WalterBright
That's right, but at least these days you'll get seg fault if you try to use
it.

Back in the bad old DOS days, NULL pointers pointed at the interrupt dispatch
table, so running out of memory (awfully common on a 640K machine) meant you
trampled all over the operating system. It was so bad that I'd defensively
reboot my machine constantly while debugging.

DOS extenders saved the day, because they ran code in protected mode. I never
developed code in real mode again, I just ported fully debugged code to it.

------
WalterBright
Author here. AMA. My other two articles on the topic:

Safe Systems from Unreliable Parts
[https://www.digitalmars.com/articles/b39.html](https://www.digitalmars.com/articles/b39.html)

Designing Safe Software Systems Part 2
[https://www.digitalmars.com/articles/b40.html](https://www.digitalmars.com/articles/b40.html)

------
dougk16
There's a good approach to asserts I adopted years ago for your typical low-
risk CRUD/web/mobile app that I'm guessing ~90% of us are working on (as
opposed to aviation software like in the article): Use asserts as you normally
would, but in production have it fire an event to google analytics or some
such. This is particularly useful for those middle-ground situations where a
full app crash isn't justified because things will probably chug along fine,
but nonetheless you want to know something's screwy.

For debug/testing mode I usually pop a toast or dialog so the tester knows to
report something to me even if the app is otherwise working fine.

You often say to yourself "Well I _know_ this assert will never trigger but
I'll put it here just in case". Then you check your analytics next day after
release and get humbled.

~~~
vram22
>You often say to yourself "Well I know this assert will never trigger but
I'll put it here just in case". Then you check your analytics next day after
release and get humbled.

That is exactly what asserts are for: checking for things that "cannot happen"
\- but do sometimes happen.

------
roland35
Software development often does seem like a struggle between
reliability/robustness and safety/correctness. Figuring out what to do with
asserts falls in this debate!

For embedded development I have often had ASSERTs log out to a serial port
with the function name/line # and continue operating - this can then be
connected to any logging device. This makes the system a little more robust
(it won't crash, right away at least) but less correct/safe.

During development testing we often had complaints of machines freezing due to
asserts, which always increased the priority of those bugs! Definitely a good
thing in the long run to fix those bugs and make the system more correct.
Stopping all executing may not be the correct 'safe state' for an airplane
though!

~~~
WalterBright
> Stopping all executing may not be the correct 'safe state' for an airplane
> though!

It absolutely is. It's a (literally) disaster to allow code that has entered
an unknown state to control aircraft functions. There is NO WAY such a design
would EVER be certified by the FAA.

~~~
cryptonector
In a fly-by-wire design (which is de rigeur now), stopping all operations
would be way worse than continuing.

~~~
WalterBright
That is certainly not how fly by wire systems are designed. The FAA would
never certify a design that relied on allowing software in an unknown state to
access critical flight controls.

Here's how it's done:

Assertions in Production Code
[https://www.digitalmars.com/articles/b14.html](https://www.digitalmars.com/articles/b14.html)

Safe Systems from Unreliable Parts
[https://www.digitalmars.com/articles/b39.html](https://www.digitalmars.com/articles/b39.html)

Designing Safe Software Systems Part 2
[https://www.digitalmars.com/articles/b40.html](https://www.digitalmars.com/articles/b40.html)

~~~
gizmo686
I've never written written safety critical software, but I believe the point
is that "stop all execution" is not always an adequate response to an unknown
state; this does not imply that "proceed as if nothing went wrong" is an
acceptable response. I would imagine the response to an unknown state would be
"return to a known state". This could be accomplished by rebooting (returning
the software itself to a known good state), passing control to a redundant
system (possibly done by halting all execution of the affected system, but the
design needs to consider more than that), falling back to some form of a safe
mode, where the bad state is not present, but safety critical functionality
can continue.

~~~
WalterBright
When you're at 30,000 feet, nobody wants to go experimenting with a system
controlling critical flight operations that has gone into an unknown state.

The pilot often has the option of rebooting the system and trying it again,
but that's dangerous. On an episode of "Aviation Disasters", I don't remember
the exact details, but the pilot got a warning of a failed system. After a
consult with the ground, they told him to reboot it. After a while, it failed
again. They told him don't worry about it, reboot it again.

It (the airplane, that is) crashed.

Aircraft systems are designed to be decoupled from each other as much as
possible, so failures do not propagate. In particular, they must not propagate
to the backup system, or the "safe mode". The engineers do work hard at this,
and sometimes they don't get it right, and another bitter lesson is learned.

The "safe mode" system must be physically and electrically decoupled from the
normal system.

An example of getting it wrong is I've seen demonstrations on TV of people
hacking in through the wireless key locks on a car to take control of the
brakes. That's seriously bad engineering - not so much the vulnerability to
hacking because ____happens, but the fact that the door lock system is
connected to the brake system. It 's cowboy engineering at its worst.

~~~
gizmo686
Keeping systems isolated is an orthogonal concern. The question is weather
rebooting the system is safer then shutting it off. The fact that shutting it
off should be safe is a separate concern. At a minimum, shutting it off
reduces your redudancy; and it the system fails again (even in an undetected
way), that should be fine, because you (should) have enough redundancy to
handle an arbitrary system fail in unexpected ways.

------
alangpierce
At my work (hosted enterprise webapp), both the frontend and backend have both
an `assert` function and an `assertAndContinue` function. `assert` always
crashes on failure, and `assertAndContinue` crashes in dev and logs an error
and keeps going in prod. Each time we want to verify a runtime assumption, we
decide which type of assert to use. We prefer `assertAndContinue` (and I push
for it in code review), but there are reasons to use plain assert:

• In some cases, crashing is a better user experience than proceeding in an
unknown state. Usually this is on the backend when data integrity is at risk,
but could be on the frontend when we're at risk of making a server call with
the wrong arguments. But usually, pretty much anything that doesn't crash is
better than crashing.

• `assertAndContinue` requires that you have a reasonable fallback. If code is
going to crash the next line anyway, there's no point in `assertAndContinue`.
In most cases, it's easy, but sometimes it takes real engineering, like
building in an error state into the UI component that skips downstream code.
When there isn't an obvious fallback, using `assertAndContinue` is a judgement
call based on the difficulty of implementing a real fallback, the likelihood
of the bug actually happening, and severity if the bug does trigger.

~~~
wvenable
I have a difficult time imagining anytime that assertAndContinue makes sense.
If a fallback is necessary and recovery possible, I would have asserts throw
and use normal exceptional handling for that.

~~~
WalterBright
Recovery is never possible if the program has entered an unanticipated and
unknown state.

~~~
temac
So assertAndContinue makes no sense, right?

On my side I don't really see its point anyway. If the code is actually ready
to handle such failures, this is merely an error trace, so it should just be
called that.

~~~
WalterBright
> assertAndContinue

I would never allow such a construct in any code I was in charge of.

~~~
laurentl
> assertAndContinue

OTOH, it’s the perfect title for a sequel of _Halt and Catch Fire_.

------
js8
I completely agree with the approach outlined in the original post. (It's also
called defensive programming.)

Not failing fast and hard when the application is in unexpected state (which
is what assert should check) is robust only on the surface.

I think we have the debate because it's not just about adding asserts. If you
want to add asserts to existing codebase, essentially making it to fail fast
and hard, you need to have a system architecture that is capable of dealing
with said failure. If your codebase isn't architected as such, you will
probably make things worse for the user by adding extra asserts.

I think that's what the people who are against asserts fear of, and rightfully
so. But the answer shouldn't be do not use asserts, the answer should be make
the system more resilient to failure first.

------
rkeene2
In my view asserts are just a way of defining the contract of an interface
that cannot be specified to the compiler/runtime any other way. They are a
workaround of limitations of those systems, or rather the extend a system with
infinite flexibility.

For example the NaCl "randombytes" function has the prototype:

    
    
        void randombytes(unsigned char *buffer, unsigned long long length);
    

It has no way to report an error -- its contract is that it must fill the
buffer with high quality random bytes before returning.

And the OpenBSD function "arc4random_buf" has the prototype:

    
    
        void arc4random_buf(void *buf, size_t nbytes);
    

So we have a potential type difference between the two interfaces for the
amount of data that we can generate.

If you wanted to back randombytes() by arc4random_buf() you have several
options: 1\. Handle the case within randombytes() where the (unsigned long
long) value exceeds the maximum value of (size_t) by making repeated calls to
arc4random_buf(); 2\. Assert, at compile time, that (size_t)'s range falls
within (unsigned long long) and there can be no problem (since
arc4random_buf() similarly cannot fail); 3\. Redefine the contract of
randombytes() such that specifying a buffer length that exceeds (size_t)'s
maximum value is invalid -- this is really the same as #2 but instead of
failing to compile if (SIZE_MAX < ULLONG_MAX) the program will compile and run
fine as long as the contract is not violated; The assert in this case is a
guard against the violation that should never occur and helps to enforce the
contract during development

With regards to production use of assert, catching a contract violation there
means a lot of things have gone wrong. How likely things are to go wrong and
how much damage they do when they do go wrong can vary from contract to
contract. Various asserts may be used based on the such things, from compile-
time to run-time debug to run-time production to an inline debugging
framework. There are trade-offs and no answer for all cases.

------
neves
It is important to note that the use of assertions is one of the few
development practices that has empirical evidence support:
[https://www.microsoft.com/en-
us/research/publication/assessi...](https://www.microsoft.com/en-
us/research/publication/assessing-the-relationship-between-software-
assertions-and-code-qualityan-empirical-investigation/)

------
jacquesm
The larger a codebase the bigger the value of a crash 'close' to the root
cause of the error. Assertions help you to get - sometimes substantially -
closer to the root cause of a problem and any assertion that does not hold
would lead to a crash - or worse, silent data corruption - any way, so by all
means, leave them on.

------
gilgoomesh
The article argues in favor of assertions in production code but acknowledges
that the author's experience is from a very specific field with very
specialized constraints and extensive design around recovering from aborted
components.

This advice is not obviously applicable to different kinds of programming with
different constraints and different environments. As with so many things: one
solution does not fit all problems.

~~~
Nomentatus
You're gainsaying, but others have dropped empirical evidence on the other
side, averring a low to no downside; and my experience is all on the other
side too. It would help if you gave some lived examples.

------
sbradford26
I think part of what is argued is that people need to look at their software
and determine what failure modes are not acceptable. It is impossible to
design a system that won't fail so the focus should be on designing parts that
detect and handle failures. For those parts you may want to pull in techniques
from aerospace such as command monitor setups, assertions, fail-safe modes,
and force restarts.

Clearly not all software is flight critical but a lot of software could be
improved by treating some parts of it as such.

------
rdtsc
> High reliability is not achieved by making perfect designs, it is achieved
> by making designs that are tolerant of failure. Runtime checking is
> essential to this, as when a fault is detected the program can go into a
> controlled state doing things like

Very good point. That's is exactly the philosophy behind Erlang (the language
and its VM, also extends to Elixir and other BEAM languages).

It has isolated process heaps and process hierarchy supervisors. That is the
most critical part of the deal because it means crashes and failures are
controlled and only a small part of the system would be affected, without it
spidering out and putting the rest of the system and put it in an unknown
state.

Microservices or just using OS processes instead of threads can kind of
emulate that. But you can only have so many OS processes as they can pretty
heavyweight. And with Microservices there whole other stack of stuff involved
as opposed to just the language environment.

> I find these options far preferable than going into an unknown state and
> praying.

The idea of avoiding "unknown" state is key. That's also where crashing and
restarting comes in - to get back to a known state. Of course you'd also want
to log the failure so someone can eventually fix it. But, if a system is well
designed, someone won't have to do it at 4am in the morning. I've seen systems
crashing and restarting for weeks. Sounds terrible at first, but the
alternative would have been having a service that's down completely and waking
someone up in the middle of the night to fix the issue.

------
Const-me
In my code, most asserts become log messages in release builds. This helps a
lot when troubleshooting problems, customers don't have debuggers installed. I
often work on desktop software, unfortunately it's impossible to achieve
avionics level of reliability because I don't have a luxury of controlled
hardware and software environment.

------
dwohnitmok
It doesn't fix everything, but judicious use of private data constructors and
custom types can go a long way towards reducing the need for asserts. They can
bubble up potential violations of preconditions and force handler code to be
written.

Some Scala examples, but they're hopefully quite language agnostic:

[https://m.youtube.com/watch?v=Csj3lzsr0_I](https://m.youtube.com/watch?v=Csj3lzsr0_I)
[https://m.youtube.com/watch?v=keTId618iOs](https://m.youtube.com/watch?v=keTId618iOs)

------
carapace
Most people don't know that Python has a '-O' optimize CLI option:

    
    
        $ python -h
        usage: python [option] ... [-c cmd | -m mod | file | -] [arg] ...
        Options and arguments (and corresponding environment variables):
        ...
        -O     : optimize generated bytecode slightly; also PYTHONOPTIMIZE=x
        -OO    : remove doc-strings in addition to the -O optimizations
        ...
    
    

It does two things (that I know of):

1.) Assertions are removed. The assertion code does not appear in the
bytecode.

2.) Any code guarded by a "if __debug__:" statement is also removed.

YMMV

------
gnachman
By default macOS apps do not terminate on certain assertions within certain
event handlers, because exceptions are caught by the framework. The app just
does weird stuff after this happens. This is terrible. Once I figured out how
to make it crash I fixed a handful of bugs and the number of reports of
inexplicable behavior went way down.

~~~
saagarjha
I'm sure you're aware of this, but you can change this with
NSSetUncaughtExceptionHandler:
[https://developer.apple.com/documentation/foundation/1409609...](https://developer.apple.com/documentation/foundation/1409609-nssetuncaughtexceptionhandler).
By default Cocoa registers an exception handler that swallows exceptions and
asks the user whether they want to continue after the exception is raised.

------
philipov
The biggest problem with assertions in Python code is that they throw a
generic AssertionError when you should be throwing a custom exception class
instead. You can't catch the right error (and nothing but the right error)
that way.

------
United857
The one time that it's actually a good idea to intentionally crash an app in
production is if failing the assert could e.g. cause irreversible corruption
of important data or worse, lead to a security vulnerability/exploit.

e.g. situations where you're writing to a buffer/disk, dealing with raw
pointers, etc. (which should hopefully shouldn't happen with good design, but
sometimes unavoidable)

~~~
WalterBright
My work is cut out for me :-(

------
WalterBright
Famous Last Words: "It's just a buffer overflow, nothing to worry about."

------
draw_down
Soft assertions. They crash under test and in development, but log in
production. Monitor the occurrences in production and alert if they go above
some threshold rate. The assertions should have an owning team, and
potentially DRI. This info should be a required argument to the soft-assert
function. Otherwise you’re just making a mess.

Hard assertions, for things that should never ever happen, or where crashing
is preferable to continuing.

------
pmichalina
Assertions, throws, and try-catch are code smells. Really bad design choice
considering we have more elegant solutions like monads.

~~~
js8
There are three main ways to reduce errors in code: tests, asserts and
abstractions. All three have advantages and disadvantages and generally deal
with different classes of errors.

In this thread, we discuss asserts, not abstractions. Monad is an abstraction.
(On the other hand, many people only see tests and forget about two other
solutions, which are as important.)

~~~
pmichalina
Tests are good of course, but that assumes the tests are valid. I’d rather
trust compiler to ensure runtime safety.

