Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Faults, errors, and failures (btmc.substack.com)
83 points by sirwhinesalot on Feb 17, 2024 | hide | past | favorite | 61 comments



>> Errors are almost always the result of faults. Barring cosmic rays, hardware issues or really unusual race conditions between the application and the operating system, if an error occurs it is because the programmer screwed up and introduced a bug. <<

The 'cosmic ray' category has taken in an awful lot of additional territory since computers started running multiple programs at once and sharing resources between programs and with the world in general. About 60 years ago, perhaps lightheartedly, it was suggested that computer instruction sets should contain a branch-on-chipbox-full instruction. The chipbox, a shared resource, was where the cardpunch disposed of the confetti produced by punching cards. As we were learning that anything that could go wrong would, it was logically inferred that if a full chipbox was ignored long enough while punching cards, the computer would find a way to stop, fail, or burn down the building, and no existing software could prevent that possibility on its own. A comparable situation in this century is that typical computers today allow the operator to 'adjust' the system clock, but a very small fraction of software is written to accommodate all the possible consequences of non-monotonic time, time being a shared resource. And if you do write software to handle said consequences, how do you handle it if the program has no way of telling if it is running on a system on which non-monotonic time is allowed or not-allowed, catastrophic or all in a days work?


I always love seeing these historical anecdotes on hackernews and I couldn't be happier to see one on my own post. Thank you!


I liked the article so I wanted to give you some feedback. Hope it is useful to you!

- I don't think the definitions of error and failure are 100% correct as stated. Looking at the IEEE definition that you reference, I interpret error meaning the difference between the value that is stored in the program, and the correct/intended value. For example if we expect to have a value of 100, but in fact have 110, the error is 10. I don't think that whether the value is observed or not is what categorizes it as either an error or a failure. If I run my program in the debugger and find that a value is off from what it is supposed to be, does that shift it from an error to a failure?

- One point I think you should have leaned more into is how language constructs and tools can help prevent failures, or cause more of them if they are bad. You bring up the point with Haskell and Rust, and how they systematically reduce the number of faults a programmer can make. You also bring up the point of Exceptions introducing a lot of complexity. I think these two examples are great individually. I think putting them together and comparing them would have been powerful. Maybe a section that argues why Rust omitting exceptions makes it a better language. - A side note since I also hate exceptions: did you know that the most common (and accepted?) way to communicate exceptions in C# is via doc comments written manually by humans. Good luck statically analyzing that!

- A lot of the text revolves around the terms error, failure, and fault and how people use these in communication. Often with different ideas of what the words mean. Even the titles (jokingly? "correctingly"?) reference this. Even with the definition at the start, the ambiguity of these terms was not dispelled. I think a major part of that was the text using the terms like you defined them, and also the common "misunderstood" versions of the terms. I think a strategy you could have deployed here is to use less overloaded words throughout the article and sticking to those throughout the article. For example (without saying these are the best terms for the job), instead of fault, error, and failure, using defect, deviation, and detected problem.

- A note on the writing style. Many words are quoted, and many sentences use parenthesis to further explain something. At least to me, these things make the text a bit jumpy when overused. I would try to rewrite sentences that end with a parenthesis by asking myself "what is missing in the sentence so I don't need to resort to parenthesis?". Don't be afraid to break a long sentence into many!

Hope my comments come of as sincere, if not then that's on me! Good luck with your continued writing.


> A side note since I also hate exceptions: did you know that the most common (and accepted?) way to communicate exceptions in C# is via doc comments written manually by humans. Good luck statically analyzing that!

Java having checked exceptions is the primary reason I’m sticking with that language. Many libraries don’t use them, unfortunately, but an application that embraces them systematically is bliss in terms of error handling, because at any place in the code you always know exactly what can fail for what non-bug reasons.


They had the right idea but implemented it poorly (overly verbose to work with, as is much of java). The end result were people taking too many shortcuts.


It is funny: usually, Java is used as a prime example why checked exceptions are bad.


Constructive feedback is always appreciated!

The only thing I'll comment on is the IEEE stuff. I was taught these terms in a university course on fault tolerance. You'll find slides from various courses using them like this or similar if you search on Google, and that particular IEEE standard was mentioned as the source (I never personally read it). I have read a later standard that rather than defining error specifically, mentions all the various ways in which the term is used.

The thing is, the actual standard is irrelevant, it wasn't meant as an appeal to authority. Rather, it's a source of 3 related terms (fault/error/failure) that can be used to refer to the 3 distinct ideas discussed throughout the post.

Your suggestions for alternative names are just as valuable and just as useless, neither the ones in the standard nor your own are generally agreed upon. My hope was that by using a somewhat common triple I would have avoided pointless discussion on the terms themselves, rather than the ideas discussed in the post.

As this hackernews comment section demonstrates, I was all for naught ;)


> did you know that the most common (and accepted?) way to communicate exceptions in C# is via doc comments written manually by humans.

Well, the accepted way to communicate them in Python is "we don't". I think C++ follows that same principle, but the ecosystem is extremely disconnected, so YMMV.

Java tried to do a new and very good thing by forcing the documentation of the exceptions. But since generics sucked for the first ~20 years of the language, and nobody still decided to apply them to the exceptions, it got bad results that discouraged anybody else from trying.


I think for dynamic languages exceptions are just a fact of life and it doesn't really make much sense to worry about them, you can't rely on the type system to remind the programmer of all the cases they need to handle.

So thinking in terms of failure handling is the way to go.


There is a cost in trying to force the language to find bugs for you. More is not always better. Unlike a linter, ignoring false positives from a compiler requires more work to work around them.

Not having exceptions in the language creates a tradeoff as well. This may lead to either ignoring errors or adding non-linear boilerplate between where the issue is detected and where the code can handle it, negatively impacting readability and refactoring.


Yup, see the section on handling failures in the post. Though note that I use "exceptions" to refer to a very particular language feature, rather than the mechanism. Rust panics and Go panics work like exceptions but are meant to be used differently. Panics are good as are exceptions when used like panics.


FWIW, the creators of rust themselves distance themselves from “if it compiles, it works”, because this is obviously not true.

If your definition of “works” ignores behavioural requirements, then I suppose.


This is why I like the Either monad found in functional programming. You either have your return value or the error. No exception handling nonsense.


Is it really that much different? You still need to handle a Left value, and a lot like handling an exception.


It is different because this way all possible results and failures are known as type information. How to handle erroneous state is the responsibility of the result consumer, given the availability of rich API to unwrap, bind or transform error variants.


This was in Java with checked exceptions.


Some thoughts. 1/ I think that it's not always possible to modify the domain. For example, I could have a function that takes a name of a file as parameter and returns a CanBeWritten object. Now, I could have a function that open a file in write mode and take an object of this type as parameter.

The issue is that between the moment I acquire this object and the moment I use it, the file could, you in fact, become non-writeable. (There was a post on hn about this idea of using the type system like this https://news.ycombinator.com/item?id=35053118 ).

I think you focus a lot on software issues and neglect the hardware ones. But it's a choice.

Still my thoughts (but at this point you already understood that it was going to be like that the entire post): I think that when a fault is detected (when it becomes a failure if I follow your definitions), an attempt to fix the problem and return to a normal state can actually fail - by incorrectly fixing the issue. Like: you have three times the same integer (redundancy) and one of them have a bit flipped. You decide that the one different from the two other is the incorrect one. You detected a problem, you tried to fix it. But it could be the case that two bitflips occured at the same position.

There is no definitive solution to that, but documenting all the detected problems AND the fixes applied to them would help.

And for the error messages ... Well, my position is that most of the time they are useless for the end user. They can be useful for the developer. For the end user, the best error message (if such a message is required) is something unique enough to be copy-pasteable on Google to find a solution that the user will not understand but will be able to apply.

I used to consider (when I started computer science) that an algorithm is like going from point A to point B on a city map. There is essentially one "good" path and a huge quantity of "wrong" paths were you can get lost. And by trying to find your way, you can make the situation even worse.


Thanks for the input!

1 - Yes, when it comes to things that touch the hardware or the OS it's hard to encode these things at the type system level since they can change from under you. This is a great example where it is useful to handle some faults at the type level (i.e., file might be missing, remember to check) while handling others as failures (file got read-only out of nowhere... better abort what I was doing).

2 - Yup, trying to fix errors often makes it worse, which is why simply restarting is often the best way to go :)


> (i.e., file might be missing, remember to check)

There is no point to performing an existence check of a file before opening it, because regardless of whether the check returns true or not, you still have to handle the error cases from `FileOpen`, because the file might have disappeared[1] between the call to `FileExists` and the subsequent `FileOpen`.

If you have to always handle a return of `FileDoesntExist` when performing `FileOpen`, there's literally no point in checking beforehand.

[1] Even if it didn't, there's a myriad of other errors that can happen when opening the file, such as `PermissionDenied`, `FileLocked`, `IsDirectory`, etc ... and you need to handle every single one of them! You don't necessarily have to handle them individually, you can handle them as a group like in the pseudocode below, but it still makes the call to `FileExists` pointless.

    if (err == FileOpen(someFileName)) {
        handleFileError (err);
    }


I never suggested calling FileExists separately, no idea where you got that from. The article explicitly refers to modeling the codomain of the function accurately, in this case something like Result<File, FileError>


> I never suggested calling FileExists separately, no idea where you got that from.

I got it from this:

>>> (i.e., file might be missing, remember to check)

What did you mean by "remember to check"?


You can't use the Result sum type without first checking if it is the valid case or the error case. The type system won't let you forget to check. it's reminding you with red squiggles in your IDE. This is a good feature to have.


A graceful message instead of letting the entire process crash is a way of handling even unexpected errors, i.e. 5XX on the web. Without anticipating them some backends will completely crash.


I think the article is wrong. Radiation/Cosmic Rays crashes require BFT (Byzantine Fault Tolerance) because it can send a 1 to a node and a 0 to another node, even if it is not malicious. For example, see [1]. Formally, CFT (Crash Fault Tolerance) does not handle this case.

[1] https://www.usenix.org/system/files/conference/atc12/atc12-f...


Awesome paper you shared but I don't really see how my article is wrong from reading it? The focus of my article is on programmer-introduced faults not hardware failures.


I didn't want to sound pedantic but for an audience that has specific knowledge about faults it is odd to avoid mentioning at least once the BFT kind of errors when you are mentioning cosmic rays. It is part of the field of fault tolerance and it is a different type of fault.


> error, which is an unobserved, incorrect internal state

For example, amount of available space on the system drive is not an internal state. However, once the number reaches zero, failures of all software are very likely to happen. The software will fail regardless of static type systems or unit tests coverage.

In my experience, external things like that (not enough disk space, not enough memory, unsupported instructions, broken TCP connections) cause large percentage of failures in the software I’m developing.


i have used the MissingThing approach a few times, incl. as specific pre-existing NullRecord in SQL db to avoid having null-FKs. But also as NotPassed singleton for default func args etc.

In Some cases it worked perfect (like in math: 0 is just another number - as long as you don't divide by it), sometimes it needed extra "arithmetic", but still worked... and Sometimes it did not work.. and was abandoned, as handling nulls was much easier than handling ThisSpecial stuff.

Sometimes it is possible to avoid the zero thing all-together but needs a magnitude more thinking on higher levels of abstraction.

btw: has anyone heard of array-of-negative-size? Like a eating hole (if positives are a peg) .. 5 + -3 == 2, so app-ending such 3-hole to a 5-peg array will shorten it to first 2 only..


Do you write for yourself or for other people?


They're pure brain dumps on whatever is on my mind at the time. But if other people find my incoherent rambling useful, then it's worth sharing on the interwebs. Hope you got something out of it!


I realize that it may be a rude question or too direct, but I thought it could cut the conversation down into something quicker. That's all. You can write whatever you want of course... but the modern world put a crazy highway in our brains, my friend. Give me the info like you would give a soldier an order. I would fall in love with you for it.


All of my close friends who read the post had the same complaint, no offense taken. It's hard to keep my ADD addled brain creatively entertained and write a coherent, straight-to-the-point kind of post at the same time.

More time in the oven (editing) is something I need to invest in.


:hot-beverage:


What a great text with useful references and links. Cudos to the author, who is also the OP.


Do let me know if I got anything wrong or missed something.


I hate sites that make things pop up when you select text.


Apologies for that, it's a substack thing... I used to have a self hosted website but wasted more time tweaking the theme than writing :(


I agree with everything in this article. Can we work together lol


If you're asking I'm guessing your current work colleagues don't keep these distinctions in mind. Neither do mine x). They're great folk though!


(Because the author redefines error to tautologically mean an unhandled condition.)


The article deserves a bit better so I replaced the baity title with a more representative subhead.


Thanks, honestly I might just do it on the article itself as well.


Not my definition, the IEEE Standard Glossary of Software Engineering Terminology definition. Yes the title is clickbait (I mention that's intentional in the post) because people keep mixing up faults and failures, be it i.e. Elm's approach (all in on faults) vs Erlang's approach (all in on failures).

Without properly defined terminology understanding the difference in focus leads to unproductive discussion.

EDIT: In your defense the later standards appear to have made "error" a uselessly wishy-washy term again, so eh. The terms fault/error/failure as defined in the post are still used in the study of fault tolerance.


The fact that a standard exists somewhere for something doesn't mean that it's an accepted-in-practice usage. Or even vaguely imply.

You probably don't have a jar of SRM 2387 peanut butter on your shelf, for instance.

The three-term separation makes plenty of sense and when I see all three together in a doc it's generally clear that they mean something along those lines, but it's far from normal use.

---

Late edit: to make this perhaps a bit more useful / constructive: because of ^ this, it's being perceived as a clickbait title. And perception is all that matters in "is this title clickbait", because it's what determines the level of interaction someone is willing to pay at the beginning.

Clickbait-title leads to clickbait-pushback. You don't have to change it of course, and it fits with the post's narrative, but I can pretty much guarantee that it'll keep causing this kind of reaction in some (many? few? idk) people.


Fair enough, I was taught the terminology in university on a course on fault tolerance.

Personally I wish their usage was more standardized, in fact if it was I wouldn't have had to explicitly mention they're taken from IEEE 610.12-1990 and that I'm using them to split "error" into 3 separate but related ideas for clarity.

Either I made it clear in the post I didn't make up the terms so no idea what the original commenter is on about.


Yeah, it's not a totally unknown standard or anything. But I feel pretty safe claiming that "bug" (unobserved in code) / "error" (handled in code) / "crash/fatal/unhandled error" (obvious) is both far more common and far clearer at a glance... which is probably why it's more common.

In technical documents, the standard can be useful for being very explicit and unambiguously identifiable. I'm far from convinced that it deserves use elsewhere tho.


I'm replying to you Groxx but hopefully panzi will also see it. I definitely see where both of you are coming from. The issue (I feel) is that if I were to do as you suggest, and just use 3 commonly used but not agreed upon terms (bug, error, crash), then the original comment of this thread would be justified in their criticism (that I'm making up my own definitions for clickbait).

I actually had to present my own definitions for terms in my article about OOP, so I could explain why I like objects but not OOP, because the terminology in practice is so diluted to be next to useless :(


If it helps I think you're right to take this approach.

In my experience defining error as 'handled' or handleable in code can often not be particularly helpful. Are the values from errno(3) always considered as handled?

EINVAL is almost always in "your" terms a fault, ENOSPC or EPIPE likely could only be handled as complete failures.

What I see too often is code which which propagates


I might push for a different set of divisions there tbh.

errno(3) is handle-able because the information exists and it is possible to use. It's the same as any other "a problem occured" signaling mechanism in that sense. Its main sin is being out-of-band of the trigger, so it's extremely easy to forget.

Whether it is handled (checked) in code is a subdivision of handle-able. Some signaling mechanisms are better about preventing not-handled than others, depending on context.

Whether you can recover from it (do something else, try again until success, etc) is also a subdivision of handle-able, and is completely unrelated to whether it was handled or not. And I think I can claim that in literally all cases the "recover-ability" is also completely unrelated to the kind of problem (ENOSPC vs EINVAL) - it only depends on what you are trying to do right now, which depends on the rest of the program and the user intent. If it was inherently unrecoverable, it wouldn't be errno(3), it'd just never return (e.g. kill your process, infinite loop, etc).

Under that framework, ENOSPC is just a normal handle-able error. It's frequently a fatal failure that is easy to forget, but it's easy to come up with something that expects and recovers from it, e.g. a lossy caching tool. Similarly, EINVAL is an unrecoverable error if you are a tool that fails on bad input, like a compiler, despite being easily recoverable in some cases (probing for feature support and gracefully degrading, perhaps).


I too learned that terminology at university almost 20 years ago. Haven't heard it used like that since. Yes, precise language would be useful for communication, but in this case it feels like a lost cause. Nobody uses these terms like that.


> Nobody uses these terms like that.

Yeah - it feels a bit like "Mebibyte". Sure, it's technically the correct name of a base-2 Megabyte. But people just use "Megabyte" and use context to figure out if people mean base-10 or base-2.


The problem is that people writing norms are (typically) people without any contact with reality. In theory they are great. No place in practice. Best example iso stack vs. Tcp/ip


My unprincipled compromise is that I spell it MiB and pronounce it 'megabyte'. Spoken language is lousy with ambiguity already, I can't recall a time when I've ever had to clarify which I meant in conversation.


Mebibyte is well defined. Error is not.


But as pointed out, “Megabyte” is not clearly defined in informal usage.

Informally, it has two specific meanings, and a third useful fuzzy meaning when the distinction between 10^6 and 2^20 does not matter.


If you try to assert a specific definition of a specific term like "HTTP error 503" then you have some authority. But if you try to assert a specific definition of a generic term like "error" then you don't really have the same authority, and you can't be surprised when other systems don't follow your definitions.


Efforts like the "IEEE Standard Glossary of Software Engineering Terminology", where from an era that we learned what didn't work, which linguistics had learned way before CS.

The IEEE didn't go 'wishy-washy' they accepted that at that scope, 'Descriptive' was the appropriate approach and 'Prescriptive' was inappropriate for a broad context.

Same thing happen with most English dictionaries.

https://www.merriam-webster.com/grammar/descriptive-vs-presc...

“Ubiquitous Language” concepts from DDD is probably the best example here.

Domain specific language is always polysemous to a wider scope. So DDD typically defaults to allowing domain experts (think accountants, not accounting software developers) to object to a rigorous “Ubiquitous Language” for interfacing with each other. This is important because when domain experts object to a formal definition as proposed, they are possibly concerned about nuances in that language that are important.

Obviously they may just be pedantic, but the default assumption is that they are trying to convey something important about the domain of discourse.

The Prescriptive model of enterprise architecture utterly failed to live up to it's promise, in part because word sense ambiguity is a pervasive characteristic of natural language even at the scope of an enterprise.


Fair comments, I wasn't hoping to use the IEEE standard as an appeal to authority, just clarifying I didn't make up the terms.

Sadly as you say, these formal definitions don't end up agreed upon, which is why it is always important to clarify what you mean when you use a certain word (which I hopefully did in the post).


I hate engineers playing lawyer, and also I do not know the IEEE definition, but acording to ISO26262 your definition is wrong: fault is undetected, error detected (but may or maynot be handled by for example resundancy) and an unhandled error that results in deviation of req. sys. behaviour is a failure.

Also note: fault can be a bug in SW, but an unexpected behaviour of the HW


It's not playing lawyer, it's agreeing upon some terminology for the purpose of discussion. I was accused of making up terms, I didn't, I used existing terms from one particular standard. The very reason I used existing terms from a standard was to avoid the very accusation I got anyway in this hackernews thread, but that's the internet for you.

I even mention in the article that the term fault is used differently in the context of hardware faults, different standards have different meanings for the same terms, which is why I also lay them out and don't expect you to know the standard in the article.


I also felt the same as loeg after finishing the article. If you simply state that invalid states cannot be handled, I believe most people here would concur. And in the end, I do think that this is a very significant insight that you describe. But using a narrow definition of errors to mean invalid state disregards any prior literature and programming manuals that already talk in terms of "error handling" and asks readers to suspend all of that while reading the article.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: