Hacker News new | past | comments | ask | show | jobs | submit login
Undefined Behavior != Unsafe Programming (regehr.org)
119 points by steveklabnik on Feb 14, 2017 | hide | past | favorite | 70 comments

> Undefined behavior is the result of a design decision: the refusal to systematically trap program errors at one particular level of a system. The responsibility for avoiding these errors is delegated to a higher level of abstraction.

Sounds a bit like an argument for Checked Exceptions: If you're going to write code for one layer of abstraction, your code should only emit errors that match its own tier. (And not raw naked primeval stuff occurring many layers down.) ( http://imgur.com/iYE5nLA )

It's also not entirely true.....undefined behavior also exists because it's a serious pain to define every detail of a language. The only language I know of that is completely defined (other than toy languages) is ML......and the authors felt the work to completely define it wasn't entirely worth the effort.

Java is also completely defined and pretty completely specified. But don't confuse unspecified behavior with undefined behavior. Unspecified behavior means that one of several things can happen, but it's unspecified which. Undefined behavior means that absolutely anything can happen, including things that are not part of normal application behavior.

> Unspecified behavior

I feel like "unspecified" still sounds too lax. How about "implementation-defined"?

I'm thinking of scenarios like: "When this occurs, The Java Virtual Machine implementation may choose to either X or Y. If it does Y, then it shall throw a Z exception."

> and the authors felt the work to completely define it wasn't entirely worth the effort.

Can you add a source link? I would be interested in looking it up.

Eiffel is also completely defined.

It is one of the few languages that has denotational semantics as part of the language definition.

It was also one of the first languages to fix the null reference problem (called Void-safety in Eiffel speak).

Had the compilers been a bit cheaper instead of targeted to the enterprise customers, maybe many more would be enjoying writing safe native compiled software on Eiffel.

And HTML. Every sequence of bytes is an HTML document with some defined meaning. It may not be valid HTML (the spec does define parse errors), but it will parse to the same representation in all conforming parsers.

This reminds me of a recent conversation I had with someone about Nim: I suggested it to them and they rejected it based on the premise that compiling to C is "bad" because of undefined behaviour. But now I find out that even LLVM IR has undefined behaviour...

AFAIK Gödel's incompleteness theorems imply that _any_ language will have at least some undefined behaviour.

No, you're confusing "undefined" with "unprovable". There are several "safe" languages where every well-formed program has well-defined semantics (for example, most interpreted languages).

Undefined behavior means that there are some well-formed programs that have semantic holes: the specification says that is up to you to avoid that your program gets to some illegal state. If your program does, the language says "I told you not to get there. Now I can do whatever I want". The reason is that by assuming that the program cannot reach those states, better code can be generated (think bounds checking).

What Godel's theorem says (and the more CS-specific version of it is Rice's theorem [1]) is that you cannot have an algorithm that proves any given property about the semantics of a program. But if the language is safe, the property of being well-defined is trivial (all the programs have it), and trivial properties are the only exceptions to the theorem.

[1] https://en.wikipedia.org/wiki/Rice's_theorem

Correct. Rice's theorem limits the ability to check a program, e.g. with a compiler.

I think you're stretching Gödel too far. Perhaps most "fundamentally", it speaks to a limitation on finite first-order logic systems, and "computer languages" generally don't impose a finiteness constraint (a language which does would not be Turing Complete).

Though if I'm missing the point you're making, it would probably be quite interesting to see it in more depth.

Hardly. It'll imply that any language could get stuck or fail to terminate but UB is something different.

That would be Turing's Halting problem, but yeah, undecidables permeate both concepts.

Undecidable and undefined are completely different concepts. Note Turing's original proof operates on "Turing Machines"; they are completely, mathematically-rigorously defined, but have some questions you can ask about them that are undecidable. Arguably part of the point of the proof is precisely to show that definedness does not imply decideability, a thing which we now take for granted but was still something mathematics as a human endeavor was working through at the time. (If I recall my timelines correctly, this was not the first such demonstration, but the old habits had not yet died and the understanding not yet fully percolated through.)

Nothing stops you from creating a computer language that has absolutely no undefined behavior except the general difficulty of fully defining behavior and the practical issues involved with possibly overspecifying your behavior to the point it can't be practically implemented on real hardware. You also face the possibility that your model of the real hardware may also be flawed, if for no other reason than because of CPU bugs, so in practice you may not be able to manifest your fully-specified language 100.00000%. But it's not in principle impossible.

I completely agree with the gist of your comment, but you're confusing "undefined", in the sense meant in the article, with "unspecified".

"Undefined" refers to certain properties of certain languages, like C, where an operation can produce any of an unenumerated set of behaviors, including behaviors that go well beyond normal program behavior. For example, in C, writing to a pointer that's pointing outside of allocated application memory can do things like change the current function's return address; in C++ it may change an object's vtable. The set of possible behaviors that can ensue is too large to enumerate and depends heavily on unforeseen factors.

What you're referring to is unspecified behavior. For example, in C, calling a function like so: foo(i++, i++) does not specify the values of the parameters to the function, but they can be one of a small set of possibilities, and cannot cause undefined behavior (like changing the function's return address etc.).

Creating a language with no undefined behavior is quite easy and is the norm in many high-level languages. Fully specifying the behavior of the language is harder, and you're right that often there is intentionally unspecified -- i.e., implementation dependent -- behavior to allow for efficient implementations on different platforms by the language's compiler/runtime, OS or hardware. For example, in Java, changing the values of non-volatile fields concurrently by multiple threads has an unspecified result, and this is intentional.

The real heart here is that program analyses have opened doors to optimizations that cannot be represented at the level of the language (LLVM IR). This is super convenient for compiler authors who want a flexible IR for optimization purposes. It's the same thing that happens with C UB too: optimizations will take advantage of what the compiler knows, not what the programmer believes.

> The essence of undefined behavior is the freedom to avoid a forced coupling between error checks and unsafe operations.

Maybe they should remain coupled - after all, they're intimately related, the error check is what makes the "unsafe" operation reasonable. For a program to remain correct it is vital that the error check remains adequately coupled to the undefined behaviour it's preventing - e.g. if an operation that would do something weird on overflow is being used, the link to our reason for believing that overflow can't happen in this case should be made explicit. It should be possible to do this in a way that has zero overhead in the final machine code (e.g. a richer type system at the LLVM bytecode level).

> Either way, UB at the LLVM level is not a problem for, and cannot be detected by, code in the safe subsets of Swift and Rust.

But this is not true, there are things that leak through and cause big headaches for the designers of Rust.

designers of Rust or implementors of the rust compiler?

(honest question)

I think the two sets (designers of the Rust language) and (rustc implementors) intersect really well.

I think that the distinction should matter more if/when someone tries another implementation of the language.

> Undefined behavior is the result of a design decision: the refusal to systematically trap program errors at one particular level of a system.

Huh? This is quite a bit of a false dichotomy... Illegal operations could also result in unspecified behaviour (e.g. it's not specified what result an integer overflow gives, but the rest of the program must continue normally).

> unsafety of machine code

In what way is the machine code unsafe? AFAIK the CPU will always try to execute the code, the worst that can happen is some kind of a trap.

I'm not sure I understand you. What is the difference between undefined and unspecified behavior?

If it was decided that an overflow would generate an error, than it was a design decision to trap errors at that level. The program could crash so that everybody knows something is wrong.

If it was decided that an overflow just would overflow, than it sounds like a refusal to trap the error. The program could continue in an unexpected state.

Maybe it's better to crash the program so you know something is wrong.

> I'm not sure I understand you. What is the difference between undefined and unspecified behavior?

They explained in the very next sentence: "e.g. it's not specified what result an integer overflow gives, but the rest of the program must continue normally".

(In the C standard, "unspecified" is an explicit marker for things where a compiler must document its choice of semantics.)

> For example, it is obvious that a safe programming language can be compiled to machine code, and it is also obvious that the unsafety of machine code in no way compromises the high-level guarantees made by the language implementation.

I agree with the thrust of the statement, but this isn't nearly as simple as is stated. Compiler bugs exist, and the more standards your code has to interact with on the way to machine code (e.g. C spec => LLVM spec => X86 spec) the dicier this becomes, not to mention the assumptions about the APIs exposed to you by your OS and libraries you use.

Duly upvoted, but I'd like to note that the problem with so-called “unsafe languages” isn't undefined behavior per se, but rather the lack of appropriate tools (e.g. a formal axiomatic semantics) to reliably write programs with no undefined behavior.

As long as you use the new semantics, the unsafe parts of the language are barred to you, so it's like a different language.

I'm not talking about changing languages, but rather about giving existing languages more rigorous specifications. My beef with C, C++, unsafe Rust, etc. isn't the presence of undefined behavior in arbitrary programs (for which the solution is very simple: don't write arbitrary programs!), but rather the absence of a calculational method [0] for designing programs that don't have undefined behavior (for which a formal semantics is an indispensable prerequisite). To convince yourself that a C program doesn't have undefined behavior, you need to interpret the English text of the C standard, and that's what I object to.

[0] The calculations needn't be performed by a machine. If they are performed by a machine, you have a safe language. If they must be performed by humans, you have an unsafe language, which is nevertheless a lot more usable than C and friends.

fwiw this appears to be a goal of people working on speccing out Unsafe Rust stuff. You also don't need to worry about interpreting any english specs today, because there just aren't any specs at all yet!

I'd have to interpret the Rustonomicon, which is still written in English. :-p

There are some tools aimed roughly in this direction, such as https://github.com/TrustInSoft/tis-interpreter.

A major difficulty here is that undefined behaviour is a property of a particular execution of a C program, rather than a static property of the program itself. Tools that dynamically detect UB are useful, but will not demonstrate that there are no inputs for which a program will go wrong.

As powerful as an interpreter might be for running programs, it isn't the right tool for designing programs. Validating a program design requires static analysis, to be performed either mechanically (e.g., type checking) or manually (e.g., manipulating predicate transformers in the way advocated by Dijkstra).

If you take JS and add Flow, or Python and add MyPy, and then accept the judgment of the checker and don't use the parts it doesn't check, then it becomes a language subset -- and sort of a different language. Like if you have array bounds annotations and check all indexing operations, C arrays become Pascal arrays.

That's the thing - I'm not talking about using any mechanical checkers. I'm talking about using a language definition that makes it easier for humans to prove things about their programs.

Okay, that's interesting. How would that work in practice? It doesn't seem like it would be easy to apply in a timely way. If the rules can be unambiguously checked by humans, would they not be more efficiently checked by machines?

Proving theorems isn't just performing routine checks. There are creative aspects, like finding the right loop invariants, at which machines fare absolutely miserably.

OTOH, machines excel at exhaustive enumeration and enforcing separation of concerns: that's why ML-style algebraic data types, pattern matching and parametric polymorphism are such a boon for high-level programming.

So there's room for both human and machine work in program verification. What we should be interested in is finding the right division of labor between humans and machines.

One can use clang and gcc to either produce a runtime error if undefined behaviour is detected or that and abort the program.

There are several things, like signed overflow of integrals, that cannot often be determined at compile time.

I think the OP is referring to full-blown verification systems, like some dependently typed languages have (although they aren't the only ones). If your program's behavior is defined because you wrote it that way (i.e. not by chance), you would have an argument for why the undefined behavior is not invoked that could be translated into a formal proof in most sophisticated (i.e. ZFC-ish) formal systems.

Formal verification, yes. Mechanical verification (dependent types, SAT/SMT solvers, model checkers, etc.), not necessarily. I don't mean to insult the efforts of the mechanical verification community, but it's preposterous to suggest that the only way to verify programs is to use a computer. Hey, we have brains! We can use them too!

To my knowledge the undefined behavior sanitizers are not complete. It is also unclear how you both optimize your program (exploiting undefined behavior semantics) and providing a runtime check without completely killing performance.

Use it during your unit tests, clang claims the runtime cost is small so it might now matter either.

As useful as debugging tools are, they are no replacement for design tools.

HN title filter ate the "!". The title must say "!=" instead of "=".

Gah! Thanks. I guess I should use the words in the future. I'm used to copy/paste-ing titles, given the HN guidelines.

Always the option to use <>. Though I'm not sure how common it is.

There is also ≠.

Where do I buy a keyboard that emmits that?

You already have one: ≠

That works in the linux command line, and in this text field on linux:

Of course now you have to find the code. I've always had luck with wikipedia. Also I'm assuming my locale, etc etc works the same for everyone.


Similar methods available for windows and mac, I imagine.

If you're in Linux, and you have a key mapped to Compose, you can hit "Compose = /" or "Compose / =" to get ≠.

On Mac I believe it's just Option-=.

Typing the more unusual characters has always been way easier on Macs. (Literally since Macs were invented.)

> Typing the more unusual characters has always been way easier on Macs.

That used to be the case, but I find the Compose key in X is awesome: as others have noted, Compose / = → ≠ (and → is Compose - >)

That Compose system (which I haven't tried yet) does seem like a vast improvement!

On Mac, hit control+command+space, type "not equal", and hit down then left, choosing ≠.

Certainly longer than typing !=, but far easier than looking up hex values.

Actually if you just hit "Alt" + "=" on the Mac it will insert the ≠

And on Linux, there is a Mac style keyboard available. I use that plus the compose key for maximum special character support.

In Emacs, using a normal keyboard: “C-\ TeX RET \ne”.

With a standard keyboard, and an OS with support for compose sequences: compose + / + =

You can pay for ~/.Xmodmap maybe.

Or when lacking unicode, #

When I read pound/hash/octothorpe it does not yield "not equal" in my mind. I would use "!=" and if someone else used "#" I would struggle to understand its meaning.

It was quite common in Niklaus Wirth languages, for example:

    PROCEDURE & Init(id: LONGINT);
        IF id # NofPhilo-1 THEN
            first := id; second := (id+1)
            first := 0; second := NofPhilo-1
    END Init;

... you know you can edit your article title at any time ...

I don't have the option to right now. So not literally any time.

https://news.ycombinator.com/edit?id=13648333 maybe? I think that's how the mods access it IIRC.

Nope, doesn't have a way of editing.


Perhaps the title filter should leave alone ! if followed immediately by non-whitespace.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact