GCC Undefined Behavior Sanitizer – Ubsan

nkurz · on Oct 16, 2014

It seems like this is checking for undefined behavior in the binary that is produced (accesses to undefined memory, division by zero), rather than behavior that is "undefined" by the C standard (it's an signed int, which the standard says is undefined on overflow, so I'll optimize by silently removing your overflow check).

See here for more: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

At first I was excited, because I thought this sanitizer would help with these cases, which I usually find much harder to debug than a division by zero or a dereference of a NULL pointer that causes an actual crash. Am I wrong? Does this sanitizer actually help with these cases? Is there anything else in the pipeline that will?

userbinator · on Oct 17, 2014

Some of what ubsan does sounds like the featureset of Valgrind; e.g. detecting uninitialised variables and accesses outside of allocated memory, except it can catch a few more bugs.

The fact that undefined behaviour is in practice a real pain has prompted some work on C variants which replace most instances of UB with more reasonable implementation-defined behaviour instead, e.g. http://blog.regehr.org/archives/1180 (discussed before here: https://news.ycombinator.com/item?id=8233484 )

on Oct 16, 2014

[deleted]

dietrichepp · on Oct 16, 2014

> Ubsan should adopt a couple of other safety improvements from C++, ...

Ubsan is not a bucket for putting a bunch of "safety improvements", it's a tool for detecting undefined behavior at runtime. GCC has many tools for detecting problems in your code, and there are some options which enable several such diagnostics at the same time. Note that ubsan doesn't include asan, which (in my mind) is way more useful than ubsan.

However, an implicit conversion from void * is very common in C. As an observation, most C programmers do not cast the result of malloc(). If you use an explicit cast on your malloc() on Stack Overflow, C programmers will come out of the woodwork to explain not only why it's unnecessary, but why the cast can hide undefined behavior[1], and so you shouldn't do it! So this is a no-go. Maybe you can ask for it as an optional warning if you really want it.

The other one you asked for is already available: -Wwrite-strings. If you want an error, -Werror=write-strings.

I suggest browsing the docs, if you are interested in getting more warnings:

https://gcc.gnu.org/onlinedocs/gcc-4.9.1/gcc/Warning-Options...

[1] If you forget to include <stdlib.h>, then you will get an implicit declaration for malloc() which returns int. An explicit cast from int to e.g. char* is NOT what you want. You can, of course, turn on -Werror=implicit-function-declaration. However, if you just omit the cast, you will always get an error because the int can't be implicitly converted to a pointer type.

sjolsen · on Oct 16, 2014

> On a platform that doesn't have alignment requirements, it's not a bug to access misaligned data

Consider the following statement:

> On a platform that has well-defined signed overflow, it's not a bug to perform signed overflow.

If you've ever cranked up GCC's optimization settings on a program that performs signed overflow, you know this simply isn't true.

In general, the fact that the underlying hardware supports a certain set of machine semantics does not guarantee that the implementation will extend the semantics of the C or C++ abstract machine to match. It of course may do so—after all, it may do whatever it pleases once your program exhibits undefined behaviour—but unless you are specifically targeting your implementation's C- or C++-like dialect instead of ISO C or C++, accessing unaligned memory is a bug.

> Ubsan should adopt a couple of other safety improvements from C++

I agree that these should be adopted, but disagree that it should be ubsan which does so. These are statically detectable errors, so it should be the compiler which implements the detection, be it as a QOI extension or as part of the language standard.

nkurz · on Oct 16, 2014

  > unless you are specifically targeting your implementation's 
  > C- or C++-like dialect instead of ISO C or C++, accessing 
  > unaligned memory is a bug.

I'm ignorant. What exactly are the standard-defined C language requirements for alignment? I'd presumed the standard only defined relative alignments, and absolute alignment was only required by the implementation and hardware. But I'm not familiar with the spec. Is reading a 4-byte int from a non-multiple-of-four address explicitly undefined behavior?

cygx · on Oct 16, 2014

Specific alignment requirements are up to the implementation. Nevertheless, the standard can (and does) specify that violating them is UB.

sjolsen · on Oct 16, 2014

> Is reading a 4-byte int from a non-multiple-of-four address explicitly undefined behavior?

A type in C or C++ has an alignment, queryable through C's _Alignof and C++'s alignof. This is not necessarily the same as its size, though there is a relationship.

Given a pointer-to-cv-T, you may only access memory through that pointer if the address it encodes is aligned to "_Alignof(cv-T)"/"alignof(cv-T)". If it is not appropriately aligned, accessing memory through it is undefined behaviour.

In other words, the standards do not specify the alignment of any type (except to say that it's always at least 1, IIRC), but they do require that you respect alignments.

nkurz · on Oct 17, 2014

Interesting, thanks. I've recently been working on various optimizations that involve writing 32-bit integers with unaligned 128-bit vector writes, where I advance the pointer by a multiple of 4 (but not of 16) between writes, thus overwriting the tail of the previous write. Your explanation would seem to imply that there is no possible way this can be "spec correct". Since I need intrinsics and inline assembly for this anyway, strict spec adherence isn't a priority, but it's nice to learn more.

dietrichepp · on Oct 17, 2014

I'm wondering why you wouldn't use intrinsics to begin with, since most architectures have multiple load/store operations for vector types and using intrinsics you get to specify which load/store operation you use.

nkurz · on Oct 17, 2014

In the case I'm speaking of, I can often handle the vector writes with intrinsics --- GCC, CLang, and ICC will accept unaligned pointers without problem. But the reason I'm using this write strategy (advancing output by a variable) is to avoid branch prediction errors by converting a jump table to a conditional move. Often (half the time?) I can't figure out a way to get the intrinsics to produce the assembly I want, much less across all three of those compilers. Combining the intrinsics with the conditional moves is what requires the inline assembly.

Other places this comes up are addressing modes, operands that work directly on memory, and register allocation. For example, Haswell is capable of doing 2 loads and a store in a single cycle, but only if the store address is a "simple" base plus offset (no indexing). Convincing a compiler to produce a particular addressing mode is somewhat like teaching a blind person to drive a stick shift while you too are wearing a blindfold. Particularly if you are optimizing for a particular processor (and don't need to support Windows), inline assembly is considerably more reliable than depending on a compiler to consistently generate sane code.

pjmlp · on Oct 16, 2014

Given that we are going to stay with C and C++ for a while, this type of tools should be part of the toolbox of any developer using the said languages and at very least have a build with all knobs turned on.

3ifbyw · on Oct 16, 2014

s/ a while/ever/

pjmlp · on Oct 16, 2014

I am pretty aware that C and C++ are going to outlive me, but that doesn't mean they will be the systems programming languages of choice until the Sun ceases to exist.

Being someone that enjoys C++, doesn't mean I have to agree with all design decisions.

Neither will I reject to use C, if that is what the customer pays me for, even if I dislike it. Professionalism comes before personal opinions.

Animats · on Oct 16, 2014

Not necessarily. Even the C++ committee now accepts that most of the C++ code that will ever be written has already been written.

CJefferson · on Oct 16, 2014

I have no idea where you get the claim from. I know many members of the C++ committee, and they expect new C++ to continue to be written, else why would they bother with C++11/14/17?

Dylan16807 · on Oct 17, 2014

But do they expect more new billions of lines of C++ to be written than have ever been written before?

crpatino · on Oct 16, 2014

Could you please clarify? Do you mean that more than 50% of the eventual total lines of C++ have already been written as of today (Peak C++, anyone)?

Or do you mean that a very large percentage of eventual lines of C++ to be written in the future are expected to be endless variations and downright repetitions of the total C++ lines been written as of today?

malkia · on Oct 16, 2014

I've tried catching this little piece of undefined behaviour, but I was not able to (gcc.4.9 -fsanitize=undefined)

  #include <stdio.h>
  int main() {
    return printf("a") + (printf("b") + printf("c"));
  }

cygx · on Oct 16, 2014

What's undefined about that?

The order of evaluation is only unspecified, ie you can't know which permutation of a,b,c you'll get - but you will end up with one of them (as long as IO doesn't fail)...

malkia · on Oct 17, 2014

Thanks for cygx and readerrrr for correcting me. The difference (to me) of undefined and unspecified is very subtle and often escapes my interpretation when I have to use it later.

readerrrr · on Oct 16, 2014

That behavior is unspecified. The result might vary through compilers, but there is no undefined behavior.

If your program can handle the functions being called in a different order you are safe, which is different from u.b. where the compiler can ( theoretically )do anything.

taspeotis · on Oct 16, 2014

It's been a long time since I've done C/C++ programming. Is the behaviour undefined because of an overflow (number of characters written + number of characters written + number of characters written might > INT_MAX)?

In the general case, that is. I imagine your program (as written) won't see an integer larger than 3.

readerrrr · on Oct 16, 2014

The order of the printf() calls is in question. The order of evaluation is unspecified but not undefined.

prudhvis · on Oct 16, 2014

I am not familiar with undefined behaviour. Why exactly can't such behaviour be caught during compile-time?

mikeash · on Oct 17, 2014

For an easy example, consider the following function:

    int next(int x) { return x + 1; }

This invokes undefined behavior in the case where x = INT_MAX. Yet if this produced a compile-time diagnostic, it would be tremendously irritating and generally useless. Worse than useless, as it would swamp real diagnostics.

pcwalton · on Oct 17, 2014

There are at least three ways I can see to answer your question, all of them answering it in different ways:

1. It's undecidable, because precisely detecting undefined behavior according to the C and C++ standards in general reduces to the halting problem. It's perfectly defined to have a program that dereferences a null pointer in a branch that is never actually taken at runtime.

2. The type systems of C and C++ were not designed to be restrictive/expressive enough to disallow undefined behavior, for a multitude of reasons (simplicity of implementation [especially in 1973], backwards compatibility, concerns over performance).

3. Defining away all undefined behavior via the runtime system (e.g. NaCl or asm.js) requires a sophisticated runtime and incurs a performance penalty.

The first explains why C and C++ compilers cannot catch all undefined behavior. The second explains why the C and C++ languages do not disallow undefined behavior. The third explains why most C/C++-based systems do not disallow undefined behavior at runtime.

tolyk · on Oct 17, 2014

How does that compare to indefined behavior detection with Valgrind?