
What Every C Programmer Should Know About Undefined Behavior #2/3 - ryannielsen
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html
======
fragsworth
Kind of off topic, but I am curious: is this one of the advantages of
functional programming - the notion that you can prove or disprove certain
things about the code and therefore optimize the compiler, based on such
proofs, to your heart's content?

~~~
Peaker
The more advanced of a type system you have, the more you can optimize indeed.
A nice example of this is the zip function.

In Haskell:

    
    
        zip :: [a] -> [b] -> [(a, b)]
        zip [] _ = []
        zip _ [] = []
        zip (x:xs) (y:ys) = (x,y) : zip xs ys
    

Note how each iteration needs to check both lists for emptiness.

In a more advanced type system (or using more advanced type hackery in
Haskell) you can have length-indexed lists. That is: lists that have 2 type
parameters or "indexes" instead of 1. The 1 is usually the type of element
inside the list, the extra 1 is a natural number indicating the length of the
list.

So the function zip becomes something like:

    
    
        zip :: List N a -> List N b -> List N (a, b)
        zip [] [] = []
        zip (x:xs) (y:ys) = (x,y) : zip xs ys
    

The compiler checks the code and verifies, at compile-time, that both the
input lists and the resulting list are all of the same length. This also means
that you only need one runtime emptiness check instead of 2. If the N is
concretely known at compile-time, you may need 0 (though that case might be
caught by inlining/loop unrolling optimizations anyway).

~~~
cwzwarich
In an unsafe language like C or C++, you would just assert that the two
containers are the same length and use parallel iterators. The type system
here lets you optimize away some of the overhead of using a safe language, but
it doesn't let you write code that is more optimized than the obvious unsafe
code, at least in this example.

Also, in this example you're using a singly linked list, which is usually a
performance loss compared to a more appropriate data structure. CPUs are
designed to favor contiguous arrays.

~~~
Peaker
> In an unsafe language like C or C++, you would just assert that the two
> containers are the same length and use parallel iterators. The type system
> here lets you optimize away some of the overhead of using a safe language,
> but it doesn't let you write code that is more optimized than the obvious
> unsafe code, at least in this example.

Well, you can choose any two of: 1. Fast (less run-time), 2. Safe (cannot
crash), 3. Simple types (None of the advanced type hackery). Many choose 1+3
or 2+3, but advanced types let you choose 1+2 which in many cases is
remarkable and somewhat surprising that it is even possible.

Also note that runtime checks, in many cases, do not give you safety. For
example, normally, you have an array indexing operation like:

    
    
      readArray :: Array a -> Int -> IO a
    

Ignore the "IO" if you are unfamiliar with Haskell's IO type. This is unsafe
whether or not you have index bounds checking. A runtime check will only
convert one kind of error (corruption/segfault) to another (runtime
exception).

In these (common cases), your _only_ option of getting safety is advanced type
hackery. Something like:

    
    
      index :: (size:N) -> Array size a -> Fin size -> a
    

where Fin N is the type of integer ranged between 0 and (N-1).

> Also, in this example you're using a singly linked list, which is usually a
> performance loss compared to a more appropriate data structure. CPUs are
> designed to favor contiguous arrays

Nothing about this type hackery is specific to singly linked lists, and
advanced type hackary is applicable for pretty much any data structure. Also
note that the singly linked lists used in zip may not actually be represented
by pointer-chasing singly-linked lists. They may be "fused" together into
efficient loops that process the input directly.

~~~
cwzwarich
> Well, you can choose any two of: 1. Fast (less run-time), 2. Safe (cannot
> crash), 3. Simple types (None of the advanced type hackery). Many choose 1+3
> or 2+3, but advanced types let you choose 1+2 which in many cases is
> remarkable and somewhat surprising that it is even possible.

I don't think it's really that surprising that it's possible. Most programs
don't rely on any deep mathematical properties for their correctness, so the
ability to formalize a proof that they are correct and encoding it into a type
system doesn't surprise me at all. If it is possible to do it on a large
program without maintenance or scalability problems, then I will be more
impressed.

You also present a false trichotomy. Haskell is not actually a completely
'safe' language; it still has exceptions that can cause a program to "go
wrong" and terminate unexpectedly. You can write 'fast' code with it, but it
isn't as fast as what I could write in a lower-level language.

> Nothing about this type hackery is specific to singly linked lists, and
> advanced type hackary is applicable for pretty much any data structure. Also
> note that the singly linked lists used in zip may not actually be
> represented by pointer-chasing singly-linked lists. They may be "fused"
> together into efficient loops that process the input directly.

As the data structures get further away from algebraic data types, the type
hackery gets more and more involved. If you start working with arbitrary
mutable data structures that arise in practice, the type hackery becomes a
topic for a PhD thesis rather than something usable in practical programming
today.

I don't know of any compiler that actually does a reasonable job converting
uses of data structures like lists and association lists into arrays and hash
tables in general. There are compilers (like GHC) that handle special cases
but fall over beyond that. I think this falls into the territory of the
"sufficiently smart compiler" fallacy. Also, making data contiguous is just
the simplest of a common set of data representation optimizations. Does any
compiler automatically convert a list into an intrusive doubly linked list or
eliminate the use of a visited stack for DFS by stealing a bit from vertices
or reversing pointers?

~~~
Peaker
> Haskell is not actually a completely 'safe' language; it still has
> exceptions that can cause a program to "go wrong" and terminate
> unexpectedly. You can write 'fast' code with it, but it isn't as fast as
> what I could write in a lower-level language.

I never meant to imply Haskell was a "completely safe" language. That is an
impossibility, even totality does not imply complete safety. My zip example is
actually a point against Haskell (as length-indexed lists are not yet actually
in use in the Haskell eco-system).

Haskell can encode low-level programs that are almost C-level, and in that
(ugly) style can probably reach the performance you can with lower-level
languages (excluding hand-optimized assembly, perhaps).

> As the data structures get further away from algebraic data types, the type
> hackery gets more and more involved. If you start working with arbitrary
> mutable data structures that arise in practice, the type hackery becomes a
> topic for a PhD thesis rather than something usable in practical programming
> today.

Mutable data structures do not have to "arise in practice". You can have pure
semantics with mutable performance (e.g: Clean's uniqueness types, or
Haskell's ST).

Lots of people's PhD thesis are in actual use in practical programming today.
A PhD thesis often discovers a technique and makes it accessible for real
world programs today.

> I don't know of any compiler that actually does a reasonable job converting
> uses of data structures like lists and association lists into arrays and
> hash tables in general. I think this falls into the territory of the
> "sufficiently smart compiler" fallacy.

Converting these is _not_ a good idea because you would change the
complexities of the code (which is, IMO, the heart of the fallacy). A prepend
of an a-list or a list is O(1), but to a list or hash table, it is (in the
worst-case) worse.

But what Haskell's primary compiler, GHC, does do, is fuse together list
processing such that lists can disappear altogether into efficient loops.

As for arrays and hash tables, these are not amenable to efficient "pure"
modification (unless you use uniqueness types as in Clean), but they have
reasonable alternatives that are pure and persistent: Sequence allows O(logN)
indexing, O(1) amortized (O(logN) worst-case) prepend/append. Tries and search
trees allow for quick lookups and quick pure modification. These are easy to
reason about.

> Also, making data contiguous is just the simplest of a common set of data
> representation optimizations.

This conflicts with O(1) prepend. If you want contiguous allocation, you can
use different data structures.

> Does any compiler automatically convert a list into an intrusive doubly
> linked list or eliminate the use of a visited stack for DFS by stealing a
> bit from vertices or reversing pointers?

If you need to remove/insert items in the middle of a list, you would probably
not use a linked list. So a doubly linked is not a good optimization to apply
automatically.

If your point is that code with simpler mathematical semantics that is easy to
reason about might cause a constant hit on performance in some cases -- then I
agree. You cannot _always_ have simplicity, verifiability, and speed. But
often you can.

Also note that purity affords many optimizations that are unavailable in non-
pure languages (e.g: rewrite rules like: map f . map g --> map (f . g)).

~~~
gsg
O(1) prepend doesn't conflict with good contiguity at all. Deque is an useful
data structure with (possibly amortised) O(1) prepend and append, O(1)
indexing and excellent contiguity.

------
gabi38
Is it just me or the LLVMs idea that "If arithmetic on an 'int' type (for
example) overflows, the result is undefined.. For example, knowing that
INT_MAX+1 is undefined allows optimizing "X+1 > X" to "true"." is incredibly
stupid? Who said that undefined number is bigger than a defined one ??

~~~
perokreco
It is not that the result of operation is undefined, the operation of
INT_MAX+1 itself is undefined(actually the entire program becomes undefined
when you do it), so the compiler can do whatever it pleases. According to the
C standard, INT_MAX+1 might as well lead to formatting of your hard drive. So,
it is safe to assume that x!=INT_MAX and that x+1>x for all x, because if
x=INT_MAX your program is undefined, so compiler can do whatever it wants,
including setting it to true or outputting an error or formatting the hard
drive.

~~~
gabi38
This kind of logic is the exact cause to the optimization problem described.
We all know that in practice MAX_INT+1 is negative, and that many "non-
standard" programs depends on it, but still, the optimizer is sticking to the
C standard that no one follows just because it gives it a nice optimization
chance. Bad..

~~~
jrockway
"Bad"? You use C because you want speed above all else. If you don't know the
language and need hand-holding, don't use C. Compilers are going to optimize C
-- that's why it usually runs so fast.

~~~
gabi38
This is bad because the optimizer ignores the imperfect reality about the big
crowd of non standard programs. It actually punishes non standard programs
(and probably the majority of programs out there are not 100% standard). So it
is a bad and patronizing optimization

~~~
ori_b
How many programs are there that depend on SIGNED overflow wrapping around to
negative? I can't think of very many - the only places I've seen potential
signed overflow that wasn't a bug was when people were _checking_ for
overflow.

In those cases, just use unsigned math to do the checks.

------
rwmj
Interesting they say there is no way to detect undefined behaviour in C code.
This is precisely what the following project is all about:

<http://www.astree.ens.fr/>

~~~
vog
The cited project is about proving that a C program has _no_ undefined
behaviour. If such proof can't be found, the code may still be okay.

So the existence of that tool doesn't contradict the statement of the article.

------
peterbotond
the way to write good solid code is to use more than 2 compiler, and
optimisation etc. i mostly write code that runs on at least 3 different os,
and 2 different hardware. ... and it just works. no prog-lang can be faulted
for bad programming using one. learn the lang, and use it. :-)

gcc still gets the pointer dereference and assignment to what a pointer points
to, due to deferring assignments. these edges, idioms of any lang needs to be
looked at with academic eyes. :-)

~~~
astrange
Do you fuzz your inputs? Run with the clang undefined integer overflow
checker? Run your program on hardware with 16-bit int?

~~~
peterbotond
yes, random bits flipped on good input. yes, i run anything that can help me
and covers the problem. nowadays, 32 and 64 bit are what i write code for. oh,
the days of 16bit ints on win3.11 for workgroups are gone. :-)

llvm is an excellent tool, hats off, and thanks.

------
CamperBob
What would possess someone to write

    
    
      if (i < 0 || buffer+i-128 >= buffer)
       

... instead of

    
    
      if ((i < 0) || (i >= sizeof(buffer))
    
    ?

~~~
jws
Many of the strange things that happen in C language edge cases come about
from macro expansion or code generation.

Unrelated to the issue, but the superfluous parenthesis added in the second
example are interesting. I think every C coder has a level of comfort with the
15 sets in the operator precedence rules, and beyond that they throw
parenthesis at it. I'm fine with the rules until I use a _< <_, _> >_, _&_ ,
_^_ , or _|_ [1]. Then I drag out the parenthesis.

[1] I swear the positions of the ones I named are placed alphabetically rather
than in relation to their mathematical function.

~~~
gsg
& and | have the wrong precedence because in primordial C they were both
either short-circuiting boolean or bitwise depending on context.

:(

