Hacker News new | past | comments | ask | show | jobs | submit login
A glimpse of undefined behavior in C (chris-cole.net)
107 points by shawndumas on Nov 30, 2013 | hide | past | favorite | 64 comments



Knowing how exactly sequence points work might be an advanced topic, but I am not sure I'd hire a junior C/C++ programmer if he has not heard of sequence points. If somebody has, then he clearly won't do something like this code. I am not 100% sure in how they work either (haven't programmed C for 4 years), so I just avoid these kind of unnecessary complications in the code, if not for myself, then for the guy who comes after me, and who, if he has a bad day, may inadvertently cause some trouble.

Really, that's one of the reason why Java still works well at big companies. The code is usually so verbose and 'un-smart' that it's quite hard for programmers with minority complex to obfuscate it when they're trying to prove that they're smart. It's a shame however that if the language is primitive then you can always create a big framework to obfuscate stuff. :)


I am not 100% sure in how they work either (haven't programmed C for 4 years), so I just avoid these kind of unnecessary complications in the code

Hear, hear. I am considered the "language lawyer" of my embedded group and often get asked questions about C minutiæ. It isn't uncommon that my answer is "I don't know how that works, because I would never write something that requires an answer to that".

Modern compilers are blind to shorthands etc.; the only useful knowledge about "C" (really C compiler & hardware) minutiæ relates to (a) how to convince the compiler to optimize certain high-level constructs (e.g. loop unrolling), (b) how to convince the compiler to emit certain low-level constructs (e.g. SIMD instructions), and (c) how the hardware behaves (e.g. the cache model). Generally anything else – you can rewrite it so you don't have to think too hard about how C works.

EDIT: Understanding type promotion is an exception to this. The integer type hierarchy is unfortunately (a) deeply baked into C and (b) mostly brain-dead – C mostly conflates physical integer width with modular arithmetic and provides no non-modular integer types, often leading to subtle software bugs (see last week's story about binary search).

EDIT: So is understanding const-ness. At least you can ignore this if you don't get it (or if C doesn't, as is the case with certain nested const types).


Const-ness is really different. It's a concept I really miss from other languages as it _adds_ more semantic information to the code. It does not obscure, it clarifies. It's a way to give orders instead of giving recommendations.


I really liked const-ness as well; in functional languages const-ness comes for free as all state is by default immutable -- one the reasons I jumped to Scala.


Just FYI for readers, the term "sequence point" has been deprecated in C++11. There is now a richer notion of operations being sequenced before/after each other or unsequenced. This motivation for this was to improve C++'s memory model in the face of concurrency. (It does not change the results of this discussion, though :) )

More info here: http://blogs.msdn.com/b/vcblog/archive/2007/06/04/update-on-... and here http://en.cppreference.com/w/cpp/language/eval_order


Aside: this test code would have been a bit easier to debug if it had used decimal bases. That is, replace

  int a[] = {10,20,30};
  int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
with

  int a[] = {1,2,3};
  int r = 1 * a[i++] + 10 * a[i++] + 100 * a[i++];
Then if your program outputs r = 111, it's obvious that it's doing

  a[0] + 10*a[0] + 100*a[0]
and if it outputs 321, it's obvious that it's doing

  a[0] + 10*a[1] + 100*a[2]
No disassembly required.


Alternatively just think a moment longer and realize that the lower limit of each value (10) already produces 60 and thus each must must be 10.


What an annoying post. Talks like he knows a lot, when he doesn't know this most basic of basic points of how C works. It's understandable to not know everything, even to have the odd hole in your basics like this. But the know-it-all voice (and for such a basic topic) "I explained operator precedence..." is just going to make you someone nobody wants to work with. This is not undefined behavior, it's one of the first things you learn about C after encountering the pre/postfix operators IF you pay attention to the details, which, with C, you should know to do.


> This is not undefined behavior, it's one of the first things you learn about C after encountering the pre/postfix operators IF you pay attention to the details, which, with C, you should know to do.

What exactly is not undefined behavior? The post describes a statement where a variable is modified multiple times between sequence points. What does the standard say about this?

N1256, 6.5 says

  Between the previous and next sequence point an object
  shall have its stored value modified at most once by the
  valuation of an expression.
J.2 Undefined behavior

  The behavior is undefined in the following circumstances:
  [..]
  * Between two sequence points, an object is modified more
    than once, or is modified and the prior value is read
    other than to determine the value to be stored (6.5).
What's with the know-it-all voice?


It's a common stylistic choice. Exaggerate how arrogant and smart you thought you were, only for it to be pricked by reality, end up humbled.


Yes, don't change the same variable in multiple places in the same statement in C.

In some cases the behaviors is well defined, if there are sequence points between the assignment, but I would rather keep my code simple.


Sequence points are an interesting topic on their own.

http://stackoverflow.com/a/4176333


I think they are the only thing relevant in the article. The introductory example is just one example of how not knowing about sequence points leads to bugs. There are many more and it makes for a much more interesting read to first learn about sequence points and then applying that knowledge to various examples.

This article goes about it the wrong way.


The way I learned that in C a long time ago was "post increment" was "post statement increment".


There was a popular myth way back in the 90's that preincrement was always faster than postincrement because the former could generate a temporary. As if this were the case:

    /* x++ */
    inline int postincrement(int *x) {
      int temp = *x;
      *x = *x + 1;
      return temp;
    }

    /* ++x */
    inline int preincrement(int *x) {
      *x = *x + 1;
      return *x;
    }

But in typical usage, where the expression value is not used, it doesn't make any difference:

    for (int i = 0; i < 10; ++i) {

If one looks at the assembly, it's clearly equivalent to:

    for (int i = 0; i < 10; i++) {

Incidently, the latter seems more readable.


I always thought this "myth" was confined to C++ code (iterators) where there may in fact be code that does something analogous to what your "postincrement()" does.

Compilers are sometimes smart enough to remove the unnecessary operation in C++ (e.g. switch to ++x themselves), but I always use ++i for the same reason people simplify their usage of C in between sequence points --- it's an easy transformation, and why tempt fate?


The compiler can't necessarily optimise i++ away in C++, because the ++operator or the destructor of the temporary iterator object returned may have side effects. It's still not quite equivalent to ballards postincrement() though, because the C++ standard allows for that temporary copy operation to be optimised away


I was actually asked in an interview at Google in 2011 whether pre- or postincrement was faster in a snippet of code where the expression's value wasn't used. When I said that I was pretty confident that the compiler would produce the same assembly in both cases, the interviewer insisted that pre- was faster based on the same logic you mentioned.

To her credit, she didn't penalize me for disagreeing with her and I still got the offer.


This is one of those things I've had in my head for years, and as a result of this I habitually use pre-increment in code, even in other languages. I think I haven't actually believed the "myth" myself, though, for quite a while.


it's a little weirder though. int main(int argc, char argv) { int a = 0; printf("%i %i %i\n", a, a++, a++); }

will give "0 0 1" (gcc 4.2.1), the increment "shouldn't" happen until after the ; if you're going with post statement. I think your rule would expect "0 0 0" with a being 2 after the printf.

BUT! you get a warning, so that's nice.


The issue here is that the order of function argument evaluation is undefined. If the commas weren't inside of a function call, they would result in sequence points. So, this would be well defined:

   int a = 0, b, c;
   b = a++, c = a++;
But, there is no defined order in which to evaluate function arguments.


The example given in the article shows all the code, but your example involves a function so the code in question is hidden. The behaviour partially depends on how the function is written.

"0 0 1" is what I would expect from your code, and identifier a should end up as 2.


I can't help but to put this in the same bag as javascript lack of block scope, exemplified in the famous for loops counter undesired effects.


exactly. This is precisely to spec; and I got the answer of 60 immediately on looking at the code.


This is incorrect; it is undefined behavior according to the spec: http://www.reddit.com/r/programming/comments/1rrefp/a_glimps...


Ok, so do any other compilers handle it differently?


Yep.

Clang says:

  zsh% clang -o sequencepoints sequencepoints.c
  sequencepoints.c:7:18: warning: multiple unsequenced modifications to 'i' [-Wunsequenced]
    int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
                   ^            ~~
  1 warning generated.
And prints:

  zsh% ./sequencepoints
  140


Practically every new version of gcc adds a new optimization that recognizes some new form of undefined behavior and then rewrites your function to do whatever it wants.

Classic example: signed integer overflow. It worked for decades. Then one day it didn't.

If you want to know about this particular example: https://news.ycombinator.com/item?id=6824514 (not personally confirmed)


That isn't really an accurate description of the issues: it's not the case that it was "working" and then "broken" by GCC maintainers. It's always been unsupported and not worked in specific situations, but for 99% of code it appeared to GCC users that it was supported. The signed integer overflow behaviour was never guaranteed by old versions of GCC, and code exploiting it would not always be compiled with the "expected" behaviour. It's just as the GCC optimizer has improved there are more circumstances when it does optimisations that hinge on the assumption.

Compiler users generally want something that "just works" and doesn't do anything unexpected, but in the case of a low-level language like C, doing away with undefined behavior essentially would mean pessimistically avoiding many optimisations on the 99%+ of straightforward, reasonable code out there in favor of not doing anything surprising on the remaining fraction of dubious code that depends on certain things happening in scenarios where behavior is undefined according to the C standard. There are languages that make that choice, but C isn't one of them.


That is my point. De facto working code is not working code.

I have fixed feelings about the integer overflow issue because it's so easy to trigger, unlike triple post increment fake examples. And it usually results in a security problem. For very little benefit, IMO.


Exactly. De facto defined is usually just as important to consider as de jure.


ah, sorry, I'm stuck in C[whatever it was I learned in high school]


Why? What reason could there possibly be that such well known, gaping holes in the standard have persisted for all these years and multiple standard revisions? C90, C99, C11 and now C14... If performance or backward compatibility is a concern, surely an optional macro like __STDC_STRICT__ could be suggested in the standard with clear expectations that could be relied on.


That's a reasonable question. It's a shame this comment gets downvoted instead of the various people falling all over themselves to show off how smart they are by posting incorrect guesses about how undefined behavior doesn't matter.

Here's a good article that addresses the question "Why have undefined behavior?"

http://blog.regehr.org/archives/213

This one linking to the above is also worth reading:

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...


People seem to be forgetting that an implementation includes an execution environment as well as just a compiler, and that a perfectly valid (and, to my mind, very much recommended) option for undefined behaviour is to behave in some documented manner characteristic of the platform. I don't see this "no obligation" stuff as a great idea, not when it results in so much surprising behaviour - and especially not for simple situations where there's something obvious that the compiler could do, that would surprise absolutely nobody familiar with the system in question.

I always imagined undefined behaviour was there to avoid tying implementations' hands, by having the standard not mandate things that vary in practice. But it seems that people are assuming it's there to give compiler writers carte blanche to do whatever they like, and then point at the standard as justification. Given how much stuff could potentially be added to C to improve it, I don't know why people are spending all this time trying to figure out all the ways in which the letter of the law allows them to confuse the programmer.

See also somebody else's rant about strict aliasing: http://robertoconcerto.blogspot.co.uk/2010/10/strict-aliasin...


Hopefully, because most in-practice use of C avoids undefined (or complexly defined) behavior.


An entire blog post focused on a daily recurring SO question. Too bad those people won't know to search for "sequence point".


I observed this when I was back in school using a simpler expression i = 0; i = i++;

the answer will be different in different compilers. I use this as an interview question ever since, not to get the right answer but to understand the candidates thought process in solving the problem and his/her understanding of operator precedence :-)


One of the things we train any newcomer to our company is to never do what this man does:a complex single line with expected behavior based on your (arbitrary) conventions.

Yo always do multiple lines, with comments on each one.

It sounds ridiculous, but this simple thing made some kind of bugs impossible: those that you have in front of you but you can't see in a million years. The atomic operation in code is the line, you can't debug a complex line(1).

It is painful forcing people to do that, people use to hate being told what to do, but at the same time they love the outcome so much. In the end everybody loves it.

1.With assembly you are debugging an instance of your code. You are not debugging what will be created with any compiler, any os or architecture.


While I agree the code presented is just, plain and simple, bad, and the first way to the solution is splitting it up, I really don't get why you want comments for every line? Or don't you mean literally every line maybe?

I mean suppose I split the first part of the statement like this:

  int r = a[ i ];
  ++i;
These lines are self-explanatory. It's pretty hard to argue they need a comment? Especially when used in a function that already should have a name/comment explaining what it does?


Lesson I took away:

The ++ operator should be deprecated except when it's the only operation on that line.


I like using it when I'm filling an array with an indeterminate number of elements. I've used this pattern a few times:

    foo foos[255];
    int ct = 0;

    if (<condition for adding first element>)
        foos[ct++] = foo1;

    if (<condition for adding second element>)
        foos[ct++] = foo2;

    ...

    for (int i = 0; i < ct; i++) {
        /* do something with foos[i] */
    }
IMHO, this is more concise and readable than separating the increment and assignment into two lines.


I think users of for/while loops will disagree. Don't forget that =, += , ... can all be abused in the same way.


    (int i=0;i<10;i++)
would count as three lines for purposes of that rule, and the

    i++
as one. I could have been more precise, but I thought it would be understood.

The point is, to be on the safe side, don't use ++/-- in any line (in the sense of ;{}-delimited statement) that is also doing something else.

>Don't forget that =, += , ... can all be abused in the same way.

I didn't, and people should use the same safeguards around them, i.e. don't mix them within lines that do other things, "cleverness" or "C golf" be damned.


Interesting, my first guess would have been '60' since the old C89 / K&R C had the behavior

   (prefix ops) => ( statement ) =>(postfix ops) 
Which made code like

   *a++ = *b--; 
Change the pointers after the copy as opposed to having them change before the copy.

It is interesting that this has become compiler defined.


It hasn't. You're not modifying a single variable twice within a single sequence point in that example.


That's really all there is to it, isn't it? I've never understood why some people feel the urge to modify the same variable multiple times in the same statement. It doesn't help readability, it's harder to reason about and worst of all leads you directly into the land of undefined behavior, where everything that can go wrong, will go wrong.


Yeah, the need to shove as much code as possible into one line seems to go away with experience. Once you have to debug and maintain that crap you realize very quickly that one liners are not necessarily something to strive for.


I used to strive for clever solutions. The shorter the better. I still like to look at them but I want them as far away as possible from my code base. Sure, if a solution is clever and efficient, while still readable, maintainable and easy to understand when documented properly, I'd use it. However, my general rule of thumb these days is: "Write this code, so an utter idiot can read it". Because that's what's going to happen: One day I'll have to read it again, and I'll feel like an idiot, trying to understand what once went through my brain.

That's something that took me way too long to learn and I feel that novice programmers should embrace that more. We all like to feel like Einstein sometimes, but a big code base is not the place for solutions only we can understand... temporarily.


Also, it's not implementation ("compiler") defined, it's undefined behavior.


I think I stumped into this once, but didn't have the time atm to check what the problem was... I simply rewrote the code to remove the infixes. The good thing with C is that you can decide not to shoot yourself in the foot


I just ran this after compiling it with clang-500.2.79 based on LLVM 3.3svn, and the output was 140. Maybe GCC is to blame?


It's undefined in the spec. So GCC and clang are both correct.


It's undefined behaviour, so you won't get a consistent answer.


Isn't this actually called "implementation-defined" behaviour in the std rather than "undefined" behaviour? I generally get a segfault for undefined behaviors


It's undefined. That means the compiler and runtime can choose to segfault, do something that looks right, silently corrupt your program, launch the nuclear missiles, whatever. And there's no requirement that this behavior be repeatable, so the code can work with debug flags turned on but fail in a release build.

You can never rely on your compiler's implementation of undefined behavior.


> You can never rely on your compiler's implementation of undefined behavior.

I wish I could somehow get my coworker to understand this. He's mostly "Yeah, well, I know it's undefined, but there's really no way this could be anything else than <foo>. I know what the CPU does there."


I believe this is undefined because the standard doesn't say what a "sequence point" is and not "a sequence point is implementation defined". I'm going from second-hand knowledge and it's been years since I looked at C (and never past a high school level of understanding).


Not "well-defined" is the proper terminology.


No. You might be right if we were talking about "proper use" in English as a spoken language, but "undefined behavior" is a technical term defined by the C and C++ ISO standards.


I see, okay did not know "undefined behavior" was part of ISO. It's pretty unfortunate that I have to pay $30 to see the ISO standard...

http://isocpp.org/std/the-standard


What is the difference? I do not see "well-defined" in the "Terms, definitions, and symbols" part of the standard.


It is 'de facto' defined.

http://en.wikipedia.org/wiki/De_facto

:)


"de facto" defined means nothing. Compilers change and break programs that depend on undefined behavior all the time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: