
Fixing a 37-year-old bug by merging a 22-year-old fix - tscherno
http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/head/head.c?rev=1.18&content-type=text/x-cvsweb-markup
======
kazinator
This fix leaves a variable uninitialized, and depends on funny logic to later
give it a value.

The general structure should be

    
    
      FILE *fp = stdin; /* sane default */
    

then override with a new pointer from fopen if necessary.

In fact, looking at this more closely, note how fp is scoped to the body of
the for loop. That's where it should be defined:

    
    
       for (firsttime = 1; ; firsttime = 0) {
         FILE *fp = stdin;
    
         /* logic flips it to fp = fopen(...) as necessary */
    
       }
    

[Note: the C compiler Joy was working with probably didn't support block scope
declarations except at the top of the function. This comment applies to
merging the fix using the environment of today.]

Note how the fclose at the bottom of the loop ends up closing stdin in the
case that fp == stdin! It ensures that this only happens once, but you have to
analyze it to see that this is true. I would write this as:

    
    
         if (fp != stdin) /* defensive! */
           fclose(fp);
    

Making it an infinite loop terminated by calling exit is unnecessary. There is
already a continue in the loop, and a break in an inner loop; another break
instead of the exit, and a "return status;" at the bottom wouldn't hurt.
Better yet:

    
    
      /* Enter loop whenever there are args left to
         process, or even if there are no args if it is
         the first time through (to process stdin instead). */
    
      for (firsttime = 1; firsttime || *argv; firsttime = 0) {
        FILE *fp = stdin;
    
        if (*argv) {
           /* have args; do the fopen to replace stdin */
           /* set status = 1 and continue if cannot fopen */
        }
    
        /* do head logic */
    
        if (fp != stdin)
          fclose(fp);
      }
    
      return status;
    

C itself was only a couple of years old when that code was written originally,
and the systems were small. Whereas I've been using it since 1989. Not
criticizing the original!

~~~
saurik
I would argue your insistence on giving the variable a "sane default" is a
matter of opinion. I personally am much happier seeing a variable left
uninitialized than ever seeing it double-assigned, as to me a fundamental
semantic logic error has occurred if you are overriding the value of this kind
of variable, especially as part of a conditional later where it might not be
clear whether or not it happened: I really would prefer to look at the
variable being assigned and think "ok, now we know the value". The compiler is
entirely able to determine whether or not the variable has been initialized in
all possible control flows for a case like this (at least, once we move the
scope of the variable to inside of the for loop, of course), so there is no
safety benefit to hoisting that default assignment out, nor is it saving a
meaningful amount of code: it only serves to obfuscate the "default" control
flow by making what had been an explicit if/else into an implicit if/default
:/.

~~~
pierrebai
Always initialize your variable. There is not aesthetic argument, but there
are multiple software engineering argument against failing to initialize
variables.

For starter, you don't have prescience. You can't prevent future version from
growing new alternate code paths. As the code evolve, checking that every
initialization-free variable will still get initialized is a burden and easily
overlooked. Then there are more potent evolution that will make it just
impossible: the code will get generalized, or moved into a library, or get a a
context object to be able to run multiple instances. Or will get a state
machine to handle complex combination of options.

You're relying on the compiler telling you the variable will get initialized.
But the compiler may change. Or the human compiling it might not look at
warnings.

OTOH, double-iniitalization is mostly a myth. If the compiler can tell the
variable may not get any initialization and warn you are exactly the same as
the case the optimizer will be able to see the double init and optimize it
out. I've looked at generated code and see compilers properly eliminate
redundant stores.

You're better off betting on optimization and get bitten by occasional double-
stores than get bitten by uninitialized variables and give system penetrators
yet another free lunch.

~~~
dllthomas
If you use static analysis, defensively initializing to a wrong value may mean
that your tooling doesn't tell you when you fail to correctly set it _from_
that wrong value.

~~~
kazinator
I have seen this argument numerous times before.

It doesn't hold water, because potentially initializing variables to wrong
values is just as much as _consequence_ of using that static tool, as it is a
consequence of an "always initialize" coding habit.

The static tool produces a report of uninitialized variables or potentially
uninitialized ones. The developers react to this by fixing all those
instances, so that the tool is "shut up". Once that happens, they have a code
base for which the tool doesn't report anything, just like the other
developers who always initialize.

There is no reason to believe that either team has a lower or higher rate of
incorrect-initial-value bugs. Or is there?

I would argue that the developer who is writing the code, at that moment,
quite possibly has a better idea of what the correct initial value should be,
than someone who is running a tool across a large code base and following up
(by him or herself).

Really, the coding convention should be "always initialize, with the correct
value which doesn't need to be overwritten later."

Even in imperative languages, we should program as "functionally" as is
reasonably possible.

Imperative code that mutates local variables should only do so if it
implements iteration. If the code has no loops and is setting local variables
conditionally, that shows that it's trying to influence the behavior of some
common code where the control flow paths merge --- instead of putting it into
a function!

In the specific program we are looking at, for instance, the function is
trying to use the head processing block at the end, but over different input
streams. This can be done by putting it into a function, so then there is no
assignment to the fp variable:

    
    
       if (!*argv) {
         head_processing(stdin);
       } else {
         FILE *fp = fopen(*argv, "r");
         /* check for fopen failure ... */
         head_processing(fp);
         fclose(fp);
       }
    

We can eliminate assignments out of loops, too, but that requires tail
recursion:

    
    
       /* pseudo code */
       head(int argc, char **argv, int first_time)
       {
         if (argc == 0 && first_time) {
           return head_processing(stdin);
         } else if (argc == 0) {
           return 0;
         } else {
           FILE *fp = fopen(*argv, "r");
           /* head_processing checks for null */
           int status = head_processing(fp);
           fclose(fp);
           
           /* tail call, if status was good */
           return (status != 0) 
                  ? status : head(argc - 1, argv + 1, 0); 
         }
       }
    

C programmers will tend not to be comfortable with this style, and it requires
tail call support in the compiler to not generate O(N) stack. So for pragmatic
reasons, you can't make a coding convention out of this.

~~~
dllthomas
No tool will help you much if you don't know how to use it. In this case,
using the tool wrong, one failure mode puts you in the same situation as not
using the tool and defensively initializing. That is a _very_ poor argument
that the tool does no good.

~~~
kazinator
That argument isn't that the tool does no good. The tool is useful even if you
follow the practice of always initializing: it helps catch accidental
deviations from that practice.

~~~
dllthomas
Well sure. My point is that it basically devolves into checking that if you
treat it poorly, and can be more useful if you don't. If you make a habit of
initializing variables with the wrong value, it's _less_ useful.

------
userbinator
The extremely odd for-loop structure feels to me like someone was trying very
hard to avoid a goto, and in the process obfuscated the flow quite a bit more.
The special case of head'ing stdin is similar to the classic "loop and a half
problem" (which is really only a problem if you religiously abstain from goto
use.) I'd consider this a more straightforward way to do it:

    
    
        FILE *f;
        int i = 1;
        ...
        if(argc < 2) {
           f = stdin;
           goto do_head;
        }
        ...
        while(i<argc) {
           f = fopen(argv[i], "r");
           /* check for failure to open */
           ...
          do_head:
           /* head'ing code goes here */
           ...
           fclose(f);
           i++;
        }
    
    

That being said, OpenBSD code is still far more concise and readable than GNU;
compare coreutils' head.c here:

[http://code.metager.de/source/xref/gnu/coreutils/src/head.c](http://code.metager.de/source/xref/gnu/coreutils/src/head.c)

GNU head has a bit more functionality, but I don't think the increase in
complexity - nearly 10x more lines - is proportional to that.

~~~
Flow
Here's the version used in OS X. It's a bit bigger but not GNUishly big.

[http://www.opensource.apple.com/source/text_cmds/text_cmds-8...](http://www.opensource.apple.com/source/text_cmds/text_cmds-87/head/head.c)

~~~
userbinator
Looks like a small extension to the BSD one, that can also handle head'ing by
byte count.

(Although I think the use of malloc() and strcpy() to handle the -[0-9]+ case
seems rather unnecessary...)

------
ceejayoz
In a 24-year-old version control system.

 _edit:_ Wow, not sure what's offensive here...

~~~
tux3
People here downvote what they _don 't like_

~~~
ceejayoz
I'm well aware of _that_ , the question is _why don 't they?_ I found it a
neat aspect that an long-lived bug with a long-lived PR is in a long-lived
VCS.

~~~
DanBC
You're single line comment might have appeared dismissive of the version
control used by the OpenBSD people, especially because it's something that
often comes up in threads about BSD.

Threads involving OpenBSD can involve fierce downvoting so it's probably a
good idea to be clear about what you're saying.

~~~
res0nat0r
I think all of this could be easily avoided if HN stopped allowing people to
try and silence valid comments that they happen to disagree with.

The site is now more political than ever and this just seems to fuel the
"silence what you don't agree with" type voting, which I don't think is a good
thing.

------
kazinator
This link to the commit log is slightly better:

[http://cvsweb.openbsd.org/cgi-
bin/cvsweb/src/usr.bin/head/he...](http://cvsweb.openbsd.org/cgi-
bin/cvsweb/src/usr.bin/head/head.c)

You're one click away from the diff.

~~~
0x0
What's with the previous commit about a "DDOS" \- it's just changing exit(0)
into exit(status)?

~~~
kazinator
There must be more context to that. CVS doesn't have change sets; just from
looking at this log, we don't know what else may be related to this.

Maybe some important shell scripts hang in a loop because they expect head to
report an unsuccessful termination status when a file doesn't exist. Maybe
some situation exists where a privileged system script can be fooled into
looping by an unprivileged user.

Here is the mailing list discussion:

[https://www.mail-
archive.com/misc@openbsd.org/msg132628.html](https://www.mail-
archive.com/misc@openbsd.org/msg132628.html)

There is no report of any actual DDoS. Craig Skinner was really just testing
tools on nonexistent files!

So the commit comment just follows from some general hypothesis that incorrect
exit statuses from tools can be exploited in some way by attackers.

~~~
0x0
Looks like that's it, and the commit log was probably just sarcastic then. :)

------
cordite
One of my coworkers found a data integrity causing bug that has been around
for 11 years.

He and his team lead had no idea how such a bug missed 11 years of peer-QA
through all edits on that module / activity.

It was as simple as converting to string and back and hitting an edge case on
internationalization with commas vs periods.

~~~
pacaro
My favourite decimal separater i18n bug was in a product where we were reading
a time dilation factor from a config file. In the default shipping file (which
basically nobody ever changed) it was specified as "1.0". We would then read
this using (this is in .Net) Double.parse() which honors the users locale if
none is specified, so for all users in Europe, and probably elsewhere, this
was interpreted as 10 ('.' is a grouping construct and more or less ignored).
So the software was running 10x slower than expected.

~~~
darklajid
. would be a weird here as well (and we had a share of issues with early
online banking systems: Think "I want to pay 10 $currency, let's put 10.00 -
and the bank uses the . as a delimiter for thousands, i.e. 1.000,00 / you
meant 10,00).

That said, my favorite localization issues stem from CH/Switzerland. Numbers
are formatted as 1'000.00 there (someone correct me), which is especially odd:

1) That's a weird decimal separator. DE uses a , and as far as I know FR uses
a , as well. I _assume_ IT uses a , - so why is this a . in CH?

2) The ' really kills a lot of naive parsers and causes quite a bit of
frustration in other fields (think OCR - some overly fat 1'000 might turn into
a 11000 with bad luck/crappy image quality or (marginally better) lead to 1
_000 where_ means 'I think here's a character, but no clue what that might
be')

~~~
wastedhours
Different currency has different conventions?

~~~
darklajid
I'm confused.

Yes, different conventions exist (in this case the thread drifted to bugs
based on locales, so obviously we're talking about different conventions
here). I .. don't understand what you're saying.

(and my original post is missing an asterisk in two places, as in 1?000 as OCR
result for example - messed up the formatting in the middle of the night)

~~~
wastedhours
Sorry - my point was counter to your about DE, FR and IT having the same
conventions, so it's odd CH doesn't too. They all use the same currency,
whereas CH doesn't, hence why there'd be a different one (?). E.g. we use
£1,000.00 in the UK.

~~~
dalai
DE, FR and IT have the same currency only the last 15 years or so. One would
expect that they would be using the same conventions because of the history
and relationships between CH and all of these countries.

------
ape4
I like that the -number option is "obsolete". I still use it.

~~~
alexmchale
That's pretty funny. I'd say that the majority of the time that I call head or
tail, it's using that syntax.

~~~
cerberusss
Exactly, and I bet that it's not going away any time soon, "obsolete" or not.

------
ahmett
Here is the diff:

[http://cvsweb.openbsd.org/cgi-
bin/cvsweb/src/usr.bin/head/he...](http://cvsweb.openbsd.org/cgi-
bin/cvsweb/src/usr.bin/head/head.c.diff?r1=1.17&r2=1.18&f=h)

------
gojomo
If OpenBSD has anyone young enough, would've been a fun twist to have someone
who wasn't even born at the time of the fix make the commit.

------
Scuds
> for (firsttime = 1; ; firsttime = 0) {

does this line predate the while loop?

or boolean values in C?

~~~
ghshephard
What's the common way of doing this in C? Is it normally something like this?:

    
    
      firsttime=True
      while (True) {
      ..
      ..
      ..
      firsttime=False
      }

~~~
Someone
Depends on the person. There likely is a correlation on what century you
learned programming, but certainly one whether you have experience with
pascal-ish languages.

People educated in the Wirth way would say "that's not a for loop; there is no
way to tell how often it will iterate when it is first entered" and write it
as a while loop.

Old-school C programmers seem to think one iteration construct is enough for
everybody and like to write any iteration as a for loop, ideally with a few
commas and no actual statements, as in the classic strcpy:

    
    
      char *strcpy(char *dst, CONST char *src)
      {
        char *save = dst; 
        for (; (*dst = *src) != '\0'; ++src, ++dst);
        return save;
      }
    

(from
[http://www.ethernut.de/api/strcpy_8c_source.html](http://www.ethernut.de/api/strcpy_8c_source.html))

Programmers with soe Wirth-style training do at least something like

    
    
      char *
      strcpy(char *s1, const char *s2)
      {
        char *s = s1;
        while ((*s++ = *s2++) != 0)
    	;
        return (s1);
      }
    

(from
[http://www.opensource.apple.com/source/Libc/Libc-167/gen.sub...](http://www.opensource.apple.com/source/Libc/Libc-167/gen.subproj/ppc.subproj/strcpy.c))

[If you google, you can find way more variations on this. for example,
[http://fossies.org/dox/glibc-2.20/string_2strcpy_8c_source.h...](http://fossies.org/dox/glibc-2.20/string_2strcpy_8c_source.html)
does a do {...} while(....) that I am not even sure is legal C; it uses
offsets into the source to write into the destination]

From your example I guess you are definitely in the Wirth (Pascalish) camp,
beyond even that second state. Classic C programmers would either use 0 and 1
for booleans or #define macros FALSE and TRUE (modern C has true booleans, but
that extra #include to get it increases compilation time, so why do it?

I am in te Pascal-ish camp. I like to say C doesn't have a for loop because
its 'for' is just syntactic sugar for a while loop.

~~~
chris_wot
I find the for loop easier to read.

------
aalbertson
Looking through all these comments, my only response is "effing nerds..." <3

