
Some dark corners of C - fayimora
https://docs.google.com/presentation/d/1h49gY3TSiayLMXYmRMaAEMl05FaJ-Z6jDOWOz3EsqqQ/edit?usp=sharing
======
eliben
Most of these "dark corners" have been in C for at least 25 years and have
been repeated over an over for at least 20.

My main take-away from this is that Google Drive seems like a nice way to put
presentations online :-)

~~~
pooriaazimi
> _My main take-away from this is that Google Drive seems like a nice way to
> put presentations online :-)_

Don't. I've been trying to access the presentation for 10 minutes and it won't
allow me:

    
    
        Wow, this file is really popular! Some tools might be unavailable until the crowd clears.
    

and then I get redirected to
[https://support.google.com/accounts/bin/answer.py?hl=en&...](https://support.google.com/accounts/bin/answer.py?hl=en&answer=32050)
(which is stupid, because there's nothing cached/cookied for google. In fact,
I'm in Firefox's "Private Browsing")

~~~
Flimm
For me, it works, but it makes each slide an entry in my browser's history,
which I hate.

~~~
qznc
You can link to certain slides this way. I would consider this a good thing.

------
gjulianm

        #define struct union
        #define else
    

That's evil. I have to do it in someone's code some day just to have some fun.

But, apart from that, it's a really nice compilation. I didn't know about the
compile time checks of array sizes, but I have a doubt. What if I pass to a
method declared

    
    
        int foo(int x[static 10])
    

this pointer

    
    
        int* x = (int*) calloc(20, sizeof(int));
    

Does the compiler skip the check? Does it give me a warning?

EDIT: Funnily enough, in Mac it doesn't give any warning, neither for pointers
nor for undersized arrays (ie, foo(w[5]) doesn't give a warning). And I've
compiled with -std=c99 -pedantic -Wall.

~~~
tobiasu
Last time this came up on HN, it was thought to be a clang feature.

Edit: While we're talking about dark corners, please stop casting functions
that return void * . If your code lacks the declaration of the function, the
compiler will assume pre ANSI-C semantics and generate code returning an
_int_.

On machines where pointers do not fit ints (basically all 64bit machines), you
just silently (due to the cast there is no warning) truncated a pointer.
Worse, it may work depending on the malloc implementation and how much memory
you allocate.

We have to fix these kinds of bugs on OpenBSD a lot, please help by typing
less and let the compiler warn you about silly mistakes :-)

And yes, C++ fucked this up for C. I'll leave it to Linus to say something
nice about that..

~~~
rjek
It's a C99 feature, and clang's the only compiler I know of that produces the
diagnostic (and only in later versions)

btw; I wrote this talk.

~~~
gjulianm
I didn't know that. To answer my previous question, clang doesn't fire a
warning when passing pointers of any size to the function foo.

And by the way, nice talk, it's great learning these dark secrets of C.

~~~
rjek
The compiler can't know at compile time with a naked pointer like it can with
an array. [static 1] is handy to say it must not explicitly be NULL, as if it
were optional, however.

~~~
gjulianm
Yes, but I expected some kind of "You're passing a pointer as an array of size
n. I can't check the size, but you should make sure you've checked it".

~~~
ori_b
I can't imagine that would be anything but noise. 99% of my function calls
have pointers passed through, not arrays.

------
rwmj
I remember when the Pentium F00F bug was reported, I tested it by doing:

    
    
        char main[] = { 0xf0, 0x0f, 0xc7, 0xc8, 0xc3 };
    

(and yes, my machine -- a Pentium MMX -- hung solid and I was rather shocked!)

~~~
tobinfricke
whoah - I find that construction astonishing.

My gcc compiles it with only this warning:

    
    
      foo.c:2:6: warning: ‘main’ is usually a function [-Wmain]
    

hah!

~~~
gsg
It won't execute though:

    
    
        [23] .got.plt          PROGBITS        0804954c 00054c 000014 04  WA  0   0  4
        [24] .data             PROGBITS        08049560 000560 000010 00  WA  0   0  4  <---
        [25] .bss              NOBITS          08049570 000570 000008 00  WA  0   0  4
    
        66: 0804840a     0 FUNC    GLOBAL HIDDEN   14 __i686.get_pc_thunk.bx
        67: 08049568     5 OBJECT  GLOBAL DEFAULT   24 main  <---
        68: 08048278     0 FUNC    GLOBAL DEFAULT   12 _init
    

The main symbol is a relocation in .data, not .text. Which is as you would
expect given that declaration. You might be able to get around that by doing
something like

    
    
        unsigned char code[] = { 0xf0, 0x0f, 0xc7, 0xc8, 0xc3 };
    
        int main(void)
        {
            ((void (*)())code)();
            return 0;
        }
    

But these days NX will usually ruin the fun.

~~~
taralx
Works if you add this line:

char main[] __attribute__((section(".text")));

(You get a warning from the assembler.)

~~~
gsg
So it does.

I didn't know gcc attributes included that kind of thing. I've really gotta
dig though the manual some time.

------
kps

        int x = 'FOO!';
    

will _not_ make demons fly out of your nose: it is not _undefined behaviour_.
It is guaranteed to produce a value; the specific value is _implementation
defined_ (that is, one that the compiler vendor has decided and documented),
but it is an integer value, not a demon value.

I'm sure, though, that someone sooner or later will be bitten by code like

    
    
        int x = 'é';
    

which is equally implementation-defined.

~~~
apaprocki
On big-endian machines, the order of characters is preserved. Because of that,
I've noticed this trick used in old network/protocol code where the intent was
to use integer values in binary headers while maintaining easy readability if
you are look at hex/ascii side-by-side. e.g.,

int x = 'RIFF';

.. if you were packing a WAVE file header.

~~~
kragen
The potential big advantage of this construct is that you can use it in
switch() statements, which you can't do with strings. But it's probably better
to use enum values, because the implementation-definedness removes the
potential great advantage of this technique (that you can serialize these
multicharacter literals nicely; consider SMTP implementations looking for
'HELO', 'MAIL', etc.).

------
gatherknwldg
C's corners aren't very dark. It's a small enough language that it's easy to
explore them. Things can get ugly when programmers decide to abuse the
preprocessor because the language isn't complicated enough for them, but
thankfully most C programmers have a distaste for such shenanigans. C++ is
down the hall and around the corner, if you want darkness.

------
dysoco
Someone should do "Dark corners of C++".

Nevermind, it would take more than the Lord of the Rings triology.

~~~
popee
LOL

~~~
popee
ROFLCOPTER

------
wtracy
Very cool.

I remember hearing that the disallowal of pointer aliasing was the main reason
that it was possible for a Fortran compiler to produce code that could
outperform code from a C compiler: It allows the compiler to perform a new
class of optimizations.

It would appear that the restrict keyword lets C programs regain that class of
compiler optimizations.

~~~
gjulianm
It's pretty well explained here

<http://en.wikipedia.org/wiki/Restrict>

------
copx
It is telling that these "dark corners" all seem harmless compared to what you
can find in certain other languages which shall not be named.

~~~
dualogy
> which shall not be named

So... not that _telling_ then.

~~~
PommeDeTerre
It's obviously JavaScript and PHP that are being referred to.

Of the C "dark corners" that are problematic, it'd be extremely rare to run
into them in most real-world code. You'd have to intentionally go out of your
way to write code that will trigger them, and this code often looks obviously
suspicious.

It's very much the opposite with JavaScript and PHP. A world of pain and
danger opens up the moment you do something as simple as an equality
comparison. The problems that can and will arise are well documented, so I
won't repeat them here, but it's a much worse (and unavoidable) situation than
when compared to C, C++, Java, C#, Python, Ruby or other mainstream languages.

~~~
popee
Agreed. Everytime i get back to C it's like coming back home. But first you
must study it hard to make it your home. On the other hand javascript (lang
i'm using at current job) is like living 'Groundhog Day' with everyday
finishing with suicide. Well, not saying javascript is bad language, there are
some really great things about it, but it's designed with a loaded gun put on
your head all the time :-) I'd also put C++ on list of dangerous languages,
because it is trying to fix C problem while introducing OOP (and in newsest
standard lambdas and others), so now you have huge base for new and exciting
set of ways to kill yourself. It's not even funny that simple languages like
lua are getting more users everyday.

------
grn
It's also worth pointing out that buffers passed to strcpy, memcpy, etc. _must
not_ overlap. Otherwise it results in _undefined behavior_.

~~~
lttlrck
That's stdlib though, not the language.

~~~
caf
The standard library is part of the language - all hosted implementations must
provide it.

This allows, for example, compilers to replace a `memcpy()` call that has a
constant size argument with direct loads/stores.

------
optymizer
I wrote a compiler for a subset of C, and I'm happily aware of all of these
'dark corners'. That's why I would always recommend writing a compiler for a
language if you _really_ want to understand the language.

------
jimmaswell
"What would be the smallest C program that will compile and link?"

Author got this wrong, that would be an empty file, which is what won the
IOCCC for smallest self-replicating program once.

~~~
angersock
That is one of the finest examples of being technically correct--the best kind
of correct. Spec is fulfilled but everybody knows the answer is useless.

------
halayli
Just FYI, If you know C and you want to take it to the next level, then Expert
C Programming:Deep Secrets is one of the best books out there.

[http://www.amazon.com/Expert-Programming-Peter-van-
Linden/dp...](http://www.amazon.com/Expert-Programming-Peter-van-
Linden/dp/0131774298)

------
shurcooL
Reading <http://golang.org/ref/spec> is such joy after having lived through
C/C++ for the last many years. I still love C++, but if I can get away without
having to use it, then I'm all for it.

------
shaneeb
Just like everything else programming languages have evolved. From Assembly to
Fortran to C to Java/C# (just saying, no exact sequence implied). I dont think
the languages we have now, far from perfection they may be, would have been
possible without the "dark corners" in the older languages. We learnt from
them and made better languages. So I say show respect to the old languages,
learn from them and keep improving languages/tools... Everybody is happy.

~~~
pilgrim689
We haven't all learned... <https://www.destroyallsoftware.com/talks/wat> :P

~~~
shaneeb
Yup. People have talked about the issues in C for years but few talk about
"modern" languages. Ruby anyone?

------
pjungwir
There is a wonderful book about the trickier parts of C called Deep C Secrets
(with a fish on the cover :-). It is a great second or third book after K&R.

~~~
fabriceleal
Very amusing book.

------
nonpme
On some slides there is shown how particular function is expressed in
assembly. I know nothing about that language (I'm talking about assembly; I
know c and even like it) and when I tried to find anything how to learn this I
faced some problems. I don't know, where should I start, how should I start
etc. Can someone point me to good resources or starting points (I prefer linux
than windows if that's matters)?

(Sorry for of offtopic)

~~~
kps
You shouldn't read too much into the assembly output from any particular
compiler (except maybe dmr's for the PDP-11), but the de facto standard
command line option "-S" will cause a *nix compiler to generate a ".s" file
containing assembly rather than a binary.

~~~
nonpme
Wow, I didn't know about -S option, thanks for the tip! I know it may not be
optimal assmbly code, but that's still interesting code to read.

------
arihant
I am almost certain the pointer aliasing thing could be fixed by providing the
proper optimization tag at compile time. I remember back in introductory
systems classes, we saw mind boggling optimizations from GCC at O3 - the
pointer example is so trivial it must be optimized by the compiler!

~~~
brigade
It isn't; there are very few flags that allow the compiler to perform
optimizations not allowed by the language standard. Aliasing is not one of
them for any compiler I know of. In fact, there are usually flags to go the
opposite direction and assume all pointers alias because so many people write
code that violates the standard (and results in GCC optimizing the code to
behave differently than the author intended.)

------
graycat
For "dark corners of C", when I was writing C code I had several serious
concerns. Below I list eight such in roughly descending order on
'seriousness':

First, what are malloc() and free() doing? That is, what are the details, all
the details and exactly how they work?

It was easy enough to read K&R, see how malloc() and free() were supposed to
be used, and to use them, but even if they worked perfectly I was unsure of
the correctness of my code, especially in challenging situations, expected
problems with 'memory management' very difficult to debug, and wanted a lot of
help on memory management. I would have written my own 'help' for memory
management if I had known what C's memory management was actually doing.

'Help' for memory management? Sure: Put in a lot of checking and be able to
get out a report on what was allocated, when, by what part of the code, maybe
keep reference counters, etc. to provide some checks to detect problems and
some hints to help in debugging.

That I didn't know the details was a bummer.

It was irritating that K&R, etc. kept saying that malloc() allocated space in
the 'heap' without saying just what they meant by a 'heap' and which I doubt
was a 'heap' as in heap sort.

Second, the 'stack' and 'stack overflow' were always looming as a threat of
disaster, difficult to see coming, and to be protected against only by mud
wrestling with obscure commands to the linkage editor or whatever. So, I had
no way to estimate stack size when writing code or to track it during
execution.

Third, doing data conversions with a 'cast' commonly sent me into outrage
orbiting Jupiter.

Why? Data conversion is very important, but a 'cast' never meant anything. K&R
just kept saying 'cast' as if they were saying something meaningful, but they
never were. In the end 'cast' was just telling the type checking of the
compiler that, "Yes, I know, I'm asking for a type conversion, so get me a
special dispensation from the type checking police.".

What was missing were the details, for each case, on just how the conversion
would be done. In strong contrast, when I was working with PL/I, the
documentation went to great lengths to be clear on the details of conversion
for each case of conversion. I knew when I was doing a conversion and didn't
need the 'discipline' of type checking in the compiler to make me aware of
where I was doing a conversion.

Why did I want to know the details of how the conversions were done? So that I
could 'desk check' my code and be more sure that some 'boundary case' in the
middle of the night two years in the future wouldn't end up with a divide by
zero, a square root of a negative number, or some such.

So, too often I wrote some test code to be clear on just what some of the
conversions actually did.

Fourth, that the strings were terminated by the character null usually sent me
into outrage and orbit around Pluto. Actually I saw that null terminated
strings were so hopeless as a good tool that I made sure I never counted on
the null character being there (except maybe when reading the command line).
So, I ended up manipulating strings without counting on the character null.

Why? Because commonly the data I was manipulating as strings could contain any
bytes at all, e.g., the data could be from graphics, audio, some of the
contents of main memory, machine language instructions, output of data
logging, say, sonar data recorded on a submarine at sea, etc. And, no matter
what the data was, no way did I want the string manipulation software to get a
tummy ache just from finding a null.

Fifth, knowing so little about the details of memory management, the stack,
and exceptional condition handling, I was very reluctant to consider trying to
make threading work.

Sixth, arrays were a constant frustration. The worst part was that could write
a subroutine to, say, invert a 10 x 10 matrix but then couldn't use it to
invert a 20 x 20 matrix. Why? Because inside the subroutine, the 'extents' of
the dimensions of the matrix had to be given as just integer constants and,
thus, could not be discovered by the subroutine after it was called. So,
basically in the subroutine I had to do my own array indexing arithmetic
starting with data on the size of the matrix passed via the argument list.
Writing my own code for the array indexing was likely significantly slower
during execution than in, say, Fortran or PL/I, where the compiler writer
knows when they are doing array indexing and can take advantage of that fact.

So, yes, no doubt as tens of thousands of other C programmers, I wrote a
collection of matrix manipulation routines, and for each matrix used a C
struct to carry the data describing the matrix that PL/I carried in what the
IBM PL/I execution logic manual called a 'dope vector'. The difference was,
both PL/I and C programmers pass dope vectors, but the C programmers have to
work out the dope vector logic for themselves. With a well written compiler,
the approach of PL/I or Fortran should be faster.

It did occur to me that maybe other similar uses of the C struct 'data type'
were the inspiration for Stroustrup's C++. For more, originally C++ was just a
preprocessor to C, and at that time and place, Bell Labs, with Ratfor,
preprocessors were popular. Actually writing a compiler would have permitted a
nicer language.

Seventh, PL/I was in really good shape some years before C was started and had
subsets that were much better than C and not much more difficult to compile,
etc. E.g., PL/I arrays and structures are really nice, much better than C, and
mostly are surprisingly easy to implement and efficient at execution. Indeed,
PL/I structures are so nice that they are in practice nearly as powerful as
objects and often easier and more intuitive to use. What PL/I did with scope
of names is also super nice to have and would have helped C a lot.

Eight, the syntax of C, especially for pointers, was 'idiosyncratic' and
obscure. The semantics in PL/I were more powerful, but the syntax was much
easier to read and write. There is no good excuse for the obscure parts of C
syntax.

For a software 'platform' for my startup, I selected Windows instead of some
flavor of Unix. There I wanted to build on the 'common language runtime' (CLR)
and the .NET Framework. So, for languages, I could select from C#, Visual
Basic .NET, F#, etc.

I selected Visual Basic .NET and generally have been pleased with it. The
syntax and memory management are very nice; .NET is enormous; some of what is
there, e.g., for 'reflection', class instance serialization, and some of what
ASP.NET does with Visual Basic .NET, is amazing. In places Visual Basic
borrows too much from C and would have done better borrowing from PL/I.

~~~
to3m
I think C might make more sense if you are more familiar with assembly
language. I learned C because real-mode x86 looked so fantastically ugly
(looking back, a rare instance of youthful good taste). 0-terminated strings
and stack allocation were quite familiar to me (though I never used stack
allocation myself because it made the disassembly hard to read) and the
overall model made perfect sense.

~~~
graycat
"I think C might make more sense if you are more familiar with assembly
language."

I've written some assembler in the machine language of at least three
different processors. On one machine I was surprised that my assembler code
ran, whatever it was, 5-8 times faster than Fortran. Why? Because I made
better use of the registers. Of course, that Fortran compiler was not very
'smart', and smarter compilers are quite good at 'optimizing' register usage.
I will write some assembler again if I need it, e.g., for

R(n+1) = (A*R(n) + B) mod C

where A = 5^15, B = 1, and C = 2^47. Why that calculation? For random number
generation. Why in assembler? Because basically want to take two 64 bit
integers, accumulate in two registers the 128 bit product, then divide the
contents of the two registers by a 64 bit integer and keep the 64 bit
remainder. Due to the explicit usage of registers, usually need to do this in
assembler.

But at one point I read a comment: For significantly long pieces of code, the
code from a good compiler tends to be faster than the code from hand coded
assembler. The explanation went: For longer pieces of code, good compilers do
good things for reducing execution time that are mostly too difficult to
program by hand which means that the assembler code tends to be using some
inefficient techniques.

------
hamidr
That's fun. Cause I remember this "x+++y;" as a question in one of my
university entrance exams!

~~~
rwmj
That sounds like a university to avoid.

~~~
norswap
It's necessarily a bad question, it makes you think about how parsers work.

But for an entrance exam, it's slightly hardcore :)

~~~
tsahyt
To get it right you have to be able to pick it apart according to specified
rules. Being able to work with formally specified rules is an integral part in
the study of computer science (also, other STEM majors). I'd say it's a
perfectly valid question, as long as someone points out that one should stay
away as far as possible from this sort of code.

~~~
kstenerud
The question assumes that you KNOW the rule, which is highly unlikely unless
you've either been bitten by it or have read through the spec enough times to
catch it.

Unless you know the actual parsing rules, there's no way to know if a real
parser would be greedy or not (or perhaps it might try to be clever?). This is
nothing more than a trivia question, which does not test aptitude or
intelligence.

~~~
robotresearcher
It does test knowledge. Nothing wrong with knowledge.

I expect they asked some other questions too.

~~~
kstenerud
It tests esoteric (aka borderline useless) knowledge. There's a big difference
between that and, say, knowing how to use something actually useful like
double pointers.

I had no idea how the C parsing algorithm worked for +++ et al, and I'm an
expert C programmer. Then again, I'd also never use such ridiculous constructs
in production code.

~~~
norswap
It's not that esoteric. You need to see it in a broader scope than simply
something that the C standard specifies. It's about how parsing is
traditionally done: by splitting input into token using a longest-match if
multiple token fit the begin of input. Then you can use the token to do
things.

Even if you don't know that much about the subject, you can still have an
interesting reasoning about it. Seeing that it is ambiguous is already a good
observation. You then can propose way to resolve the ambiguity and touch
(willingly or not) upon the topics of operator precedence, associativity,
greedy matching.

Those topics are not only relevant in parsing either, for instance
associativity is an important concept for list operations such as folding (a
right fold is different from a left fold).

------
homeomorphic
A very interesting read!

By the way, shouldn't the right hand side text on slide 7 (the final part of
slide 7) talk about the pointers z and x, instead of the values pointed at?
(Aside: How do I write "asterisk x" on HN without getting an italicized x?)

------
NelsonMinar
int x = 'FOO!';

Took me awhile to understand this; single quotes define single characters, and
for some C decided to allow multiple character character constants but leave
their value as implementation-defined. Discussion:
[http://zipcon.net/~swhite/docs/computers/languages/c_multi-c...](http://zipcon.net/~swhite/docs/computers/languages/c_multi-
char_const.html)

------
JoeAltmaier
Fun!

Lots more ambiguities in C++. But a challenge to find them in C. My favorite:
[] are just '+'

~~~
popee
[] really is very well known trick question. Wwll it was on my university
before they switched to other language. I think they don't event learn C
there. Shame.

------
colanderman
Is there a way to disable the fade-in? It makes scanning impossible.

~~~
rjek
View it in the editor instead.

------
lysium
What's the point in 'count up vs. count down'?

~~~
rcfox
Integer subtraction with a result of 0 sets the same status bit as comparing
one value to another, so you can get away without the compare instruction when
counting down. It might not sound like a lot, but it can be meaningful in a
tight loop.

I don't know why the author chose to change the syntactic structure of the
loop though, since it hides the point.

You have to be careful when counting down though. If you're accessing an
array, you might be tempted to do this:

    
    
        for(size_t i = bar_len - 1; i >= 0; --i) {
            foo(bar[i]);
        }
    

It looks innocent enough, but size_t is unsigned, so i >= 0 will always be
true. (Of course, using -Wall and -Wextra will warn you about this.)

------
simarpreet007
Ah this just made my day! :)

------
dakimov
I don't get it. What will happen if you violate the language semantics? They
call it 'dark corners'? If you hit your head against a wall, it will hurt. Is
it a 'dark corner' of life?

Overall, the presentation is very weak, like from a yesterday's graduate.

~~~
Peaker
Did you notice that not all of these corners violated any clause of the
standard?

I've got quite a bit of experience with C, and I haven't heard of the "static"
array size feature before, which seems extremely useful.

~~~
popee
No comma operator on the slides. Whata pitty :-)

