Hacker News new | comments | show | ask | jobs | submit login
Some dark corners of C (docs.google.com)
352 points by fayimora on Mar 9, 2013 | hide | past | web | favorite | 172 comments



Most of these "dark corners" have been in C for at least 25 years and have been repeated over an over for at least 20.

My main take-away from this is that Google Drive seems like a nice way to put presentations online :-)


> My main take-away from this is that Google Drive seems like a nice way to put presentations online :-)

Don't. I've been trying to access the presentation for 10 minutes and it won't allow me:

    Wow, this file is really popular! Some tools might be unavailable until the crowd clears.
and then I get redirected to https://support.google.com/accounts/bin/answer.py?hl=en&... (which is stupid, because there's nothing cached/cookied for google. In fact, I'm in Firefox's "Private Browsing")


The right way to distribute slides is with the "published" presentation link, instead of a link to the editor.


For me, it works, but it makes each slide an entry in my browser's history, which I hate.


You can link to certain slides this way. I would consider this a good thing.


In Chrome, it will still show the last history item with a different host at the bottom. So maybe use a better browser ;)


    #define struct union
    #define else
That's evil. I have to do it in someone's code some day just to have some fun.

But, apart from that, it's a really nice compilation. I didn't know about the compile time checks of array sizes, but I have a doubt. What if I pass to a method declared

    int foo(int x[static 10])
this pointer

    int* x = (int*) calloc(20, sizeof(int));
Does the compiler skip the check? Does it give me a warning?

EDIT: Funnily enough, in Mac it doesn't give any warning, neither for pointers nor for undersized arrays (ie, foo(w[5]) doesn't give a warning). And I've compiled with -std=c99 -pedantic -Wall.


Last time this came up on HN, it was thought to be a clang feature.

Edit: While we're talking about dark corners, please stop casting functions that return void * . If your code lacks the declaration of the function, the compiler will assume pre ANSI-C semantics and generate code returning an int.

On machines where pointers do not fit ints (basically all 64bit machines), you just silently (due to the cast there is no warning) truncated a pointer. Worse, it may work depending on the malloc implementation and how much memory you allocate.

We have to fix these kinds of bugs on OpenBSD a lot, please help by typing less and let the compiler warn you about silly mistakes :-)

And yes, C++ fucked this up for C. I'll leave it to Linus to say something nice about that..


> If your code lacks the declaration of the function, the compiler will assume pre ANSI-C semantics and generate code returning an int.

Better, please help by compiling your C code with -Wimplicit-function-declaration (included in -Wall), and fixing all the problems it reports. Then you won't have to worry about this problem, or a bunch of other problems.


All-too-common response: "Oh, but they're just warnings mumble compiles and runs 'cleanly' mumble..."


    -Wall -Wextra -Werror


It's a C99 feature, and clang's the only compiler I know of that produces the diagnostic (and only in later versions)

btw; I wrote this talk.


I didn't know that. To answer my previous question, clang doesn't fire a warning when passing pointers of any size to the function foo.

And by the way, nice talk, it's great learning these dark secrets of C.


The compiler can't know at compile time with a naked pointer like it can with an array. [static 1] is handy to say it must not explicitly be NULL, as if it were optional, however.


Yes, but I expected some kind of "You're passing a pointer as an array of size n. I can't check the size, but you should make sure you've checked it".


I can't imagine that would be anything but noise. 99% of my function calls have pointers passed through, not arrays.


But a pointer is not the same thing as an array, a pointer does not carry the size of the allocated space which an array does in the same scope.


Good, because I was missing a slide:

= Bitfields =

Not even once.

;)


Everybody knows about bitfields :)


Bitfields have some crazy fun dark corners, though. For instance, which values can this bitfield hold:

    int b:1;
Signed two's-complement n-bit integers can hold values from 2^(n-1)-1 to -2^(n-1), so in theory it can hold 0 and -1, but many compilers get that one wrong. Always declare one-bit bitfields as unsigned. (The Sparse static analyzer will warn you about that one.)


In my university they always told me to cast the return of void * functions, I thought it was just to avoid the warning saying "automatic cast of void* to int*" or similar.

Thanks, I will take that into account.


It is true in C++, not in C where void* needs no casting.

People tended to confuse C++ with C more in the past, as they had not diverged as much as they have now.


An annoying thing C++. Anyone know the history/rationale of this change?


It is supposed to provide you more checking by disabling automatic cast from void* to any other pointer. This makes sense in C++ since casting a pointer to a class can trigger some address adjustement if the target class of this instance pointed to is multiple derived (and maybe in other cases?). There is no way such adjustment can happen if the source type is void*, because then you don't know what the source type really is.


As far as I know, this is exactly the rationale, actually. Say you have:

  struct a { int a;}
  struct b { int b;}
  struct c : a, b { };

  c myc;
  c* p1 = &myc;
  b* pb1 = p1; // correctly points to myc's base b, which is offset

  void* p2 = &myc;
  b* pb2 = p2; // Would not point at instance of b, compile error
If I need void*-casting code to compile on both C and C++ compilers, I use a macro like this:

  #ifdef __cplusplus
  #define STATIC_CAST(T, EXPR) static_cast<T>(EXPR)
  #else
  #define STATIC_CAST(T, EXPR) (EXPR)
  #endif
This leaves the conversion to be implicit in C, and uses the stricter static_cast in C++ to catch certain types of likely-unsafe conversions, such as the aforementioned cast from int to pointer.


> While we're talking about dark corners, please stop casting functions that return void *

The problem is, if I want my C code to compile with MSVC, it has to compile as C++ - and even if I abhor Windows for development myself, a lot of developers are using MSVC.

I just wish Microsoft would update their C compiler, at least to C90. But then I suppose the standard has only been around for 23 years, and nobody really uses C anyway.


> The problem is, if I want my C code to compile with MSVC, it has to compile as C++

As someone who has compiled a lot of .c files that do not cast void pointers with cl.exe, I'd say no, this is not true... Maybe what you're trying to say is that your code relies on C99 features that are also present in C++? IMO if that's what's holding you back it's much easier to just write C89, maybe with the occasional ifdef, than to suddenly write some crummy C/C++ hybrid. That or just use mingw. (Unless you're doing SEH. I'm not aware of a good way to do SEH with mingw.)


You're suggesting I write C89 in 2013 just to support MSVC? C89 in which you can't even mix data and code?

I think casting void pointers is very much the lesser of two evils. And of course I can use MinGW, but that doesn't mean the majority of Windows developers don't insist on using MSVC.


C++ is not a superset of C[0]. If you want to use MSVC and newer features, then you're not writing C anymore; you're just writing an awkward C++ program.

[0] http://en.wikipedia.org/wiki/Compatibility_of_C_and_C%2B%2B


Nope, I'm definitely writing C99, but it just so happens that MSVC is happy enough pretending it's C++.

Definitely writing C99 in that I compile the exact same source files with -std=c99 -Wall -pedantic when using a real C compiler.


> C89 in which you can't even mix data and code?

I don't think this is as big a deal as you're making it out to be. Maybe I'm just used to it, but this is one area where I consider the C89 way to be better on stylistic grounds. (I wrote about this on a stack exchange site some time ago: http://programmers.stackexchange.com/questions/75039/why-dec...)

Realistically I don't think C99 is all that revolutionary, and the fact is there are plenty of people writing C89 in 2013. Let's look at some things C99 adds over C89:

* Mixed declarations and code

IMO not a huge deal, for reasons I give above.

* VLAs

A nice feature, but people have been using malloc for similar purposes for ages, so it's hard to say it's any better than "nice to have". I'd also argue that something that leads me to potentially consume arbitrary amounts of stack space based on runtime decisions is probably not universally good, and I'm pretty cautious about using VLAs even when given the choice.

* Last field of a struct can be variable sized array, eg. struct foo { int num_data; int data[]; };

I like this pattern and have used it a lot. Including with VC++ by having a zero-sized array at the end of the struct. This is once place where I use an ifdef to bridge the gaps because recently I noticed LLVM started doing weird things for this if you don't do it the "legit" C99 way.

* Cosmetic issues like // comments

VC++ already supports this in C89 files as a non-portable extension.

* Portable typedefs like uint32_t, etc.

These are also pretty handy. But if you're building with VC++ you're probably using the non-portable Microsoft ones anyway. (I think VC++ 2010 added the C99 versions too.)

* bool type

Again, every platform and even reasonably sized library works around this with a typedef and some macros.

* Slightly different behaviors for some libc functions

I notice this most for snprintf, where the pre-C99 return value [followed by Microsoft] means something different [and better] in C99. Again there are non-portable solutions for this that you can use ifdefs for.

* Named initializers for struct members

Cool feature. I'd hardly say I couldn't live without it, though. The only place I've really seen it used extensively is in the Linux kernel.

* Stuff that no one uses, such as "static" as used in these slides, or complex numbers, or whatever.

Not a problem since these don't see wide use.


Mixed declarations and code are absolutely essential in a modern C programming style. It's all very well to declare things like "your functions should be shorter" or "you shouldn't use temporary variables in macros", but it's just not practical in real world code. In fact, it's so impractical they amended the spec 14 years ago.

Then there are things like the portable integer types - which you decided aren't necessary because everyone is writing Win32 code (???), the designated initializers which I use every day to make my struct initializing code safe from changes in the struct itself, for loops, a consistent bool type ... heck, I could go on all day if you hadn't already dismissed the rest of the standard as "stuff that no one uses".

I really don't mean to come across as aggressive, but this kind of ignorant and dismissive attitude is so rampant in the tech scene at the moment. And it's getting us nowhere.


Not meaning you to make you feel like you're inching towards the aggressive, I just feel like the distance between C89 and C99 is not a very big one. Even in C99 I tend to use some perhaps conservative style choices, and in this view the C99 stuff really does seem like "niceties" rather than essentials. The difference between C and C++ feels much bigger to me.

And then after all, you initially explained the need for this in terms of Windows-specific code (or at least that's how I read it), so that's the angle I took...


It's better to

typedef void ( * fptr_t)()

and cast to fptr_t instead of void * But it's bigger problem how to force users to cast it back to proper prototype, because casting it back to something else will give UB.


"On machines where pointers do not fit ints (basically all 64bit machines), you just silently (due to the cast there is no warning) truncated a pointer. Worse, it may work depending on the malloc implementation and how much memory you allocate."

Wow! Ugly! Scary! Another good reason to know in fine details just what cast does. At one point, my Visual Basic .NET code actually calls some old C code, and in time I will need to convert to 64 bit addressing. So, I will keep in mind that with 64 bits I have to be especially careful about pointers and C.


Well, the bug only appears if you didn't include the header file declaring malloc (or any other void*-returning function). If you did, there's no problem.

It's a pretty easy thing to leave out, missing a required header, but if you always compile under -Wall it'll catch this and many other problems as well.


Thanks for the details, but I remain concerned about taking a language, clearly with some tricky aspects, so closely identified with 16 and 32 bit computing into 64 bit computing.


It seems to me that the problem is that you're using an undefined function (when there are flags to stop this), not casting.


It's impossible to have a static decision procedure about dynamic properties of programs, such as the size of dynamically allocated memory areas (Rice's theorem). So, it is necessary to either include false positives (correct programs rejected) or false negatives (incorrect programs accepted).

Sound static analyzers fall in the first case, but require a lot of work to become precise enough to be used (ie, to reduce the number of false alarms). Compilers fall in the second case in the sense that they don't have to honor such a clause. And in the C99 norm it's actually a "shall" (it just couldn't honor a "must" in that case):

"If the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the size expression."


> It's impossible to have a static decision procedure about dynamic properties of programs, such as the size of dynamically allocated memory areas (Rice's theorem).

You can not have a general procedure, but with the help of the programmer / user of the compiler, you can prove all kinds of things.


The purpose of

  int foo(int x[static 10])
is not to produce a warning - that's just a nice possible side-effect (and only in some cases).

The real purpose is to allow the compiler to optimise the compilation of the foo() function itself, under the assumption that x will always point to the first element of an array of at least 10 elements.


What possible optimization would a compiler be able to do, given that arrays are only ever implicit in compiled C?


If it knows the address is valid, it can use a speculative load. If it knows there are enough entries, it can use a wide load. Without that knowledge, such instructions could trigger SEGV due to an invalid pointer or the wide load spilling over a page boundary.


At least or exactly 10 elements?


At least


Huh, I didn't know that. I just checked the standard and you appear to be correct, so, just for the record:

"... if the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the size expression."


About evil preprocessor construct, there is a nice collection in the comment of this Jonh Regehr post : http://blog.regehr.org/archives/574


Isn't re-#defining language keywords disallowed by the C standard? It's just that most compilers don't complain about it.


The C++ standard has a clause prohibiting macros that re-#define keywords if the translation unit also #includes a standard header. I guess this is to clarify whether the standard-library functionality is expected to still work even in the face of such a #define, by specifying that implementors don't need to worry about that situation.

I don't believe C has any such restrictions, though.


The pre-processor is independent and a macro expanded before the C compiler sees it.


The preprocessor is a part of the standardized translation process, and if the standard says that certain things are not allowed in a well-formed C program, it does not matter at which stage the compiler is.


I'd be glad if you can pinpoint the location in the C standard where this constraint is given.


C99 Section 7.1.2.4 (on standard headers): "The program shall not have any macros with names lexically identical to keywords currently defined prior to the inclusion."

You can redefine keywords in your code, but they must not be defined when including standard headers (for obvious reasons).


Could someone explain to us newbies what the #define code does?...


It replaces the first word with what comes after. For example, #define foo bar is roughly equivalent to s/foo/bar/g, so #define else would just remove all else keywords. I'll let you guess what it does to a program.


Not to forget #define struct union which will pack all struct members together. Gives me chills even to think about it...


Can anyone think of a reason why all the examples are keywords? I would think if you were trying to cause some trouble, redefining standard functions would cause all kinds of chaos, e.g.

  #define memmove memcpy
  #define strncmp memcmp
  #define rand fork
  #define free(x)
  #define strerror(n) strerror((n)+1)
  #define memset(addr, byte, len) memset(addr, ~(byte), len)


The #define else one is scarier to me. #define struct union will almost certainly crash immediately on runtime, as one member's a pointer and it gets overwritten by an int or a pointer to a different type. #define else just always runs all else clauses - the code is likely still valid, it just does something very different from what the author intended.

Both of them would be a real pain to debug.


Both would be easy to notice in a debugger. Finding the cause, as you imply, would be tricky unless you remember to check the preprocessed output.


It's a great memory saver.


I remember when the Pentium F00F bug was reported, I tested it by doing:

    char main[] = { 0xf0, 0x0f, 0xc7, 0xc8, 0xc3 };
(and yes, my machine -- a Pentium MMX -- hung solid and I was rather shocked!)


whoah - I find that construction astonishing.

My gcc compiles it with only this warning:

  foo.c:2:6: warning: ‘main’ is usually a function [-Wmain]
hah!


It won't execute though:

    [23] .got.plt          PROGBITS        0804954c 00054c 000014 04  WA  0   0  4
    [24] .data             PROGBITS        08049560 000560 000010 00  WA  0   0  4  <---
    [25] .bss              NOBITS          08049570 000570 000008 00  WA  0   0  4

    66: 0804840a     0 FUNC    GLOBAL HIDDEN   14 __i686.get_pc_thunk.bx
    67: 08049568     5 OBJECT  GLOBAL DEFAULT   24 main  <---
    68: 08048278     0 FUNC    GLOBAL DEFAULT   12 _init
The main symbol is a relocation in .data, not .text. Which is as you would expect given that declaration. You might be able to get around that by doing something like

    unsigned char code[] = { 0xf0, 0x0f, 0xc7, 0xc8, 0xc3 };

    int main(void)
    {
        ((void (*)())code)();
        return 0;
    }
But these days NX will usually ruin the fun.


Works if you add this line:

char main[] __attribute__((section(".text")));

(You get a warning from the assembler.)


So it does.

I didn't know gcc attributes included that kind of thing. I've really gotta dig though the manual some time.


Historically, we have http://www.ioccc.org/1984/mullender.c use this technique in the Obfuscated C Code Contest in 1984 (hint at http://www.ioccc.org/1984/mullender.hint)


main is probably the only symbol this works with, data is generally put into non-executable sections/pages.


Others will probably be possible, albeit compiler-specific. The IBM xlc compiler / linker chooses to implement C static initializers by simply prefixing them with __sinit_, which tells the linker to automatically glue a call to it into init before calling main. I haven't tried this specific trick in combination with that, but if I had to make a bet it would work exactly the same way.


mainisusuallyafunction.blogspot.com


    int x = 'FOO!';
will not make demons fly out of your nose: it is not undefined behaviour. It is guaranteed to produce a value; the specific value is implementation defined (that is, one that the compiler vendor has decided and documented), but it is an integer value, not a demon value.

I'm sure, though, that someone sooner or later will be bitten by code like

    int x = 'é';
which is equally implementation-defined.


On big-endian machines, the order of characters is preserved. Because of that, I've noticed this trick used in old network/protocol code where the intent was to use integer values in binary headers while maintaining easy readability if you are look at hex/ascii side-by-side. e.g.,

int x = 'RIFF';

.. if you were packing a WAVE file header.


The potential big advantage of this construct is that you can use it in switch() statements, which you can't do with strings. But it's probably better to use enum values, because the implementation-definedness removes the potential great advantage of this technique (that you can serialize these multicharacter literals nicely; consider SMTP implementations looking for 'HELO', 'MAIL', etc.).


It was pretty common in classic Mac OS and PalmOS for writing OStype constants. I vaguely remember that for some time gcc did different things with this construct depending on whether target OS was MacOS/PalmOS or anything else.


    int x = 'A';
is also implementation-defined.


C's corners aren't very dark. It's a small enough language that it's easy to explore them. Things can get ugly when programmers decide to abuse the preprocessor because the language isn't complicated enough for them, but thankfully most C programmers have a distaste for such shenanigans. C++ is down the hall and around the corner, if you want darkness.


Someone should do "Dark corners of C++".

Nevermind, it would take more than the Lord of the Rings triology.


You'd probably have better luck with a "Light corners of C++" talk.


LOL


ROFLCOPTER


Very cool.

I remember hearing that the disallowal of pointer aliasing was the main reason that it was possible for a Fortran compiler to produce code that could outperform code from a C compiler: It allows the compiler to perform a new class of optimizations.

It would appear that the restrict keyword lets C programs regain that class of compiler optimizations.


It's pretty well explained here

http://en.wikipedia.org/wiki/Restrict


It is telling that these "dark corners" all seem harmless compared to what you can find in certain other languages which shall not be named.


I'm not sure about that. Not using "restrict" properly can lead to extremely hard-to-diagnose errors which can only be resolved by reading the generated assembler. I've seen several C programs that use "restrict" everywhere as a magic "go faster" device without understanding what it means...

The automatic conversions in JavaScript and PHP seem pretty harmless by comparison.


>Not using "restrict" properly can lead to extremely hard-to-diagnose errors

But "restrict" is a low-level micro-optimization, those tend to be tricky. I don't think a sane C programmer would sprinkle that keyword all across the source base, because as you have pointed out it can cause hard-to-diagnose errors.

In contrast, the automatic conversions in JavaScript and PHP are an "always on" feature you cannot avoid.


> which shall not be named

So... not that telling then.


It's obviously JavaScript and PHP that are being referred to.

Of the C "dark corners" that are problematic, it'd be extremely rare to run into them in most real-world code. You'd have to intentionally go out of your way to write code that will trigger them, and this code often looks obviously suspicious.

It's very much the opposite with JavaScript and PHP. A world of pain and danger opens up the moment you do something as simple as an equality comparison. The problems that can and will arise are well documented, so I won't repeat them here, but it's a much worse (and unavoidable) situation than when compared to C, C++, Java, C#, Python, Ruby or other mainstream languages.


Agreed. Everytime i get back to C it's like coming back home. But first you must study it hard to make it your home. On the other hand javascript (lang i'm using at current job) is like living 'Groundhog Day' with everyday finishing with suicide. Well, not saying javascript is bad language, there are some really great things about it, but it's designed with a loaded gun put on your head all the time :-) I'd also put C++ on list of dangerous languages, because it is trying to fix C problem while introducing OOP (and in newsest standard lambdas and others), so now you have huge base for new and exciting set of ways to kill yourself. It's not even funny that simple languages like lua are getting more users everyday.


The most egregious example in common usage: PHP.

http://me.veekun.com/blog/2012/04/09/php-a-fractal-of-bad-de...


Hah, a nice article. :-) My favourite sentence from the PHP documentation:

string create_function ( string $args , string $code )

"Creates an anonymous function from the parameters passed, and returns a unique name for it."

So there you are. Of course you can choose to be an anonymous value, but you'll get a name assigned by the state, for free. :-) Fascinating logic, captain.


In this case it isn't so weird, since PHP pre 5.3 didn't have first class functions. To pass a function around, you would use a variable containing a string of it's name.

create_function was a way to

A) not having to define a function separately

B) not getting problems with the function being re-defined, since each call would create a new function

C) fake closures by generating code.

All this should be moot points by now, since PHP has real anonymous functions with closures,


That's true (except for C), because I can't see how a create_function-defined function closes over its environment), but what I had in mind was the way in which the author of the documentation, obviously one of the PHP core developers, talks casually about "returning the name of an anonymous function". It shows just how much twisted the logic of these people is.

But I suppose that goes naturally hand in hand with the cargo cult approach to language design.


> That's true (except for C), because I can't see how a create_function-defined function closes over its environment),

That's why it would be a fake closure.

    $foo = someNumber();
    create_function('', 'return '.$foo.';');
You would generate a new string to be evaluated as the function body each time.

> talks casually about "returning the name of an anonymous function". It shows just how much twisted the logic of these people is.

I think you read too much into this. The point of an anonymous function isn't to make it not have a name, but to be able to define it where it is needed instead of referring to some specific function name in your code.

It also fits well with the way PHP handles "pointers", by storing the name of a variable in another variable.

    $foo = 42;
    $bar = 'foo';
    print($$bar); // Prints 42.


"The point of an anonymous function isn't to make it not have a name, but to be able to define it where it is needed instead of referring to some specific function name in your code."

Actually, its point is to make it a value that can be referenced from any number of bindings (associations between names and values) in any number of scopes. What you're saying is just a consequence of this.

"It also fits well with the way PHP handles "pointers", by storing the name of a variable in another variable."

Which only shows the deficiency, since any such indirect reference should never refer to a name, but to a binding instead.


I think we all know what language (s)he meant.

But about those dark corners, I guess the point wasn't to present any particularly nasty gotchas, but rather some precious little lesser known tricks. C has plenty of very well known features you can be bitten by (mostly related to memory management, of course). While the presentation reiterates over some of them, the most valuable parts are about various _good_ parts of the language which are rarely heard of (viz. the usage of `static` inside brackets).


It's also worth pointing out that buffers passed to strcpy, memcpy, etc. must not overlap. Otherwise it results in undefined behavior.


It has come up in the past that this distinction is a historical artifact rather than a necessity. Linus tried to get Ulrich to change this in glibc, but it was not changed.

http://www.sourceware.org/bugzilla/show_bug.cgi?id=12518


That's stdlib though, not the language.


The standard library is part of the language - all hosted implementations must provide it.

This allows, for example, compilers to replace a `memcpy()` call that has a constant size argument with direct loads/stores.


memmove() has proper memory moving semantics, and can deal with overlapping buffers.


I wrote a compiler for a subset of C, and I'm happily aware of all of these 'dark corners'. That's why I would always recommend writing a compiler for a language if you _really_ want to understand the language.


"What would be the smallest C program that will compile and link?"

Author got this wrong, that would be an empty file, which is what won the IOCCC for smallest self-replicating program once.


That is one of the finest examples of being technically correct--the best kind of correct. Spec is fulfilled but everybody knows the answer is useless.


Just FYI, If you know C and you want to take it to the next level, then Expert C Programming:Deep Secrets is one of the best books out there.

http://www.amazon.com/Expert-Programming-Peter-van-Linden/dp...


Reading http://golang.org/ref/spec is such joy after having lived through C/C++ for the last many years. I still love C++, but if I can get away without having to use it, then I'm all for it.


Just like everything else programming languages have evolved. From Assembly to Fortran to C to Java/C# (just saying, no exact sequence implied). I dont think the languages we have now, far from perfection they may be, would have been possible without the "dark corners" in the older languages. We learnt from them and made better languages. So I say show respect to the old languages, learn from them and keep improving languages/tools... Everybody is happy.


We haven't all learned... https://www.destroyallsoftware.com/talks/wat :P


Yup. People have talked about the issues in C for years but few talk about "modern" languages. Ruby anyone?


There is a wonderful book about the trickier parts of C called Deep C Secrets (with a fish on the cover :-). It is a great second or third book after K&R.


Very amusing book.


On some slides there is shown how particular function is expressed in assembly. I know nothing about that language (I'm talking about assembly; I know c and even like it) and when I tried to find anything how to learn this I faced some problems. I don't know, where should I start, how should I start etc. Can someone point me to good resources or starting points (I prefer linux than windows if that's matters)?

(Sorry for of offtopic)


The assembly used were relatively simple and for x86-64 Linux (You can tell it's not for Windows by how function arguments were passed).

You can actually get a firm grasp of the basics just by reading chapter 3 from Computer Systems: A Programmer's Perspective (http://csapp.cs.cmu.edu/public/samples.html) and practice writing some simple command line programs.


You shouldn't read too much into the assembly output from any particular compiler (except maybe dmr's for the PDP-11), but the de facto standard command line option "-S" will cause a *nix compiler to generate a ".s" file containing assembly rather than a binary.


Wow, I didn't know about -S option, thanks for the tip! I know it may not be optimal assmbly code, but that's still interesting code to read.


I am almost certain the pointer aliasing thing could be fixed by providing the proper optimization tag at compile time. I remember back in introductory systems classes, we saw mind boggling optimizations from GCC at O3 - the pointer example is so trivial it must be optimized by the compiler!


It isn't; there are very few flags that allow the compiler to perform optimizations not allowed by the language standard. Aliasing is not one of them for any compiler I know of. In fact, there are usually flags to go the opposite direction and assume all pointers alias because so many people write code that violates the standard (and results in GCC optimizing the code to behave differently than the author intended.)


You can only optimise if you know that it globally isn't ever passed the same, which won't be the case at compile time, as a separate object may be linked which does provide x==z.


For "dark corners of C", when I was writing C code I had several serious concerns. Below I list eight such in roughly descending order on 'seriousness':

First, what are malloc() and free() doing? That is, what are the details, all the details and exactly how they work?

It was easy enough to read K&R, see how malloc() and free() were supposed to be used, and to use them, but even if they worked perfectly I was unsure of the correctness of my code, especially in challenging situations, expected problems with 'memory management' very difficult to debug, and wanted a lot of help on memory management. I would have written my own 'help' for memory management if I had known what C's memory management was actually doing.

'Help' for memory management? Sure: Put in a lot of checking and be able to get out a report on what was allocated, when, by what part of the code, maybe keep reference counters, etc. to provide some checks to detect problems and some hints to help in debugging.

That I didn't know the details was a bummer.

It was irritating that K&R, etc. kept saying that malloc() allocated space in the 'heap' without saying just what they meant by a 'heap' and which I doubt was a 'heap' as in heap sort.

Second, the 'stack' and 'stack overflow' were always looming as a threat of disaster, difficult to see coming, and to be protected against only by mud wrestling with obscure commands to the linkage editor or whatever. So, I had no way to estimate stack size when writing code or to track it during execution.

Third, doing data conversions with a 'cast' commonly sent me into outrage orbiting Jupiter.

Why? Data conversion is very important, but a 'cast' never meant anything. K&R just kept saying 'cast' as if they were saying something meaningful, but they never were. In the end 'cast' was just telling the type checking of the compiler that, "Yes, I know, I'm asking for a type conversion, so get me a special dispensation from the type checking police.".

What was missing were the details, for each case, on just how the conversion would be done. In strong contrast, when I was working with PL/I, the documentation went to great lengths to be clear on the details of conversion for each case of conversion. I knew when I was doing a conversion and didn't need the 'discipline' of type checking in the compiler to make me aware of where I was doing a conversion.

Why did I want to know the details of how the conversions were done? So that I could 'desk check' my code and be more sure that some 'boundary case' in the middle of the night two years in the future wouldn't end up with a divide by zero, a square root of a negative number, or some such.

So, too often I wrote some test code to be clear on just what some of the conversions actually did.

Fourth, that the strings were terminated by the character null usually sent me into outrage and orbit around Pluto. Actually I saw that null terminated strings were so hopeless as a good tool that I made sure I never counted on the null character being there (except maybe when reading the command line). So, I ended up manipulating strings without counting on the character null.

Why? Because commonly the data I was manipulating as strings could contain any bytes at all, e.g., the data could be from graphics, audio, some of the contents of main memory, machine language instructions, output of data logging, say, sonar data recorded on a submarine at sea, etc. And, no matter what the data was, no way did I want the string manipulation software to get a tummy ache just from finding a null.

Fifth, knowing so little about the details of memory management, the stack, and exceptional condition handling, I was very reluctant to consider trying to make threading work.

Sixth, arrays were a constant frustration. The worst part was that could write a subroutine to, say, invert a 10 x 10 matrix but then couldn't use it to invert a 20 x 20 matrix. Why? Because inside the subroutine, the 'extents' of the dimensions of the matrix had to be given as just integer constants and, thus, could not be discovered by the subroutine after it was called. So, basically in the subroutine I had to do my own array indexing arithmetic starting with data on the size of the matrix passed via the argument list. Writing my own code for the array indexing was likely significantly slower during execution than in, say, Fortran or PL/I, where the compiler writer knows when they are doing array indexing and can take advantage of that fact.

So, yes, no doubt as tens of thousands of other C programmers, I wrote a collection of matrix manipulation routines, and for each matrix used a C struct to carry the data describing the matrix that PL/I carried in what the IBM PL/I execution logic manual called a 'dope vector'. The difference was, both PL/I and C programmers pass dope vectors, but the C programmers have to work out the dope vector logic for themselves. With a well written compiler, the approach of PL/I or Fortran should be faster.

It did occur to me that maybe other similar uses of the C struct 'data type' were the inspiration for Stroustrup's C++. For more, originally C++ was just a preprocessor to C, and at that time and place, Bell Labs, with Ratfor, preprocessors were popular. Actually writing a compiler would have permitted a nicer language.

Seventh, PL/I was in really good shape some years before C was started and had subsets that were much better than C and not much more difficult to compile, etc. E.g., PL/I arrays and structures are really nice, much better than C, and mostly are surprisingly easy to implement and efficient at execution. Indeed, PL/I structures are so nice that they are in practice nearly as powerful as objects and often easier and more intuitive to use. What PL/I did with scope of names is also super nice to have and would have helped C a lot.

Eight, the syntax of C, especially for pointers, was 'idiosyncratic' and obscure. The semantics in PL/I were more powerful, but the syntax was much easier to read and write. There is no good excuse for the obscure parts of C syntax.

For a software 'platform' for my startup, I selected Windows instead of some flavor of Unix. There I wanted to build on the 'common language runtime' (CLR) and the .NET Framework. So, for languages, I could select from C#, Visual Basic .NET, F#, etc.

I selected Visual Basic .NET and generally have been pleased with it. The syntax and memory management are very nice; .NET is enormous; some of what is there, e.g., for 'reflection', class instance serialization, and some of what ASP.NET does with Visual Basic .NET, is amazing. In places Visual Basic borrows too much from C and would have done better borrowing from PL/I.


I think C might make more sense if you are more familiar with assembly language. I learned C because real-mode x86 looked so fantastically ugly (looking back, a rare instance of youthful good taste). 0-terminated strings and stack allocation were quite familiar to me (though I never used stack allocation myself because it made the disassembly hard to read) and the overall model made perfect sense.


"I think C might make more sense if you are more familiar with assembly language."

I've written some assembler in the machine language of at least three different processors. On one machine I was surprised that my assembler code ran, whatever it was, 5-8 times faster than Fortran. Why? Because I made better use of the registers. Of course, that Fortran compiler was not very 'smart', and smarter compilers are quite good at 'optimizing' register usage. I will write some assembler again if I need it, e.g., for

R(n+1) = (A*R(n) + B) mod C

where A = 5^15, B = 1, and C = 2^47. Why that calculation? For random number generation. Why in assembler? Because basically want to take two 64 bit integers, accumulate in two registers the 128 bit product, then divide the contents of the two registers by a 64 bit integer and keep the 64 bit remainder. Due to the explicit usage of registers, usually need to do this in assembler.

But at one point I read a comment: For significantly long pieces of code, the code from a good compiler tends to be faster than the code from hand coded assembler. The explanation went: For longer pieces of code, good compilers do good things for reducing execution time that are mostly too difficult to program by hand which means that the assembler code tends to be using some inefficient techniques.


You are hypothesizing that someone whose language before C was IBM PL/1 is unfamiliar with assembly languages. This seems like an extremely improbable hypothesis; I suggest you seek another explanation for his or her dissatisfaction.


> Fourth, that the strings were terminated by the character null usually sent me into outrage and orbit around Pluto.

Everything is about tradeoffs. Fortran uses space-padded strings with no null terminator. On the positive side, this forces everyone to explicitly pass the length they mean instead of relying on more work at runtime to figure out when to stop by looking for the null sentinel. Passing explicit lengths is good practice in C anyway because you usually avoid having to scan the contents multiple times / multiple calls to strlen at different levels in the stack. While everything should be better in the Fortran case, the class of bugs that persist are even more hard-to-find bugs because poorly written code mis-calculates the length, ignores it, etc., stomping over adjacent memory. This probably won't crash, and since other code has to use an explicit length when accessing the buffer, you usually won't notice the problem at the source of the issue. Contrast that with C, where you're more likely to see an issue immediately as soon as the string is used or passed to something else.

tl;dr Poor programming is poor programming in any language.


Yup.

With PL/I the maximum length of the string is set when the string is allocated, usually dynamically during execution. The length can be given as a constant in the source code or set from computations during execution. There is also a current length <= the maximum length. When passing that string to a subroutine, the subroutine has to work a little to discover the maximum string length, but, by in effect 'hiding' both the current and maximum length from the programmer of the subroutine, the frequency of some of the errors you mentioned should be reduced.

In Visual Basic .NET, the maximum length of any string is the same, as I recall, 2 GB. Then having the strings be 'immutable' was a cute approach, slightly frustrating at times but otherwise quite nice and a good way to avoid the problems you mentioned.

But, of course, the way I actually used strings in C was close to the way they were supported in Fortran.

And, of course, likely 100,000+ C programmers wrote their own collection of string handling routines where use a struct to keep all the important data on the string, say, allocated or not, pointer to the allocated storage, maximum allocated length, current length, etc. (multi-byte character set anyone?) and then pass just a pointer to the struct instead of a pointer to the storage of the string; in this way, again should reduce the frequency of some of the errors you mentioned.


1) malloc() and free() are just library calls, they're not first class citizens of the language. K&R and other good C references describe their public interface well and that's all you need to know to use them effectively. The public interface encapsulates the implementation details, good software engineering in my book. 2) Usage of the stack reflects C's low-level, high performance "portable assembler" roots. Choosing the stack size, and avoiding allocating too much on the stack are familiar problems for assembly programmers too. I remember back in the 80s some C programmers were high level guys going down and some were assembly guys going up. Only one of these groups would ever try to put 100,000 character arrays on the stack :- ) 3) C strings are admittedly idiosyncratic, but with practice you can grow to love them and be very productive with them. But they are only a good match for textual data. If you are trying to use C strings for things like audio samples, sorry you are doing it wrong. 3) C casts are useful when you understand and the machine representation of the types you are working with. Typical use cases arise when you are bit twiddling, for example writing hardware drivers etc. If you have no particular interest in the machine representation of your data, then the presence of C casts in your code is a red flag. They aren't needed for normal computational tasks. 5) Fair enough, you can make a decent threading library in C, but it's not for the faint hearted or inexperienced. 6) Personally, I don't use multi-dimensional arrays in C much. I suspect you are probably right, they are just a weak part of the language. I could potentially be persuaded otherwise by someone more proficient. 7 and 8) I don't know much about PL/I so I will not comment in depth. I suspect you are exhibiting the 'mother' syndrome here. You learned PL/I first, that's what you fell in love with. I'd probably look at PL/I and think why don't they do it like C ? C is such an nice balance of terse yet capable. Far from being obscure I'd judge the C pointer syntax to be a miracle of concise elegance, etc. etc.


1) On malloc() and free(), right, I was free just to write my own. I should have. At various times since for various reasons, I have just written my own.

On your

"K&R and other good C references describe their public interface well and that's all you need to know to use them effectively."

I want more. By analogy, all you need to drive a car is what you see sitting behind the steering wheel, but I also very much want to know what is under the hood.

Generally I concluded that for 'effective' 'ease of use', writing efficient code, diagnosing problems, etc., I want to know what is going on at least one level deeper than the level at which I am making the most usage.

Your example of putting a 100,000 byte array on the stack is an example: Without knowing some about what is going on one level deeper, that seems to be an okay thing to do.

2) My remark about the stack is either not quite correct or is not being interpreted as I intended. For putting an array on a push down stack of storage, I am fully aware of the issues. But on a 'stack', maybe also the one used for such array allocations (that PL/I called 'automatic'; I'm not sure there is any corresponding terminology in C), there is also the arguments passed to functions. It seemed that this stack size had to be requested via the linkage editor, and if too little space was requested then just the argument lists needed for calling functions could cause a 'stack overflow'. A problem was, it was not clear how much space the argument lists took up.

Then there was the issue of passing an array by value. As I recall, that meant that the array would be copied to the same stack as the arguments. Then one array of 100,000 bytes could easily swamp any other uses of the stack for passing argument lists.

But even without passing big 'aggregates' by value or allocating big aggregates as 'automatic' storage in functions, there were dark threats, difficult to analyze or circumvent, of stack overflow. To write reliable software, I want to know more, to be able to estimate what resources I am using and when I might be reaching some limit. In the case of the stack allocated by the linkage editor for argument lists, I didn't have that information.

3) Sure, I could make use of the strings in C as C intended just as you state, just for textural data, but also have to assume a single byte character set.

I thought that that design of strings was too limited for no good reason. That is, with just a slightly different design, could have strings that would work for text with a single byte character set along with a big basket of other data types. That's what was done in Fortran, PL/I, Visual Basic .NET, and string packages people wrote for C.

The situation is similar to what you said about malloc(): All C provided for strings was just a pointer to some storage; all the rest of the string functionality was just in some functions, some of which, but not all, needed the null termination. So, what I did with C strings was just use the functions provided that didn't need the null terminations or write my own little such functions.

As I mentioned, I didn't struggle with null terminated strings; instead right from the start I saw them as just absurd and refused ever to assume that there was a null except in the case when I was given such a string, say, from reading the command line.

It has appeared that null terminated strings have been one of the causes of buffer overflow malware. To me, expecting that a null would be just where C wanted it to be was asking too much for reliable computing.

3) On casts, we seem not to be communicating well.

Data conversions are important, often crucial. As I recall in C, the usual way to ask for a conversion is to ask for a 'cast'. Fine: The strong typing police are pleased, and I don't mind. And at times the 'strongly typed pointers' did save me from some errors.

But the question remained: Exactly how are the conversions done? That is, for the set D of 'element' data types -- strings, bytes, single/double precision integers, single/double precision binary floating point, maybe decimal, fixed and/or floating, and for any distinct a, b in D, say if there is a conversion from a to b and if so what are the details on how it works?

One reason to omit this from K&R would have been that the conversion details were machine dependent, e.g., depended on being on a 12, 16, 24, 32, 48, or 64 bit computer, signed magnitude, 2's complement, etc.

Still, whatever the reasons, I was pushed into writing little test cases to get details, especially on likely 'boundary cases', of how the conversions were done. Not good.

Sure, this means that I am a sucker for using a language closely tied some particular hardware. So far, fine with me: Microsoft documents their software heavily for x86, 32 or 64 bits, from Intel or AMD, and now a 3.0 GHz or so 8 core AMD processor costs less than $200. So I don't mind being tied to x86.

On PL/I: Thankfully, no, it was not nearly the first language I learned. Why thankfully? Because the versions I learned were huge languages. Before PL/I I had used Basic, Fortran, and Algol.

PL/I was a nice example of language design in the 'golden age' of language design, the 1960s. You would likely understand PL/I quickly.

So, PL/I borrowed nesting from Algol, structures from Cobol, arrays and more from Fortran, exceptional condition handling from some themes in operating system design, threading (that it called 'tasking' -- current 'threads' are 'lighter in weight' than the 'tasks' were -- e.g., with 'tasks' all storage allocation was 'task-relative' and was freed when the task ended), and enough in bit manipulation to eliminate most uses of assembler in applications programming. It had some rather nice character I/O and some nice binary I/O for, say, tape. It tried to have some data base I/O, but that was before RDBMS and SQL.

In the source code, subroutines (or functions) could be nested, and then there were some nice scope of name rules. C does that but with only one level of nesting; PL/I permitted essentially arbitrary levels of nesting which at times was darned nice.

Arrays could have several dimensions, and the upper bound and lower bound of each could be any 16 bit integers as long as the lower was <= the upper -- 32 bit integers would have been nicer, and now 64 bit integers. Such array addressing is simple: Just calculate the 'virtual origin', that is, the address of the array component with all the subscripts 0, even if that location is out in the backyard somewhere, and then calculate all the actual component addresses starting with the virtual origin and largely forgetting about the bounds unless have bounds checking turned on. Nice.

A structure was, first-cut, much like a struct in C, that is, an ordered list of possibly distinct data types, except each 'component' could also be a structure so that really was writing out a tree. Then each node in that tree could be an array. So, could have arrays of structures of arrays of structures. Darned useful. Easy to write out, read, understand, and use. And dirt simple to implement just with a slight tweak to ordinary array addressing. So, it was just an 'aggregate', still all in essentially contiguous, sequential storage. So, there was no attempt to have parts of the structure scattered around in storage. E.g., doing a binary de/serialize was easy. The only tricky part was the same as in C: What to do about how to document the alignment of some element data types on certain address range boundaries.

Each aggregate has a 'dope vector' as I described. So, what was in an argument list was a pointer to the dope vector, and it was like a C struct with details on array upper and lower bounds, a pointer to the actual storage, etc.

PL/I had some popularity -- Multics was written in it.

For C, PL/I was solid before C was designed. So, C borrowed too little from what was well known when C was designed. Why? The usual reason given was that C was designed to permit a single pass compiler on a DEC mini-computer with just 8 KB of main memory and no virtual memory. IBM's PL/I needed a 64 KB 360/30. But there were later versions of PL/I that were nice subsets.

It appears that C caught on because DEC's mini computers were comparatively cheap and really popular in technical departments in universities; Unix was essentially free; and C came with Unix. So a lot of students learned C in college. Then as PCs got going, the main compiled programming language used was just C.

Big advantages of C were (1) it had pointers crucial for system programming, (2) needed only a relatively simple compiler, (3) had an open source compiler from Bell Labs, and (4) was so simple that the compiled code could be used in embedded applications, that is, needed next to nothing from an operating system.

The C pointer syntax alone is fine. The difficulty is the syntax of how pointers are used or implied elsewhere in the language. Some aspects of the syntax are so, to borrow from K&R, 'idiosyncratic' that some examples are puzzle problems where I have to get out K&R and review.

To me, such puzzle problems are not good.

I will give just one example of C syntax:

i = j+++++k;

Right: Add 1 to k; add that k to j and assign the result to i; then add one to j. Semi-, pseudo-, quasi-great.

I won't write code like that, and in my startup I don't want us using a language that permits code like that.


Well I certainly salute your passion. I am not nearly dedicated enough to go through this point by point. The stack issue comes down to this; C uses the assembly (i.e. machine) stack. It is an almost ridiculously simple mechanism ideally suited to pass parameters and allocate 'automatic' (yes this is a C term too) data. Avoid large aggregates and arrays on the stack because stack space is limited. Providing you adopt a conservative approach, you never have to worry, 90% of a 2K byte stack in a small embedded system is typically safety factor/headroom.

My personal view is that C offers a perfect tradeoff between simplicity and capability, it has a magical quality that has made it the most important single computer language for nearly 40 years and on into the forseeable future. Increasingly its importance is as a layer that more programmer friendly technology sits upon, but it's no less important for that.

I've read that the difference between chess and go (the oriental game, not golang) is that on Alpha Centauri if little green men play a game that resembles chess, they will almost certainly play a game identical to go. Go is simple enough it is almost inevitable. For me it's almost the same thing (I stress almost), with computer languages and C.

One final point; C syntax is ultimately a matter of taste. If you find this to be a completely obvious, correct and straightforward way of doing a non-overlapping C string copy;

  while( *src )
    *dst++ = *src++;
  *dst = '\0';
Then you 'get' C. If you find it a confusing monstrosity, maybe C isn't your language.


So, with just 2KB, that machine stack is not the stack (of dynamic descendantcy', that is, the conceptual stack of routines called but not yet returned) for automatic storage. Good to know. So, if only pass pointers or 'element' variables, then a 2KB stack for parameter lists should be okay for small to medium programs unless the programmer actually believed his computer science course that said that recursive routines were good things!

Yes, for your code example, I 'get it'! It's cute! No doubt it's cute.

So, to get 'full credit', dst and src are 'strings', that is, essentially just pointers. Since C pointers have a data type with a length, here the data type of these pointers is byte, or character, or some such with length 1 byte.

Starting off, since src points to a C string that obeys the C standard of null termination, we know that the string has exactly 1 null byte, its last byte. So, for any byte except the last one, it is not null. So, if the string has length more than 1, then as we enter the While, src points to the first byte of the string (or at least the first byte we want to copy); * src is that byte; and * src is not null, that is, is not 0, that is, tests as True for the While (need to know that 0 tests as false and anything else tests as true). So we execute the statement following the While.

That statement says copy byte * src to byte * dst and, then, before considering the semi-colon and before collecting $100, increment both src and dst by the length of their data types, that is, by 1 byte. So, now src points to the next byte in its string and dst points to the target of the next byte in its string. Then we return to the While and do its test again. If we have more bytes to move, then we just rinse and repeat the above. Else src points to the last byte of its string which has to be the null byte so that the byte itself, * src, is the null byte, that is, 0, that is, False in the While statement. Then we leave the While statment, that is, move past its semi-colon. So, net, the While statement moves all the non null bytes we want moved. Of course when get to the next statement, dst points to the last byte of its string, that is, the byte that is to be its null byte (just why that byte is not null already is possibly an issue). At any rate, we want that last byte to be the null byte, so that last statement so assigns that last byte.

For more, 'src' abbreviates 'source' and 'dst' abbreviates 'destination'.

So, yes, it's possible to describe this stuff in English.

So, there's a problem, a significant problem: I 'documented' your code. Okay, but your code is for a string copy. There should be some documentation, but it should be in documentation of a string operation for the language. That is, even just for the documentation, the strings and the copy operation need to be 'abstracted' to a higher level where they can be documented and learned there and set aside the need to document in the code.

Of course, I would write such a copy loop, assuming I wanted to use a loop, using Fortran's Do-Continue, PL/I's Do-End, VB's For-Next. I don't recall the Algol syntax. The PL/I syntax is the same as in Rexx which I use heavily.

Of course in PL/I and VB, I would use the substring function instead of a loop.

And I would fear that too much usage of syntax as sparse as this example here would be more error prone. And for more complicated operations, I would fear that neither God nor the programmer understood the code. I can understand that on some processors with some compilers, that C syntax could lead to faster code, but I'm not thrilled about digging into x86 assembler enough to be sure. Also now it's tough to know what fast code is due to out of order execution, speculative execution, parallel execution, pipelines, three levels of cache, the cache set associative, and cache line invalidates when have several cores. But, computers are so fast now I don't much care, and if I did care I would notice that making such code faster would not make the code in the runtime library or the operating system faster and, thus, might not be able to do much for actual performance of my application.

I don't really mind your example; it's actually not sparse enough to be a serious problem. But I'm not thrilled by the example because I don't take pleasure in that clever sparsity and, again, I'm afraid that it could result in bugs that could hurt my business.

Then there's the issue, I regard as significant, that with that code the C compiler is very short on the 'meaning' of what is being done. Or, you and I can look, read, guess, and agree that we are doing a copy of the tail of one C string to another (or possibly the same if initially src equals dst) string. Okay, you and I can guess that. But the poor compiler can't, and if it tries then I might get torqued when in that loop actually I'm not working with C strings but doing something else. So, then, the compiler will have a heck of a time checking string lengths and not writing on the wrong storage. So, I'd rather have strings as a higher level construct, than just a pointer to some storage from malloc(), so that the language can help me debug my code. I've programmed so many errors in array bounds and string lengths that I don't want to be without some good checking at runtime, at least in some 'debug' mode.

Next, for being 'fast', on at least IBM 360/70, etc. computers, actually that code sample would be slow! Why? Because that instruction set has a single instruction to copy all the bytes in a range of sequential virtual addresses. So, if the compiler knows it is working with a string and knows it is compiling for that instruction set, then it can replace the whole loop with just one instruction.

There was an old remark in the IBM PL/I program logic manual on execution speed: People could complain about PL/I being slower than Fortran. But if PL/I were used at all carefully, then it was faster than Fortran, and one of the main reasons was that PL/I then but not Fortran had strings in the language. So, Fortran programmers wrote collections of string handling functions/subroutines, and the internal logic was much as in your example, move one byte at a time. And as in the standard C library, do this by calling a function/subroutine with its overhead. PL/I's compiled code for strings was in-line and blew the doors off anything in Fortran. For today, and for similar reasons, Visual Basic .NET has a chance to be faster in string manipulations than C code such as in your example. Further, it's super tough to do something with VB strings that would mess up memory.

I don't find your example "a confusing monstrosity", but I greatly prefer to bet my business on VB instead of C/C++.


C strings are performant but place a lot of responsibility on the programmer. C++ strings offer a more accessible, less lightweight but easier to use facility. Rather typically of higher level string abstractions that are standard in non C languages (and which can be constructed as library functions in C), they rely on memory allocation and so will often be less performant.

By little example is basically the strcpy() standard facility. Maybe a better example would be a construct that (roughly) could replace memcpy();

  while( n-- )
    *dst++ = *src++;
This sort of thing just appeals to me as being simple and obvious computing - there's no cleverness to it - and certainly no need to break it down exhaustively to understand it. I think whether this sort of thing appeals might have something to do with prior experience - in my case as an engineer and assembly language programmer;

The equivalent to my memcpy() snippet on the original x86 machines was simply this;

  rep movsb
Put the count in cx, the source ptr in si, the dest ptr in di and the REP prefix will repeat the MOVE STRING BYTE instruction and decrementing cx each time until it hits zero.


Incidentally, writing standard business apps in VB instead of C or C++ makes complete sense to me.


C is not perfect. It has its problems (strings, ++, horrible type syntax, no memory allocation, architecture idiosyncrasies like type width). However, it's reliable, fast, and you can basically memorize the language and compile it by hand if necessary. No other language will reliably run on many systems that fast with that much existing code.

anyway, if you think you can prevent bad code by using restrictive languages, you're gonna have a bad time. Any language can be abused. Just don't abuse it, treat your code with respect.

Also I'm pretty sure j+++++k has undefined behavior so you should be shot if you write it.

> in my startup I don't want us using a language that permits code like that.

Well I hope you don't run unix or windows, python or ruby, Firefox, chrome, ie, safari, or opera, or use a smartphone.


"C is not perfect." Yup, it has some "dark corners" or whatever want to call its flaws.

"No other language will reliably run on many systems that fast with that much existing code." Yup, and just such reasons are why at times I used it. It's in effect also why I'm using some of it now although mostly my code is in Visual Basic .NET (VB): The Microsoft VB documentation is fairly clear ('platform invoke' or some such) on how to call C from VB. Well, I have some old Fortran code I want to call from VB, do have the old Watcom Fortran compiler, but do not have details on how to call Fortran from VB. So, I used the old Bell Labs program f2c to convert the Fortran to C, used the Microsoft C compiler to compile and link to a DLL, then call the DLL from VB. And actually it works. And in effect the reason I can do this is what you said: C is so popular, for the reasons you gave, etc., that Microsoft went to the trouble to say how to call C from VB. Microsoft didn't do that for Fortran, PL/I, Algol, Ada, etc. You are correct that the popularity of C is important.

"anyway, if you think you can prevent bad code by using restrictive languages," Right. Each such restriction eliminates only some cases of bad code.

> Any language can be abused. Just don't abuse it, treat your code with respect.

Right. There is "When a program is written, only God and the programmer understand it. Six months later, only God." Well, so that I could read my code six months later, I wrote only easy to read code. So, I would write

n = n + 1

and not the short version, and would never write i+++++j.

> > in my startup I don't want us using a language that permits code like that.

> Well I hope you don't run unix or windows, python or ruby, Firefox, chrome, ie, safari, or opera, or use a smartphone.

You lost me: I'm using VB and find nearly all the syntax to be fine, that is, easy to learn, read, and write. And the main reason I'm not using C# is what it borrowed from C/C++ syntax. I'm using C only when really necessary. Sure, I use Windows and Firefox; if they are written in C/C++, that's their issue. But by staying with VB, I am successful with my goal of

> in my startup I don't want us using a language that permits code like that.

> Also I'm pretty sure j+++++k has undefined behavior so you should be shot if you write it.

As I recall, I actually tried it once, and it compiled and ran as I explained. And, as you explain, if it works in one heavily used C compiler, then it should work the same in all of them. If look at j+++++k, I suspect that it parses according to the BNF just one way with no ambiguity. So, don't have to write, say,

(j++) + (++k)


According to the linked presentation, slide 13:

"The C specification says that when there is such an ambiguity, munch as much as possible. (The "greedy lexer rule".)"

So j+++++k turns into:

j++ ++ + k

Which is clarified on the next slide.


Wow!

I would have guessed that j++ ++ was not legal syntax.

So, I was wrong: There are two ways to parse that mess. So, there is ambiguity. And the way they resolve the ambiguity is their 'greedy' rule! Wow!

Net, that tricky stuff is too tricky for me.

There was a famous investor in Boston who said that he only invests in companies only an idiot could run well because the chances were too high that too soon some idiot would be running the company.

Well, I want code, or at least language syntax, that any idiot can understand, for now, me, and later some of the people that might be working for me!

You are way ahead of me on C, and you leave me more afraid of it than I was. But then I was always afraid of it and, in particular, never wrote ++.


Okay, some clarity from actually running some simple code! Or if K&R didn't make a lot of details clear to me in my fast reading, then maybe some simple test cases will!!!

So, my first issue was the statement for C

     i = j+++++k;
So, to make some tests, I dusted off my ObjectRexx script for doing C compiles, links, and execution.

Platform: Windows XP SP3 with recent updates. And apparently somehow I have

     Visual Studio 2008 x86 32 bit
installed, and it has relevant "tools", e.g., a C/C++ compiler, linker, etc.

I don't use IDEs or Visual Studio and, instead, apparently as a significant fraction of readers at HN, write code with my favorite text editor (e.g., KEdit) and some command line scripts (using ObjectRexx, which is elegant but for better access to Windows services, etc. likely I should convert to Microsoft's PowerShell).

So, I typed in some C code and tried to compile it. Then I encountered again one of the usually unmentioned problems in computing: Software installation and system management. Several hours later I had a C/C++ 'compile, load, and go' (CLG) script working, but my throat was sore from screaming curses at the perversity of 'system management' -- a project of a few minutes with a prerequisite of several hours of system management mud wrestling.

For the mud wrestling, the first problem was, since my last use of C, I had changed my usual boot partition from D to E. Next the version of C installed on E was different from that on D. And the installation on D would not run when E was booted. Bummer.

Next, the C compiler, linker, etc. want a lot of environment variables. Fine with me; generally I like the old PC/DOS idea of environment variables.

However, apparently Microsoft was never very clear on just what software, when, could change the environment variables where. At least I wasn't clear.

So, booting from my partition E, the C/C++ tools want environment variables set as in

     E:\Program Files\Microsoft Visual Studio 9.0\Common7\Tools\vsvars32.bat
Okay. Nice little BAT file.

If run the BAT file from a console window, it changes the environment variables as needed by C/C++. But, in console windows I run a little 'shell script' I wrote in ObjectRexx. I has a few nice features for directory tree walking, etc. But when run the BAT file from the command line of a console window that is running my little shell script, after the BAT file is done and returns, the environment variables have been restored to what they were before running the BAT file. If use a statement, say,

     set >t1
at the end of the BAT file, then file t1 shows that the environment variable values have been changed while the BAT file was still running.

So, sure, there is a 'stack' of invocations of processes, applications, or whatever in the console window and its address space, and, somehow, since my shell script was in the stack, when the BAT file quit the stack and its collection of environment variables was popped back to what they had been.

But eventually I relented, gave up on this little project taking just a few minutes, slowed down, thought a little, read some old notes, discovered that I should change the environment variables within my ObjectRexx script, using an ObjectRexx function for that purpose, as needed by C/C++ CLG, found the needed changes, implemented them, and, presto, got a C/C++ CLG script that works while my shell script is running and while I am booted from my drive E.

On to the C question:

For 'types', the test program has

     int i, j, k;
For

     i = j+++++k;
my guess was that this would parse only one way,

     i = (j++) + (++k)
and be legal. And as I recall, but likely no longer have good notes, some years ago on OS/2, PC/DOS, or an IBM mainframe,

     i = j+++++k;
was legal.

Not now! With the C/C++ tools with

     Visual Studio 2008 x86 32 bit
statement

     i = j+++++k;
gives C/C++ compiler error message

     error C2105:  '++' needs l-value
So, that's an L-value or 'left value' or something that the 'operator' ++ can increment.

So, it wasn't clear how the compiler was parsing. So, I tried

     i = j++ ++ +k;
and it also resulted in

     error C2105:  '++' needs l-value
So, likely the ++ that is causing the problem is the second one.

So, I tried

    i = (j++)++ + k;
and still got

     error C2105:  '++' needs l-value
Then I tried

    i = j++ + ++k;
and it worked as would hope: k was incremented by 1 and added to j, the sum was assigned to i, and then j was incremented by 1.

Then I tried

    i = j+++k;
Surprise! It's legal! j and k are added and the sum is assigned to i, and then j is incremented by 1.

So, I long concluded that to understand some of the tricky, sparse syntax of the language, not clearly explained in K&R, have to write and run test cases as here. Bummer. But, as below, here I'm significantly wrong.

Possible to make sense out of this?

Maybe: If start reading

Brian W. Kernighan and Dennis M. Ritchie, 'The C Programming Language, Second Edition', ISBN 0-13-110362-8, Prentice-Hall, Englewood Cliffs, New Jersey, 1988.

in "Appendix A: Reference Manual" on page 191, then hear about 'tokens' and 'white space' to separate tokens.

Okay, no doubt + and ++ are such 'tokens'.

Continuing, right away on page 192 have

"If the input stream has been separated into tokens up to a given input character, the next token is the longest string of characters that could constitute a token."

I would have said "up to and including a given input character", but K&R are 'sparse'!

So, with this parsing rule, in

     j+++k
the tokens are

     j
     ++
     +
     k
which is essentially

     (j++) + k
which is legal, but in

     j+++++k
the tokens are

     j
     ++
     ++
     +
     k
which would be essentially

     (j++)++ + k
where the second ++ does not have an 'L-value' to act on.

So, my remark that

     j+++++k
can parse only one legal way is irrelevant because that is not how the C parsing works.

Basically I was assuming a 'token getting' parsing rule like I've implement a few times in my own work: There are tokens and delimiters, and a 'token' is the longest string of characters bounded by delimiters but not containing a delimiter. The delimiters are white space, (), etc.

K&R seems to have a point: My parsing rule would have trouble with just

     j>=k
and, instead would require writing

     j >= k
which I do anyway.

Generally, though, the C syntax is sparse and tricky, so tricky it stands to be error prone.

Back to writing Visual Basic .NET.


Don't think the K&R book is the standard. The standard now exists and is detailed enough, for what C aims at being. As for doing maths in C or wanting managed allocation, it is well there are better languages for that (and it was even better wide know twenty years ago for the math part...)

You seems to have found some that works well for your needs so everything is good.


I confess: When I was writing C, K&R was the standard! Good to see that now there are better versions of C with more detailed documentation.

The last time I had to write some C, I just refreshed my C 'skills' with K&R and reading some of my old code.

For your

"You seems to have found some that works well for your needs so everything is good."

I agree: I looked at Java early on and didn't like it. From some of the comments and links here at HN, I see that Java has made progress since then. Indeed, some of what I like in Visual Basic .NET (I say ".NET" because there is an earlier version of Visual Basic that is quite different and less 'advanced') seems to have come from Java. So, now I'm glad to have the progress of Java and/or Visual Basic .NET and will return to C only when necessary.

Actually, the last time I worked with C, I wrote only a few lines of it! Instead, I took some Fortran code, washed it through the famous Bell Labs program f2c (apparently abbreviates 'Fortran to C') to translate to C, slightly tweaked the C, compiled it into a DLL, and now call it from Visual Basic .NET.

Maybe what will be waiting for me in the lower reaches is C programming on an early version of Unix without virtual memory and without a good text editor on a slow time sharing computer using a B/W character terminal, 24 x 80!


"heap" goes back at least to Algol 68, where you could write (using case stropping)

REF INT i = HEAP INT; # sort of like C++ "new" # REF INT i = LOC INT; # allocates from the stack #

or the shorter forms

HEAP INT i; LOC INT i;


You got me! I wrote a little Algol 60 at one time and heard nice things about Algol 68 but never looked at it.

Since heap is the word used in heap sort, it's fair to say that the second use of that word was a misuse. I don't know which use was second and don't really care but did want to know the details of the dynamic memory allocation used by the C malloc() and free(). I just would have appreciated an explanation of malloc() and free() were doing so that could write some code, as I described, to 'help' me monitor what my code was doing with memory. Sure, now writing a good system for 'garbage collection' complete with reference counts and memory compactification is difficult, but what malloc() and free() were doing was likely not very tricky. I just wish K&R had documented it.


Um ... but K&R did. Chapter 8, section 7, "A Storage Allocator". Yes, it's simple, but it's there.


Yup, there's a version of each of malloc() and free() there.

Maybe that's what was being used in the common versions of C. If so, then for whatever reason I missed out on that. I kept seeing in the book where they kept saying that malloc() allocated storage in the 'heap' without being clear on what they meant by a 'heap' although in this thread is an explanation that 'heap' was also used in Algol 68. Whatever, when they said 'heap' with no explanation, they blew me away.

Once I was one of a team of three researchers that did a new language. Eventually it shipped commercially. We needed a lot in dynamic storage allocation. Our approach was to start with an array of pointers, say, for i = 1, 2, ..., 32 or some such, s(i) was a pointer to the start of storage for chunks of size 2^(i-1) + 1 to 2^i. So, allocate 10 bytes of storage from 16 bytes where 16 = 2^i for i = 4. That is, i = 4 handles requests of size 9 through 16. Etc.

So, right away at the start of execution for relatively large j, have the operating system allocate a block of storage of size 2^j. Then for i < j, if need a block of storage for allocations handled by i, get that from storage handled by i + 1, etc. up to j where actually get some storage. That is, if i = 4 needs storage, get that from i = 5 that handles requests up to size 32 = 2^5.

For each i, the allocated blocks are chained together in a linked list and so are the free blocks. So, for an allocation, look first at the end of the linked list of free blocks.

In principle, after enough uses of, call it, free(), could return some storage for i to the storage for i + 1 but we didn't bother doing that.

It always seemed to me that on a virtual memory system where the page size was a power of 2 (aren't they all?), this approach to dynamic memory allocation would be quite good.

Later I was using an old version of Fortran, got a big block of storage from the operating system as just an array (right, as a common block known to the linkage editor), and wrote code such as above to have versions of malloc() and free().

If what is there in K&R in 8.7 is what was actually being used in the versions of C I used, then I blew it by not writing some code at least to report on storage allocated, freed, 'fragmentation', when allocated, etc. Basically I was highly concerned that in a relatively complicated program with just malloc() and free() I would make some mistakes in storage allocation and get some bugs that gave symptoms only occasionally and that would be a total pain to diagnose. "Last Tuesday after running for four hours we got some strange data in the file and then it blew up." Great! It reminds me of one of those arrangements of a few thousand dominoes on edge where when one tips over they all go, a house of cards, an electric power system with no circuit breakers, a dense city built with no fire safety codes, etc.


Did you ever check the documentation for the C compiler you were using? Every C compiler I've used have always come with extensive documentation, which included documentation about the standard C calls, like malloc() and free().

Even today, the GNU C compiler (which is what I mostly use) has non-standard extensions to malloc()/free() that allow you to obtain information that is otherwise not mentioned in the C Standard (GNU defines mtrace() and malloc_hook, for instance, which trace each allocation, and allow you to peek into malloc).

From your descriptions, it sounds like you wrote C back in the 70s or 80s. It's changed a bit since then.


I wrote C in the 1990s on IBM mainframes, PC/DOS, and OS/2. I used some IBM mainframe, Microsoft, and OS/2 documentation. At one time I wrote some C code callable from PL/I for the TCP/IP calls, available to C but not to PL/I. I wrote some little utilities in C on OS/2, e.g., for sorting files of strings. Recently I wrote a grand, very carefully considered solver for Ax = b where A is a m x n matrix and x and b are m x 1. So, the A need not be square and if square need not be non-singular. I was fairly careful about numerical accuracy, etc.

For the C documentation, I recall only one point: Due to all the different addressing options on x86, the Microsoft C compiler had a crucial but nearly secret command line switch we needed. I found the switch only by some strange efforts. It was not in the documentation. None of the documentation I had was much beyond just K&R.


If you ever program in C again, use valgrind to help debug your programs: http://valgrind.org/

It's particularly useful for bugs related to memory.


Thanks. I just made a note of that!

So, someone else dug into the details of how C manages memory and wrote some code to help people find problems; makes good sense.


Not exactly - it's kind of like executing a C program in a sandbox. It intercepts all memory allocation requests and accesses, and can tell you if you access memory you did not request.

But how C programs behave regarding memory is well known and understood. However, it does require understanding basic memory concepts related to the operating system itself.


That's fun. Cause I remember this "x+++y;" as a question in one of my university entrance exams!


That sounds like a university to avoid.


It's necessarily a bad question, it makes you think about how parsers work.

But for an entrance exam, it's slightly hardcore :)


To get it right you have to be able to pick it apart according to specified rules. Being able to work with formally specified rules is an integral part in the study of computer science (also, other STEM majors). I'd say it's a perfectly valid question, as long as someone points out that one should stay away as far as possible from this sort of code.


The question assumes that you KNOW the rule, which is highly unlikely unless you've either been bitten by it or have read through the spec enough times to catch it.

Unless you know the actual parsing rules, there's no way to know if a real parser would be greedy or not (or perhaps it might try to be clever?). This is nothing more than a trivia question, which does not test aptitude or intelligence.


It does test knowledge. Nothing wrong with knowledge.

I expect they asked some other questions too.


It tests esoteric (aka borderline useless) knowledge. There's a big difference between that and, say, knowing how to use something actually useful like double pointers.

I had no idea how the C parsing algorithm worked for +++ et al, and I'm an expert C programmer. Then again, I'd also never use such ridiculous constructs in production code.


It's not that esoteric. You need to see it in a broader scope than simply something that the C standard specifies. It's about how parsing is traditionally done: by splitting input into token using a longest-match if multiple token fit the begin of input. Then you can use the token to do things.

Even if you don't know that much about the subject, you can still have an interesting reasoning about it. Seeing that it is ambiguous is already a good observation. You then can propose way to resolve the ambiguity and touch (willingly or not) upon the topics of operator precedence, associativity, greedy matching.

Those topics are not only relevant in parsing either, for instance associativity is an important concept for list operations such as folding (a right fold is different from a left fold).


Yupe. :|


A very interesting read!

By the way, shouldn't the right hand side text on slide 7 (the final part of slide 7) talk about the pointers z and x, instead of the values pointed at? (Aside: How do I write "asterisk x" on HN without getting an italicized x?)


int x = 'FOO!';

Took me awhile to understand this; single quotes define single characters, and for some C decided to allow multiple character character constants but leave their value as implementation-defined. Discussion: http://zipcon.net/~swhite/docs/computers/languages/c_multi-c...


Fun!

Lots more ambiguities in C++. But a challenge to find them in C. My favorite: [] are just '+'


[] really is very well known trick question. Wwll it was on my university before they switched to other language. I think they don't event learn C there. Shame.


Is there a way to disable the fade-in? It makes scanning impossible.


View it in the editor instead.


What's the point in 'count up vs. count down'?


Integer subtraction with a result of 0 sets the same status bit as comparing one value to another, so you can get away without the compare instruction when counting down. It might not sound like a lot, but it can be meaningful in a tight loop.

I don't know why the author chose to change the syntactic structure of the loop though, since it hides the point.

You have to be careful when counting down though. If you're accessing an array, you might be tempted to do this:

    for(size_t i = bar_len - 1; i >= 0; --i) {
        foo(bar[i]);
    }
It looks innocent enough, but size_t is unsigned, so i >= 0 will always be true. (Of course, using -Wall and -Wextra will warn you about this.)


Ah this just made my day! :)


I don't get it. What will happen if you violate the language semantics? They call it 'dark corners'? If you hit your head against a wall, it will hurt. Is it a 'dark corner' of life?

Overall, the presentation is very weak, like from a yesterday's graduate.


Did you notice that not all of these corners violated any clause of the standard?

I've got quite a bit of experience with C, and I haven't heard of the "static" array size feature before, which seems extremely useful.


No comma operator on the slides. Whata pitty :-)


I am glad we all aren't experts like you then, some of the stuff I knew from before, some were really relieving.

So I am glad it was posted, it helped, me, and this comment page was also, something to both smile/laugh and learn something from. Thanks


I see this site is read mainly by dumb ignorant children.

Screw this 'community'. It sucks ass.

Also, the design of this site is awful. And the engineering skills of the Mr. PG The Greatest apparently suck as well.


Is banging your head against the wall violating the semantics of life?


Is my thought that complicated it needs additional explanation?

By the way, I'm an expert C/C++ programmer, so my opinion matters.


A self-proclaimed expert, one must add.


I don't know dakimov and what his history may be (maybe it's even a language issue or that he simply doesn't "speak HN" yet - I also sometimes find myself out of touch with the culture on this site like he seems to be), but I can vouch for what he's saying. I find it difficult to read discussions about C on HN. A lot of the discussion is as if to read a bunch of kids who seemingly just learned javascript or ruby yesterday, then they go way out of their element talking about C. An experienced and competent C programmer would not be surprised by anything in these slides, except perhaps the slide about the novel use of "static", because it's an obscure C99 feature that nobody really uses (in the same way that most people would also not recognize that, for example, C99 specifies compiler support for complex numbers).


I am an experienced and self-proclaimed competent C programmer and I was surprised enough by the "static" feature which I already love -- that I immediately upvoted this, and I'm going to spread this information after testing that it is actually usable with the common tools we use.

Sure, the other stuff was either UB or less interesting to experienced C programmers, but I'm not sure why that should be a problem. If you already know all of these, then you're not the intended audience. You can comment from a more experienced position, or just move along.

Bragging about knowing all of those and even worse - claiming they are just silly UB, all in a condescending tone as dakimov did, is a very stupid thing to do.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: