
C Programming Guidelines - colinprince
https://github.com/btrask/stronglink/blob/master/SUBSTANCE.md
======
ggreer
I'm the author of The Silver Searcher (aka ag)[1]. I wrote my first C program
almost 20 years ago, and I have some not-very-nice things to say about this
guide. I hope the author won't take it personally.

First, Zed Shaw's _Learn C the Hard Way_ is not worth reading. I agree with
most of Tim Hentenaar's criticisms.[2] It's confusing, incomplete, and overly
opinionated. If you want to get started with C, read the K&R book. It's not
perfect, but it's better than anything else I know of.

As for the guide itself, the quality of its advice varies greatly. Some bits
are great. Some are good. Some are controversial. Some are dangerously double-
edged. And some are Not Even Wrong. It would take too long to go into detail,
so I'll just say: Please don't blindly follow this guide. Explore other
codebases (including the ones the author linked to). Talk to other
programmers. Write your own projects, and get feedback from those who are more
knowledgable.

You might notice that this advice generalizes to every language. That's
because C isn't special. If anything, the language itself is simpler than
most. C has just had more time to accumulate cruft, both technical and
cultural. So don't feel intimidated. Once you learn those bits of historical
trivia, you'll be fine.

1\.
[https://github.com/ggreer/the_silver_searcher](https://github.com/ggreer/the_silver_searcher)

2\. [http://hentenaar.com/dont-learn-c-the-wrong-
way](http://hentenaar.com/dont-learn-c-the-wrong-way)

~~~
btrask
Thanks for for the comment. I'm the author of the article. I'd really
appreciate some more specific feedback, particularly the parts that you think
are dangerous or "not even wrong."

I'll take a look at the Silver Searcher's code and see what pearls of wisdom I
can gain.

~~~
ggreer
In hindsight, my comment seems rather curt and dismissive. Thanks for taking
my criticisms like a champ.

Honestly, I don't think ag is very good code. It's worth reading if you want
to see optimizations like memory-mapped I/O, pthreads, Boyer-Moore strstr,
etc. But if I had to grade the quality of the code, I'd give it a C-. Many
parts have grown organically, and I haven't had time to restructure them.
Also, ag's use-case is not exactly mission-critical. Nobody is going to die if
it crashes or errors-out. Add my time constraints and… the result is some
shoddy code.

I'll add "C Programming Guidelines" to my queue of blog posts to write, but I
make no guarantees about when I'll get to it. Sorry.

~~~
btrask
Cool, I'm looking forward to it.

------
asveikau
This seems like a pretty good list. Focus on error handling is spot-on, most
good C is just simply that.

One of my favorite tricks is to make your success case and your error cases
look largely the same.

Say you have N allocations and N things that can fail. Lots of people get into
trouble when they do:

    
    
        void *p, *q, *r, *s;
    
        p = a();
        if (!p) { return ERROR; }
        q = b();
        if (!q) { free(p); return ERROR; }
        r = c();
        if (!r) { free(p); free(q); return ERROR; }
        // etc..
    

And this madness just goes on and on. Add a new allocation? You potentially
have to change n cleanup blocks.

I breathe a lot easier when I can do [note we abuse the fact that free(NULL)
is a no-op]:

    
    
           void *p = NULL;
           void *q = NULL;
           void *r = NULL;
    
           p = a();
           if (!p) goto cleanup;
           q = b();
           if (!q) goto cleanup;
           r = c();
           if (!r) goto cleanup;
    
        cleanup:
           free(p);
           free(q);
           free(r);
    

What if the buffer at "r" needs to return something to the caller? It turns
out it's really easy to transfer ownership in this pattern:

    
    
           // Success case.  Give `r` to the caller.
           *callerPointer = r;
           r = NULL;
    
           // Then we fall through to the failure block...
           // r is null now so free(r) is harmless.
        cleanup:
           free(p);
           free(q);
           free(r);

~~~
mieko
This is The Way To Do It. I take minor issue to calling free(NULL) abuse,
though (I know you get it, but allow me to soapbox). I've known programmers
who get squeamish relying on this, as if it's a hack, even though it's been
mandated by the standard forever. It's like they believe someone would put
together a fully-functional libc and miss that simple requirement somehow.

I think the C standard library would be easier to work with if they would've
taken this further, and required, for example, fclose(NULL) to be a no-op.

~~~
asveikau
If you want to get nitpicky, you could also say there is performance cost to
jumping into libc.so for free, then having it do a null check for you, then
jump back to your code. Maybe that's worse than doing the null check more
"locally".

But I do make a point whenever I'm writing C, if I write a cleanup function
for a new type of struct or whatever, cleaning up null is always a no-op. It
just makes life slightly less tedious. Having to remember which functions do
this and which don't, I'll admit, is annoying.

~~~
lucozade
I would venture that, if a few (essentially no-op) .so function calls are
likely to affect performance to a point that you care, allocating and de-
allocating heap memory under those same conditions might be a bit more of an
issue.

Depends on the detail of the use case of course, but assuming that NULL
cleanup is more the exception than the rule this is definitely a case of
profile it first for me.

------
hyc_symas
"Put one statement conditionals on same line as condition: if (rc < 0) goto
fail;"

This is bad advice. A source-level debugger only works in source line numbers;
if you ever need to set a breakpoint to find out when an error condition
occurs you need to be able to set it on the action statement "goto fail" and
not be bothered by the breakpoint tripping every time the if() gets evaluated.

Always separate the conditional from the statement.

~~~
fishanz
I totally agre and that immediately popped out at me. When I started coding I
would do this to 'save space', until burned by the breakpoint as you point
out. Then I started putting it on the next line, indented, until a more
seasoned developer told me about the gotchas of not having braces around a
'one-liner'. So now what used to take one line takes 4 lines, but I'm sure it
will circumvent a 'goto fail' bug at some point!

~~~
ufo
Its only 3 lines if you use the One True Brace Style.

~~~
fishanz
Well, when I was learning c, Egyptian brackets apparently weren't even a
thing. I have co-opted their use in certain places - I think it's a nice way
to easily identify a lambda function, but I don't think I would ever use it in
pure c. I simply like the way braces line up even if it takes up an extra line
.. Kinda like reading HTML, you'd never put an opening tag at the end of a
line, right? Anyway, I'm sure this topic has been exhausted a long time ago in
a different post.

------
caf
_Use signed char for strings..._

What? No. Use plain "char" for strings (which may be signed or unsigned,
depending on the environment).

If you use "signed char" in an environment where plain "char" is unsigned then
a comparison like:

    
    
      string[i] == '\xff'
    

will be always false.

~~~
metromint
It won't make a difference for strings. But in C you can use a char to do
math, when it will make a difference.

In fact, when working in constrained memory environments, like embedded 8 bit
applications a char will often be used to do math, and then it makes a big
difference. This is because there is no byte type by default in C.

~~~
nitrogen
_This is because there is no byte type by default in C._

There is in C99 and newer: _int8_t_ or _uint8_t_.

~~~
Tomte
No, char (not signed char, unsigned char, uint8_t or int8_t is the "byte
type". By definition.

~~~
nitrogen
CHAR_BIT must be _at least_ 8 bits, but is not a "byte" as used by the
original comment. The C standard defines "byte" differently from current
common usage. To get exactly 8 bits of the desired signedness, which may not
be possible on systems with CHAR_BIT greater than 8, one must use _[u]int8_t_.

\----

From a draft version of the C standard:

The typedef name intN_t designates a signed integer type with width N , no
padding bits, and a two’s complement representation. Thus, int8_t denotes such
a signed integer type with a width of exactly 8 bits.

The typedef name uintN_t designates an unsigned integer type with width N and
no padding bits. Thus, uint24_t denotes such an unsigned integer type with a
width of exactly 24 bits.

These types are optional. However, if an implementation provides integer types
with widths of 8, 16, 32, or 64 bits, no padding bits, and (for the signed
types) that have a two’s complement representation, it shall define the
corresponding typedef names.

~~~
Tomte
Yes, exactly. Why are you telling me this?

If we're talking about C, we should use the technical definition of "byte" _as
used by C_.

Everywhere else, sure, let's view byte as a 8 bit entity. That's so
predominant that everything else would be nit-picking.

But not in the context of the C language.

~~~
nitrogen
I was commenting more for the benefit of the original commenter who said there
is no "byte" type.

------
geofft
The security brag list at the bottom linking qmail and seL4 is a little funny.

qmail has the property that (if it is indeed secure, which is a disputed
claim) it's secure by virtue of djb's superhuman awesomeness. I'm happy to
believe that djb can come up with secure elliptic curves, but I'm under no
delusions that reading all his papers and coming up with my own elliptic
curves is in any way a good idea. Similarly, even if I take the djb way of
software writing to heart, I don't see a reason to believe that _I_ could
write software with qmail's level of robustness. I am just not a world-class
expert the way that he is.

seL4, meanwhile, has a formal model of the C programming language in the proof
assistant Isabelle/HOL, a formal model of what they want their C codebase to
do (and not do), and a formal proof that, in fact, their C codebase has the
properties they hope that it has. The C is handwritten, but it may as well not
be; you can't do feature additions, significant refactoring, etc. to the C
without needing to replicate those all in a language that is completely unlike
C.

If these are the best arguments in favor of writing secure software in C, you
don't really need any counterarguments....

~~~
pjmlp
If you read the history of C, written by Ritchie, he says that Johnson created
it, because it was already clear in 1979 that not everyone was able to write
correct C code.

And here we are with millions of dollars/euros/yans/... spent in all forms of
analyzers, theorem provers, hardware extensions (MPX, cheri) to workaround
language flaws.

------
jmspring
Ugh --

Older C isn't better when it comes to security and avoiding bugs K&R C is much
beloved, but that innocent era is gone We can't afford to fight each other, or
C will really die

\--

Older doesn't mean there wasn't error handling and comprehensive bug checking.
Some of us have been writing defensive code for years. Age of "style" does not
imply approach to security. OpenBSD has a great deal of C code and is security
conscious.

Really, articles like these are surpurfulus to the fact that there are shitty
coders out there in any language.

~~~
pjmlp
There is only one that allows for security exploits by design.

------
okasaki
> Immediately clear pointers after freeing

This can be misleading...

    
    
        void foo(char *buf) {
            ...
            free(buf); 
            buf = NULL;
        }
    
        void bar() {
            char *b = malloc(123);
            foo(b);
    
            // b isn't NULL...
        }
    

My advice is: always compile and run with AddressSanitizer. Take the time to
fix errors/leaks as soon as possible.

~~~
weland
Why are you freeing b in foo?

It's confusing if you're mallocing in one scope and freeing it in another one
(foo is _especially_ confusing, besides being a disaster ready to happen). But
if you write bar() like this:

    
    
        void bar() {
            char *b = malloc(123);
    
            foo(b);
    
            free(b);
            b = NULL;
        }
    

it looks all right.

Not that I'm advocating this kind of use. I have... mixed feelings about it.
But I've seen it used and it doesn't have to be confusing.

Its primary merit is that it makes use-after-free cases immediately obvious,
since they're going to trigger a segfault.

But there are better ways to achieve that.

~~~
okasaki
It's just a minimal example.

> Its primary merit is that it makes use-after-free cases immediately obvious,
> since they're going to trigger a segfault.

But then it hides double free errors.

    
    
        int count = 0;
    
        void foo_destroy(struct foo **f) { free(*f); count--; *f = NULL; }
    
        void bar() {
            struct foo *f = malloc(sizeof(struct foo));
    
            foo_destroy(&f); // ok
            foo_destroy(&f); // no error, but count is now wrong
        }

~~~
weland
Unless I'm reading your example incorrectly, it's pretty easy to catch those
if, in foo_destroy, you ensure that you don't try to free a null pointer.

> It's just a minimal example.

I know -- but it tends to be confusing if you free() something in another
scope than the one where you're malloc()-ing it. That's why your example, even
if minimal, seemed confusing.

The obvious exception is if you have some reference counting mechanism/GC
mechanism in place -- but then the wise thing to do is to _not_ manually touch
pointers all over the code.

------
zatkin
Personal preferences does not imply that it's good, especially for things like

"Single line if possible: if(rc < 0) return rc;"

~~~
voltagex_
Lines like that cause conditions to be missed (e.g. Apple's goto fail bug)

    
    
        if (rc < 0)
        {
          return rc;
        }
    

always IMO.

~~~
adrusi
Only if they're not on the same line though. When you have

    
    
        // ...
        if (rc < 0)
          return rc;
        // ...
    

It's easy to confuse the indentation for a block and end up with something
like the apple bug.

But when you don't break the line there's no confusion:

    
    
        // ...
        if (rc < 0) return c;
        // ...

~~~
fredophile
I don't see how this is any less confusing.

    
    
      if (rc < 0) return rc;
        goto fail;
    

is no less confusing to me than

    
    
      if (rc < 0)
        return rc;
        goto fail;
    

Both can easily be misunderstood when you're quickly scanning through a bunch
of code. Screen space isn't at such a premium that you can't afford a couple
extra lines in a conditional statement.

~~~
adrusi
How would you end up with

    
    
        if (rc < 0) return rc;
          goto fail;
    

Why would you indent hat next line? Presumably because you were trying to add
it to the existing if statement's non-existant block. But why would you do
that if you don't see a block?

The problem with

    
    
        if (rc < 0)
          return rc;
    

is that we see the indentation and infer that there must be a block, so it's
tempting to try to add to the block.

When you keep your entire conditional statement on one line, it looks, to
someone scanning the code, like just any other line. It doesn't look like a
block at all, in fact it looks more like a function call.

~~~
fredophile
> Why would you indent hat next line? There are lots of reasons someone might
> indent the next line. Maybe they were in a rush. Maybe they were distracted
> by something else while they were working. Maybe they were tired. Maybe they
> were drunk.

Mistakes will happen. Ideally wrong code will look wrong. All of the following
examples do the same thing. If you know the code is supposed to call Foo() and
Bar() if bVar is true, which of these examples make it obvious that something
is wrong?

    
    
      if (bVar) Foo();
        Bar();
    
      if (bVar)
        Foo();
        Bar();
    
      if (bVar)
      {
        Foo();
      }
        Bar();
    
      if (bVar)
      {
        Foo();
      }
      Bar();
    

The third, followed by the fourth, makes the mistake most obvious to me.

------
jasonwatkinspdx
"Use struct thing thingptr[1]; to get pointers with inline storage ... You
should pretty much always use -> instead of . ... I think this is the most
important trick for making C code beautiful"

I don't really get what they're after with this section. Can anyone shed some
light?

~~~
andymurd
It is similar to:

    
    
        struct thing athing;
        struct thing *thingptr = &athing;
    

Then there should be no confusion as to whether you write:

    
    
        athing.member = 12;
    

or:

    
    
        thingptr->member = 12;

~~~
fsloth
If I saw a declaration like that in a code I was reviewing I would politely
ask not to use this pointless level of indirection. This adds a tiny bit of
complexity to the code and gives nothing back.

~~~
nitrogen
It can also turn a single memory access into two memory accesses if the
structure and the pointer can't be coalesced by the optimizer.

------
mahyarm
If all it takes is one human screw up with 3 lines of code to get full remote
execution, you will inevitably get security bugs with that code. A best
practices guide will not work, because people will not execute on their best
practices %100 of the time. A language specification or at least a linter will
enforce it %100 of the time in a practical comparison. This is why people want
to sunset C family languages.

~~~
tmuir
I do mostly embedded programming, which is overwhelmingly dominated by C. The
reason I think C will probably never be replaced as the dominant low level
language is inertia. Every microprocessor manufacturer knows they have to
provide drivers, libraries, and board support packages for their parts. If
they don't, no one will use those processors, because of the extra effort
involved, and the fact that other companies do provide them. All of these
drivers, libraries, and board support packages are written in C. Writing them
in something other than C is basically equivalent to not writing them at all,
because instead of forcing developers to reinvent the wheel, you're forcing
them to use a language most of them don't know. In the end, it's the exact
same problem: a large amount of effort required to get to the point where you
can start implementing your application.

This same logic applies to real time operating system vendors. Everything is
written in C. The minute they start writing their code in a different language
is the minute their competitors become that much more attractive. On top of
that, one of the major selling points of the major RTOSes is that they have
mature code bases with decades of bug fixes and products using them.

Perhaps this presents an opportunity for disruption, but it seems like a steep
hill to climb.

------
woodman

      Performance
        The restrict keyword should be used rarely if ever
    

wut?

~~~
btrask
It's usually penny wise, pound foolish.

~~~
woodman
In the context of performance, how so?

------
haberman
> Do not worry about 1% speedups. SQLite has gotten significantly faster from
> lots of ~1% boosts ...And it'll never be as fast as MDB

What MDB is being referred to here?

~~~
sleepydog
If I had to guess from looking at his git repo, he's referring to LMDB::

[http://symas.com/mdb/](http://symas.com/mdb/)

sqlite and lmdb aren't exactly equivalent, but I guess that is his point; you
can gain a lot of performance just by understanding your use case and what
tradeoffs you can make.

------
jimjag
Bad C programmers will simply fault C itself for their own deficiencies. I
find most of this advice as really bad and not-recommended.

------
deathanatos
> Use `signed char` for strings

If you're dealing with POSIX / the C stdlib, mirror it and use char* . (The
string version of "Respect the integer types used by each API."; if you're
coding for Windows you'll likely need to use LPWSTR or whatever they call it
now.) But: if you're storing UTF-8 string data, you _want_ unsigned char (and
maybe uint8_t[2]): if start trying to inspect the code units, esp. to decode
them to code points, you're going to need to do comparisons and bit
manipulations on them, and its too easy to write `code_unit < 0x80` or
something similarly silly (e.g., while masking bits to see if it's a
continuation byte). It's like the article early today called "Should I use
Signed or Unsigned"[1] states: "Prefer unsigned when using bitwise operators".

> Limit recursion

Recursion is fine; you simply need to be aware of your worst case stack depth.
You're looking, really, for the complexity of your stack usage, or the big-oh
of your stack size. If you're doing something like recursing down an in-memory
_balanced_ binary tree, it's O(log(n)) in the worst case: ~64 stack frames on
a 64-bit CPU. Can you fit that many on the stack? Certainly that's heavier
than a normal function call, but it's still a finite, computable size.

If you can't compute that / prove an upper bound, then I agree.

> Aside: any language that uses keywords like public and private to determine
> visibility instead of placement/scoping is bad

If not with keywords, how should visibility be determined? (How does
"placement/scoping" determine visibility?)

> Use get/init or create/copy/alloc in function names to indicate ownership
> versus borrowing for return values

Inevitable, I feel like you end up creating some structure, let's call it a
Foo. So you then have a Foo * Foo_create(), a Foo_destroy(Foo * ). Some action
performable on Foo? Foo_reticulate_splines(Foo* , …). Combined with,

> You should generally track used and unused space in dynamic arrays,

and the overhead of writing functionality to deal with dynamic arrays in C,
the entire section about managing strings, and you've come to the very reason
I left C: I'm already writing C++. Foo_create is a ctor; Foo_destroy a dtor;
Foo_reticulate_splines a member function. Combined with vectors & strings
taking care of day-to-day memory management, and I'm left with little reason
to use C.[3] RAII adds some degree of object lifetime management that simply
isn't present in C (but does not add the compile-time checks that I'm starting
to love Rust for.)

[1]: [http://blog.robertelder.org/signed-or-unsigned-
part-2/](http://blog.robertelder.org/signed-or-unsigned-part-2/)

[2]: I know C++ got some decent types for Unicode _very_ recently; I'm not
aware if C got those.

[3]: I am _not_ suggesting C++ to be the language you _should_ be using. I
simply cannot justify ever using C over it.

~~~
pjmlp
I never liked C.

Back when I got to learn it, around 1993, I had already used Turbo Pascal 3,
5.5 and 6.0.

C seemed so primitive and unsafe in regards to Turbo Pascal features.

However the same person that got me into C, also showed me C++.

There it was, a language that could be used as safely as Turbo Pascal,
provided one made use of the language features instead of the C copy-paste
compatibility.

When Turbo Pascal stopped being a possibility, I joined ranks with the C++
side, in the C vs C++ flame wars.

Even teached C++ at the university as a senior student.

Only used C instead of C++, when asked to do so by university teachers or
customers.

------
jeffreyrogers
For memory allocation you often don't need to use malloc. Many (most?)
programs can mmap a large region of memory and allocate from that. Then when
you're done you just unmap the region. This is much easier to use since you
only have to remember to unmap once as opposed to handling each allocated
pointer individually.

Edit: Here is a paper that touches a bit on what I'm describing:
[http://people.cs.umass.edu/~emery/pubs/berger-
oopsla2002.pdf](http://people.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf)

Also, to clarify, I'm not saying allocate yourself a big piece of memory and
then reimplement malloc on top of that. You literally just get a big piece of
memory and every time you want to allocate something new from it add the
offset to the base address, return that to the user, and store the new offset.

~~~
rc4algorithm
No no no no no. When you do this, you are:

1) Reinventing malloc for no good reason 2) Opening yourself up to a whole
world of security holes

See Bob Beck's talk on the first 30 days of LibreSSL. Heartbleed wouldn't have
been nearly as catastrophic if OpenSSL just used the system memory allocator
like normal people.

~~~
jeffreyrogers
I don't understand how this could lead to a security hole any more easily than
using malloc could. Can you give an example? The system memory allocator is
using mmap internally anyways (on Linux and anywhere else using the Doug Lea
allocator at least) and a simple region based allocator is only a couple dozen
lines of code.

The semantics are simple as well:

    
    
        region create_region(size_t size);
        void * allocate(region *r, size_t size);
        void destroy_region(region *r);
    

Edit: also, you're not reinventing malloc for no good reason. A region based
allocator is much faster, possibly at the expense of using more space than the
system allocator (because you might reserve more space than you actually
need).

~~~
adrusi
The point is that when you use malloc, an attacker exploiting a buffer overrun
can't easily guess at the offset they need to find useful data. When you
allocate everything in a big region, your process can be expected to read past
the object bounds, so an attacker might be able to probe the address space
(also the memory layout might be more deterministic). If they try to probe
like that in a program that allocates with malloc, they'll just segfault the
program.

~~~
jeffreyrogers
Yes, this is a reasonable objection.

