Hacker News new | comments | ask | show | jobs | submit login
Teaching C (regehr.org)
368 points by mpweiher on May 10, 2016 | hide | past | web | favorite | 152 comments



In my quest to learn C very well over the past few years, I've come to the conclusion that C is best understood if you think about it in terms of the way that an assembly language programmer would think about doing things. An example of this would be if you consider how switch statements work in C. Switch statements in C don't really compare to switch statements that you find in other languages (eg. https://en.wikipedia.org/wiki/Duff%27s_device).

The issue that many students face in learning low level C, is that they don't learn assembly language programming first anymore, and they come from higher level languages and move down. Instead of visualizing a Von Neumann machine, they know only of syntax, and for them the problem of programming comes down to finding the right magic piece of code online to copy and paste. The idea of stack frames, heaps, registers, pointers are completely foreign to them, even though they are fundamentally simple concepts.


> I've come to the conclusion that C is best understood if you think about it in terms of the way that an assembly language programmer would think about doing things.

I don't agree. That leads people to incorrect conclusions like "int addition wraps around on overflow" (mentioned in the article), "null pointer dereferences are guaranteed to crash the program", "casting from a float pointer to an int pointer and dereferencing it is OK to do low-level tricks", and so forth. C is a language that implements its own semantics, not the semantics of some particular machine. Confusing these two ideas has led to lots and lots of bugs, many detailed in John's other blog posts.

It might be useful to teach these intuitions to beginning programmers who already know assembly language before learning C (though are there more than a vanishingly few number of those anymore?) But teaching assembly language as part of, or as some sort of prerequisite for, teaching C strikes me as a waste of time and likely to lead to wrong assumptions that students will eventually have to unlearn.


I agree with both of you to some extent. Despite the fact that C implements its own semantics, those semantics are downright bizarre and hard to gain intuition about unless you have some mental model of the machine.

For example, here are some of the things that confused me when I first learned C:

    - why isn't there an actual string data type? (just char*)
    - why do some people use "char" to store numbers?
    - whats the deal with pointers?
    - why are pointers and arrays kinda sorta interchangeable?
Until I learned how things work at the assembly language level, I could not gain an intuitive understanding for why C works the way it does.


Understanding that requires understanding that the program's memory is an array of bytes where everything is stored. You do not need to know assembly to know that.


You don't really need to know assembly to understand those, though. Usually classes will go over the basics - how things are stored in memory, the stack and the heap, etc., and that's enough to answer those questions.


If all you tell me about the machine is its memory layout, you haven't told me nearly enough to explain the oddities of C.

For example, I could imagine a machine with identical memory layout to C, but that supported a hugely parallel, variable-size data bus where an operation like "x = y" (string assignment by value) could copy an entire string in a single operation.

The reason C doesn't support this is because generally at the assembly language layer you can only operate on memory in register-sized chunks, and every load or store of a register's worth of memory takes time. So assignment of a string by value requires a loop, just like it does in C.


I'm not sure I understand your example. Knowing how arrays (or pointers) are handled in C is enough to understand why you can't have string assignment by value. How C treats things is all you really need to know if you want to use the language. This isn't just true for C. In my experience, people tend to have more difficulty with string assignment in Java than C, but you usually don't hear people say that they need to go to a lower level to really understand Java.

Understanding the why is interesting - like it is for any language. And like in many languages, the why tends to be complicated and somewhat arbitrary at times. If you're really interested in the why rather than the what, a book on the history of C will probably be more useful than learning assembly.


> [ Knowing how arrays (or pointers) are handled in C is enough to understand why you can't have string assignment by value.]

True, but I think knowing how arrays (and pointers to arrays) work in C is one of the main hurdles for a lot of people who are just starting out.

This became especially apparent to me after recently helping my friend get familiar with C.

I can see how some people can get confused by this when you consider the fact that structures can be copied using a straight assignment, while arrays can't.

People naturally try to find similarities when learning something new, so it took a little while for him to _really_ get it. I think his mind kept trying to think of a structure essentially as an array of variables, when that's not really the case.


It is not easy, but so far the best way to wrap newcomer around most of C oddities (in this context) is to explain two things: 1) memory location and size 2) run time vs compile time.

Then it becomes apparent why one cannot copy strings "by assignment" and can structs: it is in general impossible to know runtime size of string at compile-time. C strings have no structure known at compile time. Structures, on the other hand, are there to enforce structure on data.

There is a pretty neat real-world analogy here: copy machine. String copy must be done character-by-character in a same way that book or document folder would have to be copied page-by-page. On the other hand engineering drawing must be copied whole. It may contain references to other drawings, you can still with some struggle extract individual parts, but it is copied as a whole. This analogy relies on a fact that drawings are single-page and but nicely encapsulates the "strings are arrays are pointers" idea: folder may be empty, may be single page, but it is impossible to know without attempting.


    struct { char name[30]; } tgt, src = { "Einstein" };
    tgt = src; 
Since there are no strings, it cannot be true that strings are pointers. They can be arrays, and array sizes are known at compile time. It's just a 40-year old cop-out that we can't copy by assignment all types based on their `sizeof` size.


> <...> it is in general impossible to know runtime size of string at compile-time <...>

You are stepping on the same rake as beginners: generalizing from a specific case instead of applying the general case to specific circumstances. I have stressed the word "in general". You may treat it like a cop-out or you can say that there are no special-case semantics here. Not sure if it was intentional, but your example is rather tricky. Until you step out of the box and see that we are no longer dealing with strings/arrays/pointers here, but structs, that have a bit different semantics.

> Since there are no strings

Yes, there is no explicit string type in a language, but somehow we do use strings in C. Semantics. We can semantically treat a particular block of memory as a string, time-series, binary tree, etc.. There is simply no special case (explicit language support) for strings.

> it cannot be true that strings are pointers. They can be arrays, and array sizes are known at compile time.

What about `malloc`? What about passing arrays between compilation units? I have covered this in SO[1]. Note that I never explicitly pass pointers, yet `sizeof()` thinks I do. Array sizes can, in some circumstances, be known at compile time in a specific program block, but not in general.

I'd say there are +/- 3 types of languages (core, no stdlib, etc) in this context: 1) provide common-special-case exceptions 2) wrap all cases in an easy-to-use interface 3) provide general case syntax. 1) languages with `=` and `eq` (Perl?) 2) languages with object-identity (Python?) 3) C

[1]: http://stackoverflow.com/questions/19589193/2d-array-and-fun...


> For example, I could imagine a machine with identical memory layout to C, but that supported a hugely parallel, variable-size data bus where an operation like "x = y" (string assignment by value) could copy an entire string in a single operation.

You mean like x86 (at least as far as the ISA is concerned) [1]? :)

I mean, this isn't just me being pedantic and annoying: I think it goes to show that C's machine is quite different from a real machine.

[1]: http://x86.renejeschke.de/html/file_module_x86_id_279.html


I thought about bringing this up as well. I'm not sure what you mean by "C's machine is quite different from a real machine". Do you mean (as I was thinking) that even assembly language is frequently not close enough to the machine to be used to guide the programmer trying to write high efficiency code?

I'm constantly surprised by how poorly documented the actual operation of current processors is, and how few people seem to care. In one way, this means that the abstraction is working, and no longer does anyone need to look behind the curtain. In another way, like the move to teaching only higher level languages, it feels like something essential is being lost.


What I mean is that C is defined in terms of a virtual machine with absolutely no restrictions on what happens if you step outside the boundaries of that machine's defined operations. That's in contrast to real machines, which typically have much less undefined behavior.


I actually thought of that when I wrote my comment. But still, notice that (1) the instructions take as their input registers that point to the data (ie. a char pointer), not some machine-level idea of a "string", and (2) the cost of these instructions is still O(n), even though you don't have to write the loop manually.


In a C compiler implemented for that architecture, copying a value of memory line size could very well use an instruction that does it all at once if the registers are large enough or a DMA intrinsic is exposed. Really that's the point. C is like a near-asm language that standardizes across ISAs with the opportunity for the compiler, libraries, or programmer to do something more clever on a specific system. It requires dedication and patience, but in the end is generally a good middle ground for low level work.


Copying a string by doing x = y is rare among languages. Most copy references like C does. An example of an outlier that does a string copy like you want would be C++ on its string class.

You can copy/initialize strings in C without ever writing a single loop by using strcpy. As for hardware, you need not have something so exotic. x86 for instance has a string copy instruction. The C strcpy function is often compiled to it.


Is strcpy not implemented with a loop?


To add to the oddities:

You can copy entire structures by value by doing "x = y".

This ends up being implemented as a memory copy loop, but I'm not sure what happens with the padding bytes. There's a chance they get copied as well, but I really don't know.

In the code base I work with, it's common to see arrays inside of a structure which they're the only member of. This makes it a little easier to copy them, though I'm not sure if that was the intended purpose.

Something like this would be defined in a header...

typedef struct { char myArray[50]; } Test_Struct_T;


It's not specified what happens to the memory padding bytes on structure assignment.

Hence: you can't use memcmp() to reliably compare structures for equality.


> You can copy entire structures by value by doing "x = y".

That wasn't in the original language, though it's a pretty old extension.


> That wasn't in the original language, though it's a pretty old extension.

That's why I don't recognize that feature. I guess we didn't have it back when I was doing a lot of C.


I believe C++14 adds the first sane C-family array type:

template<typename T, size_t N> class array { T data[N]; };


std::array is in C++11


> - why are pointers and arrays kinda sorta interchangeable?

Because apparently writing ptr = &arr[0] like in older systems programing languages was too hard implement.


Why do some people store chars to store numbers?


Because unlike other programming languages C didn't define a type called byte, so developers were forced to use char for what is byte in saner systems programming languages.

With the caveat that unless you precede them with signed/unsigned modifiers, it is not portable.


Because uint8_t is comparatively new. C99, IIRC, though many OSes defined their own earlier.


They need a char-sized number, and memory used to be a tightly-budgeted commodity.

Reminds me of this story of a game developer who magically got their game under the limit at the 11th hour

http://www.dodgycoder.net/2012/02/coding-tricks-of-game-deve...


Although char should not be used to store numbers, since its signedness is implementation-defined (gcc has the switch -funsigned-chars I think to make them unsigned instead of signed). signed char/unsigned char are ok.


You're right that trusting the assembly intuitions leads to danger when confronted with a standard compliant optimizing compiler. But on the other hand, unless one understands how processors actually execute code, I don't think it's possible to write high performance C.

And if one isn't writing for high performance, probably one shouldn't be using C. I like the suggestion in a sibling about teaching C in the context of programming a microcontroller. I think this might bridge the gap a bit: encourage the right intuitions, without creating dangerous misconceptions.


Very true. I counter this cross-platform assembly notion whenever I can as they're not the same thing. One reason for its weirdness, also only assembly it's tied to, is that it was specifically designed to utilize a PDP-11 well. It and anything depending on it is mapped to what makes sense on a PDP-11. We don't use PDP-11's today.

So, the C model neither makes sense nor matches assembler of today's CPU's. We've certainly developed ways to implement it efficiently. It wasn't designed for that, though.


Why is the PDP-11 so different from today's CPUs, and what about it made C implementations for it more efficient? The only thing I can think of is postincrementing registers. Other than that, the instruction set seems to me to be remarkably close to what you could see on a contemporary CPU.


A contemporary x86, RISC, mixed (ISA + accelerators), or what other CPU? I think CPU is a broad term. :) Anyway, Wikipedia has a detailed write-up that assembly experts can base a comparison on:

https://en.wikipedia.org/wiki/PDP-11_architecture

It wasn't that PDP-11 made C implementations more efficient. It's that C was a BCPL specifically designed to compile easily and run fast on their PDP-11. That's why I can't overemphasize C's actual history vs the lore that people repeat. It's literally an ALGOL language with every feature that couldn't compile on 60's and 70's era hardware chopped off with some extensions added latter.

http://pastebin.com/UAQaWuWG

Worked fine for a PDP-11. Yet, forcing its memory model or tradeoffs into a language used on different hardware can cause unnecessary problems. In contrast, Hansen's Edison language deployed on PDP-11 had only five statements (extreme simplicity haha) but would map efficiently to most architectures. As would Pascal and Modula-2 that inspired it & were safer.

http://brinch-hansen.net/papers/1981b.pdf

https://en.wikipedia.org/wiki/Modula-2


> A contemporary x86, RISC, mixed (ISA + accelerators), or what other CPU? I think CPU is a broad term. :) Anyway, Wikipedia has a detailed write-up that assembly experts can base a comparison on:

Contemporary x86 and RISC CPUs are what I was comparing the PDP-11 instruction set to. I don't see any fundamental differences. Painting with very broad strokes, the PDP11 ISA looks reasonably close to x86. And those minor differences in more modern RISC actually map better to C than the PDP 11 does -- for example, status flags being replaced with jumping based on register contents. Implicit widening to words is a weak mismatch for x86, but it's a pretty good match for modern risc (no need to mask out top bits in registers), etc.

I looked at your links, and I'm still not seeing how other C maps better to a PDP-11 than it does to modern CPUs. The only thing I'm seeing in the pastebin rant is that CPUs are fast enough and memories are big enough today to support more expensive features, which I can agree with.

Again:

> Worked fine for a PDP-11. Yet, forcing its memory model or tradeoffs into a language used on different hardware can cause unnecessary problems.

What parts of its memory model or tradeoffs made it into C? I can't find any specifics that you're basing these claims on, only assertions that it's true.

In fact, the usual complaint associated with C is that it left the memory model so loosely specified -- initially to allow it to match any hardware -- that optimizing compilers can use the looseness to do really strange things to your code.


"The only thing I'm seeing in the pastebin rant is that CPUs are fast enough and memories are big enough today to support more expensive features, which I can agree with."

Fair enough haha. Ok, my memory loss is hurting me on examples. I might have just been the little things adding up. I do recall two from security work: reverse-stack and prefix strings. MULTICS, UNIX's predecessor, had both with significant reliability and security benefits. Reason C had null-terminated strings was PDP-11's hardware and one personal preference/opinion:

"C designer Dennis Ritchie chose to follow the convention of NUL-termination, already established in BCPL, to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in his experience, less convenient than using a terminator."

Now, on reverse stack, my memory is cloudy. Common stacks have incoming data flow toward the stack pointer in a way that can clobber it, even leading to hacks. MULTICS had data flow away from the stack pointer with an overflow dropping into newly allocated memory or raising an error. C language (and most) implementations use regular stack. I think it was because PDP hardware expected that with a reverse stack requiring high-penalty indirection. I could be wrong, though. I know a reverse stack on x86 gets a performance penalty and key traits of x86 come from PDP-11. A CISC with reverse stack would have problems with C.

The pointer stuff. Lots of the pointer stuff, esp arrays, comes from efficiency needs for running on a PDP-11. This by itself is why we can't map C easily to safer or high-level hardware. The CPU at crash-safe.org, jop-design.com, and Ten15 VM come to mind. PDP-11 model doesn't support safety/security so neither does C.

These are a few that come to mind that carry over into modern work trying to go against C's momentum. Hardware, software, and compiler work.


> Now, on reverse stack, my memory is cloudy. Common stacks have incoming data flow toward the stack pointer in a way that can clobber it, even leading to hacks.

That's not a restriction of C, but a way to get more out of your memory on a restricted system; If your heap grows up and your stack grows down (or vice versa), then you can keep using growing both until the two meet, at which point you've used all the available memory. However, if they both grow in the same direction, you need to statically decide how much to give each one, which will lead to waste if you're not using much stack or heap:

    [heap-->|           |<--stack]
vs:

    [heap-->|    |stack-->|      ]
But, again, not something that C cares about; you have a number of architectures like Alpha (IIRC) where the program break and the top of stack move in the same direction.


Gotcha. Appreciate the tip.


Can you explain what you mean by a reverse stack? Is that a stack that grows upwards like the heap? Why does this incur a penalty?


I originally learned about it and other issues in a paper by the people (Schell & Karger) that invented INFOSEC:

https://www.acsac.org/2002/papers/classic-multics.pdf

Really old stuff. Relevant quote: "Third, stacks on the Multics processors grew in the positive direction... if you actually accomplished a buffer overflow, you would be overwriting unused stack frames rather than your own return pointer, making exploitation much more difficult."

I can't find the original paper showing the penalty on x86/Linux. However, this one does the same thing for different reasons with many details:

http://babaks.com/files/TechReport07-07.pdf

Key point: "The direction of stack growth is not flexible in hardware and almost all processors only support one direction. For example, in Intel x86 processors, the stack always grows downward and all the stack manipulation instructions such as push and pop are designed for this natural growth."

So, on such an architecture, you can't directly use the stack operations to do the job: must implement extra instructions without hardware acceleration. The stack on x86 is effed-up and insecure by design. If C's stack is fixed, there's still a mismatch between it and x86 ASM. Itanium at least provided stack protection among other security benefits.


Interesting. Thanks for the links. Isn't C's stack pretty much tied to the hardware? Curious what you mean by "If C's stack is fixed"? How could that be implemented, changing the run time?


You change the compiler to emit different things like a reverse stack or whatever your protection model is. Far as implementation, they describe it in p5 (PDF p7) of paper above (not MULTICS paper). It's actually brilliant now that I read it as the naive thing they avoid is, IIRC, what the other academics did on Linux/GCC. The performance overhead hit 10% easily due to x86's stack approach. I think worst-case was even higher. This team effectively tricks the CPU with simple instructions (eg addition/subtraction) without invoking memory to get it to worst-case of 2%. Clever.

Note: I'm not saying this is sufficient to stop stack smashing. Just that reverse stacks are a better idea than the ludicrous concept of making unknown amount and quality of data flow toward the stack pointer. Definitely reduced risk a bit but how much takes more assessment.


Interesting stuff, thanks.


I'm gonna go out on a limb and guess "memory model"; both consistency and coherence issues.


In that case, I'm not really aware of any language that exposes this to the user, outside of assembly.


Many languages restrict or eliminate the use of pointers. C doesn't. So, it exposes whatever model it expects underneath to its users.


I think you do have a point about possibly misleading people in relation to signed overflow, since that is common hardware behaviour, but for the point about null pointer de-reference, I believe thinking in terms of assembly would actually help people realize that referencing memory location 0 (if that is what null was defined to), will not immediately crash your program. It is not special, it's just another memory location.

Your point about teaching assembly as being a waste of time also has some merit. Of course people still do program in assembly, but it is only becoming more and more rare to actually need someone to program assembly, which is why it just isn't emphasized as much any more and it makes C seem like an even stranger language for someone who started on Python or Javascript.


Because dereferencing a null pointer is undefined behavior, the compiler is free to assume it won't ever happen. This, in turn, can lead to the compiler optimizing or reordering code in a counter-intuitive manner, completely changing the behavior of a program that you thought you understood.

The compiler might emit machine code that attempts to read the memory at location 0. It would also be perfectly within its rights to optimize away that branch of code (if it can prove that it always dereferences null). The code might even appear to work now, and completely break in the next version of the compiler.


The page containing address zero (NULL) is almost always unmapped in environments with virtual memory, specifically to catch NULL pointer dereferences. That's not the source of the problem. The problem is that dereferencing a NULL pointer is "undefined behavior," and modern compiler writers abuse every instance of undefined behavior to implement negligible optimizations that subtly break programs for little gain.


> I believe thinking in terms of assembly would actually help people realize that referencing memory location 0 (if that is what null was defined to), will not immediately crash your program. It is not special, it's just another memory location.

No, it is special. Dereferencing it is undefined behavior. It is not guaranteed to result in a load at address zero.


The cited incorrect assumption was "null pointer dereferences are guaranteed to crash the program", which is, as you implied, not true because it won't necessarily crash the program. C places additional restrictions which say that the program can actually do anything (undefined behaviour).

The main reason I keep coming back to assembly language with C, is that C cannot do anything that assembly language cannot do. It only places additional restrictions on assembly, which is a bit easier to grasp (bitwise operations, add, subtract etc.). Once you understand the fundamental operations of the processor, you can start to learn the copious corner cases that is the C programming language.


> The main reason I keep coming back to assembly language with C, is that C cannot do anything that assembly language cannot do.

That's basically true. Except that an optimizer is allowed to transform things as long as the observable behavior remains the same. And undefined behavior means more than "just that one line is undefined" ( http://blog.llvm.org/2011/05/what-every-c-programmer-should-... , http://blog.llvm.org/2011/05/what-every-c-programmer-should-... , http://blog.llvm.org/2011/05/what-every-c-programmer-should-... ). These two facts conspire to make undefined behavior surprising to many programmers.

For instance -- using one of Lattner's examples -- on some architectures it's a little expensive to check if a loop variable had wrapped around. So the optimizer will omit the check for wraparound if it can prove wraparound is impossible. That is, if it can prove that n + 1 > n. But the Standard only requires wraparound for unsigned integer types. So the optimizer can also omit the check if it can only prove either n + 1 > n or n is a signed integer. In this case, a bounded loop turns into an infinite loop, but "undefined behavior" includes that kind of transformation.

More to the point, Linux had a severe security bug where they dereferenced a pointer, then checked if the pointer was NULL before returning the value ( https://lwn.net/Articles/342330/ ). The optimizer removed the check for NULL because if the pointer was valid when it was dereferenced, the check was unnecessary; and if the pointer was NULL when it was dereferenced, then dereferencing it was undefined behavior and "remove an NULL check" is a valid transformation in undefined behavior. Then moving that load to a different point in the function is also a valid transformation (as long as it doesn't affect observable behavior), and a few more transformations could make the NULL dereference do something completely different from what you expect.


> C cannot do anything that assembly language cannot do.

I know where you're going with this, but I don't agree with this statement. C provides the abstraction of structs and unions, which do not exist in assembly. Arrays are also an abstraction which does not exist in assembly - yes, assembly provides the mechanisms to easily compute addresses for arrays, but that is different than having a named entity for many objects. The direction you're going is that C is a thin abstraction over assembly, which I agree with. But that is different than saying C has a one-to-one mapping, which is what I think your statement implies.


With modern compilers and modern CPUs, C isn't a particularly thin abstraction. Many serious misconceptions stem from naive assumptions about how C will be translated into assembly which are invalidated through compiler code transformations- often with results that surprise even expert compiler maintainers. Please do not spread the idea that C has any straightforward or predictable relationship with assembly language.


"more straightforward and more predictable relationship with assembly language than any other language in the world" is the right way to say it.


That's a rather bold assertion. How would you back it up? What about high-level assembly languages, ISA-agnostic IRs or languages like Forth or Oberon for which compilers intentionally perform only simple optimizations, if any?


My name is Robert Elder and I approve this message.


Has anyone ever seen any behaviour other than a segfault in the wild, though? Would any real-world compiler author decide that some other behaviour was reasonable? I can't imagine making that decision myself.


A classic problem is code like this:

    printf("a=%p *a=%d", a, *a);
    if (a == NULL) return;
An optimizing compiler will likely remove the NULL check completely because the printf above it has undefined behavior if a is NULL.


On some microcontrollers I know a pointer to address 0 really is just that. I tested compiling the following

    unsigned short *a = 0;
    ((void (*)(void))a)();
which may cause a jump to address 0 (usually the reset vector).

On R8C with the NC30 compiler there are no warnings and the output is:

    MOV.W:Q #0H, -2H[FB]
    MOV.W:G -2H[FB], R0
    MOV.W:Q #0H, R2
    JSRI.A  R2R0
It jumps to 0. But on PIC 16 with the XC compiler a warning is emitted warning: (1471) indirect function call via a NULL pointer ignored and the statement is not compiled at all.


On BSD on the VAX, address 0 used to happen to hold 0, so dereferencing NULL got you zero. That was at the time a sufficiently popular platform that some programs assumed the behaviour (eg, that a null pointer could be used as an empty string), and you can still find older docs on the web exhorting the reader to avoid "all the world's a VAX" syndrome.


I think on DOS in real mode address 0 was not special. Some(?) compilers would add code to programs to check that location on program exit to see if its value had changed and emit an error.

(I think I read about this somewhere, but it was a while ago, and I never really programmed for DOS (well, one little program to set the serial port to some specific settings, but that was about ten lines including whitespace and a little comment explaining what the code did).)


On MacOS Classic (no memory protection) you can happily write to address zero and replace whatever is stored there (I think Apple wisely left that memory address unused, though).


Of course it has its own semantics. I've seen the idea of "the 'C' virtual machine". You got a running start on the differences it has from most assemblers ( each of which will have semantic differences from each other ).

I think his point that "The idea of stack frames, heaps, registers, pointers are completely foreign to them..." is the critical point.


It still seems easier to present C as "a bit like assembly except for..." than "a bit like Java except for...".


>I've come to the conclusion that C is best understood if you think about it in terms of the way that an assembly language programmer would think about doing things

I agree with this. I found another article that made the transition for someone coming from something like Python or Ruby easier. It shows how to use gdb as a sort of C REPL. https://www.recurse.com/blog/5-learning-c-with-gdb

The quick feedback loop certainly makes things like pointers, arrays, etc, more clear.


Wow. I've been flirting around with learning C for awhile now (coming from a couple dynamic languages, including python), and this is by far the most helpful read about pointers I've ever come across. Thanks a ton.




Heck yeah. Thanks for the tip. Btw, new link is here:

https://github.com/zsaleeba/picoc


+1 to this. GDB and valgrind are the best things about writing C.


Valgrind is the best thing since sliced bread - saved me so much time and agony...


One things that I've found enlightening is when you have an IDE that allows you to step through mixed source and assembly. You can see your data being slopped around and processed. (int a = 10; becomes ld R4,10)


Personal anecdote: C was the first language I learned while I was in middle school. I taught myself. I admit that I have gained a deeper understanding of C after I recently took Computer Architecture class, but I don't think learning assembly is essential to understanding the C language, stack/heap, and pointers. When I was first learning C, a simple memory diagram with simple description was sufficient. Perhaps the problem is learning a high level language first (most schools start with python or java)? Maybe students struggle with the transition or it isn't explained well enough. I couldn't say since I started with C. I'm interested if anyone had this problem because I tutor students.


My experience was similar. I taught myself C during high school, and found transitioning to higher level languages to be fairly simple after that. C was taught as one of the later languages in my university course, and many of my friends (that started with higher level langauges) struggled to pick it up.


> The issue that many students face in learning low level C, is that they don't learn assembly language programming first anymore, and they come from higher level languages and move down. Instead of visualizing a Von Neumann machine, they know only of syntax, and for them the problem of programming comes down to finding the right magic piece of code online to copy and paste. The idea of stack frames, heaps, registers, pointers are completely foreign to them, even though they are fundamentally simple concepts.

What about teaching C in the context of something like AVR programming, where you have to worry about those sort of things because there simply isn't any abstraction on top? That was where I first encountered C and I think learning it in such a constrained environment helped me appreciate/understand the utility of C a lot more.


Pascal and Basic are also an option when targeting AVR:

http://www.mikroe.com/mikropascal/avr/

http://www.mikroe.com/mikrobasic/avr/


To add to the others noting the differences between C's behaviour and the underlying assembly, I recommend Chris Lattner's (the creator of LLVM) series of posts about Undefined Behaviour in C:

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

(HN truncates the text of the URL, but they're all different)

I've been coding C professionally for over a decade, as required for firmware/embedded development, and those posts have instilled the fear of god in me.


> ... and those posts have instilled the fear of god in me.

But why? Yeah, hitting UB can be a terrifying idea but rarely happens in practice.

In two decades of C programming, I have hit an UB bug exactly once when a piece of code was ran on an ARM platform for the first time. It took a little bit of staring at disassembly and reading some docs to sort out but it wasn't the end of the world.

Understanding the basic cases is a good idea but the darkest corners of undefined behavior are only important if you're a compiler writer like Chris Lattner is.


A solid understanding of undefined behavior is required if you want to write software in C that is secure, that will behave properly, and that is portable.

Undefined behavior has been the source of numerous security exploits in the past, and will only get worse as modern optimizing compilers become more advanced.


I took to C like a duck to water, probably because I was just coming from having extensive experience programming the PDP-11 in assembler, and C looks and behaves a lot like 11 assembler. (Such as integral and floating point promotion rules.)


> In my quest to learn C very well over the past few years, I've come to the conclusion that C is best understood if you think about it in terms of the way that an assembly language programmer would think about doing things. An explosion ample of this would be if you consider how switch statements work in C.

Require students to write a toy operating system and compiler in C. They will understand things obscenely well by the time that they are finished.

Another option is to introduce students to DTrace and require them to use it to answer questions about kernel and userspace behavior. Ask the right questions and they will learn all of the things you want them to learn from the process of answering the questions.


>and for them the problem of programming comes down to finding the right magic piece of code online to copy and paste.

This seems like a particularly uncharitable thing to say about people who write code in a 'high level' programming language. You're alleging not just that they don't understand the fundamental low level workings of a computer, but that they don't even understand how to write new programs in their language?


Uncharitable...probably.

Though there is some truth to that. It's far more likely that copy-and-paste X is a workable solution in a higher-level language than a lower-level one.

A high-level language emphasizes portability and abstraction; a low-level languages emphasizes performance and implementation details.

So...the comment sounds harsh but in reality is a reflection of the success of high-languages.


When you want to fully grok C, knowing some assembly and techniques does help. It's irrelevant for those learning the language. They don't need to know about CPU saving effects from fallthrough or lazy evaluation. They're still busy tripping up mixing arrays and pointers and incrementing the wrong one, null pointer assignments or forgetting the break in the switch. When you have some years time and are exploring the edges of the language that knowledge undoubtedly helps.

It was rare for students 25 years ago to learn assembly first. Can't say it made it harder to learn or that students had issue (spent a couple of years in the 90s teaching C part time to contractors). They had issue with language beginner things. Pointer arithmetic, or confusing pointers/arrays, but can't ever remember anyone having a particular issue with switch. Fall through was a C thing, they accepted it quite happily, and forgot break sometimes as learners do. People seemed to have far, far more difficulty getting comfortable with C++ and OO than C. The new C++ programmer was much more dangerous than the new C programmer!

C was often taught as first "real" language. You'd introduce pointers and here's how that aspect of computers work as part of the same scribble on the whiteboard. Same for memory allocation, stacks, heaps and byte sizes/packing. The fact that C was so directly close to those concepts made grasping them that much easier.

We lost a lot when we moved beyond expecting people to be aware of those basics. PHP isn't even sure itself what data is. Being able to pack your data or have app data that's optimal would be appreciated by those "few" smartphone users outside SV where dropping data or fallback to GPRS happens often. Data is rarely thought of in terms of size, it's just a blob of some types/objects. Little surprise when the app spits JSON of epic size and spends half its time "thinking".


> Little surprise when the app spits JSON of epic size and spends half its time "thinking"

At least it's not XML...


It's wonderful that the article mentions the Duff's device. I still remember the mind-boggling effect when I first met it. It taught me many things. Once you understand it you will never make mistakes with switch/case in C anymore.


C really only clicked for me when I was taking my computer architecture class in college - working through the Patt & Patel book[1], along with the Tanenbaum book[2], and building an entire 8-bit CPU from the gates up in a simulator.

I've seen so many people that just can't wrap their heads around pointers, but it makes so much more sense when you've gotten down to the nitty-gritty level and built up from there.

[1] http://www.amazon.com/Introduction-Computing-Systems-gates-b...

[2] http://www.amazon.com/Structured-Computer-Organization-Andre...


Let me add the perspective of an old dinosaur that learned to program before C had been invented.

C maps nearly 1:1 onto simple processor and memory models, and most importantly, gets out of your way and lets you get on with solving your system programming problems. Before C, just about any meaningful system programming task required a dive into assembly language. In that context, C was a huge win. It is also what makes C the langauge of choice for embedded development today.

Of course, system programming problems are not the bread-and-butter of most develpers today -- and a good thing, too. We can now build on top of solid systems and concentrate on delivering value to the customer at much higher levels of abstraction: the levels of abstraction that are meaningful to customers.

I dearly love Python because it allows me to work at levels of abstraction that are meaningful to the user's problem. I dearly love C when I want to wiggle a pin on an ARM Cortex-M3.

In my mind, CS education should start by teaching problem decomposition and performance analysis using a language like Python that provides high levels of abstraction and automated memory management. Then, just like assembly language was a required CS core course back in my day, students today should spend a semester implementing and measuring the performance of some of the data structures that they have been getting "for free" so that they understand computing at a fundamental level. Some will go on to be systems programmers, and will spend more time at the C level. Some won't ever look at C again, and that is OK.

In the end, CS education is about how to solve problems through the application of mechanical computation. The languages will evolve as our understanding of the problems evolve and our ability to create computing infrastructure evolves. CS educcation should be about creating people who can contribute to (and keep up with) that evolution.


"In my mind, CS education should start by teaching problem decomposition and performance analysis using a language like Python that provides high levels of abstraction and automated memory management. Then, just like assembly language was a required CS core course back in my day, students today should spend a semester implementing and measuring the performance of some of the data structures that they have been getting "for free" so that they understand computing at a fundamental level. Some will go on to be systems programmers, and will spend more time at the C level. Some won't ever look at C again, and that is OK."

I agree with that. I've proposed it myself. I'll add that I prefer them starting with a more type- and memory-safe, but low-level, language like Component Pascal so they can learn low-level thinking but appreciate safety features & good language definitions when they learn C afterward. One can emulate a decent bit of that in C and HLL that compile to C. Maybe they'll remember enough to make something valuable.


The whole issue of the role of types is one of my current meditations. Python's duck typing is great for getting out of your way and getting on with your job. Usually. I'm just starting to work through some Julia toturials so that I can experiment with multiple dispatch. But I think the way to teach reasoning about types is to learn you some Haskell. I can't see doing a lot of production programming in Haskell today, but I think learning Haskell is a good mind-stretching exercize (just as learning Prolog was, back when Prolog mattered.)


I agree Haskell will stretch my mind on the subject when I get around to learning it. It's exactly that reason that I've put it off for so long. ;)

Far as industry, I'm not feeling as strong doubts as you looking at the quantity and diversity of this list:

https://wiki.haskell.org/Haskell_in_industry

There's some real ass-kicking going on in industry with Haskell even if it's niche use and has obstacles/issues from that perspective.


Do it, Haskell is great learning exercise.

I never get to use it at work, but it helped improve my skills when using FP concepts in C++, JVM and .NET languages.


I might. My idea is to use either it or more pure ML/Ocaml to code the prototype up then convert each function into imperative C, Rust, or Ada. I figure a subset of Haskell should make that achievable. So, I can use all the tools like QuickCheck and QuickSpec for Haskell along with functional style to knock out many issues. Then, direct translation to imperative with no global issues, equivalence testing generated from Haskell tests, and static analysis to ensure local code is proper.

That seL4 matched Haskell and C... along with open-sourcing a key tool (AutoCorres)... makes me think this is achievable. My method isn't formal proof but would be more accessible. What do you think?


You can already generate C from Haskell using jhc compiler:

http://repetae.net/computer/jhc/

It doesn't support all the language features, though.

Besides their standard one, GHC also has an additional LLVM and an older, now deprecated, C backends:

https://downloads.haskell.org/~ghc/7.6.3/docs/html/users_gui...

The UHC also has a C backend, but not fully implemented,

http://foswiki.cs.uu.nl/foswiki/Ehc/UhcUserDocumentation#A_6...

Sounds a good idea, on the other hand have you ever used Frama-C? Never used it, but it seems being used in this kind of scenarios.


Thanks for links. So, here we go.

re generating C. The key thing is whether it generates human-readable and -editable C. Most of those tools are used to just feed and piggyback on a C compiler. In my scheme, the Haskell is like the high-level, executable spec with C being equivalent. Such tools might be a start on my goal. I like that JHC has no garbage collector. That's promising for same reason Rust having no GC is. :)

re Frama-C. Oh, yeah, that's good thinking as it's already been used for plenty C verification, even standard library. I was thinking of encoding Haskell specs in Frama-C somehow but not sure as I'm not formal methods specialist. Anyway, you might like where such techniques got their start for mainstream languages:

http://apotheca.hpl.hp.com/ftp/pub/dec/SRC/research-reports/...


Thanks for the link. I really appreciate HP has kept DEC papers alive, they are my main source into Modula-3 universe, but they are hard to search for.

Another possibility using something like the "Tiger book" and write a ML -> C compiler for the C subset you care about, but by then maybe contributing to Ada (GNAT), Rust or Swift would be better use of time.

I also forgot to mention on my previous comment that Idris and F* also have C backends, but they might suffer from the same problems.


GHC's C backend did not by any means generate anything a human would enjoy reading. It ran fast, though, on several platforms, which was the point.


As I suspected. Thanks for the confirmation. I'm guessing human-readable output might have to be done at an earlier stage in the compiler to keep it close to original code. Or do it high-assurance style with each intermediate representation in C to let reader see what transformed code means.


Btw, discovered something trying to update my research collection for memory safety and static analysis. Turns out, a prior technique for analyzing C that looked promising also has a Haskell project. That's Liquid Types.

Links describing it & to various projects http://goto.ucsd.edu/~rjhala/liquid/haskell/blog/about/

C verifier download http://goto.ucsd.edu/csolve/


> I dearly love Python because it allows me to work at levels of abstraction that are meaningful to the user's problem. I dearly love C when I want to wiggle a pin on an ARM Cortex-M3

What keeps you from wriggling that pin in Python?

Turing completeness dictates that any Turing complete language is equivalent to any other Turing complete language. In addition, you can do high levels of abstraction in C. The preprocessor and void pointers allow you to do some rather nice things. Structured programming also let you build things up. The advantage that Python enjoys is that it comes with a large number of library functions already provided and there are easily discoverable third party libraries. There are other differences, although every difference has a trade off. Garbage collection bloats memory requirements. Being interpreted means errors that can be caught in advance at compile time occur at runtime.


> Turing completeness dictates that any Turing complete language is equivalent to any other Turing complete language.

Sorry to pick on you, but this is a good example of misuse of this fact in an argument where it's not really relevant. Turing-completeness only relates to functions that take some string as input as produce a string as output (since that's all the turing machine model can do). In contrast, real-world programming languages interact with a machine or operating system, and not all languages provide the same interfaces or even run on the same machines.

I can easily implement a Turing-complete language whose only allowed system calls are reading from stdin and writing to stdout. Despite being Turing-complete, it will never be able to spawn threads, connect to a network socket, or even allocate memory on the heap.


> What keeps you from wriggling that pin in Python?

Did you catch that "Cortex-M3" part after "ARM"? No MMU, no OS, very limited SRAM -- CPython doesn't go there. I am, however, a huge fan of MicroPython, but that discussion too OT for this thread.


I'm currently at uni studying CS and recently finished my 'Programming in C' unit. The teacher from the get-go said it would be challenging compared to other languages that we had used to date (mainly Java) and that quite a few students struggle with it. Once I got my head around pointers and debugging through GBD/Valgrind the unit came immensely enjoyable and rewarding.

We didn't use any fancy IDE's and were told to stick to VIM, we also had to compile with the flags -ansi -Wall -pedantic which alerted you to not only errors but warnings when we compiled our code if it didn't meet the C90 (I think) standards. It was a lot of work crammed into 13 weeks but it had one assignment which I thoroughly enjoyed.

Tic Tac Toe (Ramming home using pointers, 2D arrays, bubble sort for the Scoreboard).

Debugging a bug-riddled program (My favourite).

Word Sorter (Using dynamic memory structures, memory management by having no leaks, etc).

The debugging one was very different from most other assignments I had done at uni to date and the teacher said he recently introduced this assignment because the university had received feedback that students debugging skills weren't the greatest. They could write what they were asked to just fine, but when it came to debugging preexisting issues quite a few struggled. We got given a program with around 15 bugs and you got marks depending on what was causing the bug and a valid solution to fix it. This forced us to use tools such as GBD and Valgrind to step through the program and see where the issue was and to be much more methodical.

I really enjoyed C and when I find a bit of time outside of work and study I'd like to explore it more.


A long-standing gripe of mine: When I clicked through to his example "cute little function" in Musl, I found myself mentally adding comments to work through all the "cuteness". If that's the case, IMO, it's either too cute or needs more comments - not sure how much time I've spent picking apart kernel code just to figure out what the hell some of it does, but it's definitely not time well spent.

EDIT: Meant to add: fantastic article, wish my Intro to C instructor had read it...


I spent about twenty minutes looking at this function trying to figure out what it does. It looks like it uses a neat trick to compare native integers instead of bytes for speed, but I just couldn't figure out what it does. Then I realised it's part of the standard library and "man memchr" told me what it does...


I mean, it's definitely a neat way to optimize comparisons for the CPU's word size. The problem is with completely unexplained things like:

  #define ONES ((size_t)-1/UCHAR_MAX)
Reading through the loop, it seems as though it will create, e.g., 0x01010101 for a 32-bit machine with 8-bit bytes. And sure enough, if you calculate ((2^32)-1)/255 that's exactly what you get. But I never would've known that without going through the code and proving to myself that the definitely of ONES actually makes sense.

If you write code like this and there are never any bugs then fine, I guess. But there will bugs.


The whole ONES... HASZERO shenanigans is somewhat explained here: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInW... as well as the entry after it.

Edit: And most likely in the excellent book "Hacker's Delight" as well.


The core of what makes C elegant is that basically everything that looks atomic is atomic, in the sense of taking O(1) space and memory (at least until C99 introduced its abominable variable-length arrays, and perhaps some other features I'm forgetting about).

Absent macro obfuscation, it is easy to reason about what a snippet of C does and how it translates down to machine code, even taken out of context. In C++, something as innocent as "i++;" could allocate heap memory and do file I/O.

The downside is that C code can become quite verbose, and to do anything useful, it takes a lot of ground work to basically set up your own DSL of utility functions and data structures. For certain applications, this is an acceptable tradeoff and gives a great deal of flexibility. I think teaching this bottom-up approach to programming can be quite useful - in a way, it mirrors the SICP approach, albeit from a rather different angle.

The question is, why are there not more languages that have the same paradigm, but also add basic memory safety, avoid spurious undefined behavior, provide namespaces, with a non-stupid standard library, etc.?


Building on pjmp's comment, this quote...

"The question is, why are there not more languages that have the same paradigm, but also add basic memory safety, avoid spurious undefined behavior, provide namespaces, with a non-stupid standard library, etc.?"

...basically just described Modula-3. It meets your requirements, was easy to read, had concurrency, had decent stdlib which had some formal verification, and could act as low-level as you needed with "UNSAFE" keyword. Brilliant design given all tradeoffs it balanced. It had some commercial uptake and was used in CVSup for FreeBSD.

https://en.wikipedia.org/wiki/Modula-3

Note: Important to not ignore it once you see "garbage collection." The GC was optional with a single keyword determining whether you or it handles a specific variable. Let's one pick and choose their battles with fate. :)

Note 2: The Obliq distributed programming language was an interesting project based on Modula-3. The SPIN OS, written in Modula-3, let you link code into a running kernel in a type-safe and memory-safe way for reducing context switches for performance.


Much of the code in the fantastic C Interfaces and Implementations [0] is inspired by Modula-3. It's a fantastic book to work through after finishing K&R for anyone wanting to learn how to write safe and reusable C code.

(Regular HN readers will recognize the author of the top review on Amazon.)

[0] http://www.amazon.com/Interfaces-Implementations-Techniques-...


Wow. Interesting to see the two converge that way. That's the glowing review I've seen of a book in a while. Guess I'm going to have to get it just in case. :)


> The question is, why are there not more languages that have the same paradigm, but also add basic memory safety, avoid spurious undefined behavior, provide namespaces, with a non-stupid standard library, etc.?

You mean Algol, NEWP, Mesa, PL/I, PL/M, Modula-2, Ada and similar?

A few of them are older than C.


This is why I made a BASIC-like language with 4GL-style utility functions that extracted to C. I knew effects of each one. BASIC didn't have C's issues and compiled pretty directly. Compiled fast, too.

Not recommending this route today as I just happened to start with BASIC. More like combining a simple, easy-to-compile, safe-by-default language w/out GC and with macros that extracts to portable C. Should make programming in C easier while providing all the benefits of C as my system did.

I've been eyeballing Nim language for this since HN commenters pointed out it's close to the goal already:

https://github.com/nim-lang/Nim/wiki/Nim-for-C-programmers

Someone might even be using a subset of it as a high-level C language. I'd be interested to know if anyone reading is doing something like that. Plus any other language besides Nim that can closely map and extract to C without its issues.


Isn't that the space Rust is going after? That's certainly whst I find interesting about it.


I realize you weren't using the word in this way, but you want to be careful with that word "atomic." See http://stackoverflow.com/questions/1790204/in-c-is-i-1-atomi... , particularly the third answer.


Great post! Univiersites tend to teach a very small subset of C, just enough to make a tic tac toe application or something silly.

I learned C by myself many years ago but it's only until recent I have been using it for big projects.

Reading Redis' source code was a great aide, xv6 is also amazing to learn systems programming.

Learn C The Hard Way is also a good read, but not as your main book, since it goes too fast. Other invaluable resources are: Beej's Guide to Network Programming and Beej's Guide to Unix Interprocess Communication

A good advanced book is Advanced Programming in the Unix Environment


Books I have read so can personally recommend are (in no particular order)

    Programming in C
    C Primer Plus
    K&R (obviously)
    21st Century C
    Modern C (also mentioned in this post)
    Understanding and Using Pointers in C


I loved "Expert C Programming: Deep C Secrets" by Peter Van der Linden. Great deep dive on the minutiae of important details (the difference between arrays and pointers as function arguments, etc)


I do agree, that book is really great, both very in-depth and at the same time entertaining to read.

Unfortunately, it appears to have been out of print for a while now.

Another book I can highly recommend is "The New C Standard: A Cultural and Economic Commentary" (http://www.knosof.co.uk/cbook/cbook.html). It takes apart the C language standard (C99) pretty much sentence by sentence, explains what it means and also contrasts how C99 is similar to or different from other languages (C++ mostly, but also, say, Fortran or Pascal).


+1 for Beej. I barely knew C when I enrolled in network programming, so trying to fill my C gaps alongside learning network programming was challenging. Someone recommended Beej. Then it was just a matter of C.


Is there a commentary on Redis source code? Always been fascinated how they implement the data structures.



This is awesome. Thanks a lot.


A few people mentioned xv6. What aspects did you find amazing for your learning process?


APUE is a great reference book, but I didn't find myself reading it linearly.


Although a C opponent, I find this to be a good writeup. I hope more C students see it. Particularly, the author focuses on introducing students to exemplar code, libraries with stuff they can study in isolation, and making habit of using checkers that knock out common problems. This kind of approach could produce a better baseline of C coder in proprietary or FOSS apps.

Only thing I didn't like was goto chain part. I looked at both examples thinking one could just use function calls and conditionals without nesting. My memory loss means I can't be sure as I don't remember C's semantics. Yet, sure enough, I read the comments on that article to find "Nate" illustrating a third approach without goto or extreme nesting. Anyone about to implement a goto chain should look at his examples. Any C coders wanting to chime in on that or alternatives they think are better... which also avoid goto... feel free. Also, Joshua Cranmer has a list there of areas he thought justified a goto. A list of great alternatives to goto for each might be warranted if such alternatives exist.

Only improvement I could think of right off the bat on the article outside including lightweight, formal methods like C or stuff like Ivory language immune to many C problems by design that extract to C. Not saying it's a substitute for learning proper C so much as useful tools for practitioner that are often left out. Astre Analyzer and safe subsets of C probably deserve mention, too, given what defect-reduction they're achieving in safety-critical embedded sector.


Although this is an interesting post, I'm disappointed from a pedagogical point of view. The article covers these topics, in this order:

1. What book do we assign?

2. What should we lecture?

3. What sort of code review work should we have students do?

4. What kind of assignments should we use? But only to say that he won't cover it in the article!

This is the almost the exact opposite order of what is most useful in terms of learning. Yes, some people (especially auto-didactic and well-focused students) are able to learn tremendous amounts on their own through books. But they are a relatively poor tool for teaching, compared to active learning methods. Lecture can be great, but usually is passive and worse than useless.

I want to acknowledge the importance of defining what you will teach and what successful (end-of-course) students look like and how to assess them. After you've decided that, it is proper to devise assignments and assessments, and then to decide on lectures and supplemental materials that support students in completing the assignments and assessments successfully. The time students spend should be active and practical - not that readings can't be provided, but they should be on-point and meaningful. Proper application of Instructional Design principles and theories of learning can make a world of difference for students.

But kudos for thinking about it, kudos for thinking about feedback mechanisms, and kudos for

PS: Obviously, I believe C has a great place in the curriculum - shouldn't leave undergrad without it!


I love K&R, 21st century C, Understanding C pointers, Deep C secrets but I think they are a little complicated for beginners. I wouldn't bother opening them until you have written a couple of small programs in C or you have good experience in other languages.

When I first learned C in highschool I got a few books on C which all seemed to have the word 'Beginner' in the name. 'Absolute Beginners Guide to C' is one I remember in particular. I think having multiple books is pivotal because as a beginner if you encounter an explanation that doesn't make sense to you it is very hard to reason around it. You probably have very little prior knowledge, almost everything you know and learn up to the point where you get stuck will be contained in that single book, and if you don't know any other languages you can't make any connections to help yourself out. The reason the second, third, or fourth book is so important is that it will have a slightly different explanation that might make something click in your brain.


I thought this was an excellent post. C has changed in lots of important ways in the last 10-20 years. The changes are both convenient (far better tooling) and inconvenient (much less forgiving of undefined behavior). Those of us who use C professionally have had to pick up most of these changes by osmosis. This was a really great run-down on how you'd bring a newbie up to speed with the state of the field.


'"even what seems like plain stupidity often stems from engineering trade-offs"'

This has truly become something I try to keep in mind, considering a) I've later, sometimes long after starting on someone else's code base, learned a useful rationale for why they did some of the previously more inscrutable things in their code, and b) ended up writing a few things like that myself.

Documentation is key to understanding these systems, but it isn't sufficient. Often you are presented with a nicely documented mega-function, which while anyone can read through, but is very hard to reuse a portion of when needed. In breaking it apart into smaller chunks, you necessarily scatter some of the reasoning about why a particular approach was taken from where it was originally used, or at least where the weird behavior is required. You can either reproduce large chunks of the documentation at many different points in the code base, and hope it doesn't get out of date as the systems it describes in other files is slowly changed, or keep the documentation as fairly strictly pertaining to the code immediately around it, in which case the knowledge of how the systems interact can get lost.

Whenever you encounter code that seems to make no sense, it's better to assume there's some interesting invisible state that you need to grok, than that the programmer was an imbecile or amateur. The latter may be true, but assuming that from the beginning rarely leads to a better outcome.

Edit:

I'll share my favorite example of this. At a prior job, we had a heavily used internal webapp written in Perl circa 1996. It was heavily modified over the years by multiple people, but by the time I was looking in on it in 2012, it was a horror story we used to scare new devs. The main WTF was that it was implemented as one large CGI which eschewed all use of subroutines for labels and goto statements, of which there were copious amounts. The really confusing part was that they were used exactly as you would expect a sub to be used, just with a setting a few variables and a jump instead, so we always scratched our heads as to the reasoning for this. There was even a comment along the lines of "I hate to use goto statements, but I don't know a better way to do this, so we're stuck with this."

Fast forward a couple years, and I'm migrating the webapp to a newer system and Perl, and I discover the reason for this. At some point it was converted to be a mod_perl application, and the way mod_perl for Apache works is to take your entire CGI and wrap it in a subroutine, persist the Perl instance, and call the subroutine each request. The common problem with this is that because of this any subroutines within your CGI can easily create closures if they use global variables. The goto statements really were intended to be used just like subroutines, because they were likely switched to in an attempt to easily circumvent this problem. Now, there are better methods to combat this, such as sticking your subroutines in a module, and having your CGI (and then mod_perl) just call that module, which is what I ended up converting the code to do, but the real take-away is that the original decision, as impossible to defend as it seemed, was actually based in a real-world trade-off, and at the time it was done may have actually been the correct call.


The Harvard CS50 course on edx does a pretty good job of teaching C, esp if you do the recommended reading/psets of "hacker level" which is from the book Hacker's Delight 2.

There is some initial magic, where they have you import cs50.h which is full of black box functions in the beginning but other than that it's a good example of teaching beginner C.


If part of a CS course I think C is an excellent first language. Perhaps not for someone wanting to learn about software development on their own though.

It seems to me that while we know how to teach C properly today not many places do because they don't do as they say.


One thing C does is force you to think about how the computer actually uses memory. So many languages abstract away the memory management and you end up with stuff like java programs that constantly and aggressively thrash the hell out of the processor cache by creating and (later) destroying objects for everything they do. Cache misses are one of the biggest performance killers on modern CPUs. You also see programmers do stuff like add characters to a string one at a time, even though each addition creates a whole new string and discards the old one.

If you do stuff like that a lot you can easily end up wondering why your program is so slow even though the profiler says that there are no standout slow parts. Everything is just uniformly slow because their coding style doesn't consider the amount of work each statement maps to at the machine level.


There are plenty statically compiled languages which don't abstract away the memory management, and are still way more robust to use than C, just start looking at Modula-2. Programming in C makes a lot of sense in many situations compared to C++ or Python or Javascript, just to name a variety, but not so much when compared to many other languages, which really compete in the same space as C, but did not catch the same attention.


Learning C has the advantage of opening up a ton of Unix software for you. Modula-2 not so much.


I was a TA for a CS program that taught C-with-classes C++, and I have to disagree. C was far too low-level for the students. In a course where you're just trying to teach the basics of flow control and translating a program into code, the foibles of C — from what a segfault is to trying to explain why you must read past the end of the file to get an EOF — just get in the way.

Since someone always brings it up — I do believe as some point students should be exposed to a low-level language that doesn't to GC to expose them to memory management. Just not as a first language. I'd rather give them Python to start, let them get their feet on the ground and somewhat comfortable with it — give them success to start and hook their interest. C just tended to defeat the students, and needlessly. I spent so much time explaining things that to them must have felt like a rocket scientist telling them why their water bottle rocket exploded into flames on the pad.


There is not one C. There are multitutdes of C. C is a recombination of the Programming language with the Compiler with the Plattform with the Code Convention of choice with the librarys chosen with the OperatingSystem (if there is one).

And C needs knowledge in all fields recombined to be really used freely. Know one of those fields not - and you will be like a wanderer on a frozzen lake, doomed to trust those who know to guide you by ramming posts of no return where the ice gets thin.

Its also about taking a sledgehammer to all those certaintys people have about computers from marketing and personal experience as consumers.


I find that many books that teach C either assume the reader knows how to program, or are too complex for them to understand.

I was going to write a Kindle book in the beginner's guide to C using Code::Blocks and its IDE because it is FOSS cross platform software. I found out it is a lot harder than I thought it was.

I learned C in 1987 at a community college still have the book on it that is written for Microsoft C, and we used Turbo C and Quick C for some of the assignments. Most of the programs I wrote can still compile and those that get errors or side effects can be debugged easily.


CS degree from pitt.edu 1996 and C was not required. But a friend an I took it as an elective. We did not want to get out of school with CS degree and no C.


We had to do the exercises of "Software Development 1" in C also I had 2 years C in high school. But it was only basic stuff.

Most of university was Java and "use what you want", only a bit C++ for "Computer Graphics 2" (which I never did)

I found it a bit sad, but on the other hand I never needed it.


Wow. I started at a community college and they had two tracts for an Associate's in CS: Java and C++. Whichever you opted for, you took one class in C to get started; then 3 semesters of each language. You also had to take an assembler class or data structures and algorithms.

Transferred that to a UC and we definitely touched more ASM and C in the OS courses.


If not C, what did they have you use for systems programming? Like, in your Operating Systems course?


Pitt grad here, though a decade after the OP. Pitt has 3 courses related to this stuff:

  * CS 447: http://cs.pitt.edu/schedule/courses/view/447
  * CS 449: http://cs.pitt.edu/schedule/courses/view/449
  * CS 1550: http://cs.pitt.edu/schedule/courses/1550
447 and 449 are required, 1550 is optional.

447 is almost entirely MIPS assembly, and goes into hardware architecture as well https://people.cs.pitt.edu/~childers/CS0447/

449 is a C class, using K&R 2nd edition: https://people.cs.pitt.edu/~jmisurda/teaching/cs449/2164/cs0...

1550 is more specifically about operating systems, using Tanenbaum's book: https://people.cs.pitt.edu/~jmisurda/teaching/cs1550/2164/cs...

So at least today, there's one class that's all C stuff. Almost all of the rest was Java, while I was a student.


I went to Pitt from 1996-2000. In the first OS course, we did have some programming assignments in C, but it wasn't a good course and the assignments were mostly just using the Windows API. In the second OS course, the professor and his student had been working on an operating system simulator in Java (DORITOS, http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=736855...) and it was used to teach real-time systems programming without needing to get down to real hardware.


Turbo Pascal! I wrote this whole this to simulate file system inodes in pascal. We also did ADA and Smalltalk. But the main lang for everything back in my day was Pascal. Java was just starting when I was leaving.


"This claim that positive signed overflow wraps around is neither correct by the C standard nor consistent with the observed behavior of either GCC or LLVM. This isn’t an acceptable claim to make in a popular C-based textbook published in 2015."

Perhaps someone could explain what I'm missing. It's exactly the behavior that I see using gcc-4.8 and Apple llvm-7.3.


Read the linked content in the post about undefined behavior. Signed overflow is undefined, not implementation defined. Clang and GCC treat it as undefined such without -fwrapv. That means they assume it cannot happen and feed information into optimization passes and code generation based on that assumption. It's worse than the result potentially being different: a program with signed overflow may crash, corrupt data or worse and it happens in practice. One common example is overflow checks often being optimized out if they do it by trying the operation and then checking for overflow. As the compilers get smarter, the problems will grow. They barely do any integer range analysis right now...


first piece in a while to make me optimistic about college curriculum priorities.

Not sure it's possible to teach green frosh 'why does industry use an old language' and the static analysis ecosystem (easier to teach skills than wisdom). But I applaud these people for trying. This feels like real programming.


Knowing assembly is a good first step and a prerequisite to be useful in an embedded project.


All good points. But teach all these in one semester? Poor students..


For all assignments, tell the students to use the most appropriate language. Plot twist: all assignments are for high level applications and C isn't the most appropriate.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: