>Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be selfevident. Data structures, not algorithms, are central to programming. (See Brooks p. 102.)
I came to the same conclusion, after a while. In the end code is there to process data. (there is meta programming and things like state machines but most programs do hold data in some kind of structures to be processed)
>Simple rule: include files should never include include files.
>Simple rule: include files should never include include files.
For small programs, that's a good rule* . For larger code bodies, it's counterproductive. If my module uses foo, there's no reason for my module to know that foo depends on bar. If the next version of foo drops its bar dependency and now uses baz which also requires fat, foo's clients shouldn't have to be rewritten.
You can ameliorate the breakage sometimes by keeping dependencies out of header files (and that's generally a good idea), but sometimes you can't.
My personal rule is exactly the opposite: every header should compile with no prerequisites. For that reason, the vast majority of my foo.h files are included as the very first line of foo.c. If foo.c doesn't compile, I know to fix foo.h.
Pike's concern that lexical analysis is the most expensive compiler phase is way out of date.
>>Simple rule: include files should never include include files.
> Yes, please!
Let's say I have a C++ class defined in a header that uses std::string. So I require that anyone who uses my header already has <string> included in version 1.0.0. Then, in version 1.1.0, I added to the class in the header so that now it requires std::map. If the caller doesn't have <map> included, there will now be a compilation error! How is that better than simply including string and map in the class header itself?
Because then people know what they are using. And that is more valuable then fixing simple compiler errors.
I'v seen programs that have an "uberinclude", usually named "common.h" or something like that, that includes not only standard libs but also other headers. It is horrible. There is no way to understand what code written with that mind-set is doing, other then slaving over a notebook trying to make sense of it. Granted all encompassing/nested headers don't mean unreadable code. But from my experience with my own code, clear and separate headers (including standard ones) make me write clear and separate .c files.
Yeah I agree splitting functionality into separate headers is a good idea. But I feel like the rule "don't include headers in headers" is good advice in lots of situations.
Redis is often thrown around as an example of an excellent, clean C source code base. Look at server.h. It has 34 include statements. Would you rather that every C file that includes server.h have to include these 34 files before including server.h? Seems insane to me.
>Redis is often thrown around as an example of an excellent, clean C source code base. Look at server.h. It has 34 include statements. Would you rather that every C file that includes server.h have to include these 34 files before including server.h? Seems insane to me.
Good point. This is exactly why I would want that, yes.
Most of the advice holds true, but there's one bit you should happily ignore: don't uglify your code with external include guards - that is put your include guards in the included file.
There is one exception to this rule which I think still holds true: MSVC doesn't perform this optimization correctly, perhaps due to its love affair with precompiled headers.
I personally prefer solving that issue with pragma once, as I find the risk of having an include guard name clash higher than the chance of having two the same file accessible between to different hardlinks or copies, let alone compiling anything on network share. It's also a lot more readable.
Lots of good advice from a long-time programmer (https://en.wikipedia.org/wiki/Rob_Pike). I have bookmarked these notes before, because his simple but effective tips resonate with me. From the article:
> Algorithms, or details of algorithms, can often be encoded compactly, efficiently and expressively as data rather than, say, as lots of if statements.
I have read elsewhere about moving your code complexity to your data. But I can't find that other article. It's hard to find mention at all of this strategy. But I have found it to be true. Moving details from PHP into the database results in shorter code overall. The first example that comes to mind is replacing a bunch of if-statements with one or more columns in the database, like for some kind of categorization.
Generalize the business logic in to things you do all the time, with data that you lookup as part of the context.
In that way your program is actually more flexible, there aren't magic numbers or magic things that happen only in some cases. Those have been moved to co-residence with the data they belong with.
These are instructive to look at now. Important to recall the '1989' date of course, but with hindsight...
I love Rule 5 ("choose the right data structures and the algorithms will almost always be self-evident"), especially when combined with STL and strong typing. There is a degree of irony in taking this advice from Pike, given the design of Go.
Much of the material on complexity assumes you are hand-coding things from scratch. I am happy to take on complexity (as long as I understand what I'm getting into) from well-designed libraries rather than building something "simple and robust" from scratch. The statement about binary trees vs splay trees is illustrative; unless I need to see the bare data structure again I would much rather take on something from the STL, complex or otherwise.
Generally speaking all this stuff is good, but I'd add that his note on Include files is dated and I wouldn't recommend it anymore. Any compiler worth half its salt at this point will recognize the include guard pattern and not parse though include'd files multiple times, so the worry about wasting tons of time due to that pattern is largely gone.
The big problem with what he's suggesting is that if you're designing a fairly big system with lots of decently small headers (Which is generally good - simple headers with easy-to-read APIs are good), you'll end-up with a crazy number of include's in every file - and if you change something to use a new dependency, then you'll have to change every location it is include'd in as well. There is something to be said about avoiding things like circular dependencies, but this requirement really doesn't make it any harder for that to happen, and just creates more problems and annoyances - it is not a very scalable solution.
If you look at the Linux Kernel source (Which is arguably one of the largest and most successful C programs) each source file has at around 10 to 30 include's at the top (Or more in some cases), and that's with the headers including other headers. If instead Linux had taken the approach Rob is recommending, that number would probably be a magnitude larger and extremely hard to manage even if they combined a bunch of the headers they have together to reduce the total number (Which, again, I would consider a huge anti-pattern).
I think I agree that includes-within-includes are probably a necessity.
But I am not that it allows things to be broken into
"simple headers with easy-to-read APIs". That is true for headers that are small and only #include system headers and other basic dependencies. But if your own code has a nest of interelated headers, then a single big header file is probably easier to read.
Well I mean, perhaps I should have clarified, "simple headers with easy-to-read APIs" has to be looked at on a case-by-case basis and what makes sense. I would argue that as a general rule it is what you should aim for if possible (And I find it is possible in a lot of cases, and that if it's not things may be getting a bit too interconnected), but you're right that there are situations where things are a bit too complicated to split-up easily so a single header for multiple things makes sense. I think that if you're constantly in that situation though, you should reevaluate how you're designing your components and headers.
Doesn't every compiler support '#pragma once' these days too, making include guards largely pointless (and like most copy/paste patterns, error-prone) boilerplate?
Yes and no. Many compilers support #pragma once, but whether your build environment handles it is another matter. Having your headers on network drives, or having symlinks to your headers, etc can cause issues with #pragma once handling (depending on the implementation) that can often be hard to track down, especially if building with multiple compilers. Include header guards do not have these issues.
> For example, binary trees are always faster than splay trees for workaday problems.
I assume by this Pike means unbalanced binary trees (if not, then red-black trees are decidedly not simpler than splay trees, especially if you need to delete). In that case, I don't really believe it. Nobody uses unbalanced binary trees anymore for good reason: they have awful performance when you do something as simple as inserting keys in sorted order.
I'd like to see Pike vs. the MISRA guidelines (https://en.wikipedia.org/wiki/MISRA_C). Rob's notes here are not about the kind of safety critical, hopefully small, programs that MISRA claims to improve.
It would be interesting to hear his thoughts about just those kinds of programs, and whether pointers and function pointers are still helpful.
It's useful to apply the Steve Yegge political axis metaphor[1] here. C forces a fairly conservative approach because it stocks a full arsenal of footguns. Pike and most of the early Unix guys, though, are about as liberal as they can be within C's constraints. MISRA, on the other hand, is Idaho survivalist camp conservative.
Both parties have good reason for their ideologies. Early Unix programs were tiny enough that it was easy to keep an entire applications' rules in your head and take full advantage of them. So it makes sense to play fast and loose. MISRA-compliant systems, OTOH, are developed by large teams where no member understands the whole system, and the consequences of getting something wrong are measured in attorney man-years.
Lysator has been around since forever. The English information page [1] says it was founded in 1973. I remember it from the early days of the Internet and possibly even from Usenet days.
Among other things it has a large repository of historical documents and papers on C programming and standardization dating back to the 1980s. See [2].
For what it's worth, I bookmarked the latter link in 2001.
Don't know but interestingly the Pike programming language was developed at Linköping University. Perhaps Rob Pike has gained some kind of demigod status there?
If it's anything like most universities I knew (especially in the 90s and early 2000s) then their CS department websites can host all kinds of documents, papers etc such as this, e.g. from ESR stuff to "Worse is better", Beejs guide etc.
You can't downvote a reply on your own post on HN.
Maybe I was a bit snarky, but if I left it to the first line what would the point be? If it's not a good example, then what is a better version of that example? How can you criticize something if you can't compare it to an alternative?
Well, maxPA, or MaximumPhysicalAddress both satisfy different goals more effectively. maxphysaddr is almost an example in my mind of how not to name things, in that it relies on abbreviation conventions that different programmers may not use consistently. As you said, within a codebase it's probably no big deal. I just think it's a bad example of a good name. I don't agree with your logic about criticism requiring an alternative.
If the purpose of criticism is to improve the state of the art, then what is the point of unexplained criticism without alternatives?
Other purposes of criticism don't seem appropriate here.
The abbreviations max and addr are found in colloquial English. Even phys is used in abbreviations like phys ed (physical education). Phys is the start of about 20 distinct words. Among them, physical has been the frontrunner in frequency since the late 1820s. [0]
The only part of the abbreviation which would be confusing to a non-programmer, is the specific use of address in this context; but if you are working in a language which directly manipulates memory addresses, you should be able to expect that of the reader.
maxPA is completely ambiguous (maximum pool allocations? maximum potential acceleration?), and it would need to be recalled, rather than recognized. On the other hand, MaximumPhysicalAddress is ridiculously huge, and would be impractical to use inline in an expression; at exactly twice the length of maxphysaddr, it conveys the exact same amount of information while being less practical to type, format, and read. Indeed, people might end up copying the variable to a local name just to fit it in their expressions. In C, it also conflicts with the common convention of making typedef'd struct names uppercase.
To this I would like to draw attention to the mention of context in this example, which I think is important and often looked over when naming variables.
The context given for maxphysaddr is that of a global variable. Pike also mentions loop indices, which have very local scope. A loop variable can be very short because you know everything about it simply by reading the few lines around it. A global must encode much more information in its name, because its context is spread globally. If the name is not good, the developer has to spend extra time looking for appropriate context, and aside from the added frustration, time is money.
In some ways one could suggest that variable name length should be proportional to its scope. This isn't quite true, however, because we should then find that MaximumPhysicalAddress should provide greater value than maxphysaddr. Instead, we find that MaximumPhysicalAddress is a worse name due to its length. They are however both unambiguous and provide the same semantic information. This suggests that perhaps a rule of thumb should be that the name unambiguously encode the necessary semantic information proportional to its context, and no more.
Naming is however hard, in general. It seems picking good examples is equally difficult.
That's exactly my point. I did not offer MaximumPhysicalAddress originally as an alternative, because as you said it is not really any less ambiguous in this case.
However, a new programmer approaching the codebase still has to remember WHICH words are abbreviated and which variables rely on which abbreviations.
Since MaxPhysAddr is already 11 characters long, and I can type MaximumPhysicalAddress in almost exactly the same amount of time (the overhead is essentially negligible) I see no benefit to using the abbreviated form.
One must remember that maxphysaddr actually carries MORE information than MaximumPhysicalAddress, specifying that the words are all abbreviated. One can argue that the abbreviations used are conventional, but if the convention is not already known then the information required is even greater. That information is not relevant to the variable's function, and thus adds an unnecessary burden to the programmer. IMHO.
But, again, it is not a good example, and I do not have a suggestion for an alternative.
"Other purposes of criticism don't seem appropriate here" - on that I will take your word; you've demonstrated quite an aptitude for inappropriate criticism.
"Ridiculously huge" ? MaximumPhysicalAddress vs maxphysaddr, let's see, it is 22 vs 11 characters. Both are much longer than I'd want to type if speed or code byte size was a problem. For readability, the primary goal of all source code other than assembly (or code going out over-the-wire), it's not even close.
If you really think maxphysaddr is as readable as MaximumPhysicalAddress, you hopefully recognize at least that not everyone would think that way.
As for maxPA being ambiguous, well, yes. My point was that it's shorter. Perhaps one of the copied names programmers can use within their own expressions.
Of course, they can use a find-replace and replace all those copies with MaximumPhysicalAddress in nearly an instant. Which is why it's probably the name I'd use unless I had to minimize byte size. If I had to minimize byte size, I would not waste time trying to compromise between length and readability.
Otherwise I do not think there is ever a reason to compromise readability.
> maxphysaddr is almost an example in my mind of how not to name things, in that it relies on abbreviation conventions that different programmers may not use consistently
Pike deals with this very issue in the article:
> np is just as mnemonic as nodepointer if you consistently use a naming convention from which np means ``node pointer'' is easily derived
I dealt with that very issue in my comment - "it relies on abbreviation conventions that different programmers may not use consistently".
Different programmers at different times follow different conventions, no matter how overwhelmingly obvious they are at the time. Changing adds to the already huge burden of context switching.
How can you possibly account for that though? Different programmers speak different natural languages, yet most public codebases still use English. Certain terms will fall in and out of fashion over the course of time. Consistency within the project is the most important thing we can control.
I came to the same conclusion, after a while. In the end code is there to process data. (there is meta programming and things like state machines but most programs do hold data in some kind of structures to be processed)
>Simple rule: include files should never include include files.
Yes, please!