
Programs to Read - mpweiher
http://wiki.c2.com/?ProgramsToRead
======
lucaslee
I saw the Linux kernel was recommended many times here, but how many people
actually read it? Where do you even start? The Linux kernel has around 60,000
files and 25 million lines of code...

I think smaller projects are better for learning purposes. If you are
interested in reading some smaller projects, check out my project here
[https://github.com/CodeReaderMe/awesome-code-
reading](https://github.com/CodeReaderMe/awesome-code-reading).

~~~
ken
Nobody ever wrote a 25MLOC program from start to finish, so I don't think it
makes much sense to read it that way.

I'd read it the way it was written: from the beginning. What's Linux 0.01 look
like? What's the next changeset after that release look like? What was
necessary to add the driver for your favorite device? What changes were made
for your particular CPU?

Programs are not static works (except maybe TeX and Metafont). They exist in
the form they do in order to be amenable to changes. So look at the changes
that drove it.

~~~
afarrell
Have you actually done this? What was your experience?

~~~
ken
Not for Linux, but it's how I approach new programs I have to work on.

I can't decipher this 1000-line function, but it came from somewhere. What did
it start out as? That's what the author originally intended it to be. What
caused it to grow? That's what features someone else thought it needed.

~~~
jammygit
How long does this sort of thing take? Its a really neat approach but seems
huge

------
kornish
Another list of reads worth mentioning is the Architecture of Open Source
Applications book [1]. Each application is small (< 500LOC) and contains
design rationale in an approachable literate programming style.

[1]: [http://aosabook.org/en/index.html](http://aosabook.org/en/index.html)

------
wwweston
Knuth vs McIlroy's solutions for a Word Frequency programming challenge:

[http://www.leancrew.com/all-this/2011/12/more-shell-less-
egg...](http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/)

Knuth's solution has its own elegance and interest including various
efficiencies and from-scratch facilities.

McIlroy's will make you wonder how much you're wasting in time and complexity.

~~~
YZF
Anywhere I can see Knuth's solution online? Can't seem to find the link in
there.

~~~
wwweston
Here's the paper:

[https://www.cs.tufts.edu/~nr/cs257/archive/don-
knuth/pearls-...](https://www.cs.tufts.edu/~nr/cs257/archive/don-
knuth/pearls-2.pdf)

(ACM ref, too:
[https://dl.acm.org/citation.cfm?id=315654](https://dl.acm.org/citation.cfm?id=315654)
)

Also, I found a blog post which links to a C version of the exercise in the
spirit of Knuth's approach, and a Haskell version in the spirit of McIlroy's:

[https://franklinchen.com/blog/2011/12/08/revisiting-knuth-
an...](https://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-
word-count-programs/)

------
jolmg
So, is wiki.c2.com something that people keep writing on or is it really
readonly nowadays? Is there a way to become a member? I just mistakenly signed
up to hypothes.is from the right side-bar thinking it was c2.com.

~~~
tabtab
Ward Cunningham decided to close c2 to new material in part because he got
tired of battling a determined vandal who was all bent out of shape over
grammer and spailing, and second to try out his new "Federated Wiki" concept.
So far, Federated Wiki is a flop in my opinion.

C2 was interesting in that software developers and IT professionals tried to
categorize debates and disagreements over software engineering issues. Almost
nothing was ever settled, but you got different perspectives. For example, is
the "best" software about maximizing economics & psychology (human grokking),
or symbolic parsimony? Is it better to hire 5 geniuses or 10 ordinary
engineers?

~~~
amyjess
> Ward Cunningham decided to close c2 to new material in part because he got
> tired of battling a determined vandal who was all bent out of shape over
> grammer and spailing

It's honestly pretty upsetting that one vandal is enough to ruin one of the
oldest and most significant institutions on the web for everybody.

~~~
lozenge
I think it has value as a snapshot archive as well. The article on extreme
programming would look quite different today.

~~~
tabtab
The reworked version doesn't show the original indentation of discussion
trees, making it harder to read. All bullet points have been flattened to one
level, compared to the the original, which showed many levels.

------
hangonhn
I went through the list and the comment about Apache Lucene really piqued my
interest. I found the Lucene code in GitHub but some of the sources have more
boilerplate Java comments than code! Anyways, I remembered Java Docs are often
excellent so here's the Lucene Java Docs:

[https://lucene.apache.org/core/7_5_0/core/index.html](https://lucene.apache.org/core/7_5_0/core/index.html)

From my experience, studying the API Docs for Java and using it as a guide to
the code makes understanding the code much easier.

The code is here:

[https://github.com/apache/lucene-solr](https://github.com/apache/lucene-solr)

------
akkartik
One claim that caught my eye: Tex being unbuggy because it's written in the
Literate Programming style. I like LP but that's a big claim. Occam's razor
suggests a more likely answer: it was written by Knuth! A great programmer,
and also one who grew up programming non-interactively: [http://ed-
thelen.org/comp-hist/B5000-AlgolRWaychoff.html#7](http://ed-thelen.org/comp-
hist/B5000-AlgolRWaychoff.html#7)

~~~
deathanatos
On the one hand, I see your point. But on the other hand, if a master of some
skill says "This is how I achieve the results I achieve." — perhaps imitating
that _might_ just be valuable. Does a dancer not get better by learning to
imitate (at least at first) a master? (Though perhaps eventually, a deeper
understanding of the artform might take hold, at which point experimentation
and improvisation allow the dancer to explore undiscovered variations of the
artform.)

And I don't think it is just that Knuth does it that catches my attention
either. Ideally, I think (if I understand the premise of literate
programming), literate programming would force the programmer to take the time
to explain the code being written, and ideally, the _rationale_ behind why
that code solves the problem at hand. And to do that, the programmer must
first understand the code being written, and I think upon attempting to do
that, they'll find they have not sufficiently thought the problem through.

I fear too many of the coders I know would not honestly attempt it; they are
in too much of a hurry to get the code written.

Now, I don't think literate programming is the only way to accomplish that;
simply writing good docstrings and comments might suffice to a good degree
(and _good_ ones: I too often review code where the docstring is a basic
repetition of the function name and arguments, or where the comments explain
_what_ is being done, but not _why_ ). But far too often I see no comments, no
docstrings, no tests, just miles and miles of code, and so it is no wonder
that Knuth is what he is.

~~~
akkartik
No disagreement there! Though you may violently agree or disagree with this
thing I wrote a couple of years ago: [http://akkartik.name/post/literate-
programming](http://akkartik.name/post/literate-programming)

I also just looked at your profile, and it resonates a lot. Tell me what you
think of mine:
[http://akkartik.name/post/about](http://akkartik.name/post/about)

~~~
svat
I'm not the person you're replying to, but I think you mean
[http://akkartik.name/about](http://akkartik.name/about) (and not
[http://akkartik.name/post/about](http://akkartik.name/post/about)).

And though you didn't ask me, I read it (having been annoyed by your LP post
for a few years now) and it strongly resonates with me — the mission, and
everything until the start of the “Constraints on a solution” section. IMO the
way to give more people more understanding of programs is not to write an
entire new programming language / operating system and hope that enough people
switch to it, but to work on delivering that “understanding” for existing
programs in existing languages. It may be harder, but is more likely to be
useful, and you also get feedback on what kind of understanding people want
and lack.

~~~
akkartik
Many thanks for the correction, and for the comment! It is absolutely likely
that delivering the understanding for existing programs is the way to go.
Unfortunately I just don't know enough for that yet. It's a hard chicken-and-
egg problem: to understand the current stack I need precisely the concepts and
tools that I'm trying to work out.

Over the last few years I've switched back and forth between the two extremes.
I spent a couple of years off and on learning how operating systems (OpenBSD,
Sortix, a little bit of Linux) work. I've gained some hard-won facility for
poking around inside GNU package sources. So it's yet possible that I'll find
a way to make progress on an existing stack.

It also depends what the criterion for 'success' is when we consider this
"strike out vs work within" tradeoff. If the goal is some level of adoption
then it's a no-brainer that working with existing platforms is the way to go.
But I may be content just to figure out the right answer for myself.

Working within an existing platform requires a time commitment to a single (or
a tiny subset) of the many projects that our platforms have balkanized into.
After all that time commitment, effecting change from within requires a level
of politics that is definitely not my strong suit. These projects have real
users and justifiably shouldn't be giving me the time of day anytime soon.

One alternative I've considered is forking a mature platform. It remains on my
radar, but at the moment I think the drop-off in benefits the instant I hit
'fork' is way too great. Consider a platform like OpenBSD that frankly is
created by way smarter people than myself and is way more mature. The level of
adoption it gets from being POSIX compliant is so miniscule; would making
incompatible changes really be _that_ counter-productive? It's worth asking if
you're baiting big to catch small.

On some level my real target is to change the customs that influence how open
source projects are governed. Even if I managed to overcome all the previous
hurdles, it still seems impossible within the existing framework to do things
like encourage more forks in projects, or convince more people to cull their
dependencies, or read their sources. There's just too much baggage. Starting
afresh may paradoxically make a new way easier to see.

This is the "idea maze" as I see it. I'd love to hear more, what you think of
it.

~~~
svat
Well it's a noble goal and I certainly don't want to discourage you, however
you go about it! Perhaps you will learn something new and useful.

But just to make my meaning clear, by “existing programs in existing
languages”, I meant doing it in a way that does _not_ require effecting change
at all. What understanding is it is possible to deliver for an existing,
complicated, messy codebase? For example, you correctly noted that early
versions tend to be easier to understand, and code tends to accumulate
complexity that makes the global structure less clear. This is true and
something I use often: use "blame" to look at where a particular chunk of code
was introduced, and look at the corresponding change, along with its
description/commit message, which is often simpler. (And I see you've written
a tool ([http://akkartik.name/post/wart-
layers](http://akkartik.name/post/wart-layers)) for those who choose to stay
conscious of this and write code in a particular way.) But most software today
is available with version history. So, what if this were easier? E.g. imagine
if when you view code there's a slider that you can move back and forth to see
older or newer versions, while the changes fade out. Or, imagine highlighting
the “base” of the code versus the less important changes. Or something;
experimentation will reveal what tends to be useful for existing codebases.
(And if people find the tools useful, that may even effect change in how code
is written, as authors get feedback on what the tool thinks versus their
mental understanding, and tweak until there's a match. I've seen Typescript
being sold not for some putative benefits of typing on code correctness but
simply for enabling IDE autocomplete for instance.)

The broader point is that, to me it seems that your writing and efforts have
the implied assumption that understanding of global structure is hard to
acquire because everyone is making _mistakes_ , and if everyone is just
careful to do things differently, the difficulties will disappear and
understandable programs will magically emerge. That is something worth
investigating, but I think there's a good chance that perhaps not everyone is
making mistakes (as even programs written by the best programmers tends to
become hard to understand eventually), and/or that it's not feasible for
everyone to be super careful when trying to get things done. (Rather, they're
making tradeoffs, and are likely to make similar tradeoffs in future.) Not all
the accumulated complexity may be accidental; some is inherent in the fact
that the problem in the real world _does_ have messy corner cases (as Spolsky
said: [https://www.joelonsoftware.com/2000/04/06/things-you-
should-...](https://www.joelonsoftware.com/2000/04/06/things-you-should-never-
do-part-i/)). Similarly most of the causes you identified (backwards
compatibility considerations, churn in personnel, vestigial features) (and
those identified elsewhere, e.g. in “out of the tar pit”) can be real and
unavoidable: there may be features that are needed only for (say) users of old
systems but still cannot simply be removed. The best that can be hoped for is
to make this fact clearer, not to make them go away.

Finally, there's also the fact that “understanding” is not a property inherent
in the system (code, program, whatever) itself, but something that grows in
the head of the reader. (Perhaps trying to influence the _writer_ is not the
best way...) And different readers come with different questions and goals,
and at least as far as the first paragraph of your
[http://akkartik.name/about](http://akkartik.name/about) goes, may need
different sorts of help in different contexts. It's unlikely a fixed
organization of the program is going to satisfy everybody.

An example: error handling. We've all seen functions that spend only a few
lines doing their “main” job and many more lines checking for errors and
dealing with them. This can obscure what the main job of the function is, and
make it appear as though error-handling is the main part. (Aside: Knuth
observes this causes a psychological barrier against writing too much error
handling, while with his literate programming one shunts off the error-
handling to a different section/module, and one tends to write better error-
handling there.) But consider [https://danluu.com/postmortem-
lessons/](https://danluu.com/postmortem-lessons/) which says “If you care
about building robust systems, the error checking code is the main code!” So
depending what kind of understanding a reader is looking for at a certain
time, sometimes they may want to understand the “happy” path and sometimes the
error handling.

Similarly in general: given a program, sometimes we want to understand roughly
how it's organized / what its major components are, sometimes we want to
understand the precise boundaries/interfaces between these components,
sometimes we want to understand the sequence of operations the program
performs and sometimes the frequency, sometimes we want to understand what it
does in the “steady state” and sometimes what it does at startup or shutdown
or some corner case. And in fact not always the global structure of a program
but sometimes only enough to understand how it solves a specific problem — if
I want to make a change to (say) Firefox in an afternoon, I may just want to
know how it does (say) font fallback.

All that said, yours is an interesting project, and for the reasons you
mentioned, and I look forward to what comes out of it. Apologies for a verbose
comment; I'll stop here :-)

~~~
akkartik
I really appreciate the detailed comment! I love talking about this stuff.

\---

You're absolutely right that code reading is a non-linear activity and that
won't ever change.[1] I'm not trying to make all code reading a linear
activity; that would be truly quixotic. I want to keep the code organized for
the convenience of the writer but still provide an initial orientation for
newcomers, when they aren't concerned with error handling or precise interface
boundaries.

\---

Finding the right version to look at is part of the solution, as you noticed.
But there's a second half: making sure the information newcomers need is
actually in the repo. Somewhere, _anywhere_. My sense is that existing
codebases don't actually contain all the information needed to truly
comprehend them. The context the system runs in, and all the precise issues it
guards against. Tests are a huge help here, but I'm constantly making changes
to code that I tested in some other terminal or browser window with some
complex workflow. Then I often save the code in one window and close the other
window. That's a huge loss of information, and it's compounding over and over
again in current platforms. All because there are manual tests we can do that
aren't easily encoded as automated tests.

It's certainly possible to port my ideas to an existing stack so that more
tests can be represented. But how do we recover all the knowledge that has
been lost so far?

\---

I don't think the problem is that authors make mistakes[2]. They understand
the domain far better than an outsider like me. No, the problem is that
authors don't capture everything that is in their heads, and that means that
_I_ make mistakes when I try to build on their work.

One selfish reason I care about making the global structure of my codebase
comprehensible is the very slight hope that others can take it in new
directions and add expertise I won't ever gain by myself, in a way that I and
others can learn from.

\---

 _"...most of the causes you identified (backwards compatibility
considerations, churn in personnel, vestigial features) can be real and
unavoidable: there may be features that are needed only for (say) users of old
systems but still cannot simply be removed. The best that can be hoped for is
to make this fact clearer, not to make them go away."_

If the codebase is more comprehensible and captures all compatibility and
other concerns as tests, then it becomes easy to fork. Users of old systems
could stay on one fork and others could be on a simpler fork that deletes the
compatibility tests and the code that implements them. That way they aren't
paying a complexity penalty for what they don't use. The tests would make it
tractable to exchange code between the two forks, even if they diverge pretty
far over time. Or so I hope.

\---

Thank you again for your detailed feedback. Now I have a sense of a few more
issues to avoid in my writing.

[1] That's partly my complaint with typography in LP: we end up polishing one
happy path to death.

[2] If you notice places in my writing that strengthen this impression that
I'm trying to reduce mistakes, I'd really appreciate hearing about them. I'm
explicitly trying to avoid the failure mode that I think projects like TUNES
fall into, of trying to come up with the one perfect architecture.

------
e12e
Seems like this thread needs a link to:
[http://aosabook.org/en/index.html](http://aosabook.org/en/index.html)

------
ryanmccullagh
Question, where does one start, when planning to read the Linux kernel? There
is so much code. I have read it, but randomly. I have read contents of net/
kernel.

I have read the main method where it attempts to launch pid1 of /bin/bash,
etc.

Is there a really good place to start reading? For example how does Linux talk
to the hard drive?

What's the first thing that happens in the kernel?

~~~
varjag
Start skimming through the included documentation files. You don't have to do
a thorough read, but the Documentation folder is structured in a way mirroring
the organization of the kernel. They will help you to understand the
structure, and you can get deeper understanding of the essential parts that
way.

Once you have mental map of major areas of the kernel it will be easier to
relate to the sourcetree.

There used to be a number of great books on fairly esoteric issues of the
kernel (like, comprehensive explanations of network stack). However most of
them seem to be stuck at 2.6 and are no longer very up to date.

------
tomxor
> Some are GreatProgramsToRead _and some are not_

Indeed! It took me a while a appreciate (because it can be painful). Reading
bad programs can be very educational, especially when you try to infer the
decisions or processes that caused it to be bad in a particular way.

------
riffraff
Many years ago I enjoyed Code Reading[0] which is a whole book which basically
just discusses snippets from open source codebases.

It might be a bit old (I honestly don't remember the specifics) but most of
the code was already "mature" at the time so I think it would still be valid.

[0]
[https://en.wikipedia.org/wiki/Code_Reading](https://en.wikipedia.org/wiki/Code_Reading)

------
kureikain
If you like this kind of reading code list, I run a newsletter at
[https://betterdev.link/](https://betterdev.link/) where we have a small
section that include one or two interesting Github repo per languages per
issue.

------
injb
The Subversion api should be on this list too. I had occasion to do some
development with it before and it's beatifully written.

------
potta_coffee
I often recommend the Flask source code, if you're into Python. Very clean and
well structured, great example code.

------
qwerty456127
Such a small list, such a humble selection of programming languages :-(

------
sneak
\+ nginx

~~~
beefhash
On the topic of C, I'd also strongly recommend mandoc[1]. It solves a number
of hard problems (indexing and searching of man pages, rendering and parsing
markup and translating said markup to HTML, PDF and tty output. The code
remains fairly accessible. Definitely one of the codebases I regularly refer
to for style and practices.

[1] [https://mandoc.bsd.lv/cgi-bin/cvsweb/](https://mandoc.bsd.lv/cgi-
bin/cvsweb/)

~~~
kristapsdz
Thanks! :)

------
dfrunza
Programs to read ... first entry 'JavaUnit'.. SKIP!

------
oraraonaro
Is there a resource with the best snippets of the code sources listed here?
Can't be bothered to trawl through for the good stuff.

