
Technical Papers Every Programmer Should Read (At Least Twice) - icey
http://blog.fogus.me/2011/09/08/10-technical-papers-every-programmer-should-read-at-least-twice/
======
papaf
I read a random sampling paper recently which took a simple problem and
approached it in ways that were much more elegant than I did.

Its almost a toy problem but I found the paper really interesting:

[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.138.7...](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.138.784)

~~~
mietek
Fun fact: I was asked to find a solution to this problem during an interview
at a proprietary trading firm last year.

I wish I had read this paper before.

~~~
psykotic
The algorithm for the simplest variant of that problem (take a single random
sample from a list of a priori unknown length) can be derived by simple
inductive reasoning. If the list has size 1, the problem is trivially solved.
Otherwise suppose we have a list (x:xs) and we recursively solve the problem
for xs. If all the recursion returns is the random sample from xs, we clearly
have insufficient information to take a random sample from (x:xs), so we need
to strengthen the induction hypothesis by having the recursion also return the
length of xs. Let (r', n') be the random sample and length for xs. Then the
result for (x:xs) is (r, n) where n = n' + 1 and r = x with probability 1/n or
else r'.

Another way to derive this algorithm is to notice that any solution must
implicitly or explicitly compute the length of the list as a byproduct. If you
start by writing down the recursive function for computing the length of a
list, composed with the straightforward recursive function for taking a random
sample from a list of known length, you can apply a standard fusion and
deforesting transformation to combine them into a single pass, and you end up
with the same algorithm as above.

Here's a fun problem to ponder. Find the most frequently occurring element of
a list in O(n) time and O(1) space. You may assume that this element occupies
more than half of the list's entries.

~~~
DigitalBison
>Then the result for (x:xs) is (r, n) where n = n' + 1 and r = x with
probability 1/n or else r'.

Do you mean r = x:r' instead of r = x?

~~~
psykotic
No, r has the same type as x. The logic behind that probability split is
simple: We have r' (the random sample taken from the tail list xs) and x, and
we have to choose between them with some probability based on n. Clearly x
should occur with a 1 in n chance.

When you approach it inductively the way I did, there really isn't any choice
in the matter, which is why this algorithm design technique is so powerful.
Udi Manber developed the technique in his paper Using Induction to Design
Algorithms from 1988 and later used it throughout his great old book
Introduction to Algorithms: A Creative Approach from the early 1990s. Here's
the paper in case you're curious:
[http://akira.ruc.dk/~keld/teaching/algoritmedesign_f05/Artik...](http://akira.ruc.dk/~keld/teaching/algoritmedesign_f05/Artikler/05/Manber88.pdf)

~~~
ssp
This way of thinking also makes it easy to derive the max-sum algorithm from
Programming Pearls:

We have to compute max-sum for (x:xs), so first solve it for xs. If the max-
sum is within xs, we are done; if not, it somehow involves x. To find out, the
recursion has to also return the best sum of the beginning of the list.

A small case analysis can find out what the new max-sum is: x, x + max-
begin(xs), or max-sum(xs)? And the new max-begin is either x or x + max-begin
(xs).

The naive recursive implementation will require linear space though, so there
is an extra step to find out how to eliminate the recursion.

~~~
psykotic
Yes, Manber derives the linear-time algorithm for the maximal subsum problem
exactly like that in his textbook. As you say, if x is involved in a maximal
subsum, it must extend the tail's maximal prefix sum, so you return that in
addition to the maximal subsum.

Regarding recursion vs iteration, Manber generally develops the right
induction hypothesis gradually, using informal language ("remove the element
and solve the smaller problem") in the process, and only turns that into a
precise algorithm once the final induction hypothesis has been found, so he
doesn't present the recursion elimination as a separate step. That works well
pedagogically. His main idea is to guide the student's intuitions rather than
formally derive a program that's correct by construction the way someone like
Richard Bird might have done it. (Bird has an interesting derivation of the
maximal subsum algorithm that begins with the brute-force algorithm written as
a combinator-based functional program and gradually transforms it in a
correctness-preserving way using program calculation techniques until arriving
at the fast linear-time program.)

------
wunki
Archived the papers mentioned by fogus and converted the .ps files to .pdf.
For those who want quick access to them, download here:

<http://c.wunki.org/A2s0>

~~~
Luyt
Thanks for the effort you took to make these papers more accessible. I've
downloaded them to my computer.

------
wgrover
Seeing Leslie Lamport's first paper (about braids, for a high school math
journal) makes me want to put together a list of computer scientists' first
papers. Now I've got two favorites:

Leslie Lamport, "Braid Theory": [http://research.microsoft.com/en-
us/um/people/lamport/pubs/b...](http://research.microsoft.com/en-
us/um/people/lamport/pubs/bxscience.pdf)

Don Knuth, "The Potrzebie System of Weights and Measures", Mad Magazine:
<http://upload.wikimedia.org/wikipedia/en/5/52/Potrzeb.jpg>

------
scott_s
A Fast File System for UNIX:
<http://www.cs.berkeley.edu/~brewer/cs262/FFS.pdf> Modern filesystems are
based on this design. They add significant features - like journaling - but by
reading this paper, you will form a basis for how modern file systems work.

The Google Filesystem: <http://labs.google.com/papers/gfs.html> Great example
of what you can design when you decide certain things - like storing small
files - are not important.

MapReduce: Simplified Data Processing on Large Clusters:
<http://research.google.com/archive/mapreduce.html> Great example of taking an
existing idea and using it to achieve high performance.

The Implementation of the Cilk 5 Multithreaded Language:
<http://supertech.csail.mit.edu/papers/cilk5.pdf> Apple's Grand Central
Dispatch uses a lot of ideas from this paper.

------
mitultiwari
Nice collection of papers. Thanks for sharing!

Here is a set of papers about distributed systems, which I read in a course
and found very useful to get good understanding of distributed systems
research so far: <http://www.cs.utexas.edu/~dahlin/Classes/GradOS/index.html>

------
zephyrfalcon
What would be useful is a site that collects links to CS papers, then for each
paper lets users (who presumably actually read that paper) comment on it,
and/or rate it. Or does such a beast already exist?

~~~
charlieok
Mendeley [1] does some of what you're talking about. Primarily it's a way to
organize your own collection of papers and access them from native or web apps
on various devices. It also has some social features, like connecting users
with similar interests and sharing tags on papers.

Its "Computer and Information Science" category [2] lists the papers with the
most "readers" which I think in this case means "Mendeley users who have added
this paper to their collection". Its most read paper in that category is the
MapReduce paper [3]. Oddly, as I look at it now, I don't see a feature for
rating or commenting on the paper, which could be a useful addition.

Of course, the authoritative method of rating papers is citations of other
papers, right? That's what led to PageRank after all, and most of the main
sites for finding papers have used citation counts for a long time.

[1] <http://www.mendeley.com>

[2] <http://www.mendeley.com/computer-and-information-science/>

[3] [http://www.mendeley.com/research/mapreducemerge-
simplified-r...](http://www.mendeley.com/research/mapreducemerge-simplified-
relational-data-processing-on-large-clusters/)

~~~
charlieok
oops. Correction: The MapReduce paper doesn't have the most readers in
Mendeley. I just saw it at the top of the "Popular" list on the computer and
information science page. So they must have some other way they're sorting
that list.

------
jules
Other great papers:

Flapjax: A Programming Language for Ajax Applications:

[http://www.cs.brown.edu/~sk/Publications/Papers/Published/mg...](http://www.cs.brown.edu/~sk/Publications/Papers/Published/mgbcgbk-
flapjax/)

Embedded probabalistic programming:

<http://okmij.org/ftp/kakuritu/dsl-paper.pdf>

Memoization of Top-down Parsing:

<http://arxiv.org/PS_cache/cmp-lg/pdf/9504/9504016v1.pdf>

------
tybris
Why? Most of the contents of these papers have made their way into books and
programming languages. There's nothing wrong with getting it from a secondary
source.

~~~
scott_s
I find that reading primary sources reminds me that _I_ can be a primary
source. It reminds me that these ideas that I rely on were thought up by
people, not handed down from Mt. Olympus.

The historian in me is also just fascinated to see the source of an idea that
has made a large impact.

------
jroll
"All papers are freely available online"

Why not have links to each one?

~~~
fogus
Sorry. My Markdown syntax was bad. There are now links to the articles marked
by ↗.

~~~
Tichy
Can't find the link for "Organizing Programs Without Classes". The Oracle site
says "click here to download", but there is nothing to click?

~~~
patrickyeon
I see a postscript version here:
[http://www.cs.ucsb.edu/~urs/oocsb/self/papers/organizing-
pro...](http://www.cs.ucsb.edu/~urs/oocsb/self/papers/organizing-
programs.html)

~~~
Tichy
Thanks!

------
russellallen
Copies of most of the Self papers in PDF can be downloaded from the official
Self language website:
<http://selflanguage.org/documentation/published/index.html>

------
bbg
I'm going to be that guy:

"as a compliment to this paper" --> complement

"no higher complement." --> compliment

Thanks for a cool post.

~~~
Confusion
Such comments are more appropriately posted at the blog itself.

~~~
mahmud
A blog is as good as a research paper, to a blind bat.

------
Luyt
I'd like to see Joel Spolsky's epic article on unicode added to the list:
<http://www.joelonsoftware.com/articles/Unicode.html>

"if [...] you don't know the basics of characters, character sets, encodings,
and Unicode, and I _catch_ you, I'm going to punish you by making you peel
onions for 6 months in a submarine. I swear I will."

~~~
papaf
Thats an interesting article you linked to but the original article is about
peer reviewed academic papers.

Meta comment: I find that downvoting people are aren't being rude and abusive
is bad manners.

------
ajross
I don't mean to speak badly of this bibliography. Of the papers in the list
I've read, they're all great. And I'll add the others to my list and make sure
I get to them.

But I think the premise of the blog post is a little flawed. Reading papers is
a poor way to make yourself a better programmer. Read them in spare moments,
sure, but spend your time reading _code_ , not paper. At best, these things
will help you avoid some design mistakes in the code you write for yourself in
your own sandboxes.

In the real world, you're dealing with code written by people who haven't read
these papers. And this is where you're going to spend all your time:
maintaining and fixing and enhancing stuff that missed all the advice in the
papers. This is especially true of "day job" programmers, but it's true at
startups too, even at seed-stage oens.

Admonitions to do stuff like re-read "Out of the Tar Pit" every six months is
just bad advice to my mind. It's a good way to convince yourself you're
smarter than everyone else. It's a bad way to get better at debugging.

~~~
wglb
I would heartily disagree. As an example, reading about Vector Clocks by
Lamport had an immediate effect on how I thought about a real-world problem,
and it has stayed with me since.

My first reading of "GOTOs considered harmful" led to the phase of my career
where I was all about structured programming. (For s perspective, much HLL
programming then was in fortran with the three-branch IF-statement.)

 _Admonitions to do stuff like re-read "Out of the Tar Pit" every six months
is just bad advice to my mind. It's a good way to convince yourself you're
smarter than everyone else. It's a bad way to get better at debugging._ The
very best way to get better at debugging is to not put bugs there in the first
place. The Tar Pit paper is one step in that direction. And it is important to
keep at that because our habits keep trying to put state in.

So while it sounds like you are saying "get your inspiration from reading bad
code", I suspect that you don't really mean that.

~~~
cellularmitosis
"The very best way to get better at debugging is to not put bugs there in the
first place"

Actually, I've come to the opposite conclusion recently.

During college, as well as for all of my personal projects, and during the
early part of my career, I had the fortune of mostly working on projects for
which I was the sole coder from start to finish.

As you say, when you're the only variable in the equation, you eliminate the
need for strong debugging skills by following a set of best practices which
tends to prevent creation of bugs in the first place.

And then I inherited my first seriously brain-damaged codebase. 16,000 lines
of "what not to do". All of my favorite practices listed in the c2 wiki were
completely ignored. Massive copy-paste coding. State BOOL's distributed
throughout the app making it extremely fragile and tightly coupled. Thread
safety? What's that? Hell, they couldn't even avoid giving methods and
variables ambiguous names, even in the simplest of cases ("buttonClick". wtf?
is that a command (e.g. clickButton) or an event (e.g. buttonClicked)?)

I was wholly unprepared for debugging this monstrosity. My first instinct was
to rewrite the whole thing. In fact, a co-worker of mine had attempted exactly
that before I was assigned to this beast, but introduced his own set of show-
stopping bugs before being pulled off onto another project, ultimately wasting
two weeks of work (his branch was abandoned) and souring the waters for any
such attempt on my part.

For several months I forced myself to avoid the urge to rewrite, and just
slogged through it until I finally started to understand what was going on.

I have to say, forcing myself to continue to chew on something sour when every
instinct told me to spit it out really improved my ability to follow what was
happening in a foreign codebase. Now, even when bughunting relatively sane
codebases, I feel I'm more quickly able to zero in on the problem.

~~~
wglb
I agree with what you are saying in the context of taking on another's
program. In similar situations, I have also attempted to rewrite the body of
code. Once it was a necessity, as the code was perhaps 50% wishful thinking
and never worked. I have tried in other cases, unwisely, and failed.

But when you are confronted with a pile of bad but working software, I have
done what what you are saying here--dive in and understand what is happening.

------
rhizome
Is this title still referring to the object mentor essays that were posted a
couple weeks ago?

~~~
jleader
The post is a response to the original object mentor list of 10 papers; this
post claims to be oriented more toward technical depth and less toward
philosophy and design.

------
nithinag
Thanks for the post.

------
zackattack
Any other entrepreneurs integrating this into their standard employee training
manual?

