I get the algorithm, but what tool(s) did you use generate the page itself?

RiderOfGiraffes · on Dec 16, 2010

OK, here's the script that drives the process.

    echo Retrieving index
        curl http://www.paulgraham.com/articles.html > data0

    echo Extracting URLs
        grep -ho "<a href[^>]*>[^<]*<.a>" data0 | grep -v http | grep -v RSS > data1

ComputeRankings takes the list of essays, extracts outgoing links to create a digraph, and then computes the Page-Rank of each page.

CreatePGER takes the results and an HTML template and glues them together to create the rankings page.

    echo Creating rankings page
        ./ComputeRankings.py > ComputeRankings.out
        ./Create_PGER.py > PaulGrahamEssaysRanking

MakeGraph outputs dot files for the giant component and the "other nodes" graph.

    echo Creating DOT file
        ./MakeGraph.py > links0.dot

Then I use neato to layout the graphs. In each case I run it a few hundred times and pick the result that has the largest boxes. That means it's the most compact output, and seems to be a good heuristic.

    echo Create Giant.png
        ./LayoutGraph Giant 11,40

    echo Create Other.png
        ./LayoutGraph Other   8,5

Finally I create the HTML you see using an HTML template.

    echo Create page with map
        ./Create_PGE.py > PaulGrahamEssays

There's a small lie in this. The web site is actually a statically generated "wiki". When it was first devised I didn't have the facility to run scripts on my host. I generate the pages as plain text with some mark-down, then off-line generate the entire site. Then I upload the parts that have changed.

If you try to edit then your suggested new version gets emailed to me, where it goes through an aggressive spam filter. Then it sits in my inbox for me to decide if it's a good change. If so, I trigger a refresh. Some people have passwords that trigger the refresh automatically, without my intervention, so it really does work as a limited access wiki.

Er, right. Is that what you wanted to know?

RiderOfGiraffes · on Dec 16, 2010

Here's a stripped down version of the "LayoutGraph" script. In essence, layout the graph 100 times and score each attempt according to the size of the biggest box. Save the attempt in a file with a name that has the score in it, then pick off the best. That way I can compare attempts with each other while it runs.

    GraphName=$1
    GraphSize=$2

    echo Laying out graph $1

    LayoutParms="-Gstart=random -Nshape=box -Goverlap=false -Gsplines=true"

    for n in 9 8 7 6 5 4 3 2 1 0
    do
        for m in 9 8 7 6 5 4 3 2 1 0
        do
            neato ${GraphName}.dot $LayoutParms -Gsize="${GraphSize}" -Tpng -o ${GraphName}.png -Timap -o ${GraphName}.map
            score=`sed "s/,/ /g" ${GraphName}.map | gawk '{printf("%04d\n",$5-$3)}' | sort -rn | head -1`
            mv ${GraphName}.png ${GraphName}.png_$score
            mv ${GraphName}.map ${GraphName}.map_$score
            echo $n$m $score `ls -l ${GraphName}.png_* | tail -1`
        done
    done

    cp `ls ${GraphName}.png_0??? | tail -1` ${GraphName}.png
    cp `ls ${GraphName}.map_0??? | tail -1` ${GraphName}.map

    ls -r ${GraphName}.png_0??? | tail -n +2 | xargs rm
    ls -r ${GraphName}.map_0??? | tail -n +2 | xargs rm

slug · on Dec 17, 2010

A small tip for the for loops:

  for n in {9..0}; do echo $n ; done

or

  for n in $(seq 9 -1 0); do echo $n ; done

RiderOfGiraffes · on Dec 17, 2010

Cool - thanks. I did actually know that, but I never got around to changing it to the more efficient version, and it's nice to be reminded.