Here is the war journal:
Quote of the day.
"It doesn't matter what you do with apache, if there is a problem you can always solve it with mod_rewrite. The question is how." -- jacquesm]
If it's the latter, did you consider distributing the crawlers to other users? Some sort of system where people with spare bw/clock cycles can do your work for you and free up your bw/clock cycles to receive and parse the data? Would writing something like that up have taken more time than it would have saved?
Regardless, congratulations on your accomplishment. It really is impressive.
I took some shortcuts there, so I'm fairly sure that a portion of what I've downloaded is in duplicate, but that will be resolved in a merge step.
Right now the files are spread out over 7 machines, the one I started on is the 'master', and then there are 6 others that have a portion of the data on them.
Each of those has been told to fetch only from a restricted area of geocities, but the master one had no such restrictions, so chances are there is some duplication between the master and the individual slaves.
Merging all the data and importing the user accounts is going to take a couple of days at least, it's quite a collection of files. I have no stats yet but when I'm done I'll do a write-up on the main statistics.