Very interesting. I'm unclear about one thing though -- were the bandwidth caps per account, or per user/session + account?
If it's the latter, did you consider distributing the crawlers to other users? Some sort of system where people with spare bw/clock cycles can do your work for you and free up your bw/clock cycles to receive and parse the data? Would writing something like that up have taken more time than it would have saved?
Regardless, congratulations on your accomplishment. It really is impressive.
Oh, now i understand. Then distributing it would still work then? (or would have worked i guess). Sorry to belabor that point, I'm just learning about this stuff, and I want to make sure I understand it, and you seem like you might be able to answer the question :)
Yes, absolutely. That's how most of the work got done. The biggest problem when you start distributing it is to avoid duplication.
I took some shortcuts there, so I'm fairly sure that a portion of what I've downloaded is in duplicate, but that will be resolved in a merge step.
Right now the files are spread out over 7 machines, the one I started on is the 'master', and then there are 6 others that have a portion of the data on them.
Each of those has been told to fetch only from a restricted area of geocities, but the master one had no such restrictions, so chances are there is some duplication between the master and the individual slaves.
Merging all the data and importing the user accounts is going to take a couple of days at least, it's quite a collection of files. I have no stats yet but when I'm done I'll do a write-up on the main statistics.