Hacker Newsnew | comments | show | ask | jobs | submit login

Oh, now i understand. Then distributing it would still work then? (or would have worked i guess). Sorry to belabor that point, I'm just learning about this stuff, and I want to make sure I understand it, and you seem like you might be able to answer the question :)



Yes, absolutely. That's how most of the work got done. The biggest problem when you start distributing it is to avoid duplication.

I took some shortcuts there, so I'm fairly sure that a portion of what I've downloaded is in duplicate, but that will be resolved in a merge step.

Right now the files are spread out over 7 machines, the one I started on is the 'master', and then there are 6 others that have a portion of the data on them.

Each of those has been told to fetch only from a restricted area of geocities, but the master one had no such restrictions, so chances are there is some duplication between the master and the individual slaves.

Merging all the data and importing the user accounts is going to take a couple of days at least, it's quite a collection of files. I have no stats yet but when I'm done I'll do a write-up on the main statistics.

-----




Applications are open for YC Winter 2016

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: