Hacker Newsnew | comments | show | ask | jobs | submit login

http://archiveteam.org and http://archive.org have been working together since May backing up http://geocities.com http://archive.org has done two full crawls using a shared seed list that http://archiveteam.org and http://archive.org have been modifying and sending back and forth since there isn't an index for geocities anywhere to be found. I'm suprised that the http://reocities.com guys got a full crawl done at all in only 6 days.



Let's say it's 2T of data, that's 16,000,000,000,000 bits.

At 50 Mbits / second that's:

16,000,000,000,000 / 50,000,000 = 320,000 seconds.

That's 88 hours. I used a little more than that because I was off to a slow start and messed up on the filenames (I never expected them to be case insensitive), and there is the bloody bandwidth limiter at Yahoo that keeps tripping you up.

I really don't want to see the message 'Service Temporarily Unavailable.' ever again.

-----


wget -p -r -l 0 -nc -U "I can't believe it's not Googlebot/2.1" -i "$FILE"

... and you wouldn't have.

-----


Hehe, that's a good trick :)

Reminiscent of the Compaq one:

BIOS NOT (C) IBM 1982

Or something to that effect!

Here is what I use, now modified with your trick:

wget -r -nv -np -nc -i $URLFILE

With a separate process putting file in batches of 50,000 files.

That way you get one wget process to do a boatload of work instead of firing up a new one for every file. That also helps in re-using the connections.

-----


Last time I checked, Geocities didn't support HTTP keep-alive. They do however support gzip compression, which wget doesn't support. Also, as you noticed, including "Googlebot" in the User-Agent works around the bw limit.

Did you archive regional versions (de.geocities.com, etc.) too? I see that (any-subdomain).reocities.com currently is just an alias for reocities.com. I have ~ 12 GB of archived data from de.geocities.com (couldn't find more links and didn't have the time later on), saved in the same format as wget (mtime set according to Last-Modified header etc.) if you are interested.

Anyway, good work!

-----


I thought about doing those too (esp. uk is large) but never got around to it, so yes, if you have it I would be much obliged.

Can you send me an email on how to receive the data ?

-----


We snagged a bunch of the other domains.

-----


the reocities guys

I think that describes Jacques pretty well; he's a substantial team unto himself.

-----




Applications are open for YC Winter 2016

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: