Hacker News new | comments | show | ask | jobs | submit login

Let's say it's 2T of data, that's 16,000,000,000,000 bits.

At 50 Mbits / second that's:

16,000,000,000,000 / 50,000,000 = 320,000 seconds.

That's 88 hours. I used a little more than that because I was off to a slow start and messed up on the filenames (I never expected them to be case insensitive), and there is the bloody bandwidth limiter at Yahoo that keeps tripping you up.

I really don't want to see the message 'Service Temporarily Unavailable.' ever again.




wget -p -r -l 0 -nc -U "I can't believe it's not Googlebot/2.1" -i "$FILE"

... and you wouldn't have.


Hehe, that's a good trick :)

Reminiscent of the Compaq one:

BIOS NOT (C) IBM 1982

Or something to that effect!

Here is what I use, now modified with your trick:

wget -r -nv -np -nc -i $URLFILE

With a separate process putting file in batches of 50,000 files.

That way you get one wget process to do a boatload of work instead of firing up a new one for every file. That also helps in re-using the connections.


Last time I checked, Geocities didn't support HTTP keep-alive. They do however support gzip compression, which wget doesn't support. Also, as you noticed, including "Googlebot" in the User-Agent works around the bw limit.

Did you archive regional versions (de.geocities.com, etc.) too? I see that (any-subdomain).reocities.com currently is just an alias for reocities.com. I have ~ 12 GB of archived data from de.geocities.com (couldn't find more links and didn't have the time later on), saved in the same format as wget (mtime set according to Last-Modified header etc.) if you are interested.

Anyway, good work!


I thought about doing those too (esp. uk is large) but never got around to it, so yes, if you have it I would be much obliged.

Can you send me an email on how to receive the data ?


We snagged a bunch of the other domains.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: