Hacker Newsnew | comments | show | ask | jobs | submit login

Hehe, that's a good trick :)

Reminiscent of the Compaq one:

BIOS NOT (C) IBM 1982

Or something to that effect!

Here is what I use, now modified with your trick:

wget -r -nv -np -nc -i $URLFILE

With a separate process putting file in batches of 50,000 files.

That way you get one wget process to do a boatload of work instead of firing up a new one for every file. That also helps in re-using the connections.




Last time I checked, Geocities didn't support HTTP keep-alive. They do however support gzip compression, which wget doesn't support. Also, as you noticed, including "Googlebot" in the User-Agent works around the bw limit.

Did you archive regional versions (de.geocities.com, etc.) too? I see that (any-subdomain).reocities.com currently is just an alias for reocities.com. I have ~ 12 GB of archived data from de.geocities.com (couldn't find more links and didn't have the time later on), saved in the same format as wget (mtime set according to Last-Modified header etc.) if you are interested.

Anyway, good work!

-----


I thought about doing those too (esp. uk is large) but never got around to it, so yes, if you have it I would be much obliged.

Can you send me an email on how to receive the data ?

-----


We snagged a bunch of the other domains.

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: