Huge props to you for taking this initiative, but are/were you aware of Archive Team and our process of backing up Geocities since April? We found ways around the bandwidth limit months ago, and will be mirroring/distributing the data as well. We own http://geociti.es, for example.
We're somewhere in the 1TB range of data, and we're still finding new stuff, by the way.
Anyway, feel free to 'complete' your set from mine, but let's please coordinate that so we don't kill this poor little server :)
I'm wondering if Yahoo! increased the available bandwidth (or maybe everyone else just stopped using it, increasing what was apparently available) so that when you got to it then it was nice and zippy compared to when the Archive Team hit it earlier in the year.
Even the mail server is doing double duty :)
The only thing that is still doing what it is intended for is my main webserver, everything else is going flat-out. There is some risk of duplication but I'll take care of that later.
I'm getting nearly 150MBit/sec peak so I really can't complain.
I have to hand it to my provider though, we get transit times that are just about unbelievable, when it's quiet beteen 30 and 50 ms rtt, and when it's busy still under 150. That helps a lot.
I could do with a break :)
Hardest working week of the last decade for me.
There are lots of bits and pieces that were hard to get to but I think I got most of it.
At 50 Mbits / second that's:
16,000,000,000,000 / 50,000,000 = 320,000 seconds.
That's 88 hours. I used a little more than that because I was off to a slow start and messed up on the filenames (I never expected them to be case insensitive), and there is the bloody bandwidth limiter at Yahoo that keeps tripping you up.
I really don't want to see the message 'Service Temporarily Unavailable.' ever again.
... and you wouldn't have.
Reminiscent of the Compaq one:
BIOS NOT (C) IBM 1982
Or something to that effect!
Here is what I use, now modified with your trick:
wget -r -nv -np -nc -i $URLFILE
With a separate process putting file in batches of 50,000 files.
That way you get one wget process to do a boatload of work instead of firing up a new one for every file. That also helps in re-using the connections.
Did you archive regional versions (de.geocities.com, etc.) too? I see that (any-subdomain).reocities.com currently is just an alias for reocities.com. I have ~ 12 GB of archived data from de.geocities.com (couldn't find more links and didn't have the time later on), saved in the same format as wget (mtime set according to Last-Modified header etc.) if you are interested.
Anyway, good work!
Can you send me an email on how to receive the data ?
I think that describes Jacques pretty well; he's a substantial team unto himself.