At 50 Mbits / second that's:
16,000,000,000,000 / 50,000,000 = 320,000 seconds.
That's 88 hours. I used a little more than that because I was off to a slow start and messed up on the filenames (I never expected them to be case insensitive), and there is the bloody bandwidth limiter at Yahoo that keeps tripping you up.
I really don't want to see the message 'Service Temporarily Unavailable.' ever again.
... and you wouldn't have.
Reminiscent of the Compaq one:
BIOS NOT (C) IBM 1982
Or something to that effect!
Here is what I use, now modified with your trick:
wget -r -nv -np -nc -i $URLFILE
With a separate process putting file in batches of 50,000 files.
That way you get one wget process to do a boatload of work instead of firing up a new one for every file. That also helps in re-using the connections.
Did you archive regional versions (de.geocities.com, etc.) too? I see that (any-subdomain).reocities.com currently is just an alias for reocities.com. I have ~ 12 GB of archived data from de.geocities.com (couldn't find more links and didn't have the time later on), saved in the same format as wget (mtime set according to Last-Modified header etc.) if you are interested.
Anyway, good work!
Can you send me an email on how to receive the data ?