Hacker News new | comments | show | ask | jobs | submit login

Hi, jacquesm.

Huge props to you for taking this initiative, but are/were you aware of Archive Team and our process of backing up Geocities since April? We found ways around the bandwidth limit months ago, and will be mirroring/distributing the data as well. We own http://geociti.es, for example.

We're somewhere in the 1TB range of data, and we're still finding new stuff, by the way.

I think I have 2+T all together right now, and I did that in 6 days, makes me wonder how there can be such a big difference ?

Anyway, feel free to 'complete' your set from mine, but let's please coordinate that so we don't kill this poor little server :)

I remember Jason (of the Archive Team) telling me that Yahoo! had very little bandwidth that they were allocating toward Geocities and that downloading was horridly slow. Getting even 1MB/s was nearly impossible.

I'm wondering if Yahoo! increased the available bandwidth (or maybe everyone else just stopped using it, increasing what was apparently available) so that when you got to it then it was nice and zippy compared to when the Archive Team hit it earlier in the year.

That's very well possible. I have no idea how they were doing it, I have about 20 different IPs in the farm that is doing this, 8 machines in total.

Even the mail server is doing double duty :)

The only thing that is still doing what it is intended for is my main webserver, everything else is going flat-out. There is some risk of duplication but I'll take care of that later.

I'm getting nearly 150MBit/sec peak so I really can't complain.

I have to hand it to my provider though, we get transit times that are just about unbelievable, when it's quiet beteen 30 and 50 ms rtt, and when it's busy still under 150. That helps a lot.

Well, there's no way we're going to coordinate without a rsync being somewhere in the mix. There's no rush, we can discuss it after the screaming dies down. I'm interested as well. At worst, it means we saved even more data, which works for me.

It's cool.

I could do with a break :)

Hardest working week of the last decade for me.

There are lots of bits and pieces that were hard to get to but I think I got most of it.

http://archiveteam.org and http://archive.org have been working together since May backing up http://geocities.com http://archive.org has done two full crawls using a shared seed list that http://archiveteam.org and http://archive.org have been modifying and sending back and forth since there isn't an index for geocities anywhere to be found. I'm suprised that the http://reocities.com guys got a full crawl done at all in only 6 days.

Let's say it's 2T of data, that's 16,000,000,000,000 bits.

At 50 Mbits / second that's:

16,000,000,000,000 / 50,000,000 = 320,000 seconds.

That's 88 hours. I used a little more than that because I was off to a slow start and messed up on the filenames (I never expected them to be case insensitive), and there is the bloody bandwidth limiter at Yahoo that keeps tripping you up.

I really don't want to see the message 'Service Temporarily Unavailable.' ever again.

wget -p -r -l 0 -nc -U "I can't believe it's not Googlebot/2.1" -i "$FILE"

... and you wouldn't have.

Hehe, that's a good trick :)

Reminiscent of the Compaq one:


Or something to that effect!

Here is what I use, now modified with your trick:

wget -r -nv -np -nc -i $URLFILE

With a separate process putting file in batches of 50,000 files.

That way you get one wget process to do a boatload of work instead of firing up a new one for every file. That also helps in re-using the connections.

Last time I checked, Geocities didn't support HTTP keep-alive. They do however support gzip compression, which wget doesn't support. Also, as you noticed, including "Googlebot" in the User-Agent works around the bw limit.

Did you archive regional versions (de.geocities.com, etc.) too? I see that (any-subdomain).reocities.com currently is just an alias for reocities.com. I have ~ 12 GB of archived data from de.geocities.com (couldn't find more links and didn't have the time later on), saved in the same format as wget (mtime set according to Last-Modified header etc.) if you are interested.

Anyway, good work!

I thought about doing those too (esp. uk is large) but never got around to it, so yes, if you have it I would be much obliged.

Can you send me an email on how to receive the data ?

We snagged a bunch of the other domains.

the reocities guys

I think that describes Jacques pretty well; he's a substantial team unto himself.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact