Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: My 'mystery project': I couldn't sleep, so I backed up GeoCities (reocities.com)
250 points by jacquesm on Oct 26, 2009 | hide | past | web | favorite | 65 comments

Hi, jacquesm.

Huge props to you for taking this initiative, but are/were you aware of Archive Team and our process of backing up Geocities since April? We found ways around the bandwidth limit months ago, and will be mirroring/distributing the data as well. We own http://geociti.es, for example.

We're somewhere in the 1TB range of data, and we're still finding new stuff, by the way.

I think I have 2+T all together right now, and I did that in 6 days, makes me wonder how there can be such a big difference ?

Anyway, feel free to 'complete' your set from mine, but let's please coordinate that so we don't kill this poor little server :)

I remember Jason (of the Archive Team) telling me that Yahoo! had very little bandwidth that they were allocating toward Geocities and that downloading was horridly slow. Getting even 1MB/s was nearly impossible.

I'm wondering if Yahoo! increased the available bandwidth (or maybe everyone else just stopped using it, increasing what was apparently available) so that when you got to it then it was nice and zippy compared to when the Archive Team hit it earlier in the year.

That's very well possible. I have no idea how they were doing it, I have about 20 different IPs in the farm that is doing this, 8 machines in total.

Even the mail server is doing double duty :)

The only thing that is still doing what it is intended for is my main webserver, everything else is going flat-out. There is some risk of duplication but I'll take care of that later.

I'm getting nearly 150MBit/sec peak so I really can't complain.

I have to hand it to my provider though, we get transit times that are just about unbelievable, when it's quiet beteen 30 and 50 ms rtt, and when it's busy still under 150. That helps a lot.

Well, there's no way we're going to coordinate without a rsync being somewhere in the mix. There's no rush, we can discuss it after the screaming dies down. I'm interested as well. At worst, it means we saved even more data, which works for me.

It's cool.

I could do with a break :)

Hardest working week of the last decade for me.

There are lots of bits and pieces that were hard to get to but I think I got most of it.

http://archiveteam.org and http://archive.org have been working together since May backing up http://geocities.com http://archive.org has done two full crawls using a shared seed list that http://archiveteam.org and http://archive.org have been modifying and sending back and forth since there isn't an index for geocities anywhere to be found. I'm suprised that the http://reocities.com guys got a full crawl done at all in only 6 days.

Let's say it's 2T of data, that's 16,000,000,000,000 bits.

At 50 Mbits / second that's:

16,000,000,000,000 / 50,000,000 = 320,000 seconds.

That's 88 hours. I used a little more than that because I was off to a slow start and messed up on the filenames (I never expected them to be case insensitive), and there is the bloody bandwidth limiter at Yahoo that keeps tripping you up.

I really don't want to see the message 'Service Temporarily Unavailable.' ever again.

wget -p -r -l 0 -nc -U "I can't believe it's not Googlebot/2.1" -i "$FILE"

... and you wouldn't have.

Hehe, that's a good trick :)

Reminiscent of the Compaq one:


Or something to that effect!

Here is what I use, now modified with your trick:

wget -r -nv -np -nc -i $URLFILE

With a separate process putting file in batches of 50,000 files.

That way you get one wget process to do a boatload of work instead of firing up a new one for every file. That also helps in re-using the connections.

Last time I checked, Geocities didn't support HTTP keep-alive. They do however support gzip compression, which wget doesn't support. Also, as you noticed, including "Googlebot" in the User-Agent works around the bw limit.

Did you archive regional versions (de.geocities.com, etc.) too? I see that (any-subdomain).reocities.com currently is just an alias for reocities.com. I have ~ 12 GB of archived data from de.geocities.com (couldn't find more links and didn't have the time later on), saved in the same format as wget (mtime set according to Last-Modified header etc.) if you are interested.

Anyway, good work!

I thought about doing those too (esp. uk is large) but never got around to it, so yes, if you have it I would be much obliged.

Can you send me an email on how to receive the data ?

We snagged a bunch of the other domains.

the reocities guys

I think that describes Jacques pretty well; he's a substantial team unto himself.

For all of those that were wondering what my 'mystery project' was, it's an all out effort to backup all of geocities.com in 6 days before closing time.

It was a lot of hard work, with the help from Abi and some others.

I think we got most if not all of it.

The account restoration process will probably take the better part of the week, there is just too much raw data to do it all in one go. There are still a whole pile of integrity checks to be done and broken links to be repaired.

Please be kind to the server, it is still doing a lot of very hard work in the background. On the homepage there is a status indicator that shows you how many accounts and files have been restored. How many accounts and files eventually will be restored I can't tell you right now but my guess is that we've managed to save a very large portion of geocities.

Great idea! My wife was just lamenting over the loss of her highschool web-years. Hopefully they'll live on through reocities. Nice design too, especially considering the time crunch.

Abi did an awesome job, he did the whole thing under 5 hours.

I have no idea how long it will take to process all the raw data, it's spread out over a whole pile of machines right now, I'm pulling it in batch by batch to integrate it in to the main site.

But it's as close to a 'drop-in' replacement as I could think of.

Thanks Jacques. Super cool project -- was a pleasure to work with you!


Here is the war journal:




Quote of the day.

"It doesn't matter what you do with apache, if there is a problem you can always solve it with mod_rewrite. The question is how." -- jacquesm]

Very interesting. I'm unclear about one thing though -- were the bandwidth caps per account, or per user/session + account?

If it's the latter, did you consider distributing the crawlers to other users? Some sort of system where people with spare bw/clock cycles can do your work for you and free up your bw/clock cycles to receive and parse the data? Would writing something like that up have taken more time than it would have saved?

Regardless, congratulations on your accomplishment. It really is impressive.

Per account, the same user session would be able to see other accounts.

Oh, now i understand. Then distributing it would still work then? (or would have worked i guess). Sorry to belabor that point, I'm just learning about this stuff, and I want to make sure I understand it, and you seem like you might be able to answer the question :)

Yes, absolutely. That's how most of the work got done. The biggest problem when you start distributing it is to avoid duplication.

I took some shortcuts there, so I'm fairly sure that a portion of what I've downloaded is in duplicate, but that will be resolved in a merge step.

Right now the files are spread out over 7 machines, the one I started on is the 'master', and then there are 6 others that have a portion of the data on them.

Each of those has been told to fetch only from a restricted area of geocities, but the master one had no such restrictions, so chances are there is some duplication between the master and the individual slaves.

Merging all the data and importing the user accounts is going to take a couple of days at least, it's quite a collection of files. I have no stats yet but when I'm done I'll do a write-up on the main statistics.

That was a great read - nice work jacquesm!

Interesting!! Bookmarking to see what other gremlins popup!

What's the copyright position here?

In the UK transitive copies for the purposes of display, caching &c., have been cleared as non-infringing actions. Googles caches link back to the original author by way of attribution for example. But archiving and reproduction without attribution in any way?

Also whilst Google may have been given a pass in robots.txt (were sub-sites allowed individual robots files? I never had one) to crawl the site declaring oneself as Googlebot in order to spider and archive the whole site could well show bad faith?

Just wondered if you'd discussed the copyright position, perhaps with Geocities. Maybe there was a disclaimer that effectively released content as PD, I doubt it though.

A brief discussion on webmasterworld, http://www.webmasterworld.com/foo/3898789-2-30.htm , but more interestingly an idea to "rape" geocities for content for ad serving sites, see http://ducedo.com/free-content-geocities/ .

Someone on webmasterworld considered whether Google might back-rate based on content so that highly rated content pages that disappear from Geocities (&c.) could be given a boost in the SERPs.

OT: did you use the current username based addressing too or are you only linking the old "campus" names, can't remember mine, was in RT somewhere IIRC.

Good stuff Jacquesm. That's one of the most beautiful things about the internet, things can literally live on forever.

Simple as you take it, in this one thread alone there are at least two copies of significant portions of geocities, and I am sure there are others out there (not even aware of HN) that have done the same.

Oh how I love technology.

WP mentions a couple more sites doing some archiving:


> To fix links pointing to old GeoCities pages, we provide you with a small Firefox Greasemonkey script.

This is a nice touch.

I think the Internet Archive got involved in this too. I asked the head of the IA if they could just get Yahoo to give them the hard disks and stuff them wholesale in the IA. But in the end, the IA put a 'spider' page on one of the main Geocities FAQ pages on Yahoo, which is not bad I guess.

The IA is distinct from Archive Team BTW.

You can scroll back for a bit in screen. C-a [ (or, apparently, C-a Esc) goes into copy mode, in which you can scroll using the arrow keys. Exit copy mode by wailing on Esc until it gives up. Be careful not to leave a screen sitting in copy mode and expect the process in the screen to keep running.

Just a heads up: The "some interesting pages" link (http://reocities.com/tablizer/) on this page (http://reocities.com/newhome/makingof.html) returns a 404.

Don't worry, I've got them.

Restoring all this is going to take some time, it's spread out over a number of machines right now.

This is the master copy:


But that does not include all the other boxes, just this one.

edit: Ok, it's fixed now. Thanks again!

Quite an achievement in such a short timespan!

Nitpicks: the frontpage says "an verification method"; should be "a". And, of course, validation: http://vldtr.com/?key=reocities.com

Hey Jeroen,

I fixed that thanks!

As for validation, I'm painfully aware of it, that's entirely my doing not Abi's. I will fix those errors asap but I have to concentrate on getting the user data in there right now, which is still quite a job.

The design was imported in a great hurry and I absolutely suck at CSS and anything else that is design/formatting related.

Give me tables any day :)

But I will get around to it.

If you feel like helping out shoot me an email ;)

I love the internet for things like this. Awesome job. I remember back in the day hoarding hundreds of geocities accounts to store and distribute MP3s . . . and of course I owned the copyrights to all of those. . . .

If those files are still there then jacques is going to be serving those from his own pages very shortly ...

Ha. That's great. Too bad I didn't have gmail back then, because my site names would totally be archived.

FYI: In "Making of" you're linking to http://reocities.com/tablizer which is a 404.... probably want to fix that ;-D

Ok, I bumped that one in the restoration queue.

Should have thought of that before, there was another mention of it below but I didn't think I'd be able to work around without messing things up.


Awesome work. Seriously, mad props are in order for a bang hack job.

Great project, mate!

Do you plan to host reocities indefinitely, or is this just a stopgap measure before you can donate the collection to another organization?

I can host it just about forever, I own & operate ww.com, which has a fairly large traffic bill anyway.

I figure if I put some 'friendly' ads on it the thing should pay for itself and that's good enough for me.

The kind of corporate superstructure that Y! puts on top of its products is what makes it inviable, not the concept of free hosting by itself.

If you figure that bandwidth in bulk costs around $3 / Mbit / month then you can serve an awful lot of pages to make back that 3 dollars.

Geocities pages weigh in at about 25K a piece from my meagre sample, so based on that cost / Mbit that's 13 million pageviews + a bit thrown in for server depreciation.

I'm not too scared about doing that.

If the need arises I can put the whole thing in a foundation to keep it alive forever.

I'm paranoid, and my first thought is that if you put 'friendly' ads on there, it'll only be so long until someone comes along and calls what you're doing profiting from copyright infringement. What's the deal with copyright on geocities stuff? I'm not well-versed in this matter.

As far as I know that's exactly the situation that there was before, after all Yahoo also had ads all over the place (and those were 'non friendly', as in popups and stuff like that).

Bandwidth costs $, I'll take the risk as long as it is managable, if it goes over that then it will have to make some money. Not much but enough to keep going.

The copyright of the materials is totally clear, it lies with the original authors, not with me.

But since they were being 'hosted' before in an environment that made their sites disappear on an hourly basis before and that will no longer happen they might even see it as an improvement, hard to tell at this point in time.

If someone owns a piece of it and doesn't want it on there I'm sure they'll tell me, it's not as if I'm hiding.

To me it's on the order of the preservation of the 'stone age' of the internet, if I can only preserve it 'offline' for my own gratification that would be a useless thing, it has to live on. If you're willing to sponsor the bandwidth then we can look at that, that would be an easy way to keep it completely advertising free. Personally I would prefer that but if it is to be done out of pocket then that will only go so far. If I have to drop a grand on it per month to keep it ad free then I'll do that. If it is more then that then there will have to be some other way to make it pay for itself. Maybe a donation button (though I don't think those work very well, I'm one of the few people I know that actually does donate to projects that I use), or some other mechanism.

Time will tell. But without the data it all stops, so that had to come first.

If someone owns a piece of it and doesn't want it on there I'm sure they'll tell me, it's not as if I'm hiding.

That's not how copyright works - "Well Your Honour I was selling those DVDs in public if the film distributors didn't want me to they can just ask, so no fine for me??"

Plus if you're putting ads with this you can't exactly say you're not making a commercial enterprise out of it. I'd leave it to someone with lots of lawyers.

I wish I could remember what my Geocities site was called. It was before Yahoo! bought them, so it has been quite a while.

give me a segment of text that was in there and I'll scan for it.

I remember that I wrote "Under construction" and "Please visit my guestbook" as a link on my site. Can you find it?

I don't have a geocities page, but that is an awesome offer!

So many times I hear people saying: I had a page, but can't remember what it was.

I wish I could forget my Geocities pages. The internet has a tendency to record all the stupid things you do when you're young.

Are you suggesting a business model ;) ?

Yes! You could easily sell users a service that helped them hunt down and delete their old embarrassing geocities sites, if you could figure out a way to confirm that they are indeed the authors.

People could use google, but you could put extra work into brewing up some special sauce that would, for example let them find their sites with only a combination vague memories, such as their neighborhoods, or when they created it, or the types of things they linked to, or the background music their site had (mine had the mission impossible song) or the type of content (mine had lots of animated gifs). Google wouldn't care enough to do that.

If users find their content on their own, they can always request it to be removed, all you're selling them is a tool to help them do it.

I wonder if people would really pay for it? It would be fun to find out. I would help build it

I'd do it for free, regardless. It's their content after all.

There's a sketch on my notepad here about authenticating 'lost' content. It's not easy and there will be a lot of stuff that needs special casing, but I think it can be done.

That's evil genius thinking right there. There are some newsgroup postings Google has archived for eternity I would pay good money to get rid of. Hosting this might get expensive after a while, are you going to put up ads anywhere or do you plan on eating the cost???

Is it really that hard to delete a newsgroup posting?

Do you still have control of the from address of the posting?

I've deleted stuff from newsgroup archives before, had some hoops to just through, but it was possible.

I've been racking my brain and it's a lost cause. I think it was my first dive into websites and I seriously doubt there was anything of consequence (I was in middle school). Certainly nothing I can remember other than I know I had an account and remember having to choose a neighborhood. I appreciate the offer though.

Mine was in the Bunker! It was related to Red Alert 1 and had lots of Under Construction images :)

Is there any small chance that you could release some of the code that you used to scrape it? I'm interested in archiving some site and wondering what you used to execute it? Just scripting a lot of wgets?

Yes, just a bunch of wgets. That's the principle anyway.

But it is quite a bit more involved because you somehow have to avoid duplication and retrying of stuff that simply doesn't exist. Then there's the problem that the urls weren't case sensitive, which causes wget to retrieve much more than necessary.

The code I wrote is pretty geocities specific, I highly doubt it has any value outside of that (other than a sustained DDOS maybe ;) ).

it looks like someone already pointed it out, but you should def talk to the archive team on irc http://www.archiveteam.org/index.php?title=IRC_Channel

They helped me out with an idea I had a while back on shortened urls. The same thing needs to happen with shortened urls cause once that shortened url service goes down, all the links are lost...

Now do you need to buy more print cartridges?


Super nice work.

I wonder about two things:

Did anybody try to contact Yahoo to get a copy of the server content?

If you where running out of machines, why not try Amazons ECC? Would be pretty sweet having web 1.0 saved by web 2.0 tech.

Because it's pretty expensive for bandwidth intensive applications.

It they were anywhere near competitive I would have signed up a long time ago, as it is there really is no point.

Softlayer.com cloud servers have free incoming bandwidth, and outgoing traffic is cheaper than ec2.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact