Huge props to you for taking this initiative, but are/were you aware of Archive Team and our process of backing up Geocities since April? We found ways around the bandwidth limit months ago, and will be mirroring/distributing the data as well. We own http://geociti.es, for example.
We're somewhere in the 1TB range of data, and we're still finding new stuff, by the way.
Anyway, feel free to 'complete' your set from mine, but let's please coordinate that so we don't kill this poor little server :)
I'm wondering if Yahoo! increased the available bandwidth (or maybe everyone else just stopped using it, increasing what was apparently available) so that when you got to it then it was nice and zippy compared to when the Archive Team hit it earlier in the year.
Even the mail server is doing double duty :)
The only thing that is still doing what it is intended for is my main webserver, everything else is going flat-out. There is some risk of duplication but I'll take care of that later.
I'm getting nearly 150MBit/sec peak so I really can't complain.
I have to hand it to my provider though, we get transit times that are just about unbelievable, when it's quiet beteen 30 and 50 ms rtt, and when it's busy still under 150. That helps a lot.
I could do with a break :)
Hardest working week of the last decade for me.
There are lots of bits and pieces that were hard to get to but I think I got most of it.
At 50 Mbits / second that's:
16,000,000,000,000 / 50,000,000 = 320,000 seconds.
That's 88 hours. I used a little more than that because I was off to a slow start and messed up on the filenames (I never expected them to be case insensitive), and there is the bloody bandwidth limiter at Yahoo that keeps tripping you up.
I really don't want to see the message 'Service Temporarily Unavailable.' ever again.
... and you wouldn't have.
Reminiscent of the Compaq one:
BIOS NOT (C) IBM 1982
Or something to that effect!
Here is what I use, now modified with your trick:
wget -r -nv -np -nc -i $URLFILE
With a separate process putting file in batches of 50,000 files.
That way you get one wget process to do a boatload of work instead of firing up a new one for every file. That also helps in re-using the connections.
Did you archive regional versions (de.geocities.com, etc.) too? I see that (any-subdomain).reocities.com currently is just an alias for reocities.com. I have ~ 12 GB of archived data from de.geocities.com (couldn't find more links and didn't have the time later on), saved in the same format as wget (mtime set according to Last-Modified header etc.) if you are interested.
Anyway, good work!
Can you send me an email on how to receive the data ?
I think that describes Jacques pretty well; he's a substantial team unto himself.
It was a lot of hard work, with the help from Abi and some others.
I think we got most if not all of it.
The account restoration process will probably take the better part of the week, there is just too much raw data to do it all in one go. There are still a whole pile of integrity checks to be done and broken links to be repaired.
Please be kind to the server, it is still doing a lot of very hard work in the background. On the homepage there is a status indicator that shows you how many accounts and files have been restored. How many accounts and files eventually will be restored I can't tell you right now but my guess is that we've managed to save a very large portion of geocities.
I have no idea how long it will take to process all the raw data, it's spread out over a whole pile of machines right now, I'm pulling it in batch by batch to integrate it in to the main site.
But it's as close to a 'drop-in' replacement as I could think of.
Here is the war journal:
Quote of the day.
"It doesn't matter what you do with apache, if there is a problem you can always solve it with mod_rewrite. The question is how." -- jacquesm]
If it's the latter, did you consider distributing the crawlers to other users? Some sort of system where people with spare bw/clock cycles can do your work for you and free up your bw/clock cycles to receive and parse the data? Would writing something like that up have taken more time than it would have saved?
Regardless, congratulations on your accomplishment. It really is impressive.
I took some shortcuts there, so I'm fairly sure that a portion of what I've downloaded is in duplicate, but that will be resolved in a merge step.
Right now the files are spread out over 7 machines, the one I started on is the 'master', and then there are 6 others that have a portion of the data on them.
Each of those has been told to fetch only from a restricted area of geocities, but the master one had no such restrictions, so chances are there is some duplication between the master and the individual slaves.
Merging all the data and importing the user accounts is going to take a couple of days at least, it's quite a collection of files. I have no stats yet but when I'm done I'll do a write-up on the main statistics.
In the UK transitive copies for the purposes of display, caching &c., have been cleared as non-infringing actions. Googles caches link back to the original author by way of attribution for example. But archiving and reproduction without attribution in any way?
Also whilst Google may have been given a pass in robots.txt (were sub-sites allowed individual robots files? I never had one) to crawl the site declaring oneself as Googlebot in order to spider and archive the whole site could well show bad faith?
Just wondered if you'd discussed the copyright position, perhaps with Geocities. Maybe there was a disclaimer that effectively released content as PD, I doubt it though.
A brief discussion on webmasterworld, http://www.webmasterworld.com/foo/3898789-2-30.htm , but more interestingly an idea to "rape" geocities for content for ad serving sites, see http://ducedo.com/free-content-geocities/ .
Someone on webmasterworld considered whether Google might back-rate based on content so that highly rated content pages that disappear from Geocities (&c.) could be given a boost in the SERPs.
OT: did you use the current username based addressing too or are you only linking the old "campus" names, can't remember mine, was in RT somewhere IIRC.
Simple as you take it, in this one thread alone there are at least two copies of significant portions of geocities, and I am sure there are others out there (not even aware of HN) that have done the same.
Oh how I love technology.
This is a nice touch.
The IA is distinct from Archive Team BTW.
Restoring all this is going to take some time, it's spread out over a number of machines right now.
This is the master copy:
But that does not include all the other boxes, just this one.
edit: Ok, it's fixed now. Thanks again!
Nitpicks: the frontpage says "an verification method"; should be "a". And, of course, validation: http://vldtr.com/?key=reocities.com
I fixed that thanks!
As for validation, I'm painfully aware of it, that's entirely my doing not Abi's. I will fix those errors asap but I have to concentrate on getting the user data in there right now, which is still quite a job.
The design was imported in a great hurry and I absolutely suck at CSS and anything else that is design/formatting related.
Give me tables any day :)
But I will get around to it.
If you feel like helping out shoot me an email ;)
Should have thought of that before, there was another mention of it below but I didn't think I'd be able to work around without messing things up.
Great project, mate!
I figure if I put some 'friendly' ads on it the thing should pay for itself and that's good enough for me.
The kind of corporate superstructure that Y! puts on top of its products is what makes it inviable, not the concept of free hosting by itself.
If you figure that bandwidth in bulk costs around $3 / Mbit / month then you can serve an awful lot of pages to make back that 3 dollars.
Geocities pages weigh in at about 25K a piece from my meagre sample, so based on that cost / Mbit that's 13 million pageviews + a bit thrown in for server depreciation.
I'm not too scared about doing that.
If the need arises I can put the whole thing in a foundation to keep it alive forever.
Bandwidth costs $, I'll take the risk as long as it is managable, if it goes over that then it will have to make some money. Not much but enough to keep going.
The copyright of the materials is totally clear, it lies with the original authors, not with me.
But since they were being 'hosted' before in an environment that made their sites disappear on an hourly basis before and that will no longer happen they might even see it as an improvement, hard to tell at this point in time.
If someone owns a piece of it and doesn't want it on there I'm sure they'll tell me, it's not as if I'm hiding.
To me it's on the order of the preservation of the 'stone age' of the internet, if I can only preserve it 'offline' for my own gratification that would be a useless thing, it has to live on. If you're willing to sponsor the bandwidth then we can look at that, that would be an easy way to keep it completely advertising free. Personally I would prefer that but if it is to be done out of pocket then that will only go so far. If I have to drop a grand on it per month to keep it ad free then I'll do that. If it is more then that then there will have to be some other way to make it pay for itself. Maybe a donation button (though I don't think those work very well, I'm one of the few people I know that actually does donate to projects that I use), or some other mechanism.
Time will tell. But without the data it all stops, so that had to come first.
That's not how copyright works - "Well Your Honour I was selling those DVDs in public if the film distributors didn't want me to they can just ask, so no fine for me??"
Plus if you're putting ads with this you can't exactly say you're not making a commercial enterprise out of it. I'd leave it to someone with lots of lawyers.
So many times I hear people saying: I had a page, but can't remember what it was.
People could use google, but you could put extra work into brewing up some special sauce that would, for example let them find their sites with only a combination vague memories, such as their neighborhoods, or when they created it, or the types of things they linked to, or the background music their site had (mine had the mission impossible song) or the type of content (mine had lots of animated gifs). Google wouldn't care enough to do that.
If users find their content on their own, they can always request it to be removed, all you're selling them is a tool to help them do it.
I wonder if people would really pay for it? It would be fun to find out. I would help build it
There's a sketch on my notepad here about authenticating 'lost' content. It's not easy and there will be a lot of stuff that needs special casing, but I think it can be done.
Do you still have control of the from address of the posting?
I've deleted stuff from newsgroup archives before, had some hoops to just through, but it was possible.
But it is quite a bit more involved because you somehow have to avoid duplication and retrying of stuff that simply doesn't exist. Then there's the problem that the urls weren't case sensitive, which causes wget to retrieve much more than necessary.
The code I wrote is pretty geocities specific, I highly doubt it has any value outside of that (other than a sustained DDOS maybe ;) ).
They helped me out with an idea I had a while back on shortened urls. The same thing needs to happen with shortened urls cause once that shortened url service goes down, all the links are lost...
I wonder about two things:
Did anybody try to contact Yahoo to get a copy of the server content?
If you where running out of machines, why not try Amazons ECC? Would be pretty sweet having web 1.0 saved by web 2.0 tech.
It they were anywhere near competitive I would have signed up a long time ago, as it is there really is no point.