
Tell HN: My 'mystery project': I couldn't sleep, so I backed up GeoCities - jacquesm
http://www.reocities.com/
======
textfiles
Hi, jacquesm.

Huge props to you for taking this initiative, but are/were you aware of
Archive Team and our process of backing up Geocities since April? We found
ways around the bandwidth limit months ago, and will be mirroring/distributing
the data as well. We own <http://geociti.es>, for example.

We're somewhere in the 1TB range of data, and we're still finding new stuff,
by the way.

~~~
jacquesm
I think I have 2+T all together right now, and I did that in 6 days, makes me
wonder how there can be such a big difference ?

Anyway, feel free to 'complete' your set from mine, but let's please
coordinate that so we don't kill this poor little server :)

~~~
tibbon
I remember Jason (of the Archive Team) telling me that Yahoo! had very little
bandwidth that they were allocating toward Geocities and that downloading was
horridly slow. Getting even 1MB/s was nearly impossible.

I'm wondering if Yahoo! increased the available bandwidth (or maybe everyone
else just stopped using it, increasing what was apparently available) so that
when you got to it then it was nice and zippy compared to when the Archive
Team hit it earlier in the year.

~~~
jacquesm
That's very well possible. I have no idea how they were doing it, I have about
20 different IPs in the farm that is doing this, 8 machines in total.

Even the mail server is doing double duty :)

The only thing that is still doing what it is intended for is my main
webserver, everything else is going flat-out. There is some risk of
duplication but I'll take care of that later.

I'm getting nearly 150MBit/sec peak so I really can't complain.

I have to hand it to my provider though, we get transit times that are just
about unbelievable, when it's quiet beteen 30 and 50 ms rtt, and when it's
busy still under 150. That helps a lot.

------
jacquesm
For all of those that were wondering what my 'mystery project' was, it's an
all out effort to backup all of geocities.com in 6 days before closing time.

It was a lot of hard work, with the help from Abi and some others.

I think we got most if not all of it.

The account restoration process will probably take the better part of the
week, there is just too much raw data to do it all in one go. There are still
a whole pile of integrity checks to be done and broken links to be repaired.

Please be kind to the server, it is still doing a lot of very hard work in the
background. On the homepage there is a status indicator that shows you how
many accounts and files have been restored. How many accounts and files
eventually will be restored I can't tell you right now but my guess is that
we've managed to save a very large portion of geocities.

~~~
abyssknight
Great idea! My wife was just lamenting over the loss of her highschool web-
years. Hopefully they'll live on through reocities. Nice design too,
especially considering the time crunch.

~~~
jacquesm
Abi did an awesome job, he did the whole thing under 5 hours.

I have no idea how long it will take to process all the raw data, it's spread
out over a whole pile of machines right now, I'm pulling it in batch by batch
to integrate it in to the main site.

But it's as close to a 'drop-in' replacement as I could think of.

~~~
abinoda
Thanks Jacques. Super cool project -- was a pleasure to work with you!

------
mahmud
Hah!

Here is the war journal:

<http://www.reocities.com/newhome/makingof.html>

:-)

[Edit:

Quote of the day.

"It doesn't matter what you do with apache, if there is a problem you can
always solve it with mod_rewrite. The question is _how_." -- jacquesm]

~~~
milkshakes
Very interesting. I'm unclear about one thing though -- were the bandwidth
caps per account, or per user/session + account?

If it's the latter, did you consider distributing the crawlers to other users?
Some sort of system where people with spare bw/clock cycles can do your work
for you and free up your bw/clock cycles to receive and parse the data? Would
writing something like that up have taken more time than it would have saved?

Regardless, congratulations on your accomplishment. It really is impressive.

~~~
jacquesm
Per account, the same user session would be able to see other accounts.

~~~
milkshakes
Oh, now i understand. Then distributing it would still work then? (or would
have worked i guess). Sorry to belabor that point, I'm just learning about
this stuff, and I want to make sure I understand it, and you seem like you
might be able to answer the question :)

~~~
jacquesm
Yes, absolutely. That's how most of the work got done. The biggest problem
when you start distributing it is to avoid duplication.

I took some shortcuts there, so I'm fairly sure that a portion of what I've
downloaded is in duplicate, but that will be resolved in a merge step.

Right now the files are spread out over 7 machines, the one I started on is
the 'master', and then there are 6 others that have a portion of the data on
them.

Each of those has been told to fetch only from a restricted area of geocities,
but the master one had no such restrictions, so chances are there is some
duplication between the master and the individual slaves.

Merging all the data and importing the user accounts is going to take a couple
of days at least, it's quite a collection of files. I have no stats yet but
when I'm done I'll do a write-up on the main statistics.

------
pbhjpbhj
What's the copyright position here?

In the UK transitive copies for the purposes of display, caching &c., have
been cleared as non-infringing actions. Googles caches link back to the
original author by way of attribution for example. But archiving and
reproduction without attribution in any way?

Also whilst Google may have been given a pass in robots.txt (were sub-sites
allowed individual robots files? I never had one) to crawl the site declaring
oneself as Googlebot in order to spider and archive the whole site could well
show bad faith?

Just wondered if you'd discussed the copyright position, perhaps with
Geocities. Maybe there was a disclaimer that effectively released content as
PD, I doubt it though.

A brief discussion on webmasterworld,
<http://www.webmasterworld.com/foo/3898789-2-30.htm> , but more interestingly
an idea to "rape" geocities for content for ad serving sites, see
<http://ducedo.com/free-content-geocities/> .

Someone on webmasterworld considered whether Google might back-rate based on
content so that highly rated content pages that disappear from Geocities (&c.)
could be given a boost in the SERPs.

OT: did you use the current username based addressing too or are you only
linking the old "campus" names, can't remember mine, was in RT somewhere IIRC.

------
marcamillion
Good stuff Jacquesm. That's one of the most beautiful things about the
internet, things can literally live on forever.

Simple as you take it, in this one thread alone there are at least two copies
of significant portions of geocities, and I am sure there are others out there
(not even aware of HN) that have done the same.

Oh how I love technology.

~~~
teeja
WP mentions a couple more sites doing some archiving:

<http://en.wikipedia.org/wiki/Geocities#Closure>

------
RyanMcGreal
> To fix links pointing to old GeoCities pages, we provide you with a small
> Firefox Greasemonkey script.

This is a nice touch.

------
zandorg
I think the Internet Archive got involved in this too. I asked the head of the
IA if they could just get Yahoo to give them the hard disks and stuff them
wholesale in the IA. But in the end, the IA put a 'spider' page on one of the
main Geocities FAQ pages on Yahoo, which is not bad I guess.

The IA is distinct from Archive Team BTW.

------
swolchok
You can scroll back for a bit in screen. C-a [ (or, apparently, C-a Esc) goes
into copy mode, in which you can scroll using the arrow keys. Exit copy mode
by wailing on Esc until it gives up. Be careful not to leave a screen sitting
in copy mode and expect the process in the screen to keep running.

------
flooha
Just a heads up: The "some interesting pages" link
(<http://reocities.com/tablizer/>) on this page
(<http://reocities.com/newhome/makingof.html>) returns a 404.

~~~
jacquesm
Don't worry, I've got them.

Restoring all this is going to take some time, it's spread out over a number
of machines right now.

This is the master copy:

<http://org.reocities.com/tablizer/>

But that does not include all the other boxes, just this one.

edit: Ok, it's fixed now. Thanks again!

------
jeroen
Quite an achievement in such a short timespan!

Nitpicks: the frontpage says "an verification method"; should be "a". And, of
course, validation: <http://vldtr.com/?key=reocities.com>

~~~
jacquesm
Hey Jeroen,

I fixed that thanks!

As for validation, I'm painfully aware of it, that's entirely my doing not
Abi's. I will fix those errors asap but I have to concentrate on getting the
user data in there right now, which is still quite a job.

The design was imported in a great hurry and I absolutely suck at CSS and
anything else that is design/formatting related.

Give me tables any day :)

But I will get around to it.

If you feel like helping out shoot me an email ;)

------
lloydarmbrust
I love the internet for things like this. Awesome job. I remember back in the
day hoarding hundreds of geocities accounts to store and distribute MP3s . . .
and of course I owned the copyrights to all of those. . . .

~~~
pbhjpbhj
If those files are still there then jacques is going to be serving those from
his own pages very shortly ...

~~~
lloydarmbrust
Ha. That's great. Too bad I didn't have gmail back then, because my site names
would totally be archived.

------
ComputerGuru
FYI: In "Making of" you're linking to <http://reocities.com/tablizer> which is
a 404.... probably want to fix that ;-D

~~~
jacquesm
Ok, I bumped that one in the restoration queue.

Should have thought of that before, there was another mention of it below but
I didn't think I'd be able to work around without messing things up.

thanks!

~~~
ComputerGuru
Awesome work. Seriously, mad props are in order for a bang hack job.

Great project, mate!

------
aminuit
Do you plan to host reocities indefinitely, or is this just a stopgap measure
before you can donate the collection to another organization?

~~~
jacquesm
I can host it just about forever, I own & operate ww.com, which has a fairly
large traffic bill anyway.

I figure if I put some 'friendly' ads on it the thing should pay for itself
and that's good enough for me.

The kind of corporate superstructure that Y! puts on top of its products is
what makes it inviable, not the concept of free hosting by itself.

If you figure that bandwidth in bulk costs around $3 / Mbit / month then you
can serve an awful lot of pages to make back that 3 dollars.

Geocities pages weigh in at about 25K a piece from my meagre sample, so based
on that cost / Mbit that's 13 million pageviews + a bit thrown in for server
depreciation.

I'm not too scared about doing that.

If the need arises I can put the whole thing in a foundation to keep it alive
forever.

~~~
PostOnce
I'm paranoid, and my first thought is that if you put 'friendly' ads on there,
it'll only be so long until someone comes along and calls what you're doing
profiting from copyright infringement. What's the deal with copyright on
geocities stuff? I'm not well-versed in this matter.

~~~
jacquesm
As far as I know that's exactly the situation that there was before, after all
Yahoo also had ads all over the place (and those were 'non friendly', as in
popups and stuff like that).

Bandwidth costs $, I'll take the risk as long as it is managable, if it goes
over that then it will have to make some money. Not much but enough to keep
going.

The copyright of the materials is totally clear, it lies with the original
authors, not with me.

But since they were being 'hosted' before in an environment that made their
sites disappear on an hourly basis before and that will no longer happen they
might even see it as an improvement, hard to tell at this point in time.

If someone owns a piece of it and doesn't want it on there I'm sure they'll
tell me, it's not as if I'm hiding.

To me it's on the order of the preservation of the 'stone age' of the
internet, if I can only preserve it 'offline' for my own gratification that
would be a useless thing, it has to live on. If you're willing to sponsor the
bandwidth then we can look at that, that would be an easy way to keep it
completely advertising free. Personally I would prefer that but if it is to be
done out of pocket then that will only go so far. If I have to drop a grand on
it per month to keep it ad free then I'll do that. If it is more then that
then there will have to be some other way to make it pay for itself. Maybe a
donation button (though I don't think those work very well, I'm one of the few
people I know that actually does donate to projects that I use), or some other
mechanism.

Time will tell. But without the data it all stops, so that had to come first.

~~~
pbhjpbhj
_If someone owns a piece of it and doesn't want it on there I'm sure they'll
tell me, it's not as if I'm hiding._

That's not how copyright works - "Well Your Honour I was selling those DVDs in
public if the film distributors didn't want me to they can just ask, so no
fine for me??"

Plus if you're putting ads with this you can't exactly say you're not making a
commercial enterprise out of it. I'd leave it to someone with lots of lawyers.

------
jonknee
I wish I could remember what my Geocities site was called. It was before
Yahoo! bought them, so it has been quite a while.

~~~
jacquesm
give me a segment of text that was in there and I'll scan for it.

~~~
ars
I don't have a geocities page, but that is an awesome offer!

So many times I hear people saying: I had a page, but can't remember what it
was.

~~~
stuartjmoore
I wish I could forget my Geocities pages. The internet has a tendency to
record all the stupid things you do when you're young.

~~~
jacquesm
Are you suggesting a business model ;) ?

~~~
milkshakes
Yes! You could easily sell users a service that helped them hunt down and
delete their old embarrassing geocities sites, if you could figure out a way
to confirm that they are indeed the authors.

People could use google, but you could put extra work into brewing up some
special sauce that would, for example let them find their sites with only a
combination vague memories, such as their neighborhoods, or when they created
it, or the types of things they linked to, or the background music their site
had (mine had the mission impossible song) or the type of content (mine had
lots of animated gifs). Google wouldn't care enough to do that.

If users find their content on their own, they can always request it to be
removed, all you're selling them is a tool to help them do it.

I wonder if people would really pay for it? It would be fun to find out. I
would help build it

~~~
jacquesm
I'd do it for free, regardless. It's their content after all.

There's a sketch on my notepad here about authenticating 'lost' content. It's
not easy and there will be a lot of stuff that needs special casing, but I
think it can be done.

------
tibbon
Is there any small chance that you could release some of the code that you
used to scrape it? I'm interested in archiving some site and wondering what
you used to execute it? Just scripting a lot of wgets?

~~~
jacquesm
Yes, just a bunch of wgets. That's the principle anyway.

But it is quite a bit more involved because you somehow have to avoid
duplication and retrying of stuff that simply doesn't exist. Then there's the
problem that the urls weren't case sensitive, which causes wget to retrieve
much more than necessary.

The code I wrote is pretty geocities specific, I highly doubt it has any value
outside of that (other than a sustained DDOS maybe ;) ).

------
schindyguy
it looks like someone already pointed it out, but you should def talk to the
archive team on irc <http://www.archiveteam.org/index.php?title=IRC_Channel>

They helped me out with an idea I had a while back on shortened urls. The same
thing needs to happen with shortened urls cause once that shortened url
service goes down, all the links are lost...

------
bitwize
Now do you need to buy more print cartridges?

<http://www.penny-arcade.com/comic/2002/7/1/>

------
tomjen2
Super nice work.

I wonder about two things:

 _Did anybody try to contact Yahoo to get a copy of the server content?

_ If you where running out of machines, why not try Amazons ECC? Would be
pretty sweet having web 1.0 saved by web 2.0 tech.

~~~
jacquesm
Because it's pretty expensive for bandwidth intensive applications.

It they were anywhere near competitive I would have signed up a long time ago,
as it is there really is no point.

~~~
diegomsana
Softlayer.com cloud servers have free incoming bandwidth, and outgoing traffic
is cheaper than ec2.

