Hacker News new | past | comments | ask | show | jobs | submit login
The Geocities Torrent (~1TB of awesomeness) (textfiles.com)
238 points by aditya on Oct 27, 2010 | hide | past | favorite | 86 comments

For those that missed it, here's jacquesm's epic story of backing up geocities. http://reocities.com/newhome/makingof.html

and the discussion: http://news.ycombinator.com/item?id=903567

Does this torrent contain the fruits of the collaboration planed in that discussion?

I like the concept of "Digital Heritage" as they call it.

I think preserving digital heritage may become an important issue in the next few years. As the world begins to recognize the historic legacy of the web, it may someday become as important as preserving physical historical landmarks.

Its time to accept that the internet may be the greatest legacy 21st century civilization leaves behind. It likely isn't going anywhere, and is the most complete archive of our lives, that may live on for generations after we're dead.

For all we know, the things we type here could be preserved longer than the pyramids of giza should someone make sure to back it up regularly. I hope that someone does.

Charles Stross has an excellent novel, "Glasshouse", a major plot point of which is that due to shoddy digital preservation practices, far-future historians know virtually nothing about ≈1950 through ≈2050 or so. What little they know is pieced together from fragmentary bits and pieces of evidence, with results that are at turns hilarious and horrifying when "put into practice" by historical re-enactors. Kind of reminded me of David Macauly's "Motel of the Mysteries" in some ways.

I'll second the positive review of "Glasshouse." The lack of knowledge portrayed in the book is also an interesting commentary on one of the often unsung evils of DRM. That license server certainly won't be running 300 years from now.

Historians learn most about ancient societies by digging through their trash. We have left more trash that takes ages to degrade than all prior societies put together.

Only if they get serious about recycling, history will be at stake.

Wow. This site is amazing. Tons of gems: http://www.textfiles.com/underconstruction/

Was under construction the first web meme? All these people couldn't have arrived at the same metaphor individually (as obvious as it is).

I remember back in `95 building websites that were basically just animated gifs, (no judging i was a kid). I used some under construction ones because I had seen it everywhere and I was constructing!

Check out http://www.metafilter.com/85695/Please-Be-Patient-This-Page-... for a person's story of how he made the first animated GIFs.

Heh. I'd say that comment made it to Hacker News last year, but I already linked it in the Mefi thread! And then I got server push animation working on my server, & yes, it's really as hideous as people say.

Keep in mind that the idea of a document that might be different every time you pull it up was still kinda new.

So maybe the "construction" part was a meme, but I think the idea that you needed to warn people that this homepage wasn't "finished" yet was universal.

Hard to say if it was the first or not. It may have been "Here's my list of interesting web pages (most of them were probably links to other peoples lists of interesting pages, very recursive.)" I don't think the word meme existed then either.

The word was coined in 1976 by Richard Dawkins, but I don't know when it became well known.

His documentaries (Get Lamp and BBS) are also pretty interesting and well worth watching/purchase.

I for one will make this one torrent that I will perpetually seed until my ISP tries to stop me.

Jason Scott is a pretty cool dude; random story: He was doing the JetBlue unlimited month of travel and showed up in Pittsburgh a few weeks ago to talk at CMU. He was giving a talk at Seton Hill college the next day (about 45 mins away), and posted on twitter looking for a ride. A friend of mine forwarded the tweet to me, and I got in touch with Jason and ended up giving him a ride the next day. I had very little clue who he was before I met him, but he gave me a copy of Get Lamp and some good discussion in return.

Sounds like a good candidate for an Amazon public data set (http://aws.amazon.com/publicdatasets/). Although they do peak at 1 TB.

I'm actually sad that I didn't put my first website on geocities now! I had my own web hosting from my ISP, so I had a fabulously easy to remember url of "homepages.tig.com.au/~liedra" which was lost as soon as my family upgraded to a cable connection from dialup. And no, archive.org didn't manage to catch it :( I think it had a page devoted to Nick Cave and some terrible poetry! Go go websites of a 17 year old! :)

The interesting thing was that at the time my friends and I (who had ISP-based homepages) looked down on Geocities because it was "lame" comparatively. Now I'm sad that I don't have any records of that original page (possibly on an ancient CD-R though? but most of those early ones have degraded now...)

My first pages were on my ISP which offered a subdomain! I paid £2 a month extra (on top of call charges) for the privelige. My friends thought I was crazy but I showed them when that very same subdomain impressed someone enough to give me a web dev job ("You have a subdomain? Impressive").

I also looked down on geocities/angelfire sites and I still think I got the better deal out of it - my first stuff was too embarrassing to live on for eternity in the depths of a torrent.

This is fantastic.

The early web is a treasure trove of an interesting time in history. It was the first time average people could just write public documents to express themselves.

Naturally the pages were terrible, covered in things that look good the first time you see it, pointless opinions and personal shrines to obscure relics of pop-culture.

The web is still the same, but more everyday. Companies work day and night to have a web presence, and "using the internet" is synonymous with replying to status and 'liking' things.

Geocities, AOL Homepages, and tripod are landmarks of the first time in history someone could just make a page about themselves, or something they liked and _anyone_ could see it. It was society making paintings on caves.

Unfortunately, these sites don't produce revenue, and never will, so from a corporate point of view, they are worthless.

The early era of the web is like trying to find rare music. Of course there is a modern site, a torrent, or some convenient way to find most of what you want. What you find is at best, the same thing everyone else finds. The old web is full of non-technical people earnestly trying to make something, not a startup, not to sell a book, just trying to put something together which is largely lost in the ease of "List your favorite bands"

Not that it was better, or more insightful, simply that it is a huge body of primitive work that is unlikely to be recreated. These things should be stored, if for no other reason than we can see the bloviated opinions of mensans, the C-style poetry of 90's sysadmins, or just the insane ramblings of people who think like Gene Ray, but don't have the perseverence to keep up timecube.

The sites are a labor of love, no matter the revenue, and it annoys me to no end that AOL or Yahoo has the power to simply delete these old sites because they don't make business sense, to businesses that don't even know what they are doing.

Anyway, as someone who mirrored a few old HomePages and Geocities sites, and backs up pieces of the old internet whenever I can find them, this is a breath of fresh air.

Hey, everyone, Jason Scott (the textfiles.com guy) here.

Just wanted to address that reocities.com has even more than I do, and more than what's in the torrent. If you want to browse geocities, like ye old days, go visit reocities. This data release is never meant to be "all of geocities" just "a lot of geocities" (and all I have).

I am ALL for a 2.0 from jacquesm. :)

How many "web rings" fit into 1 TB?

Wow, when did those disappear? That phrase immediately brought me back to ~1994. :)

Disappear? webring.com is still there. I dare you to enter "goth poetry" in its search window. Double dare, as a matter of fact.

       Skeletal Lovers
   Two dead people
   Embraced even in death
   They lie there for eternity
   Together forever
   Memories turned to dust
   Laughter and sin forgotten
   Nothing but pale white bone
   Nothing to complain about
   Only the two of them
   Forever together

Oh the teenage humanity! Curse you mhd! Curse you!

I don't think they disappeared, they just went into "blog rolls"

I had someone email me a few days ago asking for code for an old webring script I'd written in 1998. I was a little amazed that 1. anyone would still want such a thing, and B. someone had found a listing for my 12yr old PHP script and thought it worth using. I haven't had the code in my possession for at least 10 years, and being so old, I'm sure it was full of all kinds of security loopholes. No idea where he came across it, but apparently some resource site somewhere on the internet still lists it.

Sounds like a great excuse to test those "Unlimited Diskspace!" and "Unlimited Bandwidth!" claims of shared providers.

they don't work, the second you start getting any serious traffic you get a warning. I had a video(flash movie) that got popular, and they sent me a warning after only 2 gigs of bandwidth.

Granted those 2 gigs were used up in something like 5 minutes, but never the less..you'd think they'd give you a little more to play with.

I remember creating my first site on geocities. A southpark fan site with links to download episodes (linked to another site hosting the of course). That obviously became the most popular feature. Think will have to get the torrent or at least part of it for a trip down memory lane. It is the digital equvilent of the 80's haircut

I'd love to be able to search those contents. I'm pretty sure I had a few Geocities sites, but I'm not going to download a terabyte to see if it's in there.

I imagine at least somebody will download to a server and host them all there. Might grab a new 1TB drive into one of my servers and do it if I'm bored enough...

I've been doing that for the last 12 months, and the collection is significantly larger than that.


Ah ha! So here's where you've all been hiding. http://www.reocities.com/TimesSquare/1266/

That page is by a young 'hacker' named Simon Liu.

I wonder if it is this Simon Liu:


"Surviving Distributed Denial of Service Attacks"

See? This is why reocities is awesome. The Simon Lius of the world should have access to their old Descent scores.

I'd love to know what sort of infrastructure you're running this on. I'm in a course for my Masters for library school that deals with similar sorts of problems in maintaining and preserving for the "long term" digital materials.

From your site it sounds like you wrote a script using wget to harvest the files and another to check them against versions that were still up. What do you do on the server end now to ensure that the files are still working correctly? Are you running periodic checksums on them or the like? Finally, are you looking for any help from an interested novice?

I have a large database table that stores the md5 hashes of all the files and there is a script that can compare all of the contents of the site with the hashes in the files (and with a second copy if that's what it would come to).

Some bitrot is inevitable but I think it's under control for now.

As for help, yes, but right now I'm pretty swamped in other stuff, the next round of work on reocities will likely come after the new year.

Have you considered using something like MogileFS? It'd be perfect for this sort of situation.

Let me know if you're interested in this or have any questions -- I've dealt a good bit with systems like this in the past, and would love to give you a hand.

You know where to find me :)

And yes, of course I'm interested. But right now no time.

It is truly amazing to see one of my old sites reappear in this form. Also, there appears to be another mirror, oocities.com

I checked, my very first website is thus far not there. I'll wait until you're finished, and if it doesn't appear I can send the offline backup. I've been carting it around for 10 years, laughing at the animated graphics and frames.

While I think the web is probably a better place with reocities than without it, slapping ads on this content feels a tad bit slimy to me. I'm curious to hear your take on the ethical and legal implications of it.

The ads are actually a boost for a HN'er.

And let's suppose for a second that they were not (as they have not been in the past), reocities has cost a fairly large amount of money to date (instead of made money, as you suggest), not a single person that has asked me to remove the content has ever commented on the presence of the ads, and neither has anybody that has found their stuff again because I backed it up.

On the contrary, the reactions have been almost 100% positive with a very few exceptions.

As a former Geocities user, I am very pleased he has placed ads on the site. Hopefully the ad revenue will cover the hosting costs indefinitely.

If he makes a small profit, great! Hopefully that will encourage archivists of the future.

(instead of made money, as you suggest)

I apologize if that's how you read my post; I certainly didn't intend to suggest that or to belittle the amount of work and money that you've put into this project.

If you put up half of the money that went in to it sight unseen then we'll call it even , deal ? ;)

This is from the Google stable of morals - if people don't know you're profiting from their copyright then it's OK.

Presumably you have no legal right to redistribute any of the Geocities stuff?

The fact you spent time and money on it is neither here nor there, that's not a moral|legal argument. Presumably there have been some exceptions (you intimate that), what sort of percentage does that amount to? If you calculate that as a true reflection of the whole population then how many people's copyright do you suppose you've infringed, knowingly, against their wishes?

If someone takes your published material and slaps ads on it and republishes without your knowledge is that all good with you? (I'm guessing you may say yes here!).

It's not just about economics, it's probably as much about moral rights. People believe that their Geocities content died and was buried.

Of course they are pretty minor "offences" (or at least appears so) across a large population - akin to being a spammer or somesuch. Actually scratch that this is simply like copying others blog posts on a massive scale.

> Actually scratch that this is simply like copying others blog posts on a massive scale.

No, this is saving those blog posts from extinction.

I copied my content from Geocities to another provider and then on to my own site (eventually). I'm sure I'm not alone. Very little of that content is still live anywhere but there is a little I think.

Also, why do you get to be the arbiter of whether others content can be allowed to disappear or not? If I own the copyright then it's within my rights to have all copies destroyed, for example, you secretly (as you've not notified copyright holders AFAICT) keeping a copy is infringing my moral right to control that work.

Not to toot my own horn here, but let me just give you one sample of the kind of email I get about this project, I'll leave you to judge the rest of them by this single one (it just happens to be the last one and more in this vein would be excessive and embarrassing):


I don't know who you are, but I just want say you are an angel. I thought I had laid the very first website I ever built on geocities to rest, and words cannot describe my utter and complete surprise to find it resurrected on reocities.

You say this project is a labor of love -- and that is exactly how I felt about my own website. It was, for at least 10 years of my life, the best expression of who I was, and it means so much to be able to relive that era again.

You are truly doing a public service by preserving early relics of Internet culture. I can only imagine that generations from now, when people are digging into the history of the web, they will be fascinated by what you've saved.

Thank you SO MUCH for doing what you do!




I'm not going to argue morality here with you any further, you apparently have a bee up your bonnet about this, but as they say, no good deed goes unpunished, there is no reason why this would be an exception.


>I'm not going to argue morality here with you any further, you apparently have a bee up your bonnet about this, but as they say, no good deed goes unpunished, there is no reason why this would be an exception.

I'm arguing the application of copyright law (I'm not for the law as it stands incidentally) and for the moral rights of producers of copyright law. You appear to be arguing that it's fine to break the law because how you do it makes some people happy. The same rationale (at a different scale) makes speeding OK for teenagers if it impresses their mates and they're lucky enough not to have killed anyone yet.

The problem is that if we allow what you've done (which I don't dislike, indeed I'd consider myself an admirer in general of what I've seen of your work) for anyone then we allow copying other's blog posts adding adverts and putting them on one's own website, we allow copying books that are still in copyright and republishing them, etc..

It's a technicality but important to the case in point IMO. The email from webmistress is grateful for you saving her from not having properly backed up her work, not from having infringed copyright. If you now copy her current website and display it as your own with ads, will she be happy? I'd warrant no, not until the point at which she deletes it all by accident and comes to you because she hasn't backed up. Is this an argument against copyright, probably, not a great one but still it is one.

The fact that you're hearing the positive results is going to be largely selection bias.

If you've bothered reading then thanks for your responses and for not going ad hominem on my ass.



The point of difference here though is that

(1) laws have been set aside many times when the net benefit for the common good outweighed the rights of the individual

(you can still disagree that that is a good thing though)

(2) such exemptions apply to libraries and other 'violators' that serve a different goal than piracy (for instance, preservation and access)

(3) in this case those that benefit the most are the original copyright holders

(4) there is a procedure in place to deal with those copyright holders that do not want their information out there

For the record, a fairly knowledgeable lawyer on copyright law in the netherlands here has reviewed the whole thing and think there is absolutely no problem defending my actions (just in case there would have been, I would have done whatever his advice would have been).

It's been up for a year, the one time someone threatened to sue (of course, some hotshot lawyer with a corporate page on geocities :) ), he backed off and became real nice once he realized that no judge was ever going to sign off on him suing for damages and whatnot without first asking politely to remove the stuff.

Laws are there to be respected. In exceptional cases - such as the going out of business of a repository of this size - you can break them if you go about it nicely and try to limit the damage as much as you can.

There are other people out there that have also made copies of all this data that have turned the whole thing in to an adsense fest complete with SEO spam tactics. That might be a better target for your anger.

Lastly, how much would you give for a copy of the library of Alexandria ?

I'm sure that geocities can not on average be compared with the quality of what was stored there but you'd be surprised by some of the stuff that I've found amongst the wreckage and we'd all be culturally poorer if it had gone to waste.

>Laws are there to be respected. In exceptional cases - such as the going out of business of a repository of this size - you can break them if you go about it nicely and try to limit the damage as much as you can.

That's not how the law works here. You break the law whether you're held to account for it or not. Copyright law in Europe is stricter in many ways than in the US (WRT personal use for example).

>That might be a better target for your anger.

Grrr, I'm soooo anngggrrry. Really, I'm quite calm. /rageface

>Lastly, how much would you give for a copy of the library of Alexandria ?

A lot. Probably not my first born though. This hits at the correct route for attacking poor law. Obviously in Alexandria there was no copyright, it was all PD.

People are welcome to mark their pages PD (or some other liberal license; this is the legal procedure for your #4) and HTML5 should (does? via microformats?) allow a license (CC, PD, C, CL, FDL, whatever) to be applied and readily parsed so that you could stay within the law and still do your white knight deal-y.

Moreover those who wish for people to be able to copy without restriction should petition for a change in the law.

The law is an ass but you're stuck with it. I don't consider the value of the stuff you've saved (as much as I've seen, certainly not an in depth study) to be that high that civil disobedience should be practised in order to preserve it.

> I don't consider the value of the stuff you've saved (as much as I've seen, certainly not an in depth study) to be that high that civil disobedience should be practised in order to preserve it.

And that's where we disagree. Talk to a researcher in 500 years or so to get the better reasons why compared to the ones that I can give to you today.

But what I would not give to have the nasa pages about the spaceshuttle flights that I helped put out on the net back.

Those are gone forever, I wished someone had broken copyright law to preserve them.

>And that's where we disagree.

I'm sure there are plenty of examples of the type of page preserved by the likes of archive.org.

>Those are gone forever, I wished someone had broken copyright law to preserve them.

Issue all your stuff PD and then people won't need to break the law to do what you appear to want them to do.

It is digital archeology. The content was as good as gone, just someone dug it up and put a big banner ad over it and selling lemonade on the side.

The AwesomenessReminder banner ads are perfect, though. Completely fit the style and the era.

Yes, I put them up there as a joke initially because Zachary came up with his animated gifs but then I thought oh, what the heck and left them up. I'm sure he's not making much money of them - if any - but it looked like a perfect fit.

His new site looks much better actually: http://www.awesomenessreminders.com/

>I'm sure he's not making much money of[f] them - if any - but it looked like a perfect fit.

I think you're being disingenuous here. You've argued that you have a right to the content and to put your ads on it so why bother trying to spin that action as a minor benefit.

If it's no benefit then save yourself and everyone else the bandwidth and remove the adframe. If it is a benefit then keep it and stand by your conviction that that is justified.

PS: for your mate, the .logo {padding-left: } looks to be about 7px too much so it doesn't align with "Bring happiness [...]" strapline and the body text. I'm using FF3.6.11 on Kubuntu.

And after only two random clicks, I'm at a (bad) foot fetish site. Well, it was either that, or a "Mr. T ate my <censored>"…

Great idea and name!

Have you grabbed the torrent they're offering?

I would be very surprised if there is anything in there that is not in reocities yet but I will certainly do a comparison.

I'll have to run that against all the deletion requests as well for this content because chances are fairly large that lots of it has already been removed at the request of the owners.

Have you tried de-duping files to save space? I imagine there's a finite number of MIDI files and "under construction" GIFs that could just be symlinked to save a ton of space...

Space is not the problem, I've got plenty of that. The bigger problem is that the underlying filesystem is having some problems dealing with the total number of files.

Total disk space usage on the array that holds reocities is about 10.5 T (that includes a master copy though) and runs to 233 million files.

Have you considered implementing a NoSQL solution to store the files? I've successfully done something similar with HBase.

No, but that's a good idea, I'll definitely look in to that when the next round of reocities improvements rolls around.

On second reading, I don't actually see a torrent. Did I miss something?

are you worried about copyright infringement suits?

Not one bit.

It was a situation where the common good outweighed the benefits of respecting copyright law handily, besides that the only thing that changed is the machine where the content was hosted.

I do not pass it off as my own, have a very clearly established procedure for removing the data at the request of the copyright holder.

Think of it as a hosting provider going out of business and a new one taking over the corpse of the old one. That it happened without contacting the owners is simply because the vast majority of the owners does not have contact information to begin with.

See the reocities faq for some more info on this:


You can replace most geocities.com URLs with reocities.com and it will appear.

I remember starting out in Geocities and then in Angelfire. Those where the days when you had to submit your site to the Yahoo directories :D I actually made money selling ads from Commission Junction back then. It wasn't much but it felt great.

Yay for the preservation of a generation. My early attempts of hacking together something in Notepad as a kid should be in there somewhere.

Now if only Yahoo donated the geocities.com domain to whoever has the guts to keep this archive online...

YEAH my first website lives on!

I can't for the life of me remember what I put out on geocities but it was probably something to do with Star Wars. I think my username back then was Fett82. Good times...

For anyone in Australia, it would be cheaper to pay them to put the data on a HDD, wrap it in bacon and hand deliver it - rather than download the torrent.

Really? I'm seeding it from Adelaide. There are options besides whatever Telstra shovels your way in most areas, you just have to poke around.

How would you like your bacon cooked?

I like it crispy.

Dude, get yourself a $50/mo TPG account. Unlimited. (sometimes including landline phone!)

That was in Sydney though, last year.

This may be the largest public torrent ever created!

Did Alex Stefanov's collection of free online math resources make it in? It was an awesomely comprehensive list.

1TB of internet nostalgia.

Argh, they're gonna revive and store the stuff I was so ashamed for that I hoped it would never resurface.

On reocities.com you can request to have information removed.

If you want sites removed from this torrent... better ask now, before the torrent's up.

You can sue them, it is copyright infringement and your moral rights are probably more important than any commercial rights in this case.

What happens on the internet... stays on the internet.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact