Hacker News new | past | comments | ask | show | jobs | submit login
Help Us Keep the Archive Free, Accessible, and Private (archive.org)
377 points by aaronbrethorst on Nov 29, 2016 | hide | past | web | favorite | 80 comments

The Internet Archive is one of the crown jewels of the Internet. It's one of the things that I feel we were promised in the early days of technology, and it actually has managed to exist despite the massive commercialization of the Internet. In many ways it's the future we were promised, and it's an infinite pile of stuff so deep and wide that you could never buy another piece of entertainment and survive almost entirely off of the holdings in the archive and still not even scratch what's in there.

Pretty much. I think it's infinitely more important than anything else on the internet. Yet probably the least used considering the amount of content available. But if in a thousand years you want a record of what happened here now, that's what you really really need. Everything else is superfluous.

The Internet Archive is a modern Library of Alexandria. The latter was destroyed intentionally or accidentally, nobody knows for sure, but the point is that we have the technology to ensure it doesn't happen again.

Jason Scott has more on the backup: http://ascii.textfiles.com/archives/5110

The current Library of Alexandria, the Bibliotheca Alexandrina, has a copy of the Internet Archive from 1996 to 2007.[1] This was intended to be the first of several backup locations for the Archive. But they have not updated it since 2007, and many links return "Temporarily unavailable". They're lucky it survived the collapse of the Egyptian government.

[1] http://www.bibalex.org/isis/frontend/archive/archive_web.asp...

Note that Jason Scott is part of Archive Team, which has its own project to backup the Internet Archive:

http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK http://iabak.archiveteam.org/

If there was a simple all-in-one device you could buy to become part of IABAK, I'd probably do it. Embedded firmware, off-the-shelf terabyte hard drive... they could double or triple their userbase.

I wonder if the NextCloud Box could be adapted for this? https://nextcloud.com/box/ It's basically a hard drive in an enclosure, space for a raspberry pi 2 and some software on an sd card [Snappy Ubuntu Core as OS, nextcloud etc installed].

I've thought about building an IA.BAK extension for typical NAS software like Synology's DSM, QNAP's QTS, etc. Unfortunately, I've found that these platforms don't come close to the developer friendliness we've come to expect from app or extension stores - probably one of the reasons why there aren't really all that many extensions available on those devices.

Anyway, I'm hoping that situation will improve - unused NAS space would be a pretty big addition to IA.BAK, and you can't beat the UX of just installing an extension on a device you might already own.

Let's appreciate the comments section on the Archive page for a second. So many kind people. What a beautiful corner of the internet.

Went to the blog expecting a shit fight (I assumed your comment was sarcastic), left pleasantly surprised.

Things aren't going so well for me, this year. Not for the first year, in a row.

I'm going to respond by making donations I've been deferring, such as to archive.org .

Tomorrow...? Who knows?

Waiting doesn't work. I'm going to do what I can, now. Maybe it'll help, and hopefully I'll feel a little better about myself.

My thanks to Jason and all, for every time I've found a resource I was seeking mirrored and preserved, for my use and for posterity.

I'll add that a lot of "older" pages seem to -- still -- be more useful than many newer ones. The archive isn't just about maintaining some record "for posterity". It proves useful in current circumstances, daily.

And... No one should be able to make our history "go away."

No one should... but that doesn't mean the don't/can't.


from the linked page

> so no one will ever be able to change the past just because there is no digital record of it. The Web needs a memory, the ability to look back.

I thought the issue is/was that site owners (original, or purchasing the domain afterwards) could via a robots.txt remove their site from the archive?

or has this changed and now no matter what happens if the archive crawls a site on a date it stays no matter if 10 years down the road somebody buys the domain and decides to retroactively erase everything?

Not just the site owners either. I'm guessing not many people on here saw the early days of Gamergate, but one of the things that stoked the fire was the Internet Archive kindly erasing their archives of other websites' pages discussing the mess on request from a person being discussed after their hosting providers were leaned on to give them the boot, changing the past to fit the correct official narrative. It's one of the reasons why no-one involved with that uses them for archiving anything. (Wasn't doxing or anything like that, just explaining what happened was enough.)

In our modern hyperpoliticized era, one has to wonder if the Internet Archive is actually trustworthy for anything contentious anymore.

Adding a robots.txt file to a domain doesn't cause them to delete their archive of it, only to hide it.

Which is one and the same to the public.

IMO, they need to stop applying robots.txt retroactively if they want to be considered a valid archive.

The problem is that the Internet Archive exists on legally shaky ground. Neither they nor anyone else has a right to archive copyrighted web content and display it to the public. They manage to continue doing so in part because they're clearly non-commercial. They also manage to continue doing so because they voluntarily respond to robots.txt, even retroactively.

Libraries/archives have no special exemption from copyright law, which is actually a good thing, because otherwise libraries would presumably need to be licensed in some way by the government to qualify for special treatment.

Why not look at WHOIS information when getting an update, and then class a site as 'different' based on whether that changes? In most cases, a new domain owner usually means the site isn't the same as the earlier versions.

You'd then just have to stop the archive indexing/showing content after the WHOIS information changed, while leaving the stuff before it intact. Maybe you'd then have a nice form to report pages you want removed/hidden (for the edge cases), or even a seperate robots.txt/meta declaration you can make confirming you're the same person that owns the site. After all, most of the reasons why sites go missing aren't deliberate attempts to rewrite history, but domain squatters not wanting holding pages indexed.

Feels like it'd be so easy to implement robots.txt in a more logical way on the Internet Archive.

It's been suggested, but there's no way to automatically do it correctly. The whois info might be anonymized, in which case a change means nothing at all. It might just be someone's name and address, with no way of verifying who that someone works for. Also meaningless. Better just to default to something safe, and spend your manpower on something more important.

    > the Internet Archive exists on legally shaky ground
Not least because of the EU's 'right to be forgotten'.

That doesn't seem likely to apply.

Key point given the current climate, if the Trump presidency adds a restrictive robots.txt to all .gov domains they will prevent the Internet Archive Wayback Machine from showing history on any of those domains.

Not only would this obliterate public access to the Obama, Bush, Clinton era government websites in the archive, it'd prevent the use of the Wayback Machine for keeping track of Trump's shifting agendas, as demonstrated on his web domain recently.

Government documents including, I assume, web pages are in the public domain.

That's irrelevant; the issue isn't copyright, but the implementation of the Internet Archive, which applies robots.txt retroactively to archived versions of pages as well as respecting it currently.

But the reason the Internet Archive applies robots.txt retroactively is copyright, as explained in a sibling to the grandparent.

So someone* should make a clone of archive.org but built on Tor. Maybe call it Torchive. And with no hiding because of robots.txt and possibly prevent users from looking at/editing the data on their server so a person who wants to clear themselves can't flood the net with nodes wanting data on this person and then wipe them all.

*someone who has the skills to make it, because I don't, at least right now

Ok, but, I'm very thankful I was able to remove the site I made 20 years ago when I was 14.

I wonder if Wikimedia can do something to help the internet archive. They are sitting on huge amounts of money and their goals are somewhat similar.

Just to dispel a oft-repeated notion, the Wikimedia Foundation is not really sitting on that much money. From the horse's mouth:

> Our reserves vary throughout the year but are generally around 1 year of revenue. The typical recommendation for stable and successful nonprofits is to have between 6 months and 2 years of reserves. (https://www.quora.com/Wikipedia-in-2015-Why-does-Wikipedia-a...)

It's only responsible for a non-profit to keep money for when people donate less or when a lot of money is needed. In fact it would be unwise to only ask for donations when you're on the verge of bankruptcy. Not to mention that no matter how efficiently you run an organization, at the size of readers served by Wikipedia you do need to meet the demand by spending more resources. I have to mention this because so many people use it as justification not to donate, "oh I don't want to fund fat cat Jimbo's private jet" or whatever, especially when the foundations is very transparent compared to other non-profits.

That being said I do agree with the sentiment since Wikipedia uses the Internet Archive heavily to access citations that have long since 404ed and they both believe in access to information. Personally as a Wikipedia editor, I use the Internet Archive daily to access websites that are no longer online and to fight linkrot. It's invaluable for the functioning of the encyclopedia, especially when reacting to dead links as opposed to pre-emptively saving sources in Webcite or Archive.is. Wikimedia does appear on the donors acknowledgement page by the way (http://archive.org/donate/donors.php), though I'm sure you meant in a more hands-on manner.

They spend a lot of money on sub-projects that hardly go anywhere. They could cancel a bunch of those to put that money and effort into Archive instead. Maybe even a neutral, high-impact version of what Yahoo Directories used to do. Focus on all the stuff people might need or want to know curated into a collection with the Archive references on top of the Archive sponsorship.

That would be quite a circle, given that Wikipedia was inspired by GNUhoo / NewHoo / ODP, which was the first crowd-sourced Internet thing.

The Internet Archive uses blekko's slashtag data, which is a commercial effort founded by the ODP team -- blekko, now owned by IBM Watson.

I don't fully agree with this. I like that Archive.org exists, but I don't really mind if most of the archive would come to disappear. There is a lot of garbage being generated on the web and I really don't think there is sense in saving it.

On the other side, Wikipedia is knowledge, pure knowledge, and this is worth preserving in my opinion.

I agree about the tide of garbage, but on the other hand garbage is intensely interesting to historians. Think what one could do with a decent sentiment analysis AI and billions of comments on news stories for example. By themselves many of them are just nonsensical ranting for one or other political viewpoint, but in the aggregate you could probably identify significant historical tipping points that inflected much earlier than 'official' indicators.

This comment from a long-ago article [1] about saving many Usenet postings that would otherwise have been lost applies:

That’s why not only the very earliest Usenet posts, before Spencer started archiving in 1981 (Usenet began in 1979) but even some of the posts in the 1980s are still lost. It’s too bad; today, wouldn’t more of us rather see what was being said about abortion in 1984 than sift through the arcana of bug fixes in systems that have probably been long since retired? “It was perfectly reasonable from the viewpoint of stuff that we might want to use again, but a little sad from today’s viewpoint,” Spencer admits.

[1] http://www.salon.com/2002/01/08/saving_usenet/ A great read BTW.

It sounds like you have only ever used the WayBackMachine. Their audio collection is phenomenal. But that just scratches the surface. Tv News archive, Texts, games, and so much more.

Ah I didn't think of those. Good point.

I wish one of the ultra rich people who are in tech would donate a huge amount and keep essential projects like this alive.

The Archive is absolutely a worthy cause. Most people know the WayBack machine (although I wonder how many know where the name comes from), but that's not all the Archive's got. Music, Audio, Video, so much incredible content.

And that's not to mention their software library. Sketch (Jason Scott) seems to be the driving force behind it. As much as it's backed by ugly hacks (emulators compiled to JS. Yuck) it's pretty magical to be able to boot up, say, Fantasy World Dizzy in a web browser, and just play it, no install required.

It's hard to reconcile "ugly hack" and "magical".

All magic is ugly hacks!

Fair enough.

Anyways, thanks for uploading Fantasy World Dizzy. Now I can experience the horror the same way that the children of the '80s and '90s did.

Internet Archive is great, both for the fun of having frequent copies of my web site going back almost 20 years, and it is important for preserving digital history.

Fortunately storage and bandwidth costs will keep decreasing so more replicas can be built over time. I just made a contribution.

BTW, I was in their building in SF in June for the Decentralized Web conference - a fantastic location, and I recommend that you visit.

Have they recovered fully from the fire?

Jason Scott, Internet Archive. We've recovered from the fire in terms of book scanning and operations. We have not rebuilt on the spot where the building that burned down was - several possibilities have been floated but nothing has happened in that direction. And of course there were some nice awards and mementos in that building that are gone. But on the whole, we're good regarding that. People were kind and we worked hard to replace the lost resources.

That's really good to hear.

On a side note, I love textfiles.com. thanks for providing such a cool and important resource.

I don't know bot any fire there. When did it happen?

Canada seems a weird choice for this sort of thing, if freedom of speech is a worry, given stuff like this:


Care to expand on this? The wiki page describes some legislation that has since been repealed and I can't even deduce what the legislation exactly said from that article.

The brief story is that Canada does not have free speech guarantees to the same extent that US, for example, does. The Canadian Charter of Rights and Freedoms says:

"2. Everyone has the following fundamental freedoms:

(a) freedom of conscience and religion;

(b) freedom of thought, belief, opinion and expression, including freedom of the press and other media of communication;

(c) freedom of peaceful assembly; and

(d) freedom of association."

So far, so good. But it also has a section, the so-called "limitations clause", that states:

"The Canadian Charter of Rights and Freedoms guarantees the rights and freedoms set out in it subject only to such reasonable limits prescribed by law as can be demonstrably justified in a free and democratic society."

The Charter does not define what constitutes "reasonable" or "demonstrably justified", so it was left up to the courts to rule on that. The current interpretation is known as the Oakes test, and is actually fairly sensible.

However, the problem remains that this basically gives the government the ability to restrict freedom of speech, if such restriction can be "demonstrably justified". Consequently, for a long time, Canadian law prohibited a fairly broad category of speech labeled as "hate speech", and said prohibition was found by the courts to be consistent with the Charter.

It had also created a special tribunal to deal with the purported violations of one of the laws in question (specifically, Section 13), which operated under principles somewhat different from the regular court system. The article I linked to was about that. You can read the law here:


This particular law was, indeed, repealed by the Harper government. However, it only dealt with Section 13 law. There are other laws in Canada that are still in force that regulate "hate speech"; in particular:



Furthermore there's nothing precluding any future government from enacting a law to restore Section 13 and reinstate the Commission - all it takes is a simple majority in the legislature. Some people have called for the Trudeau government to do just that, although it did not indicate the desire to do that so far.

The other issue is that the Charter can be circumvented by both the federal and the provincial governments by their use of the Notwithstanding Clause, which is as follows:

"(1) Parliament or the legislature of a province may expressly declare in an Act of Parliament or of the legislature, as the case may be, that the Act or a provision thereof shall operate notwithstanding a provision included in section 2 or sections 7 to 15.

(2) An Act or a provision of an Act in respect of which a declaration made under this section is in effect shall have such operation as it would have but for the provision of this Charter referred to in the declaration.

(3) A declaration made under subsection (1) shall cease to have effect five years after it comes into force or on such earlier date as may be specified in the declaration.

(4) Parliament or the legislature of a province may re-enact a declaration made under subsection (1).

(5) Subsection (3) applies in respect of a re-enactment made under subsection (4)."

In other words, the legislature can effectively limit any fundamental freedom (this is Section 2, the one that includes freedom of speech and expression), and the only thing that they need to do so is 1) declare that they're doing it, and 2) renew that declaration every 5 years.

So far, the only instance of the Notwithstanding Clause used to limit freedom of speech that I'm aware of is its use by the legislature of Quebec in the 80s to pass their language protection laws (that mandated use of French in certain public signage etc). However, it could, in theory, also be used for "hate speech" laws and other similar restrictions.

The general point is that, in terms of both actual and potential curtailment of the freedom of speech, Canada offers far fewer guarantees than US does. While the Trump administration has expressed some hostility towards the concept of free speech already, actually acting out on it would put them on the collision course with the Supreme Court and its currently standing Brandenburg v. Ohio ruling interpreting the First Amendment, which provides extremely broad free speech protections, far exceeding anything that Canada has in the Charter, even ignoring the Notwithstanding Clause.

In terms of other countries that have laws and legal checks and balances comparable to those in US, the only one that I happen to know of is Estonia. But I'm sure there are others, it just needs researching. For something like the Internet Archive, which is archiving materials that can be contentious, I would expect legal freedom of speech to be a very strong consideration when picking jurisdictions in which to operate.

Thanks for the thoughtful response. I'd actually forgotten about the notwithstanding clause entirely.

Still though, Canada's legislation seems no more restrictive than most other (Non-U.S.) democracies. That's according to my quick read of https://en.wikipedia.org/wiki/Freedom_of_speech_by_country (so take it for what it's worth). I mean, surely there are some that are marginally better but it doesn't seem like there are any obvious leaders here. Maybe I'm missing something though.

Given that, I don't see how Canada would be a bad choice for a mirror. Especially given the other distinct advantages. Physical proximity being an obvious one (it's probably much more cost effective to build some servers pre-loaded with data and drive them up versus almost any other option). Same time zone, same language, and general political/social/economic stability are probably also pretty key. And then there are other threat considerations (eg. the Baltics being so close to Russia) that come into play.

I mentioned Estonia before. So far as I know, their level of protection is the same as in US - restricting speech requires imminent danger stemming from that speech. So no political speech, no matter how hateful, can be restricted, unless it is inciting imminent violence. It also has fairly lax libel laws, which is also a benefit

Geographic proximity has both upsides and downsides - the downside is that something that affects US is also more likely to affect Canada than any other nation (except, perhaps, Mexico).

As far as threat consideration, you have a point there - but I think that having a distributed network of server mirrors is part of mitigating any such sudden threats against any particular one. In a sense, something like a Russian invasion can probably be treated similarly to, say, a possibility of a major earthquake on the West Coast disrupting infrastructure.

But yes. I do see how Canada is probably the easiest to set up for someone in US. If they just want something done right now, as immediate mitigation, and consider better options later, it makes sense.

What do you do about things you "don't" want backed up? Say old portfolio or social media sites tied to your name? If you can wayback any site doesn't this present some issues to sanitizing your online footprint?

If you control the site, you can use robots.txt if you really want to. Though I'd think carefully about if you really want to do this.

If someone else owns the site or if it's a social media site, you'd have to see what the site owner will do. There's probably not much you can do on your own to prevent the site from being archived,

archive.org respects robots.txt so you can exclude what you want I guess, I'm not sure whether or not that's a great thing though.

Preferrably people would think more carefully before disallowing / I guess. I have many times been disappointed that info is entirely gone because archive.org respects robots.txt and the site is now offline forever.

I suppose it depends on whether you place greater value on historical accuracy or personal image. No doubt lots of people have published silly or embarrassing things years ago, but those things are still real.

I've donated, and bought some stickers. You guys should get some cooler swag :)

I donated. It feels really good to help such a good cause, however little it may be.

I don't have a lot of money to give (the holidays are expensive), is there a way to make a continuing contribution monthly? Is there a way I can volunteer my time? I live in the bay area.

Yes - they do have options for a monthly recurring donation. I just went for a one-time donation, but it is definitely an option.

Link: https://archive.org/donate/

Why is it important to keep it "Reader Private" (as in the article title? What does that actually mean?

I almost never use this service, but I'm happy to throw $5/month at it.

I prefer to donate to them rather than wikipedia these days.

Is there Internet Archive of YouTube?

Jason Scott, Internet Archive. We do archive Youtube videos but not, you know, every single one.

Internet archive is a great project. It has been allowed to be created, in the first place. Then it was and is allowed to exist. This is what I like to about the modern, freer, liberal western democratic nations.

Then there are the great people who spend their time, energy and resources to make such things tick. A great thank you to all those philanthropic people behind the Internet archive and similar such projects. It's because of you, people like me have a hope to learn something significant and with a relatively low cost footprint.

I learned many things thanks to FSF, GNU, Gutenberg, Wikipedia, Internet archive and currently the scihub. I spent only about $10 per month for internet access. Could I even imagine getting such highclass knowledge at such a low cost? Not spent ridiculously high fees for college and still could learn a lot in history, economics, and some things from science, math, technology, engineering and many fields of knowledge. In fact, most of my significant education happened on Internet, thanks to such projects.

I love the USA and the modern liberal western world who made such things happen. Hats off.

Disclaimer: I am from a third world country. $10 p.m. was an expensive thing for me for a large time.

PS: I hope to be able to contribute more to such projects soon. I do contribute a rather insignificant amount as compared to the scale of things.

They want to preserve the data.

How about this:

1. Prgram an app that you asks the user if archive.org could store, say 1 gig of encrypted data on your hard drive? It wouldn't be mandatory, but you could help if desired. It would just sit on your hard drive. That gig of data would be changing on a regular basis. (Big data centers could offer to take in data. Hell, they could have another tax right off at the end of the year.)

2. After all the data has been distributed around the world; the data transfer would start over again, but on different computers. In a short amount of time you might have millions of computers with part of The Internet Archive sitting idle on users hard drives. The end result is the users would be worker bees; waiting for the queen to call them home. (In the end, you might have 1000 computers with the same block of data on their hard drive. Why because computers don't last forever.)

3. If we had a catastrophe, once the new Internet Archive was repaired/restored; the data lying dormant on millions of hard drives would come home to papa in a orderly manner.

4. It would remind people of the importance of preserving history. It would bring more attention to The Internet Archive. It would bring in a sence of team. Why not try it until this 592c3 gets their donations?

5. Yes--this is off the top of my head. I would need to put more thought into it.

Distributed archiving feels like oral tradition. I'm smiling.

How about a distributed, social archive of music, dedicated to high quality and bit-level verification, with metadata and user discussion? With a ratio economy that rewarded contributions?

French police recently raided the infrastructure of such a place, and now it's gone. It was around for 8 years.

what.cd ? it seems like it wasn't that distributed. Also it was illegal; I love music, it appears to me they were music lovers before pirates, am sure there was a lot of rare and valuable content, but it was illegal, I can't really be too sad if they're busted.

Who knows, maybe users will organize in a different way to make an more legal repository of music.

What.cd reminded me a lot of the abandonware sites.

I may have allegedly been a pirate years ago. And I may have allegedly spent a lot of time at various abandonware sites because I like older games. And those sites seemed great. They had strict rules on what could and could not be uploaded and very much took the approach of "this is an archive". Then GoG launched and made it reasonable to buy those older games in a format that would (usually) work on modern systems.

Great right? We won! Nope. All four of the sites I used to (allegedly) frequent had responses ranging from "They aren't the creators so we are still going to let you upload these files" to "Some of our uploads are in iso format so it is still required". Hell, one even allowed people to upload the gog installers.

I am mostly good with archive.org (I have some reservations but feel them to be a net good), but my general experience is that most "archive" sites tend to just be pirates who think they understand the legal system.

It was distributed among around 115,000 people. That it was illegal doesn't make it any less valuable -- in fact, it might make it more valuable, since it was a rebellion against unjust, culture-destroying copyright law.

The Internet Archive worked closely with What.cd to archive the metadata that was painstakingly maintained there. They may even have snatched the perfect FLACs, and are shipping them offshore, so they can be made available when and if the US allows things to enter the public domain again.

What's the legal distinction between what.cd and the Internet Archive?

What archive.org backs up is already public content ? not paid one. There are exceptions (books and videos) but I assume they are negotiated with rights owners. Did what.cd do this too ? I don't know how they operate, I only heard about them last week.

What archive.org backs up is already public content ? not paid one.

Legally, that's irrelevant (except maybe for calculating damages). Publicly available content is just as copyrighted, and paid content may be in the public domain (e.g. printed copies of Oliver Twist).

Barring an explicit license, one can't copy any content on any website, except for simply displaying it (there's an implicit license). And you certainly can't re-distribute it.

There are exceptions (books and videos) but I assume they are negotiated with rights owners.

Why do you assume that, when anyone can upload them?


I assume that because archive.org is a massive open public fucking website, not a closeted circle like what.cd, requiring invitations to even log in apparently.

I'll also assume that it's as easy to upload copyrighted material than it is to remove them for the rights owner.

You're totally right about the license of publicly available content. I handwaved over it, assuming that people still wouldn't mind backup by a tier as long as it doesn't damage them (and again I'll assume archive.org accepts removal when demanded... which I'm gonna check right now).

pse: https://archive.org/about/faqs.php#Rights

https://archive.org/about/faqs.php#Movies (search for Who owns the rights to these movies?)

I agree with the idea of what archive.org claims to be doing but it doesn't seem right the way that they are going about it.

The exclusion specifically of any ISIS supporting articles and videos makes it seem that archive.org is not truly interested in creating an archive for future generations but is instead interested in creating an archive which supports their political/religious beliefs.

Cataloging and archiving Islamic State videos doesn't mean that one endorses their beliefs or supports the organization.

It's a shame that what could've been an organization for good has become a islamaphobic political organization.

You're confusing archiving and public access. Most archives don't have public access to all of their materials.


Don't feed the trolls, people.

Not supporting ISIS doesn't make you islamaphobic.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact