
Wikipedia and Internet Archive partner to fix 1M broken links on Wikipedia - The_ed17
https://blog.wikimedia.org/2016/10/26/internet-archive-broken-links/
======
jacquesm
In the long term the internet archive will likely be the major supplier of
references to Wikipedia. Webpages don't live forever, hopefully the internet
archive does. It's an extremely valuable resource, the archive and wikipedia
are amongst the most valuable digital assets we have.

~~~
The_ed17
I've been an editor on Wikipedia for years, and it's simply amazing how many
web pages I referenced in 2008–09 have disappeared. Digital archivists have
their hands full.

~~~
dredmorbius
I had to migrate my archived articles off Readability prior to the OEL at the
end of September. I'd only used the service for a couple of years, and hit
about a 5% bitrot rate, this on fairly significant articles.

One of the more curious cases was CSIRO (Australia's national science and
research organisation) which seems to have not only deliberately purged a fair
amount of data (Graham Turner's work specifically), but has a robots.txt in
place which blocks archival by TIA. That strikes me as ... downright curious.

~~~
toomuchtodo
Could you reply with CSIRO links blocked by robots.txt? I run my own instance
of ArchiveTeam's ArchiveBot (which archives links provided regardless of
robots.txt), and would be happy to put the content into cold storage.

~~~
dredmorbius
I'll need to check my Readability dump.

~~~
toomuchtodo
No rush; I'll bookmark this thread to check two weeks from now.

~~~
dredmorbius
Found it. Apparently the robots.txt has been fixed:

404: [http://www.csiro.au/en/Portals/Multimedia/CSIROpod/Growth-
Li...](http://www.csiro.au/en/Portals/Multimedia/CSIROpod/Growth-Limits.aspx)

Now available:
[http://web.archive.org/web/20120508210658/http://www.csiro.a...](http://web.archive.org/web/20120508210658/http://www.csiro.au/en/Portals/Multimedia/CSIROpod/Growth-
Limits.aspx)

That's among the specific links which _wasn 't_ being served by TIA earlier.

~~~
toomuchtodo
Awesome. I've emailed CSIRO to try to track down that podcast included in the
article that was not archived.

~~~
dredmorbius
Thanks, I really appreciate this.

dredmorbius@gmail.com if you happen to track that down.

~~~
toomuchtodo
Emailed. Also, archiving all of the current version of csiro.au, just in case.

~~~
dredmorbius
It seems to be climate, CO2, and limits-related work that is most prone to
being censored.

~~~
toomuchtodo
Not surprising based on the political climate in AU currently.

------
eriknstr
The archive.is guy provides mirrors of rotten links to Wikipedia also,
although not as the result of any official agreement with Wikipedia, just on
his own initiative, which I think was nice of him.

Enclyclopedia Dramatica is generally not a reputable source of truth, being
the site that it is, but while looking for some more information on archive.is
mirroring of links from Wikipedia articles, I found an article on ED that I
found interesting. It is heavily advocating one side of the story but at least
it backs it up with some links, which is rather seldom on ED (most links on ED
usually go to other pages on ED in my experience).

[https://encyclopediadramatica.se/Archive.is](https://encyclopediadramatica.se/Archive.is)

~~~
necessity
It's amazing to me how Wikipedia ends up being a reasonably good website with
such a cancerous community behind it.

~~~
TorKlingberg
What are you referring to?

~~~
corobo
I'm not necessarily agreeing with the OP here and I don't even know what the
community is like but this seems a decent page to start[1]. I do like how
Wikipedia keeps a page on it's own controversies - I mean it makes sense, but
I like that they're open about it.

[1]
[https://en.wikipedia.org/wiki/List_of_Wikipedia_controversie...](https://en.wikipedia.org/wiki/List_of_Wikipedia_controversies)

------
shortformblog
Excellent news. Should note that today is the 20th anniversary of the Internet
Archive: [https://blog.archive.org/2016/10/26/making-the-web-more-
reli...](https://blog.archive.org/2016/10/26/making-the-web-more-
reliable-20-years-and-counting/)

------
qwertyuiop924
I'm really glad this is happening. Wikipedia needs to clean up their broken
links, and this could help the archive get a wider sampling of websites, so as
to preserve more data.

Websites going offline is a huge problem. For example, the now-famous thread
from which sleepsort originated (on 4chan's /prog/ textboard) isn't archived
anywhere: textboard threads are immortal, so nobody thought to archive any
threads until dis.4chan.org went down for good.

Thankfully, some bright spark managed to save the sqlite databases for most of
the boards on dis to the Internet Archive, so I was able to track down the
thread eventually.

------
pmiller2
This is a huge step forward for Wikipedia as an authoritative source of
information. Glad to see this happening. :)

OT: I considered applying to the Internet Archive last time I was looking for
work, but their office is too hard to commute to coming from the East Bay. :(

~~~
brudgers
I agree that it's a big step forward for encyclopedias. Not just as a 'source
of truth' but also in terms of automating away a lot of the routine editorial
maintenance that needs to happen at Wikipedia's scale.

------
ideonexus
This whole discussion reminds me of how all MySpace content was destroyed in a
rash corporate decision years ago. Just like that, five years of the most
popular social networking site on the World Wide Web and all its history were
wiped out:

[http://activehistory.ca/2013/06/myspace-is-cool-again-too-
ba...](http://activehistory.ca/2013/06/myspace-is-cool-again-too-bad-they-
destroyed-history-along-the-way/)

Unfortunately, the Internet Archive was only able to get the non-logged-in
version of the site. All those loud, obnoxious profile pages users spent
endless hours working on? We only have oral histories now to remember them.

~~~
smsm42
I wish I could get all my old horrible homepages back. There was time you'd
have to torture me to admit I had anything to do with _that_ , but now I would
probably be proud of them again. It's history now.

------
caf
It'd be great if StackOverflow approached the Internet Archive about doing the
same for their broken links, too.

------
sengork
Internet Archive should look into distributed models such as IPFS for storage
of the archived sites.

~~~
toomuchtodo
The IPFS team is working with the Internet Archive on this.

~~~
sengork
Excellent, it may be a better way to provide spare bandwidth/storage similar
to their [https://archive.org/details/archiveteam-
warrior](https://archive.org/details/archiveteam-warrior)

~~~
toomuchtodo
For sure. ArchiveTeam has explored providing mirrors of the entire Archive
[+], but IPFS is a perfect fit for the task.

[+]
[http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.B...](http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK)

------
felipesabino
As I have clicked in several broken links already, I am wondering how many,
absolute number or in percentage, per article are likely to be broken

I might be way off, but doesn't 1M seems like a low number for wikipedia size?
What is that in percentage of total number of links? Does anyone know?

------
youdontknowtho
Wow. I really love the internet archive as a project. This is a great usage.
Looking forward to see how that will work out.

I wonder if they will publish a list of replaced links after the fact?

------
h1d
What's blocking Wikipedia to just archive the referenced pages on edit?

It would be far more reliable than depending on Internet Archive when it may
not have the page archived and more likely the time of the archive would
differ from the time it was referenced.

It would cost some more disk space and bandwidth, which of course is already
pressuring them but in turn would greatly improve usability and reliability.

~~~
digi_owl
Likely some interpretation of how Wikipedia is not to be a primary source.

------
raverbashing
One corner case that exists: a content is linked on Wikipedia, this content is
taken down due to a copyright violation

(I suppose Archive.org would be asked to take the content down)

~~~
sp332
Archive.org will take content down for certain reasons, but they have a pretty
broad copyright exemption as a non-profit archive.

------
torrent-of-ions
Why does the headline says "to fix 1M broken links" but the article says it's
already been done?

~~~
cooper12
Yeah that's a bit confusing. I'd attribute it to the "press release" nature of
Wikimedia's blog where they mean that they have _already_ partnered with the
Internet Archive and are announcing it after the fact.

------
45h34jh53k4j
(red heart)(yellow heart)(green heart)(blue heart) Internet Archive (red
heart)(yellow heart)(green heart)(blue emoji)

There are fewer more noble pursuits than archiving the sum of human knowledge.

------
alecco
On a side note, it makes me very sad how Wikipedia editors are often pushing
some political agenda. I'm relying on it for less and less topics. Clearly
nothing that can be affected by US politics or SJW-style controversies.

