
More than 9M broken links on Wikipedia are now rescued - infodocket
https://blog.archive.org/2018/10/01/more-than-9-million-broken-links-on-wikipedia-are-now-rescued/
======
noufalibrahim
I worked at the Archive for a few years remotely. It permanently altered my
view of the tech. world. Here are the notable differences. I think these would
apply to several non-profits but this is my experience

1\. There was no rush to pick the latest technologies. Tried and tested was
much better than new and shiny. Archive.org was mostly old PHP and shell
scripts (atleast the parts I worked on).

2\. The software was just a necessity. The data was what was valuable.
Archive.org itself had tons of kluges and several crude bits of code to keep
it going but the aim was the keep the data secure and it did that. Someone
(maybe Brewster himself) likened it to a ship traveling through time. Several
repairs with limited resources have permanently scarred the ship but the cargo
is safe and pristine. When it finally arrives, the ship itself will be
dismantled or might just crumble but the cargo will be there for the future.

3\. Everything was super simple. Some of the techniques to run things etc.
were absurdly simple and purposely so to help keep the thing manageable.
Storage formats were straightforward so that even if a hard disk from the
archive were found in a landfill a century from now, the contents would be
usable (unlike if it were some kind of complex filesystem across multiple
disks).

4\. Brewster, and consequently the crew, were all dedicated to protecting the
user. e.g. [https://blog.archive.org/2011/01/04/brewster-kahle-
receives-...](https://blog.archive.org/2011/01/04/brewster-kahle-receives-the-
zoia-horn-intellectual-freedom-award/). There was code and stuff in place to
not even accidentally collect data so that even if everything was confiscated,
the user identities would be safe.

5\. There was a mission. A serious social mission. Not just, "make money" or
"build cool stuff" or anything. There was a buzz that made you feel like you
were playing your role in mankinds intellectual history. That's an amazing
feeling that I've never been able to replicate.

Archive.org is truly only of the most underappreciated corners of the world
wide web. Gives me faith in the positive potential of the internet.

~~~
philipps
Thank you for sharing this. I’ve donated (small amounts of) money to Mozilla
and Wikipedia in the past. Your post makes me consider donating to archive.org
this year.

Edit: typo

~~~
nojvek
How does archive.org make money? I imagine their storage costs must be quite
high.

~~~
toomuchtodo
They don’t. They are a non-profit 501c3 charity that relies on donations.

~~~
scrollaway
I think what the parent meant is "how does archive.org pay their bills?".

~~~
toomuchtodo
I thought I was answering that. Where did I go wrong?

~~~
scrollaway
Are you saying they don't pay their bills?

~~~
toomuchtodo
I'm saying they pay their bills (utilities, hardware costs, salaries) with
donations, US dollars obtained from those donating.

> How does archive.org make money?

Donations
([https://projects.propublica.org/nonprofits/organizations/943...](https://projects.propublica.org/nonprofits/organizations/943242767))

> I imagine their storage costs must be quite high.

No, they aren't. Building and hosting your own storage is cheap. Same reason
Backblaze and Dropbox built their own storage systems.

~~~
scrollaway
> _Building and hosting your own storage is cheap_

Archive.org uses S3 extensively. Not exactly cheap.

~~~
toomuchtodo
Can you provide a citation? To my knowledge, the Archive does not use Amazon's
S3 storage system (which they refer to in places as "S3" [1]), only they're on
their own internal storage system [2].

[1]
[https://archive.org/help/abouts3.txt](https://archive.org/help/abouts3.txt)

[2] [https://archive.org/web/petabox.php](https://archive.org/web/petabox.php)

~~~
noufalibrahim
To the best of my knowledge, the Archive has it's own machines to store data.
It _is_ an Archive and one of the principles was to have the know how to
preserve data even if the cloud providers disappear.

------
jonah-archive
If you're curious to learn more about us, we're hosting our big annual event
in SF this Wednesday (Oct 3)! Details:
[https://blog.archive.org/2018/08/20/save-the-date-
building-a...](https://blog.archive.org/2018/08/20/save-the-date-building-a-
better-web-internet-archives-annual-bash/)

~~~
jhabdas
If you're not hosting that site P2P yet you're just centralizing the
distributed Web.

~~~
jonah-archive
[http://dweb.archive.org](http://dweb.archive.org) \-- previously:
[https://news.ycombinator.com/item?id=17685682](https://news.ycombinator.com/item?id=17685682)

------
ravenstine
Archive.org is such a wonderful institution.

The other day, I discovered that the Wayback Machine has been archiving
YouTube videos in full HD. Most videos aren't on there, of course, and it
seems to only go back as far as ~2012 (HTML5 video switchover?), but some of
them are there.

Y'all will be getting more donations from me. :)

~~~
jxramos
I was wondering if something like that existed. A lot of times I add to my
YouTube WatchLater list, which can often times be much later in tim that by
the time I get around to it it’s not unusual for videos to be removed by the
user or deleted for unknown reasons. And there’s no text to even see what the
video was.

~~~
pymai
that does my head in. I have so many videos on most of my playlists that im
hardly ever able to guess what the missing video was.

~~~
ravenstine
This is why I backup videos with youtube-dl. Internet history can be
completely erased at a moment's notice, and some things ought to belong to the
public if they're to never see the light of day again otherwise.

~~~
dylan604
I'd suggest a slight revision to your theorem: > Internet history can be
completely erased at a momement's notice (unless it would embarrassing for you
later in life)...

~~~
ravenstine
Yes, indeed.

------
fermienrico
I've always wondered this - How does Archive.org work in terms of storage?
Internet is massive and caching every single site periodically for years on,
isn't that unreasonably huge amount of data?

Edit: I just checked Wikipedia, it says they're using about 15 PB of storage.

Edit 2: 15 PB cost => 15,000 TB x $30/TB = $450,000. Ofcourse, back of the
napkin cost (no maintenance, power, etc). That's not too bad actually.

~~~
bnewbold
The Archive currently has about 46 Petabytes of content ("bytes archived"),
and over 120 PB of raw disk capacity; the difference is due to data
replication, "currently filling" storage, non-storage infrastructure, etc.

We save a lot on web content storage by de-duplicating "revists" when the page
hasn't changed. This works out to save a whole lot for content like jQuery
served from a common CDN URL; it doesn't work well when there is a page
counter or any trivial changing content on a page.

If you are interested in the storage back-end, it's actually pretty simple:
HTTP requests/responses are concatenated and compressed in WARC files (sort of
like .tar.gz) that get stored on regular old ext4 filesystems. An index of
"what URL captures are in what WARC files on what servers" is continuously
generated in the form of, basically, a giant sorted (and shareded) .tsv file;
replay requests on web.archive.org look up the URL and timestamp and get a
reference to a machine, file, and file offset, and make an HTTP 1.1 range
request for the content in question. There are a bunch of other details, like
checking robots.txt status, but the core design is super simple, cheap, and
(relatively) easy to operate at scale.

Apart from web crawl content (including, these days, "heavy" video content
which is difficult to de-dupe), we have a large amount of live recorded TV,
scanned books (raw photos), etc.

(I currently work at IA)

~~~
Nition
Would be nice if you could store a diff, so a changing counter would only have
to store the changed counter after the initial save.

~~~
murukesh_s
Wondering the same. How if used git protocol itself? Not sure how efficient is
git, but if it is then it's a relatively easier change

~~~
eythian
Git doesn't store diffs, it stores whole files.

~~~
rakoo
No, git packs do store base blobs and diffs, which are not visible to end
users; the git plumbing hides this and only presents blobs to the user.

------
Hard_Space
It's an invaluable resource. In the past week I realized it is quite likely
that a site I used to work at which hosts a lot of my portfolio might
disappear or be heavily amended. So I located and indexed a complete list of
my articles there -- and I was even able to click a button and create an
archive for the few pages that IA hadn't bothered to index (I wish I had known
about this before, since it was a shock to find that IA can be quite
selective, and to find a page you were hoping was there simply isn't, and is
now irretrievable).

But as one other commenter here has mentioned, you're only a robots.txt amend
away from the oblivion that the entire IMDB comments section fell into [1], so
a good archiving system is essential. I use (no affiliation)Save Page WE on
Waterfox:

[https://addons.mozilla.org/en-US/firefox/addon/save-page-
we/](https://addons.mozilla.org/en-US/firefox/addon/save-page-we/)

[1]
[https://news.ycombinator.com/item?id=13571893](https://news.ycombinator.com/item?id=13571893)

~~~
zmw
Wayback Machine isn’t Googlebot, it doesn’t crawl the web, so there’s no such
thing as “hadn’t bother to index”... Someone, be it a human or a bot, needs to
submit a page for archival.

Programmatically submitting to Wayback Machine is trivial enough, so I have
cron jobs backing up most of my static sites (in their entirety) periodically.

------
modeless
This just reminded me to donate. I've used the archive several times just in
the past few days to resolve 404s on old gamedev blogs. I'm amazed how often
what I'm looking for is in the archive considering how big the internet is and
how niche the content I'm looking for. Truly an amazing resource comparable to
Wikipedia in value.

~~~
ahmedalsudani
To everyone considering donating, please set up a monthly donation if it’s
within your means!

I used to donate every time something reminded me of the value of the archive.
Now I just think “that’s why I have a monthly pledge!”

~~~
toomuchtodo
Also, your employer may match charitable contributions (up to a predefined
amount). Check if they do! It’s effectively free money for the Internet
Archive.

~~~
modeless
Googlers can donate with two clicks on G-Give, with Google matching. I highly
recommend it.

------
bobochan
I got a call a few years ago from a member of a humanitarian organization that
had accidentally lost a significant percentage of their web site detailing
projects that they had completed over many years with no backups. The people
that had completed the work had moved on and they were frantic that the work
was gone forever, but the Wayback Machine had almost perfect records to
restore everything.

~~~
toomuchtodo
There is a github project out there where you can specify the site, and it
will rebuild the content locally from wayback content. Something to consider
for last resort recovery.

EDIT: [https://github.com/oduwsdl/warrick](https://github.com/oduwsdl/warrick)

------
Cogito
They mention at the end of the article that 'content drift' may be a bigger
issue than link rot; when the content of the post is simply changed rather
than missing, it is much harder to notice.

Is there a scalable way to monitor Wikipedia links to see if the content is
changed after originally being posted?

They are already storing every link in the Internet Archive when it gets
added, so there should be a reference point to compare against.

One easy option would be to make Internet Archive links available for every
single link on Wikipedia, even if it hasn't rotted yet. So a 'live' link to
the current content, and an archive link for what it was at the time of
linking.

~~~
Sujan
Interesting aspect!

The biggest problem in this would probably be how to recognize if "content"
changed. A site can change the full design, navigation, footer and header and
everything and still have the exact same "content". For a human being this
will be simple enough to understand, but a tool might have its problems with
that.

~~~
Cogito
Yes, this is a fundamental issue if you wanted to do this at scale.

There are a few solutions to this already, using solutions like outline.com to
pull the content out of the cruft, but I don't know how many of these are
general purpose and how many are purpose built for each site (and maintained
for the current version of the site, perhaps?)

As seen in the article, most links are to a small number of sites, so perhaps
hard coding the content extraction would be feasible, especially for an
initial study.

It would be interesting I think to see just how many links have identical
content, but you're right in that the number will be skewed greatly if there
are any ads or similar included.

------
marknadal
And, for those who don't know, you can help host the Internet Archive now by
running a P2P/decentralized backup of it:

[https://news.ycombinator.com/item?id=17685682](https://news.ycombinator.com/item?id=17685682)

~~~
jacquesm
Thank you for pointing this out, I will see what I can do to help, this is the
sort of thing I'm more than happy to dedicate resources to. I still have a 36
drive enclosure lying around that would make a nice bit of storage if I can
get it to be silent enough for the home.

~~~
marknadal
This is the repo Mitra has been working on for this:
[https://github.com/internetarchive/dweb-
mirror](https://github.com/internetarchive/dweb-mirror)

------
LoSboccacc
And they are one robots away from cancellation. For all the good they do,
retroactively applying robots exclusions to their crawler is a terrible thing.
Luckily there are alternatives for going forward.

~~~
slededit
They keep the data and just don’t display it. The last thing they need is a
court order demanding they delete it.

Sure the archive is useful today, but it’s primary purpose is retaining
information for future generations. If that means placating copyright holders
it’s worth the cost.

~~~
rovr138
> They keep the data and just don’t display it.

I’ve read about the robots.txt mentioned before but hadn’t seen this
mentioned. Any idea if they have this somewhere on the site?

~~~
slededit
Sure see here: [https://archive.org/post/133690/robotstxt-only-gives-
tempora...](https://archive.org/post/133690/robotstxt-only-gives-temporary-
removal-of-pages)

------
solarkraft
I spend quite some time on archive.org. Of course the wayback machine is
great, but I am mostly interested in old digitized media. There really is some
great stuff on there - But what is really missing is organization (and a less
broken-seeming website, I guess). It doesn't help you much to have a great
archive if noone will find anything. User-curation would help a lot with this.

~~~
movedx
Found any gems you want to share?

What you're doing is essentially digital archeology, which is super cool. In
50-100 years, if not sooner, people will be digging through digital
"graveyards" for evidence of this and that. That's so intriguing.

~~~
sixdimensional
I like this term “digital archaeology”. I find myself saying it a lot when
dealing with “only” 20 year old database data in my day job. Apparently, but
not surprisingly, it’s a real thing[1]!

[1]
[https://en.m.wikipedia.org/wiki/Digital_archaeology](https://en.m.wikipedia.org/wiki/Digital_archaeology)

~~~
InternetUser
There's long been a saying that "Once it's out there (on the Internet), it's
forever," but I used to save links in a Microsoft Word document, and I went
through them a few years later and almost none of them worked anymore. The
years in which they were saved was 2006 to 2009, and the year I went through
them was 2012. The links were from MySpace (which totally overhauled the
entire site and all content), Facebook (where users had deleted their profiles
or pictures), Tumblr (where bloggers rename their blogs, which change the URL,
or they wipe them clean, or delete their blogs), YouTube (tons of videos and
whole accounts have been deleted because of copyright infringement, whether by
the account holder or by YouTube itself), Blogspot (same, but also that some
bloggers made their blogs private, perhaps to prevent spam-comments or
trolling), Yahoo articles (which I see Yahoo deletes after some time),
Style.com (Vogue magazine's website of all runway shows, which are now on
Vogue.com instead, with a different URL structre), and dozens of other
websites that don't exist anymore.

I think the statement about "stuff that's out there" really only applies to
famous or public people, where leaked and/or damning photos or videos are
quickly copied, saved, and rehosted by websites all over the world, including
Twitter, Pinterest, and other platforms. For instead, while Google Images
fastidiously won't show you the hacked photos of "Jennifer Lawrence naked," as
Google sought to avoid a $100M lawsuit [0], Bing Images, once you turn off
Safe Search, shows plenty of sites that host the pictures, with most frequent
such site being a German-based one called "OhFree," but there are at least 3
Blogspot sites as well, I suppose ironically.

[0] > "We've removed tens of thousands of pictures," says the web giant -
[https://www.hollywoodreporter.com/thr-esq/google-responds-
je...](https://www.hollywoodreporter.com/thr-esq/google-responds-jennifer-
lawrence-attorneys-737656)

------
iso-8859-1
There was a big drama in 2012. [http://Archive.is](http://Archive.is) was
proactively archiving Wikipedia links. An unauthorized bot (RotlinkBot) was
linking to Archive.is. The bot was banned.

I liked how Archive.is was so fast at archiving, its UI more clean. And since
it proactively archived links, it still happens today that a dead reference
link will be archive in Archive.is, but not in the Wayback Machine.

See
[https://en.wikipedia.org/wiki/User:RotlinkBot](https://en.wikipedia.org/wiki/User:RotlinkBot)

~~~
duckington
I'm guessing Archive.is will probably disappear within the next 5 years,
taking all data down with it.

Nobody knows who owns or maintains the site, and recently the mysterious owner
started taking donations to keep the site running. It's a commercial
enterprise.

Slick UI or not, Archive.org's longevity is probably more feasible.

------
twelvechairs
The internet archive is great for static pages but what will happen for
today's interactive content with complex data stored across different domains?

~~~
tomatotomato37
They save JavaScript, flash apps, and even some downloads too. Just recently I
used them to get an old flash game of a studio that went bust a couple years
ago

------
ta3216
If a tree falls in the forest and nobody is there to hear it, does it make a
noise? If IA is storing copyrighted (noarchive) content but not displaying it,
does that make it acceptable?

~~~
Benjamin_Dobell
Yes, because copyright only lasts for a finite duration. In 100+ years when
the original right-holders have disappeared, or when copyright expires and the
original right-holders have no incentive to keep the originals; IA archive and
similar archiving efforts will be able to make their copies available as a
matter of historical importance.

------
agumonkey
a worthy donation candidate ;)

