Notes on bookmarks from 1997

chimeracoder · on Aug 31, 2014

> First, donate to the Internet Archive: http://archive.org/donate/

This is a good step, but there's one massive problem with the Internet Archive: If the domain registration lapses (or is sold), the new owner can direct the Internet Archive to remove archived copies of the site previously hosted on that domain[0].

In other words, even if the archives are there today, they may not be there when you want them a year (or more) from now). I've been bitten by this in the past.

This means that the Internet Archive wouldn't have been able to prevent or mitigate some of the cases of link rot described here, as the content could have been scrubbed from the archive.

EDIT In case it wasn't clear, I still support the Internet Archive - I just want to make sure people know this issue exists and that it's not a "silver bullet" for this problem.

[0] I'm fairly certain this is still the case; if they have changed their policy recently I would be pleasantly surprised.

vitovito · on Aug 31, 2014

Hi, author here. You're right, of course. Of the 77 210 OK But Gone URLs, 35 had copies in the archive (more had copies than that, but only 35 were from the right year), but 5 (three domains total) had been blocked by a later robots.txt, and 1 had been "excluded" from the Wayback Machine, which I assume is the removal option you mention.

That's 6 redacted URLs of 77, versus 35 good ones. I'm not going to not donate to the Internet Archive and help preserve access to those 35 because of those 6.

EDIT: In the interest of rigor, of the 76 404 Not Found URLs, 4 URLs (three domains) had been blocked by a later robots.txt. 45 had relevant content preserved and accessible. That's 10 redactions total.

Also, when the Internet Archive imports third-party captures, like those from Archive Team, they are included irrespective of the robots.txt from the time of capture, which is then only used to manage display of the content.

It's LOCKSS, lots of copies keeps stuff safe. The Internet Archive isn't the only solution; we need more archives. But it's a start, and getting all the bookmarking services to contribute to it, and to improve and make public their caches would be a great improvement.

idlewords · on Aug 31, 2014

It would shock me if there's a working copy of 'rm' allowed anywhere near the Internet Archive. They take this stuff down for compliance, but my dream is that the data lives on, waiting for a saner day when the legal climate for archivists gets a little warmer.

ghaff · on Aug 31, 2014

It's tough though--supportive as I am of the Internet Archive's goals. How is an "archivist" different from a random individual who scrapes stuff of the Internet and rehosts it? In the aggregate, the Internet Archive looks different from the typical person who is copying articles and blog posts, wrapping them up in ads, and displaying them. The IA is non-profit, doesn't run ads, etc. It also respects robots.txt. But it's not that clear to me what the legal regime would be that allows the Internet Archive to function free and clear and doesn't hit cases that most would agree are shady.

vitovito · on Aug 31, 2014

The only difference between science and screwing around is writing it down.

You can be trained as an archivist. You can get a Masters and a PhD in archival practice. There are industry-standard procedures and codes of ethics. There's a very specific understanding of what is important to save, how to save it, and how to document its context and its provenance.

That's why the Internet Archive requires a certain fidelity of capture (WARCs) that a screenshot service or a citation tool don't provide.

That's also why they are legally a library. Libraries have particular copyright exemptions for preservation. A typical person doesn't. But you generally have the right to make backups for your own use, and so you can also donate those backups.

It's like if you were a famous person, and you bought a newspaper and a book, and when you died your personal effects were donated to your alma mater who put on a big exhibit of your life and times, that newspaper (your backup of the original that lives in the hard drives of the publisher) and that book (your backup of the original that some author wrote) are there, too. No-one's conferring any rights to the content; the publisher still owns the newspaper, and the author still owns the book, but that was your copy that is now available for everyone to see.

ghaff · on Sept 1, 2014

>That's also why they are legally a library.

Citation? I'm not aware of the IA having any special status.

A physical newspaper or book isn't a backup. It's a physical artifact that falls under first sale doctrine. The same doesn't apply to digital.

tshaddox · on Aug 31, 2014

> How is an "archivist" different from a random individual who scrapes stuff of the Internet and rehosts it?

If the random individual is presenting it in the same way as archive.org (namely, citing the source), then I don't see a difference.

DanBC · on Aug 31, 2014

It's acting like a library, which have often wanted to have a copy of everything to allow people to research it.

These often have exemptions written into law to allow them to do what they do, so I hope the IA is covered.

tedunangst · on Aug 31, 2014

But libraries can't freely republish out of print books, which is about what the IA is doing. The equivalent would require the archive to have a room you'd visit with a terminal connected to the archive.

TazeTSchnitzel · on Aug 31, 2014

Unless it's child porn, say.

polynomial · on Aug 31, 2014

I'm not sure I understand this. Just because I buy/sell a domain in 2014, that doesn't give me the right to retroactively change what was there in 2007.

Would like to hear more about the basis for such a policy, whose only use case seems like censorship of the past, which Archive is clearly not aligned with.

maxerickson · on Sept 1, 2014

Speculating on motivations, they are resource constrained. Spending resources keeping thousands of pages available to the public is a reasonable tradeoff vs entering/inviting a protracted legal battle.

As far as the mechanics of it, they follow a policy that explicitly recommends respecting robots.txt:

https://archive.org/about/faqs.php#Rights

http://www2.sims.berkeley.edu/research/conferences/aps/remov...

slackpad · on Sept 1, 2014

I worry about this a lot. There's archive.org doing an amazing job but it seems like so much of our culture is getting lost because it's online. My box of Popular Science magazines from the 1950s is amazing - even the ads are interesting. The difficulty in archiving this stuff is worrying, as is the discovery experience for someone 60 or 70 years down the road. I can pick up a crusty magazine and flip through it, cutting across a whole bunch of topics and their surrounding context. There are lots of silos trying to save bits of it, but if this guy put a bunch of papers in a box in 1997 and it didn't get eaten by mold, then all of it would have been around to browse today.

vitovito · on Sept 1, 2014

Absolutely. We know how to archive paper pretty well. Your box of magazines will last another 20 years in your garage. Maybe twice that in a climate-controlled vault.

But they're still made of cheap paper, and will eventually decompose, or your garage will flood, or bugs will eat them, or mold will be particularly bad one year. And, like anything physical, unless someone makes a pilgrimage to your garage (or that vault), no-one else will ever (re)discover them, or share them. That's the problem with physical things: they can be controlled, hoarded, kept secret, or destroyed.

Born-digital artifacts, and digitized physical ones, can be much more discoverable than physical ones. They often can be repurposed and remixed much more easily. And they absolutely can be copied and distributed widely in ways that physical artifacts can't.

Sticking to paper isn't the answer. Figuring out better ways to index, preserve and distribute digital artifacts is.

_delirium · on Sept 1, 2014

I agree that paper per se isn't the advantage, but the fact that something comes in a relatively unencumbered and self-contained unit does help a lot with digital archiving. That is so far still more common with paper on average. A library can build a more or less complete digital archive of a paper newspaper by simply subscribing to it and setting up a routine digitization process. Building a complete archive of an online news site is much harder: you have to figure out how to scrape it in full, and many sites actively try to stop you from doing so. As publications have discovered that selling access to their own archives (or "vault" as some call it) is profitable, they are even less likely to want free third-party archives to be available.

The best-case for "born digital" archiving at the moment are probably publications that still produce a full PDF version of each "issue". For example a library could archive The New Inquiry by simply subscribing and downloading each month's subscribers-only PDF, rather than trying to scrape the website. A publication that put the full content in an RSS feed would also be fairly easy to "subscribe" to and archive. But most don't have an easy way of subscribing to full updates that can be archived and stored/viewed independently of the original site, and many are even kind of hostile to the idea.

Journals are another problem. Paper journal subscriptions go into a library's permanent holdings, while digital journal subscriptions often give them cheaper per-issue prices, but in return it's more like renting access to the journal-controlled archive, which goes away if the library cancels the subscription. Most don't allow a library to mass-download the archive into the library's own digital holdings, though some do have some self-archival arrangements (e.g. a library might be allowed to self-archive issues that came out during their period of subscription, but not the whole back catalogue).

noonespecial · on Sept 1, 2014

>The best-case for "born digital" archiving at the moment are probably publications that still produce a full PDF version of each "issue".

I'd go so far as to say that publishers who do not do this and provide these to the Library of Congress should receive much less copyright protection from the government. Copyright and preservation should go hand in hand.

walterbell · on Sept 1, 2014

> Copyright and preservation should go hand in hand.

How would this apply to out-of-print books?

wodenokoto · on Aug 31, 2014

This is the main reason I adopted zotero for my citation manager - it saves a local version of the resources I want to cite, preventing link rot and paywall from obstructing returning to an old research topic.

If your bookmarking habit is article, rather than web-services, your bookmark manager needs a way of saving bookmarked content.

vitovito · on Aug 31, 2014

That helps you, but that doesn't help everyone else.

Consider asking Zotero to do full WARC captures, which you could then donate to the Internet Archive and have them be included in the Wayback Machine, which would benefit everyone.

whyenot · on Sept 1, 2014

Hi, the Zotero project is open source and non-profit, currently getting the majority of its funding from the Roy Rosenzweig Center for History and New Media at George Mason University. Following the open source model, it depends on people like you to help out with coding new features. If doing full WARC captures is something that you feel is important, you might want to help out and contribute the necessary code. Zotero's code base is pretty approachable. I've personally submitted fixes to the project. Go here for more information on how to get involved: https://www.zotero.org/getinvolved/

dredmorbius · on Aug 31, 2014

Thinking about this further, this might also be a really useful tie-in for Wikipedia. It also increasingly relies on Web citations, many of which fail.

If cited material were automatically archived and submitted to TIA, this could be further useful. The fact that this is an inherent archival of information which is deemed relevant and citable is also worth noting.

vitovito · on Sept 1, 2014

They're starting to do this: http://blog.archive.org/2013/10/25/fixing-broken-links/

We have started crawling the outlinks for every new article and update as they are made – about 5 million new URLs are archived every day. Now we have to figure out how to get archived pages back in to Wikipedia to fix some of those dead links. Kunal Mehta, a Wikipedian from San Jose, recently wrote a protoype bot that can add archived versions to any link in Wikipedia so that when those links are determined to be dead the links can be switched over automatically and continue to work. It will take a while to work this through the process the Wikipedia community of editors uses to approve bots, but that conversation is under way.

dredmorbius · on Sept 1, 2014

That's great news.

dredmorbius · on Aug 31, 2014

Oooh! I like this.

hawktron · on Aug 31, 2014

I use Stache it stores screenshots and web archives perfect for this kinda stuff http://getstache.com

vitovito · on Aug 31, 2014

Hi! You could ask Stache, too, to do full WARC captures, which you could then donate to the Internet Archive and have them be included in the Wayback Machine, which would benefit everyone.

sepeth · on Aug 31, 2014

Is Stache mac only? and is there a way to change iCloud? e.g. I would prefer dropbox.

lingben · on Aug 31, 2014

thanks, I hadn't heard of zotero before, so it is like evernote except the data is stored locally on your own computer?

deong · on Sept 1, 2014

It's intended as a reference manager for academics, but in addition to PDFs, you can also save other resources like web pages.

The interface isn't really designed for using it as an Evernote replacement, but for some use cases at least, it would be workable.

DanBC · on Aug 31, 2014

Is there an update to the (now ancient) W3C "Cool URIs Don't Change"?

http://www.w3.org/Provider/Style/URI.html

dredmorbius · on Aug 31, 2014

From the article, the clear winner would be "cool URIs enable Wayback Machine archival".

rospaya · on Aug 31, 2014

I have around 4000 bookmarks from ~2003 to 2009 when I stopped hoarding them. Afraid to check how many of them will work.

vitovito · on Aug 31, 2014

It's a terrible problem. Maciej Ceglowski is doing his own study on link rot: https://blog.pinboard.in/2014/08/researching_link_rot/ He says my numbers match his own so far, about 5% per year. So I'd guess ~2200 bookmarks are dead, and maybe you could get ~1000 of them back from the Wayback Machine.

I have 28MB of personal bookmark files I'll be post-processing, as well as ~47,000 links from a shared, private bookmarking service dating from 2005 through 2011. I'm not looking forward to it.

dredmorbius · on Sept 1, 2014

NB: wget's '--spider' argument can be used to test link integrity (though it only checks that the link returns 200, not that the content is intact).

purplerails · on Aug 31, 2014

Here's my take on a solution to the problem of linkrot: https://www.purplerails.com/

Major points:

* Automatically saves an exact copy of pages (no need to explicitly bookmark) in the background.

* Data is encrypted with your password before being sync'd to cloud.

* Search through your pages.

* Works as a Chrome browser extension. No need for a native app.

vitovito · on Aug 31, 2014

Hi! Same thing here as with the other examples mentioned in this thread: this only helps you.

If you save a page but someone else needs it, they're out of luck.

But, if, in addition to making you a private, encrypted archive, they also tested to see if the URL was publicly visible and, if so, made a WARC of it, then they could package up all those WARCs for donation to the Internet Archive, and everyone could benefit.

purplerails · on Aug 31, 2014

> If you save a page but someone else needs it, they're out of luck.

There is a sharing feature to solve this problem. :)

But I agree with your point.

I actually looked into WARC earlier but didn't have the bandwidth to do it my first version. When I implement the ability to download your data, I'll try hard to use WARC. Unless there's some brain damage in the format: I hope not! :)

vitovito · on Aug 31, 2014

You have to save the WARC-required stuff on the initial capture, because it's a dump of the client/server conversation as well as the content. But thanks for thinking about it!

Here are some previous comments with links that might be useful:

https://news.ycombinator.com/item?id=6506032

https://news.ycombinator.com/item?id=6671152

carussell · on Sept 1, 2014

In Firefox 3, the default value for the lifetime of entries in the browser history was changed from 9 days to 99 days. In subsequent releases, it was changed to "indefinite, or whatever's reasonable, after applying some heuristics for the machine we're on".

A while back, I imagined bringing page content itself—and not just choice metadata like its URL and title—into the purview of the browser's history backend, too, effectively enabling WYGOIWYGO (What You Got Online Is What You Get Offline).

(I started off, funnily enough, not trying to imagine the next logical step in the "moar history" march, but instead with the Internet Archive in mind. I was trying to think of a way that would give ordinary plebs a zero-effort way to add to the Wayback Machine actual archived content in the same way that Alexa and the Internet Archive were slurping in data from the Alexa Toolbar about what pages are getting hits out there.)

After the stuff that happened last January with Aaron Swartz, I was even motivated to write up some use cases and gave it the codename "Permafrost":

Ashley just wants all the content he bookmarks (or simply accesses) to be always available to him, without being frustrated months or years from now by 404s, service shutdowns, and web spiders stretched too thin, allowing his favored content to slip through the cracks of their archiving efforts.

- < https://wiki.mozilla.org/Permafrost >

Even so, it remains one of those projects that I should really get around to kicking off someday, but may never end up starting, much less get close to "completing".

pain · on Aug 31, 2014

If I need to rely on your private domain to access my own research, how is it different or less risky than diigo etc? I'm waiting for an extension that lets me keep my own full data activity (and use your cloud too, optionally in addition).

purplerails · on Aug 31, 2014

Understood. Thx for the comment. The ability to download your data in a well-documented format (possibly WARC) is coming soon. I hope you will try out PurpleRails in the meanwhile. Thx again!

bigethan · on Sept 1, 2014

Funny, I used to work for DataRealm (impressed the OP capitalized it right). Wasn't there when they sold the serve.com domain, but it was a big deal for them. Everything internal used serve.com (including email addresses) and switching to Datarealm.com was a lot of changes. As a small business it's hard to say "cool URLs don't change" when you're just trying to get by.

sugarfactory · on Sept 1, 2014

> Every URL saved in more than one place increases the likelihood that their content will survive as domains change owners.

Surely it's obvious that the more copies the better for the backup purpose. Then, why is he advocating cloud services including the Wayback Machine? There's no doubt that it's important for everyone to save web pages locally to create as many copies as possible on the earth to prevent the disaster of the Library of Alexandria from happening again.

vitovito · on Sept 1, 2014

Because it does someone no good if you have a copy of an old page they're looking for, and they have no way to find you, and you have no way of providing it to them. Centralized services like libraries and archives provide that.

You absolutely should save everything yourself. But you should also give copies to as many centralized services as possible. Lots of copies keeps stuff safe.

sugarfactory · on Sept 1, 2014

"Centralized" was not the word you were looking for. Decentralized systems can serve the searching functionality as proved by many (pure) P2P file-sharing protocols.

pronoiac · on Aug 31, 2014

I hadn't realized I could write notes on Pinboard! I've written long descriptions for my bookmarks on occasion, though.

RexRollman · on Aug 31, 2014

I didn't either. What a nice feature.

(BTW: Pinboard is one of my favorite one-person companies, the other being Tarsnap. I love that the both seem to be run be people interested in building something and are not just looking for an exit strategy.)

joshu · on Sept 1, 2014

Notes was on the todo list very early on for delicious, but never allowed to proceed. Glad to see it working.

ibisum · on Sept 1, 2014

Bookmarks? I just print-to-PDF. Offline, permanent, safe.

Bookmarks have never been useful.

onion2k · on Sept 1, 2014

A print of a page is not the same as a bookmark. Rather than a link to a point in time (which is what your print out is), a bookmark is a pointer to the latest version of a page. That has many benefits if the page is actively updated. The cost of that advantage is that it can fail completely if the page is taken down.

Ideally a browser should cache a page when you bookmark it and optionally refresh the cache when you revisit. That way, if you get an error you could look back at the cached version instead.

In fact, that'd be a damn useful browser plugin.

vitovito · on Sept 1, 2014

But that only helps you. That doesn't help anyone else who needs that old reference.

Saving things in a way that those backups can be shared and distributed, like WARC files donated to the Internet Archive, can help ensure that when your hard disk crashes and your backups fail and you lose those PDFs, the things you wanted to save will still be out there.

thrownaway2424 · on Aug 31, 2014

"... a VRML tutorial is now a video about birth control."

VRML tutorials are all fundamentally about birth control.