

Notes on bookmarks from 1997 - spindritf
https://notes.pinboard.in/u:vitorio/05dec9f04909d9b6edff

======
slackpad
I worry about this a lot. There's archive.org doing an amazing job but it
seems like so much of our culture is getting lost because it's online. My box
of Popular Science magazines from the 1950s is amazing - even the ads are
interesting. The difficulty in archiving this stuff is worrying, as is the
discovery experience for someone 60 or 70 years down the road. I can pick up a
crusty magazine and flip through it, cutting across a whole bunch of topics
and their surrounding context. There are lots of silos trying to save bits of
it, but if this guy put a bunch of papers in a box in 1997 and it didn't get
eaten by mold, then _all_ of it would have been around to browse today.

~~~
vitovito
Absolutely. We know how to archive paper pretty well. Your box of magazines
will last another 20 years in your garage. Maybe twice that in a climate-
controlled vault.

But they're still made of cheap paper, and will eventually decompose, or your
garage will flood, or bugs will eat them, or mold will be particularly bad one
year. And, like anything physical, unless someone makes a pilgrimage to your
garage (or that vault), no-one else will ever (re)discover them, or share
them. That's the problem with physical things: they can be controlled,
hoarded, kept secret, or destroyed.

Born-digital artifacts, and digitized physical ones, can be _much_ more
discoverable than physical ones. They often can be repurposed and remixed much
more easily. And they absolutely can be copied and distributed widely in ways
that physical artifacts can't.

Sticking to paper isn't the answer. Figuring out better ways to index,
preserve and distribute digital artifacts is.

~~~
_delirium
I agree that paper _per se_ isn't the advantage, but the fact that something
comes in a relatively unencumbered and self-contained unit does help a lot
with digital archiving. That is so far still more common with paper on
average. A library can build a more or less complete digital archive of a
paper newspaper by simply subscribing to it and setting up a routine
digitization process. Building a complete archive of an online news site is
much harder: you have to figure out how to scrape it in full, and many sites
actively try to stop you from doing so. As publications have discovered that
selling access to their own archives (or "vault" as some call it) is
profitable, they are even less likely to want free third-party archives to be
available.

The best-case for "born digital" archiving at the moment are probably
publications that still produce a full PDF version of each "issue". For
example a library could archive The New Inquiry by simply subscribing and
downloading each month's subscribers-only PDF, rather than trying to scrape
the website. A publication that put the full content in an RSS feed would also
be fairly easy to "subscribe" to and archive. But most don't have an easy way
of subscribing to full updates that can be archived and stored/viewed
independently of the original site, and many are even kind of hostile to the
idea.

Journals are another problem. Paper journal subscriptions go into a library's
permanent holdings, while digital journal subscriptions often give them
cheaper per-issue prices, but in return it's more like renting access to the
journal-controlled archive, which goes away if the library cancels the
subscription. Most don't allow a library to mass-download the archive into the
library's own digital holdings, though some do have some self-archival
arrangements (e.g. a library might be allowed to self-archive issues that came
out during their period of subscription, but not the whole back catalogue).

~~~
noonespecial
>The best-case for "born digital" archiving at the moment are probably
publications that still produce a full PDF version of each "issue".

I'd go so far as to say that publishers who do not do this and provide these
to the Library of Congress should receive much less copyright protection from
the government. Copyright and preservation should go hand in hand.

~~~
walterbell
> Copyright and preservation should go hand in hand.

How would this apply to out-of-print books?

------
wodenokoto
This is the main reason I adopted zotero for my citation manager - it saves a
local version of the resources I want to cite, preventing link rot and paywall
from obstructing returning to an old research topic.

If your bookmarking habit is article, rather than web-services, your bookmark
manager needs a way of saving bookmarked content.

~~~
vitovito
That helps you, but that doesn't help everyone else.

Consider asking Zotero to do full WARC captures, which you could then donate
to the Internet Archive and have them be included in the Wayback Machine,
which would benefit everyone.

~~~
dredmorbius
Thinking about this further, this might also be a really useful tie-in for
Wikipedia. It also increasingly relies on Web citations, many of which fail.

If cited material were automatically archived and submitted to TIA, this could
be further useful. The fact that this is an inherent archival of information
which is deemed relevant and citable is also worth noting.

~~~
vitovito
They're starting to do this: [http://blog.archive.org/2013/10/25/fixing-
broken-links/](http://blog.archive.org/2013/10/25/fixing-broken-links/)

 _We have started crawling the outlinks for every new article and update as
they are made – about 5 million new URLs are archived every day. Now we have
to figure out how to get archived pages back in to Wikipedia to fix some of
those dead links. Kunal Mehta, a Wikipedian from San Jose, recently wrote a
protoype bot that can add archived versions to any link in Wikipedia so that
when those links are determined to be dead the links can be switched over
automatically and continue to work. It will take a while to work this through
the process the Wikipedia community of editors uses to approve bots, but that
conversation is under way._

~~~
dredmorbius
That's great news.

------
DanBC
Is there an update to the (now ancient) W3C "Cool URIs Don't Change"?

[http://www.w3.org/Provider/Style/URI.html](http://www.w3.org/Provider/Style/URI.html)

~~~
dredmorbius
From the article, the clear winner would be "cool URIs enable Wayback Machine
archival".

------
purplerails
Here's my take on a solution to the problem of linkrot:
[https://www.purplerails.com/](https://www.purplerails.com/)

Major points:

* Automatically saves an exact copy of pages (no need to explicitly bookmark) in the background.

* Data is encrypted with your password before being sync'd to cloud.

* Search through your pages.

* Works as a Chrome browser extension. No need for a native app.

~~~
vitovito
Hi! Same thing here as with the other examples mentioned in this thread: this
only helps you.

If you save a page but someone else needs it, they're out of luck.

But, if, in addition to making you a private, encrypted archive, they also
tested to see if the URL was publicly visible and, if so, made a WARC of it,
then they could package up all those WARCs for donation to the Internet
Archive, and everyone could benefit.

~~~
purplerails
> If you save a page but someone else needs it, they're out of luck.

There is a sharing feature to solve this problem. :)

But I agree with your point.

I actually looked into WARC earlier but didn't have the bandwidth to do it my
first version. When I implement the ability to download your data, I'll try
hard to use WARC. Unless there's some brain damage in the format: I hope not!
:)

~~~
vitovito
You have to save the WARC-required stuff on the initial capture, because it's
a dump of the client/server conversation as well as the content. But thanks
for thinking about it!

Here are some previous comments with links that might be useful:

[https://news.ycombinator.com/item?id=6506032](https://news.ycombinator.com/item?id=6506032)

[https://news.ycombinator.com/item?id=6671152](https://news.ycombinator.com/item?id=6671152)

------
rospaya
I have around 4000 bookmarks from ~2003 to 2009 when I stopped hoarding them.
Afraid to check how many of them will work.

~~~
vitovito
It's a terrible problem. Maciej Ceglowski is doing his own study on link rot:
[https://blog.pinboard.in/2014/08/researching_link_rot/](https://blog.pinboard.in/2014/08/researching_link_rot/)
He says my numbers match his own so far, about 5% per year. So I'd guess ~2200
bookmarks are dead, and maybe you could get ~1000 of them back from the
Wayback Machine.

I have 28MB of personal bookmark files I'll be post-processing, as well as
~47,000 links from a shared, private bookmarking service dating from 2005
through 2011. I'm not looking forward to it.

------
chimeracoder
> First, donate to the Internet Archive:
> [http://archive.org/donate/](http://archive.org/donate/)

This is a good step, but there's one massive problem with the Internet
Archive: If the domain registration lapses (or is sold), the new owner can
direct the Internet Archive to _remove_ archived copies of the site previously
hosted on that domain[0].

In other words, _even if the archives are there today_ , they may not be there
when you want them a year (or more) from now). I've been bitten by this in the
past.

This means that the Internet Archive wouldn't have been able to prevent or
mitigate some of the cases of link rot described here, as the content could
have been scrubbed from the archive.

 _EDIT_ In case it wasn't clear, I still support the Internet Archive - I just
want to make sure people know this issue exists and that it's not a "silver
bullet" for this problem.

[0] I'm fairly certain this is still the case; if they have changed their
policy recently I would be pleasantly surprised.

~~~
idlewords
It would shock me if there's a working copy of 'rm' allowed anywhere near the
Internet Archive. They take this stuff down for compliance, but my dream is
that the data lives on, waiting for a saner day when the legal climate for
archivists gets a little warmer.

~~~
ghaff
It's tough though--supportive as I am of the Internet Archive's goals. How is
an "archivist" different from a random individual who scrapes stuff of the
Internet and rehosts it? In the aggregate, the Internet Archive looks
different from the typical person who is copying articles and blog posts,
wrapping them up in ads, and displaying them. The IA is non-profit, doesn't
run ads, etc. It also respects robots.txt. But it's not that clear to me what
the legal regime would be that allows the Internet Archive to function free
and clear and doesn't hit cases that most would agree are shady.

~~~
vitovito
The only difference between science and screwing around is writing it down.

You can be trained as an archivist. You can get a Masters and a PhD in
archival practice. There are industry-standard procedures and codes of ethics.
There's a very specific understanding of what is important to save, how to
save it, and how to document its context and its provenance.

That's why the Internet Archive requires a certain fidelity of capture (WARCs)
that a screenshot service or a citation tool don't provide.

That's also why they are legally a _library_. Libraries have particular
copyright exemptions for preservation. A typical person doesn't. But you
generally have the right to make backups for your own use, and so you can also
donate those backups.

It's like if you were a famous person, and you bought a newspaper and a book,
and when you died your personal effects were donated to your alma mater who
put on a big exhibit of your life and times, that newspaper (your backup of
the original that lives in the hard drives of the publisher) and that book
(your backup of the original that some author wrote) are there, too. No-one's
conferring any rights to the content; the publisher still owns the newspaper,
and the author still owns the book, but that was your copy that is now
available for everyone to see.

~~~
ghaff
>That's also why they are legally a library.

Citation? I'm not aware of the IA having any special status.

A physical newspaper or book isn't a backup. It's a physical artifact that
falls under first sale doctrine. The same doesn't apply to digital.

------
bigethan
Funny, I used to work for DataRealm (impressed the OP capitalized it right).
Wasn't there when they sold the serve.com domain, but it was a big deal for
them. Everything internal used serve.com (including email addresses) and
switching to Datarealm.com was a lot of changes. As a small business it's hard
to say "cool URLs don't change" when you're just trying to get by.

------
sugarfactory
> Every URL saved in more than one place increases the likelihood that their
> content will survive as domains change owners.

Surely it's obvious that the more copies the better for the backup purpose.
Then, why is he advocating cloud services including the Wayback Machine?
There's no doubt that it's important for everyone to save web pages locally to
create as many copies as possible on the earth to prevent the disaster of the
Library of Alexandria from happening again.

~~~
vitovito
Because it does someone no good if you have a copy of an old page they're
looking for, and they have no way to find you, and you have no way of
providing it to them. Centralized services like libraries and archives provide
that.

You absolutely should save everything yourself. But you should also give
copies to as many centralized services as possible. Lots of copies keeps stuff
safe.

~~~
sugarfactory
"Centralized" was not the word you were looking for. Decentralized systems can
serve the searching functionality as proved by many (pure) P2P file-sharing
protocols.

------
pronoiac
I hadn't realized I could write notes on Pinboard! I've written long
descriptions for my bookmarks on occasion, though.

~~~
RexRollman
I didn't either. What a nice feature.

(BTW: Pinboard is one of my favorite one-person companies, the other being
Tarsnap. I love that the both seem to be run be people interested in building
something and are not just looking for an exit strategy.)

------
ibisum
Bookmarks? I just print-to-PDF. Offline, permanent, safe.

Bookmarks have never been useful.

~~~
onion2k
A print of a page is not the same as a bookmark. Rather than a link to a point
in time (which is what your print out is), a bookmark is a pointer to the
latest version of a page. That has many benefits if the page is actively
updated. The cost of that advantage is that it can fail completely if the page
is taken down.

Ideally a browser should cache a page when you bookmark it and optionally
refresh the cache when you revisit. That way, if you get an error you could
look back at the cached version instead.

In fact, that'd be a damn useful browser plugin.

------
thrownaway2424
"... a VRML tutorial is now a video about birth control."

VRML tutorials are all fundamentally about birth control.

