
Internet Data Is Rotting - walterbell
http://theconversation.com/your-internet-data-is-rotting-115891
======
gambler
This needs to be solved on the protocol level. Of course, the players who have
control over our protocols are exactly the people who don't want this to be
solved at all.

The next best thing would be to redefine what "bookmarking" is. When I
bookmark a page, I want it to be permanently stored on my local machine and
full-text indexed. In fact, it's rather ridiculous that after 25 years
browsers don't have anything of this sort. Unfortunately, the most popular
browser in the world is controlled by the same people who control our
protocols.

If I ever get the energy, I will attempt to write a browser extension for
this.

~~~
fooker
wget -r ?

~~~
gambler
No. There are some cases where it is useful to download many pages in a batch,
but what I am talking about is, effectively, partial local replication.
Bookmarking I describe should create a tiny (but useful) version of the web on
your computer.

It needs to be seamless. It needs to be searchable. It would be incredibly
useful if it would capture relationships between pages (links) in addition to
pages themselves to navigate offline.

I can see many use cases for a kind of "super bookmark mode" where the browser
automatically stores all the pages you visit within a certain domain.

~~~
asdff
I would loooove this. Open one of your bookmarks and you can't tell it's the
local copy unless you looked at the local path sitting in the URL bar. Maybe
make an option to fetch a live copy though for sites like the front page of HN
that are in constant turnover, but it would be great. Storage has never been
cheaper, let me hoard!

You can even take it a step further and create a viewable timeline of all the
pages you've visited in case you have something at the tip of your tongue, but
need to retrace your steps and logic to get there. Browsing history is kinda
lackluster for this. My last ten entries on firefox are three hackernews
articles with "Add Comment" dispersed half a dozen times in that list. If I
had something along the lines of zoteros timeline for papers I could probably
find stuff way easier.

------
dlkf
> As of last fall, its Wayback Machine held over 450 billion pages in 25
> petabytes of data. This would represent .0003% of the total internet.

> Universities, governments and scientific societies are struggling to
> preserve scientific data in a hodgepodge of archives, such as the U.K.‘s
> Digital Preservation Coalition, MetaArchive, or the now-disbanded
> collaborative Digital Preservation Network.

Like any conservation work, the benefits are incredibly easy to ignore - until
something goes awry / stops getting funding and suddenly it's too late.
Consequently it's easy to have a myopic view of the issue.

These organizations are doing very important work, and I hope that internet
users and governments don't take them for granted.

------
mnl
We should aim at browsing the Internet _by date_. We're moving everything
there ignoring the fact that as it is there is no built-in permanence. We're
accustomed to that after Gutenberg: it wasn't that easy losing every copy of
an important document. Now it is, things disappear and we're getting into a
drifting cultural bubble impossible to trace back.

The Internet Archive is doing God's work, but it's not enough. If you don't
have the URL of a site that is gone, you probably won't find any reference to
it after every online hyperlink to it has disappeared as well. It might become
then inaccessible after a while, stored yet gone anyway.

~~~
dredmorbius
"Browsing by date" is pretty much Brewster Kahle's ideal, and is what the
Internet Archive's Wayback Machine approximates, thanks in large part to WARC
storage.

I'd also like to see a distinction between the idea of web servers, which are
really _publishers_ , and where _archives_ are kept. Ideally _not_ all in one
single store, a/k/a the Internet Archive, but replicated fairly widely.

------
zshbleaker
Things got far more worse in China right now.

Baidu Tieba, which could be considered as Reddit for China, just made all
posts before 2017-01-01 inaccessible. And a number of other online forums are
doing the same thing due to political reasons.

~~~
cbluth
would you elaborate on the political reasons?

~~~
sandworm101
If you want to rewrite history it doesnt help to have lots of old copies lying
around. Any data not under government control is a threat.

~~~
swgdo
Isn't Baidu under government control anyway?

~~~
moate
Yes, but the effort to go through and scrub that much data to ensure only the
ideas you want out are getting out would be massive. Seems easier to just shut
the door and board up the room than to try and clean it.

------
mwest
Back in the mists of time, I used to use wwwoffle proxy. It was great for low-
latency links, but also had the benefit of keeping an offline archive of
whatever you'd browsed.

Project's still there, although not sure how well it does with the modern web.

[http://www.gedanken.org.uk/software/wwwoffle/](http://www.gedanken.org.uk/software/wwwoffle/)

There are a bunch of more modern variations too:

[https://archivebox.io/](https://archivebox.io/) \- "Your own personal
internet archive"

[https://getpolarized.io/](https://getpolarized.io/) (as seen on HN
previously)

[https://github.com/kanishka-linux/reminiscence](https://github.com/kanishka-
linux/reminiscence)

[https://github.com/fake-name/ReadableWebProxy](https://github.com/fake-
name/ReadableWebProxy)

~~~
dredmorbius
Sadly, a lot of old-school proxies (squid, privoxy) are stymied by SSL/TLS
connections.

I think we're due for the idea that a proxy can be designated as a trusted
intermediary, most especially if it's run on a personal basis. I'm sure this
presents security issues, but it also avoids some.

~~~
thaumasiotes
> I think we're due for the idea that a proxy can be designated as a trusted
> intermediary, most especially if it's run on a personal basis.

We have that idea now; you designate the proxy as a trusted intermediary by
accepting its certificate. The chain looks something like this:

    
    
        You: browser, take me to https://youtube.com
        Browser: proxy, get me https://youtube.com
        Proxy: YouTube, get me /
        YouTube: I'm youtube.com -- here's a certificate signed
                 by the government of Egypt that proves it. And
                 here are the contents of /
        Proxy (to browser): I'm youtube.com -- here's a
                            self-signed certificate attesting to
                            that. And here are the contents of /
        Browser (to user): SECURITY ALERT! SECURITY ALERT!
    

Configure your browser to accept that certificate, and your proxy can handle
its own connection to youtube and just pretend, to your browser, that it is
youtube.

~~~
dredmorbius
Does Chrome (and other browsers') MITM attack defeat prevent this? That's my
understanding.

[https://comodosslstore.com/blog/google-chrome-63-will-
warn-y...](https://comodosslstore.com/blog/google-chrome-63-will-warn-you-of-
man-in-the-middle-attacks.html)

~~~
thaumasiotes
I'm answering based mostly on having read that link. It looks like the
protection applies only in the case where an error is being surfaced. The
problem Chrome wants to address is that users will click past the SECURITY
ALERT.

If you properly configure your own CA, then the TLS error triggering this
behavior won't occur, and there is no security problem for Chrome to put its
foot down on -- your proxy is providing a _valid_ certificate for whatever
domain, as far as Chrome is concerned, not an invalid one.

Compare
[https://support.portswigger.net/customer/portal/articles/178...](https://support.portswigger.net/customer/portal/articles/1783085-installing-
burp-s-ca-certificate-in-chrome) .

> The Chrome browser picks up the certificate trust store from your host
> computer. By installing Burp's CA certificate in your computer’s built-in
> browser (e.g. Internet Explorer on Windows, or Safari on OS X), Chrome will
> automatically make use of the certificate.

> When the Burp CA certificate has been installed for your built-in browser,
> restart Chrome and you should be able to visit any HTTPS URL via Burp
> without any security warnings.

~~~
dredmorbius
Thanks. This is something I've got some plans on.

------
patrick5415
Yeah it’s annoying that links get broken. But maybe it’s better this way.
There’s something about modern tech that has turned all of us into digital
horders. I (we?) have backups and backups of backups and redundant RAID
servers with every version of every file so that no byte shall ever perish. I
have essays still that I wrote in highschool nearly 20 years ago. To what end?
I’m partial to material minimalism. Why not data minimalism?

~~~
Hextinium
My common response is that there are real world costs to having too much
stuff, data is pretty small to keep around. You can fit a RAID 5 box with 30
TB of space in a shoebox for $15/month, that is enough to keep pretty much any
content that you ever consume so that if you want anything it still exists. My
parents and grandparents hoarded files and documents to no end and I have been
going through some of them and there is mostly garbage but there is some real
gems in the rough. Willingly disposing of the internet through our own
negligence is something I don't advocate for because there is some sort of
value preserved in what we save for the next generation.

~~~
quanticle
There are real-world costs to information hoarding as well, when it's done by
people who are not you and whose incentives are not aligned with your own.

I'm glad that data is rotting on the Internet. In fact, I'll go one step
farther and say that data should rot more quickly on the Internet. There's no
reason that some scummy marketing algorithm should have access to my high-
school social media posts. There's no reason that governments or private
individuals with a grudge ought to be able to through the history of
everything I've written, no matter how off-the-cuff, in order to dredge up
something that makes me look like an undesirable when it's taken out of
context.

If an individual wants to save a particular piece of information, that should
be their choice. Otherwise, by default, information ought to disappear from
the Internet.

~~~
Hextinium
I think that in terms of what you are talking about bitrot is good. People
should have their stuff deleted after a while but there is a sense of
permanentness in internet culture and that should change to one of
fleetingness which more bitrott may actually bring about causing more people
to back up their own stuff instead of leaving it to these companies.

------
hashkb
> Then there is also a problem of software preservation: How can people today
> or in the future interpret those WordPerfect or WordStar files from the
> 1980s, when the original software companies have stopped supporting them or
> gone out of business?

This issue in particular we have great solutions for (open formats / text),
but they are of course less profitable than only-my-app-can-read-this formats.

~~~
Crinus
FWIW those particular formats are widely understood even if they are
proprietary (well, at least in WordStar's case). And as long as the software
runs (be it natively or via an emulator or VM), you can always open and
convert/print the files (e.g. you could use vDOS to run WordStar or whatever
and use its printer emulator functionality with Windows' PDF printer to create
a PDF from the WordStar files).

------
Causality1
I read somewhere that the lifespan of the average hyperlink is only about two
years.

I count myself lucky I was introduced to the HTTRACK archiver program many
years ago and thus have complete offline copies of many of my favorite
websites of the early 00's.

~~~
adossi
Can you give some examples of these 'favorite websites'? I'm interested in
knowing what kind of website would be so interesting that I would want an
entire offline copy of it. (Besides maybe Wikipedia)

~~~
Causality1
Mostly defunct webcomics but also some of the small personal sites that
documented and collected resources for particular events or strange people. A
lot of those arose out of the SomethingAwful forums. For example, there was a
guy named Brian who wrote batshit insane fanfiction about himself. One of
these sites archived the fiction, interviews with Brian, videos, recordings of
collaborative reading skype parties, etc, all neatly on one site and now
safely tucked away on my drives. Now I have a little piece of nostalgia from
2004 I can step back into.

~~~
rchaud
Are you able to navigate through the sites using the original links? I notice
that on the Wayback Machine, internal site links only work if that particular
page was also archived.

~~~
Causality1
Yes. It also lets you designate a particular number of "steps" outside the
original site it will also archive. So I give it a site and a 1-step limit,
I'll get the site and also any individual other webpage linked to somewhere on
the original site. It doesn't do so well with modern sites that full of CDN-
hosted content and pages that depend on data from two dozen different domains
to function properly but it's great for old pre-web2.0 stuff.

------
dev_dull
I’m okay with internet rot and you should be too. I’m not sure where we got
the idea that “our data must be preserved forever”. This can be especially
harmful for teens and young adults whose indiscretions now follow them
forever.

Think of the privilege you had when you were younger. You could do something
stupid and nobody could whip out a high def camera to record it and make it
part of your history forever.

Let it rot.

~~~
oceanplexian
I'm OK with it because otherwise you are whitewashing history.

For example I have recordings of the Colbert report going back to ~2005. Some
of his skits released during that time would be classified as "hate speech" in
2019. Of course he, and mainstream broadcasting companies would love it if you
didn't think about that. There are plenty of news clips and interviews where
mainstream politicians (on Left AND Right) casually dismiss gay marriage.
Powerful tech influencers like Mark Zuckerberg would love it if his IMs would
disappear from the Internet. The examples go on and on.

~~~
everdrive
I think if the last 10 years have taught us anything, it's that preserving the
past does nothing to impede the changing moral zeitgeist. More records simply
mean more people to attack for holding an opinion that has simply gone out of
fashion. If the past decade wasn't characterized by tribalism and moral
hysteria I'd be more inclined to worry about stringent preservation.

At this point, I'm not really comfortable with what we're preserving.

~~~
asdff
It's simple. Don't say ignorant things on the internet, especially when it's
tied into your real life identity. Lots of people struggle with this simple
rule, because they have a compulsion to share their own myopic opinion
(usually the same viral opinion as everyone else in their echo chamber) and
contribute to the noise. If I was a public figure I wouldn't be on any social
media platform at all, at best it wastes your already limited and highly
valuable time.

That being said, there's a lot of valuable information on the internet that is
absolutely worth preserving. Scrape off the layer of social media and the
internet is still a place of learning and problem solving. I do my own repairs
on my car and the amount of times I've found a well written photo essay on a
particular fix in a random honda forum, only to find the linked pictures
broken, is astounding and a shame to say the least.

~~~
everdrive
Who gets to decide what's ignorant? What if I'm being sensible about what's
ignorant, but I can't predict how society will feel in the future?

------
dredmorbius
Among the handy tools that can be used to save and access present, at-risk,
and/or rotted data, are bookmarklets.

I've recently added two to my browser, "open in Wayback Machine" and "Save in
Wayback Machine". Respectively:

    
    
        javascript:void(window.open('https://web.archive.org/web/*/'+location.href));
    
        javascript:void(window.open('https://web.archive.org/save/'+location.href));
    

This makes opportunistic archival and reference easy. There are also Wayback
Machine / Internet Archive browser extensions.

(These are from the Internet Archive, not my work.)

For bulk archival, lists of URLs can be submitted to the IA's save address:

    
    
        https://web.archive.org/save/<URL>
    

(Used in the bookmarklet above as well.)

This can be automated with a simple shell script using any console or script-
based HTTP agent, such as curl, wget, lynx, etc.

------
sun_n_surf
I actually have a different problem -- not sure it is one that I can legally
solve.

I have 10 years of lovingly curated YouTube videos playlists, which, now when
I look into the older ones, are a barren wasteland of "Video removed" or
"Video not available". It is heartbreaking. Is there any way I can prevent
this from happening?

~~~
boardwaalk
I’d download and store the videos locally with youtube-dl.

~~~
AlbertoGP
I concur, youtube-dl is what I’ve been using during the last few years:
whenever I find a YT video I might want to watch again, I now immediately
download it. Learned the need for that the hard way.

Check out its options here: [https://github.com/ytdl-org/youtube-
dl/blob/master/README.md...](https://github.com/ytdl-org/youtube-
dl/blob/master/README.md#options)

With --add-metadata you can embed the YT video description in the video file.
The downloaded video file name will contain the YT identifier so you can still
match them back if needed.

There is another option to save the metadata to a separate JSON file if you
prefer that.

To download your playlists, give it each playlist’s URL instead of the video
url:

    
    
        youtube-dl --add-metadata --ignore-errors 'https://www.youtube.com/watch?v=8GW6sLrK40k&list=RDQMc4l8l2aQrNo'
    

That example URL includes a specific video from the list, but will download
all of them. It works just the same if you only give it the `list` parameter,
but all links to playlists I’ve seen point to one of their videos.

The option --ignore-errors will jump over the unavailable videos instead of
stopping.

Edit to add: If you want to download your playlists as separate directories,
with each video file name including its original index in the playlist, see
these examples in youtube-dl’s documentation: [https://github.com/ytdl-
org/youtube-dl/blob/master/README.md...](https://github.com/ytdl-org/youtube-
dl/blob/master/README.md#output-template-examples)

------
darepublic
I think it's good that data is lost . Only items where someone gives enough of
a duck to save it should be preserved. It's not as if physical paper content,
which ends up recycled or in a landfill 99.999 of the time is any different.
It's true that digital formats change but fighting that is the cost of
preservation. A museum of software needs to also preserve the context on which
software was run in order to save it from the mists of time, albeit
temporarily.

~~~
jplayer01
I feel like you've never had to look for information or how to do something,
and the only decent source is entirely gone or consists mostly of pictures,
which are also mostly gone. Maybe somewhere at some point in time somebody
saved it, but that copy gets lost and never makes it's way online again so you
can find it. A lot of information is being lost in this way and I'm not sure
why we should be fine with that.

------
DFXLuna
This seems like a good time to plug running a storage server on your local
network. You can pick up old workstations off eBay for 100$. Stick a couple of
drives in it, load it up with data to preserve and then put encrypted backups
in the cloud. Back blaze B2 is something like .001$ per gigabyte.

It's a fun experiment with clear, practical use.

~~~
kwhitefoot
Not quite that cheap 0.005$ per GB. USD 6 per month for personal unlimited
backup though.

Trying the trial now.

Thanks for bringing it to my attention.

------
colonelpopcorn
I hate to be this guy, but isn't this why printers and physical books exist?

~~~
papln
If books are all we need, why did anyone bother creating the Internet?

~~~
xfitm3
Computers were a mistake.

------
dredmorbius
The recent shutdown of Google+ was another case of this.

As one of the people helping coordinate information among those still using
the platform and hoping to migrate off of it, discovering the Archive Team's
GoogleMinus project this past January was a huge boost. That ended up being
the largest archival project undertaken to date, 1.6 PB, and succeeded in
capturing 98% of all G+ profiles, now stored at the Internet Archive.[1]

While it had long been obvious that the project was ill-stared, the shutdown
announcement came as a surprise, and Google's tools, communications, and
support for both individuals, and far more importantly, _groups_ , looking to
continue their existence off the platform, was abysmal.

I don't fault Google for killing the service -- I was suprised it survived as
long as it had. I _do_ fault Google for _how_ they did so. And that episode
was hardly the worst in history.

One of the lesser-known parts of G+ were its Communities. In the process of
the shutdown we came to realise that there were over 8 million of these, about
50,000 with 1,000 or more members, of all descriptions. Many frivilous or
worse, but many also not. And all stuck in a very hard spot by Google's
actions.[2]

Even preservation of _individual_ data does very little for groups, and is one
of the issues we're considering in the post mortem of the G+ mass migration,
intended to be of use to others.[3]

________________________________

Notes:

1\. For those preferring not to have their content archived, the IA WBM
respect DMCA requests, and as Google+ posts are all listed under the user's
account, requesting removal is exceedingly straightforward.

2\. Characteristics of number and size are collected here, compiled by me,
based in part on on data provided by Friends+Me:
[https://social.antefriguserat.de/index.php/Migrating_Google%...](https://social.antefriguserat.de/index.php/Migrating_Google%2B_Communities#Google.2B_Community_Characteristics_and_Membership)

3\. Discussion at Reddit and elsewhere. Compilation at the PlexodusWiki.
[https://old.reddit.com/r/plexodus/comments/boa97x/g_migratio...](https://old.reddit.com/r/plexodus/comments/boa97x/g_migration_post_mortem_what_went_well_what_went/)
[https://social.antefriguserat.de/index.php/G%2B_Migration_Po...](https://social.antefriguserat.de/index.php/G%2B_Migration_Post_Mortem)

------
bcaa7f3a8bbc
Another problem of the present-day WWW is, even archiving all the data is far
from enough to preserve the history! That's because the Web has a dual-role,
(a) as a protocol, or a medium of communication, and (b) as the software, or
the user-interface.

A good history preservation should allow you to somehow "browse" it, as if the
historical system is still alive. How the website worked, how it was used,
that's all parts of the history. If old operating systems and programs are
preserved, there is no reason _not_ to preserve websites in this way.

Back in the old days, many systems are federated and/or distributed, which
means the software and the protocol are two separate entities. You use a
newsreader, which talks the NNTP protocol to obtain news from a Usenet
newsgroup. If you want to preserve history, you can (a) archive the newsreader
program with source code, and (b) archive all the data on the NNTP server.
That's exactly what has been done already, if you load a Usenet archive to
your newsreader, pretty much you would have the experience similar to how
Larry Wall browsed the Usenet back in the late 80s, at worst you need to
rewrite a compatible "mock" server, but that's all. On the other hand, little
of the early BBS systems have been preserved, once the server is gone,
everything is gone.

The transformation to the web, means now the platform (a web community) =
protocol (backend database format) = user interface (HTML/CSS), they're all
tightly coupled together. It creates several problems:

(1) The "internal state" cannot be archived. A website is a system with
constantly updating parameters, and often they are not stored. Simple
examples: (a) On Hacker News, I cannot see what was shown on the frontpage
yesterday retroactively, (b) A user changes his/her avatar, now we had no idea
how the old avatar used to look like, and (c) an early user has been banned
from the forum, now his/her personal profile is inaccessible, (d) on some
social media platforms, sometimes a old post may be raised from the dead by
renewed interests ( _look, how stupid this comment was!_ ), and now suddenly
it's flooded by new posts, leaving no trace of how it used to look like.

(2) The "reader/user interface" cannot be archived. You must have seen
something like this: You changed the website frontend, superficially, lots of
"conservative" users complained, but the point is: now the old frontend and
its "look-and-feel" is lost. If it was a simple CSS file, there are chances to
bring it back, but if it was a major rewrite of frontend code, now history is
gone forever. And in the lifetime of a website, the design and architecture is
likely to be changed many times.

As a result, _even if a website and all its content is still alive, it may
already be a shadow of its past for a long time_ , don't even mention to
preserve it! And currently, there are two ways to archive the web, both are
flawed:

(1) Preserve the HTML at the surface. It's good for single pages, but you
cannot browse a website in this way at all. None of the button on the website
would work.

(2) Preserve the database. For example, using the API to save posts, or
dumping the database - the frontend and reader are not preserved. Using Hacker
News as an example, now every single post is archived, but it's far from a
full experience, at least you should be able to click someone's username and
see all the posts.

Now more and more websites are powered by JavaScript, makes the problem even
worse. You are now literally running a program on your computer without any
control over it. Once the platform is gone, no archive can save you.

What is the solution? I guess there's no full solution, but there are some
possibilities:

(1) Wikipedia-like websites already have builtin version control, but it's
very difficult to browse the historical version of the entire website. Systems
like this can improve the frontend / user interface to allow a user to "lock
on" a historical date.

(2) When building an all-Javascript website, spend some energy to build a
plain HTML version as well, it may help avoiding the upcoming digital dark
age.

(3) If you are going to close a website, perhaps it may be a good idea to make
your internal backups of database and codebase at different years publicly
available with sensitive information removed, and allow everyone to setup and
run a replicated version. It's infeasible for a big website, but it may be a
workable idea for a small community.

And I can imagine the archeologist from the 22th century digging into the old
backup tapes of Reddit and attempt to rerun the system.

But ultimately, it's a problem that is needed to be addressed by the protocol
and software with archive and preservation in mind.

\---

BTW: A few weeks ago, I've written a lengthy comment on the fundamental
conflict of history preservation and personal privacy, using the Usenet as an
example, you may find it interesting.

* [https://news.ycombinator.com/item?id=19562650](https://news.ycombinator.com/item?id=19562650)

------
return1
And that s a great thing! There is no reason to maintain everything, in fact
the entire function of our brains is to filter out useful information from a
cataclysm of sensory input. The internet figures out what to keep and what to
throw away, the hard thing seems to be to willingly make it _forget_ stuff.

------
1121redblackgo
If it is rotting, then it will be a great time for scavengers and carrion
feeders--maybe even the 'era of'.

------
scarejunba
I don't care. In fact, I prefer it this way. Death in species is an
evolutionary advantage. So too with culture. We mustn't let it ossify.

For the whitewashing thing, it will happen anyway. Only vigilance can protect
against rewriting. Websites can be altered. There is no provenance.

I'm not convinced infinite recall is useful.

~~~
mnl
I don't think it's for us to decide.

We've been mourning the loss of the Library of Alexandria for circa 1,500
years and I can't notice that we're getting radically wiser during the last 25
ones.

Letting all go is definitely the cheaper and convenient attitude... for us.
But we might be leaving nothing to build upon for the future generation. They
should have the same opportunity to ignore what they want that we have.

~~~
est31
And this is an important era in the history of our species. We just invented
an almost-free communication medium capable of communication across the globe.
This is as big as the discovery of fire.

