Hacker News new | past | comments | ask | show | jobs | submit login
Where did the web archive go? (arxiv.org)
94 points by Hard_Space 30 days ago | hide | past | favorite | 23 comments

If you don't want to click:

> Abstract. To perform a longitudinal investigation of web archives and detecting variations and changes replaying individual archived pages, or mementos, we created a sample of 16,627 mementos from 17 public web archives. Over the course of our 14-month study (November, 2017– January, 2019), we found that four web archives changed their baseURIs and did not leave a machine-readable method of locating their new base URIs, necessitating manual rediscovery. Of the 1,981 mementos in our sample from these four web archives, 537 were impacted: 517 mementos were rediscovered but with changes in their time of archiving(or Memento-Datetime), HTTP status code, or the string comprising their original URI (or URI-R), and 20 of the mementos could not be found at all.

If you don't want to read a PDF: https://www.arxiv-vanity.com/papers/2108.05939/

Relevant: "Cool URIs Don't Change": https://www.w3.org/Provider/Style/URI

First thing that came to mind as well.

I honestly thought that Jakob Nielson earned his keep by having gotten the basics enough mind share to make the situation better than fond memories for turn of the century web designers.

Being surrounded by designers at the time, all who became developers soon after, the loss of elementary design and UI knowledge I believe is due to the vacuum left behind when the on ramp to get a Internet job became programmatic instead of visual arts.

However possibly the same effect can't happen in the same way again because early in the century coding for the web meant mainly fancy graphics and Flash, nothing related to let alone connected with the plumbing.

(ActionSctipt was good enough for a friend who was a novice but desiccated and earnest programmer to write a Flash demo of a insulin regulator device that was so good his simulation from the specifications was used as a reference through to certification of the C and assembly embedded implementation. A ESP8266 for ActionSctipt would be awesome. This guy lost windows for Linux the instant the penny dropped that the greatest most important barrier to this profession, for the majority, is voodoo FUD and a matter of self confidence and reading the right books. This most recent generation doesn't seem to have a equivalent reading culture similar to how we assembled libraries of O'Reilley titles in our offices and homes - everyone who I remember purchased additional titles in fields that they didn't know about, every single time they bought required references something additional got smuggled into patiently awaiting brains. I copied this at my office letting anyone add a extra title every ten references. This almost immediately became a proper budget for a library but suddenly the requests dried up unexpectedly. We realised the attraction was having snuck the expense onto the job budget people took them home to read only afterwards bringing them back into circulation out of unnecessary sensitivity. This is why I passionately hate hermetically sealed corporate systems. Creativity needs inspiration even from a little lisp (or whatever) learning larceny.

HTTP/1.1 Status Code Definitions :


It has always bugged me, e.g. to get 403: Forbidden when the intended message and situation is 401: Unauthorised.

I am mesmerised to this day why HTTP, maybe a rapidly enforced required HTTP/1.2 spec, didn't make it mandatory for servers to enforce strict management of URI errors especially, together with a crud layer of enforcement and even FTP protocol extensions for providing a UI to enable intelligent use as standard. For example :

if you follow the standard date formats including the date in the filename, it's trivial to implement

300: Multiple Choices

whenever you confirm that you really do want to upload


after _National_News_Headlines_for_1995.12.24.htm

A bit funny to have a paper on arxiv.org investigating archive.org (and europarchive.org and archive-it.org and so on).

It does not investigate archive.org (although it did use it to track changes in the archives that were studied).

I will pay for or sponsor any project that reports on the existence of archive efforts, historical and current and planned.

The existence of Archive.Org has led me to conclude that everyone has assumed that the problem is solved. We all know - here - how far that is from any consistent definition of reality.

If your are serious I'm down to do something like this.

Who archives the archive?

Everyone who runs ArchiveBox at home :)


My father used to hum or whistle this song, EDIT : [Who Looks After] The Caretaker's Daughter [When The Caretaker's Busy Taking Care?] whenever appropriate :


Needs more latin

Isn't "archive" already Latin?

Quis archivet ipsos archives?

Assuming Roman numbers count, I suggest arx4.


Quis tuetur ipse tuendum.

(Loose amateur translation, note that verbs for tabularium and archivum were not attested in my brief check.)

(that should be ipsum and I noticed too late to edit)


We detached this subthread from https://news.ycombinator.com/item?id=28238860.


This isn't reddit. Not only will this go over the heads of many HN users here, it's against the site's guidelines.


The guidelines aren't really comprehensive. A lot of the content in HN is curated by the community, which is guided by tribal knowledge and unspoken rules.

Fun and jokes are notoriously hard to get across in HN, which actually makes it all the more rewarding when it works. When the "fun" is just completely absent of originality and personality--just another regurgitated meme pattern of zero effort--many people just downvote and flag because the community doesn't want it.

Hacker News is not anti-fun, it's anti low-effort fun.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact