Hacker News new | comments | ask | show | jobs | submit login
DOI use on Reddit – results from recent dump (crossref.org)
24 points by afandian on Oct 1, 2015 | hide | past | web | favorite | 18 comments



Please correct me if I'm wrong, but based on the author's descriptions of DOIs, they're not completely linkrot-proof. The issue is that it's just a link. The link itself could disappear or the publisher or even the metadata itself. I don't know what steps publications take to preserve DOI's but it sound like on their own they only protect against page moves. To really combat linkrot you need to provide as much metadata as possible, mainly title and authors but other data could also help tracking stuff down. Lastly even if you figure out what the DOI points to, the pdf might be gone; that's why archiving services like the wayback machine or webcite are so important. I'm glad they're doing something about it though; I find that a lot of my bookmarks 404 after only a few months. Webmasters need to realize that a URL is just like an address and that you need to set up a forwarding address.


Author here. You're not wrong. There's nothing magical about DOIs or everyone would use them for everything and there would be no link rot. You can only update links if you know when and where things are moving, and you're still around to do it.

Scholarly publishing is an area where documents need to stick around for a long time, be able to be cited in a well-understood way, etc etc.

Crossref is an trade association of scholarly publishers, formed to solve exactly this problem. Part of the deal when a publisher joins Crossref is an obligation to keep the links up to date and a plan for what happens if the publisher suddenly vanishes into thin air.

Initiatives like CLOCKSS address the archiving question http://www.clockss.org/clockss/Home . The CLOCKSS homepage has a good overview.

And to the 'tracking stuff down' point, the DOI is resolvable (i.e. a link you can click) but it's also an ID that you can look up in a database. DOIs predate the web and may live longer than it. Having a single, 'official' ID (whatever it looks like) is better than searching by metadata.


> Having a single, 'official' ID (whatever it looks like) is better than searching by metadata.

As someone who works with scholarly data, having IDs for things is absolutely invaluable. There's a lot you can do with DOIs quickly and easily that would be prohibitively difficult if you had to use human-added metadata.


> An obligation to keep the links up to date and a plan for what happens if the publisher suddenly vanishes into thin air.

It's good to hear that scholars are getting together and trying to mitigate the volatility of URLs. I'll definitely need to read more on the subject to understand the pros and cons so thank you for your introduction in the article; many specialist blogs tend to only write for their specific audience. It seems like DOIs are an ample solution as long as they aren't a single point of failure.



Sorry, but what is DOI?


As described in the blog post (although not fully).

A DOI is a Digital Object Identifier.

A digital object is something like a scholarly article. People need to be able to find it by URL, cite it, and refer to it with an ID. The URL is no good because it could change. If you cite a URL and it 404s in a few years time, that's link rot.

So a DOI is a link that points to the item. It 'belongs' to the publisher who owns the article. They are obliged to update the link to point to the new location when it moves. Hence the DOI link always (in theory) works.

From a user's perspective, a DOI is just a URL that you visit that redirects you to a different URL. From inside, it's a link resolver system backed by a database that publishers keep updated through a registration agency (e.g. Crossref).

Different Registration Agencies do different things. Crossref has a tonne of metadata, contributed by publishers. Have a play here: http://api.crossref.org/works

There's a good Wikipedia article: https://en.wikipedia.org/wiki/Digital_object_identifier


DOIs work for referencing resources because someone, e.g. a scholarly publisher, has a self-interest in making sure that DOI-based links keep working. They put a concerted effort in making that happen. When referencing "web at large" resources that reside on any kind of web server out there, the admin of that web server is typically not very interested in putting work in to keep links working. Hence, the person putting a link to a "web at large" resource in, or the platform on which such link is published, caries the burden to keep the link working. Our Robust Links work (http://robustlinks.mementoweb.org/) is all about how that can be done. As someone in this thread says, it involves "metadata". But not the kind of metadata that's mentioned in his post; it's all more webbie than that. Note that link rot is not the only problem for "web at large resources". Content drift is another: content at the end of a link changes. See also my comment in the Mozilla Science discussion "What makes code a research object" (https://github.com/mozillascience/code-research-object/issue...). And see our extensive study regarding reference rot in scholarly communication (http://dx.doi.org/10.1371/journal.pone.0115253).


Thanks for the reply and the links. The approach taken by Robust links is close to what I had in mind in regards to actually archiving the content. Looks like its a good solution to content drift and it's more passive than webcite. The timestamp benefit is also great. Maybe you could also include some sort of hash on the local end so maybe users could see at a glance if the nearest archive differs at all. Of course that's limited since it uses external archives and might not work with dynamically generated pages. I'll have to check out your comment [0], which seems like a lot of thought went into it, and the research paper.

[0]: https://github.com/mozillascience/code-research-object/issue...


The article's description of a DOI is a bit vague. It isn't just a link. When you register a DOI, you have to register the metadata, which are then stored with your DOI registration agency (like crossref or datacite). e.g., 10.15783/C7F59T is a DOI for one of our datasets. If you follow http://doi.org/10.15783/C7F59T in your browser then you get taken to our data archive, but if that goes away then you can still query the metadata at datacite (http://data.datacite.org/10.15783/C7F59T)

But what happens if doi.org goes down? http://crosstech.crossref.org/2015/01/problems-with-dx-doi-o...


Author here.

Most readers on the blog know all about DOIs so that section wasn't meant to describe everything about them. I welcome feedback though.

What happens if doi.org goes down? In the first instance you can use an alternative resolver, e.g. http://dx.doi.org/10.5555/12345678 can be resolved against http://dx.crossref.org/10.5555/12345678

In the second instance, we're as open as possible about it and discuss the shortcomings of the system! It's not perfect, but nothing is. We think it's the best solution.


Thank you for the clarification. It clears a up lot; it seems I gave the designers of the DOI too little credit. Your last question brings some food for thought. The naive answer is to "just decentralize it", but smarter people than I have been trying to tackle a similar issue with certificate authorities. Sometimes software isn't really the easiest solution in contrast to open dialogue and cooperation.


How do a DOI help at all period? Once I click the link and the DOI redirects me then when I share I'm going to share the actual link not the DOI.


This is a very significant problem indeed. In our reference rot study (http://dx.doi.org/10.1371/journal.pone.0115253), we found that a very significant amount of papers were referenced by means of their landing page URL instead of by their DOI URL. As part of the "Signposting the Scholarly Web" work, we looked into this issue and thought that using a "canonical" link from the landing page URL to the DOI URL would be a good step.

It needs to be understood that "canonical" was really introduced in the context of search engine optimization and is supported by web search engines. In essence, if a landing page would have a "canonical" link to the DOI RL, the content of the landing page would be indexed under the DOI URL.

For browsers, the desired behavior would be to bookmark a page that has a "canonical" link under the canonical URL instead of the page's URL. At this time, to the best of my knowledge, this is not supported by browsers. However, it seems to me that it should not be impossible to achieve this, if, for example, the Persistent Identifier community (DOI, handle, PURL, ark, ...) would lobby for it. Alternatively, one could write an RFC and define another link relation type specifically for this relation type between landing page and persistent identifier. And lobby browser manufacturers to implement the "right" bookmarking behavior for it. But, the difference between "canonical" and this special-purpose relation type might be too subtle to make it through IANA registration scrutiny.


A publisher should display the DOI on their landing page for the article, e.g. the top matter of http://journals.plos.org/plosone/article?id=10.1371/journal.... . If you want to share a persistent link, you can.

In the context of scholarly articles, if you're reading an article, the references will be linked via DOI, e.g. here: http://journals.plos.org/plosone/article?id=10.1371/journal....

Where this all gets interesting is non-traditional scholarship, e.g. sharing on social media where there are lots of 'publishers' (twitter, reddit) and they don't particularly care about best practice from the scholarly publishing industry. How do you get someone to share the persistent link? You can't force them. It's not a solved problem, and probably never will be.

Herbert Van de Sompel is working in this area http://public.lanl.gov/herbertv/home/ with things like canonical link headers that indicate the canonical URL of a given page. Here's a presentation he did on the subject, which he calls 'signposting the scholarly web': https://www.force11.org/video/signposting-scholarly-web


Because publishers often change their systems, so links can break. URLs can also be very long and difficult to include in text documents. Some journals I've published in have moved between several publishers. The DOI is an identifier which is resolved to point to the current location of the document. The DOI is an object, not a location of an object, as described here http://www.doi.org/factsheets/DOIIdentifierSpecs.html


So... why not just use the 'DOI' as the canonical URI then?

Is it just so they can try to extract some SEO magic juice from the URI?


That's the idea. We would like people to. We have about 5000 publisher members. Co-ordinating an entire industry can take time.

As to why people don't... I wish I knew.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: