

Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts [pdf] - dj-wonk
http://2014.eswc-conferences.org/sites/default/files/papers/paper_106.pdf

======
nwf
First off, let me say that I'm always happy to see people thinking about the
robustness of scientific data. It's a thing we do not do well at all, at
present, and should be much more urgent, given its importance to the
enterprise. However, this work has a number of small problems and mostly seems
like rehashing (no pun intended) well-trodden ground.

Like so many similar works, this fails to cite the magnet: URI scheme (see,
for starters,
[http://en.wikipedia.org/wiki/Magnet_URI_scheme](http://en.wikipedia.org/wiki/Magnet_URI_scheme))
of which trusty URLs and the cited niURI scheme both seem to be small subsets.
Introduced in 2002, these already defined a way of stably identifying an
immutable object and providing one (or more!) suggestions for retrieval, which
the present paper calls "authorities" but are likely better viewed as caches;
one cache may be authoritative, but that's optional. The "modules" defined are
probably better encoded as MIME types (and could be integrated into a magnet
URI as "x.mime=.../..." attributes; the draft standard does not have a field
for MIME type, sadly), rather than introducing yet another namespace for
describing document types.

Speaking of caches, the paper's assertion that "any artifact that is available
on the web for a sufficiently long time will remain available forever" is
extremely worrying; the search engines of the Internet (other than Internet
Archive, perhaps) are not altruistic entities out to serve your data forever.
Their caches cannot and must not be depended upon by the scientific community;
we must host our own data or pay for its archival, as much as that may be
painful. There Ain't No Such Thing As A Free Lunch.

The trick for deriving self-reference is analogous to how IP packets carry
their own checksum; it's an old trick, dating back to at least RFC 791
(section 3.1, heading Header Checksum; earlier RFCs do not seem to ) but
almost surely earlier, and probably merits a citation of something. The use of
the same technique for Skolemization is cute, providing a nice workaround for
RDF's poor handling of existentials.

The performance numbers are worrying; streaming a search-and-replace pass (to
transform out self-references) followed by a SHA256 verification through 177GB
of data should not take 29 hours, especially given that the data is already
sorted. CheckSortedRdf and CheckLargeRdf both exhibit linear time in figure 3,
suggesting that the data being verified is already sorted (which would be
consistent with earlier assertions that the existing implementation only
generates sorted files); a better comparison would be to show CheckLargeRdf on
randomized inputs, as all we see now is the overhead of a pre-processing pass
that is, essentially, just verifying the sortedness of input.

~~~
tkuhn
(disclaimer: I am an author of the paper)

Thanks for your comments. First off: yes, most (perhaps all) of the applied
methods are not novel, some of them have been around for a long time. We only
claim novelty on how these existing methods are combined to solve the problem
of data availability and integrity on the web.

Yes, the magnet URI scheme is highly related, and we probably should have
referred to it in one way or another. However, there are crucial features that
magnet links do not provide (as far as I know): you cannot generate a hash
that represents content on a more abstract level than byte sequences (MIME
types by themselves don't solve that problem), and you can also not have self-
references. All of the features from our list of requirements are supported by
some approaches, but (to our knowledge) no approach supports all of them at
the same time.

In terms of search engines caching research data, I agree! We shouldn't trust
existing providers too much but build a dedicated decentralized infrastructure
for scientific purposes (this is what I am working on now).

I am sure the performance measures can be improved (incremental cryptography
might allow us to get rid of sorting altogether). The shape of the curve is
however not much affected by the fact whether the statements are already
sorted or not (they are not sorted for TransformRdf and TransformLargeRdf!).

I hope this clarifies some things.

~~~
nwf
Thanks for your response; it does clarify things.

But, I don't think I understand your concern about abstract hashing and how it
would need to be something fundamentally new. Both the order normalization and
self-reference are simply preprocessing stages on your data, albeit slightly
different forms. The sortedness requirement, I think, is captured by MIME type
parameters (the "charset=" in "text/html;charset=UTF-8"), as it does not
change the fact that the document is an RDF graph. For the placeholder trick,
I think you're right and that you'd want something like a "text/rdf+selfref"
MIME type to indicate that it is not in fact valid RDF until preprocessing has
been performed. All told, your RDF module would be described in MIME as
something like "text/rdf+selfref;sorted=".

~~~
tkuhn
Right, I guess you could define everything into a new MIME type, but I think
that would be quite a weird thing to do and wouldn't really be faithful to the
idea of MIME types. This MIME type would stand for a type that nobody would be
directly using for files, but it would only stand for some internal
intermediate representation (I will not be able to convince people using RDF
to switch to my new strange format instead of TriG or N-Quads!). And that
means that there would be two MIME types involved for a single file: the
actual type (such as application/rdf+xml or application/trig) and then the
type for normalization and hash calculation (something like
"text/rdf+selfref;sorted="). I think this shows that MIME types are not a
straightforward solution to the given problem and I think this justifies to
introduce this new level and a new scheme for the trusty URI modules (e.g.
"RA").

------
runeks
I've often thought about this when including eg. style sheets and JavaScript
for Bootstrap (or any other highly popular .css and .js file) in HTML: every
time I have to get the URL for the file in question. And I have to trust that
this URL is reliable, and continues to be reliable.

It would be a lot easier if I could just specify the SHA-256 hash of the file
in question, like so:

    
    
        <link rel="stylesheet" href_sha256="f0fa7e4b0123ff9618fc51f1e54c0842072605412113a6daaf23758c67952d0c">
    

instead of the current solution:

    
    
        <link rel="stylesheet" href="https://netdna.bootstrapcdn.com/bootswatch/3.0.0/flatly/bootstrap.min.css">
    

Specifying the hash allows the browser to retrieve this file from the quickest
source, whether that be the global cache (maybe I visited some other web page
that used the same file), or the browser knows a list of CDNs that have the
file, and fetches from the fastest one.

It avoids having to specify the source of the file. As a developer, I really
don't care where the file comes from -- I don't particularly want it to be
sent from netdna.bootstrapcdn.com, but there's not sane standard to specify
otherwise.

~~~
nwf
Agitate browser implementers to add support for magnet: URIs and you can do
exactly that without needing to add new attributes to HTML. :)

There's a bug for firefox here:
[https://bugzilla.mozilla.org/show_bug.cgi?id=528148](https://bugzilla.mozilla.org/show_bug.cgi?id=528148)

~~~
runeks
Would certainly be nice to avoid tons of different href_sha256, href_sha512,
href_sha3-512 etc. attributes.

But the browser supporting identifying files via hashes is vastly different
from it fetching files via BitTorrent.

I think it makes more sense to pursue each separately. Even though magnet URI
support will deprecate any existing implementation of hash support.

------
tkuhn
I am the main author of the paper. Happy to discuss and clarify if needed.

~~~
edwintorok
What happens to other URLs embedded in the document that you link to with
trusty URLs (other than self-references?). For example your document could
include images, and javascript that could completely the meaning of the
document, while keeping the hash of the document the same.

Do you require all URLs contained in the document to be trusty URIs too?

~~~
tkuhn
No, there can be contained URIs that are not trusty (probably the majority of
contained URIs will be of that type). You can verify the entire reference tree
as long as you follow trusty links, but of course this cannot go on
indefinitely. Furthermore not all resources have the form of what I call a
"digital artifact" (e.g. foaf:knows does not stand for a digital artifact),
but they reach out to the real world (these URIs might not even return a
representation, i.e. they might not be URLs).

------
dj-wonk
I'd be interested in hearing from others about (1) other papers and
implementations along these lines; (2) commentary on the approach outlined in
the paper.

~~~
dj-wonk
"This document describes the CCNx protocol, the transport protocol for a
communications architecture called Content-Centric Networking (CCN) built on
named data. CCN has no notion of host at its lowest level — a packet "address"
names content, not location. The CCNx protocol efficiently delivers named
content rather than connecting hosts to other hosts. Every packet of data may
be cached at any CCNx router — combined with intrinsic support for multicast
or broadcast delivery this leads to a very efficient use of the network when
many people are interested in the same content."

from
[http://www.ccnx.org/releases/latest/doc/technical/CCNxProtoc...](http://www.ccnx.org/releases/latest/doc/technical/CCNxProtocol.html)

------
dj-wonk
See also:
[https://news.ycombinator.com/item?id=5622209](https://news.ycombinator.com/item?id=5622209)

