
Archival Identifiers for Digital Files - todsacerdoti
https://blog.adamretter.org.uk/archival-identifiers-for-digital-files/
======
lcrz
> Whilst we could adopt Multihash here, it's much more complex than we need
> (famous last words?!?).

So OP has the option of adopting an existing (and clear) spec, with existing
libraries (and and existing governance structure) and because of what feels
like a not-invented-here mentality (yeah well, I can't imagine needing this
complexity), OP has decided to develop a new standard, without an existing
spec, without existing libraries and without existing governance structure.

All these ideas in online data sharing, semantics and interoperability are
great, but I think people underestimate how much effort it actually is to
create a good standard, a good spec, and how exponentially more effort it
takes to keep such a standard in a shape that's working for more people.

The extensibility for OP's identifier is extremely narrow. Just look at how
many hash functions and properties are currently already supported by the IPFS
content id:
[https://ipfs.io/ipfs/QmXec1jjwzxWJoNbxQF5KffL8q6hFXm9QwUGaa3...](https://ipfs.io/ipfs/QmXec1jjwzxWJoNbxQF5KffL8q6hFXm9QwUGaa3wKGk6dT/#title=Multicodecs&src=https://raw.githubusercontent.com/multiformats/multicodec/master/table.csv)

OP's choice of hash function is even in there.

There is a reason why IPFS is already at a second version of the spec. It is
because they were using the first version and found them lacking in their use
cases. Every new standard is doomed to make similar mistakes unless they
explicitly account for them in their design.

~~~
adamretter
> So OP has the option of adopting an existing (and clear) spec, with existing
> libraries (and and existing governance structure) and because of what feels
> like a not-invented-here mentality (yeah well, I can't imagine needing this
> complexity), OP has decided to develop a new standard, without an existing
> spec, without existing libraries and without existing governance structure.

I think perhaps my article was too light on the requirements that we face as a
National Archive, or perhaps the Digital Preservation and Access domains are
just quite specialised.

The article was designed as a thought experiment to solicit feedback on one
possible idea. The project is currently in a Proof-of-Concept phase and so
such experiments are quite valid. We certainly have not yet decided on this
scheme.

That being said, Archives have very different requirements in their identifier
schemes and standards to other organisations. It is not so much a "not-
invented-here" as "must-work-here" mentality. We always look around at
existing standards and specifications before developing anything ourselves, we
have to have good reasons not to use something that already exists, i.e. it
won't work in an archival preservation context.

> All these ideas in online data sharing, semantics and interoperability are
> great, but I think people underestimate how much effort it actually is to
> create a good standard, a good spec, and how exponentially more effort it
> takes to keep such a standard in a shape that's working for more people.

I hope I have a fairly good idea what it takes to establish a standard. I was
an invited expert to three W3C Working Groups, and I have also participated in
a number of Community standards groups ;-)

> The extensibility for OP's identifier is extremely narrow. Just look at how
> many hash functions and properties are currently already supported by the
> IPFS

I think that missed the point of the what is required. For use in an archival
context we don't need to support every hash-scheme out there. Quite the
opposite, we want to support as few as possible. In 100 or 1000 years we don't
want digital archaeologists to have to work hard to access the data. Ideally
we will support just 1 hash scheme right now... something robust that will
likely have no intentionally crafted collisions for many years (e.g.
Blake256). If and when a collision is developed for that hash, then we will
support a second hash algorithm, and so on ;-)

I actually really like IPFS and Multihash, but are they suitable for use as
archival identifiers within a National Archive?

~~~
lcrz
Hi OP,

First of all, let me say I'm not questioning your expertise or your good
intentions. I'm a great fan of (and following closely) a few standards or
proposed standards championed by the W3C, mostly related to RDF, verifiable
credentials and digital identity, so a bit different from digital archiving.

To be a bit more constructive instead of my comment above, I think if the aim
is rediscoverability in 100 to 1000 years, what your current proposal is
missing, is semantics. Bits are just bits. Discovering what the bits mean is
actually a great deal more difficult than checking if a checksum is still
valid. Indeed, this gives bits a certain color. A question that you pose, is
'If I change the name of the file, is it still the same digital file?', but
you do not get further into an answer. While a filename might be an arbitrary
label, the color of the bits (or the intent of the creator/archiver of the
file), in my opinion, might be interested for inclusion.

------
nayuki
I generally agree with the ideas he wrote, such as:

> a clean and mutually-beneficial separation between cataloguing and (digital)
> preservation activities can be established

> the file's path and/or name are not suitable for use as an identifier; in no
> small part due to both their transient nature, and inability to be combined
> with files from other systems which may cause rise to naming conflicts

> UUIDs as identifiers; hashes as identifiers

In my article [https://www.nayuki.io/page/designing-better-file-
organizatio...](https://www.nayuki.io/page/designing-better-file-organization-
around-tags-not-hierarchies) , I start from the same basis and explore deeper
structures and issues. I argue that file paths and URLs are indeed bad
identifiers, and that the content of a file should determine its name.

I propose that a tag consists of some descriptive info like a string, plus a
hash reference to the target file being described. The unique feature of my
proposition is that the tag is itself a first-class file, which can in turn be
tagged or referenced by more metadata.

~~~
magicalhippo
> Example: You have a folder named “School” for all the documents and files
> related to school activities. You have a folder named “Photos” where you
> store thousands of pics taken over years. Now you take a couple of photos of
> activities in volleyball club. Where should you store these photos?

Hah this takes me back. My grandpa got his first computer when he was almost
80, running Windows 3.1. He had a non-trivial personal library, and had used
index cards[1] to keep track.

Fast forward a year, and I'm fixing some issue on his computer. When
recreating the issue, he shows me how he saved a copy of each email he sent as
a file on disk, so he could easily find it later.

He had saved not just one copy, no. He had made a directory tree matching the
Dewey decimal system[2], and saved one copy of the email in each relevant
subdirectory. You know, like he would write down the details of a book on each
relevant index card...

So an email with a friend would go into three or four 3-level deep sub
directories, depending on the topics discussed.

After showing me this, and me standing there in silence, shocked... he turned
and asked "is there a better way of doing this?"

I had to admit that no, not really. Not in Windows 3.1.

[1]:
[https://en.wikipedia.org/wiki/Index_card](https://en.wikipedia.org/wiki/Index_card)

[2]:
[https://en.wikipedia.org/wiki/Dewey_Decimal_Classification](https://en.wikipedia.org/wiki/Dewey_Decimal_Classification)

------
zvr
People interested in such topics should check the SoftWare Heritage persistent
IDentifiers (SWHIDs) [https://docs.softwareheritage.org/devel/swh-
model/persistent...](https://docs.softwareheritage.org/devel/swh-
model/persistent-identifiers.html) They provide references to files and
directories (and even specific releases or commits) and they are now an
official IANA URI scheme.

~~~
adamretter
I think SWHID are very interesting and pretty well thought out.

Whilst they can be used for individual digital files, they look like they are
designed for hierarchical file-systems.

We are actually trying to move away from the concept of a hierarchical model
for our digital files. Instead we want a more flexible and abstract model,
where one or more hierarchies might describe an arrangement of files.

------
adamretter
Hi I'm the author I actually already posted this article earlier -
[https://news.ycombinator.com/item?id=23611403](https://news.ycombinator.com/item?id=23611403)

Should I respond here or in the earlier post?

~~~
cpach
IMHO: It’s best to respond here, since this post is on the front page now, has
comments already and the other one is off the front page and has no comments.

~~~
adamretter
Thanks cpach. I have posted my replies here. Is it normal to post something to
HN, then have someone else repost the same just a few minutes later?

~~~
cpach
I don’t think it happens too often. Having a canonical URL for an article
helps, since otherwise the HN dupe-checker will not catch the dupe.

------
sanqui
Interesting article, but I expected it to deal with preserving information
like file source, location, and metadata, which is IMO still an unsolved
problem in digital archival.

Specifically avoiding the use of multihash in favor of a new scheme doesn't
sound very forward thinking, it's not so much more complex than the author
thinks (to begin with, you can just abort for every prefix but the one you
recognize).

~~~
adamretter
> preserving information like file source, location, and metadata

Sorry about that. It's definitely something we have been working on. Without
going into too much detail here, we have the 1. The physical Digital File
itself (i.e. the bitstream) which we might name with the ACID, and then 2. we
have one or more Digital File metadata objects for each file that reference
the physical file. The metadata object has an instance specific identifier (a
unique id) and links to the physical file via its ACID. The metadata object
holds all of the properties about that instance of the physical file, e.g.
name.

------
j-pb
I expected another abomination like DOI.

I've seen librarians proudly announce that they've made software archivable by
giving git commit hashes a DOI _shudder_. Replacing a for all intents and
purposes unforgable id with a short number wrapped in a URL that can be man in
the middled is NOT a bright idea.

Finding blake-256 was a pleasant surprise, content aware hashing is the only
good way to do it.

------
chaz6
It is nice that the author recognizes the possibility of a hash collision.
Time and time again I see UUID's or other hashes used in systems with the
assumption that a collision is impossible.

