
Multihash – self-describing hashes - _prometheus
http://multiformats.io/multihash
======
_prometheus
Hey everyone. The SHAttered break made us release a new website for
Multiformats: [http://multiformats.io](http://multiformats.io) which includes
a page explaining Multihash --
[http://multiformats.io/multihash](http://multiformats.io/multihash) \--

The page goes through a bunch of examples of how a multihash would work in
practice, why it is useful to do this, and goes through some FAQs.

I'm writing a post about all this that I'll push out soon, but wanted to drop
the site here now.

Please, as you make new systems or start to upgrade your systems out of SHA1
(or MD5!!!), please use Multihash so that your upgrades will be much, much
easier in the future. This makes rolling up a matter of recompiling some
tools, and by far not facing the horrible breakages that occur when tools and
scripts assume certain things (like the hash digest of git will always be
160bits... now git faces breaking a lot of things and may have to stick with
160bits of a truncated SHA3 or BLAKE2...)

Oh also-- definitely check out BLAKE2, it's a fantastic hash function from
Zooko, JP Aumasson, Samuel Neves, and Christian Winnerlein --
[https://blake2.net](https://blake2.net) \-- use it for all your hashing
needs! So much faster than SHA2, and has the same "unbrokenness" that SHA3
enjoys. (And, I believe deep cryptanalysis has gone into BLAKE, Keccak is not
particularly safer)

The Multiformats site also has a Multiaddr description, but that's far from
complete and doesn't have the examples. The other multiformats arent well
explained on the website yet. Sorry, coming soon :) we wanted to get this out
ASAP -- PR's accepted at
[https://github.com/multiformats/website](https://github.com/multiformats/website)
(oh and yes, TLS certs coming too)

~~~
mtdewcmu
You could probably omit the length, because the length is probably already
known implicitly. E.g. git knows the length of its hashes without having to
read the hash. If the length < the full hash, it can be assumed to be
truncated.

~~~
cmrx64
Some hashing algorithms have a configurable output length. You need to encode
the length, or have it as part of the type, and it's more uniform to just have
it separate from the type.

~~~
mtdewcmu
I guess it depends on how precious space is. A naked hash is very information-
dense. In certain applications, inserting a prefix of several bytes to each
hash makes a difference. OTOH, if the hashes end up being inserted into a
table in MySQL, then space is probably not that precious.

------
DannyBee
IMHO: Using varint for this is silly, and just wastes decoding time. (Doing
because proto/etc did it, when they have _wildly_ different use cases, it a
bad idea.)

The example they purport to give even proves it: " For example, a 100 TB file
in IPFS may have as many as 400 million subobjects, which would mean 400
million hashes.

400,000,000 hashes * (7 - 2) bytes = 2 GB"

Right, so uh, that's 0.002% of that 100tb. (IE who cares)

(This already appears to be the difference between what it would normally
take, at 7 bytes, and what it takes with varint, which is ~17 bits per int)

But assume we go crazy, and use 8 bytes per hash.

we have 3.2 gigabytes there. vs 800 megabytes for your varint encoding. Or,
for our 100 terabyte file, approximately, "who cares" vs "who cares".

if you could transfer at a gigabyte per second, the version with varint would
take ~28 hours. The version with fixed-size 8 bytes would take ... 2 seconds
longer.

Remind me again how this is showing it's a good idea?

Also, if you really wanted to use variable size integers, variable byte
encoding is pretty much the slowest way to do it.

See, e.g,
[https://arxiv.org/pdf/1209.2137.pdf](https://arxiv.org/pdf/1209.2137.pdf)

Using variable sized integers here doesn't save space, precisely because th ,
but it does make it more annoying to splat these in and out of memory.

~~~
_prometheus
the table is so small you don't actually need to decode varints. any serious
application needing speed would build a lookup table for all the common values
-- in fact we will. size does matter. speed does matter.

~~~
nwmcsween
It seems cargo culted though and IMO using a lut makes it more annoying, why
not just suffix or prefix the hash with a delimiter and type?

------
advisedwang
Looks awesome, so long as its easy to ensure attackers can't quietly make you
use a weak hash. Some JWT implementations quietly allowed the "none" algorithm
which allowed an attacker to forge creds [1]. Any similar "pluggable" system
needs to learn fro this and be careful to avoid letting attackers pick weak
algorithms.

[1] [https://auth0.com/blog/critical-vulnerabilities-in-json-
web-...](https://auth0.com/blog/critical-vulnerabilities-in-json-web-token-
libraries/)

~~~
_prometheus
That's right. It's really important to make sure there is restrictions on what
hashes to use if your system is receiving hashes and only checking them for
self-consistency.

Particularly relevant is "Crypto Extensibility" (formats and protocols to be
able to extend a protocol), vs "Crypto Agility" (the use of Crypto
Extensibility to use simultaneously a large variety of algorithms, with the
key feature that one can be downgraded to an old/possibly broken hash.

AGL describes it well here:
[https://www.imperialviolet.org/2016/05/16/agility.html](https://www.imperialviolet.org/2016/05/16/agility.html)

\---

I've filed
[https://github.com/multiformats/multihash/issues/70](https://github.com/multiformats/multihash/issues/70)
to track this.

------
smsm42
I'm not sure I like that much that there's a fixed table purporting to contain
IDs for all hashes ever used in the universe. It sound both inflexible - each
time somebody uses a new parameter or new hash modification, it has to be
updated, and hard to use - you have to drag this table around in all code that
uses the protocol and remember to update it all the time. And all seems in
service of saving a couple of bytes - maybe there would be better to have more
verbose, but more flexible scheme, that allows to relatively easily add and
identify hashes and read them by many tools?

I have same reservations about variable-int thing - it would sometimes save
2-3 bytes, but makes processing with simple tools much harder as many tools
make it easy to cleave arbitrary strings into fixes pieces without knowing
string's specific content, but variable-length fields make it significantly
harder. Is the gain really worth the complications?

I've seen the FAQ but I guess I am not sure I agree with the authors on the
question that what they did is actually the best...

~~~
panic
_you have to drag this table around in all code that uses the protocol and
remember to update it all the time_

Won't some kind of table always be necessary unless the type contains actual
bytecode implementing the hash algorithm?

~~~
LeoPanthera
For a few extra bytes you could have the ASCII name of the hash function.
"sha256" is pretty clear, for example.

~~~
jnwatson
And then your code would still need a table mapping the string to the
implementation.

~~~
smsm42
Yes, but that's just one table - for implementations that are supported by
code. I don't need the table to just understand what this hash is - only to
verify it. And, if I want to add my private hash, I don't need to update
shared tables and be worried some other hash would have the same ID as mine. I
can just use reasonably unique string.

------
niftich
I wrote about Multihash the other day [1], but in that comment I refrained
from outright criticism. However, my biggest reservation about Multihash is
that it deliberately pollutes the value space of a hash digest, while in good
faith claiming to domain-segregate the value space of a digest. It's got no
magic number, no magic anything, really; nor can it, since the output of a
hash digest is an opaque binary value with any and all outputs equally likely.

Therefore, while before multihash there was sometimes no way to figure out
which hash function produced a given digest, after multihash there is no way
to figure out whether a particular digest is actually a multihash value, or a
real raw digest. The fact that it's a multihash value has to be inferred from
context or out-of-band just like before, in which case it brings little
additional value that the particular hash function is encoded _inside_ the
value, when it could've just as well have been communicated the same external
context.

Then comes the problem of the lookup table, which was left overridable instead
of globally prescriptive, so without additional context you don't even know
whether to use the global lookup, or someone else's special one.

I don't find their arguments laid out on the website persuasive. Really, this
is all information that should have been put somewhere else, not into the
digest, without affecting any of the "insecure warning" features that
implementations may bring. For IPFS, the first part of the URI would have been
an excellent place, function names and parameterizations and all, or if
they're trying to save space like they admit, the values from the global
lookup encoded as a path element. Like it or not, each hash function creates
its own sub-tree of all of its possible digests, and multihash values on their
own are not suitable to globally an unambiguously describe resources, because
they're only deterministic if it's known that the context is "just like the
defaults in IFPS". And for these reasons, I feel it muddies content-addressing
with no real gain, while all of its good ideas could have been implemented
differently inside of IFPS to better effect.

[1]
[https://news.ycombinator.com/item?id=13726589](https://news.ycombinator.com/item?id=13726589)

------
dwheeler
Multihash looks cool. It looks a little tricky to apply directly to git,
though. The problem is that git and git users treat the first few bytes
specially. In git, directories are split on the first few bytes, and users
particularly use the first few bytes to distinguish between commit ids.

I wonder if multihash might be useful in git if slightly reordered (so that at
least the first several bytes of the hash function are first).

I'm thinking out loud; I'd be interested in hearing others' thoughts.

~~~
_prometheus
sounds interesting. we're definitely interested in figuring out ways for these
protocols and formats to be as useful to others as they can be

------
cryptarch
Cool to find this on HN!

Just last Thursday I added this to an "Oh fuck, we're still using MD5 for
passwords?" issue, Multihash should provide a really easy way of migrating to
a new hash once we've "Multihashified" all existing MD5 hashes in the db.

~~~
tpetry
You should rather switch to password optimized hashing algorithm like bcrypt
or argon2 and not make the mistake again to use a general purpose fast hashing
algorithm

~~~
cryptarch
You should rather assume less. It's a codebase I inherited and need to
housebreak as this company's first sysadmin.

I hadn't gotten around to reading the Multihash table, it does not contain
suitable hash functions so I'm considering using a different protocol or a
custom table.

~~~
ianburrell
Multihash doesn't work for storing password hashes, both because the
cryptographic hash algorithm are too fast for password hashing, and because
password hashes need extra info like the salt and parameters like the work
factor for Bcrypt.

There are two standardized formats that can be used for storing password
hashes in one field. There is the RFC2307 format from LDAP which prefixes the
algorithm like "{MD5}". There is also the crypt format which uses prefixes
like "$1$".

There isn't a standard RFC2307 scheme for Bcrypt but can use "{CRYPT}" for
crypt format, or "{BCRYPT}" with crypt-style bcrypt hash.

~~~
cryptarch
That's why one of my options is a custom hash algorithm table; Multihash has
explicit support for using your own hash functions. AFAIK part of the
implementation of custom hash function table entry in Multihash is a protocol
to encode metadata alongside the hash, e.g. by appending the salt to the hash,
perhaps separated by some special character (e.g. base64 encoded hash + ":" \+
base64 encoded salt). This is not part of the table, so maybe that would make
it inconvenient to use. I haven't read enough of the code to know.

I do think it would be much nicer to have this in the Multihash library
itself. Easy hash function migration is a great feature to have when you have
to store user passwords.

I also find standardized text-based formats for storing multiple types of
hashes in one field interesting, but I'm not sure if it's worth the effort to
implement myself vs. extending Multihash. I like that Multihash encapsulates
the parsing logic and that it has libraries in many languages, so I can easily
interact with the DB entries from e.g. scripts or new codebases.

I think the cool thing about Multihash is the very broad platform support, the
small scope, the convenient API to interact with a DB + hash function table,
and the default function table; I'd only be missing out on the default
function table if I were to use a custom function table.

------
besselheim
Seems rather like the ASN.1/DER based encodings already used in crypto to
describe hash outputs, except using an integer rather than an OID to describe
the hash type.

------
jnwatson
Crypto has a long history of using ASN.1 for its data types. I'd prefer
defining the data types in that and calling it a day. It isn't like DER would
be any harder to parse than a new byte-oriented custom format like this.

------
runeks
> It is useful to write applications that future-proof their use of hashes,
> and allow multiple hash functions to coexist.

Is the suggestion that applications, which have now switched to SHA-256,
should serialize this using Multihash, in order to be able to easily switch to
a new hash function in 15 years, when SHA-256 is broken?

I think, for most cases, it makes more sense for a protocol/application to
just define that v1 uses SHA-256, and, if this ever breaks, we'll update the
protocol/application version (rather than change the protocol's hash
serialization format). Unless you explicitly need to support many different
hash functions.

For example, let's say your app needs to securely hash many files, and store
all these hashes somewhere. Would you rather serialize thousands of hashes,
all of the same type, each with a pre-pended identifier and length? Or would
it make more sense to put the hash function identifier in a single place, and
use this for all the hashes? So perhaps Multihash would benefit from
separating the versioning scheme from the serialization format of the hash
itself?

~~~
fpgaminer
It's discussed in the article that a different issue is tooling assuming the
hash length won't change. That's an issue that came up with the recent SHA1
issues and git. Lots of git tooling assumes the hash is 160-bits. For example,
a website displaying git history might be formatted for the hash field to be
exactly 320 pixels wide, enough to fit a hex encoded 160-bit string in an 8px
monospace font. If suddenly the hash needs to be 256-bits, the site needs to
be redesigned.

Multihash seems to argue that by using an TLV encoding on the hash, that will
make it clear that the hash length is dynamic, so tooling will design with
that in mind.

Simply assuming that you can rev the version of your protocol in the future
doesn't communicate to outside developers that, for example, hash lengths in
your protocol could change in the future.

I'm not sure I agree with Multihash's argument, though, and on average I agree
with your methodology in systems I design. Keeping the crypto fixed per
version makes implementations and tooling simpler and thus easier to
debug/less chance of [INSERT WORD]Bleed. That outweighs the supposed future
proofing of things like multihash, in my mind, and I also feel a lot of
designers won't take the hint even when presented with a TLV encoded hash.

------
jvehent
ASN.1 just called and wants it's encoding back...

------
nickcw
MD5 doesn't seem to be in the table...

That is going to make it harder to change code to use multihash then migrate
away from MD5 gradually.

MD5 may be broken as a crypto hash, but there are probably more MD5 hashes
than any of the others in the table (eg one on every S3 object!) so not having
a representation for it seems short sighted.

~~~
_prometheus
Yeah you're right, we've been meaning to add MD5. Filed issue:
[https://github.com/multiformats/multihash/issues/71](https://github.com/multiformats/multihash/issues/71)

------
philsnow
I like the idea of multihash. Using a registry mapping hash names to numbers
seems like it might hinder adoption, though.

If I want to use a hash that isn't yet in multihash's registry (or will never
be because reasons, but I nonetheless want to use it), what do I do?

I could make a forked copy of the registry where I allocate my own number to
the hash, but, I have to guess very well what numbers will not be used by
future allocations in multihash. If I guess wrong, I might have to go through
every hash I've encoded with my forked registry and re-encode them, which
could be very onerous if they're spread across s3, glacier, cold storage, etc.

If there is a well-defined set of numbers that are set aside for personal /
non-sanctioned use, I have the same problem when examining hashes from other
parties, because they may have chosen a different number for a particular
hash, or we may have conflicts. Likewise, when a hash "graduates" from the
personal space to the official mapping, all the old hashes are now encoded
"wrong": they'll keep working, as long as you ensure the hash keeps a number
in the personal space, but now you have to choose whether to, when making new
hashes, to use the official number "for consistency (with the standard)" or
the personal one, also "for consistency (with your existing hashes)".

It looks like every hash in multihash has a number (like 0x13), a name (like
"sha-512"), and a description ("The SHA512 Cryptographic Hash Function") [0].
I suggest letting people use e.g. "{name}:{len}:{hash}" if they choose to make
the tradeoff in favor of readability / portability (with lower performance and
marginally higher storage cost).

Then when those hashes get read by some other person's copy of multihash,
they'll either map the {name} to the local number (if if it's different from
the number of the multihash library that encoded the hash) or if the {name}
doesn't show up they can throw an appropriate error. This is forwards
compatible and I think will allow more people to be comfortable adopting
multihash, because they won't have to think so much up-front about how to
encode hashes that aren't in the registry.

[0]
[https://github.com/multiformats/multihash/pull/41/files#diff...](https://github.com/multiformats/multihash/pull/41/files#diff-71c19e973ffcdb60141b3cc511acadb4R59)

~~~
_prometheus
Codes reserved for app-specific things should satisfy your concerns. coming
soon:
[https://github.com/multiformats/multicodec/issues/39](https://github.com/multiformats/multicodec/issues/39)

One thing you should consider about using full names (like "sha2-256") is that
you still have to keep a table. What does "sha2-256" even mean? you know from
context. you know that maps to a particular algorithm and particular set of
implementations from context.

Writing a table and distributing is cheaper than writing a new implementation
of a new hash function and distributing. The "distributing code" part still
has to happen.

I think what you would like is if some body (NIST + all others who make hash
functions) was maintaining the registry. That's fine, we can give this table
onto IANA (though that's harder to update than a PR on Github).

~~~
philsnow
Fair point about ambiguity of what a given string means.

> I think what you would like is if some body (NIST + all others who make hash
> functions) was maintaining the registry.

Not really, that sounds ungainly.

------
y7
I wonder if storing the hashing parameters next to the hash is the right
approach to take for systems. Your system isn't going to support all hashes
(adds a lot of library bloat, you don't want to support weak hashes), so
you'll probably want one specific algorithm/parameter set, with room for
upgrades in case weaknesses get discovered (not likely to occur often). A
single byte/varint, starting at 0 and incrementing for every new parameter set
sounds reasonable. A downside would be interoperability, but if that's your
main concern then perhaps a human-readable string representation of the hash
algorithm/parameters would be preferable.

What are the benefits of using Multihash?

~~~
_prometheus
Truncations happen in many sizes. The combinations here are not helpful: N
functions * M commonly chosen sizes. Simple assumptions: 20 functions * 10
sizes = 200 codes. already not good.

You can print a human-readable version of multihash as:

sha2-256-20B-<hex>

Using the same values of the table. But again, it is preferable that it
travels in-band with the values, as otherwise it will likely get chopped up
and placed into out of band context.

------
barrystaes
I dont get the trend of security hashes / encryption trying to describe itself
with a header, it just makes breaking this hashed / encrypted data faster.. no
humans needed, fully automated hacking at scale!

~~~
pwdisswordfish
Kerckhoffs' principle: security should depend on the secrecy of the keys, not
of the algorithm. Do you object to PGP or PKCS#7 because they attach metadata
to the encrypted data, so that the attacker knows which cipher/key they should
be cracking?

------
mtdewcmu
This seems kind of like a cross between a uuid and a plain hash. Perhaps it
would be better to incorporate the hash into a uuid. Uuids already have
metadata.

~~~
mtdewcmu
Uuids are too short, but I think that Uuids are the relevant prior art for
this type of thing.

------
tehlike
Why not just encode a proto:

message Hash { optional HashFunction hash_function = 1; optional bytes digest
= 2; }

