
Humanhash: Human-readable representations of digests - zacharyvoase
https://github.com/zacharyvoase/humanhash
======
beagle3
The best, by far, solution to this problem that I've seen or used is:
<https://github.com/singpolyma/mnemonicode> ; original docs here:
[http://web.archive.org/web/20101031205747/http://www.tothink...](http://web.archive.org/web/20101031205747/http://www.tothink.com/mnemonic/)

------
feralchimp
Anyone doing something about the UX of security is a friend.

I agree with various commenters' would-be-nice features (larger word list for
more bits represented per word, better phonetic distance between items in the
word list), and have one of my own (an implementation in C with no library
dependencies would be nice).

Solid effort overall.

------
mseebach
It might be useful for confirming that you have the right file over the phone
or other crude applications - but you're checking such a small subset of the
digest that crafting a fake payload with the same humanhash becomes down right
trivial.

With that in mind, the code could be much simpler: Each possible pair of hex
digits gets assigned to one of 256 words, and the humanhash is the words for
the first n pair of the digest.

Or am I missing something big?

~~~
zacharyvoase
I wanted the humanhash to depend on the whole of the input, so I use a
compression function (similar to a checksum) rather than simply selecting the
first `n` bytes of input.

The humanhash is _not_ intended to guarantee any level of collision
protection, it's simply supposed to be a human-memorable representation of a
much longer underlying digest. When computers exchange representations, they
should certainly use the full digest.

However, even with only four bytes, you're looking at a 1 in 4.3 billion
chance of collision. So for user interfaces, where a user merely has to
distinguish between different values, or even a situation where two people
want to check that they're talking about (probably) the same thing, it remains
useful.

~~~
waitwhat
Your choice of words seems poor. Many of the words sound similar, even words
next to each other in the list such as "winner"/"winter", "march"/"mars" and
"six"+"ten"/"sixteen". You might want to look at e.g.
<http://en.wikipedia.org/wiki/PGP_word_list> to see how others have solved the
same problem.

It doesn't appear that you've made any attempt to secure this humanhash
against a malicious third-party.

If the input to humanhash really is the result of a cryptographically-strong
hash function, then at first glance it looks like the "XOR everything into an
n-byte buffer" is no stronger (and potentially actually weaker) than the naive
"just use the first n bytes". (Disclaimer: not my field, although I did take a
couple of formal crypto classes once.)

If the input to humanhash is arbitrary, and you aren't protecting against a
malicious third party, then why not use a much-better understood CRC rather
than "XOR everything into an n-byte buffer"?

EDIT: On reflection, the input to humanhash has to be short (because the
compression function needs to know the length of the input and so can't
operate on a stream) and of known length (because you're just xor'ing, and
it's trivial to get two inputs to "compress" to the same thing by padding one
with NUL bytes.)

~~~
zacharyvoase
You're correct, humanhash was never intended to be secure. I felt that, for
usability, I'd have to surrender some collision protection in exchange for
brevity. I was originally taking the first N bytes, but I decided to use a
simple XOR-based checksum over N chunks of the input instead. This may have
been cargo cult cryptography.

I hadn't seen the PGP word list; thanks a lot for that. I would integrate that
list this very second if I could, but I'm concerned that PGP owns the
copyright to the list.

~~~
Arelius
It's generally accepted, with modern cryptographic hash functions, that's it's
better to just take the first N bytes, rather than run a naive checksum over
the hash.

------
gwillen
I have a slightly more complex scheme that eliminates some of the drawbacks of
humanhash, while adding some of its own. Given a word-hash, and the number of
bits per word that were used to generate it, it gives back the original
numeric hash in the obvious way (by hashing each word and truncating the hash,
then concatenating the bits); in the other direction, given a numeric hash, a
wordlist to draw from, and a desired number of bits per word, it finds all the
words in the dictionary that truncate to the appropriate hash bits, and offers
the user a choice of them, then concatenates the chosen words.

This has the nice property that a given set of words uniquely translates back
to a specific hash, and that the user can use their favorite wordlist as long
as it's large enough. If you were to use it for recognition rather than
memorization (and wanted fewer words to remember) you would have to manually
decide where to truncate the hash before converting it. If you wanted to pick
a specific wordlist, you could do that.

I originally came up with this as a system for helping to memorize IPv6
addresses; it also works as a system for passphrase generation (first generate
a random number with the desired number of bits of entropy, then convert to
words.)

------
lubutu
This seems to work by "compressing the input digest to a fixed number of
bytes, then mapping those bytes to one of 256 words." So since there are only
256 words you end up with a lot less information than you put in, unless you
have some 16 words.

It would be better if we had a much larger dictionary, since we could map more
bytes to a single word -- with a list of 65,536 words we would need only 8
words per 16 byte hash, which I think is a lot more manageable.

~~~
zacharyvoase
I wouldn't mind using a list of 65,536 words, but I came up with the existing
list by hand, and it took me a good half-hour. I guess it would be possible to
generate a wordlist, or even combine phonemes to make fake but valid-sounding
words (similar to a password generator), but for a proof-of-concept I feel it
has value.

Note that I never intended this to have all the uniqueness of the original
digest. It's merely supposed to be 'good enough' for UI purposes.

~~~
wisty
You could use nltk's wordnet to generate a bunch of words of certain kinds. As
a proof of concept, UNIX should have a list of words, probably in
/usr/share/dict/words (OSX) or maybe /usr/dict/words.

/usr/share/dict/words is likely to be pretty big, over 200k words, but it's
still human readable. You could prune it by removing proper words (i.e.
anything capitalized).

~~~
muxxa
Also might be a good idea to take the Metaphone of the words, and only select
words from the dictionary that have unique metaphones.

------
splitbrain
To make this useful, both parties verifying a key need to use exactly the same
word list. So either there needs to be a standardized wordlist or the words
have to be generated by the algorithm itself (eg. using syllables as others
suggested). Standardizing an algorithm is usually simpler than an arbitrary
word list.

------
yread
Very cool. Looking at the sources it seems to further digest the digest to
only 4 bytes by default. It would be cool to increase the size of the
dictionary to say 12 bits (instead of 8 bits)

------
elg0nz
Wonder if the author has considered SSH Randomart as a good solution to this
problem. IMHO our brains are better suited to finding differences on images
than on word series, as verbose as they might be.

------
ssp
The hashes would be easier to remember if they were sentences saying something
funny. For example "Green elephant riding a bicycle" or "Golden ostrich
selling a train" etc.

~~~
zokier
Like Unique Error Phrases <http://news.ycombinator.com/item?id=2991764>

------
nodata
PGP presents something readable like this for key fingerprint confirmation
over the phone. I think it uses the entire thing though.

------
username3
like tripphases? <http://worrydream.com/tripphrase/>

