There is a fourth category of hash functions: find similar items.
For English words, this is usually done with soundex / metaphone / doublemetaphone. Fun fact, soundex was developed over 100 years ago to help immigrants find their families in USA even if they didn't know how to spell their last name. Soundex is supported by most DBs.
For images, there are hashes like ahash/phash/dhash/whash/blockmeat/colormoment/colorhas/marrHildreth .
Most popular Audio hash is acoustid. There was also EchoPrint (with two incompatible versions), but once EchoNest was bought out by Spotify the public support and development into echoprint died. Some groups of people tried to keep echoprint alive but I am unaware if it is still used by anyone publicly.
Soundex is still out there. If you live in Illinois, Florida, or Wisconsin your drivers license number is a soundex hash of your name combined with your birthday.
TIL! Fascinating! Of course my first thought was.. “what about hash collisions?” And naturally each state has selected their own resolution. In Illinois you can apparently be detained for a while due to a hash collision with a suspect.. plate numbers aren’t unique apparently!?
License plate numbers are unique. Two somewhat similar names can resolve to the same Drivers License number, but in those (extremely rare) cases it's just a matter of checking the actual name against the warrant.
While researching Soundex in the mid/late 80's, I of course soundex'd my last name and realized that it was the first part of my state's driver's license #.
If I remember correctly, Apple's Voxel Net (~2018 neural network, used for doing machine learning on sparse lidar point-cloud data) uses a simple "spatial hash" for this purpose, using the coordinate of the voxel as a key.
Yeah, hashing for spatial data is a fairly common technique (and not invented for that paper). It totally makes sense for applications where there is a need to index sparse data in a potentially very large or even unbounded physical space. There's some hashes that are specialized for keys that are 3-dimensional points or voxel coordinates, but almost any off-the-shelf hash will do if applied to say, the concatenated indices of a voxel in a given coordinate frame.
I think you're conflating "functions used to build a hash table" with "functions used to build a dictionary ADT." A hash table doesn't have to serve the purpose of storing items in a dictionary ADT. A hash table where everything that's similar collides into the same buckets is just as useful for finding "a linked list of similar items" in O(1), as a hash table where everything gets its own unique bucket is useful for finding items in O(1).
It's true that these functions are in some sense very different to traditional hash functions, because they are designed to create "collisions" (or at least be similar) for elements that are similar. But I still think it is intuitive to use the term, maybe qualified as a "similarity hash" (I've also seen "perceptual hash" or "semantic hash"). In both cases we are describing a (potentially) large object with a short code. In normal hashes, the hash is supposed to be the same for objects that are exactly the same - in similarity hashes, that is softened to similar hashes for similar objects.
They are called similarity digests or sometimes "fuzzyhashing". For generic files there are Context Triggered Piecewise Hashing (CTPH), Trend Micro Locality Sensitive Hash (TLSH) and Similary Digest Hash (SDHash)
hash functions are ultimately just a mapping of an input set of high/infinite cardinality to an output set of fixed/finite cardinality
you correctly point out that some hash functions exist which map input words in some specific language into similarity groups based on some concept of semantics
but this is essentially unrelated to the topics raised in the linked article
It's not written super clearly (though the article itself is an interesting and ambitious piece of technical writing) but the impression I get is that he's referring to the role of hash functions in cryptosystems: for signatures, transcript hashes, key derivation, channel binding, and things like that. Cryptographic hash functions are the glue that binds crypto protocols together.
But you can also trivially turn a hash function into a cipher and encrypt with it (apologies if I missed an explanation of this in the article). Just hash a key and a counter together to create a keystream and XOR your plaintext to it. That's how Salsa20 and ChaCha20 work. (Interestingly, the reverse process --- converting a block cipher into a hash function --- is where we historically get our cryptographic hash functions from).
I’m familiar with hash functions for signatures, however I’m confused about the mention for encryption, in particular with respect to the context of common schemes like AES and RSA. I guess hashing a key is an option though.
You can spitball a hash-based stream cipher with any hash function in just a couple lines of python. Take a 128 bit key string, and then hash it with an incrementing counter to get successive 32 bytes of keystream data, just like you would with AES in CTR mode.
For English words, this is usually done with soundex / metaphone / doublemetaphone. Fun fact, soundex was developed over 100 years ago to help immigrants find their families in USA even if they didn't know how to spell their last name. Soundex is supported by most DBs.
For images, there are hashes like ahash/phash/dhash/whash/blockmeat/colormoment/colorhas/marrHildreth .
Most popular Audio hash is acoustid. There was also EchoPrint (with two incompatible versions), but once EchoNest was bought out by Spotify the public support and development into echoprint died. Some groups of people tried to keep echoprint alive but I am unaware if it is still used by anyone publicly.