Hacker News new | past | comments | ask | show | jobs | submit login
Hash Functions (bretmulvey.com)
94 points by rdpintqogeogsaa on June 3, 2023 | hide | past | favorite | 34 comments



There is a fourth category of hash functions: find similar items.

For English words, this is usually done with soundex / metaphone / doublemetaphone. Fun fact, soundex was developed over 100 years ago to help immigrants find their families in USA even if they didn't know how to spell their last name. Soundex is supported by most DBs.

For images, there are hashes like ahash/phash/dhash/whash/blockmeat/colormoment/colorhas/marrHildreth .

Most popular Audio hash is acoustid. There was also EchoPrint (with two incompatible versions), but once EchoNest was bought out by Spotify the public support and development into echoprint died. Some groups of people tried to keep echoprint alive but I am unaware if it is still used by anyone publicly.


Soundex is still out there. If you live in Illinois, Florida, or Wisconsin your drivers license number is a soundex hash of your name combined with your birthday.

http://www.highprogrammer.com/alan/numbers/dl_us_shared.html


TIL! Fascinating! Of course my first thought was.. “what about hash collisions?” And naturally each state has selected their own resolution. In Illinois you can apparently be detained for a while due to a hash collision with a suspect.. plate numbers aren’t unique apparently!?


License plate numbers are unique. Two somewhat similar names can resolve to the same Drivers License number, but in those (extremely rare) cases it's just a matter of checking the actual name against the warrant.


While researching Soundex in the mid/late 80's, I of course soundex'd my last name and realized that it was the first part of my state's driver's license #.


Intriguing, I just checked mine and it seems California doesn't appear to do this.

If you're comfortable sharing, would like to find out which state you're referring to. Cheers.


Another poster shared this link: http://www.highprogrammer.com/alan/numbers/dl_us_shared.html

It has the encoding formula for Wisconsin, Illinois, and Florida (which match the parent's comment) but there may be others.


Mine was Illinois. I got it in the early 1980's, and moved away shortly thereafter, so no idea if they still do.

Weirdly, I remember that # even now. Soundex-???-2DigitYearOfBirth-???

Looks like at least the format has changed since then.


If I remember correctly, Apple's Voxel Net (~2018 neural network, used for doing machine learning on sparse lidar point-cloud data) uses a simple "spatial hash" for this purpose, using the coordinate of the voxel as a key.


Yeah, hashing for spatial data is a fairly common technique (and not invented for that paper). It totally makes sense for applications where there is a need to index sparse data in a potentially very large or even unbounded physical space. There's some hashes that are specialized for keys that are 3-dimensional points or voxel coordinates, but almost any off-the-shelf hash will do if applied to say, the concatenated indices of a voxel in a given coordinate frame.


Those are examples of useful functions. They are not examples of useful hash functions.


I think you're conflating "functions used to build a hash table" with "functions used to build a dictionary ADT." A hash table doesn't have to serve the purpose of storing items in a dictionary ADT. A hash table where everything that's similar collides into the same buckets is just as useful for finding "a linked list of similar items" in O(1), as a hash table where everything gets its own unique bucket is useful for finding items in O(1).


It's true that these functions are in some sense very different to traditional hash functions, because they are designed to create "collisions" (or at least be similar) for elements that are similar. But I still think it is intuitive to use the term, maybe qualified as a "similarity hash" (I've also seen "perceptual hash" or "semantic hash"). In both cases we are describing a (potentially) large object with a short code. In normal hashes, the hash is supposed to be the same for objects that are exactly the same - in similarity hashes, that is softened to similar hashes for similar objects.


Functions that create collisions can still hashes can't they?

Things like CRCs, Checksums are loosely considered hash functions: https://en.wikipedia.org/wiki/List_of_hash_functions

MD5SUM is a classic example of a hash function with a low (but still reasonable) chance of a collision


Specifically, these functions to not provide "uniform distribution".


That isn’t a requirement of an algorithm called a hash function


From the article:

> For a function to be useful as a hash function, it must exhibit the property of uniform distribution

It's also listed as the first property on the Wikipedia page:

https://en.wikipedia.org/wiki/Hash_function#Uniformity


I can’t find the text you quoted. The article starts off:

> A hash function is any function that can be used to map data of arbitrary size to fixed-size values

And the first sentence in the section you link says

> A good hash function should map the expected inputs as evenly as possible over its output range.

It doesn’t have to be “good” to be a hash function.


That's not a meaningful definition of a hash function, then.

    def hash(val):
        return 0


Yes, this is a hash function. Whether it’s “useful” or “good” is the purview of the architect.

Tangential edit: https://xkcd.com/221/


The best thing about that one is that it seeds the panel with a random number every time you refresh


That has a pretty uniform distribution..


Today you would use embeddings optimized for similarity search.


Embedding models are still essentially a hash function, just very complex ones


Moreover, and crucially, they are learned.


Robert C. Russell (and Margaret King Odell) invented SOUNDEX, which is described in US Patent 1,261,167 (1918) -see also Knuth (1973) TAOCP vol. 3.

https://patentimages.storage.googleapis.com/31/35/a1/f697a3a...


They are called similarity digests or sometimes "fuzzyhashing". For generic files there are Context Triggered Piecewise Hashing (CTPH), Trend Micro Locality Sensitive Hash (TLSH) and Similary Digest Hash (SDHash)


hash functions are ultimately just a mapping of an input set of high/infinite cardinality to an output set of fixed/finite cardinality

you correctly point out that some hash functions exist which map input words in some specific language into similarity groups based on some concept of semantics

but this is essentially unrelated to the topics raised in the linked article


My favorite use is for distribution / sharding of work in distributed systems, or for distributed locality of work in such a system.


I don’t understand the encryption use case. Can someone elaborate?


It's not written super clearly (though the article itself is an interesting and ambitious piece of technical writing) but the impression I get is that he's referring to the role of hash functions in cryptosystems: for signatures, transcript hashes, key derivation, channel binding, and things like that. Cryptographic hash functions are the glue that binds crypto protocols together.

But you can also trivially turn a hash function into a cipher and encrypt with it (apologies if I missed an explanation of this in the article). Just hash a key and a counter together to create a keystream and XOR your plaintext to it. That's how Salsa20 and ChaCha20 work. (Interestingly, the reverse process --- converting a block cipher into a hash function --- is where we historically get our cryptographic hash functions from).


I’m familiar with hash functions for signatures, however I’m confused about the mention for encryption, in particular with respect to the context of common schemes like AES and RSA. I guess hashing a key is an option though.


Modern hashing functions pretty much offer encryption out of box like gimli or BLAKE2 family (I think they call it XOF mode).

This is pretty much thanks to the sponge construction.


You can spitball a hash-based stream cipher with any hash function in just a couple lines of python. Take a 128 bit key string, and then hash it with an incrementing counter to get successive 32 bytes of keystream data, just like you would with AES in CTR mode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: