
An Introduction to Hashing in the Era of Machine Learning - kqr2
https://blog.bradfieldcs.com/an-introduction-to-hashing-in-the-era-of-machine-learning-6039394549b0
======
bcheung
Another interesting thing that was not mentioned was that the index can place
semantically similar items close together.

For example, look at Word2Vec. It "indexes" / "hashes" words that are
semantically similar to the same "bucket". This means that you can search for
semantically similar data just by inspecting what is near by. This has all
kinds of applications.

~~~
jzoch
Locality sensitive hashing (what you describe) is really useful for clustering
and doing nearest-neighbor searches. Its used in digital fingerprinting,
genome sequencing, and other interesting areas.

------
rurban
Completely leaving out cache-conscious hashing. And before I would improve the
hash function by some extremely costly ML, I would rather find a perfect hash.
Which is cheap.

But he linked to my SMHasher.

~~~
tebba-lebba
Are you talking about cache utilization in the context of collision handling,
or hash functions themselves? I'd love to learn more about cache-conscious
hashing, could you suggest a paper to read, or algorithm to search for?

I Googled "Cache-Conscious Hashing" but didn't quickly find anything promising
:(.

Thanks!

~~~
rurban
The hash function itself is mostly cache independent, unless you count those
variants which skip long strings.

You can cache the hash in the entry itself or not. You can compress the
entries, but mostly using linear collision structures.

Best paper "Cache-Conscious Collision Resolution in String Hash Tables”,
Askitis 2005.

------
ape4
Tons of stuff about hashing (that we already know) and not much on ML hashing.

~~~
joe_the_user
I can't see how your statement can be true:

 _" In response to the findings of the Google/MIT collaboration, Peter Bailis
and a team of Stanford researchers went back to the basics and warned us not
to throw out our algorithms book just yet. Bailis’ and his team at Stanford
recreated the learned index strategy, and were able to achieve similar results
without any machine learning by using a classic hash table strategy called
Cuckoo Hashing"_

"We can do your ml hashing with ordinary hashing" is a statement about both
ordinary hashing and ml hashing, indeed a fairly strong statement about each.

Whether it's true is a different matter but to say "nothing about ml hashing"
seems unsupportable.

------
tabtab
Sort of reminds me of Factor Tables:
[https://github.com/RowColz/AI/blob/master/Factor_Tables.pdf](https://github.com/RowColz/AI/blob/master/Factor_Tables.pdf)

One can "compress" the Factor Tables using statistical "stereotyping" as a
kind of a lossy learning technique. You have systemic control over the size-
versus-accuracy tuning of the hashing (indexing).

Similarly, we learn to recognize or respond to patterns without remembering
each specific instance of the pattern we encounter, which can be seen as a
lossy form of hashing also.

------
jacksmith21006
One of favorite papers is the paper from Jeff Dean

[https://arxiv.org/abs/1712.01208](https://arxiv.org/abs/1712.01208) The Case
for Learned Index Structures

~~~
moab
Some extra experimental evaluation of learned-indices:
[https://dawn.cs.stanford.edu/2018/01/11/index-
baselines/](https://dawn.cs.stanford.edu/2018/01/11/index-baselines/)

------
gleenn
Really well written article. Sadly looks like the research is all on read-
only, in-memory datasets because the training step is expensive.

------
asdsa5325
An Introduction to Hashing, more like it

------
carlmr
Cool that he used the photo of the Stuttgart (Germany) library.

