Hacker News new | comments | ask | show | jobs | submit login
An Introduction to Hashing in the Era of Machine Learning (bradfieldcs.com)
247 points by kqr2 9 months ago | hide | past | web | favorite | 20 comments

Another interesting thing that was not mentioned was that the index can place semantically similar items close together.

For example, look at Word2Vec. It "indexes" / "hashes" words that are semantically similar to the same "bucket". This means that you can search for semantically similar data just by inspecting what is near by. This has all kinds of applications.

Locality sensitive hashing (what you describe) is really useful for clustering and doing nearest-neighbor searches. Its used in digital fingerprinting, genome sequencing, and other interesting areas.

That's what word2vec is trying to model, it cannot do it because natural language semantics are person-2-world relationships.

It groups words together based on their distributional characteristics, and with the right parameter tuning, that can be a good-enough model of similarity.

Oh, cool idea. I had not heard the term Semantic Hashing before, but I'm going to do some research now.

Thanks for sharing!

Completely leaving out cache-conscious hashing. And before I would improve the hash function by some extremely costly ML, I would rather find a perfect hash. Which is cheap.

But he linked to my SMHasher.

Are you talking about cache utilization in the context of collision handling, or hash functions themselves? I'd love to learn more about cache-conscious hashing, could you suggest a paper to read, or algorithm to search for?

I Googled "Cache-Conscious Hashing" but didn't quickly find anything promising :(.


The hash function itself is mostly cache independent, unless you count those variants which skip long strings.

You can cache the hash in the entry itself or not. You can compress the entries, but mostly using linear collision structures.

Best paper "Cache-Conscious Collision Resolution in String Hash Tables”, Askitis 2005.

Submitted a PR for your SMHasher repo: https://github.com/rurban/smhasher/pull/41 .

Tons of stuff about hashing (that we already know) and not much on ML hashing.

I can't see how your statement can be true:

"In response to the findings of the Google/MIT collaboration, Peter Bailis and a team of Stanford researchers went back to the basics and warned us not to throw out our algorithms book just yet. Bailis’ and his team at Stanford recreated the learned index strategy, and were able to achieve similar results without any machine learning by using a classic hash table strategy called Cuckoo Hashing"

"We can do your ml hashing with ordinary hashing" is a statement about both ordinary hashing and ml hashing, indeed a fairly strong statement about each.

Whether it's true is a different matter but to say "nothing about ml hashing" seems unsupportable.

It's oddly formatted (I thought it was a quote at first), but the article includes this:

>(If you’re already familiar with hash tables, collision handling strategies, and hash function performance considerations; you might want to skip ahead, or skim this article and read the three articles linked at the end of this article for a deeper dive into these topics.)

The cuckoo hashing was new to me.

Sort of reminds me of Factor Tables: https://github.com/RowColz/AI/blob/master/Factor_Tables.pdf

One can "compress" the Factor Tables using statistical "stereotyping" as a kind of a lossy learning technique. You have systemic control over the size-versus-accuracy tuning of the hashing (indexing).

Similarly, we learn to recognize or respond to patterns without remembering each specific instance of the pattern we encounter, which can be seen as a lossy form of hashing also.

One of favorite papers is the paper from Jeff Dean

https://arxiv.org/abs/1712.01208 The Case for Learned Index Structures

Some extra experimental evaluation of learned-indices: https://dawn.cs.stanford.edu/2018/01/11/index-baselines/

These structures are particularly interesting as they lend themselves well to visualization and interactive navigation. I was happy to see that Ed Chi is one of the co-authors and comes from an HCI / Info Vis background.

Really well written article. Sadly looks like the research is all on read-only, in-memory datasets because the training step is expensive.

An Introduction to Hashing, more like it

Cool that he used the photo of the Stuttgart (Germany) library.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact