
Show HN: All-pair similarity search on millions of sets in Python and on laptop - ekzhu
https://github.com/ekzhu/SetSimilaritySearch
======
mikece
I see the terms "python" and "search" and "millions of sets" and think data
science... except that in the data science contexts with which I'm familiar
we're looking at billions of records among petabytes of data. I know the
article is about what can be done on a laptop but I'm left wondering if this
is a neat small-scale proof of concept or something that scales and I need to
research more when I've had coffee rather than bourbon.

~~~
stuartaxelowen
Data science is about drawing conclusions from data, not the size of it. Most
data science happens on datasets numbered in the hundreds rather than those in
the millions.

~~~
lifeisstillgood
This is interesting - I am a bit confused about Big Data and data science etc.
If I suggest my thoughts would you mind redirecting me?

\- We have lots of data in different databases and just need a unified view
(ETL / data warehousing) - it's where most data in most businesses is.
trapped. Next steps: common data definitions across the company, top level
imposition to get a grip

\- We can pull data together but need it to undergo what-if analysis or
aggregation for reporting. This is usually regulatory or data warehousing?

All the above are "size of Enterprise Oracle / other RDBMS". You could have
billions of records here but usually billions comes from dozens of databases
with millions each ...

Big Data seems to be at the point of trying to do the ETL/Data warehousing for
those dozens of different databases - put it into a map reduce friendly
structure (Spark, Hadoop) and then run general queires - data provenance
becomes a huge issue then.

Then we have the data science approach of data in sets / key value stores that
Inwoukd classify as predictive - K-nearest neighbour etc.

I suspect I am wildly wrong in many areas but just trying to get it straight

~~~
mijamo
I don't understand your point, you're trying to make complexity out of simple
concept imo.

Data science: the science of using data to draw conclusion. Can be
thousands/hundreds of datapoint. Can be billions. Does not matter.

Big data: subset of data science applied to "big" dataset where the most
trivial approach reach their limit. It does NOT mean billions of datapoint
easier, it probably just means that it is not well suited for a spreadsheet
anymore basically.

~~~
lifeisstillgood
I guess my point is how to classify different environments - to know which
tool to use when.

------
waterhouse
Regarding the similarity function used: I haven't studied it myself, but read
this a long time ago:

[http://www.benfrederickson.com/distance-
metrics/](http://www.benfrederickson.com/distance-metrics/)

------
a-dub
Which hash function are you using for the minhashes for the LSH benchmark?
Example code from datasketch seems to indicate SHA-1. Is there good reason for
that? Have you tried out murmur? I wonder if it improves runtime?

~~~
ekzhu
I am sure Murmur3 would improve performance, but I doubt it would improve the
indexing time very much. I can give it a try.

Update:

In IPython using pyhash library (C++):

    
    
      import pyhash
      h = pyhash.murmur3_32()
      timeit h(b"test")
      703 ns ± 4.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
      import hashlib
      timeit hashlib.sha1(b"test")
      217 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

~~~
a-dub

      import pyhash
      h = pyhash.murmur3_32()
      timeit h(b"test")
      576 ns ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
      import hashlib
      timeit hashlib.sha1(b"test").digest()
      518 ns ± 5.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
      import mmh3
      timeit mmh3.hash(b"test")
      156 ns ± 0.704 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    

Worth keeping in mind that pyhash and mmh3 close out the hash immediately
where hashlib needs digest() to call sha1_done().

That said, mmh3 seems to give a respectable 70% speedup! (assuming 32bit
hashes are acceptable) ... actually, let's compare apples to apples:

    
    
      timeit mmh3.hash128(b"test")
      180 ns ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    

Still pretty awesome for the little test!

...

And for something more realistic:

    
    
      timeit hashlib.sha1(b"test"*4096).digest()
      21.9 µs ± 594 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
      timeit h(b"test"*4096)
      6.81 µs ± 38 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    
      timeit mmh3.hash128(b"test"*4096)
      3.03 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

An order of magnitude! (well, close)

~~~
ekzhu
Interesting. I thought the Python builtin hashlib was more convenient (and
more random). But yes you are right, good implementation of murmur3 hash is
much faster.

~~~
a-dub
SHA-1 is specifically built to have special properties as a secure hash
function. As I understand it murmur actually comes from this world.

I also had a gander at some more of the datasketch source. I notice that you
compute H_n(x) by x_n = (a_n + b_n * H_0(x)) with a_n, b_n being random
seeds....

That's pretty cool, I was doing it by H_n(x) = H_n-1(x|n) and thought it would
be pretty quick, but just applying a random round directly after to one hash
value from precomputed seeds looks much faster.

------
moultano
If you were curious about the reference to MinHash in the OP, I just wrote a
gentle guide to the MinHash family of algorithms (including our recent
research extending it to probability distributions.)
[https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsx...](https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsxyg4467-6/)

------
bra-ket
How does it compare to Spotify “annoy” LSH
[https://github.com/spotify/annoy](https://github.com/spotify/annoy)

