Hacker News new | past | comments | ask | show | jobs | submit login

Which hash function are you using for the minhashes for the LSH benchmark? Example code from datasketch seems to indicate SHA-1. Is there good reason for that? Have you tried out murmur? I wonder if it improves runtime?



I am sure Murmur3 would improve performance, but I doubt it would improve the indexing time very much. I can give it a try.

Update:

In IPython using pyhash library (C++):

  import pyhash
  h = pyhash.murmur3_32()
  timeit h(b"test")
  703 ns ± 4.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import hashlib
  timeit hashlib.sha1(b"test")
  217 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


  import pyhash
  h = pyhash.murmur3_32()
  timeit h(b"test")
  576 ns ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import hashlib
  timeit hashlib.sha1(b"test").digest()
  518 ns ± 5.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import mmh3
  timeit mmh3.hash(b"test")
  156 ns ± 0.704 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Worth keeping in mind that pyhash and mmh3 close out the hash immediately where hashlib needs digest() to call sha1_done().

That said, mmh3 seems to give a respectable 70% speedup! (assuming 32bit hashes are acceptable) ... actually, let's compare apples to apples:

  timeit mmh3.hash128(b"test")
  180 ns ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Still pretty awesome for the little test!

...

And for something more realistic:

  timeit hashlib.sha1(b"test"*4096).digest()
  21.9 µs ± 594 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

  timeit h(b"test"*4096)
  6.81 µs ± 38 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

  timeit mmh3.hash128(b"test"*4096)
  3.03 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
An order of magnitude! (well, close)


Interesting. I thought the Python builtin hashlib was more convenient (and more random). But yes you are right, good implementation of murmur3 hash is much faster.


SHA-1 is specifically built to have special properties as a secure hash function. As I understand it murmur actually comes from this world.

I also had a gander at some more of the datasketch source. I notice that you compute H_n(x) by x_n = (a_n + b_n * H_0(x)) with a_n, b_n being random seeds....

That's pretty cool, I was doing it by H_n(x) = H_n-1(x|n) and thought it would be pretty quick, but just applying a random round directly after to one hash value from precomputed seeds looks much faster.


Might also want to test with a bigger buffer? There are likely to be constant-time overheads, especially for libraries which are C/C++ wrappers, and in general for hash function setup.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: