- We have lots of data in different databases and just need a unified view (ETL / data warehousing) - it's where most data in most businesses is. trapped. Next steps: common data definitions across the company, top level imposition to get a grip
- We can pull data together but need it to undergo what-if analysis or aggregation for reporting. This is usually regulatory or data warehousing?
All the above are "size of Enterprise Oracle / other RDBMS". You could have billions of records here but usually billions comes from dozens of databases with millions each ...
Big Data seems to be at the point of trying to do the ETL/Data warehousing for those dozens of different databases - put it into a map reduce friendly structure (Spark, Hadoop) and then run general queires - data provenance becomes a huge issue then.
Then we have the data science approach of data in sets / key value stores that Inwoukd classify as predictive - K-nearest neighbour etc.
I suspect I am wildly wrong in many areas but just trying to get it straight
Data science: the science of using data to draw conclusion. Can be thousands/hundreds of datapoint. Can be billions. Does not matter.
Big data: subset of data science applied to "big" dataset where the most trivial approach reach their limit. It does NOT mean billions of datapoint easier, it probably just means that it is not well suited for a spreadsheet anymore basically.
At the core of the author’s Show HN is an exact algorithm implementation / port for the all-pair similarity search. One of the steps of an all-pair similarity search, metric K-center, is an NP-complete problem. 
So we’ve got an exact algorithm that needs to solve an np-complete problem to produce a result, making it at least as hard.
Any speed increases to such an algorithm in the millions of data points is awesome! If you’ve got billions of data points chances are you can distill it down to millions, and if that’s possible you’d get an exact result. Or you could use a heuristic algorithm, some sort of polynomial-time approximation, which can scale to billions, and still get you a good-enough result.
1 - https://static.googleusercontent.com/media/research.google.c...
This is not correct. It's very obvious that all-pair similarity search can be solved in O(n^2) calls to the similarity metric, as stated in the readme. So unless the metric itself falls outside P, this problem is easy (but still hard to scale up in practice, of course)
Data science is about what you do with the data, not about how big the data is.
In IPython using pyhash library (C++):
h = pyhash.murmur3_32()
703 ns ± 4.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
217 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
h = pyhash.murmur3_32()
576 ns ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
518 ns ± 5.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
156 ns ± 0.704 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
That said, mmh3 seems to give a respectable 70% speedup! (assuming 32bit hashes are acceptable) ... actually, let's compare apples to apples:
180 ns ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
And for something more realistic:
21.9 µs ± 594 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
6.81 µs ± 38 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.03 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I also had a gander at some more of the datasketch source. I notice that you compute H_n(x) by x_n = (a_n + b_n * H_0(x)) with a_n, b_n being random seeds....
That's pretty cool, I was doing it by H_n(x) = H_n-1(x|n) and thought it would be pretty quick, but just applying a random round directly after to one hash value from precomputed seeds looks much faster.