
Counting exactly the number of distinct elements: sorted arrays vs. hash sets? - deafcalculus
http://lemire.me/blog/2017/05/23/counting-exactly-the-number-of-distinct-elements-sorted-arrays-vs-hash-sets/
======
contravariant
If you use radix sort the sorted array approach becomes O(N). And since the
actual order is irrelevant, radix sort will work for any fixed size data
structure (or even objects of varying size if you just sort and split the
array according to size first).

And if you're worried that for objects of size k, radix sort runs in O(n k),
rather than O(n), I'd like to point out that n*k = N is simply the size in
memory, so it is in fact still linear in the size of the input, so it truly is
O(N).

For string-like objects using a radix tree might be better, although the
result is pretty similar.

~~~
micahbright
>[https://en.wikipedia.org/wiki/Radix_sort](https://en.wikipedia.org/wiki/Radix_sort)

>However, in general w cannot be considered a constant: if all n keys are
distinct, then w has to be at least log n

We can't ignore the log n, unless the number of distinct elements is _small_
with respect to n. Since this wouldn't be known a priori, we can't make that
assumption.

Also, note that if we know, a priori, that the number of distinct elements is
small wrt N, we can simply use the hash set for O(N) without any of the
problems he goes to outline in the article.

~~~
rocqua
OP actually addressed the w from wikipedia. He calls it the size of the
objects k. The point being, run-time remains proportional to the actual in-
memory size of whatever you are trying to count.

Heck, for the example of 64 bit integers, w is fixed at 64.

Note also that the quote from Wikipedia concerns distinct elements. It is
rather inherent to the problem we are solving that not all elements are
distinct.

~~~
micahbright
It _still_ doesn't improve any on a quick-sort because quicksort is going to
run in O(n _log n)=O(n_ 64)=O(N) as well. So what's the point?

~~~
contravariant
It's hardly fair to assume you can compare two k-bit objects in O(1),
logically it should be at least O(k) (got to read every bit at some point).
Which would make quicksort O(k n log n) rather than radix sort's O(k n).

------
vmarsy
> I choose 64-bit integers, but strings would do fine as well.

No it would not.

Sorting 64 bits integers in a vector is fast obviously since they are stored
contiguously in memory, so you get to benefit from all the caching layers (L1,
L2, L3, TLB) nicely.

Since string are of unknown size, the actual string content is stored
somewhere in the heap and reference it via a pointer [1]. If you have a vector
of strings, to compare string i and string j, you will need to make those 2
pointers dereference , which will jump to 2 random memory locations.

With the hash table you already pay for 1 indirection, which is really what is
slow here. With a vector<string> you now also need to pay for 1 indirection,
so the performance compared to a vector<int> would be very different.

[1] [https://stackoverflow.com/questions/42049778/where-is-a-
stds...](https://stackoverflow.com/questions/42049778/where-is-a-stdstring-
allocated-in-memory)

~~~
TazeTSchnitzel
Assuming strings live on the heap. Small strings could be stored in the vector
itself.

------
vicaya
I'd expect better write up from Daniel. std::unordered_set is implemented with
chained buckets, which incurs extra memory references. Cursory google revealed
that an open addressing hash set would be about 3x faster, which would handily
beat the sort version.

~~~
rurban
This, and the sort solution is also incredibly bad. The third unique step does
unnecessary moving of all the remaining values. Multiple times. You just need
to iterate over the sorted array and skip i++ when the value is the same as
the previous.

------
tromp
Just weeks ago, I had to pay out a $5000 bounty on my Cuckoo Cycle proof of
work scheme [1], because I had wrongly assumed that hash sets were faster,
given that the hash set reduces to a simple bitmap.

Where the article considers up to 10M elements, Cuckoo Cycle deals with about
a billion elements, thus answering the question of what happens when cranking
up the data size. It turns out that despite using 32x to 64x more memory than
the bitmap, sorting is about 4x faster.

Blog entry [2] explains how Cuckoo Cycle reduces to a counting problem.

[1] [https://github.com/tromp/cuckoo](https://github.com/tromp/cuckoo) [2]
[http://cryptorials.io/beyond-hashcash-proof-work-theres-
mini...](http://cryptorials.io/beyond-hashcash-proof-work-theres-mining-
hashing)

------
Paul-ish
The article isn't concerned with estimates, but if you have large amounts of
data, approximate numbers may be sufficient. I am a fan of HyperLogLog. It can
be used as an online algorithm, so you don't have to keep all values memory.

It could be useful if you want to do something like estimate the number of
unique links posted to twitter in week. (if you could get that data)

[https://en.wikipedia.org/wiki/HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog)

~~~
amelius
It works only for uniformly distributed data.

~~~
allenz
It works for arbitrary data because

> a hash function is applied to each element in the original multiset to
> obtain a multiset of uniformly distributed random numbers with the same
> cardinality as the original multiset

~~~
amelius
Good point

------
TwoBit
How do we know the observed behaviour isn't mostly due to the hash set node
allocator? If you want speed, you don't use global malloc.

~~~
chubot
But in-place sort doesn't use malloc() at all. Problem solved.

Custom allocators are nice in theory but come with practical costs. And it's
easier to do worse than your system malloc() these days. Counterpoints:

\- Why custom allocators/pools are hard: [http://yosefk.com/blog/why-custom-
allocatorspools-are-hard.h...](http://yosefk.com/blog/why-custom-
allocatorspools-are-hard.html)

\- Reconsidering Custom Memory Allocation:
[https://people.cs.umass.edu/~emery/pubs/berger-
oopsla2002.pd...](https://people.cs.umass.edu/~emery/pubs/berger-
oopsla2002.pdf)

I think it's good to test your assumptions and this article is a nice example
of that.

------
chubot
Nice article. There's been a lot of ink spilled about the word histogram
problem and the Unix/McIlroy solution vs. Knuth's solution, e.g.:

[http://www.leancrew.com/all-this/2011/12/more-shell-less-
egg...](http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/)

I always liked McIlroy's solution because it's shorter. But it bugged me that
in theory it was O(n log n) and not O(n).

But this post gives some good evidence that this doesn't matter for problems
of most practical sizes. And I've actually used McIlroy's solution in practice
and I've never found speed to be an issue.

Of course it's sorting variable length strings rather than fixed-length
records, so there's probably some subtlety. But I like this article because it
tests your assumptions.

------
hoytech
Judy arrays might be good for this. For the 64 bit int example you could use
Judy1:

[http://judy.sourceforge.net/doc/index.html](http://judy.sourceforge.net/doc/index.html)

~~~
hoytech
... and I wrote a blog post testing this :)

[https://logperiodic.com/blog/2017/05/counting-distinct-
eleme...](https://logperiodic.com/blog/2017/05/counting-distinct-elements-
judy-arrays)

------
LgWoodenBadger
It's an interesting use-case where the data is all known up-front.

I'd be interested to see an incrementally-increasing benchmark. I'd imagine
the cache-miss penalty from the HashSet is countered by the continuous re-
sorting/copying/moving from the Array solution.

But that's why we measure, right?

------
kevinwang
Wow, I would definitely not expect that.

------
woliveirajr
10 rem Using GW-Basic

15 end_of_list=10000000

20 For x=1 to x=end_of_list

30 a(x) = a(x) + 1

40 next x

50 rem Done counting, print the result

60 for x=1 to x=end_of_list

70 print "Value " x " has " a(x) " repetitions"

80 next x

~~~
contravariant
You're going to build an array of 2^64 ints?

~~~
woliveirajr
It would build an array of n distinct elements. And for efficiency, this
comment [0] is about the drawbacks of the solutions used in the article.

[0]
[https://news.ycombinator.com/item?id=14407404](https://news.ycombinator.com/item?id=14407404)

~~~
contravariant
For the line "70 print "Value " x " has " a(x) " repetitions"" to work you'd
need an array with 2^64 elements. That said the code as you wrote it would
just result in an array filled with 10 million entries, all of which are 1
(assuming the array is initialized to 0).

~~~
woliveirajr
You're right, I was just thinking in the counting of repeated values. Can't
edit to correct now...

