Hacker News new | past | comments | ask | show | jobs | submit login
Counting exactly the number of distinct elements: sorted arrays vs. hash sets? (lemire.me)
46 points by deafcalculus on May 23, 2017 | hide | past | favorite | 27 comments

If you use radix sort the sorted array approach becomes O(N). And since the actual order is irrelevant, radix sort will work for any fixed size data structure (or even objects of varying size if you just sort and split the array according to size first).

And if you're worried that for objects of size k, radix sort runs in O(n k), rather than O(n), I'd like to point out that n*k = N is simply the size in memory, so it is in fact still linear in the size of the input, so it truly is O(N).

For string-like objects using a radix tree might be better, although the result is pretty similar.


>However, in general w cannot be considered a constant: if all n keys are distinct, then w has to be at least log n

We can't ignore the log n, unless the number of distinct elements is small with respect to n. Since this wouldn't be known a priori, we can't make that assumption.

Also, note that if we know, a priori, that the number of distinct elements is small wrt N, we can simply use the hash set for O(N) without any of the problems he goes to outline in the article.

OP actually addressed the w from wikipedia. He calls it the size of the objects k. The point being, run-time remains proportional to the actual in-memory size of whatever you are trying to count.

Heck, for the example of 64 bit integers, w is fixed at 64.

Note also that the quote from Wikipedia concerns distinct elements. It is rather inherent to the problem we are solving that not all elements are distinct.

It still doesn't improve any on a quick-sort because quicksort is going to run in O(nlog n)=O(n64)=O(N) as well. So what's the point?

It's hardly fair to assume you can compare two k-bit objects in O(1), logically it should be at least O(k) (got to read every bit at some point). Which would make quicksort O(k n log n) rather than radix sort's O(k n).

For the quick-sort case, log n need not be 64. That presumes no duplicates. In the problem to be solved. duplicates matter. Now, in practice it seems unlikely that more than 2^64 integers will need to be sorted, but for the rigor of asymptotic notation, you need to consider that case.

I'd like to add that the tree-structure needed for radix sort is going to make it useless in practice. Trees are essentially synonymous with cache-misses.

> I choose 64-bit integers, but strings would do fine as well.

No it would not.

Sorting 64 bits integers in a vector is fast obviously since they are stored contiguously in memory, so you get to benefit from all the caching layers (L1, L2, L3, TLB) nicely.

Since string are of unknown size, the actual string content is stored somewhere in the heap and reference it via a pointer [1]. If you have a vector of strings, to compare string i and string j, you will need to make those 2 pointers dereference , which will jump to 2 random memory locations.

With the hash table you already pay for 1 indirection, which is really what is slow here. With a vector<string> you now also need to pay for 1 indirection, so the performance compared to a vector<int> would be very different.

[1] https://stackoverflow.com/questions/42049778/where-is-a-stds...

Assuming strings live on the heap. Small strings could be stored in the vector itself.

I'd expect better write up from Daniel. std::unordered_set is implemented with chained buckets, which incurs extra memory references. Cursory google revealed that an open addressing hash set would be about 3x faster, which would handily beat the sort version.

This, and the sort solution is also incredibly bad. The third unique step does unnecessary moving of all the remaining values. Multiple times. You just need to iterate over the sorted array and skip i++ when the value is the same as the previous.

Just weeks ago, I had to pay out a $5000 bounty on my Cuckoo Cycle proof of work scheme [1], because I had wrongly assumed that hash sets were faster, given that the hash set reduces to a simple bitmap.

Where the article considers up to 10M elements, Cuckoo Cycle deals with about a billion elements, thus answering the question of what happens when cranking up the data size. It turns out that despite using 32x to 64x more memory than the bitmap, sorting is about 4x faster.

Blog entry [2] explains how Cuckoo Cycle reduces to a counting problem.

[1] https://github.com/tromp/cuckoo [2] http://cryptorials.io/beyond-hashcash-proof-work-theres-mini...

The article isn't concerned with estimates, but if you have large amounts of data, approximate numbers may be sufficient. I am a fan of HyperLogLog. It can be used as an online algorithm, so you don't have to keep all values memory.

It could be useful if you want to do something like estimate the number of unique links posted to twitter in week. (if you could get that data)


It works only for uniformly distributed data.

It works for arbitrary data because

> a hash function is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset

Good point

How do we know the observed behaviour isn't mostly due to the hash set node allocator? If you want speed, you don't use global malloc.

But in-place sort doesn't use malloc() at all. Problem solved.

Custom allocators are nice in theory but come with practical costs. And it's easier to do worse than your system malloc() these days. Counterpoints:

- Why custom allocators/pools are hard: http://yosefk.com/blog/why-custom-allocatorspools-are-hard.h...

- Reconsidering Custom Memory Allocation: https://people.cs.umass.edu/~emery/pubs/berger-oopsla2002.pd...

I think it's good to test your assumptions and this article is a nice example of that.

Nice article. There's been a lot of ink spilled about the word histogram problem and the Unix/McIlroy solution vs. Knuth's solution, e.g.:


I always liked McIlroy's solution because it's shorter. But it bugged me that in theory it was O(n log n) and not O(n).

But this post gives some good evidence that this doesn't matter for problems of most practical sizes. And I've actually used McIlroy's solution in practice and I've never found speed to be an issue.

Of course it's sorting variable length strings rather than fixed-length records, so there's probably some subtlety. But I like this article because it tests your assumptions.

Judy arrays might be good for this. For the 64 bit int example you could use Judy1:


... and I wrote a blog post testing this :)


It's an interesting use-case where the data is all known up-front.

I'd be interested to see an incrementally-increasing benchmark. I'd imagine the cache-miss penalty from the HashSet is countered by the continuous re-sorting/copying/moving from the Array solution.

But that's why we measure, right?

Wow, I would definitely not expect that.

10 rem Using GW-Basic

15 end_of_list=10000000

20 For x=1 to x=end_of_list

30 a(x) = a(x) + 1

40 next x

50 rem Done counting, print the result

60 for x=1 to x=end_of_list

70 print "Value " x " has " a(x) " repetitions"

80 next x

You're going to build an array of 2^64 ints?

It would build an array of n distinct elements. And for efficiency, this comment [0] is about the drawbacks of the solutions used in the article.

[0] https://news.ycombinator.com/item?id=14407404

For the line "70 print "Value " x " has " a(x) " repetitions"" to work you'd need an array with 2^64 elements. That said the code as you wrote it would just result in an array filled with 10 million entries, all of which are 1 (assuming the array is initialized to 0).

You're right, I was just thinking in the counting of repeated values. Can't edit to correct now...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact