And if you're worried that for objects of size k, radix sort runs in O(n k), rather than O(n), I'd like to point out that n*k = N is simply the size in memory, so it is in fact still linear in the size of the input, so it truly is O(N).
For string-like objects using a radix tree might be better, although the result is pretty similar.
>However, in general w cannot be considered a constant: if all n keys are distinct, then w has to be at least log n
We can't ignore the log n, unless the number of distinct elements is small with respect to n. Since this wouldn't be known a priori, we can't make that assumption.
Also, note that if we know, a priori, that the number of distinct elements is small wrt N, we can simply use the hash set for O(N) without any of the problems he goes to outline in the article.
Heck, for the example of 64 bit integers, w is fixed at 64.
Note also that the quote from Wikipedia concerns distinct elements. It is rather inherent to the problem we are solving that not all elements are distinct.
I'd like to add that the tree-structure needed for radix sort is going to make it useless in practice. Trees are essentially synonymous with cache-misses.
No it would not.
Sorting 64 bits integers in a vector is fast obviously since they are stored contiguously in memory, so you get to benefit from all the caching layers (L1, L2, L3, TLB) nicely.
Since string are of unknown size, the actual string content is stored somewhere in the heap and reference it via a pointer .
If you have a vector of strings, to compare string i and string j, you will need to make those 2 pointers dereference , which will jump to 2 random memory locations.
With the hash table you already pay for 1 indirection, which is really what is slow here. With a vector<string> you now also need to pay for 1 indirection, so the performance compared to a vector<int> would be very different.
Where the article considers up to 10M elements,
Cuckoo Cycle deals with about a billion elements,
thus answering the question of what happens when cranking up the data size. It turns out that despite using 32x to 64x more memory than the bitmap, sorting is about 4x faster.
Blog entry  explains how Cuckoo Cycle reduces to a counting problem.
It could be useful if you want to do something like estimate the number of unique links posted to twitter in week. (if you could get that data)
> a hash function is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset
Custom allocators are nice in theory but come with practical costs. And it's easier to do worse than your system malloc() these days. Counterpoints:
- Why custom allocators/pools are hard: http://yosefk.com/blog/why-custom-allocatorspools-are-hard.h...
- Reconsidering Custom Memory Allocation: https://people.cs.umass.edu/~emery/pubs/berger-oopsla2002.pd...
I think it's good to test your assumptions and this article is a nice example of that.
I always liked McIlroy's solution because it's shorter. But it bugged me that in theory it was O(n log n) and not O(n).
But this post gives some good evidence that this doesn't matter for problems of most practical sizes. And I've actually used McIlroy's solution in practice and I've never found speed to be an issue.
Of course it's sorting variable length strings rather than fixed-length records, so there's probably some subtlety. But I like this article because it tests your assumptions.
I'd be interested to see an incrementally-increasing benchmark. I'd imagine the cache-miss penalty from the HashSet is countered by the continuous re-sorting/copying/moving from the Array solution.
But that's why we measure, right?
20 For x=1 to x=end_of_list
30 a(x) = a(x) + 1
40 next x
50 rem Done counting, print the result
60 for x=1 to x=end_of_list
70 print "Value " x " has " a(x) " repetitions"
80 next x