
Finding unique items: hash vs. sort - douglasorr
https://douglasorr.github.io/2019-09-hash-vs-sort/article.html
======
BeeOnRope
You could use a faster sort implementation, such as radix sort which is O(n)
and also probably faster in practice when well-implemented.

One option is this one [1] which I actually wrote as part of this exact task:
de-duplicating a list of integers, as part of working on [2].

I wrote about the types of speedups you can expect with radix sort here [3] -
it depends on the size of the input elements and their distribution (e.g., if
many top bits are zero, radix sort will be much faster), but in my test case I
see a ~5x speedup for moderate or large input sizes (thousands of elements or
more).

Of course, the same could be said about hash tables...

[1] [https://github.com/travisdowns/sort-
bench/blob/master/radix7...](https://github.com/travisdowns/sort-
bench/blob/master/radix7.cpp)

[2] [https://lemire.me/blog/2019/05/07/almost-picking-n-
distinct-...](https://lemire.me/blog/2019/05/07/almost-picking-n-distinct-
numbers-at-random/)

[3]
[https://travisdowns.github.io/blog/2019/05/22/sorting.html](https://travisdowns.github.io/blog/2019/05/22/sorting.html)

~~~
xvector
Can you also hash every key to effectively turn them into a number, and then
perform radix sort?

If you hash with SHA-256 then any sorting operation seems like it’d just take
O(256N)=O(N).

What am I missing? Clearly this cannot work because otherwise everyone would
do this as we accept O(NlogN) as the norm!

~~~
jstanley
Calculating SHA256 is neither O(1) nor cheap.

Your idea is interesting and might be useful in some applications! But
calculating a SHA256 takes time linear in the length of the thing you're
hashing. Even if you assume fixed-size keys, SHA256 isn't _fast_. It's not
designed to be fast.

You'd likely need an extremely high N before O(N) SHA256's is faster than O(N
log N) comparisons.

~~~
jlokier
> But calculating a SHA256 takes time linear in the length of the thing you're
> hashing.

Ahem, so does hashing.

And so does sort comparison.

The GP's idea is actually quite effective for medium-to-large objects.

As a bonus, you can cache and store the SHA256 of each object for future in-
set testing and uniqueness-finding without looking at the objects themselves.
(You cannot do this with non-cryptographic hashes or sorting by comparisons.)

This is basically what Git does.

------
aidenn0
All this article has shown is that std::sort is much more optimal than
std::unordered_map on the C++ standard library used.

Every std::unordered_map I've used is surprisingly slow. Normally hash-tables
are slow because it's so easy to write a slow hash-table, but
std::unordered_map is slow for other reasons that I have had explained to me
and then quickly forgot :(.

I also find it strange that unordered_map is used rather than unordered_set.
Not sure if it would make a performance difference, but if we are going for
naive implementations, that should be at the top of the list.

~~~
noctune
One reason why unordered map is slow is that it has to be a chained hash map
due to the spec not allowing iterator invalidation when resizing.

~~~
bertr4nd
I believe unordered_map does invalidate integrators on resize, but does not
invalidate references to keys/data, which means the data can’t be stored
inline with the hash table structure.

------
umvi
I've seen this before, where big-O obsessed co-workers love to make every
algorithm and task O(1) by using hash maps, etc. but are then flabbergasted
when "inferior" O(n) code written by someone else outperforms their
theoretical perfection by a long shot.

I have to remind them that O(1) is just a rule of thumb for scaling and that
technically a dictionary lookup + sleep(5) is still O(1)

~~~
chillee
I understand the sentiment, but O(n) vs O(1) and O(n^2) vs O(n log n) are huge
jumps in complexity. Even with relatively small sizes like N=100, you're
already starting with 2 orders of magnitude disadvantage.

The example in this post is a log N factor. log N is a relatively small
growth, you'd need 1000 elements to get 1 order of magnitude and you'll never
run into 2 orders of magnitude.

If you can come up with reasonable code where an order N complexity slower
algorithm is faster in practice - I'd love to see it.

~~~
mijoharas
I was about to say "the example in the post is sorting so it's n * log(n)".

Upon thinking about it, the example is comparing the relative speed of hashing
every item (O(n) * O(1)) == O(n) and sorting (O(n * log(n)).

That means the ratio of these two is a log n factor n * log(n) / n == log(n).

good point.

As a side point, I think the main point of the post is that the time taken for
the hash solution, increases at a greater rate than the sort + unique
solution. Now, that doesn't make sense with our big 0 notation.

This is because we don't have a simple hashmap, what we have is a dynamically
resizing hashmap, which we don't set a capacity on, and the resize operations
are order n. Now, how often can we expect to resize? I think most hashmaps
double their capacity, so there will be log(n) resize operations that each
take O(m) (where m is the size of the hashmap when it needs to be resized).

Now at this point, I can't be bothered to think further about it. That feels
like it should be less than O(n * log(n)), but it's kinda close. Either way,
it's definitely larger than the simpler O(n) case we're thinking about.

~~~
jlokier
Dynamic resizing a hash table by doubling its size when it reaches a threshold
capacity is _amortised O(1)_ per insert.

If there are deletions, both operations are amortised O(1) provided they have
been sensible and made the deletion threshold a lower factor.

It's the same sort of pattern as dynamically resizing a growing string. As
long as you resize by a constant > 1 factor, rather than an addition, the
number of resizes is low enough that the O(n) cost of copying is amortised
over prior appends.

> Either way, it's definitely larger than the simpler O(n) case we're thinking
> about.

It actually isn't larger than O(n), setting aside constant factors.

For a hash map's time to grow more than O(n) for n insertions starting from
empty, suggests an implementation problem.

~~~
mijoharas
I should have thought about things more. I think this is a case of
confirmation bias on my part.

I thought "what about dynamically resizing" and quickly googled complexity of
rehashing a hash which confirmed my thought that it was n.

I guess I didn't think that since we must have inserted n values in at this
point, we could just say "insertion is O(1)" and the constant factor would be
"performing two hashes" if we are going to a point where it's resizing, i.e.
pay the cost of hashing once and rehashing once.

that feels like it makes sense. I'm being a bit hand-wavy as I don't want to
sit down with pen and paper.

I retract my incorrect statement above. (I can no longer edit it to add
postscript retraction.)

------
kccqzy
The default hash functions in most C++ implementations are surprisingly
slow—implementations tend to aim for hash quality rather than speed.

I can be totally wrong, but I suggest trying a simpler hash such as FNV. It
might outperform the sort-based approach.

I also suggest replacing the hash table implementation with something better,
such as absl::flat_hash_map. The C++ std::unordered_map is hampered by
compatibility requirements: the designers wanted these unordered containers to
be a good substitute for the preexisting std::map, and necessitated certains
design decisions such as pointer stability which is not necessary for most
applications.

~~~
chillee
Actually, I believe the default hash functions have terrible hash quality -
they're literally the identity function for integers.

I do agree that unordered_map is extremely slow.

------
mlochbaum
The hash table implementation can be much faster. Dyalog APL's Unique (∪)
computes exactly the same function using a dedicated hash table.

    
    
            ∪ 'AABBADCAB'
      ABDC
            ≢∪ a←{⍵[?⍨≢⍵]} {⍵,⍵[(?⍴)⍨≢⍵]} 5e5?2e9  ⍝ 1e6 ints with 50% unique
      500000
            cmpx '∪a'  ⍝ Time taken by ∪a in seconds
      1.8E¯2
    

So 18ns per element on a million elements (at 3.5GHz). Characteristics of the
hash table we use include:

\- Hashing with the Murmur3 mixing function (at the top of [1]), not the full
hash function

\- Open addressing with ints stored directly in the hash table

\- Sentinel value for empty slots chosen to be outside the argument's range

\- [edit] Preallocated fixed-sized table. It should be resizable when the
argument is large enough; I will fix that some day.

We know the range of the argument since we check it in order to possibly apply
small-range code, which can use an ordinary lookup table rather than a hash
table. In 18.0, which features a largely rewritten Unique implementation (not
much faster here), I applied some careful logic in order to use the first
entry of the argument as the sentinel rather than trying to find a value not
in the hash table.

[1] [http://zimbry.blogspot.com/2011/09/better-bit-mixing-
improvi...](http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-
on.html)

~~~
mlochbaum
But Dyalog's sorting actually beats its own Unique!

    
    
            cmpx '{(1,2≠/⍵)/⍵} {⍵[⍋⍵]} a'
      1.3E¯2
            ({(1,2≠/⍵)/⍵} {⍵[⍋⍵]} a) ≡ ({⍵[⍋⍵]} ∪a)
      1
    

The function {⍵[⍋⍵]} (index the argument by its own grade) sorts a vector
ascending. {(1,2≠/⍵)/⍵} gets unique elements from a sorted numeric vector by
taking all those elements which are first or unequal to their predecessor:
2≠/⍵ tests for inequality on all pairs of adjacent elements, and / uses a
boolean vector on the left to filter elements on the right.

The second line tests that this unique-sort code gives the same result as
sorting the result of Unique.

We use a radix sort for vectors of 4-byte or larger numbers. Unfortunately we
can't use sorting to implement Unique in this way because Unique has to
preserve the ordering in the original argument. However, it could be used to
implement the sort-Unique or Unique-sort combination.

------
sagarm
std::unordered_map is usually slow while std::sort is quite fast.
std::unordered_map's API includes pointer stability across map resizes, which
requires storing the contents of the map out-of-line in a separately allocated
memory block.

This results in poor cache utilization and higher memory management overhead.

The SwissTable family of containers are much faster if you do not have this
requirement. See
[https://abseil.io/about/design/swisstables](https://abseil.io/about/design/swisstables)

for more on the optimizations that make them fast.

I wrote a quick benchmark to compare sorting, std::set, std::unordered_set,
and ska::flat_hash_set (an older version of SwissTable I believe) and
flat_hash_set was generally ~2.4x faster, even up to 100M integers.

[https://pastebin.com/y5UsPek5](https://pastebin.com/y5UsPek5)

------
rurban
What he didn't show in the main article, only in subsequent code, is that
std::unordered_map is way too slow to be useful. With a proper fast map, he
called it custom_map, it outperforms sort-unique until the data set exceeds
the L3 cache size.

~~~
nartz
Right well, its clear that at least in java, he doesn't pre-allocate the size
of the hashmap. Thus, when the hashmap hits its maximum size, of course the
resize operation has to copy all of the data into a new map, which makes this
no longer O(N) but something like O(NlogN) or O(N^2) depending.

~~~
matvore
No, it's still O(N) amortized. Imagine if the hashmap doubles in capacity
whenever it is full or hits its max load factor. Then you end up having O(N)
copy operations. O(2 * N) = O(N)

~~~
yxhuvud
While your objection stands, repeated trainings of a hash table can be very
impactful on the actual walltime spent, and it is a bad benchmark not to
include the optimization of preallocating.

------
herf
Memory latency is everything once a hash table gets big enough. Anytime you
see 100ns you should think, random read from main memory!

------
rwem
The implementations in the benchmark are all pretty naive. You might get a
different outcome with more careful implementations of the functions.

------
alecco
Related Sort vs Hash revisited [joins] (VLDB 2009)

[https://15721.courses.cs.cmu.edu/spring2016/papers/kim-
vldb2...](https://15721.courses.cs.cmu.edu/spring2016/papers/kim-vldb2009.pdf)

It's interesting they predicted sort would surpass hashing when the registers
got to 512b, and we got AVX512 nowadays. TBH, their SSE4 sort implementation
was quite complex (but beautiful).

------
magicalhippo
This is a case where the optimal strategy depends a lot on your input I think.

For example, long ago I wrote a small program to find file duplicates. My
first step was to use the fact that files with different lengths can't be
duplicates. Thus only files which had the same length had to be checked
further.

For files on your average hard drive, that simple test screens the vast
majority of them.

------
mda
If author just used a fast hash table (swiss table etc) results would
immediately look very different.

------
macdice
If the keys are in random order, and the hash table is larger than L3, I bet
you can make the hash version faster by looking ahead N items and issuing
__builtin_prefetch() on the hash table array. (There are papers on this for
hash joins in databases.)

------
maury91
You could use a modified version of merge sort where if an element already
exists you don't add it (you end up adding only unique elements), this will
save the extra O(n) where you remove the duplicates.

------
xvector
I wonder how the following compares:

\- Putting a bloom filter in front of the hash table

\- Hashing the keys and performing radix sort on them

------
redis_mlc
My favorite is gratuitous ORDER BY clauses. I suggest you look at your SQL and
see if that's needed.

~~~
perl4ever
Not having an ORDER BY when you need one is vastly worse and more common than
having one that you don't.

~~~
Waterluvian
I'm curious. Can you explain why?

~~~
perl4ever
People tend to believe the order of query results without ORDER BY is
deterministic.

~~~
yxhuvud
Also, the query optimizer are more eager to use indices when ordered.

