
Removing duplicates from lists quickly - signa11
http://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/
======
CarolineW
Note: this is only removing _adjacent_ duplicate values.

~~~
signa11
from the article:

> To set a reference, suppose I generate 1024 random numbers in the range
> [0,1024) and I sort them. This will generate a few repeated values. I want
> to remove them.

so, this removes all the duplicates.

~~~
CarolineW
The code, as given, removes only adjacent repeated values. If the data you
feed it is sorted then yes, all repeats will be removed, but that's not part
of the code he's writing and analysing, that's just how he generates his
sample data.

And I can see why the sample data would be generated like that. If you simply
generate random values then it's unlikely there will be many adjacent repeats.
Sorting it ensures that there will, and by adjusting the size of array/vector
and the range from which the random numbers are chosen you can control the
expected number of repeats.

I also observe that his opening example - 1,1,2,3,3,3,4,0,0 - is _not_ sorted,
so the sorting really is just to give examples that are worth analysing.

So to repeat, the code only removes adjacent duplicate values.

~~~
signa11
> So to repeat, the code only removes adjacent duplicate values.

yup. author wants to see how far can things be taken, and with simd in play,
you can do approx. 1 cycle / array-op, which is an order of magnitude better
than c++ std::unique.

~~~
gus_massa
The speedup is interesting.

I specially like to read other example where killing an `if` increase the
performance easily. `if`s are bad!

The SIMD operations get much more speed, but here we can start a discussion
about portability and maintainability. It's a trick to use only in a very
special case.

Anyway, removing the consecutive duplicates of a vector is an O(N) operation,
but sorting the vector is an O(N log(N)) operation. So in any real example the
sorting part will be slower than the removing part. To make the whole
operation quicker it's better to improve the sorting part.

It's a nice article, but hiding the slowest part of the operation under the
rug is a bit misleading.

