
What to use instead of std::set [pdf] - ingve
http://lafstern.org/matt/col1.pdf
======
taeric
A fun trivia point that I have never seen anyone exploit, for many cases a
linear search with a sentinel value can outperform a binary search. It is
amusing to see the lengths people go through to optimize collections that will
never be more than 1k in size, only to actually be working against the things
computers are really good at.

~~~
ludwigschubert
Computer science curricula usually contain an introductory algorithms and data
structures class, but these don't usually put focus on topics like cache
locality or branch prediction.

Do you agree? And if so, would that—at least partially—explain your
observation?

~~~
Analemma_
My undergraduate algorithms course did point out that memory hierarchies and
branch predictors laugh at big-O notation, and that you should always (as
Roboprog said) measure measure measure instead of just assuming a particular
algo is best. It was one little piece of material in a course with a lot of
ground to cover though, so it's hard to say how well it stuck with the
class...

------
krinchan
Set is just sort of weird in other languages, I've never looked at set in the
STL. I'm sort of surprised at the implementation in STL. Almost every other
language's library tends to implement sets as a special case of a hash map
without values, just keys.

This leads a lot of people down a weird alley where they think they're getting
a list that prevents duplicates and then have all sorts of weird behavior when
they iterate over it. However, 90% of the time they don't care about order and
the set is nowhere near large enough to really show the overhead you're
incurring over a normal linked list or array list.

Still, it's funny when they finally run into either of those issues and
someone has to explain that the entire point of a set is really a data
structure optimized around answering the question, "Do you contain x?" and not
"Prevent duplicate entries."

I've actually used sets before, but for the purpose of having a sort of
"memory" about nodes the code has seen before while looking for cycles in a
graph. Even then, that was hardly production code and more a slow running data
clean up operation.

~~~
Roboprog
Uh, OK: "Is x a duplicate?" -> "Do you contain x?"

~~~
krinchan
Not _exactly_. Typically, iterating over a set is generally not what's
intended when someone designs a set structure. This in turn becomes a problem
in some languages.

95% of the time it simply doesn't matter. I mean, people iterate over maps all
the time in day-to-day stuff, which is also a lookup-oriented structure.

But the other 5% of the time people expect list like behavior or performance
from a set and don't get it, they act like this is a totally new thing for
them even though they've been programming for years. :-/

EDIT: To clarify, it's really hard to explain to some folks that sets are
optimized to answer quickly if they contain an X and Lists are optimized to
quickly let you do a thing across all elements. Sets just _happen_ to give you
values back most of the time, but the real question they're trying to answer
is if you have previously placed a known value into the set. Returning the
actual contents of the Set is not actually a requirement at the more abstract,
theoretical level though most standard libraries provide that.

In some libraries, actually getting at the contents and iterating over a set
is quite hard. I've seen highly optimized sets for massive data that doesn't
actually store the item, but a highly optimized, almost range encoded version
(hashes 0xabcd through 0xafaa are in the set) of the data.

Like I said, the distinction between a list that disallows duplicate entries
and a set really doesn't show up until you get into some pretty large scale
things that most programmers don't ever use.

However, I feel understanding the distinction and how trying to answer one
question (Do you contain X?) can lead to design decisions that make answering
another question (What are all the unique values out of the values I just
added to you?) more difficult leads to better programmers. That is, if they
have the skills to abstract the thought exercise and apply it to other
situations.

~~~
Roboprog
Thanks for the follow up. I don't expect a particular order if I spin through
the contents of a set, but it hadn't occurred to me how expensive it might be
simply to _get_ the sequence.

This makes a good example of when "polymorphism" or "strategy patterns" are
useful to swap in and out _how_ you do some related things to fit what
actually ended up happening in an app.

------
jdbernard
For those wondering what value a CS degree has, this is the kind of material
that is covered in a data structures class in a good CS program.

------
Roboprog
Suggested title: "Why you shouldn't use set in C++ STL, ..."

~~~
ginko
Or just "Why you shouldn't use std::set, ..."

~~~
Roboprog
That, and C++ just isn't my cup of tea. Meyer's "Effective C++" book convinced
me of that quite some time ago.

If I was in the video game industry, I guess it would be the only game in town
though. Glad I'm not.

~~~
Impossible
While AAA games haven't given up C++ and probably won't for a long time, many
games in the indie and mobile space are made without any of the creators
writing a line of C++, in favor of C#, Lua, Java or Javascript.

------
greg7mdp
Actually, if you don't need the entries to be sorted, the sparse_hash_set from
[https://github.com/greg7mdp/sparsepp](https://github.com/greg7mdp/sparsepp)
is much faster than std::set or a sorted vector (except for iteration) and
barely uses more memory than the sorted vector.

~~~
Ono-Sendai
Or use unordered_set.

~~~
greg7mdp
Well, you certainly could, but unordered_set will use significantly more
memory than the sorted vector or sparse_hash_set, and is actually slower than
sparse_hash_set.

------
leecarraher
To me set implies the mathematical definition of a Set, not just a lists that
prevents duplicates. As such it should be optimized for the various set
operations: union, intersect, compliment, sym_difference, isdisjoined,
issubset.

for just avoiding non-duplicates a list and a bloom filter is probably a
faster data structure than a set.

~~~
kazinator
complEment is tricky. How do you propose that it be encoded?

You don't literally want to create a complement set and populate it with every
possible object that is not in the original set (but _could_ be, according to
type).

A clone of the set with a complement flag? That could work.

At least until someone constructs a complement of a set, and then wants to
iterate over it, oops!

------
ybaumes
There's actually an item about that in Scott Meyers's Effective STL book.

Item 23: Consider replacing associative containers with sorted vectors.

