

Using the Right Datastructure for the job - rohankshir
http://rohankshir.github.io/2015/05/15/choosing-the-right-datastructure/

======
ninjakeyboard
I think the point of measuring is good, and even if we're only talking about
hashmaps and vectors, the point is clear to me. It's probably predictable that
a vector be at least on-par with a small map but if we consider what the
impact on performance is when we call that op millions of times a second it
becomes clear. I think you're right about the use-case as well - eg in
building distributed systems with consistent hashing to determine which node
to use. That's a real use case where routing logic performance may actually
have some impact (still probably not going to be the worst part of the
problem). Eg Cassandra by default has a hash with 64 vnodes, and routing
happens all over the cluster by what becomes 'co-ordinating nodes'.

At any rate, I'm always hammering the "measure measure measure" message
whenever talking about performance as well just because I see how brutally
wrong I am every time I take guesses about where time is spent. Every engineer
worth their salt (I hope) knows from experience that we are almost always
wrong about where the time is. It's a good message for us all to be reminded
of.

------
cmrx64
The contents of the article disagree with the title and the introduction! Why
stop at these "CS II" datastructures? There are plenty of "exotic" data
structures for integer lookup that could be evaluated, starting with tries.

------
tantalor
No much mention of how much memory these choices require. What about a Bloom
filter?

[https://en.wikipedia.org/wiki/Bloom_filter](https://en.wikipedia.org/wiki/Bloom_filter)

~~~
lorenzhs
Bloom filters are unsuitable for implementing a cache. For one, they provide
membership queries, not lookups. In the article, the problem was reduced to
membership queries so it would not be an issue for this specific setting, but
that limits its usability in practice.

Second, Bloom filters have false positives. In the use case of cache lookups,
you'd usually be fine with false negatives (querying the database even if we
didn't really have to, i.e. a bit more work) but not false positives (what do
you do then?).

Third, Bloom filters don't support deletion unless you use counting bloom
filters, which would reduce the space advantage dramatically. Normally, your
cache will be of limited size, so you will need deletion for eviction.

Bloom filters are not the right data structure for the job.

~~~
tantalor
Don't you have that backwards? The question bloom filter solves is, "Does my
database possibly have a value for this key?"

If a cache gives a false negative (bloom filter cannot) then you immediately
return a null response without querying the database, but the database does
have the key; this is incorrect behavior.

If a cache gives a false positive (bloom filter can) then you query the
database and get a null response since the database does not have the key;
this is correct behavior, although it has a performance penalty.

~~~
lorenzhs
> Don't you have that backwards? The question bloom filter solves is, "Does my
> database possibly have a value for this key?"

No, since the cache is there to avoid having to query the database. Querying
the database happens on a cache miss, and if there is no data there, then
that's the answer.

It seems that we are talking about different applications.

------
SamReidHughes
The most important piece of code in this blog post is right here:

    
    
        for (size_t i = 0; i != NUM_KEYS; ++i) {
        int key = dis(gen);
        for (size_t j = 0; j != NUM_2ND_KEYS; ++j)
            {
            int key2 = dis(gen);
    

A whole bunch of 2nd keys are generated for each "1st" key. In the parts of
the chart where blue beats red, where you've got 2000 "1st" keys, with a
MAX_VAL of 100000, each query only has a 2% chance of hitting a "1st" key and
experiencing any sort of vector traversal. If you merely generated NUM_KEYS *
NUM_2ND_KEYS combinations of 1st and 2nd independent keys, the cost of what
once were rare long traversals would be de-amortized into a bunch of tiny
traversals that don't benefit from locality.

If you only run queries on keys actually present in the data set (which is a
far more typical use case than guessing points in a sparse data set), you'll
soon regret the use of a vector.

Edit: It's also worth pointing out that the implementation with a vector is
missing values, which it needs to be functionally equivalent. Changing it to a
vector<pair<int, int>> will hurt your vector traversal times. (Replacing the
vector with a std::unordered_map<int, int> will improve your performance.)

------
mattnewport
While it's true that "measure everything" is a good ideal to aim for, I think
this article also ends up demonstrating why that's actually hard in practice.
One of the reasons it is hard is that "the right datastructure for the job"
often depends very much on the details of "the job", including things like the
expected distribution of inputs and also on what "right" means for the
particular job (e.g. do we prefer to get the best possible average times or to
avoid bad worst case times in favour of worse averages).

The first example raises as many questions as it answers. In practice I'd
consider using a _sorted_ vector with binary search lookup over a hash map in
situations where key insertion / deletion either happens entirely up front or
is very infrequent relative to lookup. The benchmark here does have all
insertions upfront but doesn't test a sorted vector with binary search. The
problem description suggests that insertions and deletions will happen for the
real use case however, though we don't really know with what relative
frequencies.

If we were to use a sorted vector, performance in the use case described would
be affected by the size of the cached data if the data was stored 'inline' (as
a vector<pair<key, value>>) and if it was stored elsewhere would depend on how
we accessed it after finding the key. The best choice there would in turn
depend on the relative frequencies and distribution of insertions and
deletions and the percentage of cache hits.

In practice I think simple synthetic benchmarks like this end up being of
limited value. Usually you're better off implementing something simple with
reasonable theoretical performance based on your expectations of the
distribution of data you'll see, then profiling / benchmarking with
representative data (ideally captured real world data) once you have something
working. Making guesses / approximations up front about the data distribution
you'll see in practice can be just as misleading as relying on your intuitions
about performance without measuring.

Sometimes you have a good understanding of the data you'll have to deal with
up front (or even have the data available for testing), in which case you may
be able to do a better job of building meaningful benchmarks before starting a
full implementation, but in my experience that's relatively rare (though in
other domains it may be more common).

