
Mapping Strings in C++ - iKlsR
https://cdacamar.github.io/data%20structures/algorithms/benchmarking/mapping-strings/
======
jsnell
The benchmark is kind of dodgy.

\- It looks up the same key on every iteration, so every level of the trie
stays in the cache, no matter how large the nominal data set size is.

\- The setup appears to be such that even though the key is 128 bytes, it's
probably not going to share a prefix longer than 4-5 bytes.

\- The code is constructing a temporary string of the 128 byte key for every
lookup in the hash table (by going from a string to char-pointer to string),
but not for the trie.

So it first eliminates the problems of a trie (especially the cache
inefficiency) by an unrealistic test setup, and then has a bug introducing
probably a factor of 2 unnecessary overhead to the hash table.

~~~
nightcracker
There's an incredibly simple algorithm for such an access pattern. Store all
values in a vector, without regards for order. Then when an element is
requested, find it's index i with a linear scan and swap elements i and i/2.

In an access pattern where the same elements are requested over and over this
very quickly brings those elements to the front, giving really fast access
times. This super simple data structure would totally blow the rest out of the
water on this benchmark, and should illustrate the issue with it quite well.

~~~
user5994461
And enable OpenMP afterwards, to process the 1 million strings in parallel
using all the available cores.

~~~
KayEss
I hate it when libraries try this sort of thing. I've already used all of the
cores higher up in the program logic where it's far more effective, but doing
that all the library is going to achieve is a net slow down of all throughput.

Don't blindly throw work at multiple cores low down in algorithms. Amdahl's
law tells you that the pay off is likely low, and if the program already makes
proper use of the cores you're just going to slow things down.

~~~
user5994461
Agreed that library needs to think about how programs will interact with them.

That program is just a benchmark to map millions of string to integers, it's
valid to optimize processing.

------
bshlgrs
"At around 1,000,000 elements our unsorted vector takes around 32 minutes to
lookup a string."

I don't at all believe that C++ can only iterate through 2000 strings a
second. Python takes an imperceptibly small time to iterate through an array
of a million strings when I test it out in my repl, and I imagine C++ would
either be as fast or somewhat faster. This number is so ridiculous that it
makes me very skeptical of the rest of the article.

~~~
martincmartin
The x-axis is the number of iterations, i.e. the number of lookups. The
collection is always size 1,000,000. Also, each of those lookups is for the
exact same entry, so after the first time, it'll be all cache hits. Well,
except for the std::vector, which is what you're talking about... :)

~~~
jstimpfle
Strings are length 10 to 100. A cache line is 64 bytes. A cache miss is < 1000
cycles < 10^3ns. That makes < 1s for 10^6 strings.

Furthermore adjacent strings are likely to be adjacent in memory, from which I
would naively expect cache prefetching to succeed.

Additionally, the curve is quadratic (if at all), not linear. I bet the author
looked up all n strings from the vector of n strings.

~~~
aarongolliver
I think the curves are linear because the x-axis is log scale but the y-axis
is not.

~~~
jstimpfle
Gotcha. It's probably still quadratic, just judging from the values. You can't
need 32 minutes for 10^6 100-byte comparisons (which typically abort after the
first character compares unequal. Not that that matters).

~~~
aarongolliver
It's definitely a strange set of graphs:

Mixed (log/linear) scales on the axes

Only two columns of points on each graph matter (the last two) because the
left hand side is all zeros (an artifact of the y axis not being log scale)

The y-axis is sometimes in thousands-of-mega-nanoseconds (with two significant
digits) or sometimes in giga-nanoseconds (with one).

Giga-nanoseconds are seconds!

The colors for each line sometimes changes between graphs.

------
Animats
Why is the time for map lookup going up so fast for large numbers of strings?
Is he running out of memory and thrashing? Is the implementation of map
broken? Does the benchmark include growing the map? (If you include growth
time, map insertion is O(N log N, because you need log N recopies of size N.)

~~~
martincmartin
Update: The author is always using a collection of 1,000,000 entries. The
horizontal axis is the number of lookups, into a collection of 1,000,000
entries, e.g. always ~ 20 entries in the tree. So the x-axis is logrithmic,
but the y axis is linear and we expect the graph to be linear.

Original comment: Good question. The x-axis is logrithmic, the y axis is
linear. Without cache effects, std::map and lower_bound should take logrithmic
time, so should be a straight line on the graph. Also, unordered_map should be
constant time, independent of size. So its either cache effects (including
swapping / thrashing), or there's some other overhead that's dominating the
whole thing, and the author isn't measuring what they think.

------
ot
If you don't need sorted search (lower/upper bound) or prefix search I doubt
even a good trie would beat a good hash table. Have you tried
google::dense_hash_map? std::unordered_map is slower because it uses chaining
to resolve collisions.

~~~
jnordwick
Tries have the incredibly useful feature that they fail fast and match fast
when you get to a unique prefix. Hash you always need to process the full
string though.

~~~
ot
Unless you're talking about extremely long strings, computing a hash takes far
less time than a cache miss.

~~~
jnordwick
Yeah, the most cache-friendly structure is going to win, but failing fast
means no strcmp test if you happen to get a collision, which is potential miss
to fetch the characters. And if the hash uses bucket chaining, the programming
gods help you.

On an interview I was once asked to calculate the latency on a hash fetch in
Java with the JDK String, hit and miss. It all basically boils down to how
many caches misses are you going to have. I literally just counted up the
memory accesses and counted up the hits and misses then gave an answer for
cold and hot cache. Then we worked on rewriting it.

------
toth
Interesting write up. There was a claim I didn't understand though

> Let’s fix this. An easy way to get our std::vector implementation in-line
> with the rest of the containers is we can sort it and utilize an algorithm,
> std::lower_bound, in order to speed up our lookup times.

Unless I'm missing something, isn't this wrong? The point of the vector is
that v[enum_value] gives you the name of enum_value as a string. Once you sort
it this relationship no longer holds, unless it so happens that the vector was
already sorted (which happens to be the case for "circle", "square",
"triangle").

~~~
SoapSeller
You are absolutely right - and the benchmark in the repository doesn't even
try to get the value - it's only check for existent of key.

However, it can easily be fixed by using something like vector<tuple<string,
types_t>> and supplying predicates for both std::sort and std::lower_bound to
only consider the first element in the tuple. There will be some performance
hit, but should be minimal.

~~~
toth
I see, that makes sense. I guess at that point you have something very much
like std::map, except it doesn't keep itself sorted, you have to do it
yourself.

I guess for cases like this where you initialize it once and never change it
again it could even be a better choice (i.e., faster initialization).

~~~
AstralStorm
The main difference is that map cannot be reserved ahead of time.

------
utopcell
The performance gap between unordered_map<> and the custom implementations
seems to be less than 2X. This will quickly evaporate if one uses an efficient
hash such as dense_hash_map<>.

~~~
rocqua
For the trie, things would be a lot better if the words got longer. All the
'inefficiencies' of a trie lie in the pointer chasing. The great thing is that
you can process a string 'on-line'.

If things start getting really big, there are some tricks with LCP-arrays and
constant time RMQ (range minimal query) that have great theoretical
performance. I haven't seen that stuff in practice though.

------
Joeri
I love how it's always the native code developers who worry the most about
performance implications, when they often have the least to gain from doing
so.

It must be something about being close enough to the metal to realize and care
about what happens.

~~~
khedoros1
Hopefully if you're using C++, it's because the performance implications
matter. And hopefully, you're nitpicking on the implementation of the mapping
algorithm for the same reason. Third, that's that only good reason to be close
to the metal anymore: that there's a requirement for some particular speed
target, and none of the "good enough" solutions are good enough.

Otherwise, why wouldn't I build my program in Python or something? If
selecting the right data structure just based on Big-O notation will get me to
a good enough solution, digging down is just a waste of time, like saving
microseconds when network latency is 1000 times the problem.

~~~
pklausler
There are other reasons to use C++ besides performance, and other reasons to
not use Python besides performance.

~~~
khedoros1
Sure, but it's one of the biggest go/no-go reasons. Preexisting codebase,
available libraries suitable to the problem domain, platform limitations,
desire to obfuscate code, developer experience, desire for speed of
development, and whatever else I'm forgetting would be other reasons to choose
a particular set of technologies.

But the comment I was responding to was about performance implications, and
whether or not development was "close to the metal", so my response was too
(using C++ and Python as examples of lower-abstraction and higher-abstraction
languages).

------
jnordwick
Tries can be memory hogs and the techniques for level and path compression can
add significant overhead. It would have been interesting to see memory usage.

I've of the more interesting data structures for things like this are Ternary
Search Trees after each subtree starts with a common prefix. That would have
been an interesting comparison.

[https://en.m.wikipedia.org/wiki/Ternary_search_tree](https://en.m.wikipedia.org/wiki/Ternary_search_tree)

~~~
mtanski
Overhead, fragmentation, pointer chasing, malloc() bound.

It's one of those data structures that makes perfect sense in CS theory but is
only applicable in a limited set of real world problems.

For now we're beholden to implementation of our architectures, their quirks
and side effects.

~~~
rocqua
Suffix tries are an even better example. Donald Knuth once heralded them as
the greatest algorithmic break-through of the '70s but in practice, it doesn't
quite perform.

The point of suffix tries is finding substrings. I.e. build a suffix trie of
all of Shakespear's works in linear time and memory (linear w.r.t. the total
length of the string). And then, given a potential quote of length K, see if
it is in Shakespear's work in O(K) time. The big draw there is that the
complexity of string lookups does not depend on the length of the big text
against which you are matching. This finds practical use in genome sequencing.

~~~
danieldk
Suffix arrays, on the other hand, have worse complexity in the typical
implementation (O(n) construction, O(k + log(n)) search [1]). However, they
work extremely well in practice, because they use (contiguous) arrays.

[1] Though there is an O(k) search approach too:
[http://www.sciencedirect.com/science/article/pii/S1570866703...](http://www.sciencedirect.com/science/article/pii/S1570866703000650)

~~~
rocqua
If I recall correctly, the linear search approach is based on constant time
range minimal query on the suffix array. That feels rather likely to also be
worse in real life applications.

Heck, as far as I can tell, this approach is just another way of storing a
tree.

------
ericseppanen
In the first chart, the units for the Y axis appear to be "giga-nanoseconds".

That's... interesting.

~~~
sillysaurus3
Also either the best or worst name for a band, depending on what music you're
into.

~~~
archgoon
Well, from the band's perspective, it's preferable to the nanosecond gig.

------
mtanski
Based on experience I'm going to form an educated guess (hypothesize) and say
that both std::map and author's trie container are primarily bound by pointer
chasing. Obviously one should test this.

If that's the case, use a C++ btree_map implementation.

~~~
adzm
I'm fond of sorted vectors as well such as boost flat_map. The worst case
performance can be surprising! But eventually if it gets too big you'll need a
btree

~~~
pekk
Can you share an instance where you actually needed a btree? What made it a
true need (business need of some kind maybe_ and not just a measurable
difference?

~~~
AstralStorm
Generally the requirement for range queries and piercing queries. (Does range
x,y contain object o; is object o overlapping range described by object p)
Additionally requirement to maximize immutability for reasons of thread
safety.

Range queries do not really work well on sorted vectors even if you have as
many as needed indices. Immutability and race freedom are even more complex.

With a tree, copy on write solves many problems. (And can be much cheaper than
copying whole structure.) If not, you can atomically replace subtrees in a
safe way.

------
wmu
I'm wondering why the author didn't try to use binary search on sorted
std::vector. It should be as fast as std::map (if not faster).

I also did some experiments with lookups using collections of std::map:
[http://0x80.pl/notesen.html#stl-map-with-string-as-key-
acces...](http://0x80.pl/notesen.html#stl-map-with-string-as-key-access-
speedup-3-04-2010)

------
jstarks
It would be interesting to see a zoomed in view of how things perform at small
numbers of elements. Perhaps an unsorted vector is actually the best up to a
certain point, for example.

~~~
Sharlin
No, the charts should use a log-log scale. Now only the x axis is logarithmic
which makes the behavior confusingly exponential-looking and also hides the
differences at small sizes.

------
rleigh
The plots would have benefitted from a log scale on the y axis, or both axes.
The behaviour at the low end can't be clearly visualised since it's squished
to the baseline by the exponential behaviour for high x values.

------
cbsmith
This seems fairly involved for a problem that should be solved in the
database...

I mean seriously, this is the kind of thinking that at least in theory is
going on inside a relational database. If you have a table mapping these
strings to values, (or, in a more sane database, it's the way around), you
would just do the join and be done with it.

------
user5994461

        auto get_type(const std::string& type) {
            auto e = m.find(type);
            if (e == std::end(m)) return type_t::num_types;
            return e->second;
        }
    

I don't understand the point of that code. Use "map.at(key)" and you get the
value from the map. No need for function and branching and whatever.

If you really want to mess around, you compare "map.at(key)" against
"map[key]". at return a const, [] is not const and allows to create the key
(if I remember well).

To conclude, if you really really want to show off your optimization skill,
you optimize return types: "auto &&" vs "auto &" vs "auto" vs a few other
ones.

Can't help you with that last one. One decade of C++ and still struggling with
reference, value copy, left-value reference...

That reminds me how bad C++ is a mess. 5 minutes of optimizations and my brain
is already hurting. Good thing I moved to DevOps. More pay, less hassle.

~~~
charles-salvia
The point is apparently to return "type_t::num_types" if the key doesn't exist
in the map. Using at() throws an exception if the key doesn't exist.

~~~
user5994461
That's an artificial requirements, you can do the same with at() or change the
requirements.

Catching all errors to return a default type is an arguable decision.

------
_pdp_
I wrote a 3D engine in C++ many years ago and it still feels wrong. At least
with C you get the power to do things the way you want to and yes it is mostly
unsafe if you don't know what you are doing. IMHO C++ is a failed attempt to
reinvent already working language. C should not be extended. There are other
languages like go and rust that provide good performance while not
bastardizing C.

~~~
ar15saveslives
Define "bastardizing C". Which improvement do you consider as "failed
attempts"? RAII, constructors/destructors, so you don't need to write
spaghetti code to free resources reliably? Templates, so you can write sort()
function once, for all comparable types? Lambdas? Or maybe GLib is somehow
better than stl/boost?

C++ saves TONS of time and effort in our projects, thank god that I don't need
to write in plain C anymore.

~~~
pg314
> Templates, so you can write sort() function once,

C has a generic sort: qsort [1] (or mergesort, heapsort or radixsort if you
have specific requirements).

[1] [https://linux.die.net/man/3/qsort](https://linux.die.net/man/3/qsort)

~~~
janoc
Good luck chasing bugs then, though. qsort() will be totally happy comparing
apples to oranges - aka there is zero type safety.

Templates give you compile time type checking, that's why one doesn't pass
void pointers like this anymore but uses templates to implement generic
functions.

~~~
pg314
> Good luck chasing bugs then, though. qsort() will be totally happy comparing
> apples to oranges - aka there is zero type safety.

That's a fair point. However, I do not tend to make many of the mistakes that
would be caught by the C++ type system. But different people tend to make
different kinds of mistakes.

The type bugs that happen tend to be pretty obvious and easy to debug. Good
luck to you debugging template code :) I've had a harder time debugging C++
code than C code. Again, YMMV.

~~~
koja86
I agree that YMMV.

Although sometimes "I do not tend to make many of the mistakes that would be
caught by the C++ type system" might be related to "how to use C++ type system
so it would catch mistakes people tend to make".

