
How can dereferencing the first character take longer when the string is longer? - pplonski86
https://blogs.msdn.microsoft.com/oldnewthing/20181205-00/?p=100405
======
userbinator
This is also why data alignment/structure padding can be slower than packing
all the data together. Cache usage is very important in CPUs now, because in a
lot of cases the bottleneck is the memory. Padding is effectively wasted
memory.

~~~
dpark
Is this true on even x86? On most platforms[1] misaligned access will simply
crash your program and you definitely will not execute faster (but you’ll
probably terminate faster). On x86 you’ll pay a hefty penalty for the
misaligned access and I would expect that it likely costs more than the cache
optimization saves (but maybe not).

Of course you can probably optimize your struct packing anyway. Shuffling
members around can get a lot of benefit without a bunch of internal padding,
though you might still have some trailing padding.

[1] I guess these days the platforms are essentially just x86 and ARM.

~~~
CyberDildonics
It depends far more on memory access patterns. If memory is accessed linearly
(backwards or forwards now) the prefetcher will get it for you before you get
to it.

Alignment matter much more when using atomics. X64 can do unaligned atomic
access although (I think) there is a penalty for crossing the cache line
boundary with atomics. Also there is an issue of 'false sharing' \- two
threads needing to sync cache lines even though the bytes they are accessing
don't overlap.

128 bit (16 byte) atomic operations need to be aligned on 128 bit boundaries
or they will crash.

(if anyone sees something inaccurate here feel free to correct me)

~~~
stcredzero
_Alignment matter much more when using atomics._

Someone should make a lockless library and call it, "Family." Most people will
think it is a reference to the _Fast and the Furious_ , but it would actually
be a reference to _Dune._

------
scarface74
Slightly off topic and I feel a little hypocritical for saying this since I’m
always railing against the need for every developer to know leetCode and
algorithms, but some of the responses make me believe that every developer
needs to have at least some understanding of lower level languages like C
and/or assembly at some point.

------
herf
Remember it takes 100ns to get a value from main memory (after you miss all
the caches). With random-access data structures the total size tells you the
speed.

Here's a test to see how your CPU does:
[https://github.com/emilk/ram_bench](https://github.com/emilk/ram_bench)

------
stabbles
In C++ with short string optimization short strings are on the stack while
longer strings are on the heap. That's the difference between a cache hit and
a cache miss.

~~~
sigi45
How does this make a difference?

The articel states that it becomes slower when two strings don't fit together
in one cache line anymore. That is independend on using heap or stack.

~~~
jonhohle
I’m assuming the OP meant that there is a higher likelihood of SSO strings (24
bytes)§ being in the same cache line (64 bytes)§ when they are stored in the
same stack frame.

Two small strings allocated on the heap at different times have a lower
likelihood of being in the same cache line. Even within consecutive mallocs
the addresses might not fall within the same line. Unless the memory is
allocated exlipicitly for locality, there’s a higher likelihood (>=0) that
memory fragmentation causes the allocations to be split on different lines.

Ultimately, this can be controlled on either the heap or stack, SSO seems to
be optimizing for the stack case, which probably works pretty well for most
cases (especially for oblivious developers).

§ on recent x86-64 processors

~~~
sigi45
But I don't create a string on the stack first.

First allocation on the heap than probably a copy onto the stack.

------
austincheney
I suspect that in order to access any character index of a given string you
must access the entire string and then the specified index. The longer a
string the more indexes there are from which to access. I suspect the speed
differences between short and long strings is negligible and possibly not
something most tools can detect in isolation. There are second and third order
consequences that compound over other operations when executed in a loop
though.

~~~
dpark
> _I suspect that in order to access any character index of a given string you
> must access the entire string and then the specified index. The longer a
> string the more indexes there are from which to access._

Are you saying that you believe accessing a single character (index) requires
reading the entire string? This is incorrect. In C, you might have to read the
first N characters in order to read the N+1th character, but that’s a bounds-
checking cost due to null-terminated strings(and is avoided if the length is
known). And in any case, you need not access anything past the character you
wish to read.

The article answers the question. It’s a well-known locality effect.

~~~
austincheney
The locality effect only explains access via cache versus memory. Just a
paragraph later the article mentions this:

> Reading the first character from the string adds another memory access, and
> the characteristics of that memory access vary depending on the length of
> the string.

Provided access were always to memory, such that strings in an array of
strings exceed cache size, it would not be a locality issue and string length
would still result in a performance variance.

~~~
dpark
I'm confused by your response. You quote the article, but you seem to disagree
with it. The author explicitly states that this _is_ a data locality effect.
The access characteristics vary _because_ of cache effects, as the article
explicitly states: _" When the strings are short...they are more likely to
occupy the same cache line.... As the strings get longer...fewer strings will
fit inside a cache line. Eventually...you don't gain any significant benefit
from locality."_

> _Provided access were always to memory, such that strings in an array of
> strings exceed cache size, it would not be a locality issue and string
> length would still result in a performance variance_

If you always miss the cache, then performance will no longer vary with string
length.

~~~
austincheney
If the strings are short and everything always lived in the cache I don't
think the author would have posted this in the first place. If the strings are
always long and never fit in the cache the performance cannot be related to
cache or anything related to cache vs memory access. In this case I don't
think the author would have posted about variable performance when accessing
the strings from memory if it were something they had not observed.

Locality only applies in a third middle state when some strings are short
enough to fit in the cache and when others are not. In this case the degrading
performance is predictable to the number of memory look ups. If the string is
in cache there are no memory look ups. If it is in memory there are two look
ups: one of the string and a second for the index upon that string. Finally
the article mentions the second look up varies according to the length of the
string, which suggests different performances for strings of different length
even if both reside in memory.

Its not that I disagree with the article, but rather the finality of
assumptions drawn from the article. Logically you cannot access a specified
index on a string without first accessing the string and then navigating to
the given index. There is instruction time in all of that and it isn't solely
confined to locality.

~~~
dpark
> _If the strings are short and everything always lived in the cache I don 't
> think the author would have posted this in the first place._

Raymond Chen didn't post this because he had trouble understanding the
behavior. He posted this as an exercise for his readers. There is zero chance
that Raymond Chen doesn't fully understand this behavior.

Also, the strings being short doesn't mean they're all in the cache. If your
strings are 4 chars each but you're storing a billion of them, you won't have
them all in cache, but cache effects will still matter.

> _Locality only applies in a third middle state when some strings are short
> enough to fit in the cache and when others are not. In this case the
> degrading performance is predictable to the number of memory look ups. If
> the string is in cache there are no memory look ups. If it is in memory
> there are two look ups: one of the string and a second for the index upon
> that string. Finally the article mentions the second look up varies
> according to the length of the string, which suggests different performances
> for strings of different length even if both reside in memory._

Look at the code again. All the strings have the same length for a given
execution. There is no case here where "some strings are short enough to fit
in the cache and when others are not".

I think your mental model of cache behavior and why it's relevant here is
wrong. The question isn't whether all the strings fit into cache. Nor whether
individual strings are "too large" to fit into the cache. The former is
uninteresting and the latter is irrelevant because we aren't looking at the
entire string, only the first character, which will most certainly fit in the
cache.

Modern CPUs will pull in an entire cache line at a time. If you access the
first character of a string, you'll pull in the surrounding N bytes as well.
If your string is long, you'll only pull in bytes from that string. (You will
also pull in the bytes preceding your string if your string doesn't begin on
the start of the cache line.) If it's short and other strings are adjacent,
you'll pull in bytes from the next string (or strings) as well. This is where
the data locality bit matters. If your strings are very short and your layout
is sequential, the cache is extremely efficient at minimizing main memory
lookups, and your comparison of one string will pull several following strings
into the cache "for free".

> _Logically you cannot access a specified index on a string without first
> accessing the string and then navigating to the given index._

I don't understand what your mental model of string access looks like, but
accessing a character within a string in C is literally a single pointer
addition and a dereference. There is no "navigating" nor is there any
meaningful notion of "accessing the string". You access characters
individually by offsetting a pointer and dereferencing.

~~~
austincheney
Excellent. As somebody who does not write C this is the answer I needed.

~~~
dpark
Glad to help.

They key thing I realized after I wrote my last comment is that you've
(probably) been thinking in terms of an object cache. Your comments make a lot
of sense if the cache is something like a key/value store that you shove an
address and an object into. If you can't fit the string in the cache, it won't
be in the cache at all. If you want to access a character in the string, you
need to access the string first (pull it from the cache) and then access the
character you want.

CPU caches don't map very well to that model. CPU caches make a lot more sense
if you think about the system memory as one enormous array of contiguous
bytes. The CPU doesn't know anything about objects. It just knows about
different-sized chunks of bytes (pages, cache lines) that it can read from the
array (system memory). The cache doesn't hold objects (and cannot, because the
CPU has no concept of an object). The cache just holds a bit of memory pulled
from the array. The benefits of CPU caches come from recency (you'll probably
touch the same memory again very soon) and data locality (the CPU will already
read in an entire chunk, which is awesome if you're going to touch nearby
memory). Your strings are embedded in this enormous array. The "chunking"
behavior means that adjacent strings will be loaded together if they're short
enough. The shorter, the better, as observed here.

From a quick look, this appears to be one of the better explanations of CPU
caches online:

[https://courses.cs.washington.edu/courses/cse378/09wi/lectur...](https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec15.pdf)

