
C performance mystery: delete unused string constant - dolmen
https://github.com/google/wuffs/commit/49023afd48aab68febc55a3561b2e8a0e2635533
======
dzdt
Lots of performance mysteries, particularly in micro-benchmarks, are explained
by how either code or data aligns to memory cache lines. A seemingly unrelated
change will cause memory locations to move. No idea if that is the case here
or not but it looks like a possibility.

~~~
userbinator
This is one of the reasons why I think microbenchmarks are either not very
useful or completely misleading --- they'll favour things like insanely
unrolled loops or aligned data with lots of padding, when in fact on a macro-
scale those things will certainly affect performance negatively due to cache
misses.

My preferred approach when optimising is to go for size first; then if more
speed is desired, carefully apply size-increasing optimisations to areas which
may benefit, testing with a _macro_ benchmark to judge any improvement.

~~~
YZF
If the macro scale is dominated by the code you're optimizing, which is often
the case, let's say some inner loops of a codec, signal processing, etc. then
the cache misses that may follow aren't really a concern. Obviously(?) taking
some random code that doesn't run frequently and trying to optimize it won't
make a difference. Microbenchmarks are pretty useful in this context since
they let you iterate faster. More generally the question is what is the
bottleneck, if the bottleneck is fetching stuff into caches then that's what
you optimize for, if the bottleneck is some computation that is running from
caches then that's what you optimize...

~~~
jeffbee
The big cautionary tale of microbenchmarking is memcpy. If you look at the Gnu
memcpy, it's ridiculously long and obviously been guided by extensive
benchmarking. And memcpy often shows up in profiles, so it is a big target.
But in real systems a small simple loop will perform better due to icache
pressure.

~~~
MaxBarraclough
> in real systems a small simple loop will perform better due to icache
> pressure

Do you have a source for this? Seems surprising GCC would leave such low-
hanging fruit. G++ makes the effort to reduce _std::copy_ to a _memmove_ call
when it can, or at least some of the time when it can (or at least, it did so
in 2011). [0]

Related to this: does GCC treat memcpy differently when it can determine at
compile-time that it's just a small copy?

[0]
[https://stackoverflow.com/a/4707028/](https://stackoverflow.com/a/4707028/)

~~~
Gibbon1
Problem is superscalar processors the correspondence between number of
instructions and speed breaks down. Partly because the processor does it's own
optimization on the fly and can do multiple things in parallel.

A programmer should be careful about second guessing the compiler. And a
compiler should be careful about second guessing the processor.

~~~
MaxBarraclough
I'm not sure if you're implying this is premature optimisation. It isn't.

It's a performance-sensitive standard-library function, the kind of thing that
deserves optimisation in assembly. It's also the kind of problem that can be
accelerated with SIMD, but that necessarily means more complex code. That's
why the standard library implementations aren't always dead simple.

Here's a pretty in-depth discussion [0]. They discuss CPU throttling, caches,
and being memory-bound.

[0]
[https://news.ycombinator.com/item?id=18260154](https://news.ycombinator.com/item?id=18260154)

------
nigeltao
I am the commit author.

I wrote that commit on my laptop. I was unable to reproduce this on my
desktop.

Re "why does this happen?", it certainly looks like an alignment thing. A more
interesting question for me is "what should I do about it?" If future non-
trivial changes affect the micro-benchmark numbers, how do I ensure that it's
signal and not noise? Do I sprinkle some alignment directives throughout my
code and hope for the best? Once the immediate symptoms are gone, how do I
know that I've added enough?

Somebody suggested (off-list): "use compiler flags like the combo `-ffunction-
sections -falign-functions=N` for values like 16, 32, 64 to help diagnose
these issues quickly. You can also look at perf counters to find the problems,
but each problem has a different counter so that can be hard. Once you know
you have a problem, you can usually write code defensively against the issue.
But it requires knowing a lot about the micro-architecture. Things like
minimizing branch density, data dependency graph height, etc."

That's all very well (and better suggestions than nothing), but I'm hesitant
to hill-climb using different compiler flags from what my users generally do.
I also want to avoid over-fitting to my primary (day-to-day) machine or to a
particular version of a particular C compiler.

I've also been pointed to
[https://github.com/ccurtsinger/stabilizer](https://github.com/ccurtsinger/stabilizer)
but it sounds tied to LLVM 3.1 and hasn't had any substantial updates since
2013.

~~~
acqq
> I wrote that commit on my laptop. I was unable to reproduce this on my
> desktop.

Then please modify the comment in the source to state that.

The described behavior (the speed results unpredictably slightly changing in
both directions) of the measurements is actually "normal" on the notebooks
with most of the possible thermal configurations and is not something that
should be even tried to be "fixed" until the observed effects are actually
consistent and big enough to be repeatable without the tight thermal controls.

Edit: of course the pushed _commit_ can't be changed. But the _comment_ (if it
is visible in the source -- haven't checked that) can. There should be some
kind of a visible "resolution" of the question in the repo.

~~~
nigeltao
I haven't written a comment in the source yet, as I haven't further
investigated and implemented a work-around yet to hang the comment on. It's
low on the priority list.

------
ufo
When weird performance things like this happens, it can be helpful to test in
other environments. If the performance change does not happen in other
operating systems or hardware then it suggests that the weird performance you
are observing could be due to an unusual coincidence in your particular
system.

That said, if you do want to figure out why deleting that global variable made
a difference then using Linux's perf tool might give you more informationto
work with. One time I had a weird program where inserting a NOP instruction in
a certain location made it run twice as fast. After investigation we found out
that the difference was with branch prediction. The presence or absence of
that NOP instruction affected the addresses of the jump targets the inner
loop's switch statement. For some reason, the version without the NOP
instruction those addresses resulted in lots of branch mispredictions. Perhaps
because of a collision in the branch predictor's hash tables.

~~~
MauranKilom
Are you sure it wasn't just instruction alignment? Inserting nops before loop
jump targets to align the first loop body instruction to 8 or 16 bytes is a
very common x86 thing most compilers do. See e.g.
[https://reverseengineering.stackexchange.com/a/2930](https://reverseengineering.stackexchange.com/a/2930).

~~~
ufo
Would that explain the large difference we observed in the "branch-misses"
statistic when we ran it under "perf stat"?

------
kjgkjhfkjf
Removing `const char* wuffs_base__note__i_o_redirect = "@base: I/O redirect";`
removes not only the literal string, but also a global non-constant pointer.
This perhaps affects various optimizations.

They should have used `const char* wuffs_base__note__i_o_redirect const =` or
(preferably) `const char wuffs_base__note__i_o_redirect[] =`.

~~~
tom_mellior
I can't think of a plausible way in which an unused global would affect
compiler optimizations. Could you elaborate?

~~~
HelloNurse
The pointer probably precedes interesting variables in the same data segment,
which therefore shift to a different address that is cached differently.

~~~
tom_mellior
Sure, but that's not a compiler optimization issue. The compiler will emit the
same instructions, they will just execute at a different speed if the data is
elsewhere.

------
SomeoneFromCA
I remember changing a _constant_ int into a _volatile global variable_ , and
my code became 4 times faster. It was on Ubuntu 16.04. I might actually find
code and post here later.

~~~
sfblah
I worked on some code once for an app that would crash if you added a comment
in a specific place. We removed the comment and then added a comment asking
developers not to re-add the comment.

~~~
IAmLiterallyAB
Was it a language where the comment would be compiled out ahead of time? Or
still present at runtime like JS

~~~
quietbritishjim
I was thinking the same thing. It would be very odd if it had an effect on C
or C++ (even more than the effect in the original article) because comments
are removed even before preprocessing. They do affect the __LINE__ macro
though, so it is vaguely conceivable for it to have an effect on program
behaviour.

------
acqq
As far as I see, the source code of the library is not plain C but .go, which
is then transpiled to .c

Anyway, even if what we see in the resulting .c is "just some missing string"
it's not necessarily obvious how the strings are supposed to be handled in the
whole .c domain. Specifically, it can be that some other constants (even not
necessarily sting constants, but the constants in the same segment where the
string constants are!) are being used very frequently by the code which is
measured (let's call them "hotter" constants), and the removal of that one
string constant simply affected the placement of the "hotter" constants).

In short, I suspect that the deletion of the constant is not the only way to
get different results, but that the different results would also happen even
when changing the order of the definition of constants, without removing any
of them.

So the way I would attack that problem, if it would be desired to fix the
performance issues, is: I'd measure the access of all the constants in the
same segment, and identify which are used during the whole run of these
benchmarks -- those are "hot." The solution is then to modify the code to be
less dependent on such constants inside of the "hot" loops. Often the number
of these that have to be fixed is low, but it's also possible it's otherwise.

So I believe it's not that much a mystery as it appears to be. (I have even
more specific experiences and also suggestions. And I'm also looking for a new
dream _remote_ job. Anybody needs this kind of expertise, for some reasonable
longer term, or shorter term but seriously paid?)

~~~
terminalcommand
Watching used constants and trimming them to fit the cache seems like a good
idea. I'm a total noob PC-architecture wise so apologies beforehand, but
doesn't the CPU handle the caching of most used memory locations itself? If
there are an equal number of hot constants, how does changing the order of
them help performance? CPU is going to cache the most used locations anyway.
In this example, the constant was never used, so there is no reason for the
CPU to cache it.

BTW, adding some contact info to your HN profile could help in the job search.
Best of luck!

~~~
acqq
> doesn't the CPU handle the caching of most used memory locations itself?

Yes but caches aren't "never failing magic". You can imagine them as the
mechanisms that allow saving some work in some scenarios. If the work thrown
at them is different from their limitations, worse results aren't surprising.

> If there are an equal number of hot constants, how does changing the order
> of them help performance? CPU is going to cache the most used locations
> anyway.

The caches have some limitations by design. It's easy to construct the code
which stresses the caches more, and sometimes just the position of elements
accessed can influence the "congestion" points due to the changed mapping
between the addresses and the elements accessed.

> adding some contact info to your HN profile

Thanks. I still hope that if there's a real interest the contact happens in
spite of not being exceptionally easy -- also a kind of filter.

------
robalni
This reminds me of when I made a program faster by adding a print statement to
it. I am not able to find the program but I was able to create a new one as
similar to it as I remember and it seems that this weird trick works in this
program too.

When I uncomment the puts("hi") line in the code below, the time it takes to
run the program consistently changes from 5.6 to 5.4 seconds on my machine if
I compile without optimizations.

#include <stdio.h>

int main() { long tri_side = 1; long tri_area = 1; long sq_side = 1; long
sq_area = 1; while (sq_side < 1000000000) { if (sq_area == tri_area) {
printf("tri:%ld sq:%ld area:%ld\n", tri_side, sq_side, sq_area); //puts("hi");
} if (sq_area < tri_area) { sq_area += sq_side * 2 + 1; sq_side += 1; } else {
tri_area += tri_side + 1; tri_side += 1; } } }

~~~
guerrilla
I wasn't able to reproduce this but I'm wondering if it could have to do with
buffer flushing. puts may implicitly do whatever fflush(stdio) does. [1]

[1]. [https://linux.die.net/man/3/fflush](https://linux.die.net/man/3/fflush)

------
gomijacogeo
Looks like other people are onto the same thing - alignment.

Especially if this is threaded code, what's probably happened is something
(likely with some sort of locking primitive) that fit on one cache-line, now
straddles two. The reverse is also possible where two items were landing on
different cache lines and are now creating a false sharing problem.

It's likely a global and likely in the .bss (which comes immediately after
.data, which is why it has alignment troubles when static strings change).
It's usually pretty easy to binary search your way to the problematic module
and variable.

~~~
nigeltao
It's single-threaded.

------
viraptor
I'm not sure I understand why they revert the change. You get random small
+/\- changes in performance with the change, but you get slightly cleaner
code. What's the reason not to want that?

~~~
rybosworld
They might be hesitant to check in a change that causes a behavior they don't
understand. This is a good mindset imo.

~~~
viraptor
I feel like there's a (sometimes blurry) line between being hesitant about
dangerous changes (good thing) and hoping that if you ignore something you
don't understand, it will keep working as you expect (not a good thing).

Purely leaving an unused variable in place due to weird impact on performance
is in the second category for me. But maybe there are other aspects - that's
why I'm curious and asking about it

------
rurban
He should experiment with shortening his ENV vars one byte by one, to see
similar benchmarking effects. It process affects alignment, and if you unlucky
bad alignment can cost a few percent.

------
munro
I've never heard of this "wuffs" project before, apparently it's a programming
language. It looks like a wild project, does anyone have any details what this
is?

~~~
nigeltao
Start at
[https://github.com/google/wuffs/tree/master/doc](https://github.com/google/wuffs/tree/master/doc)

------
AdmiralAsshat
Isn't this usually where someone sufficiently versed in Assembly would look at
the generated assembler code and figure out what changed?

~~~
pascal_cuoq
You need to look at the disassembly of the generated binary to make sense of
this sort of performance variation (paying attention to line cache boundaries
for code and data), and even so, it is highly non-trivial. The performance
counters found in modern processors sometimes help
([https://en.wikipedia.org/wiki/Hardware_performance_counter](https://en.wikipedia.org/wiki/Hardware_performance_counter)
).

[https://www.agner.org/optimize/microarchitecture.pdf](https://www.agner.org/optimize/microarchitecture.pdf)
contains the sort of information you need to have absorbed before you even
start investigating. In most cases, it's not worth acquiring the expertise for
5% one way or the other in micro-benchmarks. If you care about these 5%, you
shouldn't be programming in C in the first place.

And then there is this anecdote:

My job is to make tools to detect subtle undefined behaviors in C programs. I
once had the opportunity to report a signed arithmetic overflow in a library
that its authors considered, rightly or wrongly, to be performance-critical.
My suggestion was:

… this is not one of the subtle undefined behaviors that we are the only ones
to detect, UBSan would also have told you that the library was doing something
wrong with “x + y” where x and y are ints. The good news is that you can write
“(int)((unsigned)x + y)”, this _is_ defined and it behaves exactly like you
expected “x + y” to behave (but had no right to).

And the answer was “Ah, no, sorry, we can't apply this change, I ran the
benchmarks and the library was 2% slower with it. It's a no, I'm afraid”.

The thing is, I am pretty sure that any modern optimizing C compiler (the
interlocutor was using Clang) has been generating the exact same binary code
for the two constructs for years (unless it applies an optimization that
relies on the addition not overflowing in the “x + y” case, but then the
authors would have noticed). I would bet a house that the binary that was 2%
slower in benchmarks was byte-identical to the reference one.

~~~
voldacar
If I may ask, what was the use case for this code that they cared so much
about a 2% difference in benchmarks? Aerospace? Game engine? Packet routing?

~~~
sbierwagen
I wouldn't expect aerospace, since I have been told embedded programmers in
that field routinely disable compiler optimization, in the chance that a
compiler bug or overzealous UB exploitation might introduce a bug into
previously working code. Hard realtime requirements demand _fast_ code, but
not necessarily _efficient_ code.

------
kev009
VTune will help

------
thrownaway954
I have to admit... I get a "geek on" whenever I read C programmer commit
messages. I don't know what it is but those people know how to write detailed,
interesting and often entertaining commit messages. This one was a perfect
example of all 3.

------
moonchild
> deleting this one line of code can have a dramatic effect on seemingly
> unrelated performance micro-benchmarks. Some numbers get better (e.g. +5%),
> some numbers get worse (e.g. -10%). The same micro-benchmark can get faster
> on one C compiler but slower on another

There's a fundamental disconnect that makes it difficult for humans to reason
about performance in computer programs. Because the speed of light is so slow,
computer architecture as we know it will always rely on cache and OoO to be
fast. The human brain does seem to work out of order, but it's only used to
thinking about a world that runs in order. When we use theory of mind, we
don't model other people's minds, we use our own as a model for theirs; see
mirror neurons[1].

Because of this, standard code benchmarks are not very useful, unless they can
demonstrate order-of-magnitude speedups. Even something like a causal
profiler[2][3][4], which attempts to control for the volatile aspects of
performance, is of limited use; it cannot control for all variables and its
results will likely be invalidated by the same architectural variation it
tries to control for. Instead (with respect to performance) we should focus on
three factors:

\- Code maintainability

\- Algorithmic complexity

\- Cache coherency

Everything else is a distraction.

1\.
[https://en.wikipedia.org/wiki/Mirror_neuron](https://en.wikipedia.org/wiki/Mirror_neuron)

2\.
[https://www.youtube.com/watch?v=r-TLSBdHe1A](https://www.youtube.com/watch?v=r-TLSBdHe1A)

3\.
[https://arxiv.org/pdf/1608.03676v1.pdf](https://arxiv.org/pdf/1608.03676v1.pdf)

4\. [https://github.com/plasma-umass/coz](https://github.com/plasma-umass/coz)

~~~
dleslie
There's a major caveat: if you know the hardware won't change and you control
the software stack throughout.

Game consoles, unikernels and the like apply here.

