
But I was helping the compiler - aw1621107
https://pankajraghav.com/2020/08/16/RVO.html
======
comex
Two important points:

\- For std::array of uint8_t – an in-place container of a plain-old-data type
– the move constructor and the copy constructor, if not optimized away, will
do the exact same thing! Both boil down to copying the bytes from the old
object to the new object. So there is never any benefit to calling std::move
on such an object. This is different from out-of-line containers like
std::vector, where the std::vector object itself consists only of a pointer
and some lengths. In that case, the move constructor just copies the pointer
and length values themselves and nulls them out in the original object,
without copying the underlying data. But the copy constructor has to allocate
a whole new heap buffer, copy the underlying data, and initialize the new
vector object with _that_ pointer.

\- The godbolt links are building without optimization enabled (without -O
settings)! GCC apparently still performs (non-mandatory) NRVO in this case,
probably because that optimization is implemented on the frontend rather than
the backend, and the frontend tends to care less about -O settings. But GCC
does not perform the usual suite of backend optimizations, including inlining
and memcpy elimination. If optimizations were enabled, GCC might be able to
optimize both versions into the same thing, depending on what actually happens
in “// Do some computation”. It doesn’t make sense to make performance
comparisons with optimization disabled, unless you’re trying to improve the
performance of debug builds themselves.

~~~
jfkrorkt
Am I right to say that using std::move is even a logical mistake here?

The purpose of move is to "steal" the underlying memory, but if you do that,
you leave the gfx variable from the class without backing. So basically you
would need to reallocate it after every call to get_display_pixels. So the
code only works because the author uses array, which is statically allocated
and there is nothing to steal. It would break were they to switch to vector.

Either the author doesn't understand what the purpose of move is, or if this
is really their intent, it's a terrible design, since get_display_pixels looks
like an idempotent getter function, but it's not.

~~~
tylerhou
> So the code only works because the author uses array, which is statically
> allocated and there is nothing to steal. It would break were they to switch
> to vector.

No, because get_display_pixels returns by value and not by reference. So the
move constructor will move from a newly constructed value, _not_ the instance
gfx variable.

If get_display_pixels returned by mutable reference and the author used
std::vector, yes, the code might break.

Nit: array is not statically allocated; it's stack/automatically allocated.

~~~
w0utert
It would not break, but it’s still a logical fallacy to std::move something
returned by value, no? Or is there a valid use case here?

I strictly use std::move in case of explicit transfer of ownership, and not to
try to outsmart the compiler to ‘optimize’ something. As the article shows the
compiler is smart enough to do all this by itself, it doesn’t need to be told
to try to elide copies. But maybe I’m missing some valid use case where it
would make sense to add the std::move in these kinds of situations?

~~~
tylerhou
You're right, there's not really a point in std::move-ing something returned
by value because the something is already a rvalue.

------
tylerhou
1) It's useless to reason about the performance of generated assembly unless
you use -O{1,2,3}. Another commenter says that they're surprised std::move
isn't inlined; that's because -O{1,2,3} isn't passed.

2) If you care that much about performance, you can avoid all copies by
returning by reference in this case.

3) The author seems to misunderstand std::move — in this case, std::move can
_only_ pessimize performance and never optimize because the move constructor
of std::array is the exact same as the copy constructor.

For example, move constructors help performance for std::vector because the
move constructor will copy the backing buffer's pointer instead of copying the
entire backing buffer itself.

Since std::array<T> is a thin wrapper around T[], the entire array is stored
on the stack, and there is no possible optimization that a move could make —
stack values _must_ be copied (or otherwise initialized) in move constructors.
In fact, since std::array is an aggregate type, there isn't even a user-
defined move constructor (or copy constructor)! [https://github.com/gcc-
mirror/gcc/blob/master/libstdc%2B%2B-...](https://github.com/gcc-
mirror/gcc/blob/master/libstdc%2B%2B-v3/include/std/array#L112)

~~~
remar
Do you have a quick explanation or resource that explains why [1] is true?

Just curious as I've wanted to get better at being able to understanding
assembly output by gcc/clang but don't think I've come across this advice
explicitly before.

~~~
tylerhou
I should maybe rephrase [1] to: if you're deciding between two different ways
of writing the same function for performance reasons, you should use the
optimized assembly output to decide (or the same compiler flags as what you
use in production). There can be big differences between the unoptimized and
optimized code: [https://godbolt.org/z/fszhjj](https://godbolt.org/z/fszhjj).

Unoptimized assembly IS useful if you just want to learn or understand
assembly itself.

------
notacoward
This is the thing that always drives me nuts about C++: the compiler always
trying to second-guess me, or me having to second-guess it. If I had wanted to
spend time thinking about compiler internals I would have made that choice
decades ago (after I actually did work on compilers a bit along with
everything else). If I want low-level control to achieve best performance or
efficiency, I'll just use C and be careful to avoid its deficiencies. Been
there, done that, I'm actually pretty good at it. Then I'll link that with
code written in a language that lets me express higher-level intent at a
decent level of abstraction, which is impossible with the $#@! compiler in my
face all the time.

~~~
jfkrorkt
On the other hand, it's basically impossible to write in C something as fast
as Eigen, because it's exactly the compiler that gets into your face which can
optimize away the intermediate layers of abstractions, in a way that would
require 10 times the amount of code in C.

~~~
voldacar
Which also means that is basically impossible to read some eigen code and
mentally visualize the compiled asm. Though there are certainly situations
where that doesn't matter, it can be quite annoying.

~~~
ncmncm
It just takes some practice.

------
lilyball
Why does adding the std::move() change the generated code though? It changes
the prvalue to an xvalue, but std::move is equivalent to a static_cast, and
shouldn’t both the prvalue and the xvalue versions end up calling the same
std::array move constructor?

Edit: I just took a look at the actual godbolt links. And it looks like the
std::move version actually puts a call to std::move in the generated assembly.
I’m kind of surprised this is actually a literal function call, I would have
assumed std::move would be always inlined.

~~~
edflsafoiewq
The version without std::move does not call a constructor because of copy
elision. The get_display_pixels function constructs the disp_pixels object in
place.

In the version with std::move, it first copy-elides the function call into a
temporary, then move-constructs the disp_pixels object from the temporary. ie
as if

    
    
        auto tmp = emulator.get_display_pixels();
        const auto disp_pixels = std::move(tmp); // move constructor called here
        // lifetime of tmp ends
    

So the version with std::move is invoking an extra move-constructor and
destructor call.

~~~
iainmerrick
But shouldn’t those extra calls be optimized away to nothing, if they don’t
have any real side-effects? That’s what I would naively assume.

~~~
edflsafoiewq
I expect so, yes.

------
jeffbee
Depending on what is in the body of the function, I usually prefer to see an
output iterator taken as a function parameter, instead of functions returning
whole collections. In the article's style the caller is forced to deal with
std::array even if that's not suitable at the call site. With an output
iterator and a template function the caller gets to pick whatever suits the
purpose best. Not having to think about how the output was allocated and
deleted is just a bonus of this style.

------
knorker
Never use std::move on an rvalue. That is always wrong, and shows a
misunderstanding of std::move.

std::move doesn't move. It should maybe have been called
std::lvalue_to_rvalue.

~~~
knorker
… and I meant to add: Which makes it clear that you should neverer pass an
rvalue into std::lvalue_to_rvalue.

------
CyberRabbi
You don’t need a copy here. A const reference to the internal array is all you
need, and it’s faster.

------
platinumrad
In addition to the significant issues pointed out by comex and tylerhou, NRVO
is not in play here as `gfx` is not a non-volatile object with automatic
storage duration.

------
marta_morena_25
You really don't need to read assembly for any but the most rare cases. What
you need is a profiler to give you the hotspots and you need to stop
prematurely optimizing on a line-by-line basis. Just stop it. Optimize O(n)
performance (but also don't go wild, unless there is a business reason), keep
your fingers off micro-optimizations, unless so indicated by the profiler and
only if there is a business reason as well...

~~~
gen220
Totally agreed. Getting comfortable with a profiler is a superpower.

FWIW, this is also one of Rob Pike's rules of programming:

> Rule 1. You can't tell where a program is going to spend its time.
> Bottlenecks occur in surprising places, so don't try to second guess and put
> in a speed hack until you've proven that's where the bottleneck is.

It's implicit, but the "proven" part implies use of a profiler.

~~~
pkolaczk
This is a matter of experience. If I have a tight loop looping millions of
times but I have a heap allocation inside, I can be 99% sure this would be
slow, without looking at the profiler and if there is an easy way to avoid it
without obfuscating the code (e.g simply taking it out of the loop, or using
the stack instead of the heap), I'd just do it. Profiling is a waste of time
in such cases.

Also, bottlenecks happen in surprising places, and allowing slow/bad code just
because the profiler doesn't scream about it _on the development machine_ is a
recipe for surprising performance issues in the future. The slow code might
become a problem when a user has slightly different task for the program.

~~~
gen220
I agree with you, too, that experience helps.

But I think the aphorism covers the case you’re describing too.

It’s possible that your loop is slow, but if you’re working on a program of
sufficient size and complexity, you simply _won’t know_ if it’s _the_
bottleneck without using a profiler.

The purpose of the statement is to save you the trouble of 10xing performance
of a function, so that it takes .1% of all execution time, instead of 1%. Do
sufficiently complex programs, and without a profiler, you won’t really know
if a loop is taking 1%, 10%, or 80%.

