
Why is this C++ code faster than my hand-written assembly (2016) - signa11
https://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat
======
abainbridge
A couple of weeks ago I'd never heard of Peter Cordes. Now the linked article
is the third time I've seen his work. He's doing a fine job of fixing
Stackoverflow's low-level optimization knowledge. Not so long ago all I seemed
to find there was people saying something like, "well, you shouldn't optimize
that anyway", or, "modern computers are very complex, don't even try to
understand what's happening".

~~~
mark-r
If I ever see Knuth's quote "premature optimization is the root of all evil"
in response to a question again, I think I'll puke. Not only is it hard for
outsiders to know what's premature and what isn't, but sometimes it's nice to
make a habit of doing things the faster way when you have two choices that are
otherwise indistinguishable. For example I try to use ++x instead of x++, even
though 99% of the time it makes no difference.

~~~
nulagrithom
In my opinion, _any_ optimization done before taking the naive approach is
usually premature optimization.

That might sound a little extreme, but in the past 5 years I've run into
exactly 1 problem that was solved by busting out the profiler and optimizing.
In that same time, I can't count on all my digits the number of features that
didn't ship, estimates that were overshot, deadlines that were slipped, etc
etc. I've even been part of a team that ran out of runway while popping open
jsPerf to choose between !! and Boolean(). Our app was fast as hell -- too bad
no one will ever get to use it.

If you're expending cycles choosing between ++x and x++ and you're not ahead
of schedule, please stop.

~~~
mark-r
That was my point, I'm _not_ expending cycles choosing between ++x and x++.
I've just chosen a different default than most of the code I've seen, and you
still need to realize when the default doesn't do what you want - but that's
usually obvious.

Sorry to hear about your unsuccessful projects, that's a bummer. I hope that
premature optimization wasn't a major part of the blame for any of them.

~~~
ben0x539
It's so irritating that golang has x++ but not ++x. I never remember until
shit isn't compiling. Grr!

------
kazinator
TL; DR: > _If you think a 64-bit DIV instruction is a good way to divide by
two, then no wonder the compiler 's asm output beat your hand-written code._

Once (maybe 25 years ago?) I came across a book on assembly language
programming for the Macintosh.

The authors wrote a circle-filling graphic routine which internally calculated
the integer square root in assembly language, drawing the circle using the y =
sqrt(r * r - x * x) formula!

What is more, the accompanying description of the function in book featured
sentences that were _boasting_ about how it draws a big circle in a small
amount of time (like a "only" quarter of a second or some eternity of that
order) because of the blazing speed of assembly language!

How could the authors not have used, say, MacPaint, and not be aware that
circles and ellipses can be drawn instantaneously on the same hardware: fast
enough for drag-and-drop interactive resizing?

~~~
jacquesm
Bresenham's line algorithms and the adaptation of the general principle to
circles and arcs are absolute gems. I've used those over and over in the first
two decades of my career and I never ceased to be impressed with the elegance
and speed.

Surprising that people writing a book 25 years ago would not have been aware
of this work.

[https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm](https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm)

[https://en.wikipedia.org/wiki/Midpoint_circle_algorithm](https://en.wikipedia.org/wiki/Midpoint_circle_algorithm)

~~~
mark-r
Yes, it's absolutely brilliant. The line drawing algorithm is also applicable
to other problems, whenever you need to interpolate between two integer
values.

~~~
kazinator
Bresenham is applicable to other conic sections and functions.

I proved this back as an undergrad: i used Bresenham to plot the y = K/x
hyperbolic curve.

I had this idea that since 1/x can be interpolated with Bresenham without
doing division, somehow that could be applicable to the perspective
transformation when walking over texture maps in 3D rendering.

~~~
jacquesm
Hehe. That's so cool, this is something I did without knowing any of the
formal math behind it when writing a small 3D game engine (after seeing Doom).
It seemed to be the shortest path to a solution and it worked very well.

Then, after getting it to work I replaced the interpolator with a bunch of
assembly starting from the intermediary representation the compiler output.

I unfortunately didn't date that source file but I do remember I was living in
Amstelveen when I wrote it so this was about 23 years ago, summer of '94.

We made the textures with one of the first affordable and commercially
available digital cameras:

[http://www.digicamhistory.com/1991.html](http://www.digicamhistory.com/1991.html)

------
payne92
tl;dr -- the asm author used DIV to divide by a constant 2

More fundamentally: it's theoretically possible to at least match compiled
code performance with assembly, because you could just write the code the
compiler generates.

BUT, it requires a LOT of experience.

Modern compilers "know" a lot of optimizations (e.g. integer mult by fixed
constant --> shifts, adds, and subtracts). Avoiding pipeline stalls requires a
lot of tedious register bookkeeping, and modern processors have very
complicated execution models.

It's almost always better to start with a compiler-generated critical section
and see if there are possible hand optimizations.

~~~
bradford
In University computer architecture courses, we were challenged to write a
quicksort routine in assembly. We were also asked to compare the assembly that
we authored with assembly that was compiled from C++ (after we authored our
own solutions, of course).

It was an amazing crash-course on just how good compilers have become at
optimizing. Not a single student could hand craft assembly that was faster
than the compiler output. The teacher of the course was able to generate
assembly that was slightly faster, and he stated that in order to do so, he
had to greatly exploit his in-depth knowledge of the processor's pipeline
system. That was roughly year 2000, and I'm sure compilers have only become
better at their job since then.

All in all, excellent learning experience. I've since encountered several
instances where developers assert superior assembly skills, and by default I'm
silently skeptical of their claims.

~~~
krylon
Gathering and exploiting in-depth knowledge of a CPU's internals has become
more difficult over time, too, I think.

At least for x86/amd64 - with out-of-order-exection, branch prediction and
whatnot one not only has to know the architecture, but one has to know the
specific implementation the code will run on. And knowledge on the deep
internals of CPUs made by Intel or AMD (or Via? are they still around?) is not
easy to come by.

------
bluedino
>> Have you examined the assembly code that GCC generates for your C++
program?

A very polite way of saying, "why are you even using assembly, when you don't
understand assembly?"

~~~
eterm
And why wouldn't they be polite?

The OP clearly does understand assembly enough to start doing project Euler
type problems, which is a good way to learn basic programming in any language.
They get a solution in assembler which is more than many people here would be
able to do I suspect.

And they're looking to expand their knowledge by asking on stack-overflow
about something they don't understand.

So why do you think they should they be met with rudeness and hostility?

~~~
marsRoverDev
Because apparently it's a rite of passage / hazing ritual to go on IRC /
stackoverflow and be flamed for asking simple questions. It's been that way
since the dawn of time.

~~~
bshimmin
The ritual isn't really complete until someone has complained about the people
complaining about people asking simple questions and someone else has posted
the ESR document about how to ask smart questions, a document so
astonishingly, vastly patronising that it probably ought to be compulsory
reading for every developer starting out, with the addendum, "If you've made
it all the way through this soul-crushing drivel, congratulations! It can't
get worse than this!"

------
AdmiralAsshat
The question was more interesting than the answer.

tl;dr version--the author's hand-written assembly was poor.

I guess the more interesting takeaway is "Just because it's assembly doesn't
mean it's _good_ assembly."

~~~
pdpi
That takeaway is more or less uniformly true, though. It also often comes up
as saying that $LANGUAGE is slower than C or C++. Your algorithms aren't
naturally faster just because you're writing in C++. You don't magically stop
allocating and double-buffering all over the place just because you're writing
C++. In fact, coming in from the likes of Java you're liable to underestimate
just how much (relatively expensive) copying is going on if you're careless.

What C and C++ give you (and Assembly gives you even more of) is control. If
you can, and know how to, capitalise on that control, you _will_ get more
performance. But those requirements are non-trivial.

------
ericfrederich
For fun I ported the C++ to Python and Cython without any kind of mathematical
or programmatic optimizations. C++ was 0.5 seconds, then Python was 5.0
seconds. Cython, which was the same exact code as Python except sprinkled with
"cdef long" to declare C types, was just 0.7 seconds.

------
SeanDav
General comment and not aimed at this specific instance:

Just because you are writing in assembler, does not mean it is going to run
faster than the same code in a compiled language. There has been decades of
research and who knows how many man-years of effort that has gone into
producing efficient compiled code from C, C++, Fortran etc.

Your assembly skills have to be of quite a decent order to beat a modern
compiler.

BTW: The answer to the question on Stack Overflow by Peter Cordes is a must-
read. Brilliant.

~~~
15155
> Your assembly skills have to be of quite a decent order to beat a modern
> compiler.

Start with compiler output, add intrinsics where you can see help is needed,
benchmark, repeat.

Pathologically-slow ASM is pretty rare from modern compilers in my experience.

~~~
jzwinck
> Pathologically-slow ASM is pretty rare from modern compilers

Here are some I've found:

[https://stackoverflow.com/questions/45496987/gcc-
optimizes-f...](https://stackoverflow.com/questions/45496987/gcc-optimizes-
fixed-range-based-for-loop-as-if-it-had-longer-variable-length) (horrific
codegen for known-size C++11 loops in member functions, all GCC versions prior
to 8 which is not yet released)

[https://stackoverflow.com/questions/43651923/gcc-fails-to-
op...](https://stackoverflow.com/questions/43651923/gcc-fails-to-optimize-
aligned-stdarray-like-c-array) (SIMD opportunity squandered when C++11
features are used)

[https://stackoverflow.com/questions/42263537/gcc-
sometimes-d...](https://stackoverflow.com/questions/42263537/gcc-sometimes-
doesnt-inline-stdarrayoperator) (failure to inline trivial operators)

[https://stackoverflow.com/questions/26052640/why-does-gcc-
im...](https://stackoverflow.com/questions/26052640/why-does-gcc-implement-
isnan-more-efficiently-for-c-cmath-than-c-math-h) (C isnan() not efficient,
for many years)

------
iamjk
The people who write "article answers" like this on SO are the real MVP's of
the web.

------
raphlinus
Apologies if this is somewhat off-topic for the thread, but I suspect this
will be a fun puzzle for fans of low-level optimization. The theme is
"optimized fizzbuzz".

The classic fizzbuzz will use %3 and %5 operations to test divisibility. As we
know from the same source as OP, these are horrifically slow. In addition, the
usual approach to fizzbuzz has an annoying duplication, either of the strings
or of the predicates.

So, the challenge is, write an optimized fizzbuzz with the following
properties: the state for the divisibility testing is a function with a period
of 15, which can be calculated in 2 C operations. There are 3 tests for
printing, each of the form 'if (...) printf("...");' where each if test is one
C operation.

Good luck and have fun!

~~~
mikeash
Good nerdsnipe.

I decided to reject your conditions and replace them with my own, for no
apparent reason. Mine has some of the redundancy you wish to eliminate, but
the loop body is completely branchless, aside from the call to printf, of
course, which we're apparently ignoring.

    
    
        #include <stdio.h>
        
        #define NUM "\x06" "%zu\n\0"
        #define FIZZ "\x07" "Fizz\n\0"
        #define BUZZ "\x07" "Buzz\n\0"
        
        const char *x =
            NUM
            NUM
            FIZZ
            NUM
            BUZZ
            FIZZ
            NUM
            NUM
            FIZZ
            BUZZ
            NUM
            FIZZ
            NUM
            NUM
            "\xa6" "FizzBuzz\n";
        
        void fizzbuzz(size_t upTo) {
            for(size_t i = 1; i <= upTo; i++) {
                printf(x + 1, i);
                x += *x;
            }
        }
        
        int main(int argc, char **argv) {
            fizzbuzz(100);
        }

~~~
raphlinus
Good candidate for most evil use of the assumption that char is signed.
Incidentally, that's not going to be true on arm[0].

[0]: [http://blog.cdleary.com/2012/11/arm-chars-are-unsigned-by-
de...](http://blog.cdleary.com/2012/11/arm-chars-are-unsigned-by-default/)

~~~
mikeash
Haha, whoops! At least it's easy to fix.

------
bjoli
I know is it is not the point of the question, but that problem would benefit
greatly from memoization. Calculate it recursively and memoize the result of
every step. With all the neat trickery that they are doing with assembly they
could easily go sub 10ms.

I whipped together a short poc in chezscheme, and it clocks in at about 50ms
on my 4 yo laptop.

~~~
Veedrac
When everything is in cache and you're calculating sequentially, definitely,
but equally why not just precompute the whole range and do a binary search?
Once you're looking for better algorithms the whole problem just falls away!

That said, I do think it's neat that the fastest brute-force variant is a
rough factor-2 from your naïve memoized version. It might even win if it was
fighting a hyperthread for cache space! Just shows how much throughput modern
CPUs have... if only they were used this well all the time ;).

~~~
shoo
> but equally why not just precompute the whole range and do a binary search?

what do you mean?

~~~
Veedrac
The code finds the longest Collatz sequence below N, where N is a 32-bit
integer. There are only ~1k longest sequences possible, so just list them all.
Given N, search for the largest N below that and return its corresponding
iteration count.

------
elcapitan
tldr: compiler replaces /2 with a shift.

~~~
jandrese
Which makes it sound like the author shouldn't have been trying to hand
assemble for speed. Or maybe treat it as a learning exercise.

~~~
michaelt

      Or maybe treat it as a learning exercise.
    

Is there any other reason to complete a Project Euler question?

------
msimpson
> If you think a 64-bit DIV instruction is a good way to divide by two, then
> no wonder the compiler's asm output beat your hand-written code...

Compilers employ multitudes of optimizations that will go overlooked in hand-
written ASM unless you, as the author, are very knowledgeable. End of story.

------
coldcode
When I started programming on a Apple II+ assembly was important. Today there
are likely only a few people in the world who truly understand what any
particular CPU family is actually doing sufficiently to beat the compiler in
some cases, and they probably are the ones writing the optimizer. But 6502 was
fun to code for and the tricks were mighty clever but you could understand
them.

~~~
abainbridge
> Today there are likely only a few people in the world who truly understand
> what any particular CPU family is actually doing sufficiently to beat the
> compiler

I'm not sure that's true. There are hundreds of compilers in the world, being
maintained by thousands of developers. Then there are all the JITs. And the
people who make the standard library implementations. And performance critical
stuff in game engines, OS kernels, hardware drivers, high-frequency trading.
Then there's the embedded space. And then there's Fabrice Bellard.

My previous employer, Cambridge Silicon Radio (one of hundreds of similar
companies nobody's heard of) had dozens of people on the staff that worked on
this kind of thing. I have friends at ARM, Broadcom, Samsung and Raspberry Pi
that mess around with processor designs for a living. This is just my little
experience of the industry. There are armies of these people.

------
takeda
Not too surprising answer: "your assembly sucks"

------
m3kw9
Because the complier has optimized it better than you.

------
smegel
> but I don't see many ways to optimize my assembly solution further

I can't do it therefore it must be impossible!

------
barrkel
This was a borderline help vampire question, but it ended up working out well,
probably for nerd-sniping reasons.

