
I translated a simple C program to x86_64 and it was slower - spiffytech
https://ecc-comp.blogspot.com/2020/04/i-translated-simple-c-program-to-x8664.html
======
jdsully
The assembly is not bad for a beginner but far from optimal. A few things that
stand out from a quick glance:

1) Try to load things sequentially don’t load from offset 32, then 0, then 16.

2) Don’t use the loop instruction

3) this code:

    
    
      dec  n
      cmp  n, 0
      jne  1b
    

Can be replaced with:

    
    
      sub n, 1
      jnz 1b
    

Which will actually be executed by the processor as a single instruction
(macro-op fusion).

4) Interleave your expensive instructions with less expensive ones. Try to
interleave multiple dependency chains to let the processor see more of the
parallelism.

Your two divisions one after another will be limited by available execution
units capable of doing the divide.

5) Lastly align your loop labels to 16-byte offsets. The assembler will do
this for you with the ALIGN directive.

~~~
_bxg1
Are there linters that will overlay compiler-tier optimization hints on top of
handwritten assembly?

~~~
CalChris
The closest thing would be the _Godbolt Compiler Explorer_ :

[https://godbolt.org/](https://godbolt.org/)

~~~
pjmlp
It is nice tool for online exploration, but not as V-Tune.

------
petermcneeley
You have five bodies and 16 sse registers. The entire state of your simulation
can fit into register space and you dont need to ever access memory during the
stepping part of your code. You can loop unroll all gravity interactions so
you end up with one large branchless memoryless block of code. Now that its
completely inline you can rearrange your dependencies based on expected
latency and throughput of operations
([https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=MMX,SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2&text=sqrt&expand=5385))

Then after that you can merge the operations where you can. (for SSE4 at most
you are going to get is 2x because you are using doubles)

You may think the full inlining is cheating but the compiler has the same
information as your bodies list is entirely constant. (since your dt and your
masses are constant they can also potentially be folded).

~~~
sesuximo
Full inlining might be slower if the branch is easy to predict

------
Porygon
This is the n-body problem from the Computer Language Benchmark Game with the
attribution to the original author removed.

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/program/nbody-gcc-3.html)

The faster programs usually use SIMD.

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/nbody.html)

~~~
MaxBarraclough
Any idea what's going on with the _mem_ values in the listing in your second
link?

Wouldn't expect a C++ implementation to use 200x the memory of a C
implementation. Different algorithms presumably? Or a significant compiler
optimisation being missed?

~~~
heeen2
Maybe because the cpp version includes stdlib on top of stdio?

~~~
MaxBarraclough
Ah, of course. So the problem doesn't involve any real allocations, and it's
just the language overhead that we're seeing.

------
MattPalmer1086
Sometime around 2000, I tried to hand optimise an image processing routine in
x86 assembly. Previously I'd only done 6502.

My first attempt was massively slower than the compiled code. I had to get
into the whole pipelining that was going on and other hardware optimisations.

When I discovered putting a NOP in a tight loop sped up the code, I realised I
wasn't going to beat the compiler most of the time!

~~~
ijidak
Really insightful comment.

It's true. C compilers are so optimized, the day of re-write in assembly have
long passed for most of us.

~~~
lizardmancan
it is the dance between compilers and chip designs that unmakes hand work.

------
jchw
I think the point was adequately made. You can, of course, write better
assembly, and you can of course write the exact same assembly a compiler
would. But if all of the suggestions for optimization are things compilers
would do anyways, you are better off coding in a higher level, portable
language in the first place, and only dropping down to asm when it is
sufficiently beneficial.

All nitpicking about the quality of the assembly is valid of course, but it
does inadvertently help prove the point, especially since most or all of these
things are things compilers do today.

~~~
dkersten
Well, you have to also consider that most of us write a lot of high level code
and very little, if any, assembly code, so we shouldn't expect to be good at
writing high performance assembly without practicing. I've seen some rather
amazing high performance hand-written assembly code, by people who write it
often, but I certainly wouldn't be able to do it, without a lot of study and
practice.

Of course, for most people and most uses, its not worth the effort.

~~~
jchw
I frequently make this point so I may sound like a broken record, but I
believe this is more or less a fallacy.

Again, the point isn’t that better assembly couldn’t be written. It’s that it
most likely wouldn't be significantly better than the compiler because all of
the suggestions are things compilers would be doing anyways. There are some
cases where this isn’t true, especially when dealing with vectorization, but
those are mostly just exceptions (and intrinsics often offer easier ways to do
such optimizations...)

But here’s the point that I feel is often ignored when it comes to programming
language debates in general: just because you are experienced with and aware
of advanced usages of the environment you’re programming in, does not mean the
complexity and especially cognitive overhead of said complexity has
disappeared. Looking at the C version, it doesn’t really look especially
optimized, which is not really something that you would see in assembler, at
least not in my opinion. Complexity adds up over time; abstractions are the
antidote to that problem.

On top of that, assembly language is obviously not portable, which IMO is even
more reason to use a high level language and drop to asm only when needed; you
can easily swap implementations and have a fallback for architectures that
aren’t specifically optimized.

~~~
dkersten
If you don’t study and practice it, how can you possibly know how to write
good code (in any language)? In assembly, if you’ve never been exposed to eg
the simd instructions, or aligned memory access or cache or branch prediction
or instruction level parallelism, how can you expect to write performant
assembly code? Experience and knowledge doesn’t just appear.

I’m not arguing that it’s worth it or that it’s easy to beat the compiler. I
certainly am not going to bother writing assembly (maybe some intrinsic for
SIMD but certainly not raw assembly, outside of embedded systems, although
even then it’s not really worth it usually).

I’m simply saying that you can’t expect to be good at something unless you
practice it.

But that doesn’t mean people shouldn’t try but to learn. For example, somebody
has to implement the optimisations isn’t he compiler and that person needs to
have a great understanding of how to produce high performance assembly code.
Plus learning new things is always worthwhile if you have the time.

~~~
pjmlp
In Assembly, even if you manage to beat the compiler, it might be a pyrrhic
victory, because it might be lost when trying the same benchmark in another
CPU or after getting a microcode update.

During the 80's and early 90's it was a different matter, because CPUs were
dumb, hardware was relatively static specially on 8 and 16 bit consumer
systems and high level optimizers were pretty dumb given the resource
constraints of those platforms.

~~~
dkersten
I’m not debating whether or not its a worthy endeavour though, I’m only saying
that you can’t expect good performance out of assembly code unless you
practice writing high performance assembly code. Most of us have a lot of
experience with high level languages, so that we can write well performing
high level code makes a lot of sense, but we shouldn’t expect that we can just
“drop down to assembly” and get a performance boost, but that also doesn’t
mean that its never possible, for the people who do actually do this a lot (eg
the x264 people writing hand crafted SSE/AVX code)

~~~
pjmlp
It is an herculean effort trying to master modern Assembly.

Back in the day, you could easily know all opcodes for a given CPU, and their
clock cycle timings.

This is the SIMD guide for the Intel CPUs,

[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)

Which is only a tiny subset of all the opcodes that a modern Intel CPU is able
to understand, let alone what AMD also offers.

You need tools like VTune from each CPU vendor to actually understand the CPU
clock timings of each opcode in micro-ops (microcode execution unit).

While you can master a specific subset, like knowing AVX instructions,
mastering Assembly back to back like in the old days, only when writing
Assembly for stuff like small PIC microcontrollers.

Trying to master a language like C++ is easier, which says a lot about how
modern CPUs look like.

------
nneonneo
These days, it’s not only compilers that are optimizing for the particulars of
the CPU, but CPUs optimizing for compiled code. My favorite examples of this
are the inconsistent and sometimes downright crappy performance of “rep”
string instructions and the “loop” instruction. Both seem like ideal easy
things to write for handwritten assembly, yet the performance of both
constructs can be quite awful on certain modern CPUs compared to the naive
loops that compilers output. Much of this can be blamed on the fact that
compilers rarely use either construct, so chipmakers had no reason to make
either of them efficient (and, indeed, seem to have actually pessimized them
on some chips!).

~~~
userbinator
REP MOVS is still extremely fast --- and also small. It'll copy cacheline-
sized chunks if the size is large enough.

LOOP is a bit of a weird case. I've seen it benchmark both slower and faster
than dec/jnz depending on the surrounding instructions.

~~~
netrikare
loopXX instructions do not use CPU LSD (Loop Stream Detector) while cmp/jnz
construct takes advantage of it. This speeds up some small loops. Also, there
are some rules in intel manuals for instructions within cmp/jnz loop like no
mismatched push/pop, etc.

now, does anyone know why?

~~~
userbinator
_like no mismatched push /pop, etc._

My guess is virtual stack pointer update prediction latency.

To expand on that, Intel's CPUs have had for a long time a separate piece of
hardware dedicated to a "virtual" stack which speeds up push/pop instructions.
If pushes and pops are not mismatched, then all stack operations can stay
entirely within that and there's no need to update the "real" stack pointer
nor stack entries upon leaving the loop.

~~~
netrikare
Thank you for your answer! Any idea why would loops not use LSD when
programmed using loopXX instructions but would use when cmp/jnX is used?

------
optimiz3
Author really needs to put the assembly under a profiler. Others have
mentioned things like poor instruction selection (loop).

In my experience it's good to first attempt a rewrite in C using intrinsics.
This gets you thinking about data and register layout at a high level and
let's you better identify mid level optimizations before committing to
assembly.

------
fallat
lol, I did not expect this. I submitted this myself yesterday and it got 2
upvotes.

To the people with suggestions, especially jdsully: thank you! I will probably
give these a shot.

To the people who say "blah blah someone who doesnt know x giving an opinion
on y blah blah", ok, what is wrong with sharing an experience? I'm not sure
how to interpret this other than you take what I write way too seriously? It's
an explorative piece...

Edit: I wanted to mention too that it's really cool reading other's stories
around this topic. Thank you for sharing!

------
userbinator
The author picked a case in which compilers can be faster, because it turns
out to be the sort of code that compilers are usually optimising for; code
that contains plenty of scientific/numerical computations. Thus the result is
not so suprising.

On the other hand, with branchy "business logic"/general-purpose code and
algorithms that can't really be vectorised, it's pretty easy to beat a
compiler with handwritten Asm --- on speed, size, or often both.

~~~
drivebycomment
Please show us a proof - I have yet to see any non-toy program / routine with
complex control flow written in assembly beating a modern optimizing compiler.

You can optimize a small routine. Anything that's not a toy rapidly becomes
impossible to optimize by hand.

Just the register allocation itself can be a gigantic pain to do by hand, even
with a limited register set of x86, let alone other RISCs with much larger
register files.

EDIT: Added clarification in the wording.

~~~
userbinator
[http://menuetos.net/](http://menuetos.net/)

 _Just the register allocation itself can be a gigantic pain to do by hand,
even with a limited register set of x86,_

Actually, that's one of the main advantages of using Asm --- compilers are
relatively horrible at register allocation, because they don't take a more
"holistic" view of the program, and end up shuffling values between registers
or registers and memory far more often. This is why handwritten Asm has a
unique "texture" to it.

 _let alone other RISCs with much larger register files_

There's much less room for optimisation with handwritten Asm on a RISC,
because with x86 the opportunities are precisely those the compiler can't
easily see, unlike a very uniform and boring RISC.

~~~
drivebycomment
You say those things. Please show me an example of a real world, general
purpose, non toy sized program or routine that is hand optimized and shown to
perform better than the optimizing compiler.

~~~
augustt
OpenBLAS

~~~
gnufx
GEMM is a fairly special case, but the case is unproven as far as I can see.
The OpenBLAS kernels are certainly bigger than they need be, but it's not
clear to me they satisfy the request (especially addressing "holistic"
treatment of registers), and "OpenBLAS" isn't a performance measurement.

Since I'm interested in this, I'd like to see the case to reproduce with an
analysis of the performance difference. BLAS developers have made statements
about GCC optimization failures (at least vectorization) that aren't correct
in my experience with at all recent GCC. BLIS' generic C kernel already runs
at ~60% (I forget the exact number) of the speed of the Haswell assembler code
for large DGEMM without any tuning attempt, perhaps with GCC extensions. (I
didn't check whether the blocking is right for Haswell or pursue an analysis
with MAQAO or something.)

------
magicalhippo
Not too long ago while optimizing some x86 code on my i7-4900k, I discovered
that at least for this code it was appreciably faster to do

    
    
        mov [mem], reg
        div [mem]
    

rather than simply

    
    
        div reg
    

I'm no asm guru, but I expected the L1 cache to be slower than a register, and
certainly not a round-trip via L1 to be several percent faster.

The reason I stumbled upon this was that the compiler emitted the first
variant and I tried to optimize it to the second, but lost speed. That's when
I decided I'll stick to optimizing my high-level code...

FWIW this div was in a somewhat tight loop, but sadly I've totally forgotten
the rest of details. I do recall adding nops here and there to play with loop
alignment, it wasn't that.

~~~
temac
Modern processors are so complex that changes sometimes have paradoxical
consequences, simply in a kind of chaotic way. That seems to be the case in
what you describe. Those are typically not robust effects though, and changing
seemingly completely unrelated things might put the perf right back at what
you expect.

~~~
magicalhippo
Indeed. Having slept on it, I do recall someone mentioning that my performance
discrepancy could be related to busy load/store ports. If the issue is that it
doesn't have free reg-reg load/store capacity but reg/mem and mem/reg is
otherwise available, then it makes kinda sense I guess.

But in that case the performance could totally change again if just one more
instruction is added.

------
Veedrac
This comparison is a bit unreasonable because the Benchmarks Game has a heavy
selection effect going on; if the compiler generates bad code for a program,
people won't submit it.

There are plenty of instances where compilers go haywire and generate code far
worse than even a half-competent novice would, because compilers don't know
the same things about your code that you do.

~~~
igouy
> if the compiler generates bad code for a program, people won't submit it

Seems just to be speculation.

Might we just as easily speculate that — _if the compiler generates bad code
for a program,_ that still looks a lot better than the comparison interpreters
so _people will submit it_ ?

------
nkurz
@fallat:

I started to take a look at your problem, but I wasn't able to run the code
that you posted to test.

For the C code, the "#include" statements at the top are missing their
<header.h>. Easy to solve now that you posted the link to the original, but
maybe you could fix?

For the assembly, you are relying on "list.macro", which I couldn't find.
Could you post this, or even better link to a repository that can be used?

Separately, it would probably be helpful to post the exact model number of
your processor. Speed doesn't (shouldn't?) really matter, but the "generation"
is essential.

Also, I don't think you mention the number of iterations you are running for
your timing. [I see now in the assembly that you seem to be using 5000000?]

Post a bit more info, and someone here (maybe me, maybe not) will figure it
what's causing the slowdown.

~~~
fallat
Oh whoops, I'll add it in. EDIT: It's a damn blogger encoding error.

Yep, it's 5 million iterations.

I will update the post with list.macros immediately!

This is my CPU: Intel(R) Pentium(R) CPU N3700 @ 1.60GHz

Edit: it is updated. :) list.macros has been inlined in the snippet.

~~~
nkurz
Great, I'll take a look in a bit, although it might take me until tomorrow to
have time to do much with it.

In the meantime, I'll mention that my first quick discovery is that clang
seems to be significantly faster than gcc on the standard C code. The ratio
changes with different versions and compilation options, but on Skylake with
"-Ofast -march=native" I find clang-6.0 to be almost twice as fast as gcc-8.
So if you have clang installed, check and see if it might be a better
baseline.

Also, what system are you running? This shouldn't make a difference with
execution speed, but will make it easier to make tool suggestions. If you are
running some sort of Linux, now would be a good time to get familiar with
'perf record'!

Edit: > Intel(R) Pentium(R) CPU N3700 @ 1.60GHz Hmm, that's a "Braswell" part,
which unfortunately isn't covered in Agner's standard guide to instruction
timings
([https://www.agner.org/optimize/microarchitecture.pdf](https://www.agner.org/optimize/microarchitecture.pdf))
and I'm not familiar with it's characteristics. This might make profiling a
little more approximate.

~~~
fallat
Debian buster; and if you're going to make a comparison, I'm just saying this
just in case, you must force it to only use SSE3 and no later. :)

I added the build instruction for the assembly but I'll add it here too: gcc
nbodies.s -no-pie -o nbodies.out

Shoot me an email if you get around to it! :D

I'll check out perf record.

~~~
nkurz
OK, I'm now able to test your code!

The preliminary results I get disagree with what you are seeing. I'm not sure
if this is my error, your error, or just genuine differences between
processors. Specifically, on Skylake, I get your assembly to be much faster
than GCC, although slightly slower than Clang. And that's without trying to
use options to limit the compiler:

    
    
      gcc-8 -Ofast -march=native bodies.c -Wall -lm -o bodies_gcc_Ofast_native_c
      perf stat bodies_gcc_Ofast_native_c 5000000
      2,031,645,589 cycles       #    3.691 GHz
      2,181,200,813 instructions #    1.07  insns per cycle
    
      gcc bodies.S -o bodies_S
      perf stat bodies_S
      1,293,433,853 cycles         #    3.691 GHz
      2,641,011,827 instructions   #    2.04  insns per cycle
    
      clang-6.0 -Ofast -march=native bodies.c -Wall -lm -o bodies_clang_Ofast_native_c
      perf stat bodies_clang_Ofast_native_c 5000000
      1,158,569,067 cycles       #    3.691 GHz
      2,331,116,659 instructions #    2.01  insns per cycle
    

Not having understood the assembly yet, my guess from skimming is that the
compiler isn't able to vectorize this code well, and thus the SSE/AVX
distinction isn't going to matter much. "-Ofast" should be comparable to "-O3"
here, although I don't recall the exact differences. I didn't use "-no-pie"
with your assembly, but don't think this matters.

Are you able to do a similar comparison and report results with "perf stat"?
Cycles and "instructions per cycle" are going to be better metrics to compare
than clock time. No hurry. I'm East Coast US, and done for the night.

Edit: A quick glance at "perf record" and "perf report" suggests that clang is
slightly faster than you (on Skylake, using these options) because it's making
use of fused multiply-adds. Which slightly contradicts what I said about the
SSE/AVX distinction not mattering, although it's only a minor effect. For both
routines, the majority of the time spent is in the chain starting with the
division. I'm wondering if there is some major difference in architecture with
Braswell --- perhaps it only has a single floating point multiplication unit
or something? Or one of the operations is relative _much_ slower.

Edit 2: Looking at
[https://en.wikichip.org/wiki/intel/cores/braswell](https://en.wikichip.org/wiki/intel/cores/braswell),
I now see that Braswell is the name of the "system on a chip", and the CPU
microarchitecture is Airmont. This isn't in Agner either, but at least I've
heard of it!
[https://en.wikichip.org/wiki/intel/microarchitectures/airmon...](https://en.wikichip.org/wiki/intel/microarchitectures/airmont)
says Airmont is basically the same as Silvermont, which Agner does cover, so
we should be able to figure out timings. See page 320 here:
[https://www.agner.org/optimize/instruction_tables.pdf](https://www.agner.org/optimize/instruction_tables.pdf).

Edit 3: I hadn't actually looked at your code yet. So you are trying to
vectorize, it's just that you are limited in doing so because you can only do
2 doubles at a time. But since you are working in 3D, this doesn't fit evenly,
so you do [x,y] then [z,null]. Thus you expected it to be something like 1.5x
faster on your processor, and instead it comes out 2x slower. I think the
issue might be that your processor doesn't really have full vector units ---
instead it's doing some sort of emulation. Look closely at the timings on Page
338 of the instruction table I linked in Edit 2. Note that DIVPD takes just
about twice as long as DIVSD. Then check Skylake on Page 252 -- same time for
packed and single XMM. While this isn't the full answer, I think it's a strong
hint at the issue. The quickest fix (if you actually want to make this faster
on your processor) is to use the single instruction for the [z,null] case.
This isn't going to help much overall, since it takes twice as long for the
packed, but it might at least get them back to parity with the compiler! If
you actually want higher speed from your vectorization, you may have to switch
to a processor that has better vectorized speeds.

~~~
fallat
I just want to say: wow! You are showing me something I bet many others are
really not aware of: degraded SSE performance in these types of processors!
Your DIVSD vs DIVPD comment makes a lot of sense too. Man I feel this HN
thread has been just a gold mine of this kind of information.

What is the speed with restriction to SSE3? That would finalize the tests.

Do you mind if I directly quote you in a follow up post? This is really good
stuff.

~~~
nkurz
> What is the speed with restriction to SSE3? That would finalize the tests.

I think this is the right incantation to restrict to SSE3:

    
    
      clang-6.0 -O3 bodies.c -msse3 -Wall -lm -o bodies_clang_O3_sse3_c
      perf stat bodies_clang_O3_sse3_c 5000000
      1,470,870,787 cycles       #    3.691 GHz
      3,391,153,749 instructions #    2.31  insns per cycle
    
      gcc-8 -O3 bodies.c -msse3 -Wall -lm -o bodies_gcc_O3_sse3_c
      perf stat bodies_gcc_O3_sse3_c 5000000
      2,256,550,525 cycles       #    3.691 GHz
      3,306,361,186 instructions #    1.47  insns per cycle
    

It would be interesting to analyze the difference between GCC and Clang here.
Just glancing without comprehension, it looks like one big difference is that
Clang might be calling out to a different square root routine rather than
using the assembly builtin. Hmm, although it makes me wonder if maybe that
library routine is using some more advanced instruction set?

> Do you mind if I directly quote you in a follow up post?

Sure, but realize that I'm speculating here. Essentially, I'm an expert (or at
least was a couple years ago) on integer vector operations on modern Intel
server processors. But I'm not familiar with their consumer models, and I'm
much less fluent in floating point. The result is that I know where to look
for answers, but don't actually know them off hand. So quote the primary
sources instead when you can.

Please do write a followup, and send me an email when you post it. My address
is in my HN profile (click on username). Also, I might be able to provide you
remote access to testing machines if it would help you test.

------
ksaj
I would compile the C code with the -S option to output your C code as
assembler, and then compare them to see what decisions were made by the
(pre)compiler that made the compiled C code run faster than your hand-wrung
code. The code is short enough that you should be able to relate the different
parts and easily see the "improvements".

This is how I learned to program assembler for Linux. I already knew TASM /
MASM / a86 assembler languages for DOS/Windows - all similar but different in
their own regards - but gas is a different un-backward beast, and of course
the system calls are totally different between Linux and DOS/Windows.

For example, here are 6 different ways to program helloworld on an ARM cpu
(Raspberry Pi) running Raspbian Linux:
[https://github.com/ksaj/helloworld](https://github.com/ksaj/helloworld) I
kept each of them as close to the same as possible, so that it is easy enough
to see the differences in how each call is set up before execution.

I'm sure if you added timer code before and after the example code, you'd very
quickly discover which methods are more optimal and then begin to dissect why
that would be, and how to use that knowledge in future programs.

Keep in mind that the order of opcodes will impact execution speed, because of
the many techniques modern cpus use to speed things up (branching, prediction,
encoding techniques, entirely rewriting opcodes to something else with the
same functionality - eg, most compilers understand that mov ax,0, sub ax,ax
and xor ax,ax all do the same thing), etc.

------
aeyes
I have a good real-world counter example: The game Rollercoaster Tycoon was
written in assembly and ran full speed on a Pentium 1.

OpenRCT2 is a code conversion to C++. You can turn off all the visual tweaks
they have implemented on top of the original game and it still requires
significantly more CPU time than the original. And thats with a compiler being
able to take advantage of modern CPU instructions while the original game
can't do that.

~~~
temac
Maybe the slowness is more due to poor high level design for the problem (& if
you want an highly optimized result)? E.g. AoS everywhere instead of SoA if
needed. I don't know I have not checked the source code, but it is a
possibility.

------
Franciscouzo
I came to the same result three years ago by writing the exact same problem
into x86-64 asm, I guess I didn't think of it as something worth of sharing.

[https://github.com/franciscouzo/nbody-
asm](https://github.com/franciscouzo/nbody-asm)

~~~
bogomipz
Unfortunately the links in the README file in that repo no longer resolve
([https://benchmarksgame.alioth.debian.org/](https://benchmarksgame.alioth.debian.org/))

Might you or anyone else have the updated link(s) to the nbody problem?

~~~
brobinson
[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/nbody.html)

Looks like the subdomain was changed at some point.

------
mhh__
All this usually proves is that compilers are smarter than you think.

You can get some gains from SIMD code (i.e. autovectorization is hard), e.g.
you've laid out data deliberately but the compiler doesn't quite see it, but
modern CPUs are so complicated even executing scalar instructions that I
wouldn't bother. I think optimizing memory is more productive half the time
anyway, most programs don't spend that much time number crunching.

~~~
crest
This program is a special case because the whole state fits in registers
unless the compiler detects this a human should be able to beat the compiler.
Even unrolled the code would fit in the L1 instruction cache.

------
ggm
Some optimisation can only be performed after the fact (in as much as profile
of a real run against real data informs the branch prediction and choices in
speculative execution and order)

Some execution fails: hardware has pathological instruction sequences which
force sub optimal choices to be made.

Alternative algorithms cannot always be identified from an instruction
sequence. What if a higher order function identifed better seeds and led to
faster detection in a large field of candidate solutions?

------
enitihas
What is the best book for learning the nuances of x86 assembly?

~~~
pjmlp
It is now about 20 years old, but you can start with "Zen of Assembly
Language" and carry on with "Graphics Programming Black Book".

[https://github.com/jagregory/abrash-zen-of-
asm](https://github.com/jagregory/abrash-zen-of-asm)

[https://github.com/jagregory/abrash-black-
book](https://github.com/jagregory/abrash-black-book)

------
chj
I recently had some fun writing arm64 assembly. My experience is that hand
written assembly is faster if you apply the same cheats that the compiler
uses. 3 things to keep in mind: 1) Make use of the registers; 2) Avoid
branches; 3) Get familiar with the instruction set and make good use of them.

[https://litchie.com/2020/03/arm64-fib](https://litchie.com/2020/03/arm64-fib)

------
kazinator
I once found some book on MC68000 programming on the Macintosh. In a chapter
on graphics, the book presented an assembly routine for drawing a filled
circle and bragged about how machine code gets it down to just .25 seconds. I
went, "what???" and read the code: it was calling an integer square root
subroutine for every scan line of the circle. Not even a good integer square
root routine.

~~~
saagarjha
[https://www.folklore.org/StoryView.py?story=Round_Rects_Are_...](https://www.folklore.org/StoryView.py?story=Round_Rects_Are_Everywhere.txt)
might be relevant?

~~~
cheerlessbog
Interesting but it left me wondering what was his fast routine for drawing
rounded rects. The squarical formula?

~~~
Someone
Here’s my guess: the Mac’s drawing routines all did clipping in software. That
was accomplished by having drawing of rectangles, circles, ovals, polygons and
rounded rectangles (for various operations such as erasing or filling them,
drawing a frame along the outside, inverting the bits inside) all come down
to:

\- compute the set of bits (called a ‘region’) to operate on

\- call the appropriate function to erase/fill/… that region

I would guess drawing ovals was done this way:

\- create a region for a circle with radius equal to that of the radius of the
corners.

\- insert horizontal parts to each row of bits in the region to ‘stretch’ the
region horizontally into a rounded rectangle that has the correct width, but
is only as high as the circle.

\- insert vertical parts to each column of bits in the region to ‘stretch’ the
region vertically into a rounded rectangle that has the correct width and
height.

The first step was identical for the code for drawing circles; the region data
structure made the last two operations cheap; they did not require any memory
allocations. You had to walk the entire data structure, but for small corner
radiuses, it wasn’t that large. Also, one could probably optimize for speed by
doing that while creating the region for the circle.

I can’t find a good description of the region data structure, but
[https://www.folklore.org/StoryView.py?project=Macintosh&stor...](https://www.folklore.org/StoryView.py?project=Macintosh&story=I_Still_Remember_Regions.txt)
might be enough for some to figure out how it, conceptually, worked.

~~~
pengaru
My guess is they didn't even need to be fast, just render once when the window
is configured and keep a cache of rounded corners in the window system
complete with the masks. They're tiny, and symmetrical.

~~~
Someone
Compute once when the window is created or resized is what happened, yes, but
that doesn’t work when using them for buttons (unlike MS Windows, where every
control is a ‘Window’, on Mac OS controls inside a window didn’t have their
own drawing context), or when using them for window content (e.g. in a drawing
program)

And remember: the original Mac had about 28 _kilo_ bytes of RAM free for
applications. The system unloaded icons, code and fonts that were available on
disk all the time. Few objects were ‘tiny’ at the time.

------
aj7
“I haven't looked particularly close at it but I'm sure someone could spot why
mine is almost 2x as slow.”

Isn’t that the whole point?

------
tambre
Maybe aligning functions would help. The compiler likely does that for the C
version.

------
teleforce
Interesting, but personally I'd be more interested with the feasibility of X
programming language with GC translated to C program and C version was slower
type of article.

------
CodeWriter23
Compiler beats novice assembly programmer ¯\\_(ツ)_/¯

------
e12e
Maybe I missed it - but did the author list compiler options? I wonder if the
handwritten assambler was slower than a compile with optimizations off, too.

~~~
205guy
From the article: “The gcc options for the C code used was -O3 -lm, because my
processor doesn't support AVX“.

------
toddinsights
One thing I was told in college is that compilers are much better at assembler
than people.

------
meche123
"Let me quickly set the scene for when I started: I had zero exposure to
x86_64 before this."

So, to sum it up: guy who has no clue writes worse code then one generated by
a decent compiler.

------
arm64future
>The saying that "processors are optimizing for C" is totally correct.

I thought that the latest consensus on HN was that C is nowhere closer to the
hardware.

~~~
comex
C is close to the interface the hardware presents you – that is, the
instruction set – but far from what the hardware actually _does_.

Caching, out-of-order execution, branch prediction, speculation... all of
those things are just as opaque to assembly as they are to C.

~~~
msla
> C is close to the interface the hardware presents you – that is, the
> instruction set – but far from what the hardware actually does.

C isn't even close to that anymore, now that so many processors have SIMD ISAs
which C can't usefully model.

