FWIW, I ran it on a MacBook Pro (13-inch, 2019, Four Thunderbolt 3 ports), 2.4 G...

namibj · on Jan 6, 2021

You _need_ to use -mnative because it otherwise retains backwards compatibility to older x86.

FabHK · on Jan 6, 2021

  (base) Coding % cc -mnative two-three.c
  clang: error: unknown argument: '-mnative'

  (base) Coding % cc -v
  Apple clang version 12.0.0 (clang-1200.0.32.28)
  Target: x86_64-apple-darwin20.2.0
  Thread model: posix

astrange · on Jan 7, 2021

It's spelled "-march=native" in gcc and "-arch x86_64h" in clang.

It doesn't make much difference though, autovectorization doesn't work very well and there is not a lot of special optimization for newer x86 CPUs.

namibj · on Jan 8, 2021

All recent Intel Core-i microarchitectures require using full vector width loads to max out L1d bandwidth, because the load/store units don't actually care about the width of a load, as long as it doesn't cross a cache line (in which case the typical penalty is an additional cycle).

Only using 128 bit wide instructions on a core that has 512 bit hardware results in 4x less L1d bandwidth.

africanboy · on Jan 6, 2021

there must be something wrong there, on my late 2014 laptop that mounts

    Type: DDR4
    Speed: 2133 MT/s

I get

    two  : 27.1 ns (3x)
    two+ : 28.6 ns (2.2x)
    three: 39.7 ns (3x)

which is not much, considering this is an almost 6 years old system with 2x slower memor

FabHK · on Jan 6, 2021

Dunno, I didn't reboot and didn't close all other programs (browser, editor, mail, calendar, notes, editor)... Top shows

Load Avg: 2.36, 2.01, 1.97 CPU usage: 2.10% user, 3.39% sys, 94.49% idle