Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FWIW, I ran it on a MacBook Pro (13-inch, 2019, Four Thunderbolt 3 ports), 2.4 GHz Quad-Core Intel Core i5, 8 GB 2133 MHz LPDDR3:

  two  : 49.6 ns  (x 5.5)
  two+ : 64.8 ns  (x 5.2)
  three: 72.8 ns  (x 5.6)
EDIT to add: above was just `cc`. Below is with `cc -O3 -Wall`, as in Lemire's article:

  two  : 62.8 ns  (x 7.1)
  two+ : 69.2 ns  (x 5.5)
  three: 95.3 ns  (x 7.3)


You _need_ to use -mnative because it otherwise retains backwards compatibility to older x86.


  (base) Coding % cc -mnative two-three.c
  clang: error: unknown argument: '-mnative'

  (base) Coding % cc -v
  Apple clang version 12.0.0 (clang-1200.0.32.28)
  Target: x86_64-apple-darwin20.2.0
  Thread model: posix


It's spelled "-march=native" in gcc and "-arch x86_64h" in clang.

It doesn't make much difference though, autovectorization doesn't work very well and there is not a lot of special optimization for newer x86 CPUs.


All recent Intel Core-i microarchitectures require using full vector width loads to max out L1d bandwidth, because the load/store units don't actually care about the width of a load, as long as it doesn't cross a cache line (in which case the typical penalty is an additional cycle).

Only using 128 bit wide instructions on a core that has 512 bit hardware results in 4x less L1d bandwidth.


there must be something wrong there, on my late 2014 laptop that mounts

    Type: DDR4
    Speed: 2133 MT/s
I get

    two  : 27.1 ns (3x)
    two+ : 28.6 ns (2.2x)
    three: 39.7 ns (3x)
which is not much, considering this is an almost 6 years old system with 2x slower memor


Dunno, I didn't reboot and didn't close all other programs (browser, editor, mail, calendar, notes, editor)... Top shows

Load Avg: 2.36, 2.01, 1.97 CPU usage: 2.10% user, 3.39% sys, 94.49% idle




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: