Then my results disagree with yours. Do you have the source code to your benchma...

geophile · on June 21, 2014

By the way, I typically see #2 slightly and consistently faster than #1, which doesn't make sense to me. But I wouldn't expect a 3x difference in favor of #1.

geophile · on June 21, 2014

http://pastebin.com/DtsYEsYL

detrino · on June 21, 2014

The compiler is entirely eliminating your loop because you never do anything with the sum variable. Also, signed integer overflow is UB so you should use unsigned integers or compile with the non-standard -fwrapv option.

geophile · on June 21, 2014

I'm not convinced of this. I modified the code to do something with the sums. #2 is now slightly slower than #1, but the absolute times and the ratios are otherwise about the same.

http://pastebin.com/vbUSdKGc

detrino · on June 21, 2014

With your modified code I am now seeing #1 run about 3x faster than #2. I also still strongly suspect you weren't measuring anything at all before your modifications. Benchmarking is tricky and its hard to say if you are really measuring what you think you are. This is especially true with UB in your loop. It would be best to separate the program into 3 and then investigate the assembly.

detrino · on June 21, 2014

Also make sure you are compiling with optimizations on.

geophile · on June 21, 2014

Now working in an Ubuntu VM, since I know the tools better than I do XCode. I reduced TRIALS to 20, to reduce testing time. I also replaced summation with xor. My current code is here: http://pastebin.com/QcNu2A69

Here is what I see with default optimization:

    jao@minizack:/tmp$ gcc -std=c99 x.c
    jao@minizack:/tmp$ ./a.out
    INTERNAL time/scan (usec): 29608.000000		xor: 1722783152
    EXTERNAL UNSORTED time/scan (usec): 26688.105263		xor: 1722783152
    EXTERNAL SORTED time/scan (usec): 279771.105263		xor: 1722783152
    SORTED/UNSORTED: 10.482989

and with -O3:

    jao@minizack:/tmp$ gcc -std=c99 -O3 x.c
    jao@minizack:/tmp$ ./a.out
    INTERNAL time/scan (usec): 3636.421053		xor: 1722783152
    EXTERNAL UNSORTED time/scan (usec): 12114.263158		xor: 1722783152
    EXTERNAL SORTED time/scan (usec): 242141.263158		xor: 1722783152
    SORTED/UNSORTED: 19.988113

The INTERNAL time is 8x faster. EXTERNAL UNSORTED is 2x faster. EXTERNAL SORTED hasn't sped up much at all. My conclusions:

- With -O3, the difference between INTERNAL (#1) and EXTERNAL UNSORTED (#2) is close to what you are seeing.

- Loops cannot be optimized away in this version.

- The fact that -O3 speeds up EXTERNAL UNSORTED by 2x but EXTERNAL SORTED only a little makes sense if the latter is slow due to cache misses.

detrino · on June 21, 2014

Well, benchmarking without optimizations enabled isn't very useful :)