Sorry, I don't quite understand the explanation... Do you have some simple pseud...

Coding_Cat · on Feb 9, 2017

>Sorry, I don't quite understand the explanation...

define a

  struct S{char flag, data[63];}

then do

  foreach s in std::vector<S>(big number){
    if s.flag == 0{ //set this variable to be very rarely 0
     do_something_with(s.data);
    }

this will load the entire struct into memory each iteration because loads from memory happen on cacheline granularity (i.e. 64 bytes at a time)

then the optimized version is

>And are you using the same number of instructions in both cases?

  //define S as struct of arrays now
  struct S{char* flags, (*data)[63];}
  s = initialize_S(big number);
  for(i = 0; i < big number; i++){
    if s.flags[i] == 0{ //set this variable to be very rarely 0
     do_something_with(s.data, i);
    }

This one will load 64 flags into the L1 cache at a time. because the L1 cache is much much faster than main memory access time will be ~64x faster for the checking of the flags. Because it is rare for the loop to enter the if, this means the overal speed will be ~64x faster. As proven by the graph I posted where it is about 16x faster (because I use an int instead of a byte as flag).

Syntax is slightly different between SoA and AoS, but functionally they are the same except for how multiple S's are laid out in memory.

>And are you using the same number of instructions in both cases? >Note also: If you are moving data structures of different sizes, you are not using the same number of instructions(calculations), as the challenge required.

The same number of instructions is an arbitrary and impossible task. I'm not going to handwrite assembly for some internet points, I only do that for fun ;). Do they perform the same amount of cpu work though, yes. They are functionally the same and use the same types of instructions. That's about as good as anyone can get it. In any case, the program is completely limited by memory throughput not computation. I could make one of the inner loops a sqrt and the other a nop and the difference would be the exact same.

If you really want to see it in assembly to prove it uses the same amount of instructions, then it's just a mov rax, [rcx], cmp rax, r8, je INNER_FUNCTION, inc rcx {sizeof(S),1} depending on the method. Add 0x90 where appropriate to match instruction counts. The rest is left as an exercise to the reader.