Languages like Java and JavaScript don't let you lay data out in memory directly...

flohofwoe · on Feb 9, 2017

Emscripten uses a TypedArray for the C heap in asm.js which has the exactly same data layout as if the C code is compiled natively, this gives you (more or less) the same performance advantages for properly arranged data as native code.

crocodileJS · on Feb 9, 2017

You actually can lay things out in memory using Typed Arrays.

There's some overhead to reads/writes though (there is memory safety, after all).

Klathmon · on Feb 9, 2017

What's amazing is that in some engines, if the JIT can determine that you are within the bounds of the array, it can disable those checks for the duration of the loop.

So something like for (x = 0; x < arr.length; x++) {...} can run at pretty much the same speed as C code.

joakleaf · on Feb 9, 2017

That's a bold claim; Sure it is theoretically possible, to have two versions of the same C program take either 500ms or 8ms purely due to memory layout.

But I would like to challenge you to actually do it! I.e. same number of calculations, on same amount of data and a factor of 60 run time difference, with only the memory layout as actual difference between the two implementations.

Up for it?

Coding_Cat · on Feb 9, 2017

I'll take that challenge. Here's how to do it, with a little bit of sneaky interpretation of 'memory layout':

have an algorithm which takes a large struct S and looks at a subset Sf to determine what to do with S, for some value of Sf use all of S, otherwise skip it. (e.g. when distance between 2 particles < threshold, calculate force).

Now have a very low pass-rate for the filter so that total time ~= time taken to read all of Sf. Make Sf a single byte and S >= 64 bytes (page size if you really want to beat it).

And now compare array-of-struct vs struct-of-array ;). You should see ~64x performance difference in the assymptotic case.

AoS will use one byte per cacheline read. SoA will use 64 bytes per cacheline. if you're memory bound this will translate to an almost 64x speed difference. There are other ways to achieve the same effect but that is the gist of getting 64x performance. using 1 byte vs 64. if you want to go really crazy, use a single bit and a bitfield array for the SoA. If you use a bitfield the passing chance doesn't have to be that low.

I happened to have to present on AoS vs SoA today so I have a less extreme benchmark to show the difference (~16x because it filters over 32 bit ints)

http://imgur.com/a/HuXFR

sqeaky · on Feb 9, 2017

So you can write code that is 64x times slower when you specifically optimize for a slowdown. Cool.

Can you run that in web assembly and get the same slowdown. If so most of the same things that allow C/C++ to faster than other languages on systems will also allow them to be faster in the browser.

Coding_Cat · on Feb 9, 2017

Yeah, this will affect any language that stores structs as POD (plain old data) as long as it has somewhat sensible alignment rules and allocation (i.e. doesn't hide every struct behind a pointer, doesn't align char's to 8 bytes).

joakleaf · on Feb 9, 2017

Sorry, I don't quite understand the explanation...

Do you have some simple pseudo-code?

And are you using the same number of instructions in both cases?

Note also: If you are moving data structures of different sizes, you are not using the same number of instructions(calculations), as the challenge required.

Coding_Cat · on Feb 9, 2017

>Sorry, I don't quite understand the explanation...

define a

  struct S{char flag, data[63];}

then do

  foreach s in std::vector<S>(big number){
    if s.flag == 0{ //set this variable to be very rarely 0
     do_something_with(s.data);
    }

this will load the entire struct into memory each iteration because loads from memory happen on cacheline granularity (i.e. 64 bytes at a time)

then the optimized version is

>And are you using the same number of instructions in both cases?

  //define S as struct of arrays now
  struct S{char* flags, (*data)[63];}
  s = initialize_S(big number);
  for(i = 0; i < big number; i++){
    if s.flags[i] == 0{ //set this variable to be very rarely 0
     do_something_with(s.data, i);
    }

This one will load 64 flags into the L1 cache at a time. because the L1 cache is much much faster than main memory access time will be ~64x faster for the checking of the flags. Because it is rare for the loop to enter the if, this means the overal speed will be ~64x faster. As proven by the graph I posted where it is about 16x faster (because I use an int instead of a byte as flag).

Syntax is slightly different between SoA and AoS, but functionally they are the same except for how multiple S's are laid out in memory.

>And are you using the same number of instructions in both cases? >Note also: If you are moving data structures of different sizes, you are not using the same number of instructions(calculations), as the challenge required.

The same number of instructions is an arbitrary and impossible task. I'm not going to handwrite assembly for some internet points, I only do that for fun ;). Do they perform the same amount of cpu work though, yes. They are functionally the same and use the same types of instructions. That's about as good as anyone can get it. In any case, the program is completely limited by memory throughput not computation. I could make one of the inner loops a sqrt and the other a nop and the difference would be the exact same.

If you really want to see it in assembly to prove it uses the same amount of instructions, then it's just a mov rax, [rcx], cmp rax, r8, je INNER_FUNCTION, inc rcx {sizeof(S),1} depending on the method. Add 0x90 where appropriate to match instruction counts. The rest is left as an exercise to the reader.

Tunabrain · on Feb 9, 2017

Any entry level optimization course will make you implement something like that; do a matrix+matrix addition, but traverse it in row-major order in one program, and column-major in the other. If the matrix is big enough, you will easily get a 10x difference. For even larger sizes (once you thrash the TLB), you will get an additional factor.

You might not get up to 60, but this is just a simple matrix addition. Even just matrix multiplication might be enough to get a 60x difference just with optimizing for memory hierarchy. I expect that high-performance memory bound programs are more complex than matrix multiplication, so the parent comment's claim doesn't seem too unlikely to me.

joakleaf · on Feb 9, 2017

Sorry, I didn't mean to question it whether it is possible. I just think that a factor of 60 sounds tricky to achieve if we are just talking about RAM (no disk involved).

The first time I encountered this myself, was doing texture mapping on the 486 back in the mid 90s. Texels would be laid out by row, and the mapper would draw horizontally. Texture maps that would fit within the L1 cache would draw fine no matter which rotation they were drawn it.

However texture maps that were larger than the L1-cache, would be drawn fine as long as the texture map wasn't significantly rotated. But, if you rotated the polygon by 90 degrees you'd see a significant drop in frames per second, because you'd skip a while row between drawing each texel, which would effectively cause a lot of cache misses. OK, I didn't manage to explain that well, but I hope it is understable.

Obviously, I have encountered this many times since, in many variants (including disk read/write), and I definitely agree that a factor of 10 should be possible.

I guess, I am just wondering if the difference between RAM and L1 has reached a factor 60, and how easy it is to come up with something where the cache misses all the time, or if caches have become smarter.

FeepingCreature · on Feb 9, 2017

Googling puts RAM latency avoiding caches on modern processors at 40-60 cycles.

joakleaf · on Feb 9, 2017

Exactly... Now, that alone could make it tricky to get a factor 60 difference.

pacaro · on Feb 9, 2017

Write a simplistic O(N^3) algortihm

When N is 100, all fixed costs in the inner loop are multiplied by 1,000,000. So a nanosecond difference is now a millisecond. It's pretty easy to get worse than that by boneheaded data layout (row-major vs column-major is pretty common)

joakleaf · on Feb 9, 2017

No...

If you use a "bonehead" data structure each memory reference will force a cache-miss, i.e. taking q CPU cycles instead of 1 CPU cycles.

If the algorithms has a running time of O(N^3), it means it is using some number of instructions <= c * N^3 to complete. Let's for simplicity sake say 1 operation is identical to 1 CPU-cycle.

Now, if each operation takes q CPU cycles instead of 1 cycles due to cache-miss on every operation, the running time of your algorithm is still bounded by q * (c * N^3) operations.

For the example with N=100 -> The fast version takes c * 1,000,000 operations to complete, for some value c. For the slow version, as each operation is now q times slower, it will take q * c * 1,000,000 operations to complete for values q and c.

I.e. it takes q times as long, not q * q * q.

What you were saying is that, if you run a O(N) algorithm on a 100 MHZ CPU, it will take 10 times as long as on a 1000 MHZ CPU, but an O(N^3) algorithm would take 1000 times as long. And that is obviously, not true. Both will take 10 times longer.

pacaro · on Feb 9, 2017

No

I'm saying that a poor choice of data structure will have an amplified impact on an O(N^3) algorithm.

Giving it as an example of where you could have the different implementations of the same algorithm have massively different runtimes on the same processor.

joakleaf · on Feb 9, 2017

If the algorithm has a running time of O(N^3), that implies that it takes O(N^3) < c * N^3 operations to complete. Regardless of datastructures. That's the definition of asymptotic running time.

If we disregard c, and set N = 100, that is 1,000,000 operations.

We have two situations. 1) Either we use a good layout where each operation is fast because each memory access is from cache. 2) Or we use a bad layout where each operation is slow because we have a cache-miss and have to access the RAM. These access times are independent of the specific algorithm, and only depends on CPU architecture.

Let say that the fast operations each take 1 second. And the slow operations each take f seconds.

In the fast case, with good memory layout, the algorithm completes in 1,000,000 seconds. In the slow case, with bad memory layout, the algorithm completes in f * 1,000,000 seconds.

The time of each operation we perform does not depend on the size of N. However, in practice, you'll typically have a lot of cache hits even with a bad memory layout, so the difference may be less than a factor of f for smaller values of N (where you'll have more cache hits in a cubic algorithm). However, it will be upper bounded by f. I.e. there is an upper bound on how much slower you can make an algorithm by changing the memory layout.

pacaro · on Feb 10, 2017

Obviously N and f are independent (although they might not be, your N may dictate how often a cache miss occurs).

Right, so now let's assume that our O(N^3) algorithm is simply a triple nested loop. And that we have an expensive operation f, but that through choice of data structures we can choose in which layer that operation occurs. All other costs are constant.

If f is in the outer loop, it happens N times, in the middle loop N^2, in the inner loop N^3.

So our choice of data structure impacts our running times by any factor from linear through cube of N.

I was simply answering the question of is it really possible to have two implementations of the same algorithm have such disparate run times, and it's trivially possible.

Another way of saying it, is that the mistake need not look expensive on cursory inspection, a simple way to have very different runtimes is through the accumulation of a large number of small costs.

Maybe the important point is to use a profiler.

I've recently gained a 60x improvement in real-world run time of what is considered in the literature to be the best known algorithm for a particular problem. The speed up didn't come from improving on the algorithm, it came from improving data locality, removing memory allocations, and general modern code hygiene (to be fair the canonical implementation is ~20 years old)

xxs · on Feb 9, 2017

>>Languages like Java and JavaScript don't let you lay data out

You can do enough magic in Java, if need be. It's just not easy and you'd using arrays (not objects) or direct byteBuffers

---- Btw the entire test runs in L1, so memory layout irrelevant. It's just not a good test.