First author here. The dot product example was compiled with GCC 4.7.2 -O3 -msse...

pbsd · on April 7, 2013

Figure 7 shows Haskell-generated AVX instructions, albeit only using the lower 128 bits. That code would not run on an SSE4.2-capable Nehalem, for instance.

There are some other CPU-related slight inaccuracies in the paper. Prefetching is repeatedly mentioned, even though its effect is negligible when one has a perfectly linear memory access pattern; unaligned loads are mentioned as a performance hit, but they are essentially free on the test processor (2600k, Sandy Bridge).

Matrix multiplication would perhaps be a better example to show the power of clever prefetching.

floody-berry · on April 7, 2013

The Haskell assembly is using the 3-op AVX instructions, but not 256bit registers. The example is simple enough that it may not make a difference (2-op vs 3-op instructions), but in a more complex function with register spills it could have. ("-mavx" should be enough).

all that said, it is still very impressive work!