First author here. The dot product example was compiled with GCC 4.7.2 -O3 -msse4.2 -ffast-math -ftree-vectorize -funroll-loops; see the caption to Figure 5. What compiler options would you have suggested? The Haskell version only used SSE instructions, not AVX; this should have been made clear in the paper.
The more complex examples are in Section 5.2; see Figure 8. Granted, we would have liked to have done more, but deadlines are deadlines...
Figure 7 shows Haskell-generated AVX instructions, albeit only using the lower 128 bits. That code would not run on an SSE4.2-capable Nehalem, for instance.
There are some other CPU-related slight inaccuracies in the paper. Prefetching is repeatedly mentioned, even though its effect is negligible when one has a perfectly linear memory access pattern; unaligned loads are mentioned as a performance hit, but they are essentially free on the test processor (2600k, Sandy Bridge).
Matrix multiplication would perhaps be a better example to show the power of clever prefetching.
The Haskell assembly is using the 3-op AVX instructions, but not 256bit registers. The example is simple enough that it may not make a difference (2-op vs 3-op instructions), but in a more complex function with register spills it could have. ("-mavx" should be enough).
The more complex examples are in Section 5.2; see Figure 8. Granted, we would have liked to have done more, but deadlines are deadlines...