Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

First author here. The dot product example was compiled with GCC 4.7.2 -O3 -msse4.2 -ffast-math -ftree-vectorize -funroll-loops; see the caption to Figure 5. What compiler options would you have suggested? The Haskell version only used SSE instructions, not AVX; this should have been made clear in the paper.

The more complex examples are in Section 5.2; see Figure 8. Granted, we would have liked to have done more, but deadlines are deadlines...



Figure 7 shows Haskell-generated AVX instructions, albeit only using the lower 128 bits. That code would not run on an SSE4.2-capable Nehalem, for instance.

There are some other CPU-related slight inaccuracies in the paper. Prefetching is repeatedly mentioned, even though its effect is negligible when one has a perfectly linear memory access pattern; unaligned loads are mentioned as a performance hit, but they are essentially free on the test processor (2600k, Sandy Bridge).

Matrix multiplication would perhaps be a better example to show the power of clever prefetching.


The Haskell assembly is using the 3-op AVX instructions, but not 256bit registers. The example is simple enough that it may not make a difference (2-op vs 3-op instructions), but in a more complex function with register spills it could have. ("-mavx" should be enough).

all that said, it is still very impressive work!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: