

Using SIMD for hardware acceleration  - eerpini
http://krishnakanthmallikc.blogspot.com/2011/02/using-simd-for-hardware-acceleration.html

======
skylan_q
Great post. Vectorization is one of the easiest ways to increase per-thread
performance.

This gives me an excuse to post one of my favorite articles at Intel. It shows
the best performance increases for this person's problem come from optimizing
memory accesses.

[http://software.intel.com/en-us/articles/superscalar-
program...](http://software.intel.com/en-us/articles/superscalar-
programming-101-matrix-multiply-part-1/)

Seeing as memory read/write instructions are about 40-50% of the x86 code out
there (from what I've heard) tweaking memory accesses seems to be a great way
to get great performance.

~~~
eerpini
Yes memory access patterns seem to be the most common bottleneck for most
parallel code. I was implementing a parallel version of quick sort recently
and I have similar stories to tell. Optimize the code to avoid cache misses
frequently and you end up getting a near optimal speedup.

