

A curious SIMD assembly challenge: the zigzag - DarkShikari
http://x264dev.multimedia.cx/?p=232

======
alecco
What a great blog!

I'm a n00b wrt SIMD, but is it really better to do unaligned loads instead of
aligned loads + shift or shuffle?

According to latency tables on Core 2 loading 128 bits from memory to an SSE
register

    
    
      * movdqa: latency of 3cy, reciprocal latency 1cy
      * movdqu: latency of 3-8cy, reciprocal latency 4cy
    

Also movdqu seems to have 9 µops vs. only 1 µops for movdqa. Wouldn't there be
bad throughput on those chunks of 4 movdqu?

[All this _only_ according to Intel Manuals and Agner's tables, aka no
testing]

How do you guys test implementation performance?

Thanks!

~~~
DarkShikari
movdqu is 2 issue cycles, not 4. What he may be alluding is to the huge cost
for loading across a cacheline, which is ~14 issue cycles (equivalent to an L1
cache miss). This doesn't apply to loads which are _exactly_ split across a
cacheline, i.e. 8 bytes on each side of the cacheline. You should be able to
use this information to figure out how unaligned loads are implemented on all
Intel chips.

It _may_ be worthwhile for Core 2 to do a bunch of aligned loads and palignr
them together, but I didn't feel like testing, as it would certainly have been
slower on my i7. Patches welcome ;)

