movdqu is 2 issue cycles, not 4. What he may be alluding is to the huge cost for...

movdqu is 2 issue cycles, not 4. What he may be alluding is to the huge cost for loading across a cacheline, which is ~14 issue cycles (equivalent to an L1 cache miss). This doesn't apply to loads which are exactly split across a cacheline, i.e. 8 bytes on each side of the cacheline. You should be able to use this information to figure out how unaligned loads are implemented on all Intel chips.

It may be worthwhile for Core 2 to do a bunch of aligned loads and palignr them together, but I didn't feel like testing, as it would certainly have been slower on my i7. Patches welcome ;)