Another useful trick for reading the last few bytes, especially if suitable buffer padding is not possible and the read would cross a memory page boundary, is to use PSHUFB aligned against the memory page boundary to pull the last few bytes into a register. Safe and fast.