All recent Intel Core-i microarchitectures require using full vector width loads to max out L1d bandwidth, because the load/store units don't actually care about the width of a load, as long as it doesn't cross a cache line (in which case the typical penalty is an additional cycle).
Only using 128 bit wide instructions on a core that has 512 bit hardware results in 4x less L1d bandwidth.