Using intrinsics leaves a lot of performance on the table if there is any register contention. Even Intel's compiler doesn't do that good a job. I've seen gains of over 2x by hand coding.
Sure, specially with some operations, e.g. code requiring "shuffle" operations [1], which use constants for programing byte reordering, having most compilers problems on complex functions, where because of register pressure some "shuffle" constants are moved in and out from the registers (even when using the "register" keyword for 'helping' the compiler).