Interesting. Is there a reason for this? I was under the impression that volatile only required that the accesses actually happen, not that the accesses had to happen in a manner considered "boring". Is the issue that volatile is also demanding that the ordering remain consistent, and the SSE instruction is not capable of guaranteeing that?
(edit:) In fact, that instruction, and a small handful of others (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) do seem to cause re-orderings. On x86, at least, any other form of optimization should continue to be allowed (involving cache-lines, etc.), but you are definitely right: this instruction's usage would not be. :(
The easy way of reasoning about what optimizations the compiler can do with a volatile location is to think "If this were actually a memory-mapped IO port, would this compiler optimization change the observed behaviour".
One problem is that volatile is in practice often used for writing to memory mapped ports. I suspect in that situation using multi-memory address instructions might lead to pain. Of course in x64 such things might be less common / not make sense, but in general if you say volatile you are saying "do every read and write I tell you to, in the order I tell you to".