> The only solution which comes to my mind is to use 'volatile' for the memory access, but that will never be fast.
As you are insisting that the memory is accessed when you demand that the memory is wiped for cryptographic purposes, you will not be burned by the usage of volatile. (To be clear, you would of course not use the memory with volatile: you would add that qualifier only when you went to wipe it.)
Interesting. Is there a reason for this? I was under the impression that volatile only required that the accesses actually happen, not that the accesses had to happen in a manner considered "boring". Is the issue that volatile is also demanding that the ordering remain consistent, and the SSE instruction is not capable of guaranteeing that?
(edit:) In fact, that instruction, and a small handful of others (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) do seem to cause re-orderings. On x86, at least, any other form of optimization should continue to be allowed (involving cache-lines, etc.), but you are definitely right: this instruction's usage would not be. :(
The easy way of reasoning about what optimizations the compiler can do with a volatile location is to think "If this were actually a memory-mapped IO port, would this compiler optimization change the observed behaviour".
One problem is that volatile is in practice often used for writing to memory mapped ports. I suspect in that situation using multi-memory address instructions might lead to pain. Of course in x64 such things might be less common / not make sense, but in general if you say volatile you are saying "do every read and write I tell you to, in the order I tell you to".