
What does __asm__ __volatile__ do in C? - peter_d_sherman
https://stackoverflow.com/questions/26456510/what-does-asm-volatile-do-in-c
======
peter_d_sherman
Excerpt:

"The __volatile__ modifier on an __asm__ block forces the compiler's optimizer
to execute the code as-is. Without it, the optimizer may think it can be
either removed outright, or lifted out of a loop and cached.

This is useful for the rdtsc instruction like so:

__asm__ __volatile__("rdtsc": "=a" (a), "=d" (d) )

This takes no dependencies, so the compiler might assume the value can be
cached. Volatile is used to force it to read a fresh timestamp.

When used alone, like this:

__asm__ __volatile__ ("")

It will not actually execute anything. You can extend this, though, to get a
compile-time memory barrier that won't allow reordering any memory access
instructions:

__asm__ __volatile__ ("":::"memory")

The rdtsc instruction is a good example for volatile. rdtsc is usually used
when you need to time how long some instructions take to execute. Imagine some
code like this, where you want to time r1 and r2's execution:

__asm__ ("rdtsc": "=a" (a0), "=d" (d0) )

r1 = x1 + y1;

__asm__ ("rdtsc": "=a" (a1), "=d" (d1) )

r2 = x2 + y2;

__asm__ ("rdtsc": "=a" (a2), "=d" (d2) )

Here the compiler is actually allowed to cache the timestamp, and valid output
might show that each line took exactly 0 clocks to execute. Obviously this
isn't what you want, so you introduce __volatile__ to prevent caching:

__asm__ __volatile__("rdtsc": "=a" (a0), "=d" (d0))

r1 = x1 + y1;

__asm__ __volatile__("rdtsc": "=a" (a1), "=d" (d1))

r2 = x2 + y2;

__asm__ __volatile__("rdtsc": "=a" (a2), "=d" (d2))

Now you'll get a new timestamp each time, but it still has a problem that both
the compiler and the CPU are allowed to reorder all of these statements. It
could end up executing the asm blocks after r1 and r2 have already been
calculated. To work around this, you'd add some barriers that force
serialization:

__asm__ __volatile__("mfence;rdtsc": "=a" (a0), "=d" (d0) :: "memory")

r1 = x1 + y1;

__asm__ __volatile__("mfence;rdtsc": "=a" (a1), "=d" (d1) :: "memory")

r2 = x2 + y2;

__asm__ __volatile__("mfence;rdtsc": "=a" (a2), "=d" (d2) :: "memory")

Note the mfence instruction here, which enforces a CPU-side barrier, and the
"memory" specifier in the volatile block which enforces a compile-time
barrier. On modern CPUs, you can replace mfence:rdtsc with rdtscp for
something more efficient."

