
CPU Introspection: Intel Load Port Snooping - matt_d
https://gamozolabs.github.io/metrology/2019/12/30/load-port-monitor.html
======
drinfinity
I find it interesting that somebody dedicates his or her life to figuring out
how these CPUs work when exactly that information is just laying about in some
vault in Santa Clara.

~~~
shaklee3
If we're talking about dedicating their life, that's likely to be Agner Fog:
[https://www.agner.org/optimize/](https://www.agner.org/optimize/)

He's put out the most detailed third-party documentation on Intel and AMD
processors that I've ever seen.

------
vardump
I think I'll disable hyperthreading [0]. :-)

This interesting novel technique [1] can provide an unique window inside the
CPU core black box, helping us to better understand how the CPU works
internally when it comes to otherwise invisible loads and stores.

While I can't think of any scenario immediately, intuitively I feel this could
be useful for those of us seeking to squeeze everything out of a system.

Not the particular case about low level TLB miss / page walk mechanics (we
already know TLB misses are bad), but perhaps there are other situations where
existing performance counters don't provide as detailed information.

[0]: Yes, I'm aware INVPLG (Invalidate TLB Entries) used in this blog post is
a privileged ring 0 instruction. But there's clearly a leak regardless.

[1]: Using a core hyperthread to "spy" other hyperthread on same core.

~~~
gamozolabs
Hehe, hyperthreading has some issues. This issue technically works single
thread, but it's hard for sensitive data to survive a context switch. That
being said, this issue is mitigated in all common OSes and latest microcode.

I'll be curious as to what there is to learn from this. It's more of a
longshot goal for me to learn how things work, develop accurate uarch models,
and then learn from those models better than I could guess and check hardware
results.

Hard to say if it'll go well....

~~~
vardump
> That being said, this issue is mitigated in all common OSes and latest
> microcode.

Emphasis on _this issue_ , eh?

> I'll be curious as to what there is to learn from this...

I'm also very interested to see what you and the community can discover using
this trick!

~~~
gamozolabs
I'll eat my hat for this, but effectively the mitigation to this is clearing
all caches and internal buffers in the CPU on each context switch.

I'm sure we'll see more types of leaks, but unless they're actively fetching
invalid data [1], there isn't much sensitive data to leak anymore.

I don't think there is much during speculation that can load _new_ data during
that window.

[1]: So far almost every CPU bug has leaked something in an internal cache.

~~~
CalChris
A _machine clear_ clears the pipeline. Does it clear these internal caches?
There is, of course, no machine clear instruction. Could you construct a
machine clearing sequence, insert it into the context switch code and test
your hypothesis?

~~~
gamozolabs
The `verrw` legacy instruction has been added to with microcode to flush
internal caches (load buffers, store buffers, etc). Any serializing
instruction should (hopefully) cause a pipeline flush. This is the mitigation
solution Intel made available to OS developers and should be what is being
used.

------
herendin2
I read this. It's quite interesting, but I think it is short of a few clear
basic definitions of terms.

Could anyone help explain, in a few sentences, what is a Load Port, and why is
it interesting in this context?

It appears to be some type of indicator of the proportional time slice given
to certain opaque internal processes which are not normally visible to users.

~~~
bertr4nd
CPUs issue instructions through a few ports, each of which services a set of
instruction types (eg, arithmetic, memory load/store, vector instructions,
etc.). For example here are Skylake’s ports:
[https://en.wikichip.org/wiki/intel/microarchitectures/skylak...](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_\(server\)#Scheduler_Ports_.26_Execution_Units)

So getting a trace from the load ports is basically a trace of all memory
accesses in the system. Something particularly cool about this work is that
you can even see loads that are hidden from software, like a hardware page
table walk.

------
naveen99
I don’t understand why intel and amd can’t give us access to the cpu cache the
same way NVidia does with CUDA for local and shared memory on the gpu. I just
don’t buy the hand waiving that the cpu can magically do branch prediction
better than a programmer with a static c compiler who actually know what they
want in the future. maybe if they had offered cache control on the itanium or
phi they wouldn’t have had to cancel them despite the need for reprogramming
user software which didn’t stop CUDA.

~~~
loa_in_
I agree that we should be able to tell the processor which branch is more
likely. Even something as simple as a flag to select between "I prefer you
take any jump you encounter" and "I prefer you skip all jumps in this
macroblock for speculation purposes" (and also "try to predict smartly" to
stick to how it's working now, aka what this article is about). This would
give a determined programmer everything he needs to make sure his program is
executed optimally.

~~~
gamozolabs
Back in Pentium 4 days you could use the DS and CS override prefixes on
conditional branches to hint taken and not taken, respectively.

Kinda neat, but it's not a thing anymore.

Some more info here:
[https://stackoverflow.com/questions/14332848/intel-x86-0x2e-...](https://stackoverflow.com/questions/14332848/intel-x86-0x2e-0x3e-prefix-
branch-prediction-actually-used)

