Parallel Thread Execution. Think of them as opcodes for the Nvidia GPUs.
They are a bit more complex that your traditional opcodes (the lowest level of abstraction accessible to users) in CPUs, as you can specify cache parameters, memory barriers etc.
There are documented combinations of parameters for those instructions but if you fuzz (search new combinations in a random or organized way because you hope some will work the way you want) you can find new ones with unexpected effects or with advantages (in various ways like not polluting caches, speed...)
Which is the case for example for ld.global.nc.L1::no_allocate.L2::256B that they use in deepseek that provides significant acceleration while beeing reliable (although not working on all architectures so they have ways to disable it)
Gonna check what SASS it get translated to and whether it makes any sense.
I wonder if they had SASS assembler for Hopper (either by reverse engineering nvdisasm or by fuzzing instructions + nvdisasm + stare hard) and don't want to say it out :p
Much of the hype around DeepSeek is due to their extraordinarily low training and inference costs. They achieved this by optimizing their training code, apparently using PTX in addition to CUDA. PTX is kind of an intermediate assembly language for NVIDIA GPUs and people are eager to see how it was used.