
Intel Processor Graphics Gen11 Architecture [pdf] - ingve
https://software.intel.com/sites/default/files/managed/db/88/The-Architecture-of-Intel-Processor-Graphics-Gen11_R1new.pdf
======
dragontamer
Major thoughts that come to my mind:

* Shared / Local memory is moved from L3 cache (in Gen9) to inside of a subslice (I presume L2 or L1 speeds?? Hard to tell... from this overview). SLM is now independent from the cache-structure, so it works far more similarly to AMD / NVidia GPUs. SLM can be used as fast gather/scatter operations without using "The Dataport" (Intel's load/store to L3).

* Bunch of DX12 features -- Hard for me to fully understand, since I'm not really a graphics programmer. Coarse Pixel Shading looks cool, and should reduce bandwidth to DDR4 (a far bigger deal on iGPUs which are very bandwidth limited).

* Seems to be 4-levels of cache? L3 is in the iGPU, but the L3 of the CPU is called "LLC" (last-level cache). So it looks like iGPU L4 coincides with the CPU's L3 cache? Am I reading this correctly?

~~~
baybal2
In GPUs, in my opinion, having laggy cache is nowhere near as scary as in CPU.
GPU caches are said to be all about prefetch logic.

~~~
dragontamer
I agree in principle, but the SLM isn't a "cache", at least the equivalent of
it in AMD and NVidia GPUs. AMD LDS and NVidia Shared Memory is extremely fast.
Best-case scenario throughput of one AMD LDS operation every 2-clock cycles.

That's 64-threads performing a 32-bit load or store operation (64x32-bit
load/store operations) in just 2-clock cycles (assuming you have no bank-
conflicts / ideal situations)... per AMD execution unit. You ain't got nothing
on the power of this gather/scatter to AMD LDS. We're talking ~TBps bandwidth
here, not GBps... with incredibly low latency guarantees.

My understanding is that NVidia would perform this operation at a similar
speed to its shared memory (under similar "no bank conflicts" situations) per
SM. NVidia shares their "shared memory" with L1 cache (so yeah, the
architecture is different than AMD), but its performance is also huge and
measured in ~TBps.

Note that this gather/scatter is synchronized between all 64-threads on a AMD
workgroup, or all 32-threads of a NVidia warp. As such, you have ~2-cycles of
latency between thread communications in the ideal case.

As such, SLM isn't "cache". Its the simplest (and 2nd fastest way) to share
data between threads in a workgroup (there are harder-to-use "shuffle"
intrinsics available, if you have static thread communications, such as those
found in a sorting network... or in a "reduce" operation). And Intel's
implementation of it in Gen9 was far slower than its competitors. Intel is at
least claiming that they fixed this problem in Gen11.

I'm cautiously optimistic: 3rd party benchmarks on Gen11 iGPUs would be nice
to confirm if they really sped up SLM as much as it looks like.

\---------

EDIT: Now that I think of it, maybe you were replying to my Point#3. At which
point... yeah, I agree with you. Its hard to tell what part of my post you
were addressing though.

~~~
baybal2
I was thinking that cache lookups were not their stronger side as they had
fancy 4+ level cache. This is what I would've expected in a regular arm soc.

The more complex is the cache architecture, the more laggy your main ram
should be to mandate that. That's what I thought.

------
jdashg
Coarse Pixel Shading is a cool smarter-not-harder approach for filling high-
dpi displays on lower-horsepower iGPUs. The section on it shows comparisons,
and it does look useful, though it requires integration by applications.

Adaptive Sync, woo! That's going to make it so much nicer on anything that
misses the 60Hz budget, which is going to be more likely on these weaker
iGPUs.

For compressed texture formats, they call out BC and ETC/EAC, but I'm worried
that they didn't include ASTC, which is The New Hotness. Compressed texture
format fragmentation is a tough issue to deal with and to solve, but the
sooner we can centralized on a common good format, the better.

Dedicated subslice Shared Local Memory is going to see use, for sure.
(previously part of L3) 64KB with 1-byte alignment sounds great.

Transparent dynamic lossless compression is adding srgb support, which I'm
really surprised wasn't already happening.

~~~
ksec
>but I'm worried that they didn't include ASTC

ASTC has been part of the Skylake iGPU ( Gen 9 ) standard. Unless they
deliberately took it out.

~~~
jdashg
I thought it might, but didn't remember. That's reassuring! It was worrisome
to see it left out.

------
ngneer
How come GPU instruction set architectures are relatively under-documented
compared to x86? If anything, better results can be attained if more is known,

[https://research.google.com/pubs/archive/45226.pdf](https://research.google.com/pubs/archive/45226.pdf)

~~~
monocasa
Because they don't want to be tied to an ISA and it's backwards compatibility
requirements. They've heavily changed the underlying ISA many times in the
past twenty years or so, even shifting paradigms. They've gone from horizontal
microcode, to VLIW, to RISC-esque SIMD, and want the freedom to keep changing
it up.

~~~
ChuckMcM
I am guessing that is a big part of it, avoiding patent litigation is another.
There are a zillion patents around GPUs that it is difficult to prove the
someone is using without more information about the internals. Back when I
signed the S3 NDA/License agreement to get access to their underlying
architecture there were specific clauses in that agreement that said I would
agree to indemnify them if anyone brought patent litigation against them as
are result of information I disclosed. It was unusual enough for me to
remember it all these years later.

------
baybal2
Very surprised to see such in detail description being public.

I know that Intel's fab people are the only ones in the industry who are
public about their current process in academic publications.

TSMC's process engineers, on other hand, jokingly call the famously paranoid
TSMC's NDA an "omerta"

~~~
godelmachine
May I ask what’s an Omerta?

~~~
coldtea
A "code of silence" or an "enforced silence" about a subject.

Originally from the mafia who had strict rules about what to speak about, and
about not talking to the police and such (not just for mafia members, for
everybody in their region).

