
Intel Xe-HP Graphics: Early Samples Offer 42 TFLOPs of FP32 Performance - rbanffy
https://www.anandtech.com/show/16018/intel-xe-hp-graphics-early-samples-offer-42-tflops-of-fp32-performance
======
aidenn0
I would buy a discrete GPU from Intel without seeing any benchmarks. The only
system I own in which desktop compositing actually works on Linux is the Intel
one.

Both AMD and Nvidia drivers are dumpster fires in terms of stability (and for
Nvidia, I've tried both nouveau and the binary drivers)

~~~
winter_blue
What problems have you experienced with AMD drivers?

I'm asking because Nvidia has been an absolute pain on my laptop (e.g. sleep
is terribly broken), and I'm considering a switch over to the Ryzen 4700U,
which has pretty powerful integrated Radeon graphics.

So I'm trying to decide if I should instead switch to i7-1065G7 instead, which
has good integrated (Intel) graphics.

This is a really important question to me; I would appreciate any answers.

~~~
Athas
I have been running open source AMD drivers on my desktop since I got it
(early 2018, Vega 64 GPU). They work well, and after so many years, it felt
like a revelation to have problem-free high-performance 3D acceleration after
a fresh installation of a default Linux kernel. The only place where AMD is
still wonky is when you want to do GPGPU, but that is mostly down to AMDs
byzantine and schizophrenic software strategy (just try to pin down which
parts of ROCm you need, or what they do). You don't have to worry about that
for graphics, though, as the amdgpu driver is in the kernel, and the default
Mesa OpenGL works perfectly with it.

------
aspaceman
Much more interested in architecture design and memory hierarchy than flops.
Anything interesting going on in caching or memory hardware?

All the problems I work on need benefit from memory bandwidth and cache
latency than raw FLOPS. I imagine others are in the same boat.

I was hoping this would be the start of some more architecture diversity like
apples tile based deferred rendering.

~~~
winter_blue
In terms of memory architecture -- I've heard that memory is a bottleneck for
the GPU, specifically the time it takes to move stuff from RAM (main memory)
to the GPU's RAM/memory. If it is such a big bottleneck, then why don't we
(yet) see powerful GPU sharing the same die as the CPU and
accessing/using/sharing RAM with the CPU (like integrated GPUs do)? Then
there'd be a _zero_ bottleneck. You would just load whatever into RAM, and
just give the ( _super-powerful integrated_ ) GPU a pointer/address. Bam,
done. Why hasn't this happened yet? / _What am I missing /misunderstanding
here?_

~~~
nordsieck
> I've heard that memory is a bottleneck for the GPU, specifically the time it
> takes to move stuff from RAM (main memory) to the GPU's RAM/memory. If it is
> such a big bottleneck, then why don't we (yet) see powerful GPU sharing the
> same die as the CPU and accessing/using/sharing RAM with the CPU (like
> integrated GPUs do)? Then there'd be a zero bottleneck. You would just give
> load whatever into RAM, and just give the (super-powerful integrated) GPU a
> pointer/address. Bam, done. Why hasn't this happened yet? / What am I
> missing/misunderstanding here?

That is not the only bottleneck involved.

Historically, GPUs have used GDDR ram as opposed to general purpose DDR
memory. One of the key differences between GDDR and DDR is the bus width,
which can be as large as 1024 bits, compared to conventional ram with a 64 bit
bus width (although dual channel is effectively 128 bits). This much wider bus
results in much higher memory bandwidth which is generally necessary to feed
the truly enormous number of functional units in a GPU.

I suppose you could ask: why doesn't everyone just standardize on GDDR?

1\. This would dramatically increase cache line size. I don't have data, but I
assume this would generally be bad.

2\. My recollection (but I don't have a source for this) is that DDR has lower
latency than GDDR ram, so for branchy code (which CPUs often have to deal
with, but GPUs typically never have to deal with), DDR could actually be
faster.

3\. DDR is cheaper to manufacture. Aside from being higher volume, a lower bus
width just makes is simpler to manufacture.

~~~
jandrese
> One of the key differences between GDDR and DDR is the bus width, which can
> be as large as 1024 bits, compared to conventional ram with a 64 bit bus
> width

Does this mean there are over a thousand traces between the GPU chip and the
memory chips? It would be pretty clear why regular motherboards don't use it
if that's the case, the sockets for the chips would be enormous! You're
talking about roughly doubling the pincount vs. a 64 bit memory bus on a
modern LGA socket.

~~~
thechao
The on-die Larrabee traces were 3072 wires wide.

------
fancyfredbot
Looks a lot more interesting than the Xeon Phi ever did. If they can provide
the huge memory bandwidth this will need to keep it fed, and if they can offer
a decent programming model, then this could be very competitive. I suspect
they can do these things and the next challenge for them is going to be
optimized software. NVIDIA have a massive lead in terms of software support
for their accelerators so I can see this being a challenge.

~~~
gnufx
They're pushing "One API" for programming. They rather have to deliver this
time on the Aurora supercomputer.

~~~
pjmlp
Except they are building on top of SYCL, which is already late to the game, as
CUDA does C++17 as well, and is polyglot, whereas One API is C++ only and
hopes someone else will create bindings, Khronos style in not understanding
how to widespread adoption.

~~~
gnufx
Perhaps I should have said "for what it's worth". I'm not clever enough for
C++, but it does seem to be what Livermore, for instance, are committed to.
The oneapi propaganda does talk about other bindings, which I haven't looked
for. The question might be how it works with OpenMP 5 offload, which was
intended for portable performance.

------
strictnein
I know they're going to have a gaming GPU in 2021, and GFLOPs aren't
everything, but:

> One Tile: 10588 GFLOPs (10.6 TF) of FP32

> NVIDIA RTX 2080: 10.07 TFLOPS - FP32

~~~
ivalm
That's actually pretty bad, it will be potentially worse than 3070 (which it
will compete against)...

~~~
dr_zoidberg
That'd place them at a slightly better position than AMD, that has been in the
(discrete) GPU for years. All in all, this is still theoretical and we'll have
to see how they behave under real workloads (be that games or compute).

Intels iGPUs have always looked great on paper and then hit a wall in the real
world.

~~~
ivalm
The expectation is that the new Navi (which will come out probably earlier
than xe-hpg) will be 80 cu [0], it is essentially 2x 5700XT [1]. This should
put it in 15-20TFlop range, 1.5-2x Intel single tile.

[0] [https://www.pcgamesn.com/amd/big-navi-rdna2-80-cu-
rumour](https://www.pcgamesn.com/amd/big-navi-rdna2-80-cu-rumour)

[1]
[https://en.m.wikipedia.org/wiki/Radeon_RX_5000_series](https://en.m.wikipedia.org/wiki/Radeon_RX_5000_series)

~~~
dr_zoidberg
Lately I've been more interested in CPU announcements and haven't been
following GPUs pretty much at all (been super tired of all the hype around
Ampere, that's been going on for months now).

Considering that, it means Intel is coming in to the dGPU market with a not
stellar device, which has historically had driver issues and underdelivered in
practice... Even more pressure to prove their worth then.

------
m0zg
The number is a bit misleading: the quoted performance is for "4-tile"
configuration. Per-tile this is still markedly slower than NVIDIA.

------
MR4D
So if four tiles = 42 teraflop, then does that mean 25 of these will produce 1
petaflop?

Wow.

I’d imagine these things are super expensive.

~~~
ivalm
You need thermal and memory bandwidth solution for multi-tile setup, so you
can't easily scale it too much up. Although companies like cerebras do show us
a path into "very very large chips"

~~~
MR4D
Agreed. But just the thought that a petaflop could fit in a spare bedroom is
pretty neat.

------
tweedledee
Does anyone else here think the coprocessor on the nvidia ampere looks a lot
like the RC 18. If so that should post some crazy perf numbers.

------
MangoCoffee
Is it going to be based on 14+++nm? AMD/Nvidia is on 7nm

~~~
formerly_proven
10nm+++, actually.

~~~
ivalm
I thought intel said during architecture day that GPUs will be done on outside
fabs (presumably tsmc or samsung).

~~~
formerly_proven
> We also know, due to disclosures made at Intel’s Architecture Day, that it
> is set to be built on Intel’s 10nm Enhanced SuperFin (10ESF, formerly 10++,
> formerly 10+++) manufacturing process, which we believe to be a late 2021
> process.

~~~
ivalm
You're right, apparently only Xe-HPG is made fully on external fab.

[https://www.anandtech.com/show/15974/intels-xehpg-gpu-
unveil...](https://www.anandtech.com/show/15974/intels-xehpg-gpu-unveiled-
built-for-enthusiast-gamers-built-at-a-thirdparty-fab)

