

Anandtech: AMD Radeon HD 7970 Review (28nm, new architecture) - zdw
http://www.anandtech.com/print/5261/amd-radeon-hd-7970-review

======
gcp
I wish someone would make an explanation of the current NVIDIA and AMD
architectures with the terms used translated to their CPU equivalents, instead
of the current half-marketing-speak which uses similar terms for different
things between the two vendors.

I'm finding it exceedingly hard to make any guess whether certain algorithms
are worth trying to port over or not, because the explanations are almost
incomprehensible.

To give an example, the VLIW->SIMD unit transition in this architecture
compared to NVIDIA's scalar units.

As far as I understand the VLIW vs SIMD difference, a VLIW instruction is more
powerful compared to SIMD because one "long instruction" can contain different
operations on different data. Whereas SIMD is the same operation on multiple
data.

Traditionally, VLIW was entirely statically scheduled, which put all the
burden on the compiler. Because graphics cards recompile all shaders anyway,
it's not a bad fit.

So, now AMD changed their units from 16x4 VLIW to groups of 16xSIMD engines.
The advantage here is that you no longer have to have groups of 64 similar ALU
operations, but can do with groups of 16 similar ALU operations. Conversely,
there should be more of such groups, i.e. more control logic compared to the
old design. To top that off, there's improvements that allow one to schedule
multiple threads over the GPU at once.

Am I on the right track here?

If I am, whats the minimum amount of identical operations that you must do to
achieve reasonable throughput? 16 identical operations for 100% efficiency?
Put differently, if I have code that requires different operations (due to
branches, i.e. effectively conditional operations) on each computation stream
that I'm pulling through, what factors are going to limit the effective speed
I get out of this?

~~~
DiabloD3
VLIW merely means instructions are setup in blocks with clearly defined starts
and ends (AMD calls these blocks 'clauses'). This technique is also called
MIMD (multiple instruction multiple data, as opposed to SIMD, single
instruction multiple data).

VLIW exists to exploit very wide instruction parallelism while clearly
delineating what instructions have dependencies on other instructions, and (at
least on the Radeon) usually end on a memory write or a complex branch.

Radeon clauses are up to 128 instructions long, and manage 5 ALUs (on 4xxx,
5xxx, and 68xx, 4 of them identical, 1 of them being able to do double
precision math and transcendentals), or 4 ALUs (on 69xx, all 4 identical and
does not require a specialized ALU for DP and T).

The compiler optimizes dependency flow and ALU usage, instead of requiring
dedicated hardware common in CPU design. This means far less silicon is
dedicated to the task, and instruction scheduling is far more predictable and
optimized.

Your suggestion that AMD used 16 ALUs per pipe under VLIW is wrong, The change
is 4/5 VLIW ALUs to 16 ALUs that now can execute SIMD instructions. The
compute units (the head end that synchronizes multiple pipes to perform one
task in parallel) still use VLIW-like clauses to synchronize pipe usage.

Your suggestion that this new arch allows you to schedule multiple threads on
the GPU at once is nonsensical: the correct term for pipe is "hardware
thread", on a GPU like 5870 you have 320 hardware threads (1600 ALUs), you
already schedule all of them at the same time for massively parallel
execution.

What has changed is the CUs now support running clauses from different shaders
at the same time by using some CU on one task, and some on another, and I
believe it may also be able to have clauses from different shaders loaded at
the same time and switch without overhead; on VLIW Radeons, shader change out
has a high context switch penalty.

The only thing the new GCN arch really does is allow the ALUs to operate on
SIMD instructions which allows higher instruction packing. This does not mean
they do not use VLIW-like clauses, and it doesn't mean it is like Nvidia's
design (which the media keeps repeating).

Nvidia's ALUs are free form stream processors, and do not have a clear
beginning or end to each clause (as the hardware does not exploit hardware
thread synchronization), they frequently suffer from cache misses and pipeline
stalls, and they cannot easily exploit instruction level parallelism.

In addition, Nvidia does not exploit deep pipelining. On VLIW Radeons, the
instruction pipeline is multistaged and 4 instructions deep, so by the time
you are submitting the 5th instruction you are getting the results from the
first and instructions 2, 3, and 4 are still being processed. This allows much
easier synchronization between ALUs since they all read/write to the same set
of registers.

The addition of SIMD instructions to this design allows much higher data
throughput and much higher instruction packing; instead of executing, say, two
Bitcoin hashes per VLIW4/5 group, each instruction winding around the group
for maximum ALU efficiency, you can run 4 (or however many GCN uses for SIMD,
most likely 4 or 8) hashes at the same time as a SIMD operation and not
require complex compiler maneuvering to do a clearly instruction parallel
operation (thus 16x4 hashes per group).

Now, ultimately, nothing of what I've written actually matters. OpenCL is a
black box on purpose, it doesn't matter how the implementation executes it as
long as it does so correctly and efficiently. AMD is betting that GCN is more
efficient for the given silicon real estate.

~~~
Tuna-Fish
> Your suggestion that AMD used 16 ALUs per pipe under VLIW is wrong, The
> change is 4/5 VLIW ALUs to 16 ALUs that now can execute SIMD instructions.

Actually, the previous architecture was a setup of 16x simd, where each of the
simd operations was a 5/4 wide vliw. So calling that 320 hardware threads is
wrong -- in Cypress there really was only 20 front-ends which drove these
bundles of 80 alus in groups of 5x16. Also, it was a 4-long barrel processor,
so you had to schedule a SIMD "wavefront" of 64 "threads" for each unit.

In the new version each CU still has 4x16 ALUS like it had in Cayman, but now
each of the 4 simd units of 16 elements can be scheduled independently by a
different hardware thread.

~~~
DiabloD3
There are 20 CUs, but it controls multiple sets of VLIW5/4 in parallel. AMD
claims 1600 ALUs on the 5870, so 1600/5 = 320. I'm not sure how they are are
factoring in the 4 deep pipeline for the barrel, but I'm pretty sure it
doesn't mean there is only 80 actual ALUs.

The R700 Programming Manual seems to indicate my interpretation is correct,
although if you can provide evidence that I'm misinterpreting it, I'm all
ears.

~~~
Tuna-Fish
There are 1600 Alus, but they are grouped in vliws of 5 elements, which are
grouped in simd groups of 16 vliws. So there are 20 front ends, and reading a
single bundle in one of them will instantly make 80 alus execute an
instruction.

The barrel is essentially used to extend the vector registers from 16-elem to
64-elem, and a 64 "thread" wavefront, consisting of 5 VLIW'd instructions is
essentially the smallest amount of work that R700 can do.

------
unwind
When describing the (quite interesting, and well-described) "partially
resident textures" technology, which is in turn inspired by John Carmack's
MegaTexture technology, the review states:

 _For AMD’s technology each tile will be 64KB, which for an uncompressed 32bit
texture would be enough room for a 4K x 4K chunk._

Isn't this off by a factor of 1,000? 4K x 4K is 16M texels, which at 32 bits
per texel would require 64 MB. A chunk of 64 KB cannot hold that. They repeat
the "64KB" value for the chunk size many times, not sure in what direction
they're wrong, really. I guess if I kept up more with graphics tech, the
answer would be obvious. :)

~~~
Retric
4kx4k pixels is a higher resolution than all but the most extreme gaming
system can display (ignoring the fact textures are wrapped around 3d objects).
So, I think the idea is 64KB chunks out of an arbitrary image that could in
theory be 64 MB. AKA, you get to have 1,000 of those chunks for your 4kx4k
image some of which are loaded into memory.

~~~
VoxelBoy
4kx4k textures can easily be displayed even on my Macbook Pro's 9400M card. It
does cost a high amount of memory; depending on if and how it's compressed, if
it has mipmaps, if it contains an alpha channel etc. it can be anywhere
between ~16 and ~64 MBs, but is very doable.

~~~
onemoreact
Yea, but how many of those 16million pixels can be map'd to the display.

A: Not all that many which is why breaking that up into smaller chunks an
enabling a virtual 4k texture without compression is a good idea.

------
Natsu
With all those benchmarks out there already, I wonder if they'll ever start
using BitCoin mining as one?

~~~
DuncanIdaho
That would be hard to do and pretty pointless.

Do not forget that mining bitcoins becomes progressively harder with time.

Thus to get any sort of meaningful reviews, they would have to review whole
field of graphic cards each time a new one comes out.

Not feasible.

Edit: Ok, so I don't know anything about Bitcoin mining, thanks for
clarifications.

~~~
jl6
No, there is a very simple, well-defined and difficulty-independent metric of
Bitcoin mining performance: hashes per second.

------
jvoorhis
I'm curious about the larger memory bus width. GPUs are still considered sub-
par for real-time audio applications because of the memory bottleneck, despite
their numerical computing power.

~~~
barrkel
I expect that's down to latency rather than throughput. Back in the day, I
recall some people working with audio preferring ISA sound cards to PCI sound
cards, because the latency was worse with PCI; but PCI has orders of magnitude
more bandwidth.

~~~
sliverstorm
It would be unsurprising if latency were the issue. Think about it- a GPU with
a turnaround time of 1/60th of a second is plenty fast for any standard
monitor.

~~~
vilya
Only if you're uploading data to it just once per frame. And you're not doing
stereo.

~~~
sliverstorm
Sure, 1/60th of a second may not be the exact right number, but the point is,
I would think that traditionally a GPU designer wouldn't exactly have a tight
latency budget.

Even if you're in stereo (1/120) and you write a new image to the buffer 100
times per frame (1/12000) that still gives you 40,000 cycles on a 500MHz
clock.

------
ashwinurao
I am surprised 6990 outperforms this card in so many tests!

~~~
woadwarrior01
It shouldn't really be surprising. A 6990 is essentially two 6970s on the same
board. We'll probably have to wait for a 7990 to have a fair comparison.

~~~
DiabloD3
It is two underclocked 6970s, Two 6970s in Crossfire are approximately 6%
faster than a 6990, or 20% faster than than two 6950s.

------
zeratul
Nowadays the graphics cards are so fast that neither games nor APIs can keep
up.

No really, when we will get something as simple as OpenMP to do our data
mining on GPUs? There is more data to be processed than there are games to
play.

~~~
mrb
What do you mean APIs cannot keep up? It is perfectly possible to write a
GPGPU app that utilizes practically all the resources, eg. Bitcoin miners,
which use the OpenCL API, have a typical ALU utilization ratio of ~95%+.

~~~
DiabloD3
Closer to 98%+ on DiabloMiner or newest phatk2 (now that phateus finally
decided to catch up).

------
DiabloD3
In before Bitcoin mining comment

