

AMD’s “heterogeneous Uniform Memory Access” coming this year in Kaveri - pedrocr
http://arstechnica.com/information-technology/2013/04/amds-heterogeneous-uniform-memory-access-coming-this-year-in-kaveri/

======
vardump
Currently there's a big problem in GPGPU computing - high latency until the
computation results are available. That can be tens of milliseconds. This
significantly limits type of tasks you can efficiently offload to GPU. I
understand AMD's hUMA/HSA is supposed to address this problem.

But there's another problem: currently CPU memory buses are connected to two
or more DDR3 memory channels. And DDR3 doesn't simply have sufficient
bandwidth for high performance graphics and GPGPU computing, especially when
shared with CPU.

Intel Haswell will have CPU and GPU on-package together with shared 128MB of
eDRAM 64 GBps "L4 cache". I believe that should enable low latency high
performance memory sharing.

I don't understand AMD's bandwidth story. Does the GPU share one memory
controller with CPU and have another private one, for example connected to
GDDR5? I don't see how hUMA could work efficiently over PCIe bus either, so I
guess hUMA is about APU + CPU only.

How does AMD provide the bandwidth?

~~~
seanmcdirmid
I think this hUMA/HSA is a value proposition and not meant for high-end
graphics or GPGPU, which can easily stream through GBs of data very quickly
(high end cards have > 4 GB on card). Even Haswell strikes me as a value
product; everything great until your problem doesn't fit into your cache.

The transparent memory hierarchy is still quite expensive, and there are lots
of performance benefits, at least at the high end, to managing it yourself.

~~~
scott_s
There are a lot of performance benefits to managing the memory hierarchy
yourself. But I think that the demise of the Cell demonstrated that not enough
people are willing to do it to justify it as architectural decision.

~~~
seanmcdirmid
In the video game market sure, but in the HPC market, CUDA rules.

~~~
scott_s
I was talking about for a single chip. There was a lot of interest in Cell in
the HPC market, but it also died.

For applications that can tolerate the latency, offloading computation to the
GPUs is such a multiplier in performance that they have to put up with
manually managing the hierarchy - but even there, it's a rather coarse memory
management.

------
pedrocr
I'm surprised it's taken this long to get unified memory access across CPU and
GPU. Carmack has been asking for it for a while now and that's just for the
traditional application of a GPU (graphics). For GPGPU this would be massive.
It's also probably one of the few places AMD can really compete with Intel too
as they have both CPU and GPU tech.

~~~
mtgx
I think Nvidia will do it, too, starting with Tegra 6 (Denver/Maxwell) and
beyond, but in mobile devices. I think ARM can already share the memory
between Cortex A15 CPU's and their Mali T600 line of GPU's right now (ARM is
also part of the HSA Foundation).

<http://regmedia.co.uk/2011/08/19/hc_cohere_small.jpg>

~~~
pedrocr
That's a good point. I was thinking about this in terms of the traditional x86
fight between Intel and AMD. I could see Nvidia going beyond mobile with it
though. Consider this playbook for Nvidia:

1)Build an ARM 64bit chip with the latest GPU tech (not the generations behind
stuff they've been using on mobile), and unified addressing with the memory
controller on die.

2)Stick a bunch of fast ram, some fast flash for storage and a gigabit chip to
build a blade server.

3)Start selling these to people that today have GPGPU type loads. This should
easily be cheaper and more power efficient than the equivalent Intel solution.

4)As 64bit ARM becomes more performance competitive with x86 (is that
happening?) and people get used to developing for ARM, move to take over more
traditional CPU workloads.

It would make sense for them as they've already tried to enter the x86 market
with their motherboard/chipset business before and been mildly successful.
Back then they were just trying to build a better x86 and got squeezed by
Intel, here they'd be using the classic Innovator Dilemma strategy of coming
from a lesser product (ARM) to dominate the market.

One of the most interesting points in Ars previous articles about AMD was that
before they decided to buy ATI they were actually considering Nvidia but the
sticking point was that Nvidia's CEO wanted to be CEO of the joint company. It
makes you wonder what could have come out of AMD+Nvidia with Jen-Hsun Huang at
the helm.

~~~
new299
>As 64bit ARM becomes more performance competitive with x86 (is that
happening?)

Correct me if I'm wrong, but I don't think there are actually any 64bit cores
available for consumers yet. A quick google suggests they are just getting
these cores into ICs now:

[http://hexus.net/tech/news/cpu/53661-arms-64-bit-
cortex-a57-...](http://hexus.net/tech/news/cpu/53661-arms-64-bit-
cortex-a57-taped-out/)

~~~
vardump
Not for consumers, right. But ARMv8 64-bit X-gene should be available:
<http://www.apm.com/products/x-gene>

~~~
new299
Interesting, I'd not seen that before thanks. Not clear to me if I can
actually buy one yet, there contact page doesn't list the 64bit servers which
is weird. Anyway I've pinged them for more information would be interested in
playing with one.

------
fulafel
The low barrier intermixing of GPU/CPU code is a pretty ambitious project on
the software side, they need to implement it well and get developers onboard.
I hope AMD can pull it off.

Their fate depends on leveraging their GPU lead over Intel, and it's much
easier to port code over to GPU if you don't have to completely rewrite it
around the old school GPU data shuffling requirements.

This might have a better chance on the PS4/Xbox 720 side, with one size fits
all hardware and hopefully less driver problems.

------
nkurz
_hUMA addresses this, too. Not only can the GPU in a hUMA system use the CPU's
addresses, it can also use the CPU's demand-paged virtual memory. If the GPU
tries to access an address that's written out to disk, the CPU springs into
life, calling on the operating system to find and load the relevant bit of
data, and load it into memory._

Is demand-paging actually relevant, or just a poorly chosen example of what
would theoretically be possible? I'd think that in an application where the
worry is memory transfer speed, one wouldn't ever want to be swapping to disk.
Better a swift death by the OOM killer than drowning in molasses.

More generally, do swap files still have a useful role to play in high
performance computing? I'd think the window between "fits in RAM so no need
for a swap file" and "ever so slightly larger than RAM so we can quickly page
in what we need" is thin and growing thinner.

Sharing a virtual address space, transferring directly to and from RAM, and
hardware cache synchronization sound like real advantages, though.

~~~
scott_s
It allows people to treat memory on the GPU just as they do on CPUs. You are
correct, you will probably not achieve high performance if you are paging in
from disk all the time. But, what if you want it done once every ten minutes?
Consider the amount of programmer effort required to do that manually. This is
part of sharing the virtual address space.

------
nly
Doesn't sharing a virtual memory context with the GPU increase the cost of
context switching? Also, which CPU core shares context with the GPU? Or are we
talking about a fixed mapping (like the kernel)?

~~~
Symmetry
I'd assume they'd handle sharing memory between a CPU and a part of the GPU
the exact same way they'd handle sharing between two CPU cores.

------
api
This sounds _really, really, really freaking cool_. I am overjoyed to see AMD
not throwing in the towel and conceding the entire high-end CPU market to
Intel. A monopoly there would threaten Moore's law.

I can think of a lot of cool things to do with hUMA. I might have to get one
and dust off my once very strong interest in evolutionary computation
(strongly biomorphic genetic algorithms, artificial life, etc.). EC can do
very interesting things -- its the only "AI" technique I am aware of that can
be genuinely creative -- but it eats CPU cycles for breakfast.

It would also be great for creating a practical fully homomorphic cryptosystem
based virtual machine for "blind cloud computing"-- where the VM host has no
idea what the VM is doing. All kinds of neato stuff is waiting on this kind of
computing platform to be practical.

------
alisnic
Pardon me, but this is fucking awesome!

------
machbio
Off Topic - AMD codenames it Kaveri; let me guess the better part of
development is done by South Indians..

~~~
glaze
Kaveri also means "buddy" in Finnish.

------
ignostic
AMD (Intel too, sometimes) often goes on talking about its new technology, yet
customers have shown they don't care. There are literally only two things that
matter to buyers: price and "speed." Anything else is just PR hype for the
investors.

~~~
lucian1900
"Speed" is most certainly PR hype. There is no single objective measure of the
"speed" of any particular hardware.

Unified CPU and GPU memory is very interesting.

