
AMD and ARM's new CPU/GPU virtual ISA - Symmetry
http://semiaccurate.com/2011/06/22/amd-and-arm-join-forces-at-last/
======
sitkack
This is makes sense, esp with regards to the scalar core that is being
included in the CU. Most likely that scalar core is ARM derived.

I made noises that AMD needed to remove 32 bit and SSE and just focus on
OpenCL. What is being speculated actually reaches much farther than that and
is a better design. Thanks for the clue stick.

Also more corroborating evidence, Anton Chernoff one of the main creators of
FX!32 (<http://en.wikipedia.org/wiki/FX!32>) works for AMD Boston. FX!32
was/is one of the most badass dynamic recompilation engines I have ever
witnessed. It could load X86 executables and effectively JIT them to Alpha.
This is the kinda of technology needed to support a virtual ISA.

Anton Chernoff <http://www.linkedin.com/pub/anton-chernoff/1/59/6b2>

[http://www.usenix.org/publications/library/proceedings/useni...](http://www.usenix.org/publications/library/proceedings/usenix-
nt97/full_papers/chernoff/chernoff.pdf)

------
bascule
Holy crap, memory barriers! It's a modern architecture that's finally caught
up to where Symbolics was at in the '80s. Seriously though, this is good news
for concurrent garbage collected languages.

~~~
jgmatpdx
Barriers are slow, and the more cores are hitting the barrier the slower it
will be. The best news on that front was from the new GPU architecture parts,
where they admitted that they're still going to expose the relaxed consistency
memory model.

~~~
bascule
"Barriers are slow" in what way? Azul was hitting 10-20ms pauses while
successfully garbage collecting 500GB+ heaps using hardware memory barriers on
their Vega architecture, which had up to 768 CPUs.

Cliff Click, Azul's principal JVM architect, seemed somewhat ambivalent about
how much hardware transactional memory actually helped them hit those targets
(i.e. it's not a silver bullet), but he chided Sun for not including it on
their Niagra chips when they were providing both the language runtime and the
hardware platform.

I'll say it again: Getting them on a commodity architecture is a great thing
for concurrent garbage collected languages.

------
andrewcooke
this is awesome. i had no idea anything like this was coming. is there
anything out else out there that confirms this? and will it extend upwards to
include bulldozer too?

~~~
Symmetry
You can find the whole main presentation here:
[http://developer.amd.com/documentation/presentations/assets/...](http://developer.amd.com/documentation/presentations/assets/Phil%20Rogers%20Keynote%20Final%206.14.2011.pdf)
I'm presuming stuff coming from amd.com is confirmation. It should certainly
include future products, what I would worry about is to what extent existing
graphics cards will be able to take advantage of this.

Generally, I'd say this looks a lot like Clang but with better ways to express
memory parralelism at least. No way to know at this point to what extent this
will catch on, but it looks interesting.

~~~
xpaulbettsx
I think it's bigger than that - the idea that a GPU can execute syscalls is
pretty crazy if you really buy into it (i.e. the entire _OS_ is running using
this heterogenous environment)

~~~
jgmatpdx
This is possible now [1]. FSA will (hopefully) make it significantly more
efficient.

[1]
[http://www.idav.ucdavis.edu/publications/print_pub?pub_id=10...](http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1039)

------
MarkSweep
The part where they mention graphics becoming pre-emptable kinda solves
Microsoft's concerns about WebGL DOS attacks.

------
regehr
Wouldn't a virtual ISA be against the interests of hardware designers who
benefit from ISA lockin?

~~~
nivertech
Intel and NVidia want large piece of small pie

AMD and ARM want small piece of large pie

~~~
jgmatpdx
Even NVIDIA isn't crazy enough to try to get everyone to use the same ISA
again; they're trying to get you to buy into their compiler chain, not into
the architecture of a particular generation of GPU.

~~~
nivertech
Is PTX really ISA? It's also virtual, but very limited ... Unlike PTX, FSAIL
will work on both CPUs and GPUs and even will allow system calls from discrete
GPUs.

~~~
jgmatpdx
PTX seems quite reasonable to me, but I don't think it's really intended to be
a programmer-facing abstraction. NVIDIA wants you to target CUDA, and they'll
(try to) ensure that CUDA has sufficient performance advantages to make it
worth your while to do so.

------
cswetenham
I hope I'm not the only one who finds this article poorly written; I got a
better understanding from the list of bullet points on the slide than I did
trying to follow the rambling paragraphs below them.

------
stcredzero
I wonder if Apple will use this with Grand Central?

~~~
wtallis
It wouldn't be hard, except for Apple being in bed with Intel.

As they exist today, GCD/libdispatch blocks are very similar to OpenCL native
kernels, but with more automatic scheduling and implicit data transfers (which
are almost no-ops when the kernel is running on the host device).

On an AMD Fusion system with a shared memory controller, you don't have the
huge latency of transferring your kernel arguments to the GPU (and with
libdispatch blocks, you are generally transferring less data), so a block
could be executed on the GPU with hardly any more overhead than on a different
CPU core.

The only problem would be that you would need universal binaries of your
blocks to also target the GPU (or LLVM IR to be finalized at runtime). As
Apple has complete control over their very modern and flexible LLVM-based
toolchain, implementing this would be very straightforward, even without the
fruits of an AMD/ARM collaboration. (Though it sounds like ARM may help AMD be
able to provide an array of scalar processors that would have good performance
characteristics for typical dispatch_async uses.)

------
baq
some damn convincing speculation, i must say.

------
za
Does this portability compromise performance at all?

~~~
iam
It compromises performance in a similar way to using any byte code vs final
machine code. It remains to see how much of a VM support will be required to
make this work, as that's typically the large overhead cost of using non-
machine code.

If anything the same exact FSAIL will run faster and faster on the same exact
hardware as newer JITs come out.

