
The Vision Intel, AMD and Nvidia Are Chasing: Heterogeneous Computing - nkurz
http://wccftech.com/intel-amd-nvidia-future-industry-hsa/
======
varelse
I think the emphasis on unified memory is a bit of a red herring here. In my
experience, heterogeneous computing means optimizing data placement such that
the parallel-friendly number crunching is done entirely on the GPU and I/O,
process control, and telemetry are orchestrated by the CPU.

Failure to consider data placement destroys much of the potential perf gain
for getting the above right IMO. And unified memory won't make this any better
if BW within a GPU is ~1 GB/s and system bandwidth is ~10x less. It will just
sweep the problem under the rug.

Further, I have seen attempts to smear parallel computations system-wide,
treating all the ALUs as equivalent, but this frequently breaks determinism
when the compilers for the two different processor architectures create
associatively different implementations of the same computation. And I see no
resolution to that problem right now.

That said, GPUs have advanced enormously in the past decade. Branching is
nowhere near the big deal it once was, the cache (L1 and register file) for a
GPU core (an SMX not NVIDIA's silly redefinition of a SIMD lane) is
surprisingly large, and insanely fast atomic ops can resolve a lot of
potential concurrency issues (even deterministically if one prudently employs
fixed-point math).

~~~
exDM69
> I think the emphasis on unified memory is a bit of a red herring here. In my
> experience, heterogeneous computing means optimizing data placement

Unified memory definitely doesn't change this aspect. It just makes it a whole
lot easier to do from a software engineering standpoint.

It is quite painful to deal with GPU buffers, memory mappings and DMA
transfers. Unified memory makes this a bit easier and as an added bonus, eases
latency hiding by allowing the driver to do DMA transfers transparently and
reduces driver overhead from poking the CPU and GPU page tables.

~~~
varelse
I'd suggest creating a templated buffer class that manages data
uploads/downloads and exposes the appropriate pointer on the appropriate
device. And from an algorithmic standpoint, don't pingpong large buffers of
data between the CPU and GPU. I've been following this practice for 7 years
now and it seems to work well. Ergo I really don't need unified memory and I'm
pretty happy with the world of NUMA. Sadly, I suspect I'm an outlier.

For I think the real pain point is that GPU programming requires the ability
to parallelize seemingly serial algorithms in a strongly-typed language in an
age where 90+% of the engineers I know do all their coding in single-threaded,
weakly-typed Python/R/Ruby/Javascript so they never get the opportunity to
develop the relevant mindset to do so. As an aside, to hit close to floating
point SOL on a CPU in SSE/AVX often requires a mindset similar to GPU
programming (but sssshhhhh... don't tell the "recompile and run" peeps about
that because it doesn't fit their comforting narrative).

And it's not a problem for me personally as a former C and assembler videogame
and embedded system developer but it is a huge problem to find such talent
these days because Python/R/Ruby/Javascript.

All IMO of course.

~~~
exDM69
> I'd suggest creating a templated buffer class that manages data
> uploads/downloads and exposes the appropriate pointer on the appropriate
> device.

This is the easy part.

The hard part is getting this right in the big picture, to avoid having the
CPU wait for the GPU and vice versa. Pipeline stalls will easily ruin your
performance, to the point where single threaded CPU code runs faster because
the high latency of data transfers.

This is where modern GPU memory management really helps. For example, in
OpenGL, you can use GL_MAP_BUFFER_PERSISTENT and _COHERENT to map your buffers
once right after you create them and keep the pointer to the mapped buffer for
the entire lifetime of the buffer (see ARB_buffer_storage). When you do this,
you're responsible for the synchronization so that the CPU and GPU never touch
the same parts of the buffer concurrently. CUDA probably has something
similar.

This adds some complexity to the application but now it's done explicitly in
the application and there are no surprises. And the CPU/GPU page tables don't
have to be modified, TLBs flushed and so on, so you get a bit of extra perf
from there.

And I totally agree with the point about the mindset of single threaded
programmers with dynamic languages. I was once put into a project where a
scientific application was written in Python and some "big matrix" operations
were done in C. The matrix ops were on the top of the profile so they assumed
that it can be made faster by simply "porting" them over to GPGPU.

Well the matrices weren't that big (a few kilobytes) and the operations were
done sequentially, one by one in the application. To make it work, the
operations would have had to be batched but this would have meant a big
refactoring to the entire Python codebase. This was way too complex for the
chemists and phycisists who had written the code. They were really smart
people with the domain knowledge but they didn't possess the mindset to think
about making it efficient.

~~~
varelse
The approach I've taken to this dilemma is to work directly with the
scientists to understand the quantities they are trying to compute rather than
attempt to port the code itself (yes, I actively choose refactoring over
porting).

Once the underlying algorithm has been deconstructed, one then reconstructs it
parallelized in CUDA/OpenCL/whatever. In my experience, the implementation
language and APIs can easily lead one to a suboptimal realization of
algorithms.

Once done, the original python code can serve as a reference implementation
for conformance and unit testing. Of course, if you have 1M+ lines of legacy
code, this probably won't work. But in my experience so far, the core
algorithms have been less than 50K lines of code, and usually less than 10K.

IMO sometimes one has to stand one's engineering ground to persuade them to
make the right decision. And bonus, one ends up a ground floor domain expert
at the end.

Finally, I turned down a very prestigious project 4+ years ago because they
wouldn't let me take the above approach. They opted instead to do it entirely
with compiler directives. That project still hasn't officially shipped. It
has, however, driven a significant amount of improvement in said compiler
directives.

------
kctess5
I've been working a lot with GPUs recently, and so far it's been a mixture of
really awesome, and really hateful. Dealing with memory is a huge pain, and
the language limitations are pretty bad. So many workloads these days are _so_
parallelizable it seems like a shame that the difficulty of working with GPUs
turns people away from them. I strongly look forward to the day when this
stuff is "fully mainstream" and it receives more attention from the compiler
folks. Having a unified memory pipeline would be an excellent first step.

I can't wait to play with a Tegra X1 SOC! That's a whole lot of power in a an
embedded device. Would be excellent for real time embedded video processing or
computer graphics.

------
fulafel
The elephant in the room is that language development seems basically stalled.
Meanwhile the bad old "C for GPUs" languages we are stuck with are CUDA, Metal
and OpenCL, of which OpenCL is no good wrt compiler/tools/driver quality and
the other two are proprietary and thus of limited potential.

~~~
exDM69
Well everything is about to change once we get the SPIR-V intermediate
language for Vulkan (and possibly OpenGL and CL too). It's an intermediate
language with some similarity to LLVM IR.

That should enable 3rd parties to author compilers targetting SPIR-V, allowing
other programming languages without having to use C as an intermediate target.

~~~
markus2012
What bout HSAIL / LLVM: [https://github.com/HSAFoundation/HLC-HSAIL-
Development-LLVM](https://github.com/HSAFoundation/HLC-HSAIL-Development-LLVM)

(HSA Intermediate Language LLVM support)

It would seem an IR is already available and it's already integrated on an
existing LLVM branch. From skimming around it seems AMD has been working hard
to do what is necessary to get things merged into LLVM mainline.

<quote> My highest priority is to get the backend upstreamed as soon as
possible, so I would appreciate feedback about any kinds of blocking issues on
that.

Thanks

Matt Arsenault </quote>

[http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-May/085545.h...](http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-May/085545.html)

~~~
exDM69
I'm sure this work hasn't gone unnoticed in SPIR-V development, I wouldn't
even be surprised if it's partially the same guys doing it. That said, I
haven't been involved in that so I don't know for sure.

LLVM IR has a few weak points for this kind of IR. It has architecture
dependent parts and the format has changed in the past so it is coupled to
LLVM development cycles. You can argue whether that's a good or bad thing but
it's a thing to be considered.

There were two earlier versions of SPIR, coupled with different LLVM versions.
SPIR-V is independent of LLVM but still semantically similar to LLVM IR, so
compiling back and forth between the two isn't too difficult. AFAIK there has
been discussions about adding a SPIR-V backend to LLVM trunk (can't find the
mailing lists).

The SPIR-V spec is already out and there has been some projects already using
it, although you can't run it on a GPU yet (unless you work for a GPU vendor,
that is).

[https://www.khronos.org/spir](https://www.khronos.org/spir)

------
misja111
I find the article a bit overoptimistic about the future and applicability of
heterogenous computing.

The problem with delegating tasks to GPGPU's is, that GPGPU cores can do only
a very limited subset of what a CPU can do.

First of all, each GPGPU core has only a small amount of fast local memory
available. In most applications, tasks need to be able to access databases,
large in memory datastructures and what not. The GPGPU architecture is not
designed for that. When data outside the core's local memory is needed, data
transfer to and from individual cores is costly and when too much of that has
to happen, the benefit of the many GPGPU cores rapidly disappears.

Second, a GPGPU core does not have access to other components, e.g. IO
controllers. So it can't directly read from disk or write to the network. This
again limits the types of work that a GPGPU core can do by a lot. For all
network and disk access it needs to communicate with the CPU, and again, this
communication is slow.

And finally there's a difficulty of another nature. OpenCL or CUDA does not
integrate nicely with higher level languages such as Java. The article
mentions Aparapi, which is one solution that tries to deal with this by
compiling Java code to OpenCL. The trouble with this is that Java code is
living in another world as OpenCL; code that works well for Java might perform
horribly on the GPGPU, because the GPGPU's architecture with its memory
locality is so different. So to write Java code that will be compiled to well
performing OpenCL, you have to code it with OpenCL in mind, and you have to be
aware of the manner in which Aparapi will convert your code to OpenCL. This
defeats the purpose of Aparapi; it might very well be easier to just write
your code in OpenCL so at least you have a clear view what is going on.

~~~
venomsnake
So GPGPU is useful for computation intensive/ no branching map reduce of
vectors that can be fit into GPU memory.

Can we have a smart VM that profiles the code and sees that kind of code and
put it on GPU?

~~~
exDM69
> So GPGPU is useful for computation intensive/ no branching map reduce of
> vectors that can be fit into GPU memory.

Modern GPUs can do full dynamic branching and process data structures that
aren't flat (ie. vectors/matrices). It's not restricted to such simple tasks
any more.

I don't think we'll ever have VMs that are smart enough to do decisions like
this better than an engineer can. Or even close. What we really need is smart
engineers who better understand what the technology is suitable for and can
make good decisions on when to employ it.

~~~
michaelt
That's interesting to hear! I took the udacity CUDA course a few years ago,
and at that time they were very keen on memory access coalescing, and if a
warp had two control paths they were executed sequentially.

As such I didn't think I could figure out how to make use of it for my
applications (merging sorted lists of integers, dijkstra's algorithm etc) as
typically you're branching and jumping around memory all over the place.

Can you suggest anywhere I can learn about these advances that mean the same
rules don't apply?

------
acd
Will the future not be CPU,GPU,FPGA integrated circuit? At least for some
server applications that would make sense to gain maximum efficiency.

[http://www.theplatform.net/2015/07/29/why-hyperscalers-
and-c...](http://www.theplatform.net/2015/07/29/why-hyperscalers-and-clouds-
are-pushing-intel-into-fpgas/)

~~~
dogma1138
FPGA's are never as efficient as dedicated silicon they are flexible but that
flexibility comes at a huge cost both in performance and materials. FPGA's are
used in situations where it's not commercially viable to spin your own chips
and where you can offset the high cost of FPGA's to your consumer.

If anything the "future" might be in on-demand core configs similar to how
modern ARM SOC's are being built from a combination of "small" and "big" cores
with different features and performance head space that allows for more
efficient power usage for small devices.

So core-specific optimizations for different applications (Database, Web
Server, Video Streaming etc.) might be quite possible especially when combined
with user-programmable micro-code might be much better than FPGA's. Basically
you might have a dedicated CPU (that you might even be able to partial design
yourself by submitting VLSI code or something similar) which is specifically
tuned for your application e.g. a web application server CPU with 4 big cores
that handle executing the application code it self, with additional smaller
cores that handle things like HTTP, SSL, session cache etc. each with
dedicated register/extensions optimized for those actions e.g. LZMA or AES
extensions.

~~~
sklogic
FPGAs with larger cells (e.g., entire ALUs) can potentially beat all the shit
out of GPUs in most use cases (because of the better control over memories).
You'll have the best of both worlds here - routing flexibility of FPGAs and
fast ASIC cells.

~~~
dogma1138
That's not really true FPGA CPU's tend to be very slow relatively to the speed
of the FPGA and comparable ASIC's.

Yes you can program the FPGA to do say LZMA very quickly but you can just as
well build a dedicated SIMD ASIC that will do the same and at much lower cost.

If you design your CPU to include current ISA big cores combined with small
mission specific SIMD ASIC's it will be better than building a CPU with
additional FPGA cores or building a CPU on an FPGA which is an insanity to
begin with.

I would yield that FPGA are more flexible but if you can build more or less
flexible SIMD ASIC's that are user configurable you will get a similar
flexibility to FPGA's with better performance still, eventually there's a
limit to how many non-generic operations the CPU handles and there will be a
fairly low limit on how many of those will deserve dedicated silicon before
you over-optimize and lose performance.

~~~
sklogic
I am talking of really _large_ but generic cells, much bigger then the current
DSP slices. This way FPGA will be a programmable dataflow machine with all the
heavy lifting done by the ASIC ALU slices. And it is much more flexible than
hoarding the very specific ASIC blocks.

Any problem I was solving on GPUs recently would have been implemented 100x
times more efficiently on such an architecture.

~~~
dogma1138
large and generic cells seem to be exactly what the current GPU architectures
bring, and where did the 100x times number came from?

~~~
sklogic
The current GPUs have to deal with a cumbersome, often unfit memory hierarchy.
What an FPGA-like model would bring is a much more flexible way to configure
the local memories layout. Source - experience of getting over 100x difference
in performance wity memory opimisations alone.

------
zubirus
The MPI+OpenMP duality has been around the HPC community as a for of
heterogenous computing. Recently, with the addition of CUDA to the mix, HPC
codes have added yet another code path; I'm skeptical about the
maintainability of such systems. It's sad because the three tools exist to
solve the same data-parallel paradigm, albeit in different configurations. In
an alternative universe, where plan9 had succeeded, I could see this issue
addresses by the OS, but I imagine a well designed library+runtime to fill
this gap perfectly well. Intel's TBB addresses the shared-memory problem with
a work graph, could we extend such philosophy to the three tools?

------
avmich
Did anybody try to use APL or another array-oriented language for these
multicore chips?

------
chisleu
Many modern SoC octacore systems are HMP. I don't think the author is really
grasping too far into the future. These systems are here now. Google's new
nexus phones are 6-8 core HMP as well.

------
MattSteelblade
I am tired of the misuse of Moore's Law...

