

AMD releases APPML source code, creates clMath library - jhartmann
http://developer.amd.com/community/blog/2013/08/13/amd-releases-appml-source-code-creates-clmath-library/

======
chrisballinger
Even though these libraries are open source, it cannot be built without their
proprietary, binary-only "AMD APP SDK" [1], only available for Linux and
Windows. Bummer.

1\. [http://developer.amd.com/tools-and-sdks/heterogeneous-
comput...](http://developer.amd.com/tools-and-sdks/heterogeneous-
computing/amd-accelerated-parallel-processing-app-sdk/downloads/)

~~~
jhartmann
Actually it needs any OpenCL SDK, not AMD APP SDK specifically. If you were
building for an Altera FPGA or an NVidia card you would use their OpenCL SDK.
OpenCL is typically a LLVM frontend and backend for the hardware in question
(or something similar). What they are releasing is in fact all the source for
their FFT and BLAS implementations. They aren't opening up the compiler here,
just the software that runs on top of it. The compiler itself probably has
lots of stuff in it they consider competitive advantage so it will probably be
a little while before they open that.

------
jhartmann
I noticed this today as I was searching around for some opencl image routines,
and thought it was of general interest to the community. I really think this
is an awesome thing, and the availability of a high performance open source
BLAS that can be compiled to a wide array of OpenCL capable hardware is just
great news.

------
albertzeyer
I have recently started using ViennaCL
([http://viennacl.sourceforge.net/](http://viennacl.sourceforge.net/)). It has
an Boost uBLAS-like interface and has backends for OpenMP, OpenCL, CUDA and
uBLAS.

It is in very active development and the community is very nice and helpful.

I think there also doesn't exists something similar, i.e. a lib which can
easily do the calculations on both the CPU and the GPU (via OpenCL).

~~~
oneofthose
There is also VexCL [0] and Boost.Compute [1]. Both are quite capable. I
myself am working on something similar - it is called Aura but still a long
way to go [2]. ViennaCL, VexCL and Boost.Compute focus on maximizing
programmer productivity. Converting existing code from i.e. Matlab to
accelerator hardware is trivial using these libraries and you get excellent
performance quickly. Furthermore, NT2 doesn't get nearly as much publicity as
it deserves [3]. Not only does it provide an incredible number of functions
ready to use, but it also exploits the vector processing capabilities of
modern CPUs. These capabilities are too often ignored by developers or left
for non-optimal compiler optimizations to exploit.

In my own library Aura, I focus on maximizing performance (over developer
convenience). The target audience are developers of real-time applications
that need every last drop of performance from their hardware, while still
maintaining a sane and cross-platform API. I strive for a Boost.Asio for
accelerator developers. Aura has a already a rudimentary wrapper for clFFT,
clBLAS is in the works. So the idea is to, for each platform, utilize optimal
vendor-supplied library functions and combine them in a coherent interface.

[0] [https://github.com/kylelutz/compute](https://github.com/kylelutz/compute)

[1] [https://github.com/ddemidov/vexcl](https://github.com/ddemidov/vexcl)

[2] [https://github.com/sschaetz/aura](https://github.com/sschaetz/aura)

[3] [https://github.com/MetaScale/nt2](https://github.com/MetaScale/nt2)

~~~
albertzeyer
Thanks a lot for the links! Btw., the VexCL author is also active in the
ViennaCL community.

Atm., I want to re-implement some of the functionality of Theano but in C++.
Esp. I also want to implement deep neural networks. And I want my code in a
way so that I can easily switch between different calculation backends later
on, like using the CPU or multiple CPUs (hopefully with vector processing), or
some GPUs (e.g. via OpenCL) or even some multiple-machine cluster.

I found many libs doing one of this very well but only very few libs which
supports multiple backends like ViennaCL. For example, Boost.Compute only
supports OpenCL but not the CPU. VexCL, as far as I understand, also does not
support CPU calculations.

In what state is Aura? And would it be a good fit for my needs?

~~~
oneofthose
I didn't know about Theano, thanks for the hint. I have been focused on C++
libraries only, apparently missing a huge body of work.

CPU is supported through OpenCL by these libraries. Remember, OpenCL code can
run on CPUs. As for Aura, it is pre-alpha, not a lot of functionality there
yet, I'm still figuring out the interface. So not usable yet, but keep an eye
on it, it will be.

~~~
albertzeyer
Oh, I wasn't aware that OpenCL also can run on the CPU, I always thought that
it is GPU-only, thanks for pointing that out! How fast is that? Is that
comparable to uBLAS or ATLAS or so? Can that scale to multiple CPU cores? Does
it use SSE or similar technics? Or does that depend on the implementation?
What can I expect in common desktop PCs?

~~~
oneofthose
I'm not overly familiar with OpenCL on the CPU myself, I only know it works.
But people are doing this [0,1]. Whether or no it uses SSE/multiple cores
depends on the specific OpenCL backend that is used. But it should use both
multiple cores and SSE. I know Intel does this for both their Phi and regular
CPUs.

For me, the most important thing in both CUDA and OpenCL is the programming
model. It allows us to describe data parallel problems and related data
(in)dependence explicitly. Compilers should be (and already are) able to
generate efficient code from this. It is not as nice as it could be, we have
to write kernels by hand etc. But there are libraries that make our lives
easier (we discussed them earlier). And there is also C++ AMP which tries to
integrate better. Yet still, while we have all these options and we can solve
most of our problems with more or less effort and elegance, I believe there
must be something better out there: the right way to describe data parallel
and task parallel problems as well as concurrency etc. Maybe the FP guys are
on to something, I don't know. I'll be on the lookout.

[0]
[http://www.pds.ewi.tudelft.nl/fileadmin/pds/homepages/shenji...](http://www.pds.ewi.tudelft.nl/fileadmin/pds/homepages/shenjie/papers/Shen_PDP2013.pdf)
[1]
[http://comparch.gatech.edu/hparch/papers/lee_plc2013.pdf](http://comparch.gatech.edu/hparch/papers/lee_plc2013.pdf)

------
foxhill
as cool as this is, it was dated august.

it's still (practically) impossible to do a proper LINPACK benchmark with open
source tools on AMD GPUs, although this is a step in the right direction, and
more importantly, a big blow to CUDA.

~~~
ykl
How exactly is this a blow to CUDA though? NVIDIA has been shipping CUDA
versions of BLAS and FFT (see CUBLAS and CUFFT) for years now.

~~~
jhartmann
While CUBLAS is available out there, it is not something that can be run on
such a wide array of hardware. This is probably of great interest to startups
wanting to do things on things on FPGA's and mobile devices. They now have a
well optimized open source math library to use as potential building blocks,
and it has an Apache License. CUBLAS is only good for NVidia cards. So I
actually think this could be a big deal, and could potentially reduce the
popularity of CUDA over time.

~~~
jjoonathan
I sure hope it does. I have paid entirely too much for my predecessor's CUDA
lessons through the premium NV charges for their cards.

Unfortunately, I'm pretty sure NV is entrenched at this point. My AMD card and
my resolution to port everything I needed to use lasted about 3 months, and
that was without any CUBLAS dependencies :(

