
Neanderthal 0.8.0: CPU and GPU support on Linux, Windows, and OS X - dragandj
http://neanderthal.uncomplicate.org/articles/news/release-0.8.0.html
======
savodj
Hello, Windows 10 user here. How can I be sure I'm using optimized atlas? I
tried building it earlier, but with no success. I found some info on atlas
forum for older versions, but couldn't get it to work.

~~~
dragandj
Easy: if it works, it uses the ATLAS version that is installed on the system,
which means that when it works, it will use the optimized ATLAS that you
provided. That means that you have to build it itself, which is problematic on
Windows, BUT, the author recently released version 3.10.3, and one of major
improvements is that builds fine on Windows with cygwin and mingw.

------
dragandj
Author here, feel free to ask me anything.

------
osivertsson
Not knowing what this was I pressed the "Neanderthal" link at the top left,
but this got me a 404 at
[http://neanderthal.uncomplicate.org/articles/](http://neanderthal.uncomplicate.org/articles/)

Apart from that, looks good!

~~~
dragandj
Fortunately, that link should only lead to the main page of the site. I'll
repair that immediately.

------
akssri
Sweet! Did you consider using AutoGEMM.py from clBLAS instead of a static GEMM
kernel ?

I was considering using a polyhedral compiler macro (like PPCG), for writing
OpenCL kernels in matlisp, but it's not clear how optimal this would be.

~~~
dragandj
clBLAS is, in my opinion, hard to build AND hard to integrate.

On top of it, this approach gives better performance in most cases even on
AMD, and especially on Nvidia. Now, I have AMD hardware, but it is better to
create an overall more encompassing library, thus I avoided clBLAS :)

When I need to write my own OpenCL kernels, I use ClojureCL - it gives me easy
management while still retaining full control of the kernels and their
performance.

~~~
akssri
I found the latest version of clBLAS on Fiji achieves a fantastic ~4 Tflops
(on 2^n matrices). NVblas has probably had more resources allocated to it that
clBLAS. I'd be positively surprised if the kernels in Neantherdal beat those.
Do you plan on adding benchmarks for the GPU calls ? I can help running the
clBLAS benchmarks, if you like, since I have a tuned setup.

If you want to take a look at it, the AutoGemm generator seems to be a simple
python script written in order to overcome the limitations of the C
preprocessor. I was considering using its tiling structure, since I already
have a Lisp->OpenCL compiler in place (and have had no luck beating it). See,
for instance,

[https://github.com/matlisp/matlisp-
opencl/blob/master/tests/...](https://github.com/matlisp/matlisp-
opencl/blob/master/tests/blocked-gemm.lisp#L25)

[https://github.com/matlisp/matlisp-
opencl/blob/master/src/te...](https://github.com/matlisp/matlisp-
opencl/blob/master/src/tensor/copy.lisp#L15)

~~~
dragandj
What is the theoretical MAX flops on that Fiji card? I achieve 3.75 TFLOPS on
Hawaii, which has much less power than Fiji...

~~~
akssri
I think it's about 5.6 Tflops. Wow, 3.75 Tflops on Hawaii is very good indeed;
I agree this is not something that clBLAS would beat by a wide-margin if at
all.

~~~
dragandj
Judging by this page,
[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processin...](https://en.wikipedia.org/wiki/List_of_AMD_graphics_processin...),
Fiji has more than 8 TFLOPS.

~~~
akssri
Ah, is this
[https://github.com/CNugteren/CLBlast](https://github.com/CNugteren/CLBlast) ?

I was looking at,
[https://github.com/uncomplicate/neanderthal/blob/master/src/...](https://github.com/uncomplicate/neanderthal/blob/master/src/opencl/uncomplicate/neanderthal/opencl/kernels/amd_gcn/blas.cl)

~~~
dragandj
Yep, CLBlast. My old kernels are from the pre-CLBlast era. Now they are
deprecated.

------
santaclaus
Looks cool! Is there any chance of getting sparse operations, at some point?

~~~
dragandj
Of course. Sparse operations are on the TO DO list. I would have already added
them if I needed them, so there are two options:

1) Wait until I need them. 1a) Become active in the community and bug me often
enough that I realize how important it is :) 2) Contribute sparse library
integrations (I'll help).

