Neanderthal 0.8.0: CPU and GPU support on Linux, Windows, and OS X

savodj · on Oct 9, 2016

Hello, Windows 10 user here. How can I be sure I'm using optimized atlas? I tried building it earlier, but with no success. I found some info on atlas forum for older versions, but couldn't get it to work.

dragandj · on Oct 9, 2016

Easy: if it works, it uses the ATLAS version that is installed on the system, which means that when it works, it will use the optimized ATLAS that you provided. That means that you have to build it itself, which is problematic on Windows, BUT, the author recently released version 3.10.3, and one of major improvements is that builds fine on Windows with cygwin and mingw.

dragandj · on Oct 9, 2016

Author here, feel free to ask me anything.

osivertsson · on Oct 9, 2016

Not knowing what this was I pressed the "Neanderthal" link at the top left, but this got me a 404 at http://neanderthal.uncomplicate.org/articles/

Apart from that, looks good!

dragandj · on Oct 9, 2016

Fortunately, that link should only lead to the main page of the site. I'll repair that immediately.

akssri · on Oct 9, 2016

Sweet! Did you consider using AutoGEMM.py from clBLAS instead of a static GEMM kernel ?

I was considering using a polyhedral compiler macro (like PPCG), for writing OpenCL kernels in matlisp, but it's not clear how optimal this would be.

dragandj · on Oct 9, 2016

clBLAS is, in my opinion, hard to build AND hard to integrate.

On top of it, this approach gives better performance in most cases even on AMD, and especially on Nvidia. Now, I have AMD hardware, but it is better to create an overall more encompassing library, thus I avoided clBLAS :)

When I need to write my own OpenCL kernels, I use ClojureCL - it gives me easy management while still retaining full control of the kernels and their performance.

akssri · on Oct 9, 2016

I found the latest version of clBLAS on Fiji achieves a fantastic ~4 Tflops (on 2^n matrices). NVblas has probably had more resources allocated to it that clBLAS. I'd be positively surprised if the kernels in Neantherdal beat those. Do you plan on adding benchmarks for the GPU calls ? I can help running the clBLAS benchmarks, if you like, since I have a tuned setup.

If you want to take a look at it, the AutoGemm generator seems to be a simple python script written in order to overcome the limitations of the C preprocessor. I was considering using its tiling structure, since I already have a Lisp->OpenCL compiler in place (and have had no luck beating it). See, for instance,

https://github.com/matlisp/matlisp-opencl/blob/master/tests/...

https://github.com/matlisp/matlisp-opencl/blob/master/src/te...

dragandj · on Oct 9, 2016

What is the theoretical MAX flops on that Fiji card? I achieve 3.75 TFLOPS on Hawaii, which has much less power than Fiji...

akssri · on Oct 9, 2016

I think it's about 5.6 Tflops. Wow, 3.75 Tflops on Hawaii is very good indeed; I agree this is not something that clBLAS would beat by a wide-margin if at all.

dragandj · on Oct 9, 2016

Judging by this page, https://en.wikipedia.org/wiki/List_of_AMD_graphics_processin..., Fiji has more than 8 TFLOPS.

akssri · on Oct 9, 2016

Ah, is this https://github.com/CNugteren/CLBlast ?

I was looking at, https://github.com/uncomplicate/neanderthal/blob/master/src/...

dragandj · on Oct 9, 2016

Yep, CLBlast. My old kernels are from the pre-CLBlast era. Now they are deprecated.

akssri · on Oct 9, 2016

Ah, you're right; 5.6 Gflops is for Hawaii. That's doubly impressive. I'll make sure to try your kernel then. Thank you!

dragandj · on Oct 9, 2016

Actually, the man who writes those amazing kernels is Cedric Nugteren. I call HIS kernels.

santaclaus · on Oct 9, 2016

Looks cool! Is there any chance of getting sparse operations, at some point?

dragandj · on Oct 9, 2016

Of course. Sparse operations are on the TO DO list. I would have already added them if I needed them, so there are two options:

1) Wait until I need them. 1a) Become active in the community and bug me often enough that I realize how important it is :) 2) Contribute sparse library integrations (I'll help).