Hacker News new | past | comments | ask | show | jobs | submit login
Neanderthal 0.8.0: CPU and GPU support on Linux, Windows, and OS X (uncomplicate.org)
53 points by dragandj on Oct 9, 2016 | hide | past | favorite | 17 comments



Hello, Windows 10 user here. How can I be sure I'm using optimized atlas? I tried building it earlier, but with no success. I found some info on atlas forum for older versions, but couldn't get it to work.


Easy: if it works, it uses the ATLAS version that is installed on the system, which means that when it works, it will use the optimized ATLAS that you provided. That means that you have to build it itself, which is problematic on Windows, BUT, the author recently released version 3.10.3, and one of major improvements is that builds fine on Windows with cygwin and mingw.


Author here, feel free to ask me anything.


Not knowing what this was I pressed the "Neanderthal" link at the top left, but this got me a 404 at http://neanderthal.uncomplicate.org/articles/

Apart from that, looks good!


Fortunately, that link should only lead to the main page of the site. I'll repair that immediately.


Sweet! Did you consider using AutoGEMM.py from clBLAS instead of a static GEMM kernel ?

I was considering using a polyhedral compiler macro (like PPCG), for writing OpenCL kernels in matlisp, but it's not clear how optimal this would be.


clBLAS is, in my opinion, hard to build AND hard to integrate.

On top of it, this approach gives better performance in most cases even on AMD, and especially on Nvidia. Now, I have AMD hardware, but it is better to create an overall more encompassing library, thus I avoided clBLAS :)

When I need to write my own OpenCL kernels, I use ClojureCL - it gives me easy management while still retaining full control of the kernels and their performance.


I found the latest version of clBLAS on Fiji achieves a fantastic ~4 Tflops (on 2^n matrices). NVblas has probably had more resources allocated to it that clBLAS. I'd be positively surprised if the kernels in Neantherdal beat those. Do you plan on adding benchmarks for the GPU calls ? I can help running the clBLAS benchmarks, if you like, since I have a tuned setup.

If you want to take a look at it, the AutoGemm generator seems to be a simple python script written in order to overcome the limitations of the C preprocessor. I was considering using its tiling structure, since I already have a Lisp->OpenCL compiler in place (and have had no luck beating it). See, for instance,

https://github.com/matlisp/matlisp-opencl/blob/master/tests/...

https://github.com/matlisp/matlisp-opencl/blob/master/src/te...


What is the theoretical MAX flops on that Fiji card? I achieve 3.75 TFLOPS on Hawaii, which has much less power than Fiji...


I think it's about 5.6 Tflops. Wow, 3.75 Tflops on Hawaii is very good indeed; I agree this is not something that clBLAS would beat by a wide-margin if at all.


Judging by this page, https://en.wikipedia.org/wiki/List_of_AMD_graphics_processin..., Fiji has more than 8 TFLOPS.



Yep, CLBlast. My old kernels are from the pre-CLBlast era. Now they are deprecated.


Ah, you're right; 5.6 Gflops is for Hawaii. That's doubly impressive. I'll make sure to try your kernel then. Thank you!


Actually, the man who writes those amazing kernels is Cedric Nugteren. I call HIS kernels.


Looks cool! Is there any chance of getting sparse operations, at some point?


Of course. Sparse operations are on the TO DO list. I would have already added them if I needed them, so there are two options:

1) Wait until I need them. 1a) Become active in the community and bug me often enough that I realize how important it is :) 2) Contribute sparse library integrations (I'll help).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: