Hacker Newsnew | past | comments | ask | show | jobs | submit | bsprings's commentslogin

For most formats it achieves >90% of the performance of the hand-tuned assembly implementations in cuBLAS.


When I originally wrote the post in 2013, the GPU compilation part of Numba was a product (from Anaconda Inc., nee Continuum Analytics) called NumbaPro. It was part of a commercial package called Anaconda Accelerate that also included wrappers for CUDA libraries like cuBLAS, as well as MKL acceleration on the CPU.

Continuum gradually open sourced all of it (and changed their name to Anaconda). The compiler functionality is all open source within Numba. Most recently they released the CUDA library wrappers in a new open source package called pyculib.

Some other minor things changed, such as what you need to import. Also, the autojit and cudajit functionality is a bit better at type inference, so you don't have to annotate all the types to get it to compile.

We thought it was a good idea to update the post in light of all the changes.


Tensor Cores: 120 TFLOP/s mixed-precision (peak). Typo in your table: V100 FP64 is 7.5 TFLOP/s.


(Post author here.) Curious to hear more details about your workload, because a 5+-year-old Fermi would truly be hard pressed to outperform Maxwell or even a Kepler K40, let alone Pascal.


It's parameter sweeps of a delay differential equations, one simulation per thread. This requires a lot of complex array indexing and global memory access, so arithmetic density isn't near optimal. Still, it's a real world workload that benefits hugely from GPU acceleration.

Moving from a GTX 480 to a Kepler or Maxwell card, the numbers go up, but not the performance. I might have a corner case, but before investing in new hardware, I would want to benchmark first and not blindly follow the numbers.


People bought 400 series cards for their compute performance long after they were outdated. If your software wants an nvidia card it was either that or go up to quadro. People bought the first titan for the same reason.


(Post author here.) Yes, FP16 has been supported in NVIDIA GPUs as a texture format "forever" -- since before it was incorporated into the IEEE 754 standard. Indeed what is new in GP100 is hardware ALU support (and note denormals are full speed, which is even more important for lower precision formats).


FYI, nvprof works quite well with MPI, as described in this blog post by Jiri Kraus: http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-profi...

To use nvprof with MPI, you just need to ensure nvprof is available on the cluster nodes and run it as your mpirun target, e.g. “mpirun ... nvprof ./my_mpi_program"

You can have it dump its output to files that the NVIDIA Visual Profiler (NVVP) is able to load. You can even load the output from multiple MPI ranks into NVVP to visualize them on the same timeline, making it easier to spot issues.


"unlike a real datacenter, it's only good for floats." That's actually not true. You'll find integer throughput is rather high on GPUs also. And the memory bandwidth is very high too. "miniature 4-function calculators": make that 5, where the fifth function is a really fast special function unit that can do very fast sin, cos, sqrt, 1/sqrt, and many other functions.


Hi varelse, can you tell me more about your profiling use case? nvprof should support MPI profiling scenarios, but perhaps yours is different. I'd love to know details so I can help improve the product. Feel free to contact me at first initial last name at nvidia.com (name is Mark Harris).


This work was not done by NVIDIA, it was done by MapD, a startup. NVIDIA is promoting the work of a partner; it's a guest post on the NVIDIA developer blog.


This is part two in an in-depth series on Neural Machine Translation by Kyunghyun Cho, a leading expert on machine translation (Postdoc at U. Montreal, joining NYU faculty in fall).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: