Hacker News new | past | comments | ask | show | jobs | submit login
How I Found CUDA, Or: Rewriting the Tag Engine (marcgravell.com)
49 points by philliphaydon on May 9, 2016 | hide | past | favorite | 14 comments

This post suggests that Stack Overflow has about 30,000 distinct tags, which seems like a good fit for a bitmap index:


Not that that's the whole solution, but certainly something I'd look into before digging into CUDA.

I can tell that you wanted a clear distinction between implementations for comparisons sake. And I don't doubt that you could probably achieve higher performance in each respective branch with Cuda and c++ for GPU and CPU branches.

I get why you'd choose Cuda over opencl for pure gpgpu code, but when faced with the requirement of having both a CPU and a GPU version why not use openCL to keep both code paths as close as possible? Did you write it so both branches live in a single build and can select the optimal path at runtime and looking at the hardware and seeing what is available? Or do you maintain two separate builds that are deployed to different servers?

You can actually go the other route too. There's a library called Thrust that provides a STL-like interface to CUDA, as well as "glue" functions that go between the finely-tuned stuff.

One of the things Thrust supports is an OpenMP backend, so it can emit code that runs on a CPU instead of a GPU. I've never tried it so I don't know how good it is, particularly on the collective operations. But I assume if what you want is a "#pragma omp parallel for 1-to-N" then it's probably great. Construct your operation as a functor and use a thrust::for_each from 1 to N.

hmm interesting I had no idea that existed. Looking forward I have high hopes for Vulkan being the foundation for cross platform and cross vendor heterogeneous computing.

As it is in the raw, it's pretty rough to use but I am excited to see what interfaces will be built on top of it to make heterogeneous computing more accessible and maintainable. I'm thinking something like PyOpenCL for Vulkan.

Works reasonably well, but it absolutely ruins your compile times.

Yeah, I totally agree. It's not just the OpenMP backend, it's Thrust in general. I had a project that was probably 5-10k LOC and it took like 2 minutes to compile with Thrust. I'm pretty sure nvcc doesn't handle the templating very well.

Couple comments here: GPGPU programming is really good if you can construct your problem as a sorting problem (eg "sort results so interesting tags are at the start of the array"). From what I've seen this is reliably the most successful approach.

One problem can be when the number of output items from a given input item is unpredictable. You typically want to run a prefix-scan within a block to compact the block's output into a single chunk until your shared-memory buffer is full, or you've reached the end of the data set. Then, you do an ATOMIC increment on a global "slots-used" counter for your output buffer, and write as a big chunk. This is a lot more complex than sorting, but reasonably fast.

If you want to guarantee that your output will fit in the buffer, you need to run the operation twice. The first run is a "dummy" that writes no output, just figures out the last input element that can write data without the output buffer overflowing. In some cases it may be preferential to just allocate a reasonable guesstimate of the 95% case and not write past the end if a true worst-case scenario happens.

Try like hell to keep everything in registers or shared-memory and avoid writing intermediate data. High utilization of memory (order complexity) kills performance, both directly (hitting global memory incurs lots of latency) and indirectly (GPUs have the greatest speedup when working on the largest possible working set).

Surprised people are still using CUDA, is OpenCL still not as good? Or is the vendor lockin not a problem?

> is the vendor lockin not a problem?

It so happened that almost nobody in the (previously tiny) industry specialized for using GPU cards for computation had used any cards other than NVIDIA early on. NVIDIA had listened to that niche market by releasing increasingly performant hardware for compute needs, while other manufacturers kept ignoring the niche and focusing on graphics. As a result, very few on the software side have a burning desire to do free work to support cards made by anyone other than the Big Green, even if they had equivalent performance (they likely don't).

This may seems strange to web devs who remember the hard-fought war against Microsoft's lock-in on the Internet via IE. But NVIDIA is seen in a different light---it is seen as a benevolent company without which the GPU-for-compute platform wouldn't even exist.

AMD GPUs have consistenly had higher compute performance than Nvidia. The Fury X, for example, consistently outperforms the Titan X for half the price.[1] That's why AMD has completely dominated the bitcoin / cryptohashing scene: money is involved, so people are willing to spend time optimizing for different hardware.

Nvidia captured this market using typical vendor lock-in tactics: they didn't support OpenCL 1.2 until many years after it was released, and they still don't support OpenCL 2.0+.

[1] http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-...

I should have qualified "compute". I meant single- and double-precision floating-point operations that are not graphics. Yes, AMD GPUs outperform NVIDIA on cryptohashing benchmarks which require high integer performance.

There are multiple versions of OpenCL with different hardware vendors supporting different versions. Each vendor supplies their own OpenCL compiler and drivers, which may not generate the same performance code as other vendor or third-party compilers and drivers targeting the same hardware. NVIDIA's support for OpenCL has been lacking. CUDA also has more hardware specific features than OpenCL, which may or may not be relevant for performance depending on the problem. There are also other competing parallel programming models, such as OpenACC, which for example are promoted by Intel for Xeon Phi. Apple has also been criticized by developers over the state of OpenCL support. Not the greatest story for a standard.

The last time I really dove into this was a couple years ago. At that point Cuda had more features which allowed for more complex data structures, leading to smarter algorithms, and an all around less painful time programming.

Opencl has come along nicely in the meantime but I suspect Cuda has maintained it's lead and now has momentum with many developers.

&& / ?: may not be “branching operations”, the compiler could determine that the operands have no side effects (q->score is used in both branches and q must be a valid or invalid pointer in both) and collapse that down to just register select operations and math internally. Or, it may involve predication, with only the assignment of i to a register previously filled with 0 being predicated. The compiler has heuristics internally to determine what it should accomplish via math only versus predication versus branching, based on how long code sequences are, whether it can determine if the predicates are warp-uniform or not, etc. You would need to look at the SASS to see what is really going on. Furthermore, there may be more or less functional units in the HW for each kind of operation that you think of substituting for the original; let the compiler choose first.

However, this is not the level of optimization one should be concerned with first (or even fifth) in a problem like this in CUDA. This kind of algorithm should be global memory bandwidth bound only. Concentrate on scanning through the data once and compacting as you go, so you touch as little memory as possible.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact