
How I Found CUDA, Or: Rewriting the Tag Engine - philliphaydon
http://blog.marcgravell.com/2016/05/how-i-found-cuda-or-rewriting-tag_9.html
======
SloopJon
This post suggests that Stack Overflow has about 30,000 distinct tags, which
seems like a good fit for a bitmap index:

[https://mattwarren.org/2015/08/19/the-stack-overflow-tag-
eng...](https://mattwarren.org/2015/08/19/the-stack-overflow-tag-engine-
part-2/)

Not that that's the whole solution, but certainly something I'd look into
before digging into CUDA.

------
gnarbarian
I can tell that you wanted a clear distinction between implementations for
comparisons sake. And I don't doubt that you could probably achieve higher
performance in each respective branch with Cuda and c++ for GPU and CPU
branches.

I get why you'd choose Cuda over opencl for pure gpgpu code, but when faced
with the requirement of having both a CPU and a GPU version why not use openCL
to keep both code paths as close as possible? Did you write it so both
branches live in a single build and can select the optimal path at runtime and
looking at the hardware and seeing what is available? Or do you maintain two
separate builds that are deployed to different servers?

~~~
paulmd
You can actually go the other route too. There's a library called Thrust that
provides a STL-like interface to CUDA, as well as "glue" functions that go
between the finely-tuned stuff.

One of the things Thrust supports is an OpenMP backend, so it can emit code
that runs on a CPU instead of a GPU. I've never tried it so I don't know how
good it is, particularly on the collective operations. But I assume if what
you want is a "#pragma omp parallel for 1-to-N" then it's probably great.
Construct your operation as a functor and use a thrust::for_each from 1 to N.

~~~
maccard
Works reasonably well, but it absolutely ruins your compile times.

~~~
paulmd
Yeah, I totally agree. It's not just the OpenMP backend, it's Thrust in
general. I had a project that was probably 5-10k LOC and it took like 2
minutes to compile with Thrust. I'm pretty sure nvcc doesn't handle the
templating very well.

------
paulmd
Couple comments here: GPGPU programming is really good if you can construct
your problem as a sorting problem (eg "sort results so interesting tags are at
the start of the array"). From what I've seen this is reliably the most
successful approach.

One problem can be when the number of output items from a given input item is
unpredictable. You typically want to run a prefix-scan within a block to
compact the block's output into a single chunk until your shared-memory buffer
is full, or you've reached the end of the data set. Then, you do an ATOMIC
increment on a global "slots-used" counter for your output buffer, and write
as a big chunk. This is a lot more complex than sorting, but reasonably fast.

If you want to guarantee that your output will fit in the buffer, you need to
run the operation twice. The first run is a "dummy" that writes no output,
just figures out the last input element that can write data without the output
buffer overflowing. In some cases it may be preferential to just allocate a
reasonable guesstimate of the 95% case and not write past the end if a true
worst-case scenario happens.

Try like hell to keep everything in registers or shared-memory and avoid
writing intermediate data. High utilization of memory (order complexity) kills
performance, both directly (hitting global memory incurs lots of latency) and
indirectly (GPUs have the greatest speedup when working on the largest
possible working set).

------
kanwisher
Surprised people are still using CUDA, is OpenCL still not as good? Or is the
vendor lockin not a problem?

~~~
etangent
> is the vendor lockin not a problem?

It so happened that almost nobody in the (previously tiny) industry
specialized for using GPU cards for computation had used any cards other than
NVIDIA early on. NVIDIA had listened to that niche market by releasing
increasingly performant hardware for compute needs, while other manufacturers
kept ignoring the niche and focusing on graphics. As a result, very few on the
software side have a burning desire to do free work to support cards made by
anyone other than the Big Green, even if they had equivalent performance (they
likely don't).

This may seems strange to web devs who remember the hard-fought war against
Microsoft's lock-in on the Internet via IE. But NVIDIA is seen in a different
light---it is seen as a benevolent company without which the GPU-for-compute
platform wouldn't even exist.

~~~
Aeolos
AMD GPUs have consistenly had higher compute performance than Nvidia. The Fury
X, for example, consistently outperforms the Titan X for half the price.[1]
That's why AMD has completely dominated the bitcoin / cryptohashing scene:
money is involved, so people are willing to spend time optimizing for
different hardware.

Nvidia captured this market using typical vendor lock-in tactics: they didn't
support OpenCL 1.2 until many years after it was released, and they still
don't support OpenCL 2.0+.

[1] [http://www.anandtech.com/show/9390/the-amd-
radeon-r9-fury-x-...](http://www.anandtech.com/show/9390/the-amd-
radeon-r9-fury-x-review/24)

~~~
etangent
I should have qualified "compute". I meant single- and double-precision
floating-point operations that are not graphics. Yes, AMD GPUs outperform
NVIDIA on cryptohashing benchmarks which require high integer performance.

------
jhj
&& / ?: may not be “branching operations”, the compiler could determine that
the operands have no side effects (q->score is used in both branches and q
must be a valid or invalid pointer in both) and collapse that down to just
register select operations and math internally. Or, it may involve
predication, with only the assignment of i to a register previously filled
with 0 being predicated. The compiler has heuristics internally to determine
what it should accomplish via math only versus predication versus branching,
based on how long code sequences are, whether it can determine if the
predicates are warp-uniform or not, etc. You would need to look at the SASS to
see what is really going on. Furthermore, there may be more or less functional
units in the HW for each kind of operation that you think of substituting for
the original; let the compiler choose first.

However, this is not the level of optimization one should be concerned with
first (or even fifth) in a problem like this in CUDA. This kind of algorithm
should be global memory bandwidth bound only. Concentrate on scanning through
the data once and compacting as you go, so you touch as little memory as
possible.

