Not that that's the whole solution, but certainly something I'd look into before digging into CUDA.
I get why you'd choose Cuda over opencl for pure gpgpu code, but when faced with the requirement of having both a CPU and a GPU version why not use openCL to keep both code paths as close as possible? Did you write it so both branches live in a single build and can select the optimal path at runtime and looking at the hardware and seeing what is available? Or do you maintain two separate builds that are deployed to different servers?
One of the things Thrust supports is an OpenMP backend, so it can emit code that runs on a CPU instead of a GPU. I've never tried it so I don't know how good it is, particularly on the collective operations. But I assume if what you want is a "#pragma omp parallel for 1-to-N" then it's probably great. Construct your operation as a functor and use a thrust::for_each from 1 to N.
As it is in the raw, it's pretty rough to use but I am excited to see what interfaces will be built on top of it to make heterogeneous computing more accessible and maintainable. I'm thinking something like PyOpenCL for Vulkan.
One problem can be when the number of output items from a given input item is unpredictable. You typically want to run a prefix-scan within a block to compact the block's output into a single chunk until your shared-memory buffer is full, or you've reached the end of the data set. Then, you do an ATOMIC increment on a global "slots-used" counter for your output buffer, and write as a big chunk. This is a lot more complex than sorting, but reasonably fast.
If you want to guarantee that your output will fit in the buffer, you need to run the operation twice. The first run is a "dummy" that writes no output, just figures out the last input element that can write data without the output buffer overflowing. In some cases it may be preferential to just allocate a reasonable guesstimate of the 95% case and not write past the end if a true worst-case scenario happens.
Try like hell to keep everything in registers or shared-memory and avoid writing intermediate data. High utilization of memory (order complexity) kills performance, both directly (hitting global memory incurs lots of latency) and indirectly (GPUs have the greatest speedup when working on the largest possible working set).
It so happened that almost nobody in the (previously tiny) industry specialized for using GPU cards for computation had used any cards other than NVIDIA early on. NVIDIA had listened to that niche market by releasing increasingly performant hardware for compute needs, while other manufacturers kept ignoring the niche and focusing on graphics. As a result, very few on the software side have a burning desire to do free work to support cards made by anyone other than the Big Green, even if they had equivalent performance (they likely don't).
This may seems strange to web devs who remember the hard-fought war against Microsoft's lock-in on the Internet via IE. But NVIDIA is seen in a different light---it is seen as a benevolent company without which the GPU-for-compute platform wouldn't even exist.
Nvidia captured this market using typical vendor lock-in tactics: they didn't support OpenCL 1.2 until many years after it was released, and they still don't support OpenCL 2.0+.
Opencl has come along nicely in the meantime but I suspect Cuda has maintained it's lead and now has momentum with many developers.
However, this is not the level of optimization one should be concerned with first (or even fifth) in a problem like this in CUDA. This kind of algorithm should be global memory bandwidth bound only. Concentrate on scanning through the data once and compacting as you go, so you touch as little memory as possible.