I don't know if it's still the case but in the past CUDA/OCL kernels would do all of the execution work for each path in the CFG and only write the results for the actual path to global memory.
wyldfire: good point I should have brought that up!
For my use case (neural network design) there was no divergence between threads so each kernel ran exactly the same path (the for loop is unrolled at runtime by the compiler) but if my if/else if block had divergent paths you would be correct.
I don't know if it's still the case but in the past CUDA/OCL kernels would do all of the execution work for each path in the CFG and only write the results for the actual path to global memory.