* That my batch transfers to VRAM are done in a sensible way in the dataloader and don't hide CPU-bound preprocessing
* That my batch size is large enough
* That the model is adequate for the GPU (even convolutional models can be better on the CPU for specific sizes)
It's good enough to go from a CPU-bound pattern to a GPU-bound one but I don't really get that detailed understanding of the spectrum between these so I'm definitely going to try this tool in the future, especially since it's so easy to add.
On the subject of optimization tricks, I haven't really found any magic bullets, you can't always increase the batch size to get 100% util because of the performance implications. FP16 precision has never done anything for me, weirdly. My preprocessing is never CPU-bound unless I do dumb shit in it so rewriting it in cpp would do nothing.