Still an issue, just not where you think. For recent, more efficient CNN architectures _data augmentation_ is a bottleneck when done on a single thread. So Python has to resort to either queues and async (TF approach, worse perf than PyTorch in practice), or use multiprocessing (PyTorch approach, works better but ugly AF under the covers). I would absolutely love to use a multi core-capable language there. The machine does have several dozen cores after all.