Python is just the gluing language. All the heavy lifting happens in CUDA or CuB...

Python is just the gluing language. All the heavy lifting happens in CUDA or CuBLAS or CuDNN or so.

Most optimizations for saving memory is by using lower precision numbers (float16 or less), quantization (int8 or int4), sparsification, etc. But this is all handled by the underlying framework like PyTorch.

There are C++ implementations but they optimize on different aspects. For example: https://github.com/OpenNMT/CTranslate2/