Very nice! I wanted to do something like this but then I would miss on proper CU...

Very nice! I wanted to do something like this but then I would miss on proper CUDA acceleration and lose performance compared to using torchlib.

I wrote a forgettable llama implementation for https://github.com/LaurentMazare/tch-rs (pytorch's torchlib rust binding). Still not ideal but at least you get the same GPU performance you would get on pytorch.

...And then I spotted Candle, a new ML framework by the same author: https://github.com/huggingface/candle

It's all in Rust, self contained, a huge undertaking, but it looks very promising. They already have a llama2 example!