Interesting point. Nvidia have been improving the int performance for quantized ...

kohlerm · on Jan 13, 2021

Sure if someone can come up with an approach to run an NNUE (efficiently updatable) network on GPUs that might really be another breakthrough. But at a first glance it looks to me that this could be very difficult. Because AFAIK the SF search is relatively complicated. Even for Leela implementing an efficient batched search on multiple GPUs seems to be difficult (some improvements coming with Ceres). And Leela is using a much simple MCTS search. That doesn't that Leela's search could not be improved. It does not give higher priority necessarily for forced sequence of moves (at least not explicitly) or high risk moves. Which is IMHO why sometimes she does not see relatively simple tactics.

fho · on Jan 14, 2021

I guess the simplest approach to port NNUE to GPUs would be to run a complete instance per GPU thread (ie concurrent, not parallel evaluation).