I dont think vertex buffers are the ideal storage mechanism for lattice methods. I used Cuda to rasterize a plain buffer but I have two GPUs, the one doing the sim wasn't the one rendering. Its better this way if you are looking to run the sim 1000s of cycles per sec, but only render 60 fps. Theres a lot of extra data that could be eliminated by simply using a linear block of memory versus a vertex buffer. Depending on the goals for the sim, efficient rendering should be a lesser priority than efficient simulation speed.
Here are two very similar frameworks I made for exploring parallelizable lattice sims:
(Note: the video is of a skewed version due to hexagonal lattice coords - this is fixed in the latest commits)
To OP, you may enjoy the study of transformation between Cellular Automata and Partial Differential Eqs.
The PDF below is a gem, a good introduction to the techniques that will allow you to take any reasonable PDEs and produce CAs that produce equivalent dynamics:
I agree that there is much that could be optimized. The choice to use vertex buffers was driven by the overarching goal of implementing the whole thing in GLSL. It really is just a small playground project to generate some nice visualizations. If we wanted to actually use the results beyond displaying them the tight coupling between simulation and rendering would certainly become a hindrance.
CUDA usage as you describe definitely seems to be the way to go for larger scale stuff. i.e. most GPU-based LBM codes that are actually used in research seem to be based on it.
If you did want to increase performance, you could try to use OpenGL for rendering but write your actual sim in CUDA C++.
The whole interop API is listed here: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_...
Essentially it wouldn't involve any host/memory transfer, all would be processed on GPU.
Then you could limit your rendering thread to 60fps while running the CUDA kernel non-stop.
I haven't actually tested this out though because I had 2 GPUs for the sims above, and the beefy one running the sim was on TCC instead of WDDM mode (no attached display allowable.)  So I had the universe state buffer transferred to host memory, and then to the 2nd GPU for rendering to attached display.
I am not sure of the speed gains TCC vs WDDM really provides, but Nvidia says it makes "some difference."
It's significantly older, but it's still good and it goes into the details for those (like me) who don't have a background in fluid dynamics. I had a lot of fun implementing it on the side a while back.
Jos Stam, Stable Fluids 
More of his papers:
The most interesting thing to me was how much this article resonated with my own experiences in GPU-based particle simulation. I instead went for a particle-based method and had to make do with the state of GPU hardware and drivers as they were back in 2011. But despite the difference in computational model and all the differences stemming from the evolution of GPUs and compute shaders, the main struggles are still the same: discretization and parallelization.
Brings back many great memories.