For a company that relies on GPU effects for everything, it's amazing how bad they are at supporting OpenGL. I remember a few years ago Valve basically wrote an open letter to Apple to fix their shit, and .. well you can play Portal now, but the state of OpenGL on OSX is still a disaster.
The details of CPU-GPU buffer coherency are hardware specific, but using a larger alignment can help hitting the fast path. At worst, this ends up being a read-modify-write operation to the VRAM over a DMA transfer.
The alignment of individual writes should be at least 256 bytes, prefarably even more. A multiple of a CPU (or GPU) page size may help: 4k, 64k, 2M are good alignments.
In modern OpenGL (w/GL_ARB_buffer_storage)/D3D/Vulkan, you could use a persistent/coherent mapped buffer and explicit sync points. It makes this problem go away, but requires a bit more programming effort to set up correctly.
As mentioned by Scott in the comments section of the article, the solution is usually like this (D3D12 has removed dynamic resources for example ) :
So this means if you glBufferSubData into the same
address between every drawcall, the opengl driver is
forced to execute the command then sync before copying
the new data over the old data, which is really slow.
What you are meant to do instead is create a buffer
with glBufferData that is large enough to hold at least
one frame worth of uniforms. Then use glBufferSubData
with a **different offset** for each drawcall’s uniforms,
so they end up next to each other in GPU memory. Treat
it as a circular buffer.
Edit: also I think the blog author had no problem getting to the waiting diagnosis despite the tools not splitting the time into idle and working parts, since he zeroed in early on the option 3, "The GPU is waiting on the CPU (or vice versa) for something"
He also didn't reach the usual solution to double or triple buffer everything that is uploaded to the GPU.
Other ways would be to fill a bigger buffer with all the different uniform values at once, and then pointing to right bits of it with the drawcalls.
To be honest the Windows OpenGL drivers are the ones at fault here. Instead of reusing the old data store they seem to allocate a new one, just what glBufferData does. However the performance improvement for glBufferSubData implemented correctly is not enough to offset the huge amount of code that relies on glBufferSubData to behave like glBufferData. So I can see why a driver developer has opted to this behavior.