
Graphics Drivers - andyjohnson0
http://www.shiningrocksoftware.com/2015-12-13-graphics-drivers/
======
overgard
My experience with apple is that you should only use API calls that have been
around for 5+ years, because everything else is essentially broken.

For a company that relies on GPU effects for everything, it's amazing how bad
they are at supporting OpenGL. I remember a few years ago Valve basically
wrote an open letter to Apple to fix their shit, and .. well you can play
Portal now, but the state of OpenGL on OSX is still a disaster.

~~~
sleepybrett
But now there is Metal, so I imagine that means that opengl will lag even
more.

~~~
TazeTSchnitzel
Yep, they've stopped adding GL stuff :/

~~~
randyrand
Why not allow the graphics manufacturers to write the OpenGL code like on
windows?

------
exDM69
So, it seems like the renderer is doing a lot of small updates to buffers,
causing pipeline stalls. The article does not mention the size, offset and
alignment used to update uniform buffers, which may play a huge factor here.

The details of CPU-GPU buffer coherency are hardware specific, but using a
larger alignment can help hitting the fast path. At worst, this ends up being
a read-modify-write operation to the VRAM over a DMA transfer.

The alignment of individual writes should be at least 256 bytes, prefarably
even more. A multiple of a CPU (or GPU) page size may help: 4k, 64k, 2M are
good alignments.

In modern OpenGL (w/GL_ARB_buffer_storage)/D3D/Vulkan, you could use a
persistent/coherent mapped buffer and explicit sync points. It makes this
problem go away, but requires a bit more programming effort to set up
correctly.

~~~
kllrnohj
It's obviously not a hardware limitation if it works fine in Windows.

~~~
mentat
Drivers can map the DMA regions in differently.

------
speps
The problem is with the tool he's using. It doesn't seem to show the "idle"
time when calling the OpenGL method. He's very likely hitting a CPU/GPU stall
while the data is uploaded to VRAM.

As mentioned by Scott in the comments section of the article, the solution is
usually like this (D3D12 has removed dynamic resources for example [0]) :

    
    
      So this means if you glBufferSubData into the same
      address between every drawcall, the opengl driver is
      forced to execute the command then sync before copying
      the new data over the old data, which is really slow.
    
      What you are meant to do instead is create a buffer
      with glBufferData that is large enough to hold at least
      one frame worth of uniforms. Then use glBufferSubData
      with a **different offset** for each drawcall’s uniforms,
      so they end up next to each other in GPU memory. Treat
      it as a circular buffer.
    

[0] [https://msdn.microsoft.com/en-
us/library/windows/desktop/dn8...](https://msdn.microsoft.com/en-
us/library/windows/desktop/dn899125%28v=vs.85%29.aspx)

~~~
fulafel
Or maybe not the solution, Scott has posted a "never mind" followup to his own
comment.

Edit: also I think the blog author had no problem getting to the waiting
diagnosis despite the tools not splitting the time into idle and working
parts, since he zeroed in early on the option 3, "The GPU is waiting on the
CPU (or vice versa) for something"

~~~
speps
But his conclusion was wrong in my opinion. Updating many many sub parts of an
array in random order will always be slower than updating it all at once.

He also didn't reach the usual solution to double or triple buffer everything
that is uploaded to the GPU.

------
gavanwoolery
Juggling uniforms and buffers is something that will get messy fast unless you
take the time to organize it early on. I wrote my own preprocessor that can
dynamically bake values into the actual shader code (by injecting text and
recompiling the shader during run time) - which is useful for putting in
values that are determined by the CPU but more or less constant through the
life of the program (the preprocessor allows other neat tricks like
automatically binding a slider to a variable to test a range of values for
that variable). This also saves time where possible by minimizing CPU-GPU
communication. During one test I did (which certainly is not conclusive) I
noticed uniform buffers performed slower than using normal uniforms - not in
uploading the uniforms but rather reading them in the code. This may have been
due to needing to compute an extra offset or something.

------
scoopr
Hm, doesn't the glBufferSubData update the one and the same bit of memory on
gpu, having the driver wait for the draws that dependent on the previous
values to go through, before letting the update values through. You could use
glBufferData allocate a new bit memory (with NULL ptr it would just be
discarding the old values).

Other ways would be to fill a bigger buffer with all the different uniform
values at once, and then pointing to right bits of it with the drawcalls.

~~~
sharpneli
Yeah. glBufferData should work fine in this case.

To be honest the Windows OpenGL drivers are the ones at fault here. Instead of
reusing the old data store they seem to allocate a new one, just what
glBufferData does. However the performance improvement for glBufferSubData
implemented correctly is not enough to offset the huge amount of code that
relies on glBufferSubData to behave like glBufferData. So I can see why a
driver developer has opted to this behavior.

------
jobvandervoort
This is a nice follow up to this post [0] on porting a Windows game to OSX.

[0]:
[https://news.ycombinator.com/item?id=10545143](https://news.ycombinator.com/item?id=10545143)

------
tux1968
Makes this post about Vulkan timely...

[https://news.ycombinator.com/item?id=10728156](https://news.ycombinator.com/item?id=10728156)

~~~
vetinari
Vulcan is not supported on OSX or iOS. Instead, you are getting Metal.

~~~
sleepybrett
.. You've got. Ships with ios8+ and osx 10.11 (el cap)

------
pjmlp
And this is the fun of "portable" OpenGL, one code path per
driver/platform....

