Hacker News new | comments | ask | show | jobs | submit login
Graphics Drivers (shiningrocksoftware.com)
237 points by andyjohnson0 on Dec 14, 2015 | hide | past | web | favorite | 20 comments

My experience with apple is that you should only use API calls that have been around for 5+ years, because everything else is essentially broken.

For a company that relies on GPU effects for everything, it's amazing how bad they are at supporting OpenGL. I remember a few years ago Valve basically wrote an open letter to Apple to fix their shit, and .. well you can play Portal now, but the state of OpenGL on OSX is still a disaster.

But now there is Metal, so I imagine that means that opengl will lag even more.

Yep, they've stopped adding GL stuff :/

Why not allow the graphics manufacturers to write the OpenGL code like on windows?

So, it seems like the renderer is doing a lot of small updates to buffers, causing pipeline stalls. The article does not mention the size, offset and alignment used to update uniform buffers, which may play a huge factor here.

The details of CPU-GPU buffer coherency are hardware specific, but using a larger alignment can help hitting the fast path. At worst, this ends up being a read-modify-write operation to the VRAM over a DMA transfer.

The alignment of individual writes should be at least 256 bytes, prefarably even more. A multiple of a CPU (or GPU) page size may help: 4k, 64k, 2M are good alignments.

In modern OpenGL (w/GL_ARB_buffer_storage)/D3D/Vulkan, you could use a persistent/coherent mapped buffer and explicit sync points. It makes this problem go away, but requires a bit more programming effort to set up correctly.

It's obviously not a hardware limitation if it works fine in Windows.

Drivers can map the DMA regions in differently.

The problem is with the tool he's using. It doesn't seem to show the "idle" time when calling the OpenGL method. He's very likely hitting a CPU/GPU stall while the data is uploaded to VRAM.

As mentioned by Scott in the comments section of the article, the solution is usually like this (D3D12 has removed dynamic resources for example [0]) :

  So this means if you glBufferSubData into the same
  address between every drawcall, the opengl driver is
  forced to execute the command then sync before copying
  the new data over the old data, which is really slow.

  What you are meant to do instead is create a buffer
  with glBufferData that is large enough to hold at least
  one frame worth of uniforms. Then use glBufferSubData
  with a **different offset** for each drawcall’s uniforms,
  so they end up next to each other in GPU memory. Treat
  it as a circular buffer.
[0] https://msdn.microsoft.com/en-us/library/windows/desktop/dn8...

Or maybe not the solution, Scott has posted a "never mind" followup to his own comment.

Edit: also I think the blog author had no problem getting to the waiting diagnosis despite the tools not splitting the time into idle and working parts, since he zeroed in early on the option 3, "The GPU is waiting on the CPU (or vice versa) for something"

But his conclusion was wrong in my opinion. Updating many many sub parts of an array in random order will always be slower than updating it all at once.

He also didn't reach the usual solution to double or triple buffer everything that is uploaded to the GPU.

Did you not bother reading the article? At least half of it is talking about it being a stall and trying to figure out why it's stalling.

My point was that his conclusion seems wrong but I didn't make that clear. He didn't fix the stalling, he just found a work around which seems more like a hack.

Juggling uniforms and buffers is something that will get messy fast unless you take the time to organize it early on. I wrote my own preprocessor that can dynamically bake values into the actual shader code (by injecting text and recompiling the shader during run time) - which is useful for putting in values that are determined by the CPU but more or less constant through the life of the program (the preprocessor allows other neat tricks like automatically binding a slider to a variable to test a range of values for that variable). This also saves time where possible by minimizing CPU-GPU communication. During one test I did (which certainly is not conclusive) I noticed uniform buffers performed slower than using normal uniforms - not in uploading the uniforms but rather reading them in the code. This may have been due to needing to compute an extra offset or something.

Hm, doesn't the glBufferSubData update the one and the same bit of memory on gpu, having the driver wait for the draws that dependent on the previous values to go through, before letting the update values through. You could use glBufferData allocate a new bit memory (with NULL ptr it would just be discarding the old values).

Other ways would be to fill a bigger buffer with all the different uniform values at once, and then pointing to right bits of it with the drawcalls.

Yeah. glBufferData should work fine in this case.

To be honest the Windows OpenGL drivers are the ones at fault here. Instead of reusing the old data store they seem to allocate a new one, just what glBufferData does. However the performance improvement for glBufferSubData implemented correctly is not enough to offset the huge amount of code that relies on glBufferSubData to behave like glBufferData. So I can see why a driver developer has opted to this behavior.

This is a nice follow up to this post [0] on porting a Windows game to OSX.

[0]: https://news.ycombinator.com/item?id=10545143

Makes this post about Vulkan timely...


Vulcan is not supported on OSX or iOS. Instead, you are getting Metal.

.. You've got. Ships with ios8+ and osx 10.11 (el cap)

And this is the fun of "portable" OpenGL, one code path per driver/platform....

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact