Hacker News new | past | comments | ask | show | jobs | submit login

> Nowadays renowned industry luminaries include shader snippets in their GDC presentations where trivial transforms would have resulted in a faster shader... When the best of the best aren't doing it right, then we have a problem as an industry.

This is an amazing presentation, and there are many valuable points on how to optimize, but I don't love this "motivation", the presentation would be better without it, IMO. "Here's how to make the fastest shaders when you need them" would feel better to me than "you should always optimize, even if that's not your immediate goal".

The best are "doing it right", but a presentation isn't always the place to do it right. A GDC presenter should be making concepts easier to understand, and optimizing makes things harder to understand. Optimizations are often hardware and compiler and platform and language specific.

On top of that, the most important thing to do when optimizing a shader is to profile the thing. Memorizing every possible instruction and trying to out-think the compiler and hardware will lead to surprises, guaranteed. I just hit a case two days ago where I optimized an expression from 24 flops to 13, and the shader slowed down by 15%. The reason was data dependencies that I couldn't see, even in the assembly.

>The reason was data dependencies that I couldn't see, even in the assembly.

I don't understand why that would matter. Aren't GPUs in-order? I don't know the low-level architecture of GPUs at all.

An easy explanation is that you can think of GPUs as being massively hyperthreaded. So, when one thread hits a data stall, another thread picks up to use the ALU resources until it hits a stall, and so on through many, many threads before it cycles back to the original. But, data stalls are very long. And, if you don't have enough ALU for the other threads to work on before they stall too, you'll end up back on the first thread waiting for data anyway.

If you want to understand low-level GPU architecture, https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-... is a great intro.

They are parallel, a data dependency cannot be pipelined as easily.

So no, they are not in-order.

Radeons do not have speculative execution but that doors not make them in-order.

Radeons compute units are in-order though, in the usual sense of the word, and I'd love to hear it if there really was an out-of-order GPU. It'd be rather surprising.

One thing that they do have to deal with data dependencies is that load (and texture fetch etc.) instructions don't block. Instead, there's a separate instruction for waiting on the result of a previous load.

My understanding:

GPUs are (generally) in-order within each thread, but they are pipelined. The pipeline is filled with instructions that are ready to execute from across many threads. If all threads have an unmet dependency (previous instruction or memory access), the pipeline will stall.

GPU compilers prefer to inline everything, and they try to reuse partial results if they can, so it’s easy to get out of order dependencies in places you might not expect.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact