Hacker News new | past | comments | ask | show | jobs | submit login
Low-Level Thinking in High-Level Shading Languages (2013) [pdf] (humus.name)
81 points by pablode on Jan 24, 2018 | hide | past | web | favorite | 10 comments

> Nowadays renowned industry luminaries include shader snippets in their GDC presentations where trivial transforms would have resulted in a faster shader... When the best of the best aren't doing it right, then we have a problem as an industry.

This is an amazing presentation, and there are many valuable points on how to optimize, but I don't love this "motivation", the presentation would be better without it, IMO. "Here's how to make the fastest shaders when you need them" would feel better to me than "you should always optimize, even if that's not your immediate goal".

The best are "doing it right", but a presentation isn't always the place to do it right. A GDC presenter should be making concepts easier to understand, and optimizing makes things harder to understand. Optimizations are often hardware and compiler and platform and language specific.

On top of that, the most important thing to do when optimizing a shader is to profile the thing. Memorizing every possible instruction and trying to out-think the compiler and hardware will lead to surprises, guaranteed. I just hit a case two days ago where I optimized an expression from 24 flops to 13, and the shader slowed down by 15%. The reason was data dependencies that I couldn't see, even in the assembly.

>The reason was data dependencies that I couldn't see, even in the assembly.

I don't understand why that would matter. Aren't GPUs in-order? I don't know the low-level architecture of GPUs at all.

An easy explanation is that you can think of GPUs as being massively hyperthreaded. So, when one thread hits a data stall, another thread picks up to use the ALU resources until it hits a stall, and so on through many, many threads before it cycles back to the original. But, data stalls are very long. And, if you don't have enough ALU for the other threads to work on before they stall too, you'll end up back on the first thread waiting for data anyway.

If you want to understand low-level GPU architecture, https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-... is a great intro.

They are parallel, a data dependency cannot be pipelined as easily.

So no, they are not in-order.

Radeons do not have speculative execution but that doors not make them in-order.

Radeons compute units are in-order though, in the usual sense of the word, and I'd love to hear it if there really was an out-of-order GPU. It'd be rather surprising.

One thing that they do have to deal with data dependencies is that load (and texture fetch etc.) instructions don't block. Instead, there's a separate instruction for waiting on the result of a previous load.

My understanding:

GPUs are (generally) in-order within each thread, but they are pipelined. The pipeline is filled with instructions that are ready to execute from across many threads. If all threads have an unmet dependency (previous instruction or memory access), the pipeline will stall.

GPU compilers prefer to inline everything, and they try to reuse partial results if they can, so it’s easy to get out of order dependencies in places you might not expect.

Nice you got around this presentation, just a small note to say that pretty much everyone in the industry knows about it :)

I write shaders as a hobbyist game developer (I'm not in the gaming/graphics industry). I appreciated the submission. Shader code is often counterintuitive, I tend to think like a mathematician and not an optimizer when I read code examples, and I always wonder why the code is written in the way it is. This paper makes it clearer!

One question: On slide 13, the author outlines some of the limitations on the compiler. In the ~4 years since the paper was published, have compilers gotten better at solving some of these problems? or are these instead fundamental problems that will more or less always be present?

Don't count on it, write it out (unless it's much uglier - in which case, still write it out, but leave the original code in a comment.) New architectures, new platforms, a mess of shader languages, and the practice of customizing shaders for big titles leads to driver/compiler teams being chronically understaffed and overworked.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact