I'm pretty excited about this. Geometry shaders have been a poor fit, but mesh & task shaders seem poised to replace them and be much more performant (if also more complex conceptually).
For those unfamiliar (reorganizing a bit for clarity):
> A new, two-stage pipeline alternative supplements the classic attribute fetch, vertex, tessellation, geometry shader pipeline. This new pipeline consists of a task shader and mesh shader:
> Task shader: a programmable unit that operates in workgroups and allows each to emit (or not) mesh shader workgroups
> Mesh shader: a programmable unit that operates in workgroups and allows each to generate primitives
> [Mesh and Task] shaders bring the compute programming model to the graphics pipeline as threads are used cooperatively to generate compact meshes (meshlets) directly on the chip for consumption by the rasterizer. Applications and games dealing with high-geometric complexity benefit from the flexibility of the two-stage approach, which allows efficient culling, level-of-detail techniques as well as procedural generation.
> The mesh shader stage produces triangles for the rasterizer, but uses a cooperative thread model internally instead of using a single-thread program model, similar to compute shaders. Ahead of the mesh shader in the pipeline is the task shader. The task shader operates similarly to the control stage of tessellation, in that it is able to dynamically generate work. However, like the mesh shader, it uses a cooperative thread model and instead of having to take a patch as input and tessellation decisions as output, its input and output are user defined.
> This simplifies on-chip geometry creation compared to the previous rigid and limited tessellation and geometry shaders, where threads had to be used for specific tasks only, as shown in figure 3.
I'm pretty excited as well. This is another big step away from legacy hardware that implemented the OpenGL 1.0 API in the hardware. In case that sounds judgmental, it was awesome back in the 3dfx Voodoo days to have this kind of hardware.
What's really great here is that DirectX 12 (and now Vulkan too) allow the engine to access the underlying GPU hardware without this legacy baggage.
Geometry shaders in general always felt awkward with a lot of gotchas and unintuitive aspects. This, though more complex and abstract, seems a lot easier to use.
Am I understanding mesh shading correctly if I'm explaining it like this?
In the traditional rendering pipeline you unpack triangles on the CPU, and then send them to the GPU, which then maps them to screenspace and then shades in the pixels to the screen buffer.
Because of modern demands we're wanting to render insane amounts of triangles.
Mesh shaders alleviate some of the bottlenecks by being a step in the pipeline on the GPU before the other steps that can generate triangles. So for example you could upload a smaller amount of triangles to the GPU, and have the mesh shaders refine that geometry by procedurally fill in the details. You could even send it not even triangles but for example the locations where you want grass, and the direction of the wind or whatever and have the mesh shaders generate all of the grass geometry.
Excuse the laymanny language, I'm just a casual observer and have imprecise memory. The tech is cool though, I wonder why it took so long for such a feature to arise, is it because there's more general purpose compute on GPU's now?
That's just one of the many things you can do with them.
Really mesh shaders are just vertex shaders where you see 128 (or whatever the meshlet size is) vertices at once, so things like tesselation or blowing up geometry from 1 to [2,128] just naturally fall out of it.
But even if you just use them for regular triangle processing like you used vertex shaders before, you can do things like cull full triangle groups (a full meshlet) instead of culling per-triangle, which can be a lot faster and more efficient.
You can also use various features that used to be compute-only. Like shared memory.
You could quite easily have groups of triangles share data which was kind of annoying to do before.
Countless more things...
Almost everything mesh shaders can do were already possible with a compute shader that spits out triangles which a regular vertex shader then consumes and passes onto the rest of the pipeline. But with mesh shaders you cut out the writing and reading of that intermediary buffer completely, as your produced data is directly created from the typical vertex data and then handed onto the ROP units.
I think that about sums it up. The beauty of mesh shaders is that they don’t define how your input interface looks like, as long as you produce something in the end that the rasterizer can consume. Mesh shaders define two levels of compute shaders: one corresponds to an abstract “object” (which can be really anything you want) and one that corresponds to a more concrete “mesh” (a set of to be rasterizer primitives). The “object shader” can dynamically decide how many “meshes” will be generated and spawn the appropriate amount of “mesh shaders”. This gives you a level of indirection to do pretty much anything you want. It’s a generalization of the traditional rendering pipeline that supports user-defined types and structures and makes a lot of things simpler and more logical. Also a step to a more unified GPU programming model that actually exposes the massively parallel nature of the GPU hardware.
I am currently experimenting with polygon triangulation in the mesh shaders on Apple platforms and I’m rather happy with the results. The performance is excellent and the client code is radically simplified.
I believe mesh shaders are the future of the geometry pipeline for graphics, specially given that both of the new consoles have an implementation for them. Freeing the mesh pipeline from the legacy of opengl and similar stuff is going to allow a lot of new techniques and implementations.
Mesh shaders in a way are a new version of what the Playstation 2 did back in the day, where you had a vector processor that would output triangle lists to be rasterized. Its flexible mesh pipelined allowed very interesting techniques, like the multi-layer meshes for fur in shadow of the collosus, or multiple types of smooth LOD interpolation seen in other games. With them, now we can emit geometry from a programming model that is more or less normal compute shaders. Each threadgroup will output a small mesh directly into the rasterizer units of the GPU, so the flexibility is quite great.
Game developers are running into the limits of the current vertex shader model, as you can see with trickery like doing mesh processing in compute shaders and writing the final indices into a big linked list through parallel compaction and similar. With the mesh shaders, all that becomes both more optimal and simpler to write. Techniques like Nanite become a fair bit faster using mesh shaders than emulating them by writing indices into memory, as it avoids that whole memory and syncronization round trip. Completely removing the disliked geometry shaders is also a great feature.
Its great to finally have a crossplatform implementation of Mesh Shaders, but i think the implementation leaves a bit to be desired given it seems to be a direct translated vulkan version of the DX12 implementation. Im not sure how its going to scale to hardware other than the 2 big desktop vendors.
The biggest issue with the mesh shaders here is that the cache sizes or recomended amounts vary depending on the GPU. To take the most advantage of mesh shaders and get great speed out of them, there are some fast-path counts for triangles,vertices, and shader memory usages. Those vary by the vendor, so im not sure how are developers going to deal with that correctly other than having multiple versions of the shader by vendor.
I find them cool, because in a way it is like going back to the Amiga days, but with better hardware support for this kind of stuff, or Cell stuff like you point out.
In a way this is already what companies like Otoy are doing by using CUDA for 3D rendering.
After Nanite, it seems to me that mesh shaders are just a transitory point before everything inevitably moves primarily to raster in compute. Sticking to fixed-function hardware doesn't make sense unless that hardware is faster.
(As I understand it, Nanite does use mesh shaders as a fallback for larger triangles, but those naturally require less oomph in the hardware pass.)
radv added support for it just recently, but it requires "gang submission" support in amdgpu (it's about handling parallel submissions) that's still not merged into the upstream Linux kernel, so mesh shaders are disabled by default.
Nanite in consoles already uses mesh shaders. On PC it emulates them through some trickery but having it using real mesh shaders would be only a small change in code.
For those unfamiliar (reorganizing a bit for clarity):
> A new, two-stage pipeline alternative supplements the classic attribute fetch, vertex, tessellation, geometry shader pipeline. This new pipeline consists of a task shader and mesh shader:
> Task shader: a programmable unit that operates in workgroups and allows each to emit (or not) mesh shader workgroups
> Mesh shader: a programmable unit that operates in workgroups and allows each to generate primitives
> [Mesh and Task] shaders bring the compute programming model to the graphics pipeline as threads are used cooperatively to generate compact meshes (meshlets) directly on the chip for consumption by the rasterizer. Applications and games dealing with high-geometric complexity benefit from the flexibility of the two-stage approach, which allows efficient culling, level-of-detail techniques as well as procedural generation.
> The mesh shader stage produces triangles for the rasterizer, but uses a cooperative thread model internally instead of using a single-thread program model, similar to compute shaders. Ahead of the mesh shader in the pipeline is the task shader. The task shader operates similarly to the control stage of tessellation, in that it is able to dynamically generate work. However, like the mesh shader, it uses a cooperative thread model and instead of having to take a patch as input and tessellation decisions as output, its input and output are user defined.
> This simplifies on-chip geometry creation compared to the previous rigid and limited tessellation and geometry shaders, where threads had to be used for specific tasks only, as shown in figure 3.
- https://developer.nvidia.com/blog/introduction-turing-mesh-s...