This approach would fall flat on its face with large primitives:
- The for loop in the rasterizing code would get very large strides between lanes in a warp for the frame buffer memory accesses
- it doesn't parallelize across pixels in a primitive, so at its extreme, a single big triangle would be single threaded..
- A HW rasterizer has a lot more things it needs to do (compute barycentrics, keep around enough state to launch a fragment shader, etc)
- Quad occupancy. Conventional rasterizers go down to 2x2 quads as their unit to work with, because you need some neighbours if you are going to calculate derivatives for mishap selection when doing texturing. The hardware is designed around quads, and all lanes in a quad must come from the same polygon - leaving 3/4 of the hardware downstream from the rasterizer unused with single pixel polygons.
These are problems that conventional rasterizers have, but if you are writing a compute rasterizer you can conveniently skip lots of things.