The instruction pointer is all synchronized, providing you with fewer states to reason about.
Then GPUs mess that up by letting us run blocks/thread groups independently, but now GPUs have highly efficient barrier instructions that line everyone back up.
It turns out that SIMDs innate assurances of instruction synchronization at the SIMD lane level is why warp based coding / wavefront coding is so efficient though, as none of those barriers are necessary anymore.
We use threads to solve all kinds of things, including 'More Compute'.
SIMD is limited to 'More Compute' (unable to process I/O like sockets concurrently, or other such thread patterns). But as it turns out, more compute is a problem that many programmers are still interested in.
Similarly, you can use Async patterns for the I/O problem (which seems to be more efficient anyway than threads).
--------
So when we think about a 2024 style program, you'd have SIMD for compute limited problems (Neural Nets, Matricies, Raytracing). Then Async for Sockets, I/O, etc. etc.
Which puts traditional threads in this weird jack of trades position: not as good as SIMD methods for raw compute. Not as good as Async for I/O. But threads do both.
Fortunately, there seem to be problems with both a lot of I/O and a lot of compute involved simultaneously.
It's not just I/O, it's data pipelining. Threads can be used to do a lot of different kinds of compute in parallel. For example, one could pipeline a multi-step computation, like a compiler: make one thread for parsing, one for typechecking, one for optimizing, and one for codegening, and then have function move as work packages between threads. Or, one could have many threads doing each stage in serial for different functions in parallel. Threads give programmers the flexibility to do a wide variety of parallel processing (and sometimes even get it right).
IMHO the jury is still out on whether async I/O is worth it, either in terms of performance or the potential complexity that applications might incur in trying to do it via callback hell. Many programmers find synchronous I/O to be a really, really intuitive programming model, and the lowest levels of the software stack (i.e. syscalls) are almost always synchronous.
The ability to directly program for asynchronous phenomena is definitely worth it[0]. Something like scheduler activations, which imbues this into the threading interface, is just better than either construct without the other. The main downside is complexity; I think we will continuously improve on this but it will always be more complex than the inevitably-less-nimble synchronous version. Still, we got io_uring for a reason.
Fair. It's not like GPUs are entirely SIMD (and as I said in a sibling post, I agree that GPUs have substantial traditional threads involved).
-------
But let's zoom into Raytracing for a minute. Intel's Raytrace (and indeed, the DirectX model of Raytracing) is for Ray Dispatches to be consolidated in rather intricate ways.
Intel will literally move the stack between SIMD lanes, consolidating rays into shared misses and shared hits (to minimize branch divergence).
There's some new techniques being presented here in today's SIMD models that cannot easily be described by the traditional threading models.
GPUs in particular have a very hyperthread/SMT like model where multiple true threads (aka instruction pointers) are juggled while waiting for RAM to respond.
Still, the intermediate organizational step where SIMD gives you a simpler form of parallelism is underrated and understudied IMO.
The instruction pointer is all synchronized, providing you with fewer states to reason about.
Then GPUs mess that up by letting us run blocks/thread groups independently, but now GPUs have highly efficient barrier instructions that line everyone back up.
It turns out that SIMDs innate assurances of instruction synchronization at the SIMD lane level is why warp based coding / wavefront coding is so efficient though, as none of those barriers are necessary anymore.