My understanding:

GPUs are (generally) in-order within each thread, but they are pipelined. The pipeline is filled with instructions that are ready to execute from across many threads. If all threads have an unmet dependency (previous instruction or memory access), the pipeline will stall.

