But you don't have thousands of cores. You have a medium amount of cores (64 in the latest Vega 64 GPU) with a high amount of ALUs (and threads) per core. When the GPU executes an instruction for a thread it looks if the other threads execute the same instruction and then utilises all ALUs at once. This is great for machine learning and HPC where you often just have a large matrix or array of the same datatype that you want to process but most of the time this isn't the case.