Hacker News new | past | comments | ask | show | jobs | submit login

The CPU for complex stuff and GPU for simple number crushing is a really popular narrative. It's true in one sense and nonsense in another, determined solely on the software it's running.

If your program has one or two threads and spends all its time doing branchy control flow, a CPU will run it adequately and a GPU very poorly. If the program has millions of mostly independent tasks, a GPU will run it adequately and a CPU very poorly. That's the two limiting cases though and quite a lot of software sits somewhere in the middle.

The most concise distinction to draw between the hardware architectures is what they do about memory latency. We want lots of memory, that means it ends up far from the cores, so you have to do something while you wait for accesses to it. Fast CPUs use branch predictors and deep pipelines to keep the cores busy, fast GPUs keep a queue of coroutines ready to go and pick a different one to step along when waiting for memory. That's roughly why CPUs have a few threads - all the predictor and wind back logic is expensive. It's also why GPUs have many - no predictor or wind back logic, but you need to have a lot of coroutines ready to go to keep the latency hidden.

Beyond that, there's nothing either CPU or GPU can do that the other cannot. They're both finite approximations to Turing machines. There are some apparent distinctions from the hosted OS abstraction where the CPU threads get "syscall" and the GPU threads need to DIY their equivalent but the application doesn't care. Threads on either can call fprintf just fine. It makes a bit of a mess in libc but that's alright (and done in LLVM already).




Thanks, that is a good summary.

> If the program has millions of mostly independent tasks, a GPU will run it adequately and a CPU very poorly.

Now I would say:

CPU - optimized to execute threads with independent instruction streams and data use.

GPU - Optimized to execute threads with common instruction streams and data layouts.

CPUs

As you noted: Optimizing conditional branches is one reason CPU cores are more complex, larger.

CPUs also handle the special tasks of being overall “host”. I.e. I/O, etc.

—-

GPUs

One instruction stream over many cores greatly reduces per core footprints.

Both sides of conditional code are often taken, by different cores. So branch prediction is also dispensed with.

(All cores in a group step through all the instructions of both if-true and else-false conditional clauses, but each core only executes one branch, and is inactive for the other. Independent per-core execution, without independent code branching. Common branching happens too.)

Both CPU and GPU can swap between threads or thread groups, to keep cores busy.

Both optimize layers of faster individual to larger shared memory (i.e. registers, cache levels & RAM).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: