> single-threaded nonvectorized C wastes on the order of 97% of your computer's computational power
Can you elaborate on what this means exactly? For example, is there some reasonable C code that runs 33 times slower than some other ideal code? In what sense are we wasting 97% of our computer's computational power?
Roughly that 3000× is 18× from multithreading, 3× from SIMD instructions, 15× from tuning access patterns for locality of reference, and 3× for turning on compiler optimization options. This is a really great slide deck!
I was assuming "single-threaded nonvectorized C" already had compiler optimization turned on and locality of reference taken into account. As the slide deck notes, you can get some vectorization out of your compiler — but usually it requires thinking like a FORTRAN programmer.
So I think in this case reasonable C code runs about 54× slower than Leiserson's final code. However, you could probably get a bigger speedup in this particular case with GPGPU. Other cases may be more difficult to get a GPU speedup, but get a bigger SIMD speedup. So I think my 97% is generally in the ballpark.
A big problem is that we can't apply this level of human effort to optimizing every subroutine. We need better languages.
This is great! Which of these do you think could be extended to general-purpose programming without the HPC expert? Taichi and DAPP seem to be aimed at that goal, but you seem to be implying they don't reach it yet?
You can use them without the HPC expert, Halide for example has a good autotuner and has been used by Google and Adobe to create image filters for mobile devices.
I'm still kind of a newb myself but from what I understand these are special CPU instructions that allow you execute the same instruction in parallel against multiple data points. This allows you to eke out a lot more performance. It's how simdjson[1] is able to outperform all other C++ json parsers.
8 cores times 4 SIMD lanes is a 32× speedup; that's where "97%" comes from, as explained in the note I linked to.
It's pretty variable: some things we haven't figured out how to speed up with SIMD, sometimes we have a GPU, sometimes we can get 8 or 16 SIMD lanes out of SSE3 or AVX128 or 32 of them out of AVX256, sometimes you only have four cores, sometimes make -j is enough parallelism to win you back the 8× factor from the cores (though not SIMD and GPGPU). But I think 97% is a good ballpark estimate in general.
Just to clarify, I was only estimating a speedup of 4× from vectorization, while the other 8× comes from multithreading.
Fifteen years ago we thought regular expression matching and compilers were unlikely to benefit from vectorization, but now we have Hyperscan and Co-dfns, so they did. So I think it's likely that we will figure out ways to do a wider variety of computations in vectorized ways, now that the rewards are so great.
As an example, if you are checking a boolean flag (1 bit) on an object, and it ends up being a cache miss (and x86_64 cache line size is 64 bytes), then your computer just went through all the expense of pulling in 512 bits from RAM yet it only used 1 of them. You are achieving 0.2% of the machine's possible throughput.
I imagine if you could make the most out of vector instruction set in your code (where they can operate on a vector of data at once instead of one by one), you'll get a huge performance boost for "free". GP seem to be working on a vm that let you do that (a lot of it was flying over my head though, need some coffee).
Can you elaborate on what this means exactly? For example, is there some reasonable C code that runs 33 times slower than some other ideal code? In what sense are we wasting 97% of our computer's computational power?