While you're right that memory is the bottleneck in many CPU intensive programs, the idea that you need low level access to circumvent it is far from obvious. For example, several well known people have argued that C makes it more difficult to use memory effectively, because pointers make compiler job so difficult (Fortran still beats C for most numerical benchmarks).
I think we are at a point where architectures are so different that even though in theory, controlling memory pattern is potentially more powerful, in practice, it is impossible to do it right except when you can spend insane amount of time on it. The difference between P4 and core duo, for example, is enormous as far as organizing memory accesses. This is exactly like ASM vs C: you can still beat C with ASM, but doing so across all architectures is almost impossible to do it by hand.