Caches, memory hierarchies, out-of-order execution, etc. are hidden from assembly as well as C. One reason for this that isn't mentioned in your comment (or the ACM article) isn't that everyone loves C but rather that most software has to run on a variety of hardware, with differing cache sizes, power consumption targets, etc. Pushing all of that optimization and fine tuning off to the hardware means that software isn't forced to only work on the exact computer model it was designed to run on.
The author also mentions that alternative computation models would make parallel programming easier, but this neglects the numerous problems that aren't parallelizable. There's a reason why we haven't switched all of our computation to GPUs.
I don't think I agree completely to your sentiment. Because while we want to make software run everywhere (at least in the X86 family regardless of the feature sets we have), we want to make sure that our software performs well, too. This is esp. important in areas where we (ab)use the hardware to the highest level (games, science, rendering, etc.)
To enable this performance optimizations, we taught our compilers tons of tricks, like -march & -mtune flags. Also, we allow our compilers to generate reckless code like -ffastmath, or add tons of assembly or vectorization hints into libraries like Eigen.
We write benchmarks like STREAM, or other tools which measure core to core latency, or measure execution code with different data lengths to detect cache sizes, associativity, and whatnot. Then use this information to optimize our code or compiler flags to maximize the software's speed at hand.
If caches and other parts of the system would be available to assembly, we would have asked the processor their properties, directly optimize according to their merits, even do some data allocation tricks or prefetching w/o guesswork (which some architectures support via programmable external prefetching engines), not doing tuning in the dark via half-informative data sheets, undisclosed AVX frequency behaviors, or other techniques like running perf and looking cache trash percent, IPC, and other numbers to make educated guesses about how a processor behaves.
Yes, not all stuff is can be run in parallel, and I don't want to move all computation to GPUs with FP16 Half Precision math, but we can at least agree that these systems are designed to look like PDP-11's from a distance, and our compilers are the topmost layer of this "emulation" while doing all kinds of tricks. Trying to push this performance in an opaque way why we have Spectre and Meltdown, for example, where these abstractions and mirrors break down.
If our hardware was more transparent to us, we would have arguably selectively optimize our code a bit easier, if it had the switches labeled "Auto/I know what I'm doing", for certain features.
Intel tried to take this to max (do all optimization with the compiler) with Itanium. The architecture was so dense, it failed to float, it seems.
The author also mentions that alternative computation models would make parallel programming easier, but this neglects the numerous problems that aren't parallelizable. There's a reason why we haven't switched all of our computation to GPUs.