Modern CPUs can look ahead in the instruction stream and run 4 or more instructi...

Modern CPUs can look ahead in the instruction stream and run 4 or more instructions simultaneously. This isn't easy: you have to respect data dependencies where one instruction depends on the output of a previous one. When something depends on a load instruction that missed in the cache, CPUs can keep going farther ahead and do some other work while waiting for load to complete.

This "speculative out-of-order execution" requires a huge number of transistors to consider various combinations of future instructions it might be able to execute every clock cycle, and burns some extra power doing that. So although most of the basic ideas were known by the late 90s, adding more transistors in every generation lets it do more and more in parallel.

Also, faster and larger caches cause fewer stalls.

Also, modern cores are better at predicting branches, so they can proceed to start executing instructions past a branch before knowing which way the branch is going to go. If it guessed wrong about the branch, it has to undo the results of some instructions. So it adds a lot of complexity to track each side-effect that might need to be canceled.

Also, SIMD parallel has gotten much better. Some modern cores can do 8 floating point operations per cycle using AVX2 or Neon. While older SIMD systems had very limited instructions sets, you can do a lot with modern ones. x86 SIMD instructions can process 32 bytes at a time. With a great deal of cleverness, you can do some byte stream operations in less than one cycle per byte. See https://arxiv.org/abs/2010.03090

GPUs generally do 32 parallel floating point operations per core per cycle, with hundreds of cores.

Also, main memory is gradually getting slightly faster and wider.

Lastly, more cores are good. Back when the most cores you could get was 4, it was barely worth writing parallel software because all the locking slowed things down almost as much as the 4 cores speeded things up. But high-end Xeons can have 40+ cores, which makes it worth the hassle of writing parallel code. And GPUs have 1000s of cores, so it's worth a lot of complication to make use of them.