"Even lock-free algorithms will not be parallel enough. They rely on instructions that require communication and synchronization between cores’ caches."
Azul's Vega 3 with 864 cores/640GB mem (2008) with Azul JVM apparently works fine using lock-free java.util.concurrnt.* classes & would appear to be a counter point to the very premise of the OP.
It is also probably more likely we will see new drastic rethink of memory managers and cooperation between h/w and s/w designers (kernel /compiler level). Right now, everything is sitting on top of malloc() and fencing instructions. What is more painful? Write non-deterministic algorithms or bite the bullet and update h/w and kernels and compilers? See Doug Lea's talk at ScalaDays 2011 ( @66:30)
And this is not to mention anything about FP and STM approach to the same issue.