Like right now the L1 cache will have latencies of 1 or 2 cycles, and the L2 cache 15; this is due to the overheads of cache coherency protocols, moving the data around the chip; it's not that the memory's slower, it's all SRAM.
They are probably referring to enterprise workloads. Here you have large working sets (so caches are less useful) and you want maximum throughput. Clever multithreading (finegrained) can reduce effective latency by scheduling many (32?) processes at the same time, executing an instruction from each in round-robin fashion (see Sun Niagara). In that case, you can sometimes dump the L1 cache, and you would be able to get rid of the memory hierarchy.
There's also probably a benefit wrt hard drives/secondary storage; you can obviously make system storage very fast, which might improve random access times considerably. BUT this is probably not going to be transformative; it'll improve certain types of accesses, but current algorithms are already very highly tuned to spatial and temporal locality of reference. Furthermore, you'll still see these structures win out, because they can take advantage of hardware prefetching more easily.