Afaict the gist of the hotpads paper is this:
Dump the existing way we organize processor caching, instead make it look like generational gc where objects can move between different caches, and pointers can be re-written to point to the new cache level that the pointee is in.
"We have written the RTL for these circuits and synthesized
them at 45nm using yosys  and the FreePDK45 standard
cell library . The compression circuit requires an area
equivalent to 810 NAND2 gates at a 2.8 GHz frequency. The
decompression circuit requires an area equivalent to 592
NAND2 gates at a 3.4 GHz frequency. These frequencies are
much higher than typical uncore frequencies (1-2 GHz), and
a more recent fabrication process would yield faster circuits."
But they didn't give enough information to say how much headroom there is, with no details about how shallow/deep the circuit is or voltage or...
And if there weren't power limitations we'd already have cores using even more GHz. So a solution which depends too much on having something running faster could be not too relevant to be even used in practice. That's why I pointed to that assumption, which is I think something which has to be understood before the solution can be considered usable.
It reminds me somewhat of an optimization that my colleague Brian Hackett implemented on the Spidermonkey JS engine, which would use runtime-type-tracking infrastructure to discover objects that had specialized constituents (i.e. a slot was always a boolean, and had never not been a boolean).
The system would notice this and then transition the object to a new layout with non-value-boxed entries where appropriate. Of course this included de-optimizations hooks to allow objects to transition back to the general "boxed" layout if mutations to the object de-specialized the slot.
This technique delivered some significant performance improvements when it came to computationally heavy, type-stable code (such as what we find in the Octane benchmarks). It wasn't as effective on type-unstable code.
I'd expect that for most traditional standalone programs written in a statically typed, managed language (e.g. Java) - this sort of approach has a lot of promise.
Seems this cannot be read without first understanding "object based memory hierarchies", which is new to me. This coauthor's page has a ton of related papers: http://people.csail.mit.edu/poantsai/
I do remember the joy have doubling 16MB of RAM to 32MB.
And that scene from Johnny Mnemonic was priceless for me: