BTW, the amount of groundbreaking technological innovation in the Java space these days -- be it with optimizing compilers like Graal, GCs like ZGC, and low-overhead production profiling (JFR) -- going on at Oracle (where I work) is quite phenomenal.
See https://stefan-marr.de/papers/oopsla-marr-ducasse-meta-traci... for a discussion.
Is there a brief summary what is Graal etc doing differently to overcome 1990s style partial evaluation performance problems?
There exist several partial evaluators and benchmarks for the use case of Futamura projections.
I will admit that I know too little about the internals of Graal but I haven't seen any papers describing how the traditional issues with partial evaluation have been solved for the general case, so my guess is that all this still only works and yields good results under certain assumptions and designs, i.e. just throwing any generic interpreter code at the partial evaluator alone will not necessarily yield a compiler that is particularly good. 
The idea of general partial evaluation and the Futamura projections is well, rather general. Take code and static input and produce a specialized program to that code. In case of an interpreter being the code and a program the static input produce compiled code, etc.
It's rather hard to deliver on the promise that this always works and in particular yields performant results on arbitrary code and inputs. And I don't think Graal delivers this either (not saying it has to).
It's possible that Graal can be considered the first industry compiler for a full mainstream language that does a high degree of partial evaluation and allows tooling around it (i.e. Truffle for implementing language interpreters that can be optimized/become compilers). However it would be disingenuous to claim it's the first compiler/partial evaluator to yield practical results of applying the Futamura projections.
(You didn't quite claim that bit the previously "theroretical" part kind of goes there).
> Writing language interpreters for our system is simpler
than writing specialized optimizing compilers. However, it
still requires correct usage of the primitives and a PE mind-
set. It is therefore not straightforward to convert an existing standard interpreter into a high-performance implementation
using our system.
> Our experience shows that all code that was not explicitly
designed for PE should be behind a PE boundary. We have
seen several examples of exploding code size or even non-
terminating PE due to infinite processing of recursive meth-
ods. For example, even the seemingly simple Java collection
method HashMap.put is recursive and calls hundreds of dif-
ferent methods that might be processed by PE too.
In isolation, this sounds like a description of how HotSpot works, no? (For the uninitiated: HotSpot's JITs make dangerous assumptions, optimise accordingly, and discard compiled objects if those assumptions ever break.)
First of all, Graal is a compiler that serves as a HotSpot compiler when running as a Java bytecode JIT (HotSpot is the name of OpenJDK's JVM: it includes two compilers, with Graal serving as a third, an interpreter, several GCs and various other runtime features). But yes, HotSpot's default optimizing compiler, called C2, also does speculative optimizations and deoptimizes on "mistakes", but Graal is more general in the sense that its easier to teach it various optimizations for many languages. C2 is very good, but because Graal is believed to be easier to maintain, it may match and surpass it one day (it already surpasses it for some important workloads). Project Metropolis investigates the possibility of eventually making Graal the default optimizing compiler in HotSpot (https://openjdk.java.net/projects/metropolis/)