I found this paragraph confusing, is it talking about data prefetchers (Which would make sense b/c of the mention of short prefetches) or branch predictors? (Which would make sense b/c of the mention of TAGE and Perceptron)
Or, to put that another way, this reads to me like the probabilistic equivalent of a compiler doing dead code elimination on unconnected basic blocks. The L1 predictor is marking L1 cache lines as “dead” (i.e. LRU) when no recently-visited L1 cache line branch-predicts into them.
The way this works is there a fast predictor (L1) that can make a prediction every cycle, or at worst every two cycles, which initially steers the front end. At the same time, the slow (L2) predictor is also working on a prediction, but it takes longer: either throughput limit (e.g., one prediction every 4 cycles) or with a long latency (e.g., takes 4 cycles from the last update to make a new one). If the slow predictor ends up disagreeing with the fast one, the front end if "re-steered", i.e., repointed to the new path predicted by the slow predictor.
This happens only in a few cycles so it is much better than a branch misprediction: the new instructions haven't started executing yet, so it is possible the bubble is entirely hidden, especially if IPC isn't close to the max (as it usually is not).
Just a guess though - performance counter events indicate that Intel may use a similar fast/slow mechanism.