In order to avoid this from happening too often, they added a(nother) feedback loop in the whole process: if a sequence of instructions seems to cause a mis-speculation often enough, it's added to a table, so that when it's encountered again, the control unit doesn't speculatively execute an instruction whose result is most likely going to have to be retired anyway.
I don't know if there's prior art to this, but the basic idea is fairly straightforward. I wouldn't be surprised to learn that IBM or DEC engineers knew about this prior to 1996.
1. In a processor capable of executing program instructions in an execution order differing from their program order, the processor further having a data speculation circuit for detecting data dependence between instructions and detecting a mis-speculation where a data consuming instruction dependent for its data on a data producing instruction of earlier program order, is in fact executed before the data producing instruction, a data speculation decision circuit comprising:
a) a predictor receiving a mis-speculation indication from the data speculation circuit to produce a prediction associated with the particular data consuming instruction and based on the mis-speculation indication; and
b) a prediction threshold detector preventing data speculation for instructions having a prediction within a predetermined range.
Whether or not there was a prior implementation of this, I don't know – but it's also obviously more than a simple saturating counter branch predictor or whatever.
So inductive patent expansion. Take previous innovation, add 1, profit.
What about L0 and L4 caches? Or Renaming sets of renamed registers? The problem as others have outlined, that patents are not concrete enough. Simply describing a problem is enough to be granted a patent. The value of the description of most patents is zero, which afaik this opposite of their intended effect as a record and transfer of technology.
The innovation of the UWM paper was the MDPT and DDST, then due to practical manufacture reasons merging them, and then studying the trade-off with a simulator to arrive at a very efficient system.
For comparison, here is the IBM patent for the bigger more expensive approach used in Power4:
(obviously the processor does this on a much lower level than this code)