Maybe I'm missing something (I only did a quick read) but aren't you explicitly ...

Maybe I'm missing something (I only did a quick read) but aren't you explicitly telling the model to re-explore low density regions of the action space? Essentially turning of the exploration (and turning down exploitation) with a weighting towards low density regions?

As not an RL person (I'm in generative), have people not re-increased the exploration variable after the model has been initially trained? It seems natural to vary that ee trade-off.