More likely they are using compilers that support their private extensions. I call BS on all vendor supplied benchmark results, that includes SiFive, as they don't really represent results users are likely to get without extensive tuning.
ADD: it's unlikely the amount of cache and more likely their quite long L2 hit latency which is a lot longer (in wall-time) than, say a Cortex-A72.
L2 size, L2 speed, whatever, they're both down to the goodness of the SoC design, not the ISA.
For that matter, the 20% difference you found could be compiler maturity. And it's small enough that it's hard to notice without sitting two machines doing the same thing next to each other, or using a stopwatch.
But yeah, could well be that TH1520 has a duff L2 cache design in some way.
Much more shocking to me than the Pi 4 beating the TH1520 by 20% is the in-order dual-issue, 20% slower clock speed JH7110 beating it by 13% on a GNU toolchain build.
ADD: it's unlikely the amount of cache and more likely their quite long L2 hit latency which is a lot longer (in wall-time) than, say a Cortex-A72.