Up to a point, implementing larger caches can improve aggregate performance in higher-latency processor design configurations. Beyond a point, larger caches aren't performance- or cost-effective. Conversely, lower-latency processor designs mean that cache misses are less costly, and that L1 or particularly L2 caches can be smaller.
For a reasonable comparison of what changing the latencies within a design can provide, here is a LANL write-up from the Alpha microprocessor environment, and where Alpha EV7 had (for its time) low interprocessor and low memory latency and with toroidal processor links as compared with its Alpha EV68 predecessors and hierarchical or bus-based systems:
Among the x86 designs, the Xeon Nehalem-class processors have substantially better latencies (around 27 ns and 54 ns remote) than previous generations of Xeon processors. And rather better than the Alpha latencies discussed in the LANL document. Which means the effects of different cache sizes or access patterns can change.
Branches, too, can play havoc with the instruction streams and with the efficacy of caching and of instruction decode. Branch often and performance can suffer. Highly pipelined designs can take bigger performance hits with branches.
The OS can't dedicate parts of the L1 cache to different applications (the CPU doesn't offer any feature to allow it to do so), nor would it be a good idea to do so.
Though I wonder if that's true of all SMT chips; I wonder if any chips have dual L1 caches for exactly this reason?