30% allocation speedup for small-ish objects by prefetching memory locations for the next thread-local allocation while performing the current allocation.
Even in the large object case, where average allocation time is not substantially affected, the variability of those times is reduced substantially.
Just last week, I was helping with some performance tuning on an application that needs to match some set of criteria against tens of thousands of rules, and needs to do so in around 30ms under fairly constant load. Hitting our latency/throughput targets required a lot of tuning, especially around GC. Among other things, it was important to make sure most/all of a request could be handled in the TLAB.
I could definitely see something like this giving us a few more milliseconds per request, and I'm definitely going to try it out in a next round of performance work - always nice to have a bit of head room.