Yes, this is strongly related to inlining. Without the warmup code C2 most likely inlines your lambdas into the code generated for the for each method. At least I have seen this behavior with similar APIs (you can use a tool called JITWatch to visualize compiled and inlined methods). However, this does not scale.
Last time I checked C2 inlines at most two implementations at a polymorphic call-site (in this case the line that calls your lambda). If you pass in more lambda functions, it won’t inline them and might de-optimize existing compiled code to remove previously inlined functions (there are also cases where one implementation that is heavily used gets inlined and the others will be invoked via a function call). Graal does not seem to have the same limit but when I tested it two years ago it had others averaging out to the same performance.
Thus, increasing the max inline level does not necessarily help (you can already do that in Java 8 and 11) for this particular problem. What you would want is that the compiler inlines the for each method and the local lambdas into the methods using the API (the benchmark methods in our case), i.e., you want the compiler to copy the hot loop somewhere higher up in the call stack and then optimize it. But apparently this is easier said than done.
But again, this is probably only relevant for workloads where there is very little work to do for each item.
I looked around and indeed this still appears to be the case. Megamorphic call sites will not be inlined.
From reading mailing list messages, the main reason for this shortcoming appears to be C2 is such a mess nobody wants to touch it and they're writing a new compiler instead (Graal).
If optimizations start getting targeted towards Graal I'll be pissed. Oracle gimps the performance of the free version, the one thing you should never ever do with a programming langauge
Last time I checked C2 inlines at most two implementations at a polymorphic call-site (in this case the line that calls your lambda). If you pass in more lambda functions, it won’t inline them and might de-optimize existing compiled code to remove previously inlined functions (there are also cases where one implementation that is heavily used gets inlined and the others will be invoked via a function call). Graal does not seem to have the same limit but when I tested it two years ago it had others averaging out to the same performance.
Thus, increasing the max inline level does not necessarily help (you can already do that in Java 8 and 11) for this particular problem. What you would want is that the compiler inlines the for each method and the local lambdas into the methods using the API (the benchmark methods in our case), i.e., you want the compiler to copy the hot loop somewhere higher up in the call stack and then optimize it. But apparently this is easier said than done.
But again, this is probably only relevant for workloads where there is very little work to do for each item.