This is important work, at least partially addressed in other benchmark harnesses (like github.com/google/benchmark has DoNotOptimize and ClobberMemory https://github.com/google/benchmark/blob/7fad964a94da9364af1...). By and large this whole area is criminally under-examined, IMO.
What I'd really like is for someone to put together a suite of statistical tests that reject poor performance measurements. At a bare minimum your performance samples should look like they come from IID random variables when you take them. If not, you haven't reached steady state, or there are correlations between repeated runs, etc.
What I'd really like is for someone to put together a suite of statistical tests that reject poor performance measurements. At a bare minimum your performance samples should look like they come from IID random variables when you take them. If not, you haven't reached steady state, or there are correlations between repeated runs, etc.