On one hand, they made a performance optimisation to a library - cool.
On the other hand, the title and the conclusion talk about the difference between turbo boost in a CPU, which seems completely missing from the articles contents....
He was using a slower buffer serialisation library (for reasons unexplained), and when challenged set out to prove his choice was the fastest, but pretty soon found out he was wrong. Then he wanted to do the test himself, and along the way he discovered how his processor made benchmarking very noisy in turbo mode.
Without knowing how many runs were made and the other test conditions, it's difficult to know what actually changed. There's far too much noise in the table to come to the conclusion that turbo was the culprit.
We're only talking about a 30% difference in clock rate for 3.2GHz vs 4.2GHz. If there's more change than that across turbo vs. non-turbo, something else in the benchmark setup is messed up.
It's quite significant for AVX512.
This really screws with benchmarking unless you are on something newer than haswell and isolate the core you are executing on.
- SD increase of 768x
- Mean increase of 1.2 to 2.3x
- But n=3?
You need a lot more samples to make a reasonable statement about whether Turbo had an impact or not. Noise is completely overpowering your measurements.
AFAICT, you're measuring SD of the time (nanoseconds) to run your task, and mean is around 9000ns. 1ns SD is abnormally low, and 768ns SD seems about right (this is a desktop machine, after all). But I couldn't really figure this out from the article.
- capped freqs: stddev 768x. in the non-turbo-to-turbo direction
- in the other direction it is 2.3x. So was just acknowledging that there are 20% of samples in that table with your previous point
I appreciate what you're trying to do, and I'm intensely interested in this issue (I have a feeling that we face very similar issues in our day jobs) -- but I can't see a reasonable way to interpret your data that supports your assertion that Turbo meaningfully increases variance.
There's just not enough data.
The primary way turbo introduces variance is due to the turbo transitions forced by the "max turbo ratio varies by active core count" behavior. Eg on my Skylake chip, the turbo speed is either 3.5, 3.3, 3.2 or 3.1 GHz depending on whether 1, 2, 3 or 4 cores are active.
Let's say you are running a threaded benchmark with "nothing else" running on the box. Of course other cores will still occasionally fire up, to handle scheduler ticks, interrupts from you network card, background processes, whatever. Every time this happens, the chip has to immediately undergo a frequency transition which leaves it running at 3.3 GHz. The main problem isn't the lowered speed, it's the fact that the transition itself puts the chip into a halted state for 10 to 20 us, presumably to allow the multiplier transition to occur, for voltages to stabilize, etc. Especially for short, precise benchmarks, this shows up as a lot of noise.
Turning off turbo fixes it, but so does just setting the max turbo ratio to the "all cores" value (3.1 GHz in my case) since then you don't have these forced transitions.
This was the most interesting tidbit for me. It wouldn’t occur to me to run benchmarks on a laptop (I guess due to my advancing years...)
That's most of the JS on the page. That being said, the page transfers ~500Kb, and loads into memory ~30Mb, which shouldn't be that substantial. Most of it lies in the asciinema player (which should be mostly JITing), or an array.
Is that the asciinema player, or the "video" it plays? Either way: I thought part of the point of asciinema was that it compresses better for terminal capture? Just how much data was stored by that particular session?
Hard to tell, just shows up as 'JIT' in the profiler.
> I thought part of the point of asciinema was that it compresses better for terminal capture?
Better than what? 30Mb of heap for a video source/player combination isn't that awful. It's on-par with most videos that your web browser will try to play.