Hacker News new | more | comments | ask | show | jobs | submit login
The Effects of CPU Turbo: 768X Stddev (alexgallego.org)
100 points by ingve 7 months ago | hide | past | web | favorite | 23 comments

With credit to the author, it feels like they found out something interesting, but the article feels like a mess.

On one hand, they made a performance optimisation to a library - cool.

On the other hand, the title and the conclusion talk about the difference between turbo boost in a CPU, which seems completely missing from the articles contents....

There is a bit of connecting transitions missing, but basically I understand it thus:

He was using a slower buffer serialisation library (for reasons unexplained), and when challenged set out to prove his choice was the fastest, but pretty soon found out he was wrong. Then he wanted to do the test himself, and along the way he discovered how his processor made benchmarking very noisy in turbo mode.

Thanks for summarizing it better than I did. I'll edit and make that a bit more clear.

Plenty of other data points in the table show reduced stddev with turbo enabled. The 768x line looks like an anomaly.

Without knowing how many runs were made and the other test conditions, it's difficult to know what actually changed. There's far too much noise in the table to come to the conclusion that turbo was the culprit.

We're only talking about a 30% difference in clock rate for 3.2GHz vs 4.2GHz. If there's more change than that across turbo vs. non-turbo, something else in the benchmark setup is messed up.

Sure. It should not be that much. There is definitely some weird stuff with turbo, mind you. Particularly because using heavy AVX2 or AVX512 loads changes the max turbo frequency for the cores they are running on (in broadwell) or all cores (in haswell).


It's quite significant for AVX512.

This really screws with benchmarking unless you are on something newer than haswell and isolate the core you are executing on.

~20%. The difference however is a max of ~2.3x in that direction. 100% of the code is there, what do you suggest is 'messed' up. I probably ran the benchmarks 100s of times over 7 days trying to understand. That particular I posted seemed aligned with the others, but that was the stddev of 3 runs provided by Google Benchmark --benchmark_repetitions=3 I made it 10 and not much difference there. That was the point of 768X anomalies. The values are never like that w/ capped frequencies.

Hang on, so we've got:

- SD increase of 768x

- Mean increase of 1.2 to 2.3x

- But n=3?

You need a lot more samples to make a reasonable statement about whether Turbo had an impact or not. Noise is completely overpowering your measurements.

AFAICT, you're measuring SD of the time (nanoseconds) to run your task, and mean is around 9000ns. 1ns SD is abnormally low, and 768ns SD seems about right (this is a desktop machine, after all). But I couldn't really figure this out from the article.

No no.

- capped freqs: stddev 768x. in the non-turbo-to-turbo direction

- in the other direction it is 2.3x. So was just acknowledging that there are 20% of samples in that table with your previous point

25% of 'samples' in the stddev table show absolutely no difference between Turbo and non-Turbo. Assuming that this line of argument makes sense (it doesn't) that leaves merely 55% of samples that support your case.

I appreciate what you're trying to do, and I'm intensely interested in this issue (I have a feeling that we face very similar issues in our day jobs) -- but I can't see a reasonable way to interpret your data that supports your assertion that Turbo meaningfully increases variance.

There's just not enough data.

Turbo and Hyperthreding gernate a ton of noise when benchmarking. Disabling both results in a one or two order of magnitude reduction in outlier results (maybe you want to call it noise - but it is not symmetric like classic noise).

The primary way turbo introduces variance is due to the turbo transitions forced by the "max turbo ratio varies by active core count" behavior. Eg on my Skylake chip, the turbo speed is either 3.5, 3.3, 3.2 or 3.1 GHz depending on whether 1, 2, 3 or 4 cores are active.

Let's say you are running a threaded benchmark with "nothing else" running on the box. Of course other cores will still occasionally fire up, to handle scheduler ticks, interrupts from you network card, background processes, whatever. Every time this happens, the chip has to immediately undergo a frequency transition which leaves it running at 3.3 GHz. The main problem isn't the lowered speed, it's the fact that the transition itself puts the chip into a halted state for 10 to 20 us, presumably to allow the multiplier transition to occur, for voltages to stabilize, etc. Especially for short, precise benchmarks, this shows up as a lot of noise.

Turning off turbo fixes it, but so does just setting the max turbo ratio to the "all cores" value (3.1 GHz in my case) since then you don't have these forced transitions.

I see what you are saying. I did run it 100s of times, but didn't record it. Problem with benchmarking is, that it is incredibly slow haha. I'll run it for a few hours with a couple other tips ppl sent me and have a follow up on that post.

Ensure that your BIOS says performance when connected to AC

This was the most interesting tidbit for me. It wouldn’t occur to me to run benchmarks on a laptop (I guess due to my advancing years...)

Thanks. That machine has a server processor and ecc memory 64GB. Specs are pretty similar to a small server. But yeah. Hardware has changed a ton :)

Minor: ”Friday, June 28 2018". June 28th was actually a Thursday this year.

... i guess I was tired when i wrote it. I meant to write 29th! thanks!!

That page seems to make Android Chrome hang.

If that's happening to you, then it's probably an issue with asciinema and your particular version of Android's Chrome.

That's most of the JS on the page. That being said, the page transfers ~500Kb, and loads into memory ~30Mb, which shouldn't be that substantial. Most of it lies in the asciinema player (which should be mostly JITing), or an array.

> loads into memory ~30Mb

Is that the asciinema player, or the "video" it plays? Either way: I thought part of the point of asciinema was that it compresses better for terminal capture? Just how much data was stored by that particular session?

I think it compresses file size/network transfer, but in order for playback to occur, the video must be decompressed into memory, even though the terminal colors will be mostly the same color hex.

> Is that the asciinema player, or the "video" it plays?

Hard to tell, just shows up as 'JIT' in the profiler.

> I thought part of the point of asciinema was that it compresses better for terminal capture?

Better than what? 30Mb of heap for a video source/player combination isn't that awful. It's on-par with most videos that your web browser will try to play.

No issues like that here. On a Huawei Honor 7x with Android 8.0.0

That one is empty,old and with no upvotes.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact