
Performance variation in ‘identical’ processors - johndcook
http://shape-of-code.coding-guidelines.com/2020/01/05/performance-variation-in-2386-identical-processors/
======
0xff00ffee
In the 90's I was an architect on Intel's Willamette (Pentium4) thermal
throttle (TT1). TT1 "knocked the teeth" out of clock cycles if the checker-
retirement unit (CRU, the hottest part of the die) got too hot. This evolved
into TT2/Geyserville (where you move up/down the V/F curve to actively stay
under the throttle limit). We were browbeaten by upper management to prove
this would not visibly impact performance and worked on one of the MANY MANY
software simulators written throughout the company to prove this. (It was
actually my favourite job there.) This is when the term "Thermal Design Power"
arrived: top marketing brass to avoid using "Max Power" which was far higher.
It is possible to have almost a 2x difference between max power (running a
"power virus", which intel was terrified of from chipsets, to graphics, to
CPUs) and what typical apps use (thermal design power). Performance was a bit
dodgy on a few apps, but not a significant compared to run-to-run vairation.
(Remember this is 1995-1997 after the half-arsed Pentium fiasco in 1993 when
Motorola openly mocked intel for having a 16W CPU... FDIV wasn't thermal
fiasco, but it was a proper cock up).

Die are sorted based on something called a bin split: die are binned
immediately after wafersort based on their leakage (there are special
transistors implanted near the scribe-lines that indicate tons of
characteristics, as well as DFX units through out the die that are rings of 20
inverters that oscillate, also indicates tons of data on how the die behave,
however testing the those buggered DFX circuits takes an enormous amount of
time, and you can't slow down wafersort, so there are proxies).

The bins are designed in such a way to maximize profit and performance based
on the die characteristics. Thermal throttle plays a role in this and each bin
(among various vectors) is allowed some tolerance, which is exactly what OP
has discovered. However, this has been going on for coming up on 30 years! So
nothing really new here, I just thought I'd let you know that of course Intel
is aware of this, and they never claim performance numbers outside of the
tolerance allowed for thermal throttle.

------
mlyle
There seems to be little attempt to ensure the ambient environment of the
processors is isothermal. That is, there's likely significant chassis to
chassis thermal variation that is as big or larger than any difference in
thermal budget between processors.

Even things like the characteristics of the application of the thermal paste
and the roughness of the individual fan ducts can matter, beyond the obvious
bottom of rack vs. top of rack, position of processor in individual chassis,
etc effects.

tl;dr-- probably most of what is measured is not silicon to silicon variation.

~~~
fyp
I have a pair of "identical" GPUs which have dramatically different
performance. Swapping the order didn't help. It turns out it was because one
was blowing hot air into the other. Sigh.

~~~
satori99
I have seen this happen before. In the 1990's I was an audio engineer for a
radio station that had just spent millions on digital suites with hard drives
that would crash regularly, destroying in-progress work.

The station actually got techs to fly to Australia to try to diagnose the
problem -- A bunch of SCSI disks stacked vertically was causing the topmost
drive to overheat.

~~~
toolslive
Had a similar experience with a JBOD chassis. The vibrations of the spinning
drives caused resonance on some positions. The cure was to attach a patch of
duct tape with a washer at a certain point on the chassis.

~~~
iforgotpassword
Heh I remember in the late 90s I put some rubber between the hdds and case to
greatly reduce the noise of my desktop PC.

~~~
the-dude
I have seen rubber 'sleeves' for hdds

~~~
java-man
I don't understand why this isn't a standard practice.

~~~
mnw21cam
That sounds like a really easy way to make a hard drive overheat.

Way back, I had a 4GB SCSI hard drive that was extra tall, extra fast, extra
noisy, and extra hot. I constructed a thick rubber box around it to try noise-
proof it a little, but I also had to strap a CPU cooler to it and have the
airflow enter the box and exit at a hole in the box after wrapping all the way
round the hard drive. It worked. But just wrapping it in rubber would have
been a very quick way to cook the drive. This is a drive that if just left
running bare on a table would get too hot to touch.

Hard drives produce a heck of a lot less heat these days.

~~~
java-man
An elastic mount does not have to be a suffocating box.

Plus, and I could be wrong here - the mounts are not really designed to be
heat conductors (the rails are plastic on my high end Dell workstation); I
suppose the heat is extracted via air flow.

------
kens
Interesting, but the discussion of "an atom here and there" affecting the
performance doesn't make sense. The manufacturing variations are much larger
than an atom or two. This variation is part of the motivation for "binning" of
processors, testing them and then selling them at different performance levels
based on how they turn out.

------
big_chungus
This isn't new. Processors are often the same template for multiple models and
manufacturers "bin" based on quality and just turn off the bad parts. This is
why it's more expensive to produce a nicer processor: the yield rates are much
lower. There is still significant variation in models, though, known as
silicon lottery. This is why some chips overclock or undervolt much better
than others. There's even one site that sells chips that basically go through
extra binning to ensure a better product:
[https://siliconlottery.com/](https://siliconlottery.com/)

------
quotemstr
Whether or not this article controls properly for temperature, it introduced
me to violin plots:
[https://en.wikipedia.org/wiki/Violin_plot](https://en.wikipedia.org/wiki/Violin_plot)

------
userbinator
I'm surprised that, even at the same frequency, there is still some pretty
large variation. I wonder if that's due to other sources of noise that are
often ignored by a lot of people running benchmarks (e.g. background
processes, SMM, ME, etc.)

 _e.g., memory, which presumably have their own temperature characteristics_

To my knowledge, and this is based on DDR3 and older; memory frequencies are
essentially fixed because the transceivers on both ends need to sample in the
_middle_ of a bit cell, and to do that they need to know the clock period,
which must not change once it's known. There's a delay-locked-loop (DLL) in
the RAM which generates a phase-shifted local reference clock.

If the processors could all be locked to one constant frequency (i.e. all the
power/performance "dynamic tuning" stuff disabled) that would help show
whether there's other sources of noise. This of course also assumes the clock
generators are identical.

------
ChuckMcM
This would be a fun experiment to run at Google, much bigger data set would be
available.

------
touisteur
Ugh... Had two PCs with the exact same spec (CPU, motherboard, RAM same
product number, etc.), all same versions of Linux, firmware. Had 15% in
performance difference between the 2 gears. Turned out to be some bios tuning,
not really related to performance. 15%...

~~~
dspillett
_> not really related to performance. 15%..._

I would say that if it make 15% difference it _was_ related to performance...
It might not have been an option on a screen specifically stated as being
performance related options, but that could just be a bad menu layout.

I'd be interested to know what the option was, if you have any specific memory
of that?

------
p1necone
The variances are likely to be even more noticeable nowadays as modern (x86 at
least) processors do a lot more automatic overclocking based on temperatures.

------
doe88
aka Silicon Lottery.

