
Why has CPU frequency ceased to grow? (2014) - Osiris
https://software.intel.com/en-us/blogs/2014/02/19/why-has-cpu-frequency-ceased-to-grow
======
Symmetry
That was a bit misleading in some ways. First, in pipelining you'll typically
measure how long a pipeline steps in FO4s, which is to say the delay required
for one transistor to drive 4 other transistors of the same width. Intel will
typically design its pipeline stages to have 16 FO4s of delay. IBM is more
aggressive and will try to work it down to 10. But of those 10, 2 are there
for the latches you added to create the stage and 2 are there to account for
the fact that a clock edge doesn't arrive everywhere at exactly the same time.
So if you take one of those 16 FO4 Intel stages and cut it in half you won't
have a two 8 FO4 stages but two 10 FO4 stages. And since those latch
transistors take up space and energy you're got some severe diminishing
returns problems.

One thing that's changed as transistors have gotten smaller is that leakage
has gotten to be more of a problem. You used to just worry about active
switching power but now you have to balance using higher voltage and lower
thresholds to switch your transistors quickly with the leakage power that that
will generate.

And finally velocity saturation is more of a problem on shorter channels
making current go up more linearly with the gate voltage than quadratically.

~~~
deepnotderp
I don't think anyone uses U/LVT transistors in low geometries, the leakage
would be a nightmare .

~~~
trsohmers
I know a lot of people using LVT transistors in 28 and 16/14nm processes,
including relatively low power (mobile and embedded) designs. I personally
have used LVT variant SRAM blocks for both our 28nm and 16nm designs, and ULVT
cells manually placed for critical path for Neo's FPU for our 28nm chip.

~~~
deepnotderp
I should've rephrased: I don't know of anyone* who uses ULVTs exclusively, to
answer the parent of using ULVTs to increase speed.

 __* Okay, I know of _some_ people, but their design is different.

------
mjfl
I took a class that went over this in depth like 3-4 years ago. Basically the
message was that serial performance is saturating, and the only way to get
speed improvements in the future is going to be by exploiting parallelism.
However, most programmers, and programming languages, remain stuck in a
serial-by-default paradigm. I'm surprised that there hasn't emerged a
"parallel-by-default C++" kind of language + hardware system to exploit it to
keep things going forward. I find the apparent stagnation extremely
depressing.

~~~
quadcore
[https://golang.org/](https://golang.org/)

In case you dont know, Golang _goroutines_ are a marvel of parallelism. They
are coroutines which are dispatched on a few OS threads. So you can use 100%
of a multi-cores CPU and yet, spawn, say, 10K of those _light threads_ without
worrying about context switches PLUS have them all run concurrently. I've
found that golang is one of those rare language, like Lisp, that actually
change the way you think about programming. Makes you feel really more
powerful.

If you dont know that language, I suggest to run the following and watch your
cpu activity and memory (or any metric):

    
    
      import "time"
      func main() {
        for i := 0; i < 10000; i++ { 
          go func() { 
            for { 
              time.Sleep(time.Second) 
            }
          }() 
        }
      }

~~~
tzahola
Your example will _not_ run in parallel. The go runtime will schedule your
goroutines _concurrently_ , but they will be run by a single OS thread, and
consequently on a single CPU core.

Once you execute truly on multiple CPU cores (by increasing GOMAXPROCS),
you'll be having the same kind of race conditions in Go as in any other
imperative language (inb4 Rust Evangelism Strike Force saying "except Rust").

~~~
quadcore
Wrong. Goroutines are _not_ simply coroutines.

GOMAXPROCS defaults to number of cores.

~~~
tzahola
Prove it.

Prove it by replacing Sleep in your example with some number crunching, and
show how it scales with the number of cores in your CPU.

~~~
ta2384428
[https://imgur.com/a/DNpw3](https://imgur.com/a/DNpw3)

Running the following code.

[https://play.golang.org/p/k_rRxNAyb0i](https://play.golang.org/p/k_rRxNAyb0i)

I can assure you that go runs across all processors by default.

[https://docs.google.com/document/d/1At2Ls5_fhJQ59kDK2DFVhFu3...](https://docs.google.com/document/d/1At2Ls5_fhJQ59kDK2DFVhFu3g5mATSXqqV5QrxinasI/edit)

~~~
grkvlt
That's not what he's saying. He knows go can use all processors if GOMAXPROCS
is set correctly, the argument seems to be that there will be race conditions
just like any other threading, which seems pretty self evident to me: yes,
multi-threaded code can have concurrency issues, film at 11...

------
slivym
Seems like an incredibly long winded way of saying 'To go faster you either
need to split up each instruction into lots of parts or increase the voltage
for the transisters. We've split the instructions as much as we can, and power
consumption is proportional to Voltage cubed, so it's not a scalable plan.'

~~~
IIAOPSW
More importantly than mere power consumption, we don't have a way to remove
the waste heat generated. Dennards law (like Moore's law but for power
consumption per transistor) ended 10 years ago. Exactly the same time clock
speeds stopped improving. There are actually a few computers out there that
run around 10Ghz but they all have impractical cooling systems.

If there ever were a return to exponential scaling, we would very soon run
into the Launder limit.

~~~
deepnotderp
> If there ever were a return to exponential scaling, we would very soon run
> into the Launder limit.

No we wouldn't. We're around ~10,000X off and would run into thermal danger
zones long before.

~~~
IIAOPSW
BUT....The Launder limit is over optimistic because unless your computer runs
at absolute zero you need to keep redundant copies of each bit for error
correction. Transistors are implicitly error correcting in the sense that each
bit is represented by a current of a few thousand electrons.

Factoring in the redundancy requirement we are likely only off by somewhere
between 100x and 1000x. If there were ever a return to exponential
technological improvement, we would run out of road after a few years.

~~~
deepnotderp
That's my point though, the bottleneck is not going to be Landauer's Limit.

------
navjack27
Re: all the programming replies.

Preface, I'm not a programmer, I'm a hardware guy.

It's all well and good to make sure your programs and future programs are able
to be run in a parallel fashion but there is a big hole to that and it's the
operating systems methods of handling cores and threads.

Let's use folding@home as an example. Very multithreaded. Now let's use, at
first, Ryzen 1800x as the hardware we'll run it on. We have 8 physical cores.
We also have two separate dies. Each die has four cores. Each die module has
their own level 3 cache. As you use your system and you are also folding, even
in the newest Linux kernel, data and instructions might get evicted and
bounced around and take latency hits and thus performance hits. Nothing really
locks the work to the cores or threads taking into account locality. You can
adjust this with HTOP and set each thread of folding manually.

Beyond AMD, even Intel has similar issues still with the 8700k. Hell, in
general just efficient multithreading seems like a tough compromise for OS
development. "Users" want things to be smooth upon interaction, so you have
preemption. Work wants to get done but it also wants to be a good citizen to
the rest of the system.

Developers are going to have to learn about, and keep up to date with, much
more then a fancy new language. You're going to have to learn each new CPU
inside and out and how each OS treats it.

~~~
vondur
How did the BeOS designers make the BeOS so good at multiprocessing? I
remember how well the operating system scaled with more than one CPU.

~~~
da_chicken
Probably because it ran on PowerPC. If I remember my systems design class from
20 years ago, RISC makes implementing or scaling multiprocessing easier.

~~~
nradov
Intel processors essentially are RISC now internally. They just expose more
complex instructions as a higher level API.

~~~
JetSetWilly
They are not RISC internally. In fact I read an interview with an intel
engineer that an intel cou has >10,000 uops. That’s not in any sense reduced!

And it doesn’t make sense to apply the term ”risc” to internal CPU design
anyway.

------
jedbrown
Notably, single-thread performance of code that is not friendly to
vectorization has not stagnated despite stagnant clock frequency. Indeed,
SPECint performance continues to grow exponentially, albeit more slowly since
~2004.

[https://www.karlrupp.net/2018/02/42-years-of-
microprocessor-...](https://www.karlrupp.net/2018/02/42-years-of-
microprocessor-trend-data/#more-760)

~~~
Retric
Depends on the code, you can write ASM that's as fast on an old P4 as a modern
i7. Just access random RAM locations and modern CPU's suck.

~~~
jedbrown
Increasing CPU clock speed also does not reduce DRAM latency.

~~~
Retric
We are talking about the sum of latency's. The CPU needs to do something _AND_
you need to fetch from DRAM. Modern CPU's have increased the worst case
overhead to fetch form DRAM to decrease the average case.

That's usually a good trade-off until someone wants to make your CPU look
terrible.

------
whoisthemachine
My senior design course focused on asynchronous (clock-less) cryptography
circuits; after learning of these, I looked into asynchronous general purpose
processors, and learned that ARM actually designed an asynchronous processor
back in the 2000's [0].

While they've never quite taken off (the extra gates decrease speed and our
harder to manufacture), with the recent side-channel attacks on processor
pipelines, I've been hopeful that I would see something pop up. Imagine a
world where our processors run without a clock!

[0]
[https://www.eetimes.com/document.asp?doc_id=1299083](https://www.eetimes.com/document.asp?doc_id=1299083)

~~~
rcxdude
Wouldn't asynchronous CPUs increase the amount of side-channels? AFAIK in an
asynchronous circuit every aspect of the calculation may affect the time it
takes to complete.

~~~
whoisthemachine
The attacks I'm aware of use processor timings to measure the effects their
programs are having. However, because an asynchronous circuit doesn't complete
tasks in a reliable cycle, you can't measure how the pipeline is being
effected by your program in the same way. You could find new ways to force the
processor to act in a reliable manner that you could measure, but that may
change quite randomly from processor to processor, or even from moment to
moment, depending on external environmental factors.

------
tagrun
I understand that companies heavily invested in silicon would like to portray
it that way, but CPU frequencies haven't ceased to grow.

DARPA manufactured a THz transistor made of InP back in 2014.

Silicon isn't the only semiconductor in nature, and others are actively being
researched.

Also, "when you increase the frequency you increase the power" (which is their
argument) doesn't explain why they can't increase the frequencies. That was
always the case even back in 1960s.

What they actually need to explain is why they can't make the silicon more
power-efficient anymore; all toy-physics arguments (such
approximations/linearizations work only for a very limited range of
frequencies, if they do at all, anyway meaning their scaling-relations aren't
universal like they're trying to portray and those coefficients they ignore
aren't constant across voltage, frequency, materials, ... either; you almost
never get such simple and universal answers in condensed matter physics even
for much simpler problems) mentioned there could have been made 50 years ago
as well, but silicon CPU frequencies did go up.

~~~
tarlinian
Transistor switching frequency has almost nothing to do with processor clock
speeds which are almost entirely limited by wire RC delay. You will get
increased drive currents by switching to higher mobility materials but the
performance improvement over strained silicon isn't that large.

~~~
tagrun
That's just a plumbing problem which can be solved by lowering temperature or
using a different material with lower resistivity. Yes, it'll probably cost
more, but it's a problem that can readily be solved.

But if your switching frequency is slow, it doesn't matter if you use a
superconductor for wires. It is the switching frequency that truly determines
the limits for gate times, which in turn determines how fast your CPU is.

For the record, SiGe is also very promising in terms of switching speeds.
There were experiments which shows near THz frequencies.

~~~
deepnotderp
>That's just a plumbing problem which can be solved by lowering temperature or
using a different material with lower resistivity. Yes, it'll probably cost
more, but it's a problem that can readily be solved.

No it's not. What is this magical material with ultra low resistance? And how
do you plan to reduce capacitance?

Btw, manufacturing terahertz speed transistors is very difficult. There are
Mott FETs which will switch at 10 terahertz, but they're incredibly hard to
manufacture and very power hungry.

~~~
d-sc
I can’t speak on their usability in circuits. However, materials that show
properties characteristic of zero resistance exist. I’ve used them at work
before.

[https://en.wikipedia.org/wiki/Superconductivity](https://en.wikipedia.org/wiki/Superconductivity)

~~~
deepnotderp
And how do you propose to cool every chip to cryogenic temperatures?

Not to mention the manufacturing challenge of integrating superconductors into
chips (I think InP would be the easiest candidate, and that's saying
something...)

------
jacksmith21006
In some cases because not necessary. Take the Gen 1 Google TPUs. They use a
700 mhz clock rate but are processing 65535 things at the same time. Very
simple instructions.

Here is a great paper on comparing to silicon using far higher clock rates.

[https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf](https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf)

Now what will be interesting is this new architecture can be used for more
traditional CS functions.

I love this paper from Jeff Dean on using the TPUs instead of a CPU to replace
a b-tree for example.

[https://research.google.com/pubs/pub46518.html](https://research.google.com/pubs/pub46518.html)

This also solves our multi-thread issue. Basically it done in a multi-thread
manner from the ground up.

We get a round peg for a round hole.

------
smnscu
Let me take the opportunity to plug my favorite computer architecture course
from ETH, who was published here recently.

[https://www.youtube.com/playlist?list=PL5Q2soXY2Zi9OhoVQBXYF...](https://www.youtube.com/playlist?list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_)

[https://safari.ethz.ch/architecture/doku.php](https://safari.ethz.ch/architecture/doku.php)

------
simias
I find their explanation of the pipelining issue slightly confusing, probably
because they tried to simplify it to the extreme:

>One could object to this and note that due to shorter clock ticks, the small
steps will be executed faster, so the average speed will be greater. However,
the following diagram shows that this is not the case.

Said diagram shows that the two-clock-tick step locks the pipeline, that is
you can't execute the first clock tick for the next instruction if you're
still running the 2nd part of the previous one. When would this be the case?
Isn't entire point of pipelining to divide a function into smaller steps that
can be run in parallel? If you can split "step 3" across two clock cycles,
couldn't you effectively subdivide it into two steps that could run in
parallel?

I suppose that eventually you run across the issue that adding additional
pipelining stages increases the logic size which in turn causes it to run
slower or something like that. I wish the document was a little more specific,
after all it doesn't hesitate to throw the physical formulas for power
dissipation in the 2nd part so clearly it's not afraid to dig into technical
details.

~~~
til
> If you can split "step 3" across two clock cycles, couldn't you effectively
> subdivide it into two steps that could run in parallel?

There is a difference between splitting "step 3" across two clock cycles and
splitting "step 3" into two separate steps. The underlying assumption here is
that "step 3" is indivisible. E.g. say "step 3" was memory access and the
latency for that is 500 picoseconds, it's not like you can just split it into
two steps and make it load faster.

------
rbobby
> But remember, wrong overclocking can harm not only processor but you as
> well.

Don't date robots!

------
M_Bakhtiari
They fail to mention that dividing instructions into more pipeline stages
means greater branch penalty which in turn prompts reckless speculative
execution schemes to compensate.

------
dschuetz
The rule of thumb in chip design: your chip clock is as slow as your slowest
logic pipe. Complex logic circuitry slows down potential clock rates
significantly. You can make your logic gates switch faster, thus allowing
longer signal paths, but it has a huge energy trade-off, as the article
states.

Multi-core design seems now to compensate for slower clock rates, but it also
has its trade-offs. It makes software more complex. In case of the CISC arch
Intel established it's a huge trade-off, since CISC is supposed to make its
processors easier to program, as opposed to RISC. I don't think that CISC is a
good choice when it comes to massive parallelism.

But, since chip design is so expensive and is considered state of the art
high-tech, we'll need to deal with everything that chip makers throw at us. Or
do we?

~~~
pjc50
The RISC/CISC "tradeoff" is mostly a non-issue at the higher end of processor
design: everything is now a hybrid. You have ARM64 with its SIMD and floating
point extensions that hardly qualifies as "reduced" on one side, and Intel
systems that have a suspiciously RISC-like internal architecture fed by
decoder of the "legacy" CISC instruction set.

It still matters at the small end, which is why Cortex-M exists.

> Or do we?

A startup can design its own chips, but good luck getting anyone to _use_ it.

~~~
kec
It worked for P.A. Semi (eventually).

------
mysterypie
The article doesn't answer the question at the fundamental level. The closest
it gets is this: _" Increased frequency depends heavily on the current level
of technology and advances cannot move beyond these physical limitations."_

Certainly, Moore's law is just an observation and cannot go on forever. Would
it fair to say that we've simply reached the point where we can no longer
"keep up" with Moore's observation because the technology is getting harder
and not because we've actually reached any limit of _physics_?

~~~
Sammi
You're skipping over the main point of the article, which is that the reason
it's hard to increase the clockrate is because it is dependent on the slowest
instruction that can be done in one tick.

And the main method of making an instruction faster is by splitting it, but
all instructions have now already been split as much as is possible, while
still having them operate correctly.

~~~
hinkley
But this shouldn’t be true for a superpipelined processor, right?

Or put it another way, let’s say phase 3 contains several important
instructions that cannot be reduced to a length of less than 1.7 clock ticks.
If the pipeline stalls you have other pipelines that won’t.

Or you get crazy and put in 2 copies of the slow paths of phase 3 and one
takes the even ticks and the other the odd ones.

------
ksec
Until we have another material that could replace silicon. Not sure if we will
see this happen in the next ten to twenty years.

~~~
tedsanders
The 60 mV/dec limit applies to all materials, not just silicon. Beating it
will require a fundamentally different type of operation. Agree that changes
in materials will require massive investment and learning.

~~~
Symmetry
Sure, but many materials have _much_ higher electron and hole mobility than
silicon has.

~~~
deepnotderp
He's talking about the Boltzmann limit of MOSFETs

------
otabdeveloper1
Exponential growth can't be infinite. Who knew! Mind blown!

------
stringer
I've been looking at Haskell, Rust and Go for helping with parallelism but
decided to go with a less known language: Pony. Not used actors a lot right
now but looks really promising.

~~~
lmm
Particularly given you haven't used actors, what advantages does Pony give you
over Haskell?

~~~
stringer
Right now I found Pony more approachable than Haskell. Maybe because that I
never fully grok monads (probably my fault not persevering enough), I always
had problems composing monads.

Also the promise of Pony is a garbage collection that is concurrent with
program execution and since I want to write low latency server code this
feature sounds very appealing.

------
AlfeG
Funny to read comment section in russian source article. People argue about
need of multicore processors in a desktop pc. Especially when I read those
comments from 16 core machine.

------
yAnonymous
>temperature

Using a proper thermal interface material in their CPUs would be a start...

When you can decrease the temps of Intel CPUs by 20°C with delidding, the heat
argument seems quite constructed.

------
nothis
"Only _you_ can prevent overclocking fires!"

Nicely written. Seems like the intended headline was "why it's bad to
overclock", though!

------
chx
It's a matter of cost and cooling really, IBM z13 is 5GHz, z14 is 5.2GHz.

~~~
jacquesm
No, it's a matter of economics. Those zXX chips are not cheap, nor is what
they interface to.

------
ponyous
There is a new relevant episode of changelog's podcast that talks about CPU
advancements.

[https://changelog.com/podcast/284](https://changelog.com/podcast/284)

------
api
Cost/benefit AFIAK-- higher clock speeds require advanced cooling, use more
power, etc., and it's been possible to get more speed by increasing transistor
count instead at lower cost.

------
drudru11
Are there any CPUs out there with FPGAs tacked on that are available to the
hobbyists/gamer/build your own PC crowd?

~~~
rphlx
There are some hobbyists using the 28nm Xilinx Zynq, a hardened
circa-2009-cell-phone dual-core ARM with on-die FPGA. One popular board is the
[https://www.crowdsupply.com/krtkl/snickerdoodle](https://www.crowdsupply.com/krtkl/snickerdoodle)

------
mozumder
Asynchronous CPUs or bust.

Or, better yet, wave pipelining...

------
MaxBarraclough
It's there if you want it, but liquid helium doesn't come cheap.

------
swarnie_
"CPU manufactures will not allow a meltdown to happen."

No one in the office understands why i'm laughing....

~~~
hackme1234
The spectre of meltdown is haunting CPU manufactures.

~~~
krylon
Great, now I have to clean all that tea off of my display. ;-P

~~~
MichaelMoser123
and i thought that humor is verboten around here... Does HN have an exception
in the book that applies to Intel ?

~~~
chimprich
I don't think humour as such is frowned upon on HN - if I was to attempt to
write down the unwritten, I would say that posts that are just jokes tend to
go down badly, but jokes that make a point or serious posts that are written
with some wit are generally accepted.

------
vdfs
_> But there are also strong concerns that the increased frequency will raise
the CPU temperature so much that it will cause an actual physical melt down.
Note that many CPU manufactures will not allow a meltdown to happen_

~~~
garmaine
At least they have a sense of humor.

Edit: Oh, 2014. The sweet irony.

------
Majora320
(2014)

~~~
sctb
Thanks! Updated.

------
logicallee
guys, I'll be honest: I found it really odd that the article didn't talk about
the speed of light, and die size constraints. (c / 4 ghz = 7.49 cm only, if
you double that frequency you have half that size to put components it between
any two clock ticks.)

But there are limits to my hubris - this is on intel.com, so I'm going to go
with "I'm the one missing something". Is the speed of light and number of
transistors you can put in that path (due to die size) just not a practical
constraint? Neither is mentioned.

~~~
pjc50
The key factors in integrated circuit delay are more to do with capacitance;
in order to change a gate's transistor from off to on, the driving gate has to
charge the capacitance of the driving wire and the driven gate. Making
features closer together increases their mutual capacitance.

(source: worked on this for a chip design software company. The delay
approximation was based entirely around R/L/C modelling and had no terms for
the speed of light per se. If I remember rightly it was calculated in integer
pico-meters; I definitely remember it emitting an error message if you had
more than 2cm of wire in any one net!)

~~~
v_lisivka
I.e. the problem is the state of processor, which need to be erased. So we can
make stack of stateless processors, which will readily accept fresh data,
because they will need to fill capacitors only, not discharge, and then
discharged after use. Kind of multicore design, but with each core used for
only 1/n of time at e.g. 1Thz frequency. Unlike parallel system, sequential
calculation will work faster in such setup.

~~~
rcxdude
discharging and charging capacitance is generally pretty symmetric. I don't
think what you're suggesting would provide much benefit.

~~~
v_lisivka
But heating and cooling is not symmetrical. We can heat a processor much
faster than cool it. So, if we need to cool a processor 10x faster than we
can, just use 10x more processors, and switch them in order, to allow them
cool after use in overclocked mode. Using this simple technique, frequency can
be raised by few GHz, which is important for serial computations.

