
Why have CPUs been limited in frequency to around 3.5Ghz for so many years? - NARKOZ
http://www.reddit.com/r/askscience/comments/ngv50/why_have_cpus_been_limited_in_frequency_to_around/
======
ChuckMcM
The answer is 'simple' it comes in four parts;

1) Transistors dissipate the most power when they are switching (this is
because they transit their linear region). So the more they switch the more
power they dissipate, thus the power dissipation of any transistor circuit is
proportional to its switching frequency.

2) Silicon stops working reliably above 150 degrees C, this is due to thermal
effects in the silicon overwhelming its electrical characteristics. One way to
think about it is that at 150 degrees C the silicon lattice structure is
vibrating so hard because of heat the electrons can't move anymore.

3) Lithography and manufacturing techniques have increased the number of
transistors per square millimeter of silicon exponentially over the years so
the number of transistors has been observed to double every 18 months or so.

4) The ability to channel heat from silicon to the outside air is limited by
the thermal conductivity of silicon and the ceramics used to encase it.

So those four parameters create a 'box' which is sometimes called the 'design
space.' If you build a chip that is inside the box it works, if one of the
parameters goes outside the box it fails.

In the great Mhz race of 1998 clock rates were pushed up which caused heat
dissipation to go up, and transistor counts were going up too, so you got an
n^2 effect in terms of heat dissipation. The race was powered by consumers who
used the single number to compare machines (spec wars). It was unsustainable.

The end of that war occurred when AMD introduced multi-core (two cpus for the
price of one!) architectures. And Intel had the largest design failure in its
history when it scrapped the entire Pentium 4 microarchitecture after
realizing it would never get to, much less past 4Ghz as they had promised.

AMD proved that for user visible throughput, multiple cores could give you a
better net gain than a faster CPU. This side stepped the n-squared problem by
having twice the number of transistors but running at the same clock rate, so
the heat only went up linearly rather than exponentially.

It was a pretty humbling moment for Intel.

That begat the 'core wars' where Intel and AMD have worked to give us more and
more 'cores'. The heat problem was still there, but it was managed with
transistor design since the frequencies were staying flat.

Recently, some new transistor designs and system micro-architectures, have
combined with the inevitable flattening of performance gains from multiple
cores (see Amdahl's law), to give us 'turbo-boost' type solutions, where only
one core runs at higher speed (side stepping Amdahl) at the expense of down
clocking or even turning off other cores (sidestepping the frequency component
of power increases).

Another technology on the horizon (which used to be only for the military
guys) is SoD or Silicon-on-Diamond. Diamond is a wonderful conductor of heat
and so if you make your processor on a diamond subtrate you can pull lots of
heat out into an attached cooling system. Get ready for 7" x 7" heat
management assemblies though that attach to these things. Or alternatively a
new ATX type form factor that includes something that looks like a power
supply but is a chiller that attaches via tubes to the processor socket.

~~~
TimGebhardt
It wasn't the multiple cores that allowed AMD to steal market share from
Intel. It was the original Athlon chips around the turn of the century. They
were the first chips that ran at a lower clock speed but produced far superior
benchmark numbers in just about any benchmark. Combine that with the much
higher prices that Intel charged for inferior products and AMD had a great run
until Intel ditched the Pentium 4 line.

<http://en.wikipedia.org/wiki/Athlon>

Dual core chips came along much later.

~~~
wtallis
The original Athlon was all about clock speed and the race past 1Ghz. The P4
quickly surpassed it, though, and left the Athlon competing on the basis of
cost (P4s required RDRAM, which was much more expensive.) In the fall of 2001,
the P4 hit 2Ghz and added DDR SDRAM support before the Athlon XP showed up at
1.5Ghz, and the Athlon XP was never really able to clearly beat Intel's
fastest. It wasn't until the Athlon 64 that AMD was able to win across the
board with significantly lower clock speeds.

------
jacquesm
I'm probably in a very small minority but I think that the 'free ride' we got
from Moore's law in terms of transistor density leading to an increase in
clock frequency has caused us to be locked in to a situation that is very much
comparable to the automobile industry and the internal combustion engine.

If it weren't for that we'd have had to face the dragons of parallelization
much earlier and we would have a programming tradition solidly founded on
something other than single threaded execution.

Our languages would have likely had parallel primitives and would be able to
deal with software development and debugging in a multi-threaded environment
in a more graceful way.

Having the clock frequency of CPUs double on a fairly regular beat has allowed
us to ignore these problems for a very long time and now that we do have to
face them we'll have to unlearn a lot of what we take to be un-alterable.

I'm not sure about much about the future, the one thing I do know is that it
is parallel and if you're stuck on waiting for the clock frequency increases
of tomorrow you'll be waiting for a very long time, and possibly forever.

~~~
tezza
I think we would just have run into _Amdahl's Law_ earlier.

<http://en.wikipedia.org/wiki/Amdahl%27s_law>

.

Basically you don't get a free pass for parallelising things.

.

If 90% of a task is parallelizable then the maximum speedup you can get with
_infinite cores_ is 10x

~~~
jacquesm
You're right, there is no free pass.

Let me give you one example of how the wrong habits have become embedded in
how we think about programming because of the serial nature of former (and
most current) architectures:

One of the most elementary constructs that we use is the loop. Add four
numbers: set some variable to 0, add in the first, the second, the third and
finally the fourth. Return the value to the caller. The example is short
because otherwise it would be large, repetitive text. But you can imagine
easily what it would look like if you had to add 100 numbers in this way, and
'adding numbers' is just a very dumb stand-in for a more complicated reduce
operation.

That 'section' could be called critical and you'd be stuck at not being able
to optimize that any further.

But on a parallel architecture that worked seamlessly with some programming
language what could be happening under the hood is (first+second) added to
(third+fourth). You can expand that to any length list and the time taken
would be less than what you'd expect based on the trivial sequential example.

Right now you'd have to code that up manually, and you'd have to start two
'threads' or some other high level parallel construct in order to get the work
done.

As far as I know there is no hardware, compiler, operating system or other
architectural component that you could use to parallelize at this level. The
bottle-neck is the overhead of setting things up , which should be negligible
with respect to the work done. So parallelism would have to be built in to the
fabric of the underlying hardware with extremely low overhead from the
language/programmers point of view in order to be able to use it in situations
like the one sketched above.

Those non-parallelizable(sp?) segments might end up a lot shorter once you're
able to invoke parallel constructs at that lower level.

~~~
algoshift
FPGA's is where parallel processing is alive and well.

I've done a lot of real-time image processing work on FPGA's. The performance
gains one can create are monumental. One of my favorite examples (one that
parallels yours) is the implementation of an FIR (Finite Impulse Response)
filter --a common tool in image processing.

How it works: Data from n pixels is multiplied by an equal number of
coefficients, the results are added and then divided by n. There are more
complex forms (polyphase) but the basics still apply.

Say n=32. You multiply 32 pixel values by 32 coefficients. Then you sum all 32
results, typically using a number of stages where you sum pairs of values. For
32 values you need five summation stages (32 values -> 16 -> 8 -> 4 -> 2 ->
result). After that you divide and the job is done.

Due to pipelining this means that you are processing pixel data through the
FIR filter at the operating clock rate. The first result takes 8 to 10 (or
more) clock cycles to come out of the pipe. After that the results come out at
a rate of one per clock cycle.

If the FIR filter is running at 500MHz you are processing five-hundred million
pixels per second. Multiply that by 3 to process RGB images and you get to 1.5
billion pixels per second.

BTW, this is independent of word width until you start getting into words so
wide that they cause routing problems. So, yes, with proper design you could
process 1.5 billion 32 bit words per second. Try that on a core i7 using
software.

For reference, consumer HD requires about 75MHz for real-time FIR processing.

A single FPGA can have dozens of these blocks, equating to a staggering data
processing rate that you simply could not approach with any modern CPU, no
matter how many cores.

This is how GPU's work. Of course, the difference is that the GPU hardware is
fixed, whereas FPGA's can be reconfigured with code.

My point is that we know how to use and take advantage of parallelism in
applications that can use it. Most of what happens on computers running
typical consumer applications has few --if any-- needs beyond what exists
today. When it comes to FPGA's and this form of "extreme parallel
programming", if you will, we use languages like Verilog and VHDL. Verilog
looks very much like C and is easy to learn. You just have to think hardware
rather than software.

The greater point might also be that when a problem lends itself to
parallelization a custom hardware approach through a highly reconfigurable
FPGA co-processor is probably the best solution.

The next major evolution in computing might very well be a fusion of the CPU
with vast FPGA resources that could be called upon for application specific
tasks. This has existed for many years in application-specific computing,
starting with the Xilinx Virtex 2 PRO chips integrating multiple PowerPC
processors atop the FPGA array. It'd be interesting to see the approach reach
general computing. In other words, Intel buys Xilinx and they integrate a
solution for consumer computing platforms with API's provided and supported by
MS and Apple.

~~~
justincormack
FPGAs have been the next big thing for at least 15 years, during which time
the consumer GPU has made huge performance increases, and is turning into a
general purpose parallel vector unit. Still seems to me the GPU is on track to
fill this role and the fpga will stay niche, has anything changed to change
this?

~~~
wtallis
What's changed is the definition of GPU. 15 years ago, the most
programmability a GPU had was a few registers to switch between a handful of
hardcoded modes of operation. For the past several years, GPUs have been
something like 80% composed of Turing-complete processors that can run
arbitrary software, and they've more recently gained features like a unified
address space with the host CPU, and the ability to partition the array of
vector processors to run more than one piece of code at the same time. These
features are 100% irrelevant to realtime 3d rendering.

GPU architectures don't exactly resemble FPGAs when viewed up close, but for
the right kind of problem, an application programmer can now deal with a GPU
in the same manner as an FPGA. (In fact, with OpenCL, you can use the exact
same APIs.)

------
xxcode
High(er) Frequency means that you have to run at a higher core voltage (this
is, in part, because gate propagation delay is inversely proportional to bias
voltage, i.e., the voltage corresponding to a 'high' or 1 bit). You have to
decrease the propagation delay so that the clock tick gets everywhere in the
processor quickly (one part of the clock doesn't lag the other parts, called
clock skew). So in order for things to run faster, you'd have to run it at a
higher V(bias). Now that means that there are higher thermal costs (things get
hotter) - heat produced is proportion to voltage.

So its now mostly a thermal management problem. This is the primary problem in
the newer chips. Even though we can pack in more transistors, we can't get
signals among them faster without higher V and making it run too hot.

Therefore, our solution is to use the extra transistors to create a separate
new processor, running with a different clock, so the 'tick' doesn't have to
reach all parts of this rather big chip, but just needs to be synchronized
intra-core.

~~~
shabble
I'm curious if asynchronous logic will get brought back in as time goes on.
Multi-core is essentially desynchronising the individual cores since (iirc)
clock generation and distribution is one of the most expensive parts of the
processor, in both energy and size.

Historically, it lost out because of the added complexity of
handshaking/synchroniser logic scattered all over the place, and the mess a
mishandled metastable condition propagating through the system could cause.
How that we've got transistors to burn, however, and routing is no longer done
on large empty floors with crepe tape...

~~~
copper
Well, Handshake did make an asynchronous ARM core not too long ago, and
Achronix has a nice business in async FPGAs, so the idea isn't dead. That
said, I'm not sure (though I'd love to be proved wrong) that the idea will
really go big again. Synchronous chips are still easier to reason about.

By far the most expensive real estate on a typical processor is on-chip RAM,
and that really does need a clock. Sure, clock-tree synthesis is complex
enough that you may even still be able to start up a company selling a tool
for it, but it's still possible for a few reasonably competent engineers to do
within some weeks.

~~~
scarmig
Chuck Moore also has recently released a clockless asynchronous chip, 144
core.

Though it has to be programmed in an obscure dialect of colorForth...

------
microarchitect
This is a good example of why science isn't a popularity contest. The top
voted reply makes some vague noises about vt-scaling and leakage. It then
claims that we "don't get a very good "off" if the threshold voltage is too
low". This is incorrect. Leakage doesn't degrade the logic values for CMOS-
style logic, which is the vast majority of the digital logic in the world.

The real issue which the OP may or may have been trying to get at is that
leakage power eats into the chip's power budget. Since we'd rather not burn
our budget on leakage, we reduce leakage by increasing the threshold voltage.
But unfortunately, this comes at the cost of frequency.

The other very important issue is that of power-density. There's a famous
graph that Shekar Borkar of Intel [1] made showing that if we continued to
ignore power dissipation issues like in the past our chips would run
ridiculously hot (IOW, they wouldn't work at all because they'd just burn
themselves to death).

There are also other issues like the fact that your wires start acting like
antennas at 5+ GHz and that reliability concerns like electromigration and
dielectric breakdown are getting worse with newer technologies.

A much more recent and reliable (not to mention highly-cited) reference on
scaling and power issues is [2].

[1] I found a copy here: <http://www.nanowerk.com/spotlight/id1762_1.jpg> [2]
<http://www-vlsi.stanford.edu/papers/mh_iedm_05.pdf>

~~~
mbell
While I agree that power is the primary issue leakage does have a correlation
with "degrading the off" value.

Specifically the effects of drain induced barrier lowering becomes very
significant in the sub-threshold region as the channel length is reduced. This
has the effect of both increasing drain current (leakage) and making the
transistor more difficult to turn off, that is a larger gate bias is required
to turn the transistor off.

While "leakage" isn't the cause, you do get leakage and a harder to turn off
transistor at the same time, at least for this type of leakage.

~~~
microarchitect
Hmm. In a CMOS structure with a weakly turned off device, you'd need the
effective off resistance to be more than 1/20th of the effective on for the
logic value to degrade by roughly 5%. Did we have such leaky transistors for
130nm/90nm when the shift happened?

A quick calculation I did with a 90nm model for I have some parameters with
vt=0.3, vdd=1.2, subthreshold slope=90mV/decade and dibl coeff=0.1 seems to
suggest that leakage would still be < 1/100th of the on current.

~~~
mbell
I believe we're talking about two separate issues, maybe this will help:
<http://en.wikipedia.org/wiki/Drain_Induced_Barrier_Lowering>

~~~
microarchitect
I don't think so. The effect of DIBL is an increase in leakage current with
the drain to source voltage.

The first-order model I was taught in school was that leakage current not
considering DIBL used to be proportional to exp(vgs-vt) but thanks to DIBL
leakage is now proportional to exp(vgs - vt + eta*vds). The eta here is the
DIBL coefficient, which AFAIK is 0.1 or thereabouts.

~~~
mbell
The first-order model falls apart as channel length decreases.

------
JoshTriplett
We've gotten very close to physical limits. Quick back-of-the-envelope
estimate: if electrons traveled at the speed of light through silicon (they
don't), then in 3GHz, an electron could travel .1 meters. In reality,
electrons in silicon travel quite a bit slower than that. Net result:
electrons can barely cross the diameter of the chip in one cycle, even without
gate propagation delays and other factors limiting work done per cycle.

So, if you want data from a tiny distance away, such as a local register, you
can grab it and do something simple with it in one cycle. If you need data
from any further away, forget about it; cache takes longer, another core takes
even longer, and memory takes _far_ longer.

~~~
Confusion
Fun fact: electrons in an metallic wire under a potential difference move only
in the order of mm/s. See e.g.[1]

As electrical signals still travel at about half the speed of light, that
seems besides the point. However, electrical signals do not need to span the
entire width of a chip in order to have an effect. That is true for your
example of accessing memory, but not for the actual calculations happening
inside the CPU.

On each clock cycle, all transistors are 'fed' by the transistors before it.
The state of a transistor only depends on the states of the immediately
connecting transistors on the clock cycle before it. That means the signal
only has to travel the length of a single transistor on each clock cycle. We
could have CPU's working at 3THz and a much larger focus on the difference
between CPU-bound and IO-bound tasks.

[1] <http://amasci.com/miscon/speed.html>

~~~
jacquesm
The easiest to understand analogy to illustrate this that I've found to date
is to imagine a tube full of marbles, push one more marble in on one end and
another one will pop out _instantly_ on the other end. As long as the marbles
are all the same colour it is as though the marble you pushed in has
miraculously teleported to the far end of the tube. It appeared to have moved
at the speed of light even though in reality all the marbles have shifted only
an amount equal to the diameter of a single marble.

~~~
Confusion
There is a problem with that analogy: if you only apply pressure to a single
marble, the marble at the other end will _not_ be pushed out instantly. In
fact, the 'signal' will travel with the speed of sound in the material of the
marbles and you can measure the time difference it takes for the last marble
to move after the first one has moved. The marbles will be slightly compressed
and decompressed while moving, accounting for the extra length needed to
accommodate the 'not moving instantly'. [1]

A better analogy would be one where all marbles are pushed simultaneously.
That is more like what happens in logic circuits.

[1] People doubting relativity often try this thought experiment: I have an
incompressible metal bar of a lightyear long. I press on one side. The other
side must move instantaneously and not after 1 second (or longer). In fact,
this proves the reverse: relativity is incompatible with the existence of
incompressible metal bars. As we know, all metal bars are in fact
compressible, so that is not a problem. And with compressible metal bars, the
thought experiment fails, because the push will travel with the speed of
sound.

~~~
jacquesm
That's true, but it is only an analogy, and like every other analogy it breaks
down at some level (after all, it isn't the 'real thing').

The fact that there is a pressure wave set up in the materials is possible
because marbles are made of some material (glass, stone, metal, whatever). If
you wanted a 'perfect' picture you'd have to explain about electron migration
in detail and then we're looking at a completely different picture.

You'd not have a pressure wave in an electron to begin with, and they're not
'pushing' against adjacent electrons either.

But it serves well to show how a slow move can have an apparent instantaneous
effect at a distance.

~~~
ScottBurson
_[Electrons are] not 'pushing' against adjacent electrons_

Sure they are. Like charges repel.

Charge density in this case plays the same role as mass density does in the
case of a sound wave. It's a very good analogy.

------
kashifr
According to the Sandia's Cooler whiter-paper pdf:
[http://prod.sandia.gov/techlib/access-
control.cgi/2010/10025...](http://prod.sandia.gov/techlib/access-
control.cgi/2010/100258.pdf) the limit is due to a "Thermal Brick Wall".
Basically due to a lack of progress in heat exchanger technology.

I have visualized this "Brick-Wall" as a graph with data from Wikipedia pdf:
<http://dl.dropbox.com/u/3215373/Thermal-Brick-Wall.pdf>

------
iradik
Donald Knuth on multicore (2008):

Andrew: Vendors of multicore processors have expressed frustration at the
difficulty of moving developers to this model. As a former professor, what
thoughts do you have on this transition and how to make it happen? Is it a
question of proper tools, such as better native support for concurrency in
languages, or of execution frameworks? Or are there other solutions?

Donald: I don’t want to duck your question entirely. I might as well flame a
bit about my personal unhappiness with the current trend toward multicore
architecture. To me, it looks more or less like the hardware designers have
run out of ideas, and that they’re trying to pass the blame for the future
demise of Moore’s Law to the software writers by giving us machines that work
faster only on a few key benchmarks! I won’t be surprised at all if the whole
multithreading idea turns out to be a flop, worse than the "Itanium" approach
that was supposed to be so terrific—until it turned out that the wished-for
compilers were basically impossible to write.

Let me put it this way: During the past 50 years, I’ve written well over a
thousand programs, many of which have substantial size. I can’t think of even
five of those programs that would have been enhanced noticeably by parallelism
or multithreading. Surely, for example, multiple processors are no help to
TeX.[1]

How many programmers do you know who are enthusiastic about these promised
machines of the future? I hear almost nothing but grief from software people,
although the hardware folks in our department assure me that I’m wrong.

I know that important applications for parallelism exist—rendering graphics,
breaking codes, scanning images, simulating physical and biological processes,
etc. But all these applications require dedicated code and special-purpose
techniques, which will need to be changed substantially every few years.

Even if I knew enough about such methods to write about them in TAOCP, my time
would be largely wasted, because soon there would be little reason for anybody
to read those parts. (Similarly, when I prepare the third edition of Volume 3
I plan to rip out much of the material about how to sort on magnetic tapes.
That stuff was once one of the hottest topics in the whole software field, but
now it largely wastes paper when the book is printed.)

The machine I use today has dual processors. I get to use them both only when
I’m running two independent jobs at the same time; that’s nice, but it happens
only a few minutes every week. If I had four processors, or eight, or more, I
still wouldn’t be any better off, considering the kind of work I do—even
though I’m using my computer almost every day during most of the day. So why
should I be so happy about the future that hardware vendors promise? They
think a magic bullet will come along to make multicores speed up my kind of
work; I think it’s a pipe dream. (No—that’s the wrong metaphor! "Pipelines"
actually work for me, but threads don’t. Maybe the word I want is "bubble.")

From the opposite point of view, I do grant that web browsing probably will
get better with multicores. I’ve been talking about my technical work,
however, not recreation. I also admit that I haven’t got many bright ideas
about what I wish hardware designers would provide instead of multicores, now
that they’ve begun to hit a wall with respect to sequential computation. (But
my MMIX design contains several ideas that would substantially improve the
current performance of the kinds of programs that concern me most—at the cost
of incompatibility with legacy x86 programs.)

Source: <http://www.informit.com/articles/article.aspx?p=1193856>

~~~
mrich
Knuth comes across as living in a bit of denial here... Sure the frequency
speedups were much nicer, but faced with hardware limits that make this
impossible, he should point out that the future lies in finding algorithms
best suited to multicore. They surely deserve their own volume of TAOCP, and I
don't see how this would be wasted research for the next 10+ years.

There are some amazing lock-free data structures/algorithms out there which
should be taught in any CS curriculum.

Intel has been releasing some great tools to aid with multicore development
since they realized years ago that this was the only way to get more
performance.

<http://software.intel.com/en-us/parallel/>

~~~
sliverstorm
_Knuth comes across as living in a bit of denial here... Sure the frequency
speedups were much nicer, but faced with hardware limits that make this
impossible_

It's easier to say "the hardware guys are dropping the ball" than to
acknowledge silly things like "Physics".

Now, he could try to argue that hardware should be responsible for paralleling
things under the covers, and that would be a bit more reasonable- though I
would still have to disagree. CS folks are the ones supposed to be discovering
algorithms; find them, and have the hardware guys stick them in once they are
known.

------
DiabloD3
I wish people would quit asking this question. This is just another case of
the Mhz myth. Who cares what the clock speed is if modern CPUs can execute
instructions 2-4x faster per clock than they did 30 years ago.

Go look at the Bulldozer design, 2 hardware decoder/scheduler engines, 4
integer ALUs, 2 fp ALUs, all per core...

A shared L3 cache per socket that is owned by the memory controller and is
socket-local to other sockets (ie, all memory controllers conspire to cache
system memory efficiently and synchronously know what is cached across all
sockets)...

And the memory controllers also accept memory requests from ANY core on the
Hypertransport bus, no matter which socket, and multi-socket boards commonly
have one memory bank per socket, thus 4 sockets of dual channel DDR3-1600
would indeed give 820 gbit/sec of bandwidth that can be accessed (almost) in
full by any individual core[1]...

The ALUs have execution queues, and any thread (currently 2 per core) can
schedule instructions on it to maximize ALU packing...

And you can now buy Bulldozers for Socket G34 that have 16 threads/8 cores per
socket, and G34 boards usually have 4 sockets.

So again, who cares what the mhz is?

[1]: A 4 socket setup has 4 or 6 Hypertransport links, on AM3+ and G34+
sockets this would be HTX 3.1 16bit wide links running at 3.2ghz, or 204
gbit/sec per link.

On a 4 socket ring, that would be 204 gbit/sec off each neighbor's memory
bank, plus another 204 gbit/sec (also the speed of dual channel DDR3-1600)
from the local memory bank, thus leading to 612 gbit/sec that could be
theoretically saturated by a single core.

On a 4 socket full crossbar, it would be the full 820 gbit/sec.

~~~
dchest
_Who cares what the clock speed is if modern CPUs can execute instructions
2-4x faster per clock than they did 30 years ago._

If you could increase clock speed on these modern processors twofold, would
their performance improve?

~~~
jlouis
Not necessarily. The fact is that you _could_ probably do just exactly that by
making the pipeline deeper. But making the pipeline deeper has a price: your
control logic becomes much more complex and emptying the pipeline is more
expensive. The question is then: does the price you pay become larger than the
gain. As this is not happening I have a good guess :)

Historically, the Pentium 4 is an example of a Mhz-oriented design. The
pipeline was made ridiculously deep. This meant that when you had heavy
floating point intensive work, then the Pentium 4 was very very fast.
Essentially this was because you could fill up the pipeline in the CPU and you
have relatively few jumps in that code. It was just a processing job. On the
other hand, the P4 suffered for almost any other task. Even to the point where
a more beefy Pentium 3 (Tualatin) could beat it. Intel chose a path which was
based on the P3 design for their Pentium-M CPU and that is the design which
guided Core, Core 2 Duo, Core i7 and so on.

So it is possible, but it is not a given it would lead to faster CPUs. Mhz is
but one of the knobs you can turn. The problem we are facing is that all the
parallelism we can squeeze out of a sequential program has been done. Modern
CPUs dynamically parallelizes what can be run concurrently. But to gain more,
now, you need to write programs in a style where they expose more parallelism.

Welcome to parallel programming :)

------
glimcat
Because scaling limits. The stuff we're already getting uses very highly doped
silicon. You could go smaller or faster if you could dope it more to keep the
field characteristics viable, but more doping would screw up the silicon
lattice. "More" also generally means an orders-of-magnitude increase.

There are a few alternative processes and materials, but they're costly as all
hell and hard to do in bulk.

------
Symmetry
The main problem is that leakage current has started to become a problem[1].
Back in the day designers could just scale down features and rely on the
reduced capacitance of the smaller areas to lower power usage enough to let
them put in more logic. Unfortunately that only reduces active power, not
leakage power. Transistors have started leaking more now that they're smaller.
You can reduce leakage by lowering the voltage your processor operates at, but
that also causes a reduction in the frequency that your transistors flip at
because your logic voltage becomes smaller with respect to the transistor
threshold voltage, meaning less current but unit of charge you have to move.
Modern devices are also tending to run up against saturation velocity[2] now,
limiting their switching speed still further.

You certainly could increase clock speed by increasing the voltage you put
into a chip, and just accepting that you're going to have more leakage current
and more wasted power. But we're already close to the edge of what chips can
dissipate right now. You can try having less logic between clock latches,
meaning you have a higher clock speed for your switching delay. However, this
increases the ratio of latches to everything else so its a matter of
diminishing returns. It also means that you've pushed your useful logic
further apart, and now you have more line capacitance too. Finally, you can
decrease the temperature of your silicon to substantially reduce the amount of
leakage you get. This will let you raise the voltage safely, letting you
attain faster switching speeds. The only problem is that this requires
expensive cooling devices.

[1]<http://en.wikipedia.org/wiki/Leakage_%28semiconductors%29>
[2]<http://en.wikipedia.org/wiki/Saturation_velocity>

------
tibbon
So what confuses me is that we've seen tech demos (and overclockers) push CPU
speed to 4 to 10ghz. Are those gains so artificially made that they just can't
be replicated at scale in the public market?

We can get them to go faster, just seemingly not for the public.

It was weird to buy my new Macbook Pro and after 3+ years, it was .2ghz slower
than my old one. Now of course it has more cores, better instructions, etc...
but it was still a weird thing.

Apple's done a great job of explaining the benefits of new ones to the
consumers - they don't. I'm not being sarcastic. At the end of the 90's (and
still largely today for most PC companies) its all about specs. Apple sold
what you can do with each computer instead in a good, better, best format. I
bet a large amount of computers sold at the Apple store never have the sales
associate mention the clock speed.

~~~
j-g-faustus
Here's an article describing what it takes to reach those speeds:
<http://www.tomshardware.com/reviews/5-ghz-project,731.html>

The article is a few years old, and the clock limits have changed, but the
principles of reaching them are the same.

As for replicating it at large in the public market, I don't anticipate that
liquid nitrogen feeds will be ubiquitous anytime soon :)

~~~
apaprocki
The top IBM POWER7 CPU is clocked at 5ghz and isn't cooled with liquid
nitrogen.

~~~
j-g-faustus
As I said, it's a few years ago. Today you can get 5 GHz with Sandy Bridge on
air (although it might entail testing a few dozen CPUs to find one that can
reach that high).

But the OP was referring to speeds up to 10 GHz, those still need something
special. Here's a current Guiness record, 8.4 GHz using liquid helium:
<http://www.youtube.com/watch?v=UKN4VMOenNM>

Other materials might potentially clock higher, an IBM research prototype chip
reached 500GHz: <http://news.bbc.co.uk/2/hi/technology/5099584.stm>

------
jeremysalwen
>What would happen if you added a second compressor loop with its cold side on
the hot side of the first one? What if you added a third one? You've now got 3
stage cascade phase change(compressors cool stuff by compressing gas into a
liquid, then letting it suck up thermal energy while decompressing/evaporating
back into a gas elsewhere, i.e. phase change) cooling on one end, and a
spectacularly inefficient heater on the other.

Now wait a second... that's not just wrong, that's precisely the _opposite_ of
reality. If you have a heat pump, it _must_ be _more_ efficient than a simple
heating element. If it at _all_ cools the processor, by the simple laws of
thermodynamics, it must heat the room more than any process which simply
converts the work directly to room heat.

------
foolinator
CPU speed isn't a bottleneck anymore. These days improving caching, threading,
and increasing the bus/memory speed have been the primary contributors to
speeding up a computer. Once those bottlenecks close a bit you'll see those
numbers begin to rise again.

------
redthrowaway
I'm regularly impressed with the mods over at /r/asks Jen e and how high
they've been able to keep the SNR.

~~~
redthrowaway
That would be /r/askscience. Thanks iPhone, helpful as always.

------
its_so_on
this is why:

<http://www.google.com/search?q=c+%2F+3.5+Ghz>

that's a theoretical maximum. you increase Ghz, you have a shorter theoretical
maximum length of path electricity can take through your chip. An i7 is not
that small. You would have to shrink things further and further to get a
smaller chip with shorter paths in it, which is what new fabrication methods
have been about.

Obviously this is very difficult. Sure, chips could be even faster if they
were just a few atoms across, but who would expect you to do meaningful
computation in that size, or to be able to manufacture that.

~~~
perokreco
This is completely false. You can read why in the comments above.

------
karolist
No idea. I have my E8400 still kicking ass at 4.2Ghz on air.

~~~
karolist
Ok, let me explain to the downvoters.

As the top comment on reddit currently states, there's no issue with speed of
light, the issue is with heating. And there's no "3.5Ghz" limit, stock, mass
produced heating solutions limit us.

People get all sorts of crazy clock speeds by improving their cooling, mine's
on Zalman 9500 fan and has no stability issues to run at 4.2Ghz (stock
3.0Ghz).

