
Zombie Moore's Law shows hardware is eating software - miiiiiike
http://www.theregister.co.uk/2016/09/22/the_evolution_of_moores_law_suggests_hardware_is_eating_software/
======
janekm
This article is based on a faulty premise. The A10 processor is still far away
from the performance of recent Intel CPUs (a quick browse of geek bench shows
about 2 times the single and 4 times the multi core performance). Apple is
very quickly making the gains towards the limits of Moore's law not because of
a different model of computation but because those gains had not yet been
realised for mobile CPUs. As the performance gap is getting smaller, it is
likely that the year-on-year improvements in processor performance will also
slow down for Apple CPUs. Which is not to take away from the achievements of
the A10 design team, considering performance per Watt this chip is incredible.

~~~
brudgers
I don't disagree. What I found interesting is the bigger picture of hardware

By which I mean that, I think 'performance' is a slippery eel of a concept.
The A10 fits in my pocket and and the Xeon Phi does not...or rather it doesn't
in a way that provides me with useful computations...and the latest i7 doesn't
hold a candle to the GPU in my $40 graphics card when I want to rotate the 16
million pixels my camera stuffs into a RAW file...and if I want to build a
Kubernetes cluster, I can throw Raspberry Pi's at the problem.

~~~
nkurz
_the latest i7 doesn 't hold a candle to the GPU in my $40 graphics card_

I think I understand your point (unless you meant iPhone 7?), but
interestingly, recent Intel processors have impressively powerful graphics
processors built in:
[https://software.intel.com/sites/default/files/managed/c5/9a...](https://software.intel.com/sites/default/files/managed/c5/9a/The-
Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf). So if you
were using full capacity of that i7, I think it would beat your graphics card
handily. The issue is that practically no one is writing code to use the full
capabilities of these modern CPU's. A friend did a write up here:
[http://lemire.me/blog/2016/09/19/the-rise-of-dark-
circuits/](http://lemire.me/blog/2016/09/19/the-rise-of-dark-circuits/)

Even without the built-in GPU, I'd bet that the right software running on that
i7 would blow away that $40 graphic card. I think problem is that we don't
really have the right tools for writing low level multi-core software. Image
rotation parallelizes and vectorizes really well, the x64 side of those
processors have excellent vector capabilities, but it still requires hand
coding to get top performance.

 _if I want to build a Kubernetes cluster, I can throw Raspberry Pi 's at the
problem_

A thought experiment: how fast a cluster could you make from a few dozen
iPhone 7's connected wirelessly? The processors are surprisingly fast, and I
think they support 802.11ac at gigabit plus speeds. Could you possibly do
distributed computing on an ad-hoc network of iPhones that happen to be
nearby? An app with a sandboxed work queue that accepts local connections?
There's lots of reasons it makes no practical sense, but it would be quite a
demo.

~~~
brudgers
An Intel i7-6700 will do about 200 single precision GFLOPS[1] and costs about
$300 [2]. An Nvidia GeForce 710GT will do about 350 single precision GFLOPS
[3] and costs about $40 [4]. A pixel pipeline is one of those 'embarrassingly
parallel' workloads and the software I use, Darktable [5] is tuned to take
advantage of GPU parallelism...the tools are there in no small part due to the
gaming industry.

Thoughting on an iPhone cluster, the hurdle seems to be software that is
designed to create friction against implementing such a thing: drivers and
firmware in particular.

[1]: [https://www.pugetsystems.com/labs/hpc/Skylake-
S-i7-6700K-and...](https://www.pugetsystems.com/labs/hpc/Skylake-
S-i7-6700K-and-i5-6600K-for-compute-maybe-697/)

[2]:
[http://www.newegg.com/Product/Product.aspx?Item=N82E16819117...](http://www.newegg.com/Product/Product.aspx?Item=N82E16819117560)

[3]:
[http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=...](http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=100007709&IsNodeId=1&Description=nvidia&bop=And&SrchInDesc=710&Page=1&PageSize=36&order=BESTMATCH)

[4]: [https://www.techpowerup.com/gpudb/1990/geforce-
gt-710](https://www.techpowerup.com/gpudb/1990/geforce-gt-710)

[5]:
[https://www.darktable.org/usermanual/ch10s02s04.html.php](https://www.darktable.org/usermanual/ch10s02s04.html.php)

~~~
nkurz
_An Intel i7-6700 will do about 200 single precision GFLOPS[1] and costs about
$300 [2]. An Nvidia GeForce 710GT will do about 350 single precision GFLOPS
[3] and costs about $40 [4]._

You are right, and I stand (mostly) corrected. Alternatively said for Skylake,
with unrolled 256-bit FMA, you can calculate close to 16 single-precision
floating point calculations per cycle per (physical) core: two vector loads,
one vector store, and the 8x32-bit FMA itself.

At 4 GHz this is about 100 Gflop/s per core. For 4 cores this is about the
same 350 Gflop/s as the $40 graphics card you pointed to. And realistically,
loop overhead will knock you down another 10%, and if you run for any length
of time you probably be thermally throttled back to something closer to the
200 Gflop/s you cite.

My general question about why no one is interested in using the built-in
Tflop/s capable side of the Skylake die stands, but I was wrong to think that
a well-tuned desktop processor could come anywhere close to the
price/performance of a cheap graphics card for raw flop/s. Thanks for pointing
out the real numbers.

~~~
sliken
Don't forget the i7-6700 is the one with the lame graphics.

Try the i7-6770HQ if you want better graphics performance.

------
okket
This perspective fits into the top-down evolution narrative: Write highly
specialised solutions in the most general way in software until essential,
generic parts are discovered and stable enough to be translated into
transistor logic. Repeat.

Or like the Bitcoin "evolution": CPU -> GPU -> FPGA -> ASIC (this is a very
simple, single purpose optimisation, but it can illustrate part of picture)

Only optimising transistor speed, their size or the whole CPU/GPU package has
obviously limitations and may be a dead end.

~~~
usrusr
Dead end is far too negative. There is nothing wrong with improving general
purpose computing. It's the very best thing you can do - when you can. Just be
ready to adapt (by going for specialization) when you can't.

------
eximius
Huh, not where I thought this article was going from the title. I thought it'd
be about software bloat eating into the gains given by hardware (which, in
hindsight, is exactly backwards from the title).

------
gaius
What is old is new again - this is the architecture of the Amiga! And before
that the C=64 and the Atari 800.

~~~
untilHellbanned
Why was it a non-starter for Intel and what made Apple use it? I'm curious
because I don't know about computer architecture.

~~~
_ph_
Intel seems to be captured in its own x86 world. There were no big changes to
the instruction set for ages, even the x64 extensions were designed by AMD. So
while Intel still excels at manufacturing, the few projects for breaking out
of their box failed (Ithanium, Larrabee). And also: why are there no Intel
chips yet who include all the nice things USB-C can offer to the market? E.g.
Thunderbolt, newest DisplayPort.

~~~
dogma1138
>Intel seems to be captured in its own x86 world. Eh, no.

>even the x64 extensions were designed by AMD.

AMD (not working solo) has published the spec in 2000, AMD, Intel, VIA and a
few others have a pretty different implementation of that spec, Intel's x64
extension is drastically different than AMD's implementation.

Intel has been adding new extensions with virtually every generation, just
look at how many generation of virtualization support extensions we've had.

>the few projects for breaking out of their box failed (Ithanium, Larrabee).

Itanium died because the industry didn't want to take a "RISC" (;)).

Larrabee isn't dead, in fact it is very much alive it is just inside your CPU,
the AVX extension set and the silicon that supports are Larrabee's vector
processing units. Intel just more or less had the insight to see that Larrabee
won't go anywhere really and they can shrink it and implement it into every
CPU within a few years.

Larrabee was interesting but it was not here and not there, Intel went to
implement the good parts of it into each of their own CPU's and for HPC/GPGPU
like computing it went a different way by releasing the Xeon-Phi.

>And also: why are there no Intel chips yet who include all the nice things
USB-C can offer to the market? E.g. Thunderbolt, newest DisplayPort.

This just shows your utter lack of understanding of the technology landscape.
I was tempted to cynically ask you about what version of Thunderbolt you have
on your non-Intel machine to see if you fall for it, but alas I don't want to
spread misinformation.

USB-C doesn't bring Thunderbolt not Displayport to the market, USB-C can be
mechanically compatible with those 2. Thunderbolt is Intel's proprietary
technology and its upto the system integrator to decide if they want to
implement it or not, don't buy cheap ass laptops get the latest thunderbolt.

As for displayport could you please tell me what is the "latest"? if it is 1.4
there isn't a single desktop GPU that "technically" supports it yet, even the
GTX 1080 is only certified for 1.2 (and in theory can support 1.3/1.4 but ink
on the specs isn't dry yet and the certification process is well meh).
Thunderbolt 3 is the only current interface that is actually certified for
DP1.3, it should probably support DP1.4 but the spec was only finalized in
march this year so....

~~~
qwertyuiop924
Actually, Itanium died because the perf sucked. And it had no compat.

So yeah...

~~~
dogma1138
The performance didn't suck that much, not for the RISC part. The problem was
that too many IA-64 application relied on the x86/xCPU emulation which was
well yeah very shitty.

~~~
userbinator
The Itanium is a VLIW, not a RISC (although they have some similarities). It
executes "bundles" of 3 instructions in parallel. It was excellent in certain
benchmarks because the compiler could schedule sequences to make full use of
this; but it turns out that general-purpose code isn't really so parallel all
the time, so the compiler would have to fill 2 or 3 of the slots in each
bundle with NOPs, wasting space (thus cache usage and fetch bandwidth) and
leave much of the CPU's execution units idle.

It was great at highly parallel benchmarks, but much slower than contemporary
x86 (the P4, which wasn't that great either) for everything else.

~~~
dogma1138
Yeah it was VLIW/EPIC but i like the puns :P

As for the performance I don't have enough experience to pinpoint exactly all
the problems, but i think it really ended up being a software issue.

Most complaints I've encountered were about the "emulation" part with the
dynamic translation libraries and instruction set emulation.

Overall it seems to me like they just tried to tackle too many things, have
some compatibility with X86, take on SPARC/POWERPC and do HPC, mainframes and
tons of other applications at the same time.

That said after all this is still a mid 90's ISA for a very niche market with
all these issues its surprising it lasted so long, I never understood why HP
kept paying Intel to produce these chips in the first place.

------
samfisher83
I think one of the things that is incorrect in this article is the software
used to design asics (Custom Chips). Synopsys, Cadence, Mentor, Magma (when it
was around) all made pretty good tools. The thing is the weren't free. Also
you had to use TCL, but they could take rtl (Verilog aka the design language)
to GDS (what gets fabbed). Heck there always seemed to be some startup that
claimed they came up with a better routing tool or a better timing tool. They
weren't super complicated to use. If you had too many gates you had to script
a bunch of hacks, but it isn't too hard.

------
ChuckMcM
Simple summary, you can use transistors in your chips to speed up software.
That comes at the cost of flexibility of course but I think it was one of the
secrets of the iPad's initial success.

That said, I think the ability to make inexpensive custom chips is going to
power a wave of new hardware gizmos, unfortunately all of that capacity is in
China and so if you don't read Chinese data sheets you're probably not going
to be able to use those chips. (not that autotranslation is "bad" just that it
doesn't always make technical documentation actionably understandable.)

------
on_and_off
This might also explains why Google seems more and more interested in
designing its own chips for mobile devices, just like Apple.

That and the underwhelming offering on Android compared to what Apple has been
able to deliver.

~~~
colechristensen
Why is it underwhelming? What do people do with phones?

I exchange text, sound, and pictures with people. Sometimes I play stupid
games. The sound/pictures/text sharing long ago hit a limit where theres a big
question as to why a person needs a faster processor. Increased watts per
compute helps, sure. But what else? What besides bloat requires a Moore's law
increase in a phone processor?

What do I want to do with my phone that I don't yet know I want?

~~~
a3n
> Why is it underwhelming? What do people do with phones?

What did people do with phones in the 1890s? Even your closed set is much,
much more than that.

What did people do with computers in the 50s, or PCs in the 80s? We do a lot
more than that now.

With greater capability comes more, even if we don't yet know what "more" will
be.

~~~
colechristensen
Windows 95 is 26 years old, my first computer ran it. It's been a quarter of a
century and I don't do anything with a shiny new MacBook Pro that I didn't do
with Windows 95. I browse the web, read news and comments, watch videos,
exchange text with people. Sure, the video resolution is better but that has a
lot more to do with the network than with the computer hardware. The UI is a
little shinier, but I'm not enabled to do anything new that I couldn't do
before.

If I don't buy a new computer every few years, what I have will be nearly
worthless because software has a funny habit of finding ways to spend all
available resources, even if it does most of the same things. I just don't
think I would be that disappointed if I was still using Win 95.

> With greater capability comes more, even if we don't yet know what "more"
> will be.

If you discount network speed and pixel density, "more" hasn't been much in 25
years as far as I can see. Miniturization means I can carry a laptop around
with me, ok.

Likewise with phones. Android and iOS are about 8 years old now. What can we
do that we couldn't then? _not much_ as far as I can see.

We're definitely past the point of diminishing returns when it comes to
resolution, and near or already at the point where physically your eye can't
couldn't tell if the pixel density was higher. So what's left?

------
bramen
How bad is debugging logic running purely on hardware nowadays? Is there
anything more user-friendly than tapping specific lines and watching the
output on an oscilloscope?

~~~
dnautics
Usually you start at the verilog level with simulators (I've done this
professionally with zero hardware experience and discovered many critical bugs
in a chip that's now taped out. Of course the real engineers had to fix them.)
Now that it's gone physical, after a series of very low level tests, they will
take similar software and adapt it to the interface for hardware instead and
repeat all the tests.

~~~
bramen
Ah makes sense. Thanks for the info, will look into this.

------
ismail
Interestingly Sony went the other way with the PS4, Moving to x86. any ideas
why?

~~~
esjeon
\- Heterogeneous architecture made game development a lot harder.

\- Engines had to be re-implemented for PS3, because Cell is simply too
different. (Note: Xbox is almost-PC, so it attracts lots of devs.)

\- Cell architecture isn't good for games, honestly. Cell CPU has bunch of
stream processors, but has only one (or few?) general purpose core.

~~~
snuxoll
You're correct, honestly it seemed like Sony was expecting ALL the graphics
work to be done on the SPU's. The whole design of the Cell Broadband Engine is
basically one SMT PowerPC core (PPU) and a bunch of stream processors (SPU's).
Considering their choice in an extremely weak GPU compared to the Xenos in the
360, a lot of games had to move a bunch of work off the GPU and write it for
the SPU's instead (which were nowhere near as nice to work with, they couldn't
just pull data from system memory without you passing a pointer to them that
was translated to an address they could DMA from, and you couldn't just throw
GL or RSX commands at them so you were stuck writing all the nitty gritty by
yourself).

------
digi_owl
We really do have a hardon for that Apple SoC, don't we?

------
peter303
"Hey Clark Kent, its time for another bienniel 'End of Moores Law' piece"

------
arbuge
TL:DR; Moore's law can no longer be counted on for performance gains, so
speeding up things will now be dependent on replacing general purpose hardware
with hardware specifically designed to implement specific algorithms.

Of course work on custom chips has been going on as long as chips were a
thing. The article just underlines that this is now pretty much the only way
forward.

~~~
drdre2001
Aren't optical transistors another option though?

~~~
smaddox
Optical switches are extremely large compared to contemporary transistors, and
there's no obvious way to scale them smaller than the diffraction limit.

~~~
drdre2001
Hopefully this kind of problem will be solved within a decade, do you think
that's a reasonable time scale or is this problem even unsolvable?

~~~
eloff
Like all research into the unknown it might prove easy (5-10 years) it might
prove hard (20 years) it might always be 50 years away (hehe fusion) or it
might just not be possible (faster than light travel).

