Hacker News new | past | comments | ask | show | jobs | submit login

Gains in instructions-per-clock start to flatten out. And that's where the gains were coming from in the last years. Some time ago a paper was posted here that showed how even if you have an infinite amount of transistors, you will still be limited in the range of 3-10 instructions-per-clock for typical programs.

Clock speeds seem to have leveled and IPC will only see another gain of 50-100%. Single threaded performance is close to the limit. What after that? Is this the end?




> Gains in instructions-per-clock start to flatten out. And that's where the gains were coming from in the last years.

This is commonly claimed but it's actually false for x86_64 desktop parts. For a single core scalar integer workload the IPC boost from i7-2700k to i7-7700k was maybe 20-25% on a great day, but the base frequency increase was a further 20%, and max boost freq increase ~15%. The frequency increase is of similar importance as the IPC increase.


When was the last time we saw a 50-100% performance gain in cars? airplanes? spacecraft?

Was it the end of those industries?

Welcome to mature technology.


> spacecraft

Economically? I think this or last year.


Economically, SpaceX is about 3% cheaper than Arianespace.

That's not a 50% or 100% improvement.

Maybe they'll get that improvement once they run recycled rockets all the time, but not before that.


@mlvljr: Your account seems to be shadowbanned, I can’t reply to your comment.

Currently, SpaceX has prices around 56-62 million USD per launch of a normal satellite (with a weight and orbit where they can recover the first stage).

Arianespace launches such lighter satellites in pairs, always two at once, at a price of around 60 million USD per satellite.

The Chinese launchers offer the same at around 70 million USD per launch.

So, the prices aren’t that different.

But, for launches from reused rockets, SpaceX is damn cheap. The first launch on a reused rocket cost below 30 million USD.

So, to recap: Today, in best case, SpaceX is between 4 and 13% cheaper than the next competitor. But in a few years, once they launch mostly reused rockets, they’ll be around 50 to 60% cheaper than the next competitor.


I imagine that, while SpaceX will continue to improve their cost/kg to orbit and reach a launch expense of half the current cost with re-usables pretty quickly, until someone else can compete, they could just increase their profit per launch enormously. Musk needs some serious capital for his Mars plans. I hope his global satellite internet provider concept works (I can't wait to have a option other than AT&T or Comcast) and brings in the big bucks. Then he won't need to make money on launches and can drop the launch price on launches to close to cost to help all space activities. Maybe even start selling re-usable rockets to other launch companies. Can't wait to see that day.

Long term, Musk is shooting for a ~100x reduction in launch costs to make a Mars colony feasible. Hope he makes it.


Isn't this an even further argument for cloud computing? If cost savings all come from having more cores at the same price, but end user devices can't put all those cores to work, having more of the compute intensive work happen on the back end amortized over many end users seems like the only way to benefit from improvements in cores per chip.


Memory and storage. Still big gains to be had there. Imagine if your whole hard drive was RAM speed.

Also more specialised cores e.g. DSP, and customisable hardware i.e. FPGA.


I distinctly remember a benchmark (which my google-fu is currently unable to find) between Intel chips with and without the Iris chip. On similar conditions (clock base/turbo and core count), the Iris chip had about a 20% performance advantage.

It wasn't explained in the benchmark, but the only reason I could imagine was the Iris chip worked as an L4 cache because the benchmark was not doing graphics stuff. That is what the Iris chip does, it sits right there in the socket with a whole bunch of memory available for the iGPU or work as L4 cache if available.

It's also a great way to do (almost) zero cost transfers from main memory to (i)GPU memory -- you'd do it at the latency of the L3/L4 boundary. With intel, that unlocks a few GFLOPs of processing power -- in theory, your code would have to be adapted to work this in a reasonable way, of course.

To sum things up, I agree with you, memory is a path that holds big speedups for processors. Don't know if "the Iris way" is the best path, but it indeed showed promise. Shame that Intel decided to lock it up for the ultrabook processors mostly.


I think the end point will be a massive chip with fast interconnects and a (relatively) huge amount of on die memory talking over a fast bus to something like nvme on steroids.

My new Thinkpad has nvme and the difference is huge compared to my very fast desktop at work which has SATA connected SSD's.


GPUs:

http://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.h...

http://www.anandtech.com/show/7603/mac-pro-review-late-2013/...

This is behind much of the interest in machine learning these days. Deep learning provides a way to approximate any computable function as the composition of matrix operations with non-linearities. It does this at the cost of requiring many, many times the computing power. But much of this computing cost can be parallelized and accelerated effectively on the GPU, so with GPU cores still increasing exponentially, at some point it's likely to become more effective than CPUs.


"Deep learning provides a way to approximate any computable function as the composition of matrix operations with non-linearities."

Thanks, and I wish this sentence was one of the first things I read when I was trying to figure out exactly what Deep Learning really meant. It's much more comprehensible than the semi-magical descriptions that seem far more prevalent in introductory articles.

It's also fascinating that a seemingly simple computing paradigm is so powerful, kind of like a new Turing Machine paradigm.


"Deep learning provides a way to approximate any computable function as the composition of matrix operations with non-linearities."

This actually describes neural networks in general, not so much "deep learning".

Deep learning comes from being able to scale up neural networks from having only a few 10s or 100s of nodes per layer, to thousands and 10s of thousands of nodes per layer (and of course the combinatorial explosion of edges in the network graph between layers), coupled with the ability to process and use massive datasets to train with, and ultimately process on the trained model.

This has mainly been enabled by the cheap availability of GPUs and other parallel architectures, coupled with fast memory interconnects (both to hold the model and to shuttle data in/out of it for training and later processing) and the CPU (probably disk, too).

But neural networks have almost always been represented by matrix operations (linear algebra), it's just that there wasn't the data, nor the vast (and cheap) numbers of parallelizable processing elements available to handle it (the closest architectures I can think of that could potentially do it in the 1980/90s would be from Thinking Machines (Connection Machines) and probably systolic array processors (which were pretty niche at the time, mainly from CMU):

https://en.wikipedia.org/wiki/Systolic_array

https://en.wikipedia.org/wiki/WARP_(systolic_array)

These latter machines started to prove some of what we take for granted today, in the form of the NAVLAB ALVINN self-driving vehicle:

http://repository.cmu.edu/cgi/viewcontent.cgi?article=2874&c...

Of course, today it can be done on a smartphone:

http://blog.davidsingleton.org/nnrccar/

The point, though, is that neural networks have long been known to be most effectively computed using matrix operations, it's just that the hardware wasn't there (unless you had a lot of money to spend) nor the datasets - to enable what we today call "deep learning".

That, and AI winters didn't help matters. I would imagine that if somebody from the late 1980s had asked for 100 million to build or purchase a large parallel processing system of some form for neural network research - they would've been laughed at. Of course, no one at that time really knew that what was needed was such large architecture, nor the amount of data (plus the concept of convolutional NNs and other recent model architectures weren't yet around). Also - programming for such a system would have been extremely difficult.

So - today is the "perfect storm", of hardware, data, and software (and people who know how to use and abuse it, of course).


I don't think GPUs are a particularly good solution for these, they aren't the future and won't be around for mass-deployment that much longer.


It seems the author is down the 'deep learning' rabbit hole.

>> It does this at the cost of requiring many, many times the computing power. But much of this computing cost can be parallelized and accelerated effectively on the GPU, so with GPU cores still increasing exponentially, at some point it's likely to become more effective than CPUs.

So can be any matrix. Sadly, there aren't as many algorithms that are efficiently represented by one.


That's quite a statement - what will replace GPUs for the ever increasing amount of ML work being done?


TPU-like chips; though they can be (partially) included on GPUs as well as is the case with the latest NVidia/AMD GPUs.


There's nothing special about the tpu. The latest gpus are adding identical hardware to the tpu, and the name "GPU" is a misnomer now since those cards are not even intended for graphics (no monitor out). Gpus will be around for a very long time, just not doing graphics.


Yep. Simply the core idea of attacking memory latency with massive parrelization of in flight operations rather than large caches makes sense for a lot of different workloads, and that probably isn't going to change.


> Some time ago a paper was posted here that showed how even if you have an infinite amount of transistors, you will still be limited in the range of 3-10 instructions-per-clock for typical programs.

Do you know what papers that was? I would have thought that with infinite transistors you could speculatively execute all possible future code paths and memory states at the same time and achieve speedup that way.


Oldie but goodie:

http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-93-6.pdf

Speculation can only take you so far. How do you speculatively execute something like:

   a = a + b[x];
?

You can't even speculatively fetch the second operand until you have real values for b and x.

Trying to model all possible values explodes so much faster than all possible control paths that it's only of very theoretical interest.


It's not the end, if we as software developers can stop counting on the hardware folks to improve performance and do the hard work necessary to parallelize our apps. (This includes migrating components to use SIMD and/or GPUs as appropriate.)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: