Hacker News new | past | comments | ask | show | jobs | submit login
Difficulties in Saving Moore’s Law (nextplatform.com)
55 points by jonbaer on June 27, 2016 | hide | past | web | favorite | 36 comments

No Mention of moving to 3D ?

I am assuming that having RAM and CPU (and other functions such as GPU or other hardware accelerators) on same die on different levels (physically close ie nano or micrometers apart) would cut latency by orders of magnitude.

3D is already done with NAND flash so I am assuming it is heat that is the problem.

Just my 10c

Average distance from points in a unit circle to center is 2/3; average distance from points in a unit sphere to center is 3/4. Or if you prefer squares and cubes, you are essentially replacing a sqrt(2) diagonal with a sqrt(3) triagonal.

Even worse, surface area available for heat dissipation grows quadratically, while volume and therefore heat generated grows cubically. That's why 3D memory is a thing (most memory cells are not active at any given time), but seriously 3D processors (trigate notwithstanding) are not.

Yes, but your diameter can shrink significantly if you have more than a single plane of thickness. Currently the thickness of the die is less than 1% of the typical diameter, so that 2/3 -> 3/4 increase is negligible compared to the >100x decrease in diameter for the same number of transistors.

The biggest issue (IMO) will be heat dissipation and manufacturing time.

Well, biological brains are 3D processors, and have been a thing for billions of years..

You might be surprised how little 3D they are. Look at the top line model: the human cortex is only about 3 mm thick, and comprised of only six layers of neurons. It may look like a big lump of meat, but that's because it's folded to fit more area into the limited volume available in the skull. If you unfold it, you get this very thin sheet with a whopping area of 2500 cm^2.

It could be that CPUs are very complex and each layer adds a significant threat of errors. They are currently at around 13 layers.

It could be that the physical complexities of adding more layers is just too expensive.

That helps with wire delay but not with transistor delay. For workloads that are bound by cache latency (a lot of them) it should be a really big deal but not for more execution bound workloads. And in the end it might give a few more factors of two improvement but it will only get us so far.

> I am assuming that having RAM and CPU (and other functions such as GPU or other hardware accelerators)

We already have RAM(cache) and GPU(integrated graphics) on the same die. But that doesn't work as a replacement for professional/gaming graphical workloads.

Interesting that a guy from Intel says cache coherence doesn't scale; I kinda thought it scales pretty well, especially if actual data sharing is the exceptional case that must be made to work correctly, and non-shared data is the rule and must be made fast.

Cache coherence has already scaled a lot further than I expected it to. I take him to mean that, while it has scaled remarkably far already, it won't scale forever and we are starting to push the limits on it.

> especially if actual data sharing is the exceptional case

The problem is that cache coherence does not scale, given (note emphasis) that a certain amount of data is shared among CPUs.

You cannot circumvent the problem by choosing to not share much data between cores, because then you have changed the problem!

What example workload cannot be implemented efficiently if your cores communicate using coherent caches, but can be implemented efficiently if they work in some other way?

Yeah, what do these guys from Intel know about things like caching.

Ouch! Appeal to authority stings. But in response, I pitted mighty Google against Intel and in the first ten hits for "cache coherence scalability paper" I found the one that prompted me to write my original comment:

From http://people.ee.duke.edu/~sorin/papers/tr2011-1-coherence.p...

"Using amortized analysis, we show that interconnection network traffic per miss need not grow with core count and that coherence uses at most 20% more traffic per miss than a system with caches but not coherence. Using hierarchical design, we show that storage overhead can be made to grow as the root of core count and stay small, e.g., 2% of total cache size for even 512 cores.

Consequently, we predict that on-chip coherence is here to stay. Computer systems tend not abandon compatibility to eliminate small costs, such as the costs we find for scaling coherence"

And now the quote from TFA:

"Do we really need cache coherency across a hundred cores or a thousand cores on a die?"

The paper's 512 cores are in the middle of the range where TFA suggests you start to get enough problems to seriously consider breaking compatibility and foregoing cache coherence. Why is the paper which supports its claims with concrete analysis less credible than an unsupported offhand remark from someone at Intel, a company that made a fortune from all of its CPU designs except that infamous one where they abandoned compatibility?

BTW a basic understanding of coherence might lead one to doubt the claims about lack of scalability. When you hit the cache, you know that you own the data so it's fast. When you miss the cache, it's a very slow operation anyway, the overhead of checking whether to fetch the data from external memory or another cache can't be that bad. And if data travels a lot from one cache to another because of true or false sharing, that's the fault of the code doing that. Coherence is not supposed to make sharing blindingly fast, just to make it correct, and to make false sharing correct as well.

Of course none of this will be relevant if there's no substrate on which to put a thousand of x86 cores, which right now there isn't.

> BTW a basic understanding of coherence might lead one to doubt the claims about lack of scalability. When you hit the cache, you know that you own the data so it's fast. When you miss the cache, it's a very slow operation anyway, the overhead of checking whether to fetch the data from external memory or another cache can't be that bad. And if data travels a lot from one cache to another because of true or false sharing, that's the fault of the code doing that. Coherence is not supposed to make sharing blindingly fast, just to make it correct, and to make false sharing correct as well.

The problem isn't so much that missing the cache is slow, but rather how do you make a processor know that it has to look the data value up in another cache instead of main memory? The two main options are either to basically broadcast every request to every cache it might be in, or to store in main memory which cache it's in--you have a design tradeoff between high interconnect traffic or high memory metadata usage.

You don't store it in main memory, you store it in the directory, aka snoop filter, or whatever you want to call it. The paper actually estimated the overhead and came up with the number 2% for 512 cores, not that high.

The directory is basically main memory, as far as provisioning bits of storage are concerned.

The number 2% also is achieved only if you do three-tier organization of processors, and (IIRC) you need to limit the cache broadcasts to within each tier--which puts some sharp constraints on how you lay out cores in a chip. Intel's many-core chips, IIRC, are laid out in a single ring connecting all the cores, which means it's completely the wrong topology implied by the paper.

If that's true why didn't they incorporate this into their products? They had 5 years. That's plenty of time in this industry.

All multi-core Intel chips have cache coherence. Intel might not have incorporated the paper's approach to scaleable cache coherence because they don't have a substrate to put 512 x86 cores on.

The real question is - can we really know that there is nothing better than CMOS?

Imagine an old star that is undergoing a collapse - the matter infalls under it's own gravity, accelerating faster and faster. At some point, the matter runs out of density - the neutrons slam into each other and the whole mass of the star going inwards at a good fraction of c tries to undergo an abrupt stop.

The leisurely free fall turns into the violence of inertia. Then, either that violence is enough, and the collapse continues towards a black hole, or it's not enough, and all we get is a boring neutron star.

Will the economical inertia behind the Moore’s Law push it past each technology's clang of limits all the way towards some sort of singularity, or will it one day come to an abrupt halt against an immovable wall of the structure of the universe?

The latter, obviously.

Would that wall be CMOS? We can't really know until we hit it.

There are quite a few options besides CMOS all with desirable properties and some of those with potential feature sizes smaller than Silicon. The problem is that none of these are economical at scale and that is where the problem lies, it's an economical issue, not so much a technical issue.

Bismuth, Gallium-Arsenide and other variations on that theme, various superconductors used for processing elements all have the potential or have already surpassed Silicon for various parameters. But none have done so in a way that would allow massive adoption.

So for now the 'economical inertia' as you put it so eloquently seems to be exactly where the problem is, inertia is only useful to get you past a hump when you're still moving forward, when you're already stuck it is a hindrance.

Any new technology will have a really hard time competing economically with an industrial process that has been refined for five decades. The question is whether there are people willing to pay a hefty premium for the new technology to finance its development.

It's a bit like the combustion engine. A completely different tech (such as optical) may be a better bet.

It took me too long to find this: information:


From the article - possible replacements for CMOS include:

Rapid single flux quantum devices Carbon nanotube field-effect transistor Silicon nanowire FET Spintronics, various types Tunnel junction devices, e.g. Tunnel field-effect transistor Indium antimonide transistors Photonics, optical computing Graphene nanoribbons Molecular electronics

> Gallium-Arsenide

I have read somewhere that this actually integrates fairly well into CMOS.

At the end of the article it states:

> “There is nothing on the horizon today to replace CMOS,” he says. “Nothing is maturing, nothing is ready today. We will continue with CMOS. We have no choice. Not just Intel, but everybody. If you want this industry to grow, then we need to be doing things with CMOS

So time to write efficient code and start working hard on easy multi-threading? Cause if we hit "the wall", all consumers can do is buy the fastest CPU with the most cores they can get, and then never in his live again will he be able to upgrade. So those resources is what we will need to do it with.

I think when we actually hit The Wall, we'll keep on improving (at a slowed pace) for a while afterwards based on the not-so-low-hanging fruit we've ignored up til then.

Frankly, I'm kind of interested in the prospect of specialized hardware coming back.

The trend since the mid-1980s, when the Intel 80386 was introduced, has been for specialized chips to be replaced due to commodity chips beating them comprehensively in single-core speed; problems which once required specialized hardware and core designs were either solved or obviated by massive improvements in scalar hardware.

Now that this brute force method is starting to peter out, we might see a rise of new, specialized designs once again, to solve specific problems which gain substantial speedups from very specific kinds of parallelism. Systolic arrays are one example. We're already doing something like this by pressing GPUs into service as computational hardware, but it can go farther.

This is definitely already starting to happen.

Modern server CPUs have FPGAs in them. I was able to speed up Blockchain transaction signing and verification by 2 order of magnitude by utilizing the Intel in-build FPGAs.

Mobile phone SoCs already have semi-specialized processors in them. Microsoft's Hololens has something what they call "HPU" Holographic Processing Unit. Google just made public that they have a TensorFlow ASIC. Google's phone radar thingy has it's own ASIC if I'm not mistaken.

I have been wondering for a long time if it would be possible to make a computational device out of subatomic particles. But i guess if possible that would take hundreds of years.

I wonder if hardware restraints might actually improve software. Instead of "what the hardware giveth, the software taketh away," the industry will instead begin to say, "well, this is all the performance we've got so let's make the code half the size it is right now and eliminate 50% of the bugs."

The Singularitarians aren't going to be too happy about that. They've been spending all their time getting ready for the Strong AI future built on the inevitability of Moore's law and now it may never come.

Maybe biotech will lead the next tech revolution. I can see people doing all sorts of unholy stuff with CRISPR. It's a technology though that, at least in the U.S, I can't see the regulatory agencies really supporting development of. It's almost too powerful of a technology.

It's sort of how not much has happened in the way of nuclear power design lately. So much can go wrong with that technology that the regulatory agencies never approve anything.

I wonder if we'll reach a point where all the socially acceptable technologies have been developed and technologies like advanced nuclear or radical CRISPR biotech that could take things to the next level are permanently forbidden because they are too powerful and dangerous.

Speaking as a singularitarian: The only thing that has stopped is free scaling of photolithography.

Our most advanced computing systems fill the equivalent volume of a large snowflake. Scaling down is over; the future is scaling up and sideways. (Power consumption, price...)

The only unit that ever really mattered is amortized flops per inflation adjusted dollar.

I think it's pretty obvious that raw computing power alone isn't enough to get True AI, and there's no guarantee we'll need a stupidly powerful processor for AI in the first place.

Also, we mainly ignore parallelisation because it's hard to reason about, which makes it easy pickings for AI to improve on if we do need stupidly large amounts of computing power. Although if it's inherently linear we're just SOL I guess.

> Maybe biotech will lead the next tech revolution

This. Almost certainly blending biological ideas with more traditional chip manufacturing. 3D chips, self healing, cooling integrated through the chip.

Augmenting our brains with implants. Maybe even expanding parts of our brains.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact