Hacker News new | comments | show | ask | jobs | submit login

My prognosis as bit of an insider popping in and out of Shenzhen.

All guns are pointed at _memory_

Memory is an uncompetitive industry, a cash cow unseen in history, comparable only to oil. The SEL empire is built not on top of galaxy notes, but on a pile of memory chips.

The easiest way to get an order of magnitude improvement right away is to put more memory on die and closer to execution units and eliminate the I/O bottleneck, but no mem co. will sell you the memory secret sauce.

Not only that memory is made on proprietary equipment, but decades of research were made entirely behind closed doors of Hynix/SEL/Micron triopoly hydra, unlike in the wider semi community where even Intel's process gets leaks out a bit in their research papers.

SEL makes a lot of money not only selling you the well known rectangular pieces, but also effectively forces all top tier players buying their fancy interface IP if they want to jump on the bandwagon of the next DDR generation earlier than others: https://www.design-reuse.com/samsung/ddr-phy-c-342/ . This makes them want to keep the memory chip a separate piece even more.

Many companies tried to break the cabal, or workaround them, but with no results. Even Apple's only way around this was just to put a whopping 13 megs of SRAM on die.

Changing the classical Von Neuman style CPU for GPU or the trendy neural net streaming processor changes little when it comes to _hardware getting progressively worse_ at running synchronous algorithms because of memory starvation.

You see, the first gen Google TPU is rumored to have the severest memory starvation problem, as do embedded GPUs without steroid pumped memory busses of gaming grade hardware.

When PS3 came out, outstanding benchmark results on typical PC benchmark tasks were wrongly attributed to it having 8 dsp cores, while they were not used in any way. It was all due to it reverting back to more skinny, and synchronous operation friendlier memory. The amazing SPU performance was all thanks to that too. DSP style loads benefitted enormously from nearly synchronous memory behaviour.




David Patterson already figured out that separation of the memory and the processor was the real bottleneck in 1996-97. His IRAM designs for general-purpose computer systems that integrate a processor and DRAM onto a single chip were done in The Berkeley Intelligent RAM (IRAM) Project http://iram.cs.berkeley.edu/

Probably didn't work out for the reasons you mention.


Can you please tell me what is stopping big companies like INTEL and AMD from implementing a processor like IRAM project does? I am really curious to know more about this.


Completely different process technology. When Intel or AMD add memory into their chips it's static RAM (SRAM). SRAM cells are large, expansive and take lots of space. DRAM cells are small and dense (only one-transistor, one-capacitor).

Next question: What prevents Intel or AMD combining logic process and DRAM process or Samsung and others combining logic in their DRAM ships?

Integrating CMOS and DRAM might be impossible to do so that the price/speed is less than using separate chips. Combining two processes increases the price. CPU/GPU makers don't have the latest DRAM knowledge. Reverse is also true. Partnership is required.

Then there are technological problems: There are yield differences in different processes. CPU/GPU's operate in high temperatures. DRAM's technology needs to adjust or there needs to be halfway solution.

It's possible that in some time in the future new technology called STT-MRAM could replace replace low-density DRAM and SRAM and it could be integrated into logic because it can use existing CMOS manufacturing techniques and processes. It will take time. (STT = Spin-Transfer Torque)


Your arguments mixes up memory as in DRAM, where the market structure is as described, and memory as in CPU cache, which is something entirely different.

If Intel wants to add more cache, they simply paste the template again. What's limiting on-die caches is the competition for space. Chip yields sink roughly proportional to die size.


What's the reason why they can't make bigger chips? More costly?


Yields of functioning chips quickly drop off when you increase the size, because each single transistor has an independent risk of error. Individual error rates are extremely small, but so the number of possibly failing components is large.


You don't need every transistor to work though. If you can detect when a module has a broken transistor, you can simply disable that module and sell the rest of the chip. Divide the cache into multiple small modules, and it is not a big deal if you have to deactivate one because of a broken transistor. You would probably be deactivating it anyway for market segmentation.


Yeah! Here is one example of a company working on that:

> The REX Neo architecture gains its performance and efficiency improvements with a reexamining of the on chip memory system, but retains general programmability with breakthrough software tools.

https://insidehpc.com/2017/02/rex-neo-energy-efficient-new-p... (check out the linked video)


We at Vathys are as well for deep learning, check out our Stanford EE380 talk for more: http://web.stanford.edu/class/ee380/Abstracts/171206.html


Can you provide a source on the memory starvation for the TPUs? Plus are we talking generation 1 or 2 or both?

Did not seem to be an issue in the paper with the first generation.

https://cloud.google.com/blog/big-data/2017/05/an-in-depth-l... An in-depth look at Google's first Tensor Processing Unit (TPU ...


In the paper on the first-generation TPU, in section 7. They estimate that there would have been impressive gains in speed, both absolute and per-watt, if they'd had enough design time to give it more memory bandwidth:

https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...


This is what was intensively discussed by first trial users, and later admitted by google tpu team themselves on hot chips 29

Even with monsterous HBM2 memory, they still have it.

It is probably hard to predict what matrix set to prefetch when you deal with a neural net. So you have cache misses there too


What do you mean the secret sauce for memory? And who is prevent whom from using it? Certainly Intel, AMD, and Nvidia should have access to the technology. Also I'm not an expert but there are complexities for mixing DRAM and CPU design in the same die.


DRAM processes use slower, higher capacitance, lower leakage RCAT transistors. Not good for logic, but who cares when logic is only ~20% of your die?

The other mammoth problem however, is scaling the deep trench capacitors.


DRAM processes do not lend themselves well to high speed. But FPGAs show that low speed with wide busses can still accomplish a lot.

However if you have a great idea for logic and dram on the same die you are unlikely to be able to do so economically since you can't get the dram IP.


Far as bringing CPU's and memory together, here was one of my favorite attempts that used DRAM processes:

https://news.ycombinator.com/item?id=16350560


What is SEL?


With the mention of the Galaxy Note I'm assuming Samsung Electronics.


Ah, EL from ELectronics. Search was turning up nothing.


Samsung




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: