
Cerebras’s giant chip will smash deep learning’s speed barrier - pheme1
https://spectrum.ieee.org/semiconductors/processors/cerebrass-giant-chip-will-smash-deep-learnings-speed-barrier
======
geomark
The article talks about a few things that they call inventions, like making
interconnections across what would normally be scribe lines. But I personally
worked on wafer scale integration about 25 years ago and we were already doing
that. We called it inter-reticle stitching. The technology was ancient back
then - 0.5 micron feature size on 4 inch wafers - but the wafer scale
techniques are applicable to modern technologies. In particular, developing a
yield model that informs your on-chip redundancy choices and designing built-
in self test and selection circuitry so that you can yield large chips. The
chip we developed was so large that only two would fit on a wafer. We got 50%
yield on a line that was far from mature at the time. The company lacked the
vision to do anything with what they had developed. To them it was just a chip
for which there were few customers. The suits didn't know how to make bank
with this methodology that could yield nearly arbitrarily complex chips in
nearly any target process.

Edit: There were a number of papers and conference proceedings published back
then but not much shows up when searching Google. Here's one discussing the
issues and results of field stitching [https://fdocuments.in/document/ieee-
comput-soc-press-1992-in...](https://fdocuments.in/document/ieee-comput-soc-
press-1992-international-conference-on-wafer-scale-
integration-589dfe172703a.html)

From 1992, so yeah, field stitching is not a recent invention.

~~~
DaniFong
Great post, but I would like to add that the critical question for whether an
_invention_ because a useful _innovation_ is usually not "is this novel" but
rather "is there a currently viable project here with people who care about
the thing and genuine motivation and persistence and adequate resources."

In other words, "how is this effort new to the universe?"

I would say it's certainly at a different scale and a different time. And we
should be super thankful that the commercial interest is such that we can try
out new chip designs in a different domain now; you can really imagine a
rethink for the kinds of things that are possible once you're really at scale
here.

------
mark_l_watson
I don’t know if this mega-chip will be successful, but I like the idea. Before
I retired I managed a deep learning team that had a very cool internal product
for running distributed TensorFlow. Now in retirement I get by with a single
1070 GPU for experiments - not bad but having something much cheaper, much
more memory, and much faster would help so much.

I tend to be optimistic, so take my prediction with a grain of salt: I bet
within 7 or 8 years there will be an inexpensive device that will blow away
what we have now. There are so many applications for much larger end to end
models that will but pressure on the market for something much better than
what we have now. BTW, the ability to efficiently run models on my new iPhone
11 Pro is impressive and I have to wonder if the market for super fast
hardware for training models might match the smartphone market. For this to
happen, we need a deep learning rules the world shift. BTW, off topic, but I
don’t think deep learning gets us to AGI.

~~~
corporate_shi11
It's also my impression - from my modest exposure to DL over the past two
years as a student taking courses - that deep learning must be overcome to
reach AGI.

Specifically gradient descent is a post hoc approach to network tuning, while
human neural connections are reinforced simultaneously as they fire together.
The post hoc approach restricts the scope of the latent representations a
network learns because such representations must serve a specific purpose
(descending the gradient), while the human mind works by generating
representations spontaneously at multiple levels of abstraction without any
specific or immediate purpose in mind.

I believe the brain's ability to spontaneously generate latent representations
capable of interacting with one another in a shared latent space is
functionally enabled by the paradigm of neurons 'firing and wiring' together.
I also believe it is the brain's ability to spontaneously generate
hierarchically abstract representations in a shared space that is the key to
AGI. We must therefore move away from gradient descent.

~~~
mantap
Don't forget the human brain takes about 7 to 8 hours off every day to
rejiggle itself, to use a scientific term. The brain's architecture is better
than having a training stage but it's by no means able to continually learn
without stops and starts.

~~~
rckoepke
You see this in young puppies (3-6 month old) a lot as well. They get
irritable/exhausted after 15-30 minutes of training, and usually dont seem to
learn anything at all during the training activity itself. Then they pass out
("nap") for 30 minutes and when they wake up they do the trick/skill
perfectly.

Same thing as humans, just more obvious/visible.

------
varelse
I am far more excited by the underlying Wafer Scale Integration moonshot than
I am by any AI benchmarks here. I know it's trendy to think there can only be
one w/r to the AI Iron Throne but nope, not the case, everyone is writing
bespoke code in production where the money is made. Well, almost everyone,
Amazon seems to be the odd duck but they're a bunch of cheapskate thought
leaders anyway (except for their offers to junior engineers in their desperate
hail mary attempt to catch up with FAIR and DeepMind, but... I... digress...).

Which is to say that graphs written to run specifically on Cerebras's giant
chip will smash deep learning's speed barrier for graphs written to run best
on Cerebras's giant chip. And that's great, but it won't be every graph, there
is no free lunch. Hear me now, believe me later(tm).

But if we can cut the cost of interconnect by putting a figurative
datacenter's worth of processors on a chip, that's genuinely interesting, and
it has applications far beyond the multiplies and adds of AI. But be very wary
of anyone wielding the term "sparse" for it is a massively overloaded
definition and every single one of those definitions is a beautiful and unique
snowflake w/r to efficient execution on bespoke HW.

~~~
01100011
I just wonder about the reliability of a system that large. Sure, it's mostly
used for machine learning where we don't seem to care as much, but what is the
average MTBF of a chip this large? How many chips actually make it out of
production?

Also, is this something that will likely scale up, or will this style of
design hit a wall(power dissipation?) faster than, say, silicon-interconnect
fabric?

Time will tell if this is the new path forward or just a curious footnote in
the history of semiconductors.

~~~
why_only_15
They built the chip specifically so that it can tolerate failures in some of
the cores. I wonder if it can do that adaptation only once or if it can
automatically detect it and route around it.

------
Zenst
A chip that size, imagine the yield. Equally, cooling - has to be water based
as a heatsink that size would be on par to a small anvil and the weight factor
would be some serious issues. Though unsure as no pictures of it in-play alas
and all they say is - "20 kilowatts being consumed by each blew out into the
Silicon Valley streets through a hole cut into the wall", which does somewhat
beg for a picture as just raises more questions.

Why would they make a chip this big with AMD showing a chiplet design approach
is cheaper and more scalable on so many levels. Let alone, yields.

Equally, arms approach to utilising the back of the chip as a power delivery
:-
[https://spectrum.ieee.org/nanoclast/semiconductors/design/ar...](https://spectrum.ieee.org/nanoclast/semiconductors/design/arm-
shows-backside-power-delivery-as-path-to-further-moores-law)

Then a wafer scale chip like this, using that approach, would save so much
power. But again, yeilds will be a factor and can imagine this is not the
cutting edge process node as you find as nodes mature, the yields improve. So
an older node size would have a better yield and be more suitable for such
wafer scale chips. But again, no mention of what is used. I have read in the
past that it would use Intel's 10nm, but this article mentions TSMC. Another
article I read that they used a 16nm node (
[https://fuse.wikichip.org/news/3010/a-look-at-cerebras-
wafer...](https://fuse.wikichip.org/news/3010/a-look-at-cerebras-wafer-scale-
engine-half-square-foot-silicon-chip/) ), which as mentioned above about node
maturity, understandable.

~~~
gimmeThaBeet
I'm really curious about the benefits of their implementation. It's far beyond
my grasp to make any serious criticisms and I don't really want to doubt them,
it just seems a pretty radical departure from even the direction of
innovation.

The way they paint it sounds like they're putting in redundant cores to
account for failure of what seems like what I would call the 'first line'
cores, i.e. there's cores that are only used if some primary ones aren't
working?

But sort of intuitively that doesn't make a whole lot of sense given the
parallel nature. Maybe they are just putting in 101% of specified cores, and
if there's a ~1% hopefully uniform-ish core failure rate then it's all gucci?

I guess my question is probably similar to yours, what are you giving up with
yield-enhancing redundancy of a behemoth die vs integrating a bunch of
confirmed working chiplets together?

~~~
phonon
The CEO says 1-1.5%.

"Cerebras approached the problem using redundancy by adding extra cores
throughout the chip that would be used as backup in the event that an error
appeared in that core’s neighborhood on the wafer. “You have to hold only 1%,
1.5% of these guys aside,” Feldman explained to me. Leaving extra cores allows
the chip to essentially self-heal, routing around the lithography error and
making a whole-wafer silicon chip viable."

[https://techcrunch.com/2019/08/19/the-five-technical-
challen...](https://techcrunch.com/2019/08/19/the-five-technical-challenges-
cerebras-overcame-in-building-the-first-trillion-transistor-chip/)

------
bcatanzaro
Reminds me of that other great prediction of a GPU killer from IEEE Spectrum
back in 2009:

[https://spectrum.ieee.org/computing/software/winner-
multicor...](https://spectrum.ieee.org/computing/software/winner-multicore-
made-simple)

~~~
Google234
What did go wrong with Intel's MIC (Xeon Phi) project? I can't find a
compressive account of this from HPC people. The idea seemed pretty sound:
large die, simpler circuit, and much more parallelism, in the x86 line..

~~~
liuliu
I vaguely remember that at the dawn of the deep learning (2013 to 2014), there
were talks and hopes that Xeon Phi would smash the performance of Nvidia GPUs.
However, the sample people got are too late (I believe it is at the end of
2014) and the performance figures are disappointing. It might be just the
software was simply not there yet unfortunately. But then the wheels moved
forward and everyone started to buy Nvidia chips in their datacenters.

------
wbhart
From the perspective of an outsider, I can't see how a company like this could
survive. They claim on the one hand to have done something really amazing and
are at the stage where they are looking for customers. Normally, you'd expect
them to be touting performance figures to secure such investment. Instead,
they've decided to keep the performance secret. And they've managed to find
some "expert" who says this is normal.

Does anyone here have expertise in this area? Is this the model for a
successful company in this area?

~~~
recursivecaveat
As someone who works for another startup in this area, building the chip is
only half the battle. The other half is tooling for compiling benchmark
networks onto the chip in a performant manner. With 400k cores and their
'duplicate and re-route' defect strategy, this might literally be the most
challenging compilation target ever made. It probably stacks up absolutely
terribly in every metric right now. That's not to say it will necessarily get
better, most of the people I've talked to don't think the megachip will
ultimately amount to much more than a clever marketing ploy.

~~~
Veedrac
A bit baffled by this because on every axis I look this seems like a dream of
a compilation target.

* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.

* Model parallel alone is full performance, no need for data parallel if you size to fit.

* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.

* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.

I genuinely don't know how you'd build a simpler system than this.

~~~
jcranmer
Having worked on compilers for pretty weird architectures, it's generally the
case that the less like a regular CPU your architecture is, the more difficult
it is to compile.

In particular, when you change the system from having to worry about how to
optimally schedule a single state machine to having to place operations on a
fixed routing grid (à la FPGA), the problem becomes radically different, and
any looping control flow becomes an absolute nail-biter of an issue.

~~~
Veedrac
Remember that you aren't compiling arbitrary programs. Neural nets don't
really have any local looping control flow, in the sense that data goes in one
end and comes out the other. You'll have large-scale loops over the whole
network, and each core might have a loop over small, local arrays of data, but
you shouldn't have any sort of internal looping involving different parts of
the model.

~~~
tachyonbeam
It's pretty common to have neural networks that have both recurrent nets
processing text input and convolutional layers. A classic example would be
visual question answering (is there a duck in this picture?). That would be a
simple example involving looping over one part of the model. Ideally you want
that looping to be done as locally as possible to avoid wasting time having a
program on a CPU dispatching, waiting for results and controlling data flow.

Having talked to someone at Cerebras, I also know that they don't just want to
do inference with this, they want to accelerate training as well. That can
involve much more complex control flow than you think. Start reading about
automatic differentiation and you will soon realize that it's complex enough
to basically be its own subfield of compiler design. There have been multiple
entire books written on the topic, and I can guarantee you there can be
control-flow driven optimizations in there (eg: if x == 0 then don't compute
this large subgraph).

~~~
Veedrac
I would be surprised if Cerebras was trying to handle any recurrence inside
the overall forward/backward passes. It seems like a lot of difficulty (as
mentioned) for peanuts.

I don't get your point about training. Yes, it's backwards rather than
forwards, and yes it often has fancy stuff intermixed (dropout, Adam, ...),
but these are CPUs, they can do that as long as it fits the memory model.

------
michelpp
The members of the GraphBLAS forum have discussed this chip a couple of times.
There's a lot of research on making deep neural networks more sparse, not just
by pruning a dense matrix, but by starting with a sparse matrix structure de
novo. Lincoln Laboratory's Dr. Jeremy Kepner has a good paper on Radix-Net
mixed radix topologies that achieve good learning ability but with far fewer
neurons and memory requirements. Cited in the paper was a network constructed
with these techniques that simulated the size and sparsity of the human brain:

[https://arxiv.org/pdf/1905.00416.pdf](https://arxiv.org/pdf/1905.00416.pdf)

It would be cool to see the GraphBLAS API ported to this chip, which from what
I can tell comes with sparse matrix processing units. As networks become
bigger, deeper, but sparser, a chip like this will have some demonstrable
advantages over dense numeric processors like GPUs.

------
giacaglia
I've wrote about the challenges that Cerebras went through and what is next:
[https://towardsdatascience.com/why-cerebras-announcement-
is-...](https://towardsdatascience.com/why-cerebras-announcement-is-a-big-
deal-6c8633ffc49c)

------
rsp1984
This fits perfectly into the narrative of yesterday's discussion on HN [1].

Deep Neural Nets are somewhat of a brute force approach to machine learning.
Training efficiency is horrible as compared with other ML approaches, but hey,
as long as we can trade +5% of classification performance for +500% of NN
complexity and throw more money at the problem, who cares?

I see a dystopian future where much better and much more efficient approaches
to ML exist, but nobody's paying attention because we have Deep Neural Nets in
hardware and decades of infrastructure supporting it.

[1]
[https://news.ycombinator.com/item?id=21929709](https://news.ycombinator.com/item?id=21929709)

~~~
justicezyx
Well, if a better algorithm cannot beat DNN in a realistic product setting,
then how can you say its better after all?

If the algorithm is indeed better, how can DNN dominates and turn into a
dystonia...

~~~
someguyorother
What economists call path dependence.

The alternative algorithm would be better than DNN if the same amount of
effort was put into creating special-purpose hardware, libraries, and so on;
but in the dystopia, it's not fully refined DNN vs fully refined alternative
algorithm, but fully refined DNN vs alternative algorithm with hardware and
software optimized for DNN.

The alternative algorithm always looks unappealing because the playing field
historically favors DNN, and so doesn't take off in the dystopia.

~~~
justicezyx
You are referring back to OP's own reasoning fallacy...

DNN emerges out from being an underdog. Its superiority was proven by
technology and economy.

What you said is of course not wrong, but they can never be proven right. As
immediately you switch the role, your argument then favors the other one.

------
m0zg
They did build some valuable tech, no question there, but be sure to account
for the typical startup hyperbole. By the time you can get your hands on this
(if that ever happens), the hyperbole will converge a bit closer to reality,
the tradeoffs will become apparent, etc, and you'll discover that it is not,
in fact, going to "smash" barriers of any kind in any practical sense.

From TFA: "Cerebras hasn’t released MLPerf results or any other independently
verifiable apples-to-apples comparisons."

That's all you really need to know.

------
ZhuanXia
Them shunning benchmarks is pretty lame.

~~~
baybal2
The guy who runs Cerebas has history of quick selling companies that then went
nowhere. He bets all on wow-effect, and sells to trend chasing suckers.

Less than stellar benchmarks will ruin the "magic"

------
green-eclipse
The Cerebras chip really stands out in terms of the chip industry's
relationship to Moore's law. Look at the graphs in this article for reference:

[https://medium.com/predict/cerebras-trounces-moores-law-
with...](https://medium.com/predict/cerebras-trounces-moores-law-with-first-
working-wafer-scale-chip-70b712d676d0)

~~~
atq2119
That article is hogwash. Sure, the Cerebras "chip" is impressive. But the idea
that it will _accelerate_ Moore's law and usher in the singularity is just
nonsense. Nobody has even made serious efforts to use deep learning for
physical design, and its scope for improving designs is limited at best even
in theory.

If this was trying to aim at solid state physics and materials research, then
_maybe_ one could be carefully optimistic about a genuine breakthrough via
something like room temperature, standard pressure super-conducting. As it
stands, I call blind hype.

~~~
mlyle
Yah, it's BS. But it may be teaching TSMC a whole lot about making larger
chips with good yield, and the across-reticle interconnect technology is
impressive too and may find some general applicability (e.g. it sounds like
something AMD might like).

~~~
atq2119
Oh, for sure there are things to be learned from this. The responsibility for
yield doesn't lie with TSMC though, but with the logic design: to make this
kind of integration work, your design has to be able to tolerate a fault
essentially anywhere on the wafer surface.

This isn't magic, of course: keep in mind that we already have SRAM with extra
capacity for fault tolerance, and multi-core chips which are binned based on
the number of functioning cores has been standard for a long time.

~~~
mlyle
Design to tolerate failures is only one variable in yields. Wafer-scale
integration exercises to the limit both our ability to tolerate defects and
our ability to minimize them.

------
gfodor
I’m a know-nothing when it comes to this area, but I shouted expletives at
least twice when I read this article. This is crazy.

