
Cerebras Systems unveils a 1.2T transistor chip for AI - modeless
https://venturebeat.com/2019/08/19/cerebras-systems-unveils-a-record-1-2-trillion-transistor-chip-for-ai/
======
modeless
There are far more transistors in this chip than neurons in the human brain.
In 100 of these chips, there are more transistors than there are synapses in
the human brain.

I don't mean to suggest that transistors are equivalent to neurons or
synapses; clearly they are very different things (though it is _not_ clear
that neurons are much more computationally powerful than transistors, despite
assertions from many people that this is the case). But I think it is still
useful to compare the complexity of structure of chips vs. the human brain. We
are finally approaching the same order of magnitude of complexity in
structure.

Also note that this is not manufactured on TSMC's highest density process.
Assuming TSMC's 3nm process development is successful, that will probably be
6x denser than the 16nm process used here.

~~~
neural_thing
I have written a book about the true computational power of pyramidal neurons.
Spoiler: they are A LOT more powerful than transistors. The book is free @
[http://www.corticalcircuitry.com/](http://www.corticalcircuitry.com/)

~~~
thelittleone
I'm a few pages in and with found it fascinating, humorous, compelling and
understandable even with zero background in the field. Thanks for sharing.

~~~
neural_thing
Thank you for the kind words!

------
ChuckMcM
Wafer scale integration, the stupid idea that just won't die :-)

Okay. I'm not quite that cynical but I was avidly following Trillogy Systems
(Gene Amdahl started it to make super computers using a single wafer).
Conceptually awesome, in practice not so much.

The thing that broke down in the '80s was that different parts of the system
evolve at different rates. As a result your wafer computer was always going to
be sub-optimal at something, whether it was memory access or an I/O channel
standard that was new, changing that part meant all new wafers and fab
companies new that every time you change the masks, you have to re-qualify
_everything_. Very time consuming.

I thought the AMD "chiplet" solution to making processors that could evolve
outside the interconnect was a good engineering solution to this problem.

Dave Ditzel, of Transmeta fame, was pushing at one point a 'stackable' chip.
Sort of 'chip on chip' like some cell phone SoCs have for memory, but
generalized to allow more stacking than just 2 chips. The problem becomes
getting the heat out of such a system as only the bottom chip is in contact
with the substrate and the top chip with the case. Conceptually though,
another angle where you could replace parts of the design without new masks
for the other parts.

I really liked the SeaMicro clusters (tried to buy one at the previous company
but it was too far off axis to pass the toy vs tool test). Perhaps they will
solve the problems of WSI and turn it into something amazing.

~~~
samstave
Havent hear Tranmeta in a LONG time.

I recall when I was at intel in 1996 and I used to work a few feet from Andy
Grove... and I would ask naive questions like

"how come we cant stack mutiple CPUs on top of eachother"

and make naive statements like:

"When google figures out how to sell their services to customers (GCP) we are
fucked" (this was made in the 2000's when I was on a hike with Intels then
head of tech marketing, not 1996) ((During that hike he was telling me abt a
secret project where they were able to make a proc that 48 cores)) (((I didnt
believe it and I was like "what the fuck are they going to do with it)))?? --
welp this is how the future happens. and here we are.

and ask financially stupid questions like:

"what are you working on?" response "trying to figure out how to make our ERP
financial system handel numbers in the billions of dollars"

I made a bunch of other stupid comments... like "apple is going to start using
Intel procs" and was yelled at by my apple fan boi " THATS NEVER GOING TO
FUCKING HAPPEN"

But im just a moron.

\---

But transmeta... there was a palpable fear of them at intel at that time....

~~~
foobiekr
A lot of the transmeta crowd, at least on the hardware side, went on to work
at a very wide variety of companies. Transmeta didn't work, and probably was
never going to work, as a company and a product, but it made a hell of a good
place to mature certain hardware and software engineers, like a VC-funded
postdoc program. I worked with a number of them at different companies.

~~~
PopeDotNinja
I was at Transmeta. It was good for my career!

~~~
samstave
So what specifically are you doing now??

~~~
PopeDotNinja
I was a QA Engineer at Transmeta, got an MBA, work in tech recruiting for
several years, and now I'm a software engineer.

------
program_whiz
An amazing stride in computing power. I'm not convinced that the issue is
really about more hardware. While more hardware will definitely be useful once
we understand AI, we still don't have a fundamental understanding of how AGI
works. There seem to be more open questions than solved ones. I think what we
have now is best described as:

"Given a well stated problem with a known set of hypotheses, a metric that
indicates progress towards the correct answer, and enough data to
statistically model this hypothesis space, we can efficiently search for the
local optimum hypothesis."

I'm not sure doing that faster is going to really "create" AGI out of thin
air, but it may be necessary once we understand how it can be done (it may be
an exponential time algo that requires massive hardware to execute).

~~~
gridlockd
What's there to understand? We know how intelligence "happened". We just need
to build a few million human-sized brains, attach the universe, and simulate a
few billion years of evolution. Whatever comes out of that could just be
called "AGI" by definition.

~~~
yters
We don't know that's how intelligence happened. That's just speculation based
on materialist assumptions. Intelligence is most likely immaterial, i.e.
abstract concepts, free will, mathematics, consciousness, etc. In which case,
it's beyond anything in this physical universe.

~~~
visarga
> it's beyond anything in this physical universe

Your way of thinking leads to dualism, which has been proven to be a bad
approach to the problem of consciousness. Dualism doesn't explain anything,
just moves the problem outside of the 'material' realm into fantasy lala-land.

~~~
yters
Hey, if materialism cannot explain reality, then why stick with a bad
hypothesis? Sticking your fingers in your ears and calling any alternative
'lala-land' sounds pretty anti-intellectual.

~~~
visarga
I am not a materialist, I am a physicalist. And la-la land is just a fit
metaphor for thinking that conscious experience is explained by a theory that
can't be proven or disproven. If you think your consciousness is received by
the brain-antenna from the macrocosmic sphere of absolute consciousness (or
God, or whatever you call it) then remember about your mother and father, and
the world that supported your and fed you experiences. They are responsible
for your consciousness. Your experience comes from your parents and the world,
and you are missing the obvious while embracing a nice fantasy. And if you
don't make babies you won't extend your consciousness past your death. Your
parents did, and here you are, all conscious and spiritual.

~~~
yters
How do you prove/disprove physicalism?

------
Traster
I'd love to see some metrics on whether this idea has any merit. Because
obviously any problem that can be done on a massive wafer can be done on
multiple smaller wafers. The question is, are the compromises to get something
working on a massive wafer killing performance to the point where just
splitting the problem up efficiently would have been better. It's also
important to think: Stuff isn't just going to happen to be the size to fit on
this wafer, it's either huge in scale, or not. If it's not, you don't need
this wafer, if it is, you probably need to partition your problem onto
multiple copies of this wafer anyway. Take their comparison to a Nvidia GPU.
It might be 57 times bigger, but I can rent 1000 Nvidia GPUs from AWS at the
drop of a hat (more or less).

So yes, maybe they've done some interesting stuff, but they need some decent
benchmarks to show off before we can really distinguish between whether this
is the Boeing 747 Dreamliner or a Spruce Goose.

~~~
dgacmu
You get much higher bandwidth (at lower power cost) between chiplets than if
you used a multi-chip design or even an interposer.

The drawback is that you suffer a really interesting thermal problem and have
to figure out what to do with the wafer space that doesn't fit in your square
-- probably creating lower-scale designs that you sell.

The second drawback is that you can't really match your available computation
to memory in the same way you can with a more conventional GPU. So you have to
be able to split your model across chips and train model-parallel. The
advantage is that model-parallel lets you throw a heck of a lot more
computation at the problem and can help you scale better than using only data
parallelism.

Model-parallel training is typically harder than data-parallel because you
need high bandwidth between the computation units. But that's exactly what
Cerebras's design is intended to provide.

You also have a yield management issue, where you have to build in the
capability to route around dead chips, but that's not too nasty a technical
detail. But if your "chip"-level yield (note that their chip is still a
replicated array of subunits) is too low, it kills your overall yield. So
they're going to be conservative with their manufacturing to keep yield high.

It's not obviously broken, but it's certainly true we need benchmarks -- but
not just benchmarks, time for people to come up with models that are optimized
for training/inference on the cerebras platform, which will take even longer.

~~~
nrp
Why even make the final giant chip rectangular? I get that the exposure
reticles are rectangular, but since this is tiling a bunch of connected
chiplets, why not use the full area of the wafer?

~~~
petra
You need to make sure you're io lines are imprinted on the edge of the wafer.

It's much easier to do by making the io lines at the ends of each "chip" , vs
a circle that cuts in the middle of the "chip".

------
voldacar
This is very neat, but I really wish the press would say "neural net" instead
of "AI". "AI" just means a computer program that has some ability to reason
about data similarly to a human, neural nets are a subset of that

I guess "AI" gets you the clicks though

~~~
curiousgal
> neural net

Which is basically matrix multiplication.

~~~
voldacar
Everything is matrix multiplication at some level!

~~~
guenthert
If you have a hammer ...

~~~
ivalm
In a sense of Hamiltonians as linear operators on quantum states (such as
state of the universe)....

------
CaliforniaKarl
This is one of the first of what will probably be a number of announcements
out of the Hot Chips conference, happening now through Tuesday on the Stanford
campus.

[https://www.hotchips.org](https://www.hotchips.org)

------
groundlogic
> The 46,225 square millimeters of silicon in the Cerebras WSE house 400,000
> AI-optimized, no-cache, no-overhead, compute cores

> But Cerebras has designed its chip to be redundant, so one impurity won’t
> disable the whole chip.

This sounds kinda clever to a semi-layperson. Has this been attempted before?
Edit: at this scale. Not single-digit cores CPUs being binned, but 100k+ CPU
core chips with some kind of automated core-deactivation on failure.

~~~
baybal2
> This sounds kinda clever to a semi-layperson. Has this been attempted
> before?

Yes, by Trilogy Systems, and it went bust spectacularly. It raised +200m plus
of capital (back in seventies!) and turned out to be the biggest financial
failure in Silicon Valley's history.

[https://en.m.wikipedia.org/wiki/Trilogy_Systems](https://en.m.wikipedia.org/wiki/Trilogy_Systems)

~~~
groundlogic
First: thanks, this was exactly the kind of historic knowledge I hoped would
show up in this thread!

Gene Amdahl ran this; geeze, no surprise they got funded.

Do you happen to know how many "compute units" this chip was designed to
handle?

[https://en.m.wikipedia.org/wiki/Trilogy_Systems](https://en.m.wikipedia.org/wiki/Trilogy_Systems)

> These techniques included wafer scale integration (WSI), with the goal of
> producing a computer chip that was 2.5 inch on one side. At the time,
> computer chips of only 0.25 inch on a side could be reliably manufactured.
> This giant chip was to be connected to the rest of the system using a
> package with 1200 pins, an enormous number at the time. Previously,
> mainframe computers were built from hundreds of computer chips due to the
> size of standard computer chips. These computer systems were hampered
> through chip-to-chip communication which both slowed down performance as
> well consumed much power.

> As with other WSI projects, Trilogy's chip design relied on redundancy, that
> is replication of functional units, to overcome the manufacturing defects
> that precluded such large chips. If one functional unit was not fabricated
> properly, it would be switched out through on-chip wiring and another
> correctly functioning copy would be used. By keeping most communication on-
> chip, the dual benefits of higher performance and lower power consumption
> were supposed to be achieved. Lower power consumption meant less expensive
> cooling systems, which would aid in lower system costs.

Edit: '"Triple Modular Redundancy" was employed systematically. Every logic
gate and every flip-flop were triplicated with binary two-out-of-three voting
at each flip-flop.' This seems like it should it should complicate things
quite a bit more dramatically.. they were doing redundancy at a gate-level,
rather than a a CPU-core level.

------
wolf550e
whitepaper: [https://www.cerebras.net/wp-
content/uploads/2019/08/Cerebras...](https://www.cerebras.net/wp-
content/uploads/2019/08/Cerebras-Wafer-Scale-Engine-Whitepaper.pdf)

I don't expect good yields from a chip that takes up the whole wafer. They
must disable cores and pieces of SRAM that are damaged. How is this
programmed?

~~~
keveman
> How is this programmed?

Full disclosure: I am a Cerebras employee.

There is extensive support for TensorFlow. A wide range of models expressed in
the TensorFlow will be accelerated transparently.

~~~
gwern
Looking at the whitepaper, I'm a little surprised how little RAM there is for
such an enormous chip. Is the overall paradigm here that you still have
relatively small minibatches during training, but each minibatch is now vastly
faster?

~~~
Veedrac
“full utilization at any batch size, including batch size 1”

[https://www.cerebras.net/](https://www.cerebras.net/)

~~~
gwern
That doesn't really mean anything. It (and any other chip) had _better_ be
able to run at least batch size 1, and lots of people claim to have great
utilization... It doesn't tell me if the limited memory is part of a
deliberate tradeoff akin to a throughput/latency tradeoff, or some intrinsic
problem with the speedups coming from other design decisions like the sparsity
multipliers, or what.

~~~
Veedrac
Most of the chip is already SRAM, I'm not really sure what else you would
expect?

18 GiB × 6 transistors/bit ≈ .93 trillion transistors

~~~
gwern
Well, it could be... _not_ SRAM? It's not the only kind of RAM, and the choice
to use SRAM is certainly not an obvious one. It could make sense as part of a
specific paradigm, but that is not explained, and hence why I am asking. It
may be perfectly obvious to you, but it's not to me.

~~~
Veedrac
You basically have the option between SRAM, HBM (DRAM), and something new. You
can imagine the risks with using new memory tech on a chip like this.

The issue with HBM is that it's much slower, much more power hungry (per
access, not per byte), and not local (so there are routing problems). You
can't scale that to this much compute.

~~~
gwern
But HBAM and other RAMs are, presumably, vastly cheaper otherwise. (You can
keep explaining that, but unless you work for Cerebras and haven't thought to
mention that, talking about how SRAM is faster is not actually an answer to my
question about what paradigm is intended by Cerebras.)

~~~
Veedrac
They say they support efficient execution of smaller batches. They cover this
somewhat in their HotChips talk, eg. “One instance of NN, don't have to
increase batch size to get cluster scale perf” from the AnandTech coverage.

If this doesn't answer your question, I'm stuck as to what you're asking
about. They use SRAM because it's the only tried and true option that works.
Lots of SRAM means efficient execution of small batch sizes. If your problem
fits, good, this chip works for you, and probably easily outperforms a cluster
of 50 GPUs. If your problem doesn't, presumably you should just use something
else.

------
navaati
Wow ! 56 (!) times more core than the biggest Nvidia Volta core, a single 46
square millimeters chip (no chiplets like recent AMD chips), and an incredible
wooping 18 GB of SRAM (that’s like 18GB of CPU cache basically) !

I don’t know if you guys are used to that scale but I find it monstruous !

~~~
14
I was a little confused when I read your comment and 46 square millimeters. I
am guessing you meant 46k square millimeters as the article stated 46,225
square millimeters which yes you are right that is monstrous. Very cool! As a
care giver I often discuss with my clients the cool things that have come into
existence in their life time and wonder when I get old what things will my
children reflect back on and say grandpa you were alive when “x” was invented
how neat. Personally I am hoping it is fusion power.

~~~
duderific
The young people I work with are always shocked when I tell them there was no
internet (or at least, nothing terribly useful at the consumer level) when I
went to college.

------
lnsru
I am just wondering, why there is no similar product as FPGA array. As far as
I know, it’s the cheapest way to see if there is product/market fit for a
semiconductor product. High speed transceiver as well as memory controllers
are included in FPGA. This single wafer approach looks very interesting to me.
I was intern at Infineon some time ago and was working on device distribution
characterization across 200 mm wafer. The chips in the middle were 2-3 better
performing than these in the border. So how Cerberas’s chip manages this
issue? Middle parts are throttled or low performing areas near wafer’s boarder
are disabled? How much does it cost?.. I can imagine it being shipped on
thermal pad with liquid nitrogen cooling bellow. There must be some wires
bonded for interface to the host. Very interesting technical project. I am
very curious what are the clients for such huge specialized chip.

~~~
morphle
An FPGA would make it much more generic a product so you can sell to more
markets. But you would loose a factor 200 (a factor 1000 with traditional FPGA
design) to make transistors reconfigurable.

If you leave the wafer intact, you get 70,000 mm2. Cerebras cut off almost
half of the wafer.

At 7nm you would get 2.7 trillon transistors with 300mm wafers, more with 450
mm wafers. You disable those reticle-sized areas with impurities or damage at
runtime.

You can cool it with immersive liquid cooling. Instead of wire bonding you can
stack chips on top or use free space optics [1].

[1]
[https://www.youtube.com/watch?v=7hWWyuesmhs](https://www.youtube.com/watch?v=7hWWyuesmhs)

------
sp332
There's a photo on the homepage:
[https://www.cerebras.net/](https://www.cerebras.net/) It's just about 8.5"
across. You could just barely print a 1:1 photo of it on a letter-sized sheet
of paper, with less than a half-milimeter margin on each side.

~~~
ohazi
I wonder if they bond it to a copper slab. At that scale, the tiniest amount
of PCB flex would probably shatter the die/wafer...

~~~
oITAZt
They do not. They have "developed [a] custom connector to connect wafer to
PCB": [https://imgur.com/a/sXxGbiD](https://imgur.com/a/sXxGbiD)

~~~
ohazi
I think "cold plate" is the slab.

~~~
oITAZt
Yes, I it is probably a copper slab (they never mentioned), but there is no
electrical connection, as the rest of the slides make clear:
[https://imgur.com/a/Rbd7e4D](https://imgur.com/a/Rbd7e4D)

It just provides a thermal connection for water cooling. The electrical
connection is made through the PCB (and probably through a thick copper plate
on the opposite side of the PCB)

------
goldemerald
Is there any comparison of training speed of a neural net for one of these
chips and a typical one? I'd be interested to see how long it takes to train
an imagenet classifier on one of these compared to other hardware.

------
samcheng
I remember watching the (excellent!) Tesla Autonomy Day presentation, and an
analyst was asking about 'chiplets' but was met with a bit of dismissal from
the Tesla team.

Maybe THIS is what the analyst had in mind! Pretty cool stuff, although I
question how interconnect / inter-processing-unit communication would work.

Notably, no benchmarks in the press release...

------
rightbyte
"The Cerebras software stack is designed to meet users where they are,
integrating with open source ML frameworks like TensorFlow and PyTorch"

What's the instruction-set? They don't say.

I assume you need to program in some DSL VERILOG-ish macro-assembler for that
monster contraption. Python is probably not what works well ...

~~~
sanxiyn
Your CoreML model can run on Apple Neural Engine, but Apple doesn't expose
that hardware's instruction set. This probably works similarly.

------
rbanffy
There are more details here: [https://fortune.com/2019/08/19/ai-artificial-
intelligence-ce...](https://fortune.com/2019/08/19/ai-artificial-intelligence-
cerebras-wafer-scale-chip)

~~~
bhassel
> In another break with industry practice, the chip won’t be sold on its own,
> but will be packaged into a computer “appliance” that Cerebras has designed.
> One reason is the need for a complex system of water-cooling, a kind of
> irrigation network to counteract the extreme heat generated by a chip
> running at 15 kilowatts of power.

15 kW, yikes.

~~~
dmitrygr
So, Azul Systems 2.0? clever(ish) hardware with good(ish) results, for too
much $$$ for anyone to actually buy?

~~~
rbanffy
The main difference is that what Azul built had more limits - once you run
your workload well enough, there is little incentive to have more compute
power.

When it comes to ML, the more compute power you throw at it, the better.

------
natpalmer1776
I know very little about this sub-field, but from a layman's perspective, and
given competition between Intel and NVidia in the field of deep learning, I
would not be surprised if Intel tried to acquire or takeover this company for
their single-chip designs.

------
ineedasername
Would this really be more efficient in term of cost/performance? It seems the
specialized nature of the chip push the price high enough that you could build
equivalent systems with traditional hardware for the same or less, and it
would all be known-quantities rather that working with something brand new and
not as well understood.

~~~
mv4
Specialized AI chips don't seem like a very good business idea to me.

The way we do things (in AI) today - we may be doing things completely
different tomorrow. It's not like there's a standard everyone has agreed on.

There is a very real risk these specialized, expensive devices will go the way
of the Bitcoin ASIC miner (which saturated secondary markets at a fraction of
its original cost).

Source: I do ML consulting and build AI hardware.

~~~
p1esk
The way we do things in AI today is multiplication of two large matrices. Just
like we did it 30 years ago:
[http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf](http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf)

~~~
ivalm
Sure, but Cerebras isn't just multiplying two large matrices, they are
multiplying two large very sparse matrices, relying on ReLU activation to
maintain sparcity in all of the layers. We already have BERT/XLNet/other
transformer models move away from ReLU to GELU which do not result in sparse
matrices. "Traditional" activations (tanh, sigmoid, softmax) are not sparse
either.

~~~
p1esk
Good point. I think it's a safe bet to focus on dense dot product in hardware
for the foreseeable future. However, to their defense:

1\. It's not clear that supporting sparse operands in hw would result in
significant overhead.

2\. DL models are still pretty sparse (I bet even those models with GELU still
have lots of very small values that could be safely rounded to zero).

3\. Sparsity might have some benefits (e.g.
[https://arxiv.org/abs/1903.11257](https://arxiv.org/abs/1903.11257)).

------
kingosticks
How do they go and package something this big? Is this supposed to be used or
is it just a headline?

~~~
kingosticks
There's some information on this towards the end of the slides at
[https://www.anandtech.com/show/14758/hot-chips-31-live-
blogs...](https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-
cerebras-wafer-scale-deep-learning).

Sadly there's very little concrete info on the packaging methodology (hardly a
surprise) which to me is the only truely novel thing about this chip. But it
must cost an absolute fortune.

------
temac
Impressive but is this a good idea?

Either you need a crazy thermal solution, or it must be way less thermally
dense than smaller chips can be. And is it really that much of an advantage to
stay on chip compared to going through a PCB, if distances are crazy?

~~~
gnode
I'm surprised this hasn't been done before; a monolithic wafer can achieve
denser interconnects, and is arguably simpler than a multi-chip module.

On the other hand, multi-chip modules can combine the working chips in low-
yield wafers, whereas a monolithic wafer would likely contain many failed
blocks, uselessly taking up space / distancing the working chips.

Cooling isn't really a problem, as the TDP scales with area as in other chips.
Water cooling or heat pipes can transport the heat away to a large heat sink.
3D / die-stacked chips have a harder cooling problem, potentially requiring
something like an intra-chip circulatory system.

~~~
bryanlarsen
> I'm surprised this hasn't been done before

It has, a couple of people have linked to
[https://en.wikipedia.org/wiki/Wafer-
scale_integration](https://en.wikipedia.org/wiki/Wafer-scale_integration)

------
bmh
In their whitepaper they claim "with all model parameters in on-chip memory,
all of the time", yet that entire 15 kW monster has only 18 GB of memory.

Given the memory vs compute numbers that you see in Nvidia cards, this seems
strangely low.

~~~
Veedrac
18GB is huge! An NVIDIA V100 has 6MB of L2 memory. HBM is off-chip, and vastly
(~100x) slower.

~~~
baybal2
18GB of very fast memory will still be just as hard to keep fed with data as
that 6MB cache

~~~
Veedrac
The idea is that the whole model resides in the fast memory, so you don't need
to ‘keep it fed’.

~~~
dmitrygr
44K/core is very little memory

~~~
Veedrac
Indeed, but cores are only responsible for small fragments of the network, so
don't need huge amounts of memory.

~~~
dmitrygr
Unless you need to multiply large matrices, where you need access to very
large rows and columns...like in...ML applications

~~~
Veedrac
That's what the absurdly fast interconnect is for. You send the data to where
the weights are.

~~~
dmitrygr
Absurdly fast != Single cycle

It will be physically impossible to access that much memory single cycle at
anything approaching reasonable speeds. I suppose you could do it at 5Hz :)

~~~
Veedrac
A core receives data over the interconnect. It uses its fast memory and local
compute to do its part of the matrix multiplication. It streams the results
back out when it's done. The interconnect doesn't give you single-cycle access
to the whole memory pool, but it doesn't need to.

~~~
dmitrygr
I think it is telling that in one sentence there is a claim that it is faster
than nvidia, and in another, a claim it does tensor flow. I do not think this
architecture could do both of these at once. It could not do tensor flow fast
enough (not enough local fast mem) to compete even with a moderate array of
GPUs

------
mschuster91
18 GByte of memory with _1 clock cycle_ latency? That's impressive.

~~~
bryanlarsen
I assume that each byte of that memory has a 1 clock cycle latency to only one
of the 400,000 cores on the wafer. That's about 45KByte of memory per core; 1
clock cycle latency to a block that small is quite reasonable.

------
The_rationalist
ASICS do not support CUDA. There is a forked tensorflow with opencl support
from AMD but I doubt people will use it for this ASIC. So how can
tensorflow/cntk/pytorch use such hardware?

~~~
sanxiyn
The exact same way TensorFlow supports TPU: by writing another backend. TPU
doesn't support CUDA either.

~~~
The_rationalist
So instead of having common optimized kernels for AMD, Intel, all ARM firms,
all FPGAs and all ASICS. Each member is reinventing the wheel in it's own
backend? Not so surprising ^^

------
fnord77
since the article didn't post a picture of the bare chip:

[https://i.imgur.com/cMo4w0C.jpg](https://i.imgur.com/cMo4w0C.jpg)

------
p1esk
But... can I play Crysis - oops sorry - can I train ResNet on this thing?

------
riku_iki
Interesting what will be the price and power consumption. Does it need
specialized server with huge power supply module?

~~~
morphle
A 300mm wafer at 16nm node would cost more than $6500 a piece. The power
consumption would be more than 20kW if all transistors are in use
simultaneously. We am designing a special reconfigurable AC-DC and DC-DC power
router (inverter) to supply this huge amount of power. Our WSI design will be
liquid immersively cooled.

------
jaboutboul
There goes Intel. Behind the curve again...

------
panabee
can anyone comment on the differences between cerebras and other chip startups
trying to rethink the semiconductor architecture for AI? what are the main
technical forks?

------
mv4
While this is certainly a very impressive achievement, I am personally
interested in small and light AI.

World's tiniest AI chip? That would get me excited!

~~~
p1esk
What's "AI chip"? You can build a full adder, and call it "an AI chip".

~~~
mv4
by that I mean any chip, from a generic (GPU) to something specialized
(vision, NLP, etc). Any chip that makes training/or running TF/Caffe models
faster.

~~~
p1esk
Faster than what? A tiny chip will be slower than a large chip.

~~~
mv4
faster while having the same form factor, energy usage, cost.

~~~
p1esk
Even under those constraints you can build something that's either fast or
general. Pick one.

