
A Look at Celerity’s Second-Gen 496-Core RISC-V Mesh NoC - rbanffy
https://fuse.wikichip.org/news/3217/a-look-at-celeritys-second-gen-496-core-risc-v-mesh-noc/
======
pjc50
Interesting. There have been a few of these super-manycore processors before;
their main characteristic is being quite hard to program effectively due to
the need to partition the work and think very hard about memory bottlenecks.

> The Vanilla-5 cores are a 5-stage in-order pipeline RV32IM cores so they
> support the integer and multiply extensions

So, roughly comparable to a high-speed Cortex M.

> instead of using caches, the entire memory address space is mapped across
> all the nodes in the network using a 32-bit address scheme. This approach,
> which also means no virtualization or translation, simplifies the design a
> great deal.

The diagram shows each core has icache and dcache; what they've ditched is
cache coherency. That certainly makes it simpler to implement but now the
cores have to be responsible for their own coherency. Also, none of your
protected mode operating system nonsense - this is designed to run a single
program and get everything out of the way. Every core can potentially
overwrite any other core's memory, and if it does so you won't know until you
have a cache miss. Good luck figuring that one out in the debugger.

This is very clearly intended for the sort of AI or image processing workload
where you can clearly partition it two-dimensionally across the array to
identical nodes, and then have those nodes collaborate locally by passing
messages across the edges.

~~~
samps
> The diagram shows each core has icache and dcache; what they've ditched is
> cache coherency.

This is not quite true: the local data memories are _not_ caches, i.e., they
do not implicitly move memory in from a more distant tier in the memory
hierarchy. They are just plain explicitly managed local memories (sometimes
called "scratchpads" to distinguish them from caches).

------
ncmncm
I like that only permitting writes to other cores' memory simplifies the
design. If you want to read, you have to write a request to that core by
storing the request in a place it will look, and tell it where to put the
answer. And, all the cores heve to check for such requests.

It is kind of surprising that that is acceptable. I suppose the usual case is
that each core already knows what its neigbors will want to see, and sends
those values before each neighbor needs them.

~~~
JerryTy
This technique is known as remote store programming (RSP) which allows for a
very convenient producer-consumer communication. It lends itself well to AI
workloads since reading for a foreign core is rather rare compared to writing
to one (eg next stage).

------
nickik
Another project that goes for a manycore designed, is by Princton and ETH
Zürich. Their design targets a few less cores I think.

See: [https://github.com/PrincetonUniversity/openpiton#support-
for...](https://github.com/PrincetonUniversity/openpiton#support-for-the-
ariane-rv64imac-core)

------
rs23296008n1
Ok, if two processors read/write the same address what happens? Could I assume
the reads are coherent but writes are not?

If this is the case then I assume I'd just need to design-wise restrict one
processor from writing to another's area, yes?

I've worked on worse. Quite promising set of ideas from my limited reading.

~~~
rbanffy
One processor can't read from the other's memory. It requests what it wants
and tells the other where it should be put.

------
ilaksh
I think it's interesting to compare this to a dual 64 core AMD system (so 128
cores) or similar in terms of the programming model or performance.

~~~
rbanffy
This completely different from EPYC (and the Xeon Phi). The x86's present a
single coherent memory image and lots of identical CPU cores whereas this one
doesn't. In some aspects it resembles a Cell processor, with the difference
that the "SPUs" (the tiny in-order cores) have some direct access to the main
memory and are mostly binary compatible with the "PPUs" (the five large RISC-V
cores).

~~~
ilaksh
I know that its a different architecture.

The question is whether it's worthwhile to adapt that new programming model
and architecture? What can be achieved that is not possible or is price
compatible with something boring and old-fashioned like 128 x64 cores etc.

I am all for new architectures. But the trick is getting people to make the
effort to make use of them.

I think that when I write a good comment it gets buried and the only time
comments are really popular is when I write something trite or silly that a
lot of people are already thinking.

~~~
anon73044
>I think that when I write a good comment it gets buried and the only time
comments are really popular is when I write something trite or silly that a
lot of people are already thinking.

probably more to do with the gamification of user generated content. you see
the same thing on imgur and reddit: the well thoughtout comments don't get up
voted nearly as much as the memes and injokes.

>I am all for new architectures. But the trick is getting people to make the
effort to make use of them.

IIRC at the time there were a lot of complaints against the Cell processor
that it was "too hard" to program for. >[https://www.cnet.com/news/sony-
ps3-is-hard-to-develop-for-on...](https://www.cnet.com/news/sony-ps3-is-hard-
to-develop-for-on-purpose/)

>[https://www.gtplanet.net/playstation-3-cell-more-powerful-
mo...](https://www.gtplanet.net/playstation-3-cell-more-powerful-modern-
chips/)

>[https://www.gtplanet.net/the-ps3-era-was-a-nightmare-for-
pol...](https://www.gtplanet.net/the-ps3-era-was-a-nightmare-for-polyphony-
digital-says-kazunori/)

>What can be achieved that is not possible or is price compatible with
something boring and old-fashioned like 128 x64 cores etc

That's probably the main issue today, x86/64 are cheap and nearly all the
problems can be fixed in software. Changing architectures and instruction set
is too big of an upfront cost for most people/companies to deal with, and I
think that is why we're only seeing Google and Amazon starting to look for
other solutions.
>[https://aws.amazon.com/ec2/graviton/](https://aws.amazon.com/ec2/graviton/)
>[https://www.google.com/amp/s/www.wired.com/2017/04/building-...](https://www.google.com/amp/s/www.wired.com/2017/04/building-
ai-chip-saved-google-building-dozen-new-data-centers/amp)

~~~
rbanffy
> IIRC at the time there were a lot of complaints against the Cell processor
> that it was "too hard" to program for.

Indeed. Interestingly GPUs were initially very hard to program (and still are
kind of a pain). What made them viable was the introduction of practical
development tools that Cell never had. This machine shares (some of the)
instruction set between the fat and the puny cores, which makes it a much
easier target to program.

