Hacker News new | past | comments | ask | show | jobs | submit login
A Look at Celerity’s Second-Gen 496-Core RISC-V Mesh NoC (wikichip.org)
53 points by rbanffy 5 days ago | hide | past | web | favorite | 15 comments





Interesting. There have been a few of these super-manycore processors before; their main characteristic is being quite hard to program effectively due to the need to partition the work and think very hard about memory bottlenecks.

> The Vanilla-5 cores are a 5-stage in-order pipeline RV32IM cores so they support the integer and multiply extensions

So, roughly comparable to a high-speed Cortex M.

> instead of using caches, the entire memory address space is mapped across all the nodes in the network using a 32-bit address scheme. This approach, which also means no virtualization or translation, simplifies the design a great deal.

The diagram shows each core has icache and dcache; what they've ditched is cache coherency. That certainly makes it simpler to implement but now the cores have to be responsible for their own coherency. Also, none of your protected mode operating system nonsense - this is designed to run a single program and get everything out of the way. Every core can potentially overwrite any other core's memory, and if it does so you won't know until you have a cache miss. Good luck figuring that one out in the debugger.

This is very clearly intended for the sort of AI or image processing workload where you can clearly partition it two-dimensionally across the array to identical nodes, and then have those nodes collaborate locally by passing messages across the edges.


> The diagram shows each core has icache and dcache; what they've ditched is cache coherency.

This is not quite true: the local data memories are not caches, i.e., they do not implicitly move memory in from a more distant tier in the memory hierarchy. They are just plain explicitly managed local memories (sometimes called "scratchpads" to distinguish them from caches).


> what they've ditched is cache coherency. That certainly makes it simpler to implement but now the cores have to be responsible for their own coherency.

Like the quote says: “There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.”


I like that only permitting writes to other cores' memory simplifies the design. If you want to read, you have to write a request to that core by storing the request in a place it will look, and tell it where to put the answer. And, all the cores heve to check for such requests.

It is kind of surprising that that is acceptable. I suppose the usual case is that each core already knows what its neigbors will want to see, and sends those values before each neighbor needs them.


This technique is known as remote store programming (RSP) which allows for a very convenient producer-consumer communication. It lends itself well to AI workloads since reading for a foreign core is rather rare compared to writing to one (eg next stage).

Ask what's memory protection before it overwrites state used by another core. Then ask again how you synchronize these words from one with reads from another.

Yeah, it's not even eventual consistency. They get away with it by using a heavy mailbox message passing setup for synchronization and separate address spaces.


Another project that goes for a manycore designed, is by Princton and ETH Zürich. Their design targets a few less cores I think.

See: https://github.com/PrincetonUniversity/openpiton#support-for...


Ok, if two processors read/write the same address what happens? Could I assume the reads are coherent but writes are not?

If this is the case then I assume I'd just need to design-wise restrict one processor from writing to another's area, yes?

I've worked on worse. Quite promising set of ideas from my limited reading.


One processor can't read from the other's memory. It requests what it wants and tells the other where it should be put.

I think it's interesting to compare this to a dual 64 core AMD system (so 128 cores) or similar in terms of the programming model or performance.

This completely different from EPYC (and the Xeon Phi). The x86's present a single coherent memory image and lots of identical CPU cores whereas this one doesn't. In some aspects it resembles a Cell processor, with the difference that the "SPUs" (the tiny in-order cores) have some direct access to the main memory and are mostly binary compatible with the "PPUs" (the five large RISC-V cores).

I know that its a different architecture.

The question is whether it's worthwhile to adapt that new programming model and architecture? What can be achieved that is not possible or is price compatible with something boring and old-fashioned like 128 x64 cores etc.

I am all for new architectures. But the trick is getting people to make the effort to make use of them.

I think that when I write a good comment it gets buried and the only time comments are really popular is when I write something trite or silly that a lot of people are already thinking.


>I think that when I write a good comment it gets buried and the only time comments are really popular is when I write something trite or silly that a lot of people are already thinking.

probably more to do with the gamification of user generated content. you see the same thing on imgur and reddit: the well thoughtout comments don't get up voted nearly as much as the memes and injokes.

>I am all for new architectures. But the trick is getting people to make the effort to make use of them.

IIRC at the time there were a lot of complaints against the Cell processor that it was "too hard" to program for. >https://www.cnet.com/news/sony-ps3-is-hard-to-develop-for-on...

>https://www.gtplanet.net/playstation-3-cell-more-powerful-mo...

>https://www.gtplanet.net/the-ps3-era-was-a-nightmare-for-pol...

>What can be achieved that is not possible or is price compatible with something boring and old-fashioned like 128 x64 cores etc

That's probably the main issue today, x86/64 are cheap and nearly all the problems can be fixed in software. Changing architectures and instruction set is too big of an upfront cost for most people/companies to deal with, and I think that is why we're only seeing Google and Amazon starting to look for other solutions. >https://aws.amazon.com/ec2/graviton/ >https://www.google.com/amp/s/www.wired.com/2017/04/building-...


> IIRC at the time there were a lot of complaints against the Cell processor that it was "too hard" to program for.

Indeed. Interestingly GPUs were initially very hard to program (and still are kind of a pain). What made them viable was the introduction of practical development tools that Cell never had. This machine shares (some of the) instruction set between the fat and the puny cores, which makes it a much easier target to program.


You should normalize by die size. A Zen 2 8 core die is 75mm^2, so you'd get 200 of these cores on 16nm for each Zen 2 core on 7nm.

If it's going to be compared to something though, I think a GPU would make much more sense.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: