Hacker News new | past | comments | ask | show | jobs | submit login

Hey, CEO of REX here... would be happy to answer questions.

P.S. We're hiring Chisel developers! If you don't know chisel, but want to learn and have RTL experience, we'd love to have you learn on the job! Check out our website: http://rexcomputing.com

Chisel is potentially a revolution in hardware design and I am following it intently but I have not heard of anyone creating an actual chip from a standard fab as of yet. I am trying to make an argument for trying this at my work but I think until we get some more feedback from people in industry it may be too risky of an endeavour. It would be great if you would be willing to share your experiences or know of some papers that would help me build an argument for trying it out.

UC Berkeley has taped out ~10-12 chips entirely designed using Chisel through the standard flow and fabed at TSMC (As low at 28nm) and all have functioned. My start up has had minimal problems with using Chisel and going through both Cadence and Synopsys tools (most if not all the problems were user error :P)

Once we get closer to having silicon in hand, I'd love to publish our experience as both a startup making a new processor in this day and age, along with using Chisel and other new tools.

So is the idea that the compiler would try and optimize it so that both code and data would be kept local in the scratchpad memory and if there was a scratchpad "miss" the cores would DMA the needed memory locations from DRAM to the scratchpad?

DRAM, or preferably a closer core. The memory on chip is all physically addressed, and part of a flat global address space. The first 128KB of the address space is core 0's memory, then the next 128KB is core 1's, and so on to core 255. When a core accesses a memory region not in its own local scratchpad, it hops along the network on chip (with one cycle per hop) to get to the core which has the needed memory address. The compiler would try to keep the needed data by a core in that cores local scratchpad, or if it can't, as close as possible. Even in the worse case scenario where a core needs to access the memory in the opposite corner (Core 0 accessing core 255), it is still only 32 cycles to access it (less than the ~40 cycles it takes to access L3 cache on an Intel chip).

The NoC is also entirely non blocking... a router is able to read/write to its cores scratchpad and do a passthrough in the same cycle.

I'm a layman, but the information on your website reminds be both of what GreenArrays are doing and of the 80's INMOS transputers. People might want to know how those compare with what you're working on.

What differentiates your company's Neo Chip vs Adapteva's Epiphany co-processor?

A number of things... The first thing is that Neo is not a coprocessor, it is a fully independent many core processor. To quickly go over the basics:

1. Neo has a 64 bit core, and conforms the IEEE 754-2008 Floating Point standard... Epiphany is 32 bit, and is not fully IEEE compliant (along with only being capable of single precision FP).

2. The existing Epiphany chips cap out at 32KB of local memory per core (with the Epiphany IV having a total of 2MB of on chip memory), while the planned Neo chip will have 128KB of local memory per core (32MB of on chip memory).

3. Epiphany is limited to using it's 4 eLink (based on ARM's AXI interface) connectors to access the outside world, and would typically be connected to either other Epiphany chips or to its host processor. Each eLink port only supports 1.6GB/s bidirectional traffic, giving a total of 6.4GB/s of aggregate chip bandwidth. For Neo, we have developed a new 96GB/s (bidirectional, 48GB/s each way) interface with either 3 or 4 interfaces per chip, giving an aggregate chip-to-chip bandwidth of 288-384GB/s.

4. Neo can directly address DRAM attached to it, instead of having to go through a host processor.

5. Neo is a Quad issue VLIW core (capable of a 64 bit ALU op, 1 64 bit FPU op/2 32 bit FPU ops, and 2 load/store ops every cycle) compared to Epiphany's standard superscalar core (Capable of 1 32 bit ALU op, 1 32 bit FPU op, and 1 load/store op per cycle).

All of this adds up to actually being a commercially viable (for industry, not hobbyists) processor. Above all, memory bandwidth has been what kills Epiphany and completely prevents it from reaching their advertised performance.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact