

Startup to Open-Source Parallel CPU - trsohmers
http://www.eetimes.com/document.asp?doc_id=1324759

======
lavomme
First of all, good luck for you guys.

I've worked in a start-up company similar to your company.

We developed a 256-cores RISC processor, only shared memory was used between
all the cores instead of a mash-up of a memory block for each core and DMA for
transactions.

How do you intend to synchronize work between the different cores? How a
compiler will abstract away the memory synchronizations? Which programming
language is going to be used? So many questions as this is such a complex area
in computing...

From my personal experience of over 5 years developing such chip in a start-up
company, the cost of production will probably be a huge obstacle. Good luck!

~~~
trsohmers
Thanks! If you don't mind me asking, what was the name of the company?

Technically, you don't need to synchronize the cores... it's a MIMD/MPMD
system that is not in lockstep. It is up to the programmer (with help from the
compiler) to make sure you don't do anything too stupid ;) As for the
programming languages, C and Fortran are the big ones to us. We hope once our
LLVM backend is improved, you'll be able to run anything you like on the cores
themselves. As for the programming model, the three we like the best are CSP
(Go), Actor (Erlang), and PGAS (C, Fortran, Chapel, a few other research
ones). If you're familiar with SHMEM, that's the closest thing we can think of
currently.

When it comes to cost, we're trying to develop as much ourselves to reduce
licensing costs. We've been able to do a pretty good job (if I do say so
myself) as just two people so far with no capital. Fabrication costs are the
killer, with it being about ~$500k per shuttle run, and $5m-$7m for a mask set
when we actually go to full production.

~~~
lavomme
[http://plurality.com/](http://plurality.com/) \- website is not very good and
the company is dead. You can read more information in wikipedia:
[https://en.wikipedia.org/wiki/Plurality_%28company%29](https://en.wikipedia.org/wiki/Plurality_%28company%29)

All our cores shared the same memory for both data and instruction and we
based our synchronization of work-load on a hardware instead of software. It
yielded such a huge speedup for execution time that most of the companies
simply ignored our results as fake. :)

Choosing a programming language is crucial - we went with C and a declaration
language for tasks. Today (4 years after we closed the company) I am not sure
whether it was the best decision. The simpler the parallel definition in code
the better. Programmers as getting confused easily.

------
trsohmers
I'm the founder and CEO of REX... Check out our website for a brief overview
([http://rexcomputing.com](http://rexcomputing.com)) and feel free to ask
questions here!

~~~
alain94040
Impressive for your age. But how is this any different from the millions of
identical proposals that never got anywhere? Packing plenty of alus with some
sort of basic network hasn't really worked out in real life. The first block
diagram in the article is so basic it's concerning.

Apologies for being critical, I wish you the best.

~~~
trsohmers
The article is incorrect in saying that the (very simplistic...it's supposed
to just show the basic components per core, not how it is actually
functioning) diagram for a single core is the whole chip... our chip design
has 256 of those cores (So in total, 256 64 bit ALUs and 256 64 bit FPUs). We
have a custom mesh network that allows for atomic operations on from any core
to any adjacent core, and DMA operations that allow for a read or write to any
core's local scratchpad memory from any other scratchpad memory.

In comparison to other architectures, we have chose to stick to RISC, instead
of some crazy VLIW or very long pipeline scheme. In doing this, we limit
compiler complexity while still having very simple/efficient core design, and
thus hopefully keeping every core's pipeline full and without hazards. The
idea is that we just want to have a bunch of very simple and focused SPMD
cores, so that we can have a MIMD/MPMD chip.

We are currently fixing the bugs on our single core FPGA demo, and hope to
have our full 256 core cycle accurate simulator done by ~January/February. We
want to release that (and our currently very early compilers) to the public
ASAP.

------
solobratsche
Also wish you good luck and recommend change focus: First figure out, what
kind of real world problems this architecture can handle and how to do it
best. The architecture will perform very well, if the number of operations,
which can do done on a set of data fitting into the scratchpad memory is
large, so that communication overhead is small or if the problem can be mapped
to the grid in a way that they only require communication with the neighbours.
However I would assume that typical real world big data problems don't fulfill
these requirements and problems fulfilling them also run well on classical
architectures. As soon as you start to need a lot of data transfer, only the
cores close to the border of the grid will be able to work, as they get data,
the ones at the border are busy with forwarding data to the memory and the
innermost just wait ... Therefore before investing a lot of effort, money and
time into a hardware, which many others also do in very similar ways and ask
the community to find out, how to use it, spend the energy in innovative ideas
to actually efficiently use such architectures. I.e. languages, profiling and
debug tools, ... That's where the real innovation is needed and where there is
a lot of room for improvement. And if you want to stay with hardware it is
probably much easier to convince investors, if you can show in a simulation of
FPGA prototype that 2 or 3 real world applications showing a very bad
performance on classical machines can achieve a substantive boost on your
architecture ...

------
nkurz
_Unlike GPUs and other SIMD accelerators, Neo 's MIMD processor design
leverages independent program counters and instruction registers in each core
to allow different operations to be performed in parallel on separate pieces
of data._[1]

Other than the grid interconnect, how does the architecture differ from Xeon
Phi? In particular, what allows it to get such dramatically higher power
efficiency? I'd have guessed that Intel's smaller process would make it
difficult to match.

[1] It feels awkward that this sentence is included twice on such a short
page.

~~~
trsohmers
Thanks for that... didn't notice that we had a repeated section (I just
updated it with the proper text for that section)

As for comparison to the Phi... it is the fact that our actual core size (and
thus the whole chip) is MUCH smaller. The Phi's cores are actually based on
the original Pentium architecture (they are just downsized P54Cs) with added
AVX instructions. Contrary to popular belief, they are not full x86-64 cores.

Intel has not released the official die size for the Phi, but has said it is
about ~5 billion transistors (at a 22nm process), and independent
"guesstimates" have pegged the die at 600-700mm2 (there is one place that says
350mm2, but that is false). For the top of the line 61 core Xeon Phi, it uses
300W, with a theoretical peak performance of 1.2TFLOP of double precision
performance. That gets you to about 4GFLOP/Watt.

In comparison, our entire will be under 100 million gates, with each core
(excluding memory) being around 100k gates. At a 28nm process, our core size
(without memory) is a little under 0.1mm2. Our theoretical peak performance
per compute chip is 256 GFLOPs double precision, while it should be using
around 3 Watts, giving us a performance per watt ratio of ~85GFLOPs/Watt.

Intel has even said that their next generation Xeon Phi, made at their 14nm
process, will be at 14 to 16GFLOPs/Watt. At SC14 last week, they made a soft
announcement for the following generation at 10nm process will be in the ~2018
timeframe, and that is estimated at only being around 25GFLOPs/Watt.

The bottom line is Intel is just following Moore's law, and is sticking to big
and complex systems, which while retaining legacy compatibility, kill you when
it comes to efficiency.

~~~
nkurz
Interesting, and a great answer. The usual estimates I've seen put the power
efficiency overhead of the legacy x86 layer to be small enough to not be a
major factor
([http://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa-...](http://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa-
power-struggles.pdf)). But I suppose as you shrink the core smaller and
smaller, that small mostly-fixed difference becomes a greater and greater part
of the total power budget.

~~~
trsohmers
I discredit most people who continue the RISC/CISC debate to this day (As
Intel forfeited in ~2006... all modern x86 processors are actually RISC
processors. They are RISC at the lowest level, and just decode x86 instruction
and translate them into Intel RISC microcode).

Then again, I don't think most popular "RISC" CPUs are very RISC like (Does
that make me a RISC hipster?). While making things general purpose, there is
no reason for ARM processors to be OoO, do branch prediction/speculative
execution, or any more of these crazy things... it reduces efficiency in the
long run.

Take a look at this for instance: [http://chip-
architect.com/news/2013_core_sizes_768.jpg](http://chip-
architect.com/news/2013_core_sizes_768.jpg) ... an AMD CPU and ARM chip are
virtually the same size at the same process node. I find it insane that one of
our cores is a bit less than 1/5th of the size of a cortex A7, and can do more
FLOPs than it. Then again, we have focused on doing that, but still.

~~~
nkurz
Your .1 mm^2 was without memory, right? I don't know the exact numbers, but to
be fair, the ratio does become somewhat closer when you discount the area on
the A7 used for memory:
[http://www.arm.com/images/Single_Cortex-A7_core_layout_image...](http://www.arm.com/images/Single_Cortex-A7_core_layout_image.jpg)

~~~
trsohmers
That is correct, but even with memory we should only be around .2mm^2 to
.3mm^2, while having 4x the SRAM as the Cortex-A7.

------
TallGuyShort
My first thought was how this compares to Adapteva's Epiphany architecture
(best known for it's use in their crowd-funded Parallela boards), so I was
happy to see this project was inspired by that one. Andreas Olofsson from
Adapteva has tweeted often about how difficult it is to get funding as a
silicon start up, though. I think the statement "it's a tough sell" will prove
to be a colossal understatement. But these are impressive kids and it's a
great concept - I wish them luck.

~~~
trsohmers
I was working on the Parallela boards for a long time... but there are
fundamental flaws in the architecture (missing instructions, it is only 32
bit, not IEEE754-2008 compliant, etc) that makes the Epiphany architecture not
really suitable for the markets we are trying to address. When we decided to
go out and make our own, we knew there was going to be a lot of difficulties,
but the idea is that if we are going to do a startup and give it our all, why
not try to do something big?

The fact that we are using mostly open source tools for our development (such
as Chisel: chisel.eecs.berkeley.edu), our development time has decreased and
productivity has had a huge boost compared to if we were just writing in
Verilog. We also made the decision to not use off the shelf IP, which while
difficult to verify, we think we make a much better system. Compare this to
the Epiphany implementations, which while having a custom ISA and basic core
components, used off the shelf ARM interconnect, off the shelf ARM memory
compiler, and many other things that were used to minimize development time
and verification. While this is a bit more difficult upfront, since we are
keeping our components very simple, our verification is no where near as
complex as a "normal" processor. Plus we don't have to pay $500k-$1m+ in
upfront licensing fees.

------
sargun
Does anyone know what the actual CPUs are? It mentions it has 64 registers? My
guess is ARM / MIPS, based upon:
[http://en.wikipedia.org/wiki/Processor_register](http://en.wikipedia.org/wiki/Processor_register)

How much scratch memory is there? Is it SRAM?

Are they licensing anyone's IP for the interconnect, or CPU? What's the
bandwidth of the interconnect? Is is packet-oriented? How does fair-routing
work?

~~~
trsohmers
Custom ISA that we have developed... we developed it in parallel to RISCV
(before they released their public 2.0 ISA), but we have diverged a bit... we
are a static 64 bit ISA, have no options to have VLIW expandability, our FPU
(and thus its instructions) are able to do two single precision IEEE754-2008
FLOPs per cycle, and one double precision per cycle. We have also added a set
of DMA instructions.

We currently have 128KByte dual-ported SRAM per core (which is physically part
of the core, and not a giant array somewhere else on die). It has single cycle
latency to the core's registers and to the Network on Chip router.

The on chip mesh network is custom 128 bit wide going core to core. The router
can do a read or write to SRAM per cycle AND allows a passthrough to another
core in the same cycle.

Our chip-to-chip interconnect is a custom 64 bit (72 lane) parallel interface
allowing 48GB/s. There are two of these (unidirectional) interfaces per side,
giving you a total of 8 of these interfaces per chip.

~~~
trsohmers
I realized I didn't fully answer the chip-to-chip interconnect questions... it
is an extremely simple parallel interface (that I would not even call a "bus",
as that is assuming it has a lot more control logic than it has). Instead of
having a full serdes per lane, our solution is to use a very simple latch and
buffer, along with PLLs on each chip to have a MUCH (50-70%) smaller and more
power effecient point-to-point connection between chips. As such, it is not
packet based, and we are really just focusing on moving a 64 bit word per
cycle.

~~~
mechagodzilla
I asked this previously without seeing that you had answered it here. So
you're claiming to have a parallel interface without a SERDES running at
64-bits at 6 Gb/s? How are you maintaining bit alignment between lanes? Do you
have any in-line signal conditioning or anything (CTLE, DFE, etc.)? Parallel
interfaces are rarely run faster than 1 Gb/s, so 6 Gb/s sounds unlikely.

~~~
trsohmers
Relevant paper (by Intel Research, actually)... there are quite a few
differences (we are keeping it a lot simpler on the tx and rx ends, but that
is some of our secret sauce... I can talk about it offline if you are really
interested)

[http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=648778...](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6487788&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6487788)

~~~
mechagodzilla
Getting anything approximating Intel's work, assuming you are rolling all your
own IP, is pretty ambitious (and probably a good deal more work than your Core
design). Even at _only_ 6 Gbit/s (faster than PCIe Gen2, btw). Just curious -
have you ever taped out a chip on a modern process before?

~~~
trsohmers
We aren't trying to reach their 128GB/s number as listed in that paper, only
48GB/s... in addition to our design being much simpler than that. Our design
is having each pin is simply a buffer and a latch that is synchronized with
all the other pins part of the interface by a PLL... much easier to implement
and run than a serializer for even just a single pin.

I myself have not taped out something on a modern process, but have advisors
who have. My co founder and I do have nanofab experience, so we do understand
the physical complexities of fabrication first hand.

~~~
mechagodzilla
Ah, so the idea is to have 64 parallel bits coming in at 6 Gbits/s, along with
a slower clock that you multiply up to 6 GHz and use to sample the inputs?
That will be quite tricky to get working 1) without any analog signal
conditioning on the inputs (or outputs), and 2) without inter-bit skew making
it impossible to meet timing on your inputs. Best of luck to you, but the
interface alone sounds likely to be problematic.

~~~
trsohmers
That's the basic idea, but we can run it at 3GHz if we do DDR, or 1.5GHz doing
QDR. There is some extra magic there which I don't want to talk publicly about
just yet ;)

The biggest problem (even with our solutions for skew and crosstalk) is just
the number of pins/traces on the board, but that's not unsolvable... nothing a
~10 layer PCB can't solve.

------
vardump
I wonder what kind of penalties there are for transmitting data to neighboring
nodes. Same for receiving. If every node does for example receives data from a
neighboring node, does a single fused multiply-add and transmits the result to
a neighboring node, how many FLOPS you get out of the whole thing?

How big chunks of computation you need to do in a node for this to be
effective?

~~~
trsohmers
Our theoretical double precision FLOPs (based on running at 1GHz) per core is
1GFLOP per second, but that is based on doing an add or a multiply. In the
case of just doing FMAs, you would be doing 1.5GFLOP. This gives you
(theoretical peak) 256 to 384 double precision GFLOPs per chip. Our bandwidth
between cores is 16GB/s, while the required bandwidth for doing 1GFLOP is
8GB/s.

Our single precision numbers are actually double that of double precision
(compared to a GPU or most other SIMD systems, which have independent FP32 and
FP64 FPUs, we have a single combined unit). While our ISA is pure 64 bit, we
have packed 32 bit FPU instructions for doing two single precision FLOPs per
cycle.

~~~
vardump
So how many I can do if for each FMA I also receive 16 bytes of data from a
neighboring node and send 16 bytes of data to a neighboring node? Or is the
data transfer free, is the neighboring node memory mapped? If so, how does
synchronization work?

Edit: didn't notice same was asked before too. Regardless, how many FMAs can
be executed in the scenario I gave, also sending 16 bytes and receiving 16
bytes for each FMA?

~~~
trsohmers
The superscalar design means Load/Store unit can operate independently from
FMA and DMA units, i.e. FMA operation on register operands will not interfere
with other operations elsewhere on the chip.

Synchronization should be handled in a dataflow-driven program design. Any
sort of mutex/semaphore/etc. will have to be software defined or interrupt-
driven.

In regards to neighboring node being memory mapped... are you asking about
another compute chip or another node (with the 16 compute chips + GaMMU)? All
of the compute chips in a grid are part of the same flat global memory map,
and have DMA capabilities between each other. Once you get out of the compute
grid (that is managed by the GaMMU), then that is a separate memory space, but
can be accessed through some other layer through something like MPI, for
instance.

~~~
vardump
I guess this is one of those cases where you just need to get your hands dirty
to really understand it.

Can you do reasonably efficient [arbitrary size] fixed point arithmetic on
your hardware? Do you have 64-bit add with carry and 64-bit multiply with 128
bit results? I'm interested in 64.64 and 128.128. Needed operations are add,
sub and multiply.

I think compute grids like these could very well be an important part of
computing in the future, maybe even the most important part. Ever since when I
first saw an article about transputer. Grids or VLIWs, sadly software is
always the pain point. I wish you luck, please get this working and right.

