
A Tiny Chip That Could Disrupt Exascale Computing (2015) - GUNHED_158
http://www.nextplatform.com/2015/03/12/the-little-chip-that-could-disrupt-exascale-computing/
======
trsohmers
Founder of REX here, and surprised to see this posted here. Happy to answer
any questions, and you can check my comment history for some of my prior posts
on REX.

We've had some really great progress that we hope to share in the near future,
so stay tuned.

EDIT: Since this article is over a year old, we have made a lot of progress,
and have recently taped out our first chip. We haven't officially posted a job
opening, but we are very shortly going to be looking for software engineers
that would love to work on our architecture. Feel free to shoot me an email if
you're interested!

~~~
mechagodzilla
Did you stick with the parallel, SERDES-less interfaces for your interchip
I/O? 48 GB/s implies a pretty high signalling rate to not have a CTLE, DFE,
etc.

Why 3 interchip links? What network topology are you planning to use to scale
to large numbers of chips? If you're still using parallel I/O, how are you
planning to communicate beyond a single PCB?

What memory interface are you using? The article seems to confuse your
interchip links with your memory controller.

~~~
trsohmers
We have partnered with a startup (we'll announce who soon enough) who shared a
lot of ideas about chip to chip I/O with myself. While they call it a SerDes,
it is infact a source synchronous (clock forwarded) link that is 5 bits over 6
wires. It is silicon proven, and is capable of up to 125Gb/s over 12mm while
being a little over 10x more energy efficient (in terms of pJ/bit) than other
available VSR SerDes. Obviously it is short reach over PCB, but we imagine
(yet to be tested) we can extend that reach a bit more using a more exotic PCB
laminate (Megtron, Rogers, etc), or going over wire (tested to go over 6
inches using a HuberSuhner SMA cable). Right now, we are only using it to go
between chips in a Multi Chip Module, or under 12mm on a PCB. Big bonus is as
of a month ago, it is a JEDEC standard!

Most of the information in the linked article is very outdated (~16 months
old), so we have decided to ditch the idea of having a separate DRAM and
"External I/O" and just have our chip-to-chip on all four sides of the chip.
The chip-to-chip interface uses the same protocol as our Network On Chip, and
expands in the same 2D mesh. We are also looking into (with a sketched out
plan) on how to directly interface this I/O with HBM dies that can be in the
same MCM package. As far as supporting other memories/IOs, we are leaning
towards having "adapter chips" that would convert our chip-to-chip interface
to DDR4, Ethernet, Infiniband, etc.

As far as bandwidth numbers, our aggregate bandwidth for this test chip we
have just taped out (16 cores + 2 chip-to-chip I/O macros on TSMC 28nm, 12mm^2
in size) is 60GB/s though for the planned production chip, we will be over
256GB/s. I have a good feeling we will be a fair margin higher than that, but
I would rather under promise and over deliver.

~~~
tmzt
Would it be possible to interface with HyperTransport or QPI? Can you name the
JEDEC standard?

~~~
trsohmers
I highly doubt that a direct interface would be possible with either of them,
though if you really wanted it, you could make an adapter (though fat chance
Intel would open up QPI enough to allow for it). We haven't officially
announced the partnership, though I can point you at JESD247.

------
dewster
From the article: “Caches and virtual memory as they are currently implemented
are some of the worst design decisions that have ever been made,” Sohmers
boldly told a room of HPC-focused attendees at the Open Compute Summit this
week.

As a lay processor designer, I couldn't agree more. I don't like VLIW, but
this architecture makes a lot of sense. I think it took up to this point for
compiler technology to catch up with what is possible in hardware.

Almost all the good ideas in computing were mined out long ago, the trick I
think is to get the computing world to give up on those which are holding
things back (cold dead hands if necessary).

------
Nomentatus
This is a 2015 story that I remember reading, then. Google news search shows
only a couple articles this year about Rex Computing and only one tiny bit of
news, that they're at tapeout. That's probably par for the course for a
startup creating product (or prototype) one.
[http://semiengineering.com/power-centric-chip-
architectures/](http://semiengineering.com/power-centric-chip-architectures/)

also a speaking engagement: [http://insidehpc.com/2016/01/call-for-papers-
supercomputing-...](http://insidehpc.com/2016/01/call-for-papers-
supercomputing-frontiers-in-singapore/)

and a comment elsewhere that mentions another approach: the "Mill CPU of Mill
Computing"

As I recollect (perhaps quite wrongly) Itanium (VLIW) failed because compiler-
writers couldn't really be bothered or couldn't mount the learning curve. So
I'm most curious about what progress is being made on the compiler side.

~~~
trsohmers
You are correct that we have already taped out, though we haven't made any
announcements yet, though will be talking publicly about it in the future with
a big focus on the "magic" on the software side.

You can read my comments on the Mill architecture elsewhere on HN (not a fan
of stack machines), but my biggest disappointment in them is the fact that
they have been working on Mill for ~10 years with a team ranging of 5 to 20
(from what I have heard) and have yet to get to silicon, while we have gone
from a complete custom architectural idea to tapeout in ~11 months from
closing our first seed funding.

The big technical failure point for Itanium (in my opinion) is the fact that
Intel took the relatively pure VLIW research by Josh Fisher @ HP Labs and
tried to add a ridiculous number of features (and attempted x86 compatibility)
that impacted the ability to statically schedule instructions. The resulting
bastard architecture Intel called "EPIC" (rather than VLIW) had a very
difficult job in getting the compiler to generate instruction parallel code
since Intel added a huge amount of indeterminism into the architecture that
goes against the original VLIW tenets. If your compiler has to assume the
worst case latency for all instructions and memory operations, you are going
to have a bad time.

~~~
gpderetta
Itanium failed for the same reason every other VLIW failed as a general
purpose CPU: there just isn't enough information a compile time to model the
dynamic properties of a program. In fact many of Itanium additions (strange
instruction packing, alias disambiguation hardware) were attempts at
overcoming this issue.

The only moderately successful general purpose VLIW are Conroe and the related
Denver, and they use a runtime translation layer to collect the required
dynamic informations.

~~~
trsohmers
The vast majority of the dynamic parts of program that matter for scheduling
(both when it comes to ILP/avoiding hazards within a core and when it comes to
handling memory management for our scratchpad based memory system) are due to
indeterminate latencies for memory accesses and executing instructions (due to
variable length pipelines). Throw in horrible (for determinism) things like
out of order execution and and branch prediction and no wonder a compiler
can't determine things statically! While we are not really targeting general
purpose (though I would say we have the capability to evolve to it in the
future) it seems painfully obvious to me where these issues have been in any
general-leaning VLIW attempts in the past, and I can't understand the clinging
nature to bad architectural decisions in the past by hardware folks 30 years
ago that could not imagine the ability of software in the future. </rant>

Targeting general purpose from the get go is a bad idea, but it NOT impossible
to do efficiently and without sacrificing performance. You just need a well
defined and constrained architecture, and a clean way to describe it.

~~~
gpderetta
You have your causality relations reversed: the reason that branch prediction
and dynamic caches exist is that because jump targets and working sets are
hard to impossible to compute statically.

Even in the restricted world of HPC, GPGPUs have been moving from statically
scheduled exposed pipeline VLIW machies to more conventional SIMD with caches,
virtual memory and branch prediction (no meaningful OoO yet as the large
amount of thread parallelism can hide the memory latency).

Also GPGPU have the benefit of having the large, lucrative GPU gaming market
to pay for their development. How can a pure HPC machine be competitive in
this market? Even for Intel Xeon Phi is more of a prestige project than
actually meant to make money.

~~~
trsohmers
I've spent a long time debating with VLIW haters (that I presume you are
with), but I'd love to see any citations you have for your claim that my
causality is reversed, as I have a ton of evidence (to be fair not published
yet) going for my side. While not as generally applicable as our architecture,
you can take a look at basically any DSP from the past 15 years and see that
VLIW works great from a performance and efficiency standpoint when your data
is in a constrained form. We're showing that a compiler can structure a lot of
different types of data (and the code required to actually operate on it)
effectively if there are enough constraints on the hardware. Fairly pointless
to try to convince you without documentation on hand for all parties, but hope
you'll take a look in a couple of months.

As far as market, we are going after a decent sized market where the customers
care the most about efficiency and performance, and are not only willing but
very eager to switch their current solutions for whatever is best. As the
typical startup claims, we are able to do it for a fraction of the cost and in
a fraction of the time as one of the big guys, and have a solution that is 10x
better than is out there. NVIDIA boasts that they spent $1 Billion developing
the Pascal architecture, with them selling the Tesla series GPUs for it at
$5,000+ a unit. We've shown we can prototype something that can theoretically
beat it for under $2 million, and our hope/bet is that we can take it to
market (and actually beat it by an order of magnitude) for less than $25
million. That's just HPC, which doesn't include the very interesting high end
DSP area that is now using very expensive and power hungry FPGAs for wireless
baseband solutions which we think are a very good fit for us.

~~~
p1esk
Just to clarify: are you trying to compete with Nvidia, or with Intel? If
you're going against GPUs, is your chip something that can run neural networks
(better than Nvidia)?

~~~
trsohmers
Short answer: If we were to implement SIMD FP16 support similarly to how we
have a planned dual FP32 in our FP64 FPU, we would be able to easily match GPU
performance by throwing more cores at the problem, while still being more
efficient. While neural nets/machine learning is interesting, and we could
potentially enable it in new forms as we can provide a desktop GPU's
capability in a much smaller/lower power form factor, it is not our main
focus. As the other commenter noted, there are ASICs that do a good job at
that, though since we are more generally programmable than those sort of
ASICs, we would be able to handle changes in algorithms over time while some
may not be able to.

The more interesting problems for us are things that GPUs can't do well, such
as level 1 (vector) and level 2 (matrix-vector) BLAS operations. While most
GPUs (and CPUs when utilizing SIMD instructions) only get a couple of percent
the performance on level 1 and level 2 BLAS compared to level 3 (matrix-
matrix), we are equally performant across all three (and at a very high
percentage of theoretical peak).

~~~
p1esk
Interesting. Which applications require vector-vector or matrix-vector
operations as opposed to matrix-matrix?

------
amelius
> there is no virtual memory translation happening, which in theory, will
> significantly cut latency (and hence boost performance and efficiency). This
> means that there is one cycle to address the SRAM, so “this saves half the
> power right off the bat just by getting rid of address translation from
> virtual memory.”

In protected mode (i.e., what the kernel is using), will an Intel processor
not also disable virtual memory lookup? Couldn't we just recompile scientific
software to a protected mode environment to get those same benefits?

Also, I think it is more useful and fair to compare against a GPU than a
general purpose CPU.

(As an aside, I don't see where the reduced latency gives such a big
advantage. There will be latency anyway, so in any case your software has to
deal with waiting in an efficient way (doing useful stuff in the mean time).
Shaving off some latency will only help if your software design was bad to
begin with.)

------
SeanDav
It would just be great to get in a decent chip that does not have built in,
and unblockable, back-door hacking, like those on Intel, AMD and probably ARM.

~~~
mvdwoord
I see a good opportunity for government to make this a reality. Not per se a
fan of gov regulation for many things but I don't see this moving forward very
fast. There are initiatives left and right (e.g. Talos) but if a significantly
large government body (EU?) would make it a requirement, that might change the
game. Lobbyists would probably convince them otherwise (you need closed HW to
catch terrorists... etc).

------
ridgeguy
I'm curious about the thermal issues.

From the article, the power density is (4 W)/ (0.1mm^2), or 40W/mm^2. Intel's
Haswell chip has a TDP of ~ 65W, an area of 14.7mm^2, for a power density of
4.4W/mm^2.

Is this power density a cooling challenge?

~~~
trsohmers
First note: The article is ~16 months old, so is outdated on some measures.
I've corrected the numbers below, but in either case, you seem to have been
confused between the size of a core and the size (and power) of an entire chip
consisting of multiple cores.

After tapeout of our first test chip, the final size for one of our cores is
0.27mm^2 (including the SRAM that makes up the scratchpad memory) on TSMC's
28nm process. We actually came in using less gates than originally
anticipated, and our size without SRAM is a little less than 0.01mm^2.

Now, for just going by what is on the linked article: The diagram comparing
sizes are for _single cores_ (0.1mm^2 estimate back then for a Neo core,
14.5mm^2 for a single Intel Haswell core). The power numbers in the table
below that are for entire chips. You are quoting 65W for a single core, which
is incorrect... The 65W Haswell chip I believe you may be referring to is the
4770S, which is 4 cores @ 65 watts, and looks like it has a die size of
177mm^2.

Calculating this out using our current numbers, our planned full 256 core chip
has changed a bit (doubled the performance since last year, doubled the power
due to adding more stuff) and we estimate the TDP to now be 8 Watts and
~100mm^2, which gives us a power density of 0.08W/mm^2. Intel would then have
65W / 177mm^2 = 0.367W/mm^2.

As would make sense in the case where we are claiming lower power operation,
our power density is also lower.

~~~
ridgeguy
Thanks very much for clarifying. That the 4W didn't apply to a single core
fell through a cognitive crack.

The power density is impressively low, indeed. Looking forward to more info in
Sept.

------
gpderetta
This chip was discussed on RealWorldTech a while ago:
[http://www.realworldtech.com/forum/?threadid=151566](http://www.realworldtech.com/forum/?threadid=151566)

Let's say it wasn't well received.

------
KKKKkkkk1
There is nothing to disrupt. Exascale computing is a haux perpetrated on the
US government by unscrupulous hardware vendors. Kudos to Rex for grabbing a
piece of that action.

