
Epiphany-V: A 1024-core 64-bit RISC processor - ivank
https://www.parallella.org/2016/10/05/epiphany-v-a-1024-core-64-bit-risc-processor/?
======
Coffeewine
This is fascinating:

The Epiphany-V was designed using a completely automated flow to translate
Verilog RTL source code to a tapeout ready GDS, demonstrating the feasibility
of a 16nm “silicon compiler”. The amount of open source code in the chip
implementation flow should be close to 100% but we were forbidden by our EDA
vendor to release the code. All non-proprietary RTL code was developed and
released continuously throughout the project as part of the “OH!” open source
hardware library.[20] The Epiphany-V likely represents the first example of a
commercial project using a transparent development model pre-tapeout.

~~~
LeifCarrotson
RTL = Register Transfer Logic, and EDA = Electronic Design Automation, for
anyone else who was curious. I don't know what GDS stands for, but context
indicates it's the actual physical description that's used to make the part.

But I'm confused about what part of this is open and not open. Do they mean
that they imported their Verilog into a proprietary tool, which generates the
design? That doesn't make it open source in practice.

~~~
adapteva
HW design is not that different from SW design. Comp table below:

HW SW Verilog --> C/Java/etc EDA --> GCC/LLVM GDS --> Binary (elf)

The GDS is completely tied up in NDAs due to the foundry. The EDA
combines/translates open source code with proprietary blobs to produce a
"super secret" GDS binary blob that gets sent to the foundry for
manufacturing.

~~~
ChrisRus
> HW design is not that different from SW design.

Shouldn't be. But it is.

~~~
AceJohnny2
Except the economics are vastly different. The complexity and cost of
manufacturing, the computationally intensive cost of simulation and various
checks and optimizations (be it clock timing or mask optimizations to etch
features _that are smaller than the wavelength used to etch them_ ), all mean
that you can't just "compile and publish", and turnaround times are months,
not hours.

And there are no open-source toolchains for any of this. It's a student
project to implement a SW compiler, why isn't it to implement an RTL compiler?

~~~
zanny
Nothing about the time frames or even production costs justify the disparity
in how proprietary and closed hardware manufacturing is. For the exact reason
hardware and software are different open sourcing your patterning toolchain
has nothing to do with your competitive advantage in actually having built
foundries with functioning lithography. The cost is in the later, the former
is just abuse of position for power over the end user.

If anything, it hurts your bottom line. You would probably get more third
party interest in having print outs of custom hardware if the toolchains were
more open. It is not a question of price, its a question of exposure.

I'm not even talking about the 12-20nm stuff. It is still crazy expensive
because the hardware and software R&D was huge and these companies are
hoarding their toys like preschoolers because of a prisoners dilemma in
regards to competitive advantage. But older 45-100nm plants are often still in
use but are still just as inaccessible as ever to most hobbyist hardware
enthusiasts.

~~~
AceJohnny2
_> The cost is in the latter, the former is just abuse of position for power
over the end user._

Exactly, hence my question about "student projects" which is really about why
aren't there more OSS projects that challenge this. Is it because of the lack
of platforms to experiment on, or the inherent difficulty of the task?

~~~
seanp2k2
Thinking about this, yeah it'd be amazing to e.g. Have a community-driven
forum with some DIY CPU designs (lisp machines!) with an affordable (let's say
under $1k per chip) way to get them made. We'll probably get there eventually,
but I'm not aware of where progress on this front is.

------
adapteva
I am here, if anyone has questions. AMA! Andreas

~~~
crudbug
What will be cost estimate for a PCI-e board ? Chip ? if this thing touches
consumer hands.

Are you planning any production samples for research / universities / DARPA ?

~~~
adapteva
The chip is about the same size as the Apple A10, so in terms of silicon area
it's in the consumer domain, but price will only come down to consumer levels
if shipments get into millions of units. Big companies take a leap of faith
and build a product hoping that the market will get there. Small companies get
one shot at that. With University volumes and shuttles, we are talking 100x
costs. So the $300 GPU PICe type boards become $10K-$30K with NRE and small
scale productio folded in.

~~~
runeks
You should look into alternative financing methods.

How long is the period from needing the cash to pay for production to
availability in retail, roughly?

If it's all about volume, accumulating orders over a long period using some
non-reversible payment method could, perhaps, get you into millions of units.
It's all about how long people are willing to wait in order to save on per-
chip unit costs.

------
valarauca1
Two things immediately jump out

    
    
        Custom ISA extensions for deep learning, communication, and cryptography
        DARPA/MTO
        autonomous drones cognitive radio
    

The radar geeks are gonna love to get their hands on ~250GFLOP, 4watt
processor.

~~~
mamcx
I have a naive question based in my dreams:

Is possible to design a CPU that ON-DEMAND switch between parallel and linear
operation? So, if we have a 1000 cores, it switch to 10 with the linear power
of 10 x 10?

In my dreams this was very usefull, but wonder how feasible clould be ;)

~~~
pjc50
No.

Basically the limiting factor in most designs isn't so much arithmetic as
fetches and branches. Especially cache misses. Theses are inherently linear
operations - if you need to fetch from memory and then jump based on the
result, for example.

Superscalar 'cheats' somewhat by spending area to keep the pipeline fed,
through branch prediction and suchlike.

The nearest thing is the graphics card, which has a very large number of
arithmetic units but less flow control, so you can run the same subroutine on
lots of different data in parallel.

Highly multicore chips make a different tradeoff: external memory bandwidth is
_very_ limited. Ideal for video codecs etc where you can take a small chunk
and chew heavily. Very bad for running random unadapted C code, Java etc.

------
zelon88
Did I read the specs wrong or are they claiming a 12x - 15x performance
improvement over the Ivy Bridge Xeon in GFLOPS/watt? In a <2w package?
[http://www.adapteva.com/wp-
content/uploads/2013/06/hpec12_ol...](http://www.adapteva.com/wp-
content/uploads/2013/06/hpec12_olofsson_publish.pdf)

~~~
dnautics
That's not unreasonable.

~~~
dnautics
I should clarify:. Presumably the parallella's RISC does away with a lot of
the superscalar features of the x86 which are embedded in the xeon phi's

One way to think about it is that things like branch prediction and
speculative and out of order execution are like real-time JITting of your
code.

Not having that silicon can make things way more efficient.

------
Tistel
I wonder if the Erlang/BEAM VM could take advantage of it. Erlang would be a
beast. if any of the pure functional languages get running on it (for easy
parallel), watch out. Nice work!

~~~
meta_AU
Things like Seastar[0] and Rust's zero cost futures would also make good use
of many cores.

[0] [http://www.seastar-project.org/](http://www.seastar-project.org/)

------
technological
Anyone looking for cached link for the website

[http://webcache.googleusercontent.com/search?q=cache:https:/...](http://webcache.googleusercontent.com/search?q=cache:https://www.parallella.org/2016/10/05/epiphany-
v-a-1024-core-64-bit-risc-processor/)?

Related Report - [https://www.parallella.org/wp-
content/uploads/2016/10/e5_102...](https://www.parallella.org/wp-
content/uploads/2016/10/e5_1024core_soc.pdf)

------
mechagodzilla
The linked paper mentions a 500 MHz operating frequency, as well as mentioning
a completely automated RTL-to-GDS flow. 500 MHz seems extraordinarily slow for
a 16nm chip - was this just an explicit decision to take whatever the tools
would give you so as to minimize back-end PD work? Also, given the performance
target (high flops/w), how much effort did you spend on power optimization?

~~~
adapteva
Paper stated that 500MHz number was arbitrary (had to fill in something for
people to compare to). Agree that 500MHz with 16nm FinFet is ridiculously
slow. We are not disclosing actual performance numbers until silicon returns
in 4-5 months. 28nm Epiphany-IV silicon ran at 800MHZ.

------
cordite
But can I run Erlang on it?

~~~
adapteva
Hah! You thought you would get us with that one.:-) Here is the link to the
Erlang OTP developed at Uppsala University for Epiphany.

[https://github.com/margnus1/otp](https://github.com/margnus1/otp)

~~~
cordite
Is this actually running Erlang processes on the epiphany cores or just erlang
spawning special processes on the epiphany cores? I've seen the latter and was
not impressed.

~~~
adapteva
This is actually a cut down erlang otp running on the Epiphany cores. It's not
ready for production, but it's interesting research. See the README.

~~~
cordite
Sweet! Though the README does not identify what is "cut down" or the status
and what remains to be vetted.

------
sargun
Would anyone be interested in an Epiphany dedicated servers a la Rasberry Pi
collocation ([https://www.pcextreme.com/colocation/raspberry-
pi](https://www.pcextreme.com/colocation/raspberry-pi))?

I've always wanted to play with these units, but buying one doesn't make a lot
of sense for me (where would I put it?). I would be super interested in making
them accessible to folks.

------
weatherlight
What are the benefits/advantages of choosing something like this over a
traditional Arm/x86 or a GPU? My knowledge in this area is limited. :)

~~~
Stubb
Best I can tell, Epiphany is designed as a co-processor, so it's not booting
the OS and relies on a host (like an ARM/x86) to run the show and issue
commands.

The Epiphany cores have significantly more functionality than GPU cores, so
they're useful for things beyond computing FFTs and other number-crunching
tasks. For example, you could map active objects one-to-one onto Epiphany
cores.

------
convolvatron
I read through the pdf summary and it doesn't look as if the shared memory is
coherent (which would be silly anyways). But I couldn't find any discussion
about synchronization support. Given the weak ordering of non-local references
it seems difficult to map alot of workloads. My real guess is that I haven't
seen part of the picture.

~~~
tomcam
Not a hardware genius here. What does coherent memory mean?

~~~
rthille
As I understand it: If memory is coherent then all cores see the same values
when they read the same location at the same time. Stated another way, the
result of a write to a location by one core is available in the next instant
to all other cores, or they block waiting for the new value.

~~~
tomcam
Thank you all for that help. Did not see definitions elsewhere in post.

------
loeg
What's the practical application of a chip like this?

~~~
adapteva
In general it was built for math and signal processing (broad field). Within
those fields, more specifically it was designed initially for real time signal
processing (image analysis, communication, decryption). Turns out that makes
it a pretty good fit for other things as well (like neural nets..). Here is
the publication list showing some of the apps. (for later, server is flooded
now): [http://parallella.org/publications](http://parallella.org/publications)

~~~
teraflop
Google cache:
[https://webcache.googleusercontent.com/search?q=cache:gpEOQO...](https://webcache.googleusercontent.com/search?q=cache:gpEOQOZcORgJ:https://www.parallella.org/publications/+&cd=1&hl=en&ct=clnk&gl=us)

------
adapteva
PAPER: [https://www.parallella.org/wp-
content/uploads/2016/10/e5_102...](https://www.parallella.org/wp-
content/uploads/2016/10/e5_1024core_soc.pdf)

(access until we resolve the hosting issues, wordpress completely hosed...)

------
witty_username
Prepend cache: to the URL to view Google's` cached version of this website.

~~~
mden
For the lazy -
[http://webcache.googleusercontent.com/search?q=cache%3Ahttps...](http://webcache.googleusercontent.com/search?q=cache%3Ahttps%3A%2F%2Fwww.parallella.org%2F2016%2F10%2F05%2Fepiphany-
v-a-1024-core-64-bit-risc-
processor%2F%3F&oq=cache%3Ahttps%3A%2F%2Fwww.parallella.org%2F2016%2F10%2F05%2Fepiphany-
v-a-1024-core-64-bit-risc-
processor%2F%3F&aqs=chrome..69i57j69i58.5527j0j4&sourceid=chrome&ie=UTF-8)

------
rpiguy
Wow from Kickstart to DARPA funding! How did I miss that?

~~~
agumonkey
They went surprisingly silent after the KS boards. I falsely assumed they left
the business or went employee. Delightful surprised they found ways to keep
searching.

------
kirrent
For those interested, Andreas did an interview on the Amp hour a while ago.
[http://www.theamphour.com/254-an-interview-with-andreas-
olof...](http://www.theamphour.com/254-an-interview-with-andreas-olofsson-
adatevas-ampliative-abacus/)

Congrats to everyone at adapteva. I remember talking to a couple of
researchers who were using the prototype 64 core epiphany processor who seemed
excited at how it could scale. I wonder how excited they'd be about this.

------
AnimalMuppet
1024 64-bit cores? Cool. Very impressive.

64 MB on-chip memory? For 1024 cores? That's 64 K per core. That seems rather
inadequate... though for some applications, it will be plenty.

~~~
adapteva
You need think of it as aggregate memory, not as per core memory to use if
effectively. Are you aware of a chip with more than 64MB of on chip RAM?

~~~
orbifold
The latest generations of IBM Power processors have >64MB L3 caches on chip.
The Power 7+ has 80MB per chip, the 12 core Power 8 96MB, according to
Wikipedia the Power 9 will have 120MB.

~~~
adapteva
Good data! That puts e5 in good company with some big-iron heavies.

------
thechao
Is there a mirror anywhere?

~~~
mmastrac
[https://webcache.googleusercontent.com/search?q=cache:XCsT2e...](https://webcache.googleusercontent.com/search?q=cache:XCsT2efmgU0J:https://www.parallella.org/2016/10/05/epiphany-
v-a-1024-core-64-bit-risc-processor/+&cd=1&hl=en&ct=clnk&gl=ca)

This PDF is a great technical overview as well:
[https://www.parallella.org/wp-
content/uploads/2016/10/e5_102...](https://www.parallella.org/wp-
content/uploads/2016/10/e5_1024core_soc.pdf)

------
Animats
So each processor has 64KB of local memory and network connections to its
neighbors?

The NCube and the Cell went down that road. It didn't go well. Not enough
memory per CPU. As a general purpose architecture, this class of machines is
very tough to program. For a special purpose application such as deep
learning, though, this has real potential.

------
thesz

        Cray had always resisted the massively parallel solution to high-speed computing, offering a variety of reasons that it would never work as well as one very fast processor. He famously quipped "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"
    

I cannot see how this thing can be programmed efficiently (to at least 70% of
computing capacity, as most vector machines can be programmed for).

------
algorithm314
The ISA is epiphany or risc-v?

~~~
jamesaross
It's backward compatible with Epiphany-III...so it's still Epiphany ISA with
new instructions.

~~~
algorithm314
I have read it but in the past he wrote a blog post that risc-v will be used
as isa in future products.So maybe 64 bit risc-v with backwards compatibility
with epiphane?(it sounds a bit strange)

~~~
milcron
The Epiphany core is a co-processor, and the "main" processor is a couple of
ARM cores to run Linux/other.

Maybe in the future they will offer boards with Risc-V main processors, and
Epiphany co-processors.

I'm not sure how feasible 1024 Risc-V cores would be (although it sounds
awesome). Epiphany cores were designed for this sort of thing.

~~~
adapteva
Agree, but people have all kinds of pre-conceived notions about co-processors
so let's clarify some things: e5 can't self-boot, doesn't have virtual memory
management, and doesn't have hardware caching, but otherwise they are "real"
cores. Each RISC core can run a lightweight runtime/scheduler/OS and be a
host.

------
pjc50
Interesting, but for a very specialized market, somewhere in the corner
between GPU and FPGA. Closest existing offer might be Tilera?

Site is currently slashdotted so I can't comment on details like how much DRAM
bandwidth you might actually have.

~~~
nickpsecurity
Tilera is what I thought of, too. It's actually where I'm getting my ideas of
applications for Epiphany-V. They did a lot of the early proving work on
architectures like this. Example: first 100Gbps NIDS I saw used a few Tilera
chips to do that.

~~~
planteen
Kind of off topic, but are there any low-end/hobbyist Tilera boards? The Linux
kernel has support for it. I've always thought you could stress multi-threaded
code in interesting ways by running it on tons of cores.

------
tiggilyboo
Good to see this here! I actually wrote a paper analyzing this architecture
for one of my bachelor classes, been a few years but:
[http://simonwillshire.com/papers/efficient-
parallelism/](http://simonwillshire.com/papers/efficient-parallelism/)

------
jokoon
What I don't understand with computer chips, is how really relevant the FLOPS
unit is, because in most situations, what limits computation speed is always
the memory speed, not the FLOPS.

So for example a big L2 or L3 cache will make a CPU faster, but I don't know
if a parallel task is always faster on a massively parallel architecture, and
if so, how can I understand why it is the case? It seems to me that massively
parallel architectures are just distributing the memory throughput in a more
intelligent way.

~~~
adapteva
You have to look at all the numbers (I/O, on-chip memory, flops, threads) and
see if the architecture fits your problem. Some algorithms like matrix matrix
multiplication are FLOPS bounds. It's rare to see a HPC architecture (don't
know if there is one?) that can't reach close to the theoretical flops with
matrix matrix multiplication. Parallel architectures and parallel algorithm
development go hand in hand.

------
protomyth
The website is erroring out for me, so I wonder what the motherboard situation
will be like for this chip. It would be really nice to be able to buy and ARM
like we can buy an x86.

------
mhurd
Truly inspirational in showing what largely one person can do even in these
times of huge fabs, expensive masks, and difficult, modern design rules

[http://meanderful.blogspot.com.au/2016/10/adapteva-tapes-
out...](http://meanderful.blogspot.com.au/2016/10/adapteva-tapes-out-
epiphany-v-1024-core.html)

------
wbsun
The website is down. Maybe a good opportunity to demonstrate the scalability
improvement with such 1024-core processor?

------
bra-ket
how do I connect external RAM to it, and what would be the cpu-to-memory
bandwidth in that case

~~~
adapteva
External RAM, up to 1 Petabyte would be connected through an external FPGA,
containing an epiphany link, some glue logic, and a memory controller.

~~~
sargun
From my understanding the Zynq's memory controller can only handle ~4GB of
memory. Am I missing something? Is there a way to connect more than 4GB -- if
so, I'd be very interested.

~~~
adapteva
Larger FPGA support 64-bits with custom memory controllers.

~~~
sargun
With the new chip, is there a memory controller on the board, or will you
still need the FPGA's?

Even with the new MPSoC, I think the memory controller is limited to 8GB.

Do you know what the most efficient cost / GB confit is for a Epiphany +
memory controller or FPGA

------
jdmoreira
that's going to provide some interesting race conditions for sure :D

------
imre
What is the possible applications? i.e how to potentially make use of all the
cores? Is it more like GPU programming?

------
noelwelsh
Tying in to earlier discussion on C
([https://news.ycombinator.com/item?id=12642467](https://news.ycombinator.com/item?id=12642467)),
it's interesting to imagine what a better programming model for a chip like
this would look like. I know about the usual CSP / message passing stuff, and
a bit about HPC languages like SISAL and SAC. Anyone have links to more modern
stuff?

------
api
Wish I was still working on genetic programming and digital artificial life.
This would be barrels of fun.

------
rjammala
Seems like this url is really popular,I get this connection error:

Error establishing a database connection

~~~
imaginenore
Mirror: [http://archive.is/Zj9hG](http://archive.is/Zj9hG)

------
tempodox

      Error establishing a database connection
    

Overload thru request storm?

------
erichocean
Any chance of adding 16-bit floating point support in Epiphany-VI?

------
bikamonki
Error establishing a database connection

------
laxk
403 error now for the entire site.

------
the_duke
Page is overwhelmed.

Can anyone provide a summary?

------
liveoneggs
too bad the entire site is returning a 500 error now

~~~
aerodog
i can't see anything on the site. is this for sale or just a proposed
architecture? amazon seems only to be selling your 16-core device. was there a
64-core one? can't access your product offering.

~~~
reportingsjr
The tapeout is apparently at the foundry and they are expecting chips back in
4-5 months. (I gathered this info from a google cache of their blog)

------
new299
I think my thoughts on the parallella stuff still hold:

[http://41j.com/blog/2012/10/my-take-on-the-adapteva-
parallel...](http://41j.com/blog/2012/10/my-take-on-the-adapteva-parallella/)

Basically this is a recurring theme in computing, but the whole custom
massively parallel thing rarely works out.

