The Epiphany-V was designed using a completely automated flow to translate Verilog RTL source code to a
tapeout ready GDS, demonstrating the feasibility of a 16nm “silicon compiler”. The amount of open source
code in the chip implementation flow should be close to 100% but we were forbidden by our EDA vendor
to release the code. All non-proprietary RTL code was developed and released continuously throughout the
project as part of the “OH!” open source hardware library. The Epiphany-V likely represents the first
example of a commercial project using a transparent development model pre-tapeout.
But I'm confused about what part of this is open and not open. Do they mean that they imported their Verilog into a proprietary tool, which generates the design? That doesn't make it open source in practice.
Verilog --> C/Java/etc
EDA --> GCC/LLVM
GDS --> Binary (elf)
The GDS is completely tied up in NDAs due to the foundry. The EDA combines/translates open source code with proprietary blobs to produce a "super secret" GDS binary blob that gets sent to the foundry for manufacturing.
Verilog --> C/Java/etc
EDA --> GCC/LLVM
GDS --> Binary (elf)
Verilog --> imperative language
EDA --> IDE + compiler
GDS --> Assembly
reminds me of alan-kay's comment "hardware is just software which has crystallized early"
Shouldn't be. But it is.
And there are no open-source toolchains for any of this. It's a student project to implement a SW compiler, why isn't it to implement an RTL compiler?
If anything, it hurts your bottom line. You would probably get more third party interest in having print outs of custom hardware if the toolchains were more open. It is not a question of price, its a question of exposure.
I'm not even talking about the 12-20nm stuff. It is still crazy expensive because the hardware and software R&D was huge and these companies are hoarding their toys like preschoolers because of a prisoners dilemma in regards to competitive advantage. But older 45-100nm plants are often still in use but are still just as inaccessible as ever to most hobbyist hardware enthusiasts.
Custom circuit boards are coming down in price, maybe custom lithography will come down in price at some point to be accessible to hobbyists / startups.
Exactly, hence my question about "student projects" which is really about why aren't there more OSS projects that challenge this. Is it because of the lack of platforms to experiment on, or the inherent difficulty of the task?
sometimes I wish somebody with deep pockets (or maybe a semiconductor company) were to buy an ailing EDA company and just opensource all these design tools things would move much faster for opensource h/w design.
The hardware programming is way, way too low. Consider assembler programming, even lower.
This is why videocotroller HW takes 9 months for group of 5 engineers and 2 programmers, and driver software for said videocontroller can be wriiten in a month by one graduate student.
The languages also either very dirty or very expensive.
For example of expensiveness, the cost of one license cool shiny Bluespec SystemVerilog compiler can cost you 2-3 yearly salaries of one of your engineers. Yes, it reduces lines (3 times) and error density (another 3 times), but nonetheless.
The example of dirtyness in Verilog: the sized based number literal has three parts - integer size (regular decimal integer with non-significant underscores like 10_00 for thousand), the base, expressed by regexp "'[Ss]?[xXOobBdD], and the value of the literal. These are three separate lexems. You can use preprocessor definition "`define WEIRD(n,b,s) s b n" and use it to construct sized literals backward: WEIRD(dead,'X,42) for 0xdead with size 42. As you can see, the value part of literal can (and will) be matched as regular identifier rule. The compiler right now seems to me as more or less straightforward, though.
The example of dirtyness in VHDL: construction of record where first fiels is character can be written as "RECORD'(')')" - we have successfully constructed a record with character field set to ')'. The single quote mark is either start of character literal (as in 'c'), the prefix of attribute (NAME_OF_ENUMERATED_VALUE'SUCC) or part of typed construction of value exemplified above. VHDL was one of the first languages that untroduced operator and function overloading, including and not limited to, overloading on return types of functions.
Good luck implementing all of this when you are student.
I wrote a 5-stage RISD processor with it for school, was quite simple and easy to abstract.
If hardware was more competitive, industry coding practices would be more efficient. Instead their own self-conception of pain-points prevents them from going after this low-hanging fruit.
I wrote something like that long time ago: https://github.com/thesz/hhdl (even before clash)
I had some translation algorithm from pure Haskell code to the HHDL internals. I even wrote MIPS clone using it (and it was simulated OKly).
There's just no market for that.
I'm hoping (as is the author with http://qbaylogic.nl/) that the market for FPGA soft(?)ware will suck less. Best case it pushes pressure on the fabs for ASICs, but we'll see.
There is one fully open source flow, but currently only targeting Lattie iCE40 chips: Project IceStorm. http://www.clifford.at/icestorm/
That said, the synthesis tool (Yosys) can actually synthesize netlists suitable for Xilinx tools, as well. In theory any company could probably add a backend component to Yosys to support their chips. arachne-pnr/icetools can only target iCE40 chips, still.
That said, it all works today. I recently have been working on a small 16-bit RISC machine using Haskell/CLaSH as my HDL, and using IceStorm as the synthesis flow. This project wouldn't have been possible without IceStorm - the proprietary EDA tools are just an unbelievable nightmare that otherwise completely sap my will to live after several attempts...
 Like how I had to sed `/bin/sh` to `/bin/bash` in 30+ shell scripts, to get iCEcube2's Synplify Pro synthesis engine to work. WTF?
 Or other great "features", like locking down iCE40-HX4K chips with 8k-usable LUTs to 4k LUTs artificially, through the PR/synthesis tool, to keep their products segmented. I mean, I get the business sense on this one (easier to do one fab run at one size), but ugh.
Specially when you're working with RF or when you're doing commercial products or when you have a strict timeline and limited resources.
In a software project, the development is only limited by the Human Resources, you can't realistically blame the computer for being too slow to compile your code, and there are no "defects" when your users download your code.
Even designing simple stuff without the fab's component libraries for old processes would be a daunting task. (For some context, something circa the Sega Dreamcast era -- 350 nm/4 layers or there abouts -- is well in the realm of what an undergraduate would be able to design with a fair bit of ease for his capstone (senior-year) project is doable by a talented single 4th year with the component libs. Without the tooling, he'd be lost.)
I'm sure Adapteva wanted to open source their final files which went to the fab for tape-out, but you could bet your bottom dollar if they did, a take-down letter would be sent to Github and Adapteva would be slammed with a lawsuit.
SPICE is/was the original open-source project that came out of UC Berkeley in the '70s if you want to go from zero-to-tape-out on an entirely open source stack but it's no trivial task. http://opencircuitdesign.com/links.html has some auxiliary resources, and IIRC there's a Linux distribution with a pretty good toolkit with even things like analog simulators for RFIC (though, as the late-great Bob Pease of NatSemi said - "never trust the simulator" ;)).
Side-note: Adapteva - your work is fascinating, so much so that I read your entire set of ref docs for the Epiphany. I'm in the Boston area, let me buy y'all a coffee at Diesel
as I'd love to pick you brains.
 - (Grey area legality content) - Here's an example of the documentation of the libs you'd be using - normally even these documents are lock&keyed: http://www.utdallas.edu/~mxl095420/EE6306/Final%20project/ts...
This looks like a masters level thesis project directory by the course number (didn't go to U of T:D) @ 180 nm sizing.
So, if you have a 16nm silicon compiler, I encourage you to pull a Gaisler with a presentation on how you do that with key details and synthetic examples designed to avoid issues with EDA vendors. Or just use Qflow if possible.
[edit: was thinking of the wrong Gaisler, still will pass]
Speaking from experience, even getting purpose-built compilers like ICC to apply "simple" optimizations like fused-multiply-add to matrix multiply is non-trivial.
Taking jpeg decoding as a concrete example of why modern compilers fall over, you have two high-level choices: (1) the compiler automatically translates a generic program into one that can be vectorized using the instructions on the target platforms. This will probably involve reworking control flow, loops, heap memory layout, malloc calls, etc, and will require changing the compressed / decompressed images in imperceptible to humans ways (the vector instructions often have different precision/rounding properties than non-vector instructions). This is well beyond the state of the art.
(2) Find a programmer that deeply understands the capabilities of all the target architectures and compilers, who will then write in the subset of C/Java/etc that can be vectorized on each architecture.
I think you'll find there are many more assembler programmers than there are people with the expertise to pull off (2), and that using compiler intrinsics is actually more productive anyway.
I don't agree that SIMD is so specialized. It is needed where ever you have operation over arrays of items of the same type, including memcmp, memcpy, strchr, unicode encoders/decoders/checkers, operations on pixels, radio or sound samples, accelerometer data, etc.
Compilers have latency and dependency models for specific CPU arch decoders/schedulers/pipelines. Compiler authors agree that compilers should learn to do good autovectorization. But it's hard. So people use assembly.
> human assembly optimization is unlikely to be better than a modern compiler
> Most developers can't beat LLVM
Then you pointed out some specific examples where a human can be a compiler.
Seem like you two agree, then you go and call what he is a saying "a myth". I think I need some clarification.
Prior to this my understanding was that if the developer provides the compiler good information with type, const, avoids pointer aliasing and in general makes the code easy to optimize that the compiler can do much better than most humans most of the time, but of course a domain expert willing to expend a huge amount of time with all the knowledge the compiler would have can beat the compiler. It just seems that beating the compiler is rarely cost (time, money, people, etc...) efficient.
Is my understanding close in your opinion?
If what your program does can be sped up using vector registers/instructions (e.g. DSP, image and video processing) then you want to do that because x4 and x8 speedups are common. Current autovectorisers are not very good. If it is not the most trivial example like "sum of contiguous array of floats", you'll want to write SIMD assembly or intrinsics or use something like Halide. In practice projects end up using nasm/yasm or creating a fancy macro assembler in a high level language.
The choice to use assembly is economics, and it's all a matter of degree. How much performance is left on the table by the compiler? How many C lines of code take up 50% of the cpu time in your program? How rare is the person who is able to write fast assembly/SIMD code? How long does it takes to write correct and fast assembly/SIMD code for only the hot function for 4 different platforms (e.g. in-order ARM, Apple A10, AMD Jaguar, Haswell)?
If you think "25%, 100k LoC, very rare, man-years" then you conclude it's not worth it. If you think "x8, 20 lines, only as rare as any other good senior engineer, 50 hours" then you conclude it's stupid to not do the inner loop in assembly.
What are the numbers in practice? I don't know. In practice, all the products that have won in their market and can be sped up using SIMD have hand coded assembly or use something like Halide and none of them think the compiler is good enough.
const most certainly is used by optimizers: https://godbolt.org/g/kLmGr4
The willingness of C compilers to (ab)use undefined behavior for optimization is one of the main criticisms against it.
Sure, I can file bug reports in those cases, and I would attempt to if possible -- but it also doesn't meaningfully help any users who suddenly experience the problem. At some point I'd rather just write the core bit a few times and future proof myself (and this has certainly happened for me a non-zero amount of times -- but not many more than zero :)
as well as 'What every compiler writer should know about programmers or “Optimization” based on undefined behaviour hurts performance '
The second paper is so biased it hurts. It hardly attempts to hide this bias, on the second page it start referring to one group of people as "clueless" and never justifies it describing what what clued in would be.
The second paper also has a strong assumption that compilers should somehow maintain their current undefined behavior going forward. It is almost as though the paper author thinks a compiler can somehow divine what the programmer wants without referring to some pre-agreed upon document, such as the standard for the language.
The second paper also talks only about performance and not about any other real world concern, like maintainability, reliability or portability.
This paper is setting up straw men when it trots out code with bugs (that loop on page 4) and then a pre-release version of the compiler does something unexpected. Of course non-conforming code breaks when compiled. Of course pre-release compilers are buggy.
The paper's author wants code to work the same on all systems even when the code conveys unclear semantics. That is unreasonable.
Even though I disagree with the author I try to understand some of his perspective.
100th in costs and one half in performance is, granted, wishful thinking on my part. But I believe the important point is that with a sufficient productivity gain, this technology can reduce the old, non-automated way to something akin to writing software libraries in assembly. Writing software libraries in assembly is useful, but few bother to do it because they'd rather just buy more hardware. Chugging out twice a many chips, once you have your design finished, isn't really that much more expensive, as I understand it.
Why? Is there anything that could be done to change that?
It is an open source RISC based ISA along with open source implementations of example processor cores. Then you could have had a processor that was completely open and did not include any proprietary code.
Are you planning any production samples for research / universities / DARPA ?
How long is the period from needing the cash to pay for production to availability in retail, roughly?
If it's all about volume, accumulating orders over a long period using some non-reversible payment method could, perhaps, get you into millions of units. It's all about how long people are willing to wait in order to save on per-chip unit costs.
Also congrats! This is brilliant engineering to get a chip like this into production silicon as a small team.
How much did the prototype MPW(?) silicon cost?
We can't disclose MPW costs. Chip was funded by DARPA. For standard MPW costs, check with MOSIS.
Not saying we won't have chips with all cores working, just saying you shouldn't count on it.
In a tile based CPU error topology matters. A string of broken cores or a broken core at the edges is likely worse than a broken core with all 4 (or 8?) neighbors working.
It might be easier to work around broken SRAM bits than just skipping a whole core.
That way you could always have same pipeline layout and not need to compute it dynamically.
How many DRAM ports?
Also: any more information on the ISA extensions for communications/deep learning?
No reason to drive a 1024 core chip to the broad market when most applications aren't ready to use 16 cores. With this chip we focus on customers and aprtners who have proven that they have mastered the 16-core platform.
Yet magically they have no problem taking advantage of massively parallel GPUs...
Most applications don't use 16 CPU cores because they don't need them.
Not competitors yet. They have awesome silicon in the field, we just taped out...
But seriously, I'm tremendously curious about the use for this with video processing. Has there been any good benchmarks with that?
Here's one from ARL:
Just kidding. Nobody's perfect. :)
Awesome to see the 1024 cpu epiphany taped out! Congratulations! Any plan to put these into a card computer for easy programming and evaluation? EDIT: nevermind on the question, I see the response below.
Would like to say that your kickstarter was one of the best communicated most smoothly run kickstarter campaigns that I have ever backed.
Here are direct links if you are in a rush:
I didn't see it addressed in the paper, how does this compare WRT discrete DSP chips? Are you targeting ease of programming instead of raw FMAD/etc?
In Epiphany, the programmers are challenged by the manycore and an SRAM size cliff (so 0 or 1 in terms of pain).
It depends...but I personally prefer having one big dragon to slay rather than 10 little ones.
Are you upstreaming qemu, uboot, Linux, GCC, GDB etc changes?
Will we see a Debian port for this?
For Parallella: Linux upstream, uboot might be as well? Runs Debian, Ubuntu, etc
Custom ISA extensions for deep learning, communication, and cryptography
autonomous drones cognitive radio
What does that mean?
Is possible to design a CPU that ON-DEMAND switch between parallel and linear operation? So, if we have a 1000 cores, it switch to 10 with the linear power of 10 x 10?
In my dreams this was very usefull, but wonder how feasible clould be ;)
Basically the limiting factor in most designs isn't so much arithmetic as fetches and branches. Especially cache misses. Theses are inherently linear operations - if you need to fetch from memory and then jump based on the result, for example.
Superscalar 'cheats' somewhat by spending area to keep the pipeline fed, through branch prediction and suchlike.
The nearest thing is the graphics card, which has a very large number of arithmetic units but less flow control, so you can run the same subroutine on lots of different data in parallel.
Highly multicore chips make a different tradeoff: external memory bandwidth is very limited. Ideal for video codecs etc where you can take a small chunk and chew heavily. Very bad for running random unadapted C code, Java etc.
Xeon, Power, etc are kind of power pigs anyway, though they've got a lot of absolute oomph to show for it.
One way to think about it is that things like branch prediction and speculative and out of order execution are like real-time JITting of your code.
Not having that silicon can make things way more efficient.
Related Report - https://www.parallella.org/wp-content/uploads/2016/10/e5_102...
I'm unable to find the feature branch bringing Parallella support to OTP https://github.com/margnus1/otp/branches
Maybe it was merged upstream already?
You came a long way since I saw you in London in 2013. 1024 cores came sooner that 2020! Amazing job.
I've always wanted to play with these units, but buying one doesn't make a lot of sense for me (where would I put it?). I would be super interested in making them accessible to folks.
The Epiphany cores have significantly more functionality than GPU cores, so they're useful for things beyond computing FFTs and other number-crunching tasks. For example, you could map active objects one-to-one onto Epiphany cores.
edit: and also a global wired-or for a barrier.
The problem becomes a lot easier if you can reduce the multiple-writer case to the single-writer case. One idea that occurred to me is that since you have 1024 cores, it might make sense to dedicate a small fraction of them (say, 1/64) to synchronization. When you need to send a message to another process, you write to a nearby "router" that has a dedicated buffer to receive your data. The router can then serialize the with respect to other messages and put it into the receiver's buffer.
Basically, you'd end up defining an "overlay network" on top of the native hardware support; you pay a latency cost, but you gain a lot of flexibility.
EDIT: I may be completely wrong about the first paragraph; it looks like the TESTSET instruction might actually be usable on remote addresses. I assumed it didn't because the architecture documentation doesn't say anything about how such a capability would be implemented. But if it works, it would drastically simplify inter-node communication.
I was talking about the DMA mode in which every write to special register (that may be coming from a different core) gets "redirected" to subsequent byte of the DMA target region. This can work as a queue with multiple enqueuers, but has bounded size (after the size is exhausted, messages get lost) and operates on single byte messages.
I don't remember how does this work with external memory (including cores from different chips).
For some context, the x86 memory model gives you an almost consistent view of memory. The behavior is roughly as if the memory itself executes reads/writes in sequential order, but writes may be buffered within a processor in FIFO order before being actually sent to memory. Internally, the memory actually isn't that simple -- there are multiple levels of cache, and so forth -- but the hardware hides those details from you. Once a write operation becomes globally visible, you're guaranteed that all of its predecessors are too.
From what I can see from a quick overview of the Epiphany documentation, it doesn't have any caches to worry about, but it gives you much weaker guarantees about memory belonging to different cores. For one thing, there's no "read-your-writes" consistency; if you write to another core and then immediately try to read the same address, you might read the old value while the write is still in progress. For another, there's no coherence between operations on different cores, so if you write to cores X and then Y, someone else might observe the write to Y first (e.g. because it happens to be fewer hops away).
Epiphany-V does not have caching. You explicitly move data around in software. Some software abstractions are better than others.
It hasn't really got mindshare though in the sense players like Qualcomm have all but ignored it and would rather work on proprietary comms schemes.
(access until we resolve the hosting issues, wordpress completely hosed...)
Congrats to everyone at adapteva. I remember talking to a couple of researchers who were using the prototype 64 core epiphany processor who seemed excited at how it could scale. I wonder how excited they'd be about this.
64 MB on-chip memory? For 1024 cores? That's 64 K per core. That seems rather inadequate... though for some applications, it will be plenty.
You've just described the general architecture of the Connection Machine, a late 80's early 90's era supercomputer that was used for modeling weather, stocks, and other items. It was fairly useful in it's time.
I think we will end up with systems with 64GB of memory, but which instead of 8 cores with 8GB each, have 1M cores with 64 KB memory each. We just need to lean how to write code that makes the most out of that, which is probably a lot more than what you can do with current systems.
And this Epiphany thing is something like the first step in that direction.
This is actually how all modern mobile GPUs work and it's highly vectorizable. The partitioning obviously needs to know the whole scene but that's much more lightweight than rendering.
From what I've heard from my ex-gamedev contacts movies are heading that route in a large way because the turnaround time a raytracing is so long that's it's really hurting the creative process.
This PDF is a great technical overview as well: https://www.parallella.org/wp-content/uploads/2016/10/e5_102...
The NCube and the Cell went down that road. It didn't go well. Not enough memory per CPU. As a general purpose architecture, this class of machines is very tough to program. For a special purpose application such as deep learning, though, this has real potential.
Cray had always resisted the massively parallel solution to high-speed computing, offering a variety of reasons that it would never work as well as one very fast processor. He famously quipped "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"
Maybe in the future they will offer boards with Risc-V main processors, and Epiphany co-processors.
I'm not sure how feasible 1024 Risc-V cores would be (although it sounds awesome). Epiphany cores were designed for this sort of thing.
Site is currently slashdotted so I can't comment on details like how much DRAM bandwidth you might actually have.
So for example a big L2 or L3 cache will make a CPU faster, but I don't know if a parallel task is always faster on a massively parallel architecture, and if so, how can I understand why it is the case? It seems to me that massively parallel architectures are just distributing the memory throughput in a more intelligent way.
Even with the new MPSoC, I think the memory controller is limited to 8GB.
Do you know what the most efficient cost / GB confit is for a Epiphany + memory controller or FPGA
Error establishing a database connection
Error establishing a database connection
Can anyone provide a summary?
Basically this is a recurring theme in computing, but the whole custom massively parallel thing rarely works out.