The Epiphany-V was designed using a completely automated flow to translate Verilog RTL source code to a
tapeout ready GDS, demonstrating the feasibility of a 16nm “silicon compiler”. The amount of open source
code in the chip implementation flow should be close to 100% but we were forbidden by our EDA vendor
to release the code. All non-proprietary RTL code was developed and released continuously throughout the
project as part of the “OH!” open source hardware library.[20] The Epiphany-V likely represents the first
example of a commercial project using a transparent development model pre-tapeout.
RTL = Register Transfer Logic, and EDA = Electronic Design Automation, for anyone else who was curious. I don't know what GDS stands for, but context indicates it's the actual physical description that's used to make the part.
But I'm confused about what part of this is open and not open. Do they mean that they imported their Verilog into a proprietary tool, which generates the design? That doesn't make it open source in practice.
The GDS is completely tied up in NDAs due to the foundry. The EDA combines/translates open source code with proprietary blobs to produce a "super secret" GDS binary blob that gets sent to the foundry for manufacturing.
Except the economics are vastly different. The complexity and cost of manufacturing, the computationally intensive cost of simulation and various checks and optimizations (be it clock timing or mask optimizations to etch features that are smaller than the wavelength used to etch them), all mean that you can't just "compile and publish", and turnaround times are months, not hours.
And there are no open-source toolchains for any of this. It's a student project to implement a SW compiler, why isn't it to implement an RTL compiler?
Nothing about the time frames or even production costs justify the disparity in how proprietary and closed hardware manufacturing is. For the exact reason hardware and software are different open sourcing your patterning toolchain has nothing to do with your competitive advantage in actually having built foundries with functioning lithography. The cost is in the later, the former is just abuse of position for power over the end user.
If anything, it hurts your bottom line. You would probably get more third party interest in having print outs of custom hardware if the toolchains were more open. It is not a question of price, its a question of exposure.
I'm not even talking about the 12-20nm stuff. It is still crazy expensive because the hardware and software R&D was huge and these companies are hoarding their toys like preschoolers because of a prisoners dilemma in regards to competitive advantage. But older 45-100nm plants are often still in use but are still just as inaccessible as ever to most hobbyist hardware enthusiasts.
If it was really that easy then hobbyists would have found a way to do it on their own by now(e.g. 3D printing). You can't just demand that someone open their billion dollar fabs to amateur hobbyists. It is very likely if the fab is still operating at a certain process, it's because they have profitable business churning through it. If it's not profitable, they retool or close it down. An idle fab is money down the drain, and it's really doubtful hobbyists would be able to fill the gap with a bunch of one-off production runs, while likely needing a lot of hand holding.
Custom circuit boards are coming down in price, maybe custom lithography will come down in price at some point to be accessible to hobbyists / startups.
> The cost is in the latter, the former is just abuse of position for power over the end user.
Exactly, hence my question about "student projects" which is really about why aren't there more OSS projects that challenge this. Is it because of the lack of platforms to experiment on, or the inherent difficulty of the task?
Thinking about this, yeah it'd be amazing to e.g. Have a community-driven forum with some DIY CPU designs (lisp machines!) with an affordable (let's say under $1k per chip) way to get them made. We'll probably get there eventually, but I'm not aware of where progress on this front is.
this. I always say this, the real credit for success of open source software goes to gcc (egcs for old timers) which allowed developers to make executable code unencumbered with NDAs & royalties.
sometimes I wish somebody with deep pockets (or maybe a semiconductor company) were to buy an ailing EDA company and just opensource all these design tools things would move much faster for opensource h/w design.
In software, the code line of state machine does miriad of things - computes new state, reads input, writes output, etc, etc. In hardware, the code line state machine computes one bit of acknowledgement of having input read. If you lucky,
The hardware programming is way, way too low. Consider assembler programming, even lower.
This is why videocotroller HW takes 9 months for group of 5 engineers and 2 programmers, and driver software for said videocontroller can be wriiten in a month by one graduate student.
The languages also either very dirty or very expensive.
For example of expensiveness, the cost of one license cool shiny Bluespec SystemVerilog compiler can cost you 2-3 yearly salaries of one of your engineers. Yes, it reduces lines (3 times) and error density (another 3 times), but nonetheless.
The example of dirtyness in Verilog: the sized based number literal has three parts - integer size (regular decimal integer with non-significant underscores like 10_00 for thousand), the base, expressed by regexp "'[Ss]?[xXOobBdD], and the value of the literal. These are three separate lexems. You can use preprocessor definition "`define WEIRD(n,b,s) s b n" and use it to construct sized literals backward: WEIRD(dead,'X,42) for 0xdead with size 42. As you can see, the value part of literal can (and will) be matched as regular identifier rule. The compiler right now seems to me as more or less straightforward, though.
The example of dirtyness in VHDL: construction of record where first fiels is character can be written as "RECORD'(')')" - we have successfully constructed a record with character field set to ')'. The single quote mark is either start of character literal (as in 'c'), the prefix of attribute (NAME_OF_ENUMERATED_VALUE'SUCC) or part of typed construction of value exemplified above. VHDL was one of the first languages that untroduced operator and function overloading, including and not limited to, overloading on return types of functions.
Good luck implementing all of this when you are student.
Look up clash-lang.org. Haskell-modules->Verilog+VHDL with a simple compilation model so you're not leaving performance in the table.
I wrote a 5-stage RISD processor with it for school, was quite simple and easy to abstract.
If hardware was more competitive, industry coding practices would be more efficient. Instead their own self-conception of pain-points prevents them from going after this low-hanging fruit.
Cool! But note that Clash is actually compiling Haskell (i.e. analogous to GHCJS or something), rather than being an EDSL.
I'm hoping (as is the author with http://qbaylogic.nl/) that the market for FPGA soft(?)ware will suck less. Best case it pushes pressure on the fabs for ASICs, but we'll see.
> And there are no open-source toolchains for any of this.
There is one fully open source flow, but currently only targeting Lattie iCE40 chips: Project IceStorm. http://www.clifford.at/icestorm/
That said, the synthesis tool (Yosys) can actually synthesize netlists suitable for Xilinx tools, as well. In theory any company could probably add a backend component to Yosys to support their chips. arachne-pnr/icetools can only target iCE40 chips, still.
That said, it all works today. I recently have been working on a small 16-bit RISC machine using Haskell/CLaSH as my HDL, and using IceStorm as the synthesis flow. This project wouldn't have been possible without IceStorm - the proprietary EDA tools are just an unbelievable nightmare that otherwise completely sap my will to live after several attempts...[1][2]
[1] Like how I had to sed `/bin/sh` to `/bin/bash` in 30+ shell scripts, to get iCEcube2's Synplify Pro synthesis engine to work. WTF?
[2] Or other great "features", like locking down iCE40-HX4K chips with 8k-usable LUTs to 4k LUTs artificially, through the PR/synthesis tool, to keep their products segmented. I mean, I get the business sense on this one (easier to do one fab run at one size), but ugh.
It is[0] and electrical engineering students make them pretty regularly, it's just much more expensive and complicated if you actually want to make a chip with the output of one instead of just simulating it.
Specially when you're working with RF or when you're doing commercial products or when you have a strict timeline and limited resources.
In a software project, the development is only limited by the Human Resources, you can't realistically blame the computer for being too slow to compile your code, and there are no "defects" when your users download your code.
The limiting factor is your 'building blocks' (component libraries with things their cells, IO and what-have-you)) that your fab (i.e. TSMC) gives to your design software house (e.g. Mentor, Synopsis, Cadence) for a specific process (e.g. $integer-$um|nm CMOS) for a production run is usually built off of heavily NDA'd building blocks locked down by contract[0] (and that's assuming you have the cash to to buy time for that tape out!).
Even designing simple stuff without the fab's component libraries for old processes would be a daunting task. (For some context, something circa the Sega Dreamcast era -- 350 nm/4 layers or there abouts -- is well in the realm of what an undergraduate would be able to design with a fair bit of ease for his capstone (senior-year) project is doable by a talented single 4th year with the component libs. Without the tooling, he'd be lost.)
I'm sure Adapteva wanted to open source their final files which went to the fab for tape-out, but you could bet your bottom dollar if they did, a take-down letter would be sent to Github and Adapteva would be slammed with a lawsuit.
SPICE is/was the original open-source project that came out of UC Berkeley in the '70s if you want to go from zero-to-tape-out on an entirely open source stack but it's no trivial task. http://opencircuitdesign.com/links.html has some auxiliary resources, and IIRC there's a Linux distribution with a pretty good toolkit with even things like analog simulators for RFIC (though, as the late-great Bob Pease of NatSemi said - "never trust the simulator" ;)).
Side-note: Adapteva - your work is fascinating, so much so that I read your entire set of ref docs for the Epiphany. I'm in the Boston area, let me buy y'all a coffee at Diesel
as I'd love to pick you brains.
--
[0] - (Grey area legality content) - Here's an example of the documentation of the libs you'd be using - normally even these documents are lock&keyed: http://www.utdallas.edu/~mxl095420/EE6306/Final%20project/ts...
This looks like a masters level thesis project directory by the course number (didn't go to U of T:D) @ 180 nm sizing.
Probably RTL would be more correctly known as "Register Transfer Level" as in a level of abstraction, in contrast to for example the lower "gate" level of abstraction.
I might be wrong. But if they automated the flow from RTL to GDS, the timing might not be optimal. I understand since they have lack of resources so that this is unavoidable but in normal chip design flow, the backend timing ECO is critical to achieve high frequency for all timing corners.
Yes, we are leaving 2X on the table in terms of peak frequency compared to well staffed chipzilla teams. Not ideal, but we have a big enough of a lead in terms of architecture that it kind of works.
The comment above said you couldn't release the info due to the EDA vendor. However, people like Jiri Gaisler have released their methodologies via papers that just describe them with artificial examples. Others use non-manufarable processes and libraries (like NanGates) so the EDA vendors feelings don't get hurt about results that don't apply to real-world processes. ;)
So, if you have a 16nm silicon compiler, I encourage you to pull a Gaisler with a presentation on how you do that with key details and synthetic examples designed to avoid issues with EDA vendors. Or just use Qflow if possible.
I'll pass for now...Gaisler is in the business of consulting, we survive by building products. I am happy to release sources, but it's completely up to the EDA company.
[edit: was thinking of the wrong Gaisler, still will pass]
Damnit. No promises but would you consider putting it together if someone paid your company to do it under an academic grant or something? Quite a few academics trying to do things like you've done with small chance that one might go for that.
That's concurrency, throughput, and load-balancing of web servers connected to pipes of certain bandwidth. It's not the same as parallel execution of CPU-bound code on a tiled processor. You could know a lot about one while knowing almost nothing about the other.
That seems analogous to human assembly optimization vs a compiler. But the time to market is greatly reduced, designs can be vetted and a 2.0 that is optimized for frequency can be shipped later.
IIRC, human assembly optimization is unlikely to be better than a modern compiler nowadays. Same thing could very well happen for this "automated flow" if it starts incorporating its own optimization techniques.
That is a myth. Most developers can't beat LLVM. LLVM can't beat the handcrafted assembly in libjpeg-turbo or x264 or openssl or luajit by compiling the generic C alternative.
In response to the other replies: I'm not sure about luajit, but the other two examples involved a programmer hand crafting algorithms around specific special purpose CPU instructions -- vector processing and video compression hardware, if I remember the details of x264 correctly. This is so specialized and architecture specific that it probably doesn't make sense to push it into the compiler.
Speaking from experience, even getting purpose-built compilers like ICC to apply "simple" optimizations like fused-multiply-add to matrix multiply is non-trivial.
Taking jpeg decoding as a concrete example of why modern compilers fall over, you have two high-level choices: (1) the compiler automatically translates a generic program into one that can be vectorized using the instructions on the target platforms. This will probably involve reworking control flow, loops, heap memory layout, malloc calls, etc, and will require changing the compressed / decompressed images in imperceptible to humans ways (the vector instructions often have different precision/rounding properties than non-vector instructions). This is well beyond the state of the art.
(2) Find a programmer that deeply understands the capabilities of all the target architectures and compilers, who will then write in the subset of C/Java/etc that can be vectorized on each architecture.
I think you'll find there are many more assembler programmers than there are people with the expertise to pull off (2), and that using compiler intrinsics is actually more productive anyway.
x264 does not use any video compression hardware. It uses only regular SIMD.
I don't agree that SIMD is so specialized. It is needed where ever you have operation over arrays of items of the same type, including memcmp, memcpy, strchr, unicode encoders/decoders/checkers, operations on pixels, radio or sound samples, accelerometer data, etc.
Compilers have latency and dependency models for specific CPU arch decoders/schedulers/pipelines. Compiler authors agree that compilers should learn to do good autovectorization. But it's hard. So people use assembly.
> human assembly optimization is unlikely to be better than a modern compiler
You said:
> Most developers can't beat LLVM
Then you pointed out some specific examples where a human can be a compiler.
Seem like you two agree, then you go and call what he is a saying "a myth". I think I need some clarification.
Prior to this my understanding was that if the developer provides the compiler good information with type, const, avoids pointer aliasing and in general makes the code easy to optimize that the compiler can do much better than most humans most of the time, but of course a domain expert willing to expend a huge amount of time with all the knowledge the compiler would have can beat the compiler. It just seems that beating the compiler is rarely cost (time, money, people, etc...) efficient.
Making C compilers for different architectures output great code from same source is really hard. e.g. "const" is not used by optimizers because it can be cast away. Interpreters, compression routines, etc. can always be sped up using assembly.
If what your program does can be sped up using vector registers/instructions (e.g. DSP, image and video processing) then you want to do that because x4 and x8 speedups are common. Current autovectorisers are not very good. If it is not the most trivial example like "sum of contiguous array of floats", you'll want to write SIMD assembly or intrinsics or use something like Halide. In practice projects end up using nasm/yasm or creating a fancy macro assembler in a high level language.
The choice to use assembly is economics, and it's all a matter of degree. How much performance is left on the table by the compiler? How many C lines of code take up 50% of the cpu time in your program? How rare is the person who is able to write fast assembly/SIMD code? How long does it takes to write correct and fast assembly/SIMD code for only the hot function for 4 different platforms (e.g. in-order ARM, Apple A10, AMD Jaguar, Haswell)?
If you think "25%, 100k LoC, very rare, man-years" then you conclude it's not worth it. If you think "x8, 20 lines, only as rare as any other good senior engineer, 50 hours" then you conclude it's stupid to not do the inner loop in assembly.
What are the numbers in practice? I don't know. In practice, all the products that have won in their market and can be sped up using SIMD have hand coded assembly or use something like Halide and none of them think the compiler is good enough.
> Making C compilers for different architectures output great code from same source is really hard. e.g. "const" is not used by optimizers because it can be cast away.
Check out the cppcon 2016 presentation by Jason Turner and watch how eager the compiler optimizes away code when const is enabled on values. Cool presentation too, and uses Godbolt's tool
https://www.youtube.com/watch?v=zBkNBP00wJE
If it's not at least able to match handcrafted assembly using intrinsics, you should file bugs against LLVM. There is no theoretical reason why compilers shouldn't be able to match or beat humans here: these problems are extremely well studied.
Sometimes consistency is desirable, as well as performance. Compilers are heuristic. They evolve and get better, but they can mess up, and it's not always a fun time to find out why the compiler made something that was performance sensitive suddenly do worse, intrinsics or not -- from things like a compiler upgrade, or the inlining heuristic changes because of some slight code change, or because it's Friday the 13th (especially when it's something horridly annoying like a solid %2-3 worse -- at least with %50 worse I can probably figure out where everything went horribly wrong without spending a whole afternoon on it). This is a point that's more general than intrinsics, but I think it's worth mentioning.
Sure, I can file bug reports in those cases, and I would attempt to if possible -- but it also doesn't meaningfully help any users who suddenly experience the problem. At some point I'd rather just write the core bit a few times and future proof myself (and this has certainly happened for me a non-zero amount of times -- but not many more than zero :)
"using intrinsics" is a cop out: you are essentially doing the more complicated part of translating that sequence of generic C code into a rough approximation of a sequence of machine instructions and leave the compiler to do the boring and simpler parts, like register allocation, code layout and ordering of independent instructions.
Compilers are smart at some things and not so smart at others. I can beat the compiler in tight inner loops almost every time, but it will also do insanely clever things that id never think of!
Slides with the talk, not my favorite, have a link to the talk?
The second paper is so biased it hurts. It hardly attempts to hide this bias, on the second page it start referring to one group of people as "clueless" and never justifies it describing what what clued in would be.
The second paper also has a strong assumption that compilers should somehow maintain their current undefined behavior going forward. It is almost as though the paper author thinks a compiler can somehow divine what the programmer wants without referring to some pre-agreed upon document, such as the standard for the language.
The second paper also talks only about performance and not about any other real world concern, like maintainability, reliability or portability.
This paper is setting up straw men when it trots out code with bugs (that loop on page 4) and then a pre-release version of the compiler does something unexpected. Of course non-conforming code breaks when compiled. Of course pre-release compilers are buggy.
The paper's author wants code to work the same on all systems even when the code conveys unclear semantics. That is unreasonable.
To give credit to the paper's author that no-op is part of the SPEC benchmark suite and the author feels that code in that benchmark is being treated as privileged by compiler authors.
Even though I disagree with the author I try to understand some of his perspective.
There's a gap between "humans can't write assembly better than the compilers" and "there's nothing humans can do to help the compiler write better code".
Depends. You won't beat llvm if your code uses strictly intrinsics. Some things, like adding carry bits across 64-bit arrays, might need to be done by hand, because of special, knowledge about your data that are not generalizable.
I agree completely, it's still impressive to me that they presumably managed a competitive offering with such a system. I imagine having it be a highly homogeneous design also helped.
The interesting question, to me at least, is how much cheaper this chip is - with its suboptimal maximum clock rate - compared to a chip from a non-automated flow. If peak clock rate is one half, but cost is one hundredth, I'd say it's a spectacular achievement.
100th in costs and one half in performance is, granted, wishful thinking on my part. But I believe the important point is that with a sufficient productivity gain, this technology can reduce the old, non-automated way to something akin to writing software libraries in assembly. Writing software libraries in assembly is useful, but few bother to do it because they'd rather just buy more hardware. Chugging out twice a many chips, once you have your design finished, isn't really that much more expensive, as I understand it.
It is an open source RISC based ISA along with open source implementations of example processor cores. Then you could have had a processor that was completely open and did not include any proprietary code.
The chip is about the same size as the Apple A10, so in terms of silicon area it's in the consumer domain, but price will only come down to consumer levels if shipments get into millions of units. Big companies take a leap of faith and build a product hoping that the market will get there. Small companies get one shot at that. With University volumes and shuttles, we are talking 100x costs. So the $300 GPU PICe type boards become $10K-$30K with NRE and small scale productio folded in.
You should look into alternative financing methods.
How long is the period from needing the cash to pay for production to availability in retail, roughly?
If it's all about volume, accumulating orders over a long period using some non-reversible payment method could, perhaps, get you into millions of units. It's all about how long people are willing to wait in order to save on per-chip unit costs.
I had a friend who mentioned that it was very difficult to get the 64-cores Parallellas with fully-functional Epiphany-IV chips. Are these yield problems going to continue with Epiphany-V or can we expect a full 1024 functional cores per chip?
It would be a BIG mistake to assume 1024 working cores. If you want to scale your software you should take a look Google/Erlang and others. Not reasonable to demand perfection at 16nm and below...
Not saying we won't have chips with all cores working, just saying you shouldn't count on it.
In a tile based CPU error topology matters. A string of broken cores or a broken core at the edges is likely worse than a broken core with all 4 (or 8?) neighbors working.
Impossible to characterize without high volume silicon or accurate yield models. We can say that historically, most failures are in SRAM cells and they are limited to a few bits (core still works!) and that in general only one out of N cores will fail. For arguments sake, let's assume the while network always works, but 1 CPU may be broken. (this is what needs to be confirmed later). Does that help?
Yes, you can call it scratchpad or sram. The point is that there is no hardware caching. The local SRAM is split into 4 separate banks so it is "effectively" 4 ported. DRAM controllers is up to the system designer. This is handled by the FPGA. (like previous epiphany chips).
Not going to happen in the near term. There is no way to meet the price point needed to compete in the low cost SBC market with the Epiphany-V. Believe it or not, the $99 Parallella was priced too high to reach mass adoption.
Sure, there will be evaluation boards, they just won't be generally available at digikey and won't cost $99. More information about custom ISA will be disclosed once we have silicon back.
Well, the Parallella has shipped to over 10,000 people and it still selling at Amazon an DK, so no the dream is not dashed in any way. The number of publications and frameworks around Parallella is growing every month...
No reason to drive a 1024 core chip to the broad market when most applications aren't ready to use 16 cores. With this chip we focus on customers and aprtners who have proven that they have mastered the 16-core platform.
I think you're underestimating the requirements and mastery of cloud companies. Something like an Amazon lambda could virtualize 4 cores per instance and host 256 lambda execution units on a single chip. The use cases are endless
Unless the architecture has changed drastically from the earlier Epiphany, they can't be virtualised like that, and each core are way too slow to be suitable for lambda except for software written specifically to take advantage of the parallelism of the architecture.
You still need to recompile code for the new architecture, and taking full advantage of it wisely is not easy... but may be worth it in many use cases. Part of the problem is that it's not 100% clear which use cases these are and how to market it. Probably unit calculation per watt is the most likely performance advantage, but it's still amazingly hard to sell people on that sometimes
Some parallel algorithms will scale to bigger (more parallel) chips the way binary programs got more performance with clock higher frequencies. That's the holy grail..
Congrats again on getting amazing amount done on budget. The part that jumped out more than usual was you soloing it to stay within budget. Pretty impressive. How did you handle the extensive validation/verification that normally takes a whole team on ASIC's? Does your method have a correct-by-construction aspect and/or automate most of the testing or formal stuff?
Modern SOCs might have 100 complex blocks. We had 3 simple RTL blocks (9 hard macros). Top level communication approach was "correct by construction". Nothing is for free.
Hours were over a 12 month period, but yes...the pace was relentless. All ambitious projects, including many kickstarer projects get done because creators end up working for free for essentially thousands of hours. In this case, we were on a fixed cost budget so those hours were "my problem".
Emacs? Ahem. I would like to return the parallella I purchased in the kickstarter campaign...
Just kidding. Nobody's perfect. :)
Awesome to see the 1024 cpu epiphany taped out! Congratulations! Any plan to put these into a card computer for easy programming and evaluation? EDIT: nevermind on the question, I see the response below.
Would like to say that your kickstarter was one of the best communicated most smoothly run kickstarter campaigns that I have ever backed.
Hopefully you guys have ECC on your 64MB of SRAM, otherwise the meant time to bit flip due to Single Event Upset (SEU) is around 400 days ( based on 200 Fit/Mb/Billion Hours from previous experience ).
No ECC on chip, but we do have column redundancy. We are pushing the envelope in terms of SEUs, making an assumption that the right programming model and run time will be able to compensate for high soft error rates. It's a contentious point, but basically our thesis is that with 1024 cores on a single chip, cores are "free" and it "should" be possible to avoid putting down very expensive ECC circuits on every memory bank (x4096). Some of our customers don't notice all bit flips because they have things like Turbo/Viterbi ..channels aren't perfect...
Thanks, sounds like lots of parallels(har har) to the SPUs on the PS3 which got a bad rep but I thought where great if you went in with the right approach.
I see that there is a llvm backend at https://github.com/adapteva/epiphany-llvm, but it hasn't been updated in a while. Are there any plans on upstreaming/contributing and maintaining a backend for llvm?
We are quite happy with our GCC port so LLVM hasn't been a priority. If anyone wants to take over the port, please do! We could give financial assistance for getting it completed, but the budget would be modest.
First of all, congrats, this is very impressive. Second of all, I've been thinking a lot about how proprietary GPU computation and especially VR is these days. Any interest or plans for the future in specialized hardware development for VR?
I'm seeing NIDS for 10+Gbit links, DDOS mitigation, cache appliance for web servers, Erlang accelerator, BitTorrent accelerator, and so on. Quite a few possibilities. Also, something like this might be tuned for hardware synthesis, formal verification, or testing given all the resources that requires. Intel has a nice presentation showing what kind of computing resources go into their CPU work:
Is possible to design a CPU that ON-DEMAND switch between parallel and linear operation? So, if we have a 1000 cores, it switch to 10 with the linear power of 10 x 10?
In my dreams this was very usefull, but wonder how feasible clould be ;)
Basically the limiting factor in most designs isn't so much arithmetic as fetches and branches. Especially cache misses. Theses are inherently linear operations - if you need to fetch from memory and then jump based on the result, for example.
Superscalar 'cheats' somewhat by spending area to keep the pipeline fed, through branch prediction and suchlike.
The nearest thing is the graphics card, which has a very large number of arithmetic units but less flow control, so you can run the same subroutine on lots of different data in parallel.
Highly multicore chips make a different tradeoff: external memory bandwidth is very limited. Ideal for video codecs etc where you can take a small chunk and chew heavily. Very bad for running random unadapted C code, Java etc.
Could be excellent for a dense automatic isolating array microphone; thousand other things. I'd love to see Parallella in embedded, they set a great example.
That's an older paper, but yes there have been more than one independent study showing 25x boost in terms of energy efficiency. See Ericsson FFT paper, OpenWall bcrypt paper, and others at parallella.org/publications.
The Epiphany gains are certainly only achievable for massively pipelinable or embarrassingly parallel operations with very little intermediate state (e.g. streaming data, neural software, etc), not for random access large memory footprint crunching like the Xeon. There simply isn't the per-core memory (64KB), or external memory bandwidth, to go around otherwise.
Xeon, Power, etc are kind of power pigs anyway, though they've got a lot of absolute oomph to show for it.
I wonder if the Erlang/BEAM VM could take advantage of it. Erlang would be a beast. if any of the pure functional languages get running on it (for easy parallel), watch out. Nice work!
The linked paper mentions a 500 MHz operating frequency, as well as mentioning a completely automated RTL-to-GDS flow. 500 MHz seems extraordinarily slow for a 16nm chip - was this just an explicit decision to take whatever the tools would give you so as to minimize back-end PD work? Also, given the performance target (high flops/w), how much effort did you spend on power optimization?
Paper stated that 500MHz number was arbitrary (had to fill in something for people to compare to). Agree that 500MHz with 16nm FinFet is ridiculously slow. We are not disclosing actual performance numbers until silicon returns in 4-5 months. 28nm Epiphany-IV silicon ran at 800MHZ.
Is this actually running Erlang processes on the epiphany cores or just erlang spawning special processes on the epiphany cores? I've seen the latter and was not impressed.
I've always wanted to play with these units, but buying one doesn't make a lot of sense for me (where would I put it?). I would be super interested in making them accessible to folks.
Best I can tell, Epiphany is designed as a co-processor, so it's not booting the OS and relies on a host (like an ARM/x86) to run the show and issue commands.
The Epiphany cores have significantly more functionality than GPU cores, so they're useful for things beyond computing FFTs and other number-crunching tasks. For example, you could map active objects one-to-one onto Epiphany cores.
I read through the pdf summary and it doesn't look as if the shared memory is coherent (which would be silly anyways). But I couldn't find any discussion about synchronization support. Given the weak ordering of non-local references it seems difficult to map alot of workloads. My real guess is that I haven't seen part of the picture.
It comes back to the programming model. Synchronization is all explicit. See publication list. Includes work on MPI, BSP, OpenMP, OpenCL, and OpenSHMEM. The work from US army research labs on OpenSHMEM is especially promising. It's a PGAS model.
If you're looking for weird synchronization primitives, look at the documentation of the DMA controller. It has a mode in which it stores bytes that are written to a particular address in a memory range in order the writes arrive. I haven't figured out a reasonable way to use that with multiple writers (except the trivial case of having a byte-based stream with bounded size), though.
Yeah, I was thinking about that problem too. (It's not safe to blindly write somewhere unless you can be sure that nobody else is going to simultaneously clobber your data. You can't do any kind of atomic test-and-set or compare-and-swap operation on remote memory, so you don't have the usual building blocks for things like queues or semaphores.)
The problem becomes a lot easier if you can reduce the multiple-writer case to the single-writer case. One idea that occurred to me is that since you have 1024 cores, it might make sense to dedicate a small fraction of them (say, 1/64) to synchronization. When you need to send a message to another process, you write to a nearby "router" that has a dedicated buffer to receive your data. The router can then serialize the with respect to other messages and put it into the receiver's buffer.
Basically, you'd end up defining an "overlay network" on top of the native hardware support; you pay a latency cost, but you gain a lot of flexibility.
EDIT: I may be completely wrong about the first paragraph; it looks like the TESTSET instruction might actually be usable on remote addresses. I assumed it didn't because the architecture documentation doesn't say anything about how such a capability would be implemented. But if it works, it would drastically simplify inter-node communication.
IIRC TESTSET is usable: IIRC it just sends a message that causes that to happen, but you don't learn if the test succeeded.
I was talking about the DMA mode in which every write to special register (that may be coming from a different core) gets "redirected" to subsequent byte of the DMA target region. This can work as a queue with multiple enqueuers, but has bounded size (after the size is exhausted, messages get lost) and operates on single byte messages.
The easiest way to think about it is that remote access is order-preserving message passing with a separate message network for reads (as it truly is), so:
0. Local reads and writes happen immediately.
1. Writes from core X to core Y are committed in the same order in which they happen.
2. Reads of core Y from core X are performed in the same order in which they are executed, and they are performed sometime between when they get executed and their result is used.
3. Reads can be reordered WRT writes between the same pair of cores (so you _don't_ see your writes).
I don't remember how does this work with external memory (including cores from different chips).
As the other comments have said, it basically has to do with the level of consistency between different processors' views of the shared memory space. (There are some semantic differences between "consistency" and "coherence" that I'm going to ignore.)
For some context, the x86 memory model gives you an almost consistent view of memory. The behavior is roughly as if the memory itself executes reads/writes in sequential order, but writes may be buffered within a processor in FIFO order before being actually sent to memory. Internally, the memory actually isn't that simple -- there are multiple levels of cache, and so forth -- but the hardware hides those details from you. Once a write operation becomes globally visible, you're guaranteed that all of its predecessors are too.
From what I can see from a quick overview of the Epiphany documentation, it doesn't have any caches to worry about, but it gives you much weaker guarantees about memory belonging to different cores. For one thing, there's no "read-your-writes" consistency; if you write to another core and then immediately try to read the same address, you might read the old value while the write is still in progress. For another, there's no coherence between operations on different cores, so if you write to cores X and then Y, someone else might observe the write to Y first (e.g. because it happens to be fewer hops away).
As I understand it: If memory is coherent then all cores see the same values when they read the same location at the same time. Stated another way, the result of a write to a location by one core is available in the next instant to all other cores, or they block waiting for the new value.
In general it was built for math and signal processing (broad field). Within those fields, more specifically it was designed initially for real time signal processing (image analysis, communication, decryption). Turns out that makes it a pretty good fit for other things as well (like neural nets..). Here is the publication list showing some of the apps. (for later, server is flooded now): http://parallella.org/publications
Dynamically switching carrier frequencies to make better use of the spectrum. It is somewhat related to software-defined radio, in that SDR's are typically used to prototype cognitive radio.
It hasn't really got mindshare though in the sense players like Qualcomm have all but ignored it and would rather work on proprietary comms schemes.
They went surprisingly silent after the KS boards. I falsely assumed they left the business or went employee. Delightful surprised they found ways to keep searching.
Congrats to everyone at adapteva. I remember talking to a couple of researchers who were using the prototype 64 core epiphany processor who seemed excited at how it could scale. I wonder how excited they'd be about this.
The latest generations of IBM Power processors have >64MB L3 caches on chip. The Power 7+ has 80MB per chip, the 12 core Power 8 96MB, according to Wikipedia the Power 9 will have 120MB.
Consider that many instruction and data caches are at the 16-32 KB scale. It's obviously a big criticism of the microarchitecture but you have a linear tradeoff between number of cores and available core memory. One core with 64 MB of memory seems less useful than 1024 cores with 64 KB of memory each (which can directly access all other core memory). But 65,536 cores with 1KB of memory each doesn't sound very useful either.
Thanks for articulating. As you know, there is no right answer as it depends on workload. Now if we could only build a specific chip for every application domain....
In fact, you have two trade-offs. One is what you said - that for a fixed amount of memory, the more cores, the less memory you have per core. The second trade-off is the transistor budget - the more space you use for cores, the less space you have left for memory.
The third trade off is cycle time; the larger the memory, the longer it takes to access it. This is why L1 caches are typically 16-64 KiB and despite that access is typically 2-3 cycles. However, 3+ cycles is difficult to hide in an in-order processor like this.
> But 65,536 cores with 1KB of memory each doesn't sound very useful either.
You've just described the general architecture of the Connection Machine[0], a late 80's early 90's era supercomputer that was used for modeling weather, stocks, and other items. It was fairly useful in it's time.
I think the right way to think about this is the following: scaling "up" is basically over with CPUs. Now we need scaling "out". This means learning how to make use of many more smaller cores, rather than just a few larger ones. Here communications becomes the problem, and indirectly, affects how you design and implement software. Scaling is becoming a software problem: how can you take advantage of 1024 cores with just 64KB or memory each, in a world where terabyte-sized is the daily business?
I think we will end up with systems with 64GB of memory, but which instead of 8 cores with 8GB each, have 1M cores with 64 KB memory each. We just need to lean how to write code that makes the most out of that, which is probably a lot more than what you can do with current systems.
And this Epiphany thing is something like the first step in that direction.
Unfortunately not, at least not for the real work type of scenes that you see in movies / cartoons. Textures and high-polygon models take a ton of space.
Depends. If you do 3D rendering with triangles and shaders you can divide your buffers into tiles based on storage size and stream vertex/shader commands.
This is actually how all modern mobile GPUs work and it's highly vectorizable. The partitioning obviously needs to know the whole scene but that's much more lightweight than rendering.
From what I've heard from my ex-gamedev contacts movies are heading that route in a large way because the turnaround time a raytracing is so long that's it's really hurting the creative process.
So each processor has 64KB of local memory and network connections to its neighbors?
The NCube and the Cell went down that road. It didn't go well. Not enough memory per CPU. As a general purpose architecture, this class of machines is very tough to program. For a special purpose application such as deep learning, though, this has real potential.
Cray had always resisted the massively parallel solution to high-speed computing, offering a variety of reasons that it would never work as well as one very fast processor. He famously quipped "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"
I cannot see how this thing can be programmed efficiently (to at least 70% of computing capacity, as most vector machines can be programmed for).
I have read it but in the past he wrote a blog post that risc-v will be used as isa in future products.So maybe 64 bit risc-v with backwards compatibility with epiphane?(it sounds a bit strange)
I have two excuses for why RISC-V didn't make it it. My February RISC-V post stated that we will use RISC-V in our next chip. We were already under contract for this chip so I was referring to the next chip from now. I had hopes of sneaking it into this chip, but ran out of time. Both lame excuses, I know. I am firmly committed to RISC-V in some form in the future. For clarity, I am not talking about replacing the Epiphany ISA with a RISC-V ISA.
Agree, but people have all kinds of pre-conceived notions about co-processors so let's clarify some things: e5 can't self-boot, doesn't have virtual memory management, and doesn't have hardware caching, but otherwise they are "real" cores. Each RISC core can run a lightweight runtime/scheduler/OS and be a host.
Jan Gray stuffed 400 RISC-V cores into a Xilinx Kintex UltraScale KU040 FPGA (and the KU115 is three times larger, not to mention the Virtex UltraScale range).
I think a heterogeneous product was implied in that post, but I don't blame you for the confusion. The Epiphany-V is still homogeneous because of the time/funding constraints.
Tilera is what I thought of, too. It's actually where I'm getting my ideas of applications for Epiphany-V. They did a lot of the early proving work on architectures like this. Example: first 100Gbps NIDS I saw used a few Tilera chips to do that.
Kind of off topic, but are there any low-end/hobbyist Tilera boards? The Linux kernel has support for it. I've always thought you could stress multi-threaded code in interesting ways by running it on tons of cores.
What I don't understand with computer chips, is how really relevant the FLOPS unit is, because in most situations, what limits computation speed is always the memory speed, not the FLOPS.
So for example a big L2 or L3 cache will make a CPU faster, but I don't know if a parallel task is always faster on a massively parallel architecture, and if so, how can I understand why it is the case? It seems to me that massively parallel architectures are just distributing the memory throughput in a more intelligent way.
You have to look at all the numbers (I/O, on-chip memory, flops, threads) and see if the architecture fits your problem. Some algorithms like matrix matrix multiplication are FLOPS bounds. It's rare to see a HPC architecture (don't know if there is one?) that can't reach close to the theoretical flops with matrix matrix multiplication. Parallel architectures and parallel algorithm development go hand in hand.
The website is erroring out for me, so I wonder what the motherboard situation will be like for this chip. It would be really nice to be able to buy and ARM like we can buy an x86.
From my understanding the Zynq's memory controller can only handle ~4GB of memory. Am I missing something? Is there a way to connect more than 4GB -- if so, I'd be very interested.
Tying in to earlier discussion on C (https://news.ycombinator.com/item?id=12642467), it's interesting to imagine what a better programming model for a chip like this would look like. I know about the usual CSP / message passing stuff, and a bit about HPC languages like SISAL and SAC. Anyone have links to more modern stuff?
i can't see anything on the site. is this for sale or just a proposed architecture? amazon seems only to be selling your 16-core device. was there a 64-core one? can't access your product offering.
The Epiphany-V was designed using a completely automated flow to translate Verilog RTL source code to a tapeout ready GDS, demonstrating the feasibility of a 16nm “silicon compiler”. The amount of open source code in the chip implementation flow should be close to 100% but we were forbidden by our EDA vendor to release the code. All non-proprietary RTL code was developed and released continuously throughout the project as part of the “OH!” open source hardware library.[20] The Epiphany-V likely represents the first example of a commercial project using a transparent development model pre-tapeout.