Hacker News new | past | comments | ask | show | jobs | submit login
Epiphany-V: A 1024-core 64-bit RISC processor (parallella.org)
429 points by ivank on Oct 5, 2016 | hide | past | favorite | 233 comments

This is fascinating:

The Epiphany-V was designed using a completely automated flow to translate Verilog RTL source code to a tapeout ready GDS, demonstrating the feasibility of a 16nm “silicon compiler”. The amount of open source code in the chip implementation flow should be close to 100% but we were forbidden by our EDA vendor to release the code. All non-proprietary RTL code was developed and released continuously throughout the project as part of the “OH!” open source hardware library.[20] The Epiphany-V likely represents the first example of a commercial project using a transparent development model pre-tapeout.

RTL = Register Transfer Logic, and EDA = Electronic Design Automation, for anyone else who was curious. I don't know what GDS stands for, but context indicates it's the actual physical description that's used to make the part.

But I'm confused about what part of this is open and not open. Do they mean that they imported their Verilog into a proprietary tool, which generates the design? That doesn't make it open source in practice.

HW design is not that different from SW design. Comp table below:

HW SW Verilog --> C/Java/etc EDA --> GCC/LLVM GDS --> Binary (elf)

The GDS is completely tied up in NDAs due to the foundry. The EDA combines/translates open source code with proprietary blobs to produce a "super secret" GDS binary blob that gets sent to the foundry for manufacturing.

For anyone else who was confused by everything being on one line:

    HW          SW
    Verilog --> C/Java/etc
    EDA     --> GCC/LLVM
    GDS     --> Binary (elf)

For everyone still confused

  Verilog --> imperative language
  EDA --> IDE + compiler
  GDS --> Assembly

> HW design is not that different from SW design.

reminds me of alan-kay's comment "hardware is just software which has crystallized early"

> HW design is not that different from SW design.

Shouldn't be. But it is.

Except the economics are vastly different. The complexity and cost of manufacturing, the computationally intensive cost of simulation and various checks and optimizations (be it clock timing or mask optimizations to etch features that are smaller than the wavelength used to etch them), all mean that you can't just "compile and publish", and turnaround times are months, not hours.

And there are no open-source toolchains for any of this. It's a student project to implement a SW compiler, why isn't it to implement an RTL compiler?

Nothing about the time frames or even production costs justify the disparity in how proprietary and closed hardware manufacturing is. For the exact reason hardware and software are different open sourcing your patterning toolchain has nothing to do with your competitive advantage in actually having built foundries with functioning lithography. The cost is in the later, the former is just abuse of position for power over the end user.

If anything, it hurts your bottom line. You would probably get more third party interest in having print outs of custom hardware if the toolchains were more open. It is not a question of price, its a question of exposure.

I'm not even talking about the 12-20nm stuff. It is still crazy expensive because the hardware and software R&D was huge and these companies are hoarding their toys like preschoolers because of a prisoners dilemma in regards to competitive advantage. But older 45-100nm plants are often still in use but are still just as inaccessible as ever to most hobbyist hardware enthusiasts.

If it was really that easy then hobbyists would have found a way to do it on their own by now(e.g. 3D printing). You can't just demand that someone open their billion dollar fabs to amateur hobbyists. It is very likely if the fab is still operating at a certain process, it's because they have profitable business churning through it. If it's not profitable, they retool or close it down. An idle fab is money down the drain, and it's really doubtful hobbyists would be able to fill the gap with a bunch of one-off production runs, while likely needing a lot of hand holding.

Custom circuit boards are coming down in price, maybe custom lithography will come down in price at some point to be accessible to hobbyists / startups.

> The cost is in the latter, the former is just abuse of position for power over the end user.

Exactly, hence my question about "student projects" which is really about why aren't there more OSS projects that challenge this. Is it because of the lack of platforms to experiment on, or the inherent difficulty of the task?

Thinking about this, yeah it'd be amazing to e.g. Have a community-driven forum with some DIY CPU designs (lisp machines!) with an affordable (let's say under $1k per chip) way to get them made. We'll probably get there eventually, but I'm not aware of where progress on this front is.

this. I always say this, the real credit for success of open source software goes to gcc (egcs for old timers) which allowed developers to make executable code unencumbered with NDAs & royalties.

sometimes I wish somebody with deep pockets (or maybe a semiconductor company) were to buy an ailing EDA company and just opensource all these design tools things would move much faster for opensource h/w design.

In software, the code line of state machine does miriad of things - computes new state, reads input, writes output, etc, etc. In hardware, the code line state machine computes one bit of acknowledgement of having input read. If you lucky,

The hardware programming is way, way too low. Consider assembler programming, even lower.

This is why videocotroller HW takes 9 months for group of 5 engineers and 2 programmers, and driver software for said videocontroller can be wriiten in a month by one graduate student.

The languages also either very dirty or very expensive.

For example of expensiveness, the cost of one license cool shiny Bluespec SystemVerilog compiler can cost you 2-3 yearly salaries of one of your engineers. Yes, it reduces lines (3 times) and error density (another 3 times), but nonetheless.

The example of dirtyness in Verilog: the sized based number literal has three parts - integer size (regular decimal integer with non-significant underscores like 10_00 for thousand), the base, expressed by regexp "'[Ss]?[xXOobBdD], and the value of the literal. These are three separate lexems. You can use preprocessor definition "`define WEIRD(n,b,s) s b n" and use it to construct sized literals backward: WEIRD(dead,'X,42) for 0xdead with size 42. As you can see, the value part of literal can (and will) be matched as regular identifier rule. The compiler right now seems to me as more or less straightforward, though.

The example of dirtyness in VHDL: construction of record where first fiels is character can be written as "RECORD'(')')" - we have successfully constructed a record with character field set to ')'. The single quote mark is either start of character literal (as in 'c'), the prefix of attribute (NAME_OF_ENUMERATED_VALUE'SUCC) or part of typed construction of value exemplified above. VHDL was one of the first languages that untroduced operator and function overloading, including and not limited to, overloading on return types of functions.

Good luck implementing all of this when you are student.

Look up clash-lang.org. Haskell-modules->Verilog+VHDL with a simple compilation model so you're not leaving performance in the table.

I wrote a 5-stage RISD processor with it for school, was quite simple and easy to abstract.

If hardware was more competitive, industry coding practices would be more efficient. Instead their own self-conception of pain-points prevents them from going after this low-hanging fruit.


I wrote something like that long time ago: https://github.com/thesz/hhdl (even before clash)

I had some translation algorithm from pure Haskell code to the HHDL internals. I even wrote MIPS clone using it (and it was simulated OKly).

There's just no market for that.

Cool! But note that Clash is actually compiling Haskell (i.e. analogous to GHCJS or something), rather than being an EDSL.

I'm hoping (as is the author with http://qbaylogic.nl/) that the market for FPGA soft(?)ware will suck less. Best case it pushes pressure on the fabs for ASICs, but we'll see.

> And there are no open-source toolchains for any of this.

There is one fully open source flow, but currently only targeting Lattie iCE40 chips: Project IceStorm. http://www.clifford.at/icestorm/

That said, the synthesis tool (Yosys) can actually synthesize netlists suitable for Xilinx tools, as well. In theory any company could probably add a backend component to Yosys to support their chips. arachne-pnr/icetools can only target iCE40 chips, still.

That said, it all works today. I recently have been working on a small 16-bit RISC machine using Haskell/CLaSH as my HDL, and using IceStorm as the synthesis flow. This project wouldn't have been possible without IceStorm - the proprietary EDA tools are just an unbelievable nightmare that otherwise completely sap my will to live after several attempts...[1][2]

[1] Like how I had to sed `/bin/sh` to `/bin/bash` in 30+ shell scripts, to get iCEcube2's Synplify Pro synthesis engine to work. WTF?

[2] Or other great "features", like locking down iCE40-HX4K chips with 8k-usable LUTs to 4k LUTs artificially, through the PR/synthesis tool, to keep their products segmented. I mean, I get the business sense on this one (easier to do one fab run at one size), but ugh.

It is[0] and electrical engineering students make them pretty regularly, it's just much more expensive and complicated if you actually want to make a chip with the output of one instead of just simulating it.


I don't see how it shouldn't be; it's an entirely different set of constraints?

Yes, it is.

Specially when you're working with RF or when you're doing commercial products or when you have a strict timeline and limited resources.

In a software project, the development is only limited by the Human Resources, you can't realistically blame the computer for being too slow to compile your code, and there are no "defects" when your users download your code.

The limiting factor is your 'building blocks' (component libraries with things their cells, IO and what-have-you)) that your fab (i.e. TSMC) gives to your design software house (e.g. Mentor, Synopsis, Cadence) for a specific process (e.g. $integer-$um|nm CMOS) for a production run is usually built off of heavily NDA'd building blocks locked down by contract[0] (and that's assuming you have the cash to to buy time for that tape out!).

Even designing simple stuff without the fab's component libraries for old processes would be a daunting task. (For some context, something circa the Sega Dreamcast era -- 350 nm/4 layers or there abouts -- is well in the realm of what an undergraduate would be able to design with a fair bit of ease for his capstone (senior-year) project is doable by a talented single 4th year with the component libs. Without the tooling, he'd be lost.) I'm sure Adapteva wanted to open source their final files which went to the fab for tape-out, but you could bet your bottom dollar if they did, a take-down letter would be sent to Github and Adapteva would be slammed with a lawsuit.

SPICE is/was the original open-source project that came out of UC Berkeley in the '70s if you want to go from zero-to-tape-out on an entirely open source stack but it's no trivial task. http://opencircuitdesign.com/links.html has some auxiliary resources, and IIRC there's a Linux distribution with a pretty good toolkit with even things like analog simulators for RFIC (though, as the late-great Bob Pease of NatSemi said - "never trust the simulator" ;)).

Side-note: Adapteva - your work is fascinating, so much so that I read your entire set of ref docs for the Epiphany. I'm in the Boston area, let me buy y'all a coffee at Diesel as I'd love to pick you brains.


[0] - (Grey area legality content) - Here's an example of the documentation of the libs you'd be using - normally even these documents are lock&keyed: http://www.utdallas.edu/~mxl095420/EE6306/Final%20project/ts... This looks like a masters level thesis project directory by the course number (didn't go to U of T:D) @ 180 nm sizing.

Probably RTL would be more correctly known as "Register Transfer Level" as in a level of abstraction, in contrast to for example the lower "gate" level of abstraction.

Graphic Data System. AKA GDSII

I might be wrong. But if they automated the flow from RTL to GDS, the timing might not be optimal. I understand since they have lack of resources so that this is unavoidable but in normal chip design flow, the backend timing ECO is critical to achieve high frequency for all timing corners.

Yes, we are leaving 2X on the table in terms of peak frequency compared to well staffed chipzilla teams. Not ideal, but we have a big enough of a lead in terms of architecture that it kind of works.

The comment above said you couldn't release the info due to the EDA vendor. However, people like Jiri Gaisler have released their methodologies via papers that just describe them with artificial examples. Others use non-manufarable processes and libraries (like NanGates) so the EDA vendors feelings don't get hurt about results that don't apply to real-world processes. ;)

So, if you have a 16nm silicon compiler, I encourage you to pull a Gaisler with a presentation on how you do that with key details and synthetic examples designed to avoid issues with EDA vendors. Or just use Qflow if possible.

I'll pass for now...Gaisler is in the business of consulting, we survive by building products. I am happy to release sources, but it's completely up to the EDA company.

[edit: was thinking of the wrong Gaisler, still will pass]

Damnit. No promises but would you consider putting it together if someone paid your company to do it under an academic grant or something? Quite a few academics trying to do things like you've done with small chance that one might go for that.

Btw, your site is down right now.

It's pretty ironic that parallella.org is down on an article about high parallelism because -apparently- it cannot take HN-front-page-load-levels.

That's concurrency, throughput, and load-balancing of web servers connected to pipes of certain bandwidth. It's not the same as parallel execution of CPU-bound code on a tiled processor. You could know a lot about one while knowing almost nothing about the other.

That seems analogous to human assembly optimization vs a compiler. But the time to market is greatly reduced, designs can be vetted and a 2.0 that is optimized for frequency can be shipped later.

IIRC, human assembly optimization is unlikely to be better than a modern compiler nowadays. Same thing could very well happen for this "automated flow" if it starts incorporating its own optimization techniques.

That is a myth. Most developers can't beat LLVM. LLVM can't beat the handcrafted assembly in libjpeg-turbo or x264 or openssl or luajit by compiling the generic C alternative.

In response to the other replies: I'm not sure about luajit, but the other two examples involved a programmer hand crafting algorithms around specific special purpose CPU instructions -- vector processing and video compression hardware, if I remember the details of x264 correctly. This is so specialized and architecture specific that it probably doesn't make sense to push it into the compiler.

Speaking from experience, even getting purpose-built compilers like ICC to apply "simple" optimizations like fused-multiply-add to matrix multiply is non-trivial.

Taking jpeg decoding as a concrete example of why modern compilers fall over, you have two high-level choices: (1) the compiler automatically translates a generic program into one that can be vectorized using the instructions on the target platforms. This will probably involve reworking control flow, loops, heap memory layout, malloc calls, etc, and will require changing the compressed / decompressed images in imperceptible to humans ways (the vector instructions often have different precision/rounding properties than non-vector instructions). This is well beyond the state of the art.

(2) Find a programmer that deeply understands the capabilities of all the target architectures and compilers, who will then write in the subset of C/Java/etc that can be vectorized on each architecture.

I think you'll find there are many more assembler programmers than there are people with the expertise to pull off (2), and that using compiler intrinsics is actually more productive anyway.

x264 does not use any video compression hardware. It uses only regular SIMD.

I don't agree that SIMD is so specialized. It is needed where ever you have operation over arrays of items of the same type, including memcmp, memcpy, strchr, unicode encoders/decoders/checkers, operations on pixels, radio or sound samples, accelerometer data, etc.

Compilers have latency and dependency models for specific CPU arch decoders/schedulers/pipelines. Compiler authors agree that compilers should learn to do good autovectorization. But it's hard. So people use assembly.

yellowapple said:

> human assembly optimization is unlikely to be better than a modern compiler

You said:

> Most developers can't beat LLVM

Then you pointed out some specific examples where a human can be a compiler.

Seem like you two agree, then you go and call what he is a saying "a myth". I think I need some clarification.

Prior to this my understanding was that if the developer provides the compiler good information with type, const, avoids pointer aliasing and in general makes the code easy to optimize that the compiler can do much better than most humans most of the time, but of course a domain expert willing to expend a huge amount of time with all the knowledge the compiler would have can beat the compiler. It just seems that beating the compiler is rarely cost (time, money, people, etc...) efficient.

Is my understanding close in your opinion?

Making C compilers for different architectures output great code from same source is really hard. e.g. "const" is not used by optimizers because it can be cast away. Interpreters, compression routines, etc. can always be sped up using assembly.

If what your program does can be sped up using vector registers/instructions (e.g. DSP, image and video processing) then you want to do that because x4 and x8 speedups are common. Current autovectorisers are not very good. If it is not the most trivial example like "sum of contiguous array of floats", you'll want to write SIMD assembly or intrinsics or use something like Halide. In practice projects end up using nasm/yasm or creating a fancy macro assembler in a high level language.

The choice to use assembly is economics, and it's all a matter of degree. How much performance is left on the table by the compiler? How many C lines of code take up 50% of the cpu time in your program? How rare is the person who is able to write fast assembly/SIMD code? How long does it takes to write correct and fast assembly/SIMD code for only the hot function for 4 different platforms (e.g. in-order ARM, Apple A10, AMD Jaguar, Haswell)?

If you think "25%, 100k LoC, very rare, man-years" then you conclude it's not worth it. If you think "x8, 20 lines, only as rare as any other good senior engineer, 50 hours" then you conclude it's stupid to not do the inner loop in assembly.

What are the numbers in practice? I don't know. In practice, all the products that have won in their market and can be sped up using SIMD have hand coded assembly or use something like Halide and none of them think the compiler is good enough.

> Making C compilers for different architectures output great code from same source is really hard. e.g. "const" is not used by optimizers because it can be cast away.

const most certainly is used by optimizers: https://godbolt.org/g/kLmGr4

The willingness of C compilers to (ab)use undefined behavior for optimization is one of the main criticisms against it.

Check out the cppcon 2016 presentation by Jason Turner and watch how eager the compiler optimizes away code when const is enabled on values. Cool presentation too, and uses Godbolt's tool https://www.youtube.com/watch?v=zBkNBP00wJE

I think the argument is not against unwillingness, but when and how.

If it's not at least able to match handcrafted assembly using intrinsics, you should file bugs against LLVM. There is no theoretical reason why compilers shouldn't be able to match or beat humans here: these problems are extremely well studied.

Sometimes consistency is desirable, as well as performance. Compilers are heuristic. They evolve and get better, but they can mess up, and it's not always a fun time to find out why the compiler made something that was performance sensitive suddenly do worse, intrinsics or not -- from things like a compiler upgrade, or the inlining heuristic changes because of some slight code change, or because it's Friday the 13th (especially when it's something horridly annoying like a solid %2-3 worse -- at least with %50 worse I can probably figure out where everything went horribly wrong without spending a whole afternoon on it). This is a point that's more general than intrinsics, but I think it's worth mentioning.

Sure, I can file bug reports in those cases, and I would attempt to if possible -- but it also doesn't meaningfully help any users who suddenly experience the problem. At some point I'd rather just write the core bit a few times and future proof myself (and this has certainly happened for me a non-zero amount of times -- but not many more than zero :)

You may want to read this Mike Pall post about the shortcomings of high level language compilers regarding interpreters: http://article.gmane.org/gmane.comp.lang.lua.general/75426

"using intrinsics" is a cop out: you are essentially doing the more complicated part of translating that sequence of generic C code into a rough approximation of a sequence of machine instructions and leave the compiler to do the boring and simpler parts, like register allocation, code layout and ordering of independent instructions.

Compilers are smart at some things and not so smart at others. I can beat the compiler in tight inner loops almost every time, but it will also do insanely clever things that id never think of!

Common misconception, see Proebsting's Law, "The Death of Optimizing Compilers":


as well as 'What every compiler writer should know about programmers or “Optimization” based on undefined behaviour hurts performance '


Slides with the talk, not my favorite, have a link to the talk?

The second paper is so biased it hurts. It hardly attempts to hide this bias, on the second page it start referring to one group of people as "clueless" and never justifies it describing what what clued in would be.

The second paper also has a strong assumption that compilers should somehow maintain their current undefined behavior going forward. It is almost as though the paper author thinks a compiler can somehow divine what the programmer wants without referring to some pre-agreed upon document, such as the standard for the language.

The second paper also talks only about performance and not about any other real world concern, like maintainability, reliability or portability.

This paper is setting up straw men when it trots out code with bugs (that loop on page 4) and then a pre-release version of the compiler does something unexpected. Of course non-conforming code breaks when compiled. Of course pre-release compilers are buggy.

The paper's author wants code to work the same on all systems even when the code conveys unclear semantics. That is unreasonable.

Why even write a book about it? effectively a no-op

To give credit to the paper's author that no-op is part of the SPEC benchmark suite and the author feels that code in that benchmark is being treated as privileged by compiler authors.

Even though I disagree with the author I try to understand some of his perspective.

There's a gap between "humans can't write assembly better than the compilers" and "there's nothing humans can do to help the compiler write better code".

Depends. You won't beat llvm if your code uses strictly intrinsics. Some things, like adding carry bits across 64-bit arrays, might need to be done by hand, because of special, knowledge about your data that are not generalizable.

unless you have a language which allows you to express that knowledge of the data

I agree completely, it's still impressive to me that they presumably managed a competitive offering with such a system. I imagine having it be a highly homogeneous design also helped.

Design symmetry and regularity was the key. Harder to achieve that with a heterogeneous architecture.

The interesting question, to me at least, is how much cheaper this chip is - with its suboptimal maximum clock rate - compared to a chip from a non-automated flow. If peak clock rate is one half, but cost is one hundredth, I'd say it's a spectacular achievement.

100th in costs and one half in performance is, granted, wishful thinking on my part. But I believe the important point is that with a sufficient productivity gain, this technology can reduce the old, non-automated way to something akin to writing software libraries in assembly. Writing software libraries in assembly is useful, but few bother to do it because they'd rather just buy more hardware. Chugging out twice a many chips, once you have your design finished, isn't really that much more expensive, as I understand it.

> but we were forbidden by our EDA vendor to release the code.

Why? Is there anything that could be done to change that?

You should investigate the RISC-V project.

It is an open source RISC based ISA along with open source implementations of example processor cores. Then you could have had a processor that was completely open and did not include any proprietary code.


I am here, if anyone has questions. AMA! Andreas

What will be cost estimate for a PCI-e board ? Chip ? if this thing touches consumer hands.

Are you planning any production samples for research / universities / DARPA ?

The chip is about the same size as the Apple A10, so in terms of silicon area it's in the consumer domain, but price will only come down to consumer levels if shipments get into millions of units. Big companies take a leap of faith and build a product hoping that the market will get there. Small companies get one shot at that. With University volumes and shuttles, we are talking 100x costs. So the $300 GPU PICe type boards become $10K-$30K with NRE and small scale productio folded in.

You should look into alternative financing methods.

How long is the period from needing the cash to pay for production to availability in retail, roughly?

If it's all about volume, accumulating orders over a long period using some non-reversible payment method could, perhaps, get you into millions of units. It's all about how long people are willing to wait in order to save on per-chip unit costs.

What type and size memory can the Epiphany-V support?

Also congrats! This is brilliant engineering to get a chip like this into production silicon as a small team.

How much did the prototype MPW(?) silicon cost?

Up to 1 petabyte supported theoretically through FPGA interfaces.

We can't disclose MPW costs. Chip was funded by DARPA. For standard MPW costs, check with MOSIS.


I had a friend who mentioned that it was very difficult to get the 64-cores Parallellas with fully-functional Epiphany-IV chips. Are these yield problems going to continue with Epiphany-V or can we expect a full 1024 functional cores per chip?

It would be a BIG mistake to assume 1024 working cores. If you want to scale your software you should take a look Google/Erlang and others. Not reasonable to demand perfection at 16nm and below...

Not saying we won't have chips with all cores working, just saying you shouldn't count on it.

So what can we count on?

In a tile based CPU error topology matters. A string of broken cores or a broken core at the edges is likely worse than a broken core with all 4 (or 8?) neighbors working.

Impossible to characterize without high volume silicon or accurate yield models. We can say that historically, most failures are in SRAM cells and they are limited to a few bits (core still works!) and that in general only one out of N cores will fail. For arguments sake, let's assume the while network always works, but 1 CPU may be broken. (this is what needs to be confirmed later). Does that help?

Yes, that helps.

It might be easier to work around broken SRAM bits than just skipping a whole core.

That way you could always have same pipeline layout and not need to compute it dynamically.

You refer to the per-CPU SRAM as "memory" rather than "cache". It's just addressable local memory?

How many DRAM ports?

Yes, you can call it scratchpad or sram. The point is that there is no hardware caching. The local SRAM is split into 4 separate banks so it is "effectively" 4 ported. DRAM controllers is up to the system designer. This is handled by the FPGA. (like previous epiphany chips).

What are the chances of seeing a new Parallella SBC with an Epiphany-V coprocessor coupled with a RISC-V main processor?

Not going to happen in the near term. There is no way to meet the price point needed to compete in the low cost SBC market with the Epiphany-V. Believe it or not, the $99 Parallella was priced too high to reach mass adoption.

How about a evaluation board which plugs into the mezzanine connectors of the ZC706 evaluation kit? Something similar to the AD9361 FMCOMMS3/5 [1]

Also: any more information on the ISA extensions for communications/deep learning?

[1] https://wiki.analog.com/resources/eval/user-guides/ad-fmcomm...

Sure, there will be evaluation boards, they just won't be generally available at digikey and won't cost $99. More information about custom ISA will be disclosed once we have silicon back.

Having seen that the $99 price point was too high, is one of your goals still "supercomputing for everyone"? Or has that dream been dashed?

Well, the Parallella has shipped to over 10,000 people and it still selling at Amazon an DK, so no the dream is not dashed in any way. The number of publications and frameworks around Parallella is growing every month...

No reason to drive a 1024 core chip to the broad market when most applications aren't ready to use 16 cores. With this chip we focus on customers and aprtners who have proven that they have mastered the 16-core platform.

>No reason to drive a 1024 core chip to the broad market when most applications aren't ready to use 16 cores.

Yet magically they have no problem taking advantage of massively parallel GPUs...

Most applications don't use 16 CPU cores because they don't need them.

I think you're underestimating the requirements and mastery of cloud companies. Something like an Amazon lambda could virtualize 4 cores per instance and host 256 lambda execution units on a single chip. The use cases are endless

Unless the architecture has changed drastically from the earlier Epiphany, they can't be virtualised like that, and each core are way too slow to be suitable for lambda except for software written specifically to take advantage of the parallelism of the architecture.

You still need to recompile code for the new architecture, and taking full advantage of it wisely is not easy... but may be worth it in many use cases. Part of the problem is that it's not 100% clear which use cases these are and how to market it. Probably unit calculation per watt is the most likely performance advantage, but it's still amazingly hard to sell people on that sometimes

Some parallel algorithms will scale to bigger (more parallel) chips the way binary programs got more performance with clock higher frequencies. That's the holy grail..

Congrats again on getting amazing amount done on budget. The part that jumped out more than usual was you soloing it to stay within budget. Pretty impressive. How did you handle the extensive validation/verification that normally takes a whole team on ASIC's? Does your method have a correct-by-construction aspect and/or automate most of the testing or formal stuff?

Modern SOCs might have 100 complex blocks. We had 3 simple RTL blocks (9 hard macros). Top level communication approach was "correct by construction". Nothing is for free.

That makes sense. Appreciate the explanation.

4100 hours in about ten months (according to the PDF). Did you really put in 100 hour work weeks?

Hours were over a 12 month period, but yes...the pace was relentless. All ambitious projects, including many kickstarer projects get done because creators end up working for free for essentially thousands of hours. In this case, we were on a fixed cost budget so those hours were "my problem".

#1 on HackerNews is worth it. Congratulations, man!

Are your competitors GPUs and or Xeon Phi? What is programming on this chip and how is the instruction set designed?


http://adapteva.com/docs/epiphany_arch_refcard.pdf http://adapteva.com/docs/epiphany_arch_ref.pdf

Not competitors yet. They have awesome silicon in the field, we just taped out...

Vim or Emacs? :trollface:

But seriously, I'm tremendously curious about the use for this with video processing. Has there been any good benchmarks with that?

Emacs? Ahem. I would like to return the parallella I purchased in the kickstarter campaign...

Just kidding. Nobody's perfect. :)

Awesome to see the 1024 cpu epiphany taped out! Congratulations! Any plan to put these into a card computer for easy programming and evaluation? EDIT: nevermind on the question, I see the response below.

Would like to say that your kickstarter was one of the best communicated most smoothly run kickstarter campaigns that I have ever backed.

:-) Thanks for making me laugh :-)

Hopefully you guys have ECC on your 64MB of SRAM, otherwise the meant time to bit flip due to Single Event Upset (SEU) is around 400 days ( based on 200 Fit/Mb/Billion Hours from previous experience ).

No ECC on chip, but we do have column redundancy. We are pushing the envelope in terms of SEUs, making an assumption that the right programming model and run time will be able to compensate for high soft error rates. It's a contentious point, but basically our thesis is that with 1024 cores on a single chip, cores are "free" and it "should" be possible to avoid putting down very expensive ECC circuits on every memory bank (x4096). Some of our customers don't notice all bit flips because they have things like Turbo/Viterbi ..channels aren't perfect...

This is a bit of a dumb question; when do you feel your site is going to be back up? I would actually rather like to buy a Parallella...

I know...it's painful, we honestly weren't expecting this.

Here are direct links if you are in a rush:

Amazon: https://www.amazon.com/Adapteva/b/ref=bl_dp_s_web_9360745011...



Cool, stuff for sure.

I didn't see it addressed in the paper, how does this compare WRT discrete DSP chips? Are you targeting ease of programming instead of raw FMAD/etc?

In modern DSP chips programmers have to contend with: VLIW, SIMD, pipelines, caches, and multicore.

In Epiphany, the programmers are challenged by the manycore and an SRAM size cliff (so 0 or 1 in terms of pain).

It depends...but I personally prefer having one big dragon to slay rather than 10 little ones.

Thanks, sounds like lots of parallels(har har) to the SPUs on the PS3 which got a bad rep but I thought where great if you went in with the right approach.

I see that there is a llvm backend at https://github.com/adapteva/epiphany-llvm, but it hasn't been updated in a while. Are there any plans on upstreaming/contributing and maintaining a backend for llvm?

We are quite happy with our GCC port so LLVM hasn't been a priority. If anyone wants to take over the port, please do! We could give financial assistance for getting it completed, but the budget would be modest.

What is your software story for this thing?

Are you upstreaming qemu, uboot, Linux, GCC, GDB etc changes?

Will we see a Debian port for this?

For Epiphany: GCC upstream already, working on GDB upstreaming. THere is no linux, qemu,uboot

For Parallella: Linux upstream, uboot might be as well? Runs Debian, Ubuntu, etc


So what do you run on Epiphany if there is no Linux?

First of all, congrats, this is very impressive. Second of all, I've been thinking a lot about how proprietary GPU computation and especially VR is these days. Any interest or plans for the future in specialized hardware development for VR?

When are dev-boards coming out?

Can it run off Power over Ethernet? That would be interesting.

Sure...but probably not with all cores running full throttle. Would need to build an appropriate board.

Two things immediately jump out

    Custom ISA extensions for deep learning, communication, and cryptography
    autonomous drones cognitive radio
The radar geeks are gonna love to get their hands on ~250GFLOP, 4watt processor.

I'm seeing NIDS for 10+Gbit links, DDOS mitigation, cache appliance for web servers, Erlang accelerator, BitTorrent accelerator, and so on. Quite a few possibilities. Also, something like this might be tuned for hardware synthesis, formal verification, or testing given all the resources that requires. Intel has a nice presentation showing what kind of computing resources go into their CPU work:


I don't believe the chip has enough I/O for a 10Gb nic.

The issue is not bandwidth, it's the system cost. The chip has 1024 IO pins. (more than enough for MANY 10Gb nics...)

"The issue is not bandwidth, it's the system cost."

What does that mean?

I have a naive question based in my dreams:

Is possible to design a CPU that ON-DEMAND switch between parallel and linear operation? So, if we have a 1000 cores, it switch to 10 with the linear power of 10 x 10?

In my dreams this was very usefull, but wonder how feasible clould be ;)


Basically the limiting factor in most designs isn't so much arithmetic as fetches and branches. Especially cache misses. Theses are inherently linear operations - if you need to fetch from memory and then jump based on the result, for example.

Superscalar 'cheats' somewhat by spending area to keep the pipeline fed, through branch prediction and suchlike.

The nearest thing is the graphics card, which has a very large number of arithmetic units but less flow control, so you can run the same subroutine on lots of different data in parallel.

Highly multicore chips make a different tradeoff: external memory bandwidth is very limited. Ideal for video codecs etc where you can take a small chunk and chew heavily. Very bad for running random unadapted C code, Java etc.

There has been a bunch of academic research about this topic under names like core fusion and dynamic multicore. A recent sample: https://www.microsoft.com/en-us/research/wp-content/uploads/...

Sure: it's called a superscalar CPU.

This is sort of what Hyperthreading is. Though you'll notice the ratios are not as good as what you want.

yes. It's called FPGA.

Could be excellent for a dense automatic isolating array microphone; thousand other things. I'd love to see Parallella in embedded, they set a great example.

Did I read the specs wrong or are they claiming a 12x - 15x performance improvement over the Ivy Bridge Xeon in GFLOPS/watt? In a <2w package? http://www.adapteva.com/wp-content/uploads/2013/06/hpec12_ol...

That's an older paper, but yes there have been more than one independent study showing 25x boost in terms of energy efficiency. See Ericsson FFT paper, OpenWall bcrypt paper, and others at parallella.org/publications.

The Epiphany gains are certainly only achievable for massively pipelinable or embarrassingly parallel operations with very little intermediate state (e.g. streaming data, neural software, etc), not for random access large memory footprint crunching like the Xeon. There simply isn't the per-core memory (64KB), or external memory bandwidth, to go around otherwise.

Xeon, Power, etc are kind of power pigs anyway, though they've got a lot of absolute oomph to show for it.

That's not unreasonable.

I should clarify:. Presumably the parallella's RISC does away with a lot of the superscalar features of the x86 which are embedded in the xeon phi's

One way to think about it is that things like branch prediction and speculative and out of order execution are like real-time JITting of your code.

Not having that silicon can make things way more efficient.

I wonder if the Erlang/BEAM VM could take advantage of it. Erlang would be a beast. if any of the pure functional languages get running on it (for easy parallel), watch out. Nice work!

Things like Seastar[0] and Rust's zero cost futures would also make good use of many cores.

[0] http://www.seastar-project.org/

Pony would be even better, but for this we would need a llvm toolchain, not just gcc.

The linked paper mentions a 500 MHz operating frequency, as well as mentioning a completely automated RTL-to-GDS flow. 500 MHz seems extraordinarily slow for a 16nm chip - was this just an explicit decision to take whatever the tools would give you so as to minimize back-end PD work? Also, given the performance target (high flops/w), how much effort did you spend on power optimization?

Paper stated that 500MHz number was arbitrary (had to fill in something for people to compare to). Agree that 500MHz with 16nm FinFet is ridiculously slow. We are not disclosing actual performance numbers until silicon returns in 4-5 months. 28nm Epiphany-IV silicon ran at 800MHZ.

But can I run Erlang on it?

Hah! You thought you would get us with that one.:-) Here is the link to the Erlang OTP developed at Uppsala University for Epiphany.


Is this actually running Erlang processes on the epiphany cores or just erlang spawning special processes on the epiphany cores? I've seen the latter and was not impressed.

This is actually a cut down erlang otp running on the Epiphany cores. It's not ready for production, but it's interesting research. See the README.

Sweet! Though the README does not identify what is "cut down" or the status and what remains to be vetted.

Hey Andreas,

I'm unable to find the feature branch bringing Parallella support to OTP https://github.com/margnus1/otp/branches Maybe it was merged upstream already?

You came a long way since I saw you in London in 2013. 1024 cores came sooner that 2020! Amazing job.

My second favourite comeback, right after "but did you win the putnam".

Would anyone be interested in an Epiphany dedicated servers a la Rasberry Pi collocation (https://www.pcextreme.com/colocation/raspberry-pi)?

I've always wanted to play with these units, but buying one doesn't make a lot of sense for me (where would I put it?). I would be super interested in making them accessible to folks.

What are the benefits/advantages of choosing something like this over a traditional Arm/x86 or a GPU? My knowledge in this area is limited. :)

Best I can tell, Epiphany is designed as a co-processor, so it's not booting the OS and relies on a host (like an ARM/x86) to run the show and issue commands.

The Epiphany cores have significantly more functionality than GPU cores, so they're useful for things beyond computing FFTs and other number-crunching tasks. For example, you could map active objects one-to-one onto Epiphany cores.

I read through the pdf summary and it doesn't look as if the shared memory is coherent (which would be silly anyways). But I couldn't find any discussion about synchronization support. Given the weak ordering of non-local references it seems difficult to map alot of workloads. My real guess is that I haven't seen part of the picture.

It comes back to the programming model. Synchronization is all explicit. See publication list. Includes work on MPI, BSP, OpenMP, OpenCL, and OpenSHMEM. The work from US army research labs on OpenSHMEM is especially promising. It's a PGAS model.

got it, thanks. it looks like the per-node memory controller has an atomic test and set

edit: and also a global wired-or for a barrier.

If you're looking for weird synchronization primitives, look at the documentation of the DMA controller. It has a mode in which it stores bytes that are written to a particular address in a memory range in order the writes arrive. I haven't figured out a reasonable way to use that with multiple writers (except the trivial case of having a byte-based stream with bounded size), though.

Yeah, I was thinking about that problem too. (It's not safe to blindly write somewhere unless you can be sure that nobody else is going to simultaneously clobber your data. You can't do any kind of atomic test-and-set or compare-and-swap operation on remote memory, so you don't have the usual building blocks for things like queues or semaphores.)

The problem becomes a lot easier if you can reduce the multiple-writer case to the single-writer case. One idea that occurred to me is that since you have 1024 cores, it might make sense to dedicate a small fraction of them (say, 1/64) to synchronization. When you need to send a message to another process, you write to a nearby "router" that has a dedicated buffer to receive your data. The router can then serialize the with respect to other messages and put it into the receiver's buffer.

Basically, you'd end up defining an "overlay network" on top of the native hardware support; you pay a latency cost, but you gain a lot of flexibility.

EDIT: I may be completely wrong about the first paragraph; it looks like the TESTSET instruction might actually be usable on remote addresses. I assumed it didn't because the architecture documentation doesn't say anything about how such a capability would be implemented. But if it works, it would drastically simplify inter-node communication.

IIRC TESTSET is usable: IIRC it just sends a message that causes that to happen, but you don't learn if the test succeeded.

I was talking about the DMA mode in which every write to special register (that may be coming from a different core) gets "redirected" to subsequent byte of the DMA target region. This can work as a queue with multiple enqueuers, but has bounded size (after the size is exhausted, messages get lost) and operates on single byte messages.

The easiest way to think about it is that remote access is order-preserving message passing with a separate message network for reads (as it truly is), so: 0. Local reads and writes happen immediately. 1. Writes from core X to core Y are committed in the same order in which they happen. 2. Reads of core Y from core X are performed in the same order in which they are executed, and they are performed sometime between when they get executed and their result is used. 3. Reads can be reordered WRT writes between the same pair of cores (so you _don't_ see your writes).

I don't remember how does this work with external memory (including cores from different chips).

Not a hardware genius here. What does coherent memory mean?

As the other comments have said, it basically has to do with the level of consistency between different processors' views of the shared memory space. (There are some semantic differences between "consistency" and "coherence" that I'm going to ignore.)

For some context, the x86 memory model gives you an almost consistent view of memory. The behavior is roughly as if the memory itself executes reads/writes in sequential order, but writes may be buffered within a processor in FIFO order before being actually sent to memory. Internally, the memory actually isn't that simple -- there are multiple levels of cache, and so forth -- but the hardware hides those details from you. Once a write operation becomes globally visible, you're guaranteed that all of its predecessors are too.

From what I can see from a quick overview of the Epiphany documentation, it doesn't have any caches to worry about, but it gives you much weaker guarantees about memory belonging to different cores. For one thing, there's no "read-your-writes" consistency; if you write to another core and then immediately try to read the same address, you might read the old value while the write is still in progress. For another, there's no coherence between operations on different cores, so if you write to cores X and then Y, someone else might observe the write to Y first (e.g. because it happens to be fewer hops away).

It applies to architectures with caches https://en.wikipedia.org/wiki/Cache_coherence

Epiphany-V does not have caching. You explicitly move data around in software. Some software abstractions are better than others.

As I understand it: If memory is coherent then all cores see the same values when they read the same location at the same time. Stated another way, the result of a write to a location by one core is available in the next instant to all other cores, or they block waiting for the new value.

Thank you all for that help. Did not see definitions elsewhere in post.

What's the practical application of a chip like this?

In general it was built for math and signal processing (broad field). Within those fields, more specifically it was designed initially for real time signal processing (image analysis, communication, decryption). Turns out that makes it a pretty good fit for other things as well (like neural nets..). Here is the publication list showing some of the apps. (for later, server is flooded now): http://parallella.org/publications

In the paper they are suggesting deep learning, self-driving cars, autonomous drones and cognitive radio.

What is cognitive radio?

Dynamically switching carrier frequencies to make better use of the spectrum. It is somewhat related to software-defined radio, in that SDR's are typically used to prototype cognitive radio.

It hasn't really got mindshare though in the sense players like Qualcomm have all but ignored it and would rather work on proprietary comms schemes.

Dynamic spectrum management, changing channels based on current usage and other factors.


Maybe that's what they call speech recognition?

PAPER: https://www.parallella.org/wp-content/uploads/2016/10/e5_102...

(access until we resolve the hosting issues, wordpress completely hosed...)

Prepend cache: to the URL to view Google's` cached version of this website.

Wow from Kickstart to DARPA funding! How did I miss that?

They went surprisingly silent after the KS boards. I falsely assumed they left the business or went employee. Delightful surprised they found ways to keep searching.

For those interested, Andreas did an interview on the Amp hour a while ago. http://www.theamphour.com/254-an-interview-with-andreas-olof...

Congrats to everyone at adapteva. I remember talking to a couple of researchers who were using the prototype 64 core epiphany processor who seemed excited at how it could scale. I wonder how excited they'd be about this.

1024 64-bit cores? Cool. Very impressive.

64 MB on-chip memory? For 1024 cores? That's 64 K per core. That seems rather inadequate... though for some applications, it will be plenty.

You need think of it as aggregate memory, not as per core memory to use if effectively. Are you aware of a chip with more than 64MB of on chip RAM?

The latest generations of IBM Power processors have >64MB L3 caches on chip. The Power 7+ has 80MB per chip, the 12 core Power 8 96MB, according to Wikipedia the Power 9 will have 120MB.

Good data! That puts e5 in good company with some big-iron heavies.

I realize comparing to Intel is unfair but I think the Skylake Iris Pro 580 has 128MB on chip RAM. https://en.wikipedia.org/wiki/Intel_HD_and_Iris_Graphics#Sky...

That's eDRAM though, and it's really on-module rather than on-chip. It's a separate die on the same module as the main SoC.

64 MB static RAM, no less. You've built a huge-ass static RAM chip and thrown in some local processing. (-:

Consider that many instruction and data caches are at the 16-32 KB scale. It's obviously a big criticism of the microarchitecture but you have a linear tradeoff between number of cores and available core memory. One core with 64 MB of memory seems less useful than 1024 cores with 64 KB of memory each (which can directly access all other core memory). But 65,536 cores with 1KB of memory each doesn't sound very useful either.

Thanks for articulating. As you know, there is no right answer as it depends on workload. Now if we could only build a specific chip for every application domain....

In fact, you have two trade-offs. One is what you said - that for a fixed amount of memory, the more cores, the less memory you have per core. The second trade-off is the transistor budget - the more space you use for cores, the less space you have left for memory.

The third trade off is cycle time; the larger the memory, the longer it takes to access it. This is why L1 caches are typically 16-64 KiB and despite that access is typically 2-3 cycles. However, 3+ cycles is difficult to hide in an in-order processor like this.

> But 65,536 cores with 1KB of memory each doesn't sound very useful either.

You've just described the general architecture of the Connection Machine[0], a late 80's early 90's era supercomputer that was used for modeling weather, stocks, and other items. It was fairly useful in it's time.


I think the right way to think about this is the following: scaling "up" is basically over with CPUs. Now we need scaling "out". This means learning how to make use of many more smaller cores, rather than just a few larger ones. Here communications becomes the problem, and indirectly, affects how you design and implement software. Scaling is becoming a software problem: how can you take advantage of 1024 cores with just 64KB or memory each, in a world where terabyte-sized is the daily business?

I think we will end up with systems with 64GB of memory, but which instead of 8 cores with 8GB each, have 1M cores with 64 KB memory each. We just need to lean how to write code that makes the most out of that, which is probably a lot more than what you can do with current systems.

And this Epiphany thing is something like the first step in that direction.

Exciting times.

PS3 SPUs had 256KB, you'll want to vectorize your data anyway if you want to take of advantage of this.

You can't always vectorize your data. Like if you want to do highly parallel 3D rendering, you need the whole scene accessible to each core.

Of course, in that situation, the scene probably fits in 64 MB, so it's not really a limitation.

Unfortunately not, at least not for the real work type of scenes that you see in movies / cartoons. Textures and high-polygon models take a ton of space.

Depends. If you do 3D rendering with triangles and shaders you can divide your buffers into tiles based on storage size and stream vertex/shader commands.

This is actually how all modern mobile GPUs work and it's highly vectorizable. The partitioning obviously needs to know the whole scene but that's much more lightweight than rendering.

From what I've heard from my ex-gamedev contacts movies are heading that route in a large way because the turnaround time a raytracing is so long that's it's really hurting the creative process.

The current top500 machine has 64KB scratchpad per processing core and seems to be capable of running real HPC applications well <http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-....

Depending on the application this may be a reasonable trade-off.

Is there a mirror anywhere?

So each processor has 64KB of local memory and network connections to its neighbors?

The NCube and the Cell went down that road. It didn't go well. Not enough memory per CPU. As a general purpose architecture, this class of machines is very tough to program. For a special purpose application such as deep learning, though, this has real potential.

    Cray had always resisted the massively parallel solution to high-speed computing, offering a variety of reasons that it would never work as well as one very fast processor. He famously quipped "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"
I cannot see how this thing can be programmed efficiently (to at least 70% of computing capacity, as most vector machines can be programmed for).

The ISA is epiphany or risc-v?

It's backward compatible with Epiphany-III...so it's still Epiphany ISA with new instructions.

I have read it but in the past he wrote a blog post that risc-v will be used as isa in future products.So maybe 64 bit risc-v with backwards compatibility with epiphane?(it sounds a bit strange)

I have two excuses for why RISC-V didn't make it it. My February RISC-V post stated that we will use RISC-V in our next chip. We were already under contract for this chip so I was referring to the next chip from now. I had hopes of sneaking it into this chip, but ran out of time. Both lame excuses, I know. I am firmly committed to RISC-V in some form in the future. For clarity, I am not talking about replacing the Epiphany ISA with a RISC-V ISA.

The Epiphany core is a co-processor, and the "main" processor is a couple of ARM cores to run Linux/other.

Maybe in the future they will offer boards with Risc-V main processors, and Epiphany co-processors.

I'm not sure how feasible 1024 Risc-V cores would be (although it sounds awesome). Epiphany cores were designed for this sort of thing.

Agree, but people have all kinds of pre-conceived notions about co-processors so let's clarify some things: e5 can't self-boot, doesn't have virtual memory management, and doesn't have hardware caching, but otherwise they are "real" cores. Each RISC core can run a lightweight runtime/scheduler/OS and be a host.

Jan Gray stuffed 400 RISC-V cores into a Xilinx Kintex UltraScale KU040 FPGA (and the KU115 is three times larger, not to mention the Virtex UltraScale range).


I think a heterogeneous product was implied in that post, but I don't blame you for the confusion. The Epiphany-V is still homogeneous because of the time/funding constraints.

Interesting, but for a very specialized market, somewhere in the corner between GPU and FPGA. Closest existing offer might be Tilera?

Site is currently slashdotted so I can't comment on details like how much DRAM bandwidth you might actually have.

Tilera is what I thought of, too. It's actually where I'm getting my ideas of applications for Epiphany-V. They did a lot of the early proving work on architectures like this. Example: first 100Gbps NIDS I saw used a few Tilera chips to do that.

Kind of off topic, but are there any low-end/hobbyist Tilera boards? The Linux kernel has support for it. I've always thought you could stress multi-threaded code in interesting ways by running it on tons of cores.

Good to see this here! I actually wrote a paper analyzing this architecture for one of my bachelor classes, been a few years but: http://simonwillshire.com/papers/efficient-parallelism/

What I don't understand with computer chips, is how really relevant the FLOPS unit is, because in most situations, what limits computation speed is always the memory speed, not the FLOPS.

So for example a big L2 or L3 cache will make a CPU faster, but I don't know if a parallel task is always faster on a massively parallel architecture, and if so, how can I understand why it is the case? It seems to me that massively parallel architectures are just distributing the memory throughput in a more intelligent way.

You have to look at all the numbers (I/O, on-chip memory, flops, threads) and see if the architecture fits your problem. Some algorithms like matrix matrix multiplication are FLOPS bounds. It's rare to see a HPC architecture (don't know if there is one?) that can't reach close to the theoretical flops with matrix matrix multiplication. Parallel architectures and parallel algorithm development go hand in hand.

The website is erroring out for me, so I wonder what the motherboard situation will be like for this chip. It would be really nice to be able to buy and ARM like we can buy an x86.

Truly inspirational in showing what largely one person can do even in these times of huge fabs, expensive masks, and difficult, modern design rules


The website is down. Maybe a good opportunity to demonstrate the scalability improvement with such 1024-core processor?

how do I connect external RAM to it, and what would be the cpu-to-memory bandwidth in that case

External RAM, up to 1 Petabyte would be connected through an external FPGA, containing an epiphany link, some glue logic, and a memory controller.

From my understanding the Zynq's memory controller can only handle ~4GB of memory. Am I missing something? Is there a way to connect more than 4GB -- if so, I'd be very interested.

Larger FPGA support 64-bits with custom memory controllers.

With the new chip, is there a memory controller on the board, or will you still need the FPGA's?

Even with the new MPSoC, I think the memory controller is limited to 8GB.

Do you know what the most efficient cost / GB confit is for a Epiphany + memory controller or FPGA

thanks, Andreas. is there an existing solution, or is it DIY project at this point?

that's going to provide some interesting race conditions for sure :D

What is the possible applications? i.e how to potentially make use of all the cores? Is it more like GPU programming?

Tying in to earlier discussion on C (https://news.ycombinator.com/item?id=12642467), it's interesting to imagine what a better programming model for a chip like this would look like. I know about the usual CSP / message passing stuff, and a bit about HPC languages like SISAL and SAC. Anyone have links to more modern stuff?

Wish I was still working on genetic programming and digital artificial life. This would be barrels of fun.

Seems like this url is really popular,I get this connection error:

Error establishing a database connection

  Error establishing a database connection
Overload thru request storm?

Any chance of adding 16-bit floating point support in Epiphany-VI?

Error establishing a database connection

403 error now for the entire site.

Page is overwhelmed.

Can anyone provide a summary?

too bad the entire site is returning a 500 error now

i can't see anything on the site. is this for sale or just a proposed architecture? amazon seems only to be selling your 16-core device. was there a 64-core one? can't access your product offering.

The tapeout is apparently at the foundry and they are expecting chips back in 4-5 months. (I gathered this info from a google cache of their blog)

I think my thoughts on the parallella stuff still hold:


Basically this is a recurring theme in computing, but the whole custom massively parallel thing rarely works out.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact