Could someone explain how this is different from GPU computing and regular multi-core CPU computing?
I realize there is a difference...but I'm not quite sure I grasp it yet. GPU computing is a lot of parallel math computations with limited shared memory. I'm assuming the Epiphany CPU is more capable than the simple GPU math units?
How's it different from multi-core CPUs? Just the sheer quantity of cores they have packed in there?
I've done my master thesis on GPGPU, so maybe I can help out a bit. I'm not yet too familiar with Epiphany's design however. From what I could grasp what sets them apart the most is a different memory architecture compared to multicore CPUs, where the individual cores seem to be optimized for accessing adjacent memory locations as well as the locations of the direct neighbors. This is one point where the architecture seems to be similar to GPUs, although GPUs have a very different memory architecture again - for the programmer it might look similar however, especially when using OpenCL.
The main point where Epiphany is diverging from GPUs is that the individual cores are complete RISC environments. This could mainly be a big plus when it comes to branching and subprocedure calls (although NVIDIA is catching up on the later point with Kepler 2). On GPUs the kernel subprocedures currently all need to be inlined and branches mean that the cores that aren't executing the current branch are just sleeping - Epiphany cores seem to be more independent in that regard. I still expect an efficient programming model to be along the same lines as CUDA/OpenCL for epiphany however - which is a good thing btw., this model has been very successful in the high performance community and it's actually quite easy to understand - much easier than cache optimizing for CPU for example.
If we compare epiphany to CPU what's mainly missing is the CPU's cache architecture, hyperthreading, long pipelines per core, SSE on each core, possibly out-of-order and intricate branch prediction (not sure on those last ones). The missing caches might be a bit of a problem. The memory bandwidth they specify seems pretty good to me, but from personal experience I'd add another 20-30% to the achievable bandwidth if you have a good cache (which GPU has since Fermi for example). The other simplifications I actually like a lot - to me it makes much more sense to have a massive parallel system where you can just specify everything as scalar instead of doing all the SSE and hyperthreading hoops like on CPUs - optimizing for CPU is quite a pain compared to those new models.
Assuming you're programming it in OpenCL, it's effectively a GPU with many more SMs but with a narrower SIMD width. If they were to give it, say, 16-way predicated SIMD with incomplete IEEE compliance on par with the Cell (~4M transistors per core plus a wider internal bus), it would become a very interesting processor IMO with ~1.4 TFLOPs per 64-core epiphany board. At the very least, they'd get bought out if they built such a beast and undercut NVIDIA, AMD, and Intel. Just sayin'...
In the meantime, leave the fast atomic ops, ECC, and full IEEE compliance to the GPUs and Xeon Phis of the world until you have the transistor budget to go after them...
> If they were to give it, say, 16-way predicated SIMD
I think that would completely defeat the purpose of the architecture, as it'd massively bloat the transistor count per core. Their roadmap is for 1000+ independent cores on a single chip, not stopping at 64 per board.
And there's the problem: my personal bias from years and years of GPU programming is that I'd rather target 4 cores with 16-way SIMD than 64 cores each with scalar, or to quote Seymour Cray - "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"
Besides, this is 28 nm technology and 15x15 mm, no? That's 225 mm^2. AMD's 28 nm Tahiti is 365 mm^2 with 4.3B transistors, making this thing ~2.7B transistors give or take or ~41M transistors per core. Adding 4M transistors (source: it's about 1M transistors on a Cell chip per 4-way SIMD unit) is <10% larger in exchange for 16x the floating-point power. Unless I'm missing something, I'd build that chip in a minute...
Which is to say I don't want 1000+ wimpy cores - it'll get smashed by Amdahl's Law - when I can have ~900 brawny cores. NVIDIA and AMD have been exploring this space for almost a decade now and to start over without considering what they may have gotten right and what they have learned while doing so seems a little daft to me.
> I'd rather target 4 cores with 16-way SIMD than 64 cores each with scalar
You're assuming problems that are suitable for SIMD. If you have problems suitable for SIMD, use a GPU. Lots of problems are NOT suitable for SIMD.
If those 64 data streams all happen to require branches regularly, for example, your 4x 16-way SIMD is going to be fucked.
> Besides, this is 28 nm technology and 15x15 mm, no?
Where did you get that idea? Their site states 2.05mm^2 at 28nm for the 16 core version. 0.5mm^2 per core.
So by your math, more like ~26M transistors, or ~1.6M per core. Your estimated die size is 70% larger than what they project for their future 1024 core version...
This is a ludicrous argument when arguing for a GPU architecture instead. A GPU architecture gets affected far worse for many types of problems, because what is parallelizable on a system with 64 general purpose may degenerate to 4 parallel streams on your example 4 core 16-way SIMD.
There are plenty of problems that do really badly on GPU's because of data dependencies.
> when I can have ~900 brawny cores
Except you can't. Not at that transistor count, and die size, anyway.
> NVIDIA and AMD have been exploring this space for almost a decade now and to start over without considering what they may have gotten right and what they have learned while doing so seems a little daft to me.
Have they? Really? They've targeted the embarrassingly parallel problems with their GPU's, rather than even try to address the multitude of problems that their GPU's simply will run mostly idle on, leaving that to CPU's with massive, power hungry cores and low core count. I see no evidence they've tried to address the type of problems this architecture is trying to accelerate.
Myabe the type of problem this architecture is trying to accelerate will turn out to be better served by traditional CPU's after all, but we know that problems that don't execute the same operations on a wide data path very often are not well served by GPUs.
That said, this is where the R&D done by AMD and NVIDIA have expanded what is amenable to running on a GPU. Specifically, instructions like vote and fast atomic ops can alleviate a lot of branching in algorithms that would otherwise be divergent. It's not a panacea, but it works surprisingly well, and it's causing the universe of algorithms that run well on GPUs to grow IMO.
What I worry about with Parallela is that by having only scalar cores, and lots of them, it has solved issues with branch divergence in exchange for potential collisions reading from and writing data to memory. The ideal balance of SIMD width versus cores count is a question AMD, Intel, and Nvidia are all investigating right now. But again, ~26M transistors - no room for SIMD...
There is certainly something to what you say. The advantage of the GPU model is that you can have the ALUs occupying a much higher percentage of your die if each core is less independent. Independent threads is not necessarily what you need on an accelerator card - that's what you have CPUs for anyways.
Why plow a field with 1024 chickens, when you can plow it with 1M worms?
The GA144's F18 core has ~20 thousand transistors, and is asynchronous, and if you make the die size the size of an Opteron, and if you wait until you can pack 20B transistors on a die, you get---one million---cores.
It's closer to regular multi-core CPU computing than GPU computing. It's general purpose cores.
What sets it apart is that the cores are tiny, with little per-core memory (though all cores can transparently access each-others memory as well as main memory), and so the architecture is well suited for scaling up the number of cores with quite low power consumption.
So for problems that can be parallelized reasonably well, but with more complex data dependencies than what a GPU is good for, this might be a good fit.
I'd put it somewhere in the middle between GPU's (for embarrassingly parallel tasks) and general purpose CPU's with high throughput per core.
Also, this looks like it'd be possible to fit in the power envelope of really small embedded systems, like e.g. cellphones and tablets....
Before more developers have these systems, it'll be hard to say how useful they'll be, but the architecture looks exciting.
That's why I supported it - I really want to see how this type of architecture can be exploited, and whether or not it'll prove to be cost effective and/or simpler to work with than GPU's for the right type of problems.
IMO this combines some of the worse features of Cell (e.g. local memory and DMA) and GPUs, and while the power efficiency is good the absolute performance is very low. For a parallel noob who's using OpenMP/OpenCL I don't think it's any better than a desktop PC because programming it is going to feel the same and performance is going to be equal or lower. And if you don't use the libraries then you're in low-level ninjas-only land — the extremely simple and flexible hardware is good in theory because you can use it many different ways, but it also doesn't help you or give any hints about how to properly exploit it.
It's not meant to compete with a desktop PC, or with a mass produced GPU.
It's meant to be a development platform for solutions based on their architecture and for people to get familiar with the development model, with an existing 64-core version of their chip and future versions intended to put 1000+ cores on a board as the eventual target.
That it's also a reasonably capable platform to run Linux on (on the ARM chip) so you can do development directly on the board is an added bonus.
If the architecture is not good then people don't want to get familiar with it. I'm skeptical that even an "eventual" version with hundreds of cores would be worth using.
Well, clearly at least 3700 people want to get familiar with it based on the number of backers so far, which is pretty good for a niche platform like this.
Typical multicore CPUs don't have nearly as many cores as Parallella. Also, from what I can from www.apteva.com/introduction, the power consumption is much lower and the interconnect is different. In Parallella, cores are laid out in a grid and cores can only talk directly to their neighbors.
I realize there is a difference...but I'm not quite sure I grasp it yet. GPU computing is a lot of parallel math computations with limited shared memory. I'm assuming the Epiphany CPU is more capable than the simple GPU math units?
How's it different from multi-core CPUs? Just the sheer quantity of cores they have packed in there?