I'm really disappointed about how shallow the discussions about Adapteva are, and have been, on HN.
To remind everyone, the H = hacker. This device is a godsend, as far as I'm concerned. For the first time ever I get fully documented access to compute array on chip. No the architecture wasn't designed for anything specific, like graphics, but that means I don't get bogged down in details I don't care about, like some obscure memory hierarchy.
The chip is plain, simple, low-power, and begging for people to have an imagination again. Stop asking what existing things you can do with it, ask what future things having something like this on a SoC would enable.
Also, you should really be thinking about the chip at the instruction level, writing toy DSL to asm compilers. Thinking along the lines of, oh yeah I'll use OpenCL so I can be hardware agnostic, is never going to allow you to see what can be possible with it. If you read the docs you'll see what a simple and regular design it is, perfect for writing your own simple tooling.
It's been a long time, but I feel like a kid again. Like when I first discovered assembly on my 8086. Finally a simple device I can tinker with, play, and wring performance out of.
This is exactly my take on the Parallella board, and it's the exact reason why I donated to the Kickstarter back in the Fall.
I want to try my hand at writing a real, efficient, many:many message-passing API on top of SHM. It's something I've been interested in for a while (and am doing in an a side project for x86_64). Not because it hasn't been done a thousand times before, but because it's neat.
I want to write a compiler for the Parallella. Not because there aren't compilers already, but because I've never written a compiler that targets a RISC architecture before. I've never written a compiler that respects pipelining.
I want to write a parallelized FFT based on the original paper for the Parallela. I've used various FFT libraries before, but never actually implemented an FFT straight up. Why? Not because it's never been done before, but just because it's an idea that appeals to me. And for practice parallelizing algorithms . . .
I want to write a raytracer for the Parallella. Not because I haven't written a raytracer before, but because I think that I'll be able to do something interesting with a Parallella raytracer that I haven't done before: real-time (ish) raytracing. Not because that hasn't been done before, but because it'd be neat to build.
I want to build a distributed physics engine. Not because there aren't excellent open-source physics engines (Bullet, ODE, etc.) -- but because I find the problem interesting. It's something I've wanted to do for a while, but never got around to. Why? Because it's interesting.
I could go on, but I'll stop here. The Parallella, I think, is a catalyst for a lot of small projects that I've wanted to do for a while. The Parallella is my excuse to spend time on random projects that will never go anywhere beyond a page on my website describing what they are, plus a link to the source code.
And, you know what? That seems perfect to me. That's why I want a Parallella, and that's why I'm eagerly awaiting mine within the next month or three. (Hopefully!)
Sounds cool, but worth pointing out that the Parallela is Zynq based, and so comes with a Xilinx FPGA built into the SoC that includes the dual ARM cores. The FPGA provides the "glue" for the Epiphany chip to talk to the CPU, but there's plenty of spare capacity.
The more the merrier, though. I wish I had time to play with FPGA's - I have a Minimig (Amiga reimplementation where the custom chips are all in an FPGA) and I'm on the list for an FPGA Replay (targeting FPGA reimplementation of assorted home computers including the Amiga, and arcade machines in an FPGA).
Well stated. When I got an OLPC the point wasn't that it was a "good" laptop, it was that it was a "documented" laptop and that is rare indeed. What I find particularly amusing are the comments on things like the Raspberry Pi (and Parallela) which talk about all the things they do with "real" computers that these things cannot do. Its amusing because when the first 8 bit machines, the Mark-8 the Altair 8800, and the KIM-1 came out the exact same criticisms were leveled against them from the the same sorts of people. People who were using the existing infrastructure (mini-computers at the time of the Altair) whining that you couldn't run multi-user, you had no storage to speak of, you probably couldn't even assemble a program much less compile one on them, etc etc etc. The Apple II got huge amounts of disrespect from the establishment as a "baby's toy" or a kids toy designed to look like a computer. Then Visicalc hit and those voices faded.
Not enough people understand what is and what isn't possible in a parallel computer setup. Creating a really really simple setup is a great way to let them get their heads around it without getting the information overload shutdown.
"...I'll use OpenCL so I can be hardware agnostic..."
Using OpenCL is imminently sensible for a large swath of applications. And "Hardware Agnosticism" is beneficial for a large number of reasons. This is only the first "accessible" platform that has been released... we can expect many, many more. DSL -> asm on each can get a bit tiresome.
Many apps can benefit from running in more than one place. For instance, I scaled out an authoritative game server using OpenCL recently. Because I used OpenCL, I could run it on Amazon GPGPU instances as well as locally on the server here. Two different sets of cards... same codebase... only difference was instead of 100's of thousands of users on the local server, you can support millions on Amazon. There is something to be said for that sort of flexibility.
Couple that with a clustered vert.x event bus, and you begin to see the power such a system might bring... and that's just for a trivial application like gaming!!!
Imagine the benefits accrued to other, more complex, applications!
You should think carefully prior to being dismissive of hardware agnosticism. If you want to tinker and determine what is possible with this particular platform, by all means, use platform specific tools. However, if you want to bring the power of parallel processing to bear in solving problems wherever you find them... thinking about hardware agnosticism really is a "must-do" as my daughter would say.
"I'm really disappointed about how shallow the discussions about Adapteva are, and have been, on HN. To remind everyone, the H = hacker."
Thank You for saying this, i wanted to write about this but never found the time. I have started reading HN less and less because there is no longer substance in the comments and they usually focus on something that makes up 10% of the article, i actually find it weird at times. I am depending more and more on the weekly newsletters but still miss the old comments which were optimistic, on point and taught me something about the subject central to the article.
When I was 13, I started to tinker around with electronics. I didn't really have a proper grasp of electronics except from electronics playsets at 8-10. I designed some circuits using the bits of digital logic knowledge I had, but they called for dozens of chips to implement simple tasks. Then, I stumbled across 8bit AVRs. The possibilities compared to my feeble steps around XOR gates and Flip Flop ICs...they put me in awe. So I started playing around with it, and tried to push bounds, optimize code, adressing, busses, multiplexing, etc. And then I found out about the 32 bit ARM Chips that came for a similar price to the AVRs. Again, the possibilities...And then CPLDs and FPGAs, Neuronal Network ICs, and so on. Every single time, this rush of excitement. I'm an addict, amn't I?
And the Adapteva is similar. From what it sounds like, you just have this brute power at your fingertips. I'll read into the specs a little bit, but it sounds frankly awe inspiring even before opening them up. Want to PWM control 500 leds individually? From what it sounds like in this article, the thing got power for double (off the top of my head). Want to create stereooptic 3d models using a camera and an IMU in real time? It sounds like it can do that (over the thumb calculation). Want to overlay your own 3d models into it and display them using the occulus 3d?
What I'm trying to get across is, if you're not as excited as if you found out santa is real, you're not excited enough.
EDIT: THERE'S A D*"§ FPGA ON THAT THING!!!! I'll stop datasheeting to avoid hyperventilation.
I can think of two: documentation and simplicity.
Comparing Parallella with GPUs only in performance is missing the point. The board is open and quite understandable for non-experts like me. It's a platform for learning and experimenting, like a Raspberry Pi but more geeky. You will probably have an easier time tinkering with this and getting it to do useful things, that's the point.
I feel, like the parent commenter, quite excited about this.
In the comment thread on the article someone points out that the Adapteva chip doesn't do double precision floating-point, which limits its usefulness (to put it mildly). If the goal is to provide people with a low-cost platform to experiment with parallel programming, surely a decent NVidia card gives you less expensive (given you can plug it into a PCI slot and it will work) access to more CPUs that run faster and do more.
I'm 32 or so years of programming, I've hardly ever done anything that needed, or used, floats. It may limit it's usefulness, but most of what people tend to want double precision for is incidentally also stuff that is easily vectorized, in which case a GPU will crush it anyway.
And a "decent NVidia card" doesn't allow me to combine arbitrary independent C programs to each individual core, and doesn't give me full low level guides for hardware access. It's a completely different beast.
Well, you can still do double-floats, combining two 32-bit floats for a greater precision. While that doesn't get you full double precision, it just might be enough. And of course you can extend the same idea to implement quad-floats and so on.
One interesting application could be realtime 3D rendering because this is an area with small overhead. I know that the chip does not support floating point but that could be simulated by fixed point integers.
I agree with Shamanmuni that the great advantage of Parallela chip over GPUs is open source (full documentation). It's a practical study tool for real parallel programming tasks that many students can afford.
"I fundamentally disagree that SoCs are different than desktop technology. They're just smaller. High-end architectures are already driven to be as power-efficient as possible, so when you cut them down they're still efficient. For example, Kabini is an "SoC" that has the same GCN CU as a "desktop" Kaveri or a discrete Radeon."
The Parallella doesn't seem inherently more appropriate for embedded devices; it just depends on your requirements. Kabini would be embarrassingly power-hungry in plenty of embedded applications, while the Parallella might be laughably slow in plenty of other embedded applications.
Don't forget, by the way, that "embedded" doesn't mean "battery".
You're nitpicking. I already gave a few examples of where Parallella would be a good fit. To remind you, let's revisit the OpenCV for robotics application. The Parallella is shaping up to be a great device for OpenCV applications, do you at least admit that?
You're approaching the question from a different angle. The key word is SoC. Think embedded performance, rather than desktop performance.
Just to give you a few examples... OpenCV for robotics platforms, cheap low-power SDR capable of transmission, SIP encryption and compression. One might argue you could stick a GPU in a robot, I'd personally want something better suited to the task (lower power).
I fundamentally disagree that SoCs are different than desktop technology. They're just smaller. High-end architectures are already driven to be as power-efficient as possible, so when you cut them down they're still efficient. For example, Kabini is an "SoC" that has the same GCN CU as a "desktop" Kaveri or a discrete Radeon.
This is wrong. For example, low-power embedded ARM chips are not simply cut-down high-end x86 chips. If you optimize for power usage instead of raw performance, there are many design decisions that come out differently, resulting in a design that is qualitatively different and not just "scaled down powerhorse".
ZenoArrow is talking about embedded SoCs though. While Kabini is a SoC, it isn't really suited for embedded applications. These chips usually have a ton of GPIO, built-in support for different communication protocols, analog to digital converters, a lower power draw, etc. etc.
Exactly. I'm very excited to get my board and break into it. It really is intended to be a testbed for multiprocessing and hacking.
One thing that I'd like to see is what other people do with this product. I think that really will be the best part of this board. I'd equate it to minecraft (if I may be so bold). They didn't create a computer in minecraft, they created the possibility to create a computer and that was enough. That's how I see this board.
My biggest question is what do I need to know to use this? How can I write things that take advantage of this massive parallelization? Is there any reading anyone would recommend? Or maybe some basic examples for writing GPU based parallel software?
I'm so stoked about this board! I've been really wanting to have a board to try out hybrid computing/HPC with for a long time now, but everything has been a bit beyond my reach/justification pricewise. I got so excited about this board reading about it. The sky is the limit with this hardware, and now it'll that much more available to everybody.
Thinking about everything I can do with this board is making my head explode!
I'm really happy that you clarified this for me. I was a bit confused because the newest KickStarter video focuses mainly on a young girl using the computer to surf the web and doesn't really detail anything that has been mentioned in these comments. After reading everything, it seems awesome though.
I'm actually thinking Adapteva has a lot of future in present areas of growth.
1) On the mobile side, you can have Epiphany, their compute fabric, as a unit directly on the mobile SoC. You can do codec offload, like WebP, WebM, SILK/Opus. You can do basic computer vision for augmented reality applications, or image recognition. Or perhaps physics, integrate gyro output, position the device in absolute three space. I dunno, the point is the compute is open, there for exploitation. It's not like OpenCL where I have to beg the drivers to be available, correct, or performant. Nor is it like Qualcomm's Hexagon, where who knows if I can use it, and I sure as hell won't without signing an NDA.
2) As far as cloud and heterogenous compute goes, again I see an embedded Epiphany being useful. Everybody whines about various things, like for example missing double-precision. Firstly, it's not like the architecture can't be extended in future. But more importantly they miss little details. Each node in Epiphany can branch and do integer. You can see it doing wire-speed protobuff de/coding and other parallel data shuffling of long-living data, that could be compressed, or interleaved somehow.
I'm more of a low-power, cloud kind of guy. So that's what I'll be playing with the most when I get my hands on the kit. That and maybe some parallel graph rewriting. Who knows, the sky's the limit.
I wish to pick one nit: This is not the first chipset to be fully documented and have this sort of massively-parallel structure. GreenArrays is producing hardware right now if you want to go play: http://www.greenarraychips.com/
Greenarrays eval board is $450, though. Parallella is $99 with everything. Parrallella's original intent was to fit in an Altoid tin, but it turns out rounded corners are expensive ;). Would have been a nice case, though.
Ok, since when is the ARM Cortex fully documented, and what depth does your comment add to this discussion? Your emotional appeal is more shallow than any other comment here! Throw in some programmer jargon and nostalgia, and a smiley and your comment is practically garbage. :)
There's so much negativity in this thread, wasn't the whole idea that these guys had plans and an architecture to scale up to the order of a terra-flop by 2014, and 20 by 2022? And look! they're shipping. This first chip may not be impressive, but I'll welcome a new player to the market who has big plans to innovate.
I don't get the negativity either. If you look at the architecture manual, this is like a cheap Tilera. It's an interesting programming model (lots of cores in a shared memory SMP with weak memory ordering), and the CPU's are pretty vanilla RISC architectures. For $99, it's a great way to play with something that has the properties of the kinds of CPU's you might see in a future supercomputer.
I'm pretty skeptical, having played with the Tilera I'm not sure it gives you enough of a benefit to warrant the extra effort. The Parallella also looks a lot like a Tilera, I do wonder if there might be IP issues there down the line.
I also still think our best bet for this kind of thing is multicore ARM systems.
Not to take away anything from Parallela, but you can also get to play around with something that's used in today's supercomputers by buying any Geforce 5xx and start programming in CUDA.
A fully documented ARM architecture as an accelerator chip is certainly interesting though. It will take time until the software tooling catches up, but the initial buzz in the HPC community about those ARM newcomers is certainly there. I'd give them a good chance in the long run to outrun Intel MICs and catch up to NVIDIA Tesla.
What I'd like to see next is a PCI express expansion card using this technology. See, one of the great benefits about Tesla cards is that you can swap them out in your supercomputers just like you do it with RAM - and you get the newest chip architecture, as long as your PCI bus can handle the load. For multi purpose systems you often still like to have a good number of x86 cores in there however.
But then they've included some really weird wording. They know that people are hostile to that wording, yet they chose to continue to use it.
When you're educating people it's important to be clear about terminology.
Having said that, I think it's neat, and I wish them luck. I think they're missing one of the main points of the RPi's success - it is dirt cheap. At $35 people will take a risk. At $90 people need to think about it. That might sound odd on HN where people tend to have a lot more disposable income.
I am very happy to see this coming out. The negativity found here is really disappointing. I will be very happy to get hold of a good number of cores running with a hardware memory model that is more out-of-order than x86.
As someone who uses supercomputers, I'm not sure I entirely understand the market of this product. It's really cool and I'd love to have one to tinker with, but due to its high parallelization, I see no benefit of using this over a graphics card. I'm not sure if $99 can get you a GPU that reaches 90 GFlops though... perhaps that's where the benefit lies.
EDIT: After reviewing their website, I notice they state
> One important goal of Parallella is to teach parallel programming...
In this respect, I can see how this is useful. Adapting scientific software to GPUs can be difficult and isn't the easiest thing to get into for your average person. This board, with its open-source toolkit and community could make this process a lot easier.
You can't do individual branching for every single strand of computation for a typical GPU. With their chips they are trying to create a "third" category between highly parallel, but few individual instruction threads, and normal CPU's.
And the $99 is not for a mass produced card or chip, but for an initial small production run computer that includes one of the parallela chip.
You pay for a dual-core ARM computer based on the Zynq SoC (which means you get an FPGA in the deal) with an Epiphany chip. Getting a Zynq dev-board for that price in itself makes it worth it for a lot of people.
I think this may just be a novel way to sell a dev board for their custom silicon and get some of that heavy kickstarter press coverage.
If you figure that what they're really trying to do is get people familiar with it and see how well it might augment one of their existing ARM products it starts to make a lot of sense.
For instance I have a low end 4 bay arm based nas. It's insanely modest specs (1.6ghz single core + 512MB ram) actually are quite sufficient for most nas tasks. But it's really more like a home server platform as they have all sorts of addons that include things like CCTV archiving, DVR, ip pbx - you get the picture. But if you really start treating it like a general purpose server you quickly realize that some common workloads perform horribly on that arm core and it's frustrating.
It can easily push 800mbps or so with nfs smb or cifs, but if you want to rsync+ssh you're looking at less than a 10th of that because of the various fp needs of that chain. Native rsync with no ssh/no compression does somewhat better but still poorly due to its heavy use of cryptographic hash functions for delta transfers.
There are plenty of other examples - file system compression, repairing multipart files with par2 (kind of like raid for file sets). Face detection, file integrity hashing) And if it could do on the fly video transcoding (don't even think about it) it could happily replace another full system i have running plex server.
There's probably a lot of devices that the designers default to arm but have to skip features that are heavy fp. If somebody in the firm has played around with a chip you can just drop in and not change your soc or toolchain that starts sounding pretty good i'd guess - and likely still far cheaper than an atom soc.
It's an interesting (read: niche) market, to be sure. It's the same market as someone who might buy 8 raspberry pi boards, or 4 ODroid U2 machines for the purpose of learning about parallel computation.
The Epipchany chip (the coprocessor on these boards) is supported as of GCC 4.8, so we also may see some novel ways to offload work to this chip in the future.
Comparing it with a GPU is a natural discussion to have. I believe there is more information on their website to answer the question. But maybe someone else can explain the programming paradigm difference.
Each of the 64 cores can independently run arbitrary C/C++ code, which is much more flexible than a GPU. Each core has 32KB of local memory, which can also be accessed by the other cores, and there's 1GB of external memory too.
I wonder how much work it would take to port cpuminer to this platform for use in mining bitcoins and litecoins. Probably not worth it for bitcoins with the new FPGA hardware but could be interesting for litecoin depending on the memory access speed to main memory.
Interesting architecture. I like how well-documented everything is. Usually, either the low-level ISA for accelerator chips is not documented at all (like with GPUs), or detailed documentation is only available under NDA, and only proprietary development tools are available (like with FPGAs).
First of all, which is the NP-hard pathfinding problem you're talking about? When I hear "pathfinding" I think "shortest path", which is (deterministic) polynomial time (exact class dependant on exactly which variant of the problem, but even Floyd-Warshall is Θ(|V|³)).
Anyway, no. If you have an NP-hard problem, and you want an exact answer (i.e. you are denying yourself approximate solutions), and you want to solve it for large inputs, unless you have either proven that P=NP by construction (heh), or you have a non-deterministic computing machine (heh), you're basically screwed. Going parallel isn't going to help, any more than a hypothetical billion-GHz serial CPU is going to help. Asking this question suggests a fundamental lack of understanding about what is interesting (or rather, infuriating) about NP-hard problems.
Parallel processing models give you, at best, linear speedup. If your problem is O(too-big), and your input is large, linear speedup doesn't help, no matter how much linear speedup you have.
Say you're running a problem of size N on a processor with a cache of size N/2. Adding another processor with another N/2 cache means your whole problem now fits in cache, so you'll probably suffer fewer cache misses, and thus could end up running more than 2x faster.
Hmn. I don't find those examples compelling, but maybe I'm just missing something.
I don't have a thought about the wikipedia backtracking example.
Regarding cache-vs-RAM, or regarding RAM-vs-disk, I see no reason you cannot take the work that the parallel units do, and serialize it onto a single processor. Let me consider the example you gave.
Initially, you have two processors with caches of size N/2, and a problem of size N on disk. You effortlessly split the problem into two parts at zero cost (ha!), load the first part on procA, the second part on procB. You pay one load-time. Now you process it, paying one processing-time, then store the result to disk, paying one store-time. Now you effortlessly combine it at zero cost (again, ha!), and you're done.
In my serial case, I do the same effortless split, I load (paying one load), compute (paying one compute), then store (paying one store). Then I do that again (doubling my costs). Then I do the same effortless combine. My system is 1/2 as fast as yours.
In short, I think "superlinear speedup" due to memory hierarchy is proof-by-construction that the initial algorithm could have been written faster. What am I missing?
OK, bear with me as we get slightly contrived here...(and deviate a little from my earlier, somewhat less thought out example).
Say your workload has some inherent requirement for random (throughout its entire execution) access to a dataset of size N. If you run it on a single processor with a cache of size N/2, you'll see a lot of cache misses that end up getting serviced from the next level in the storage hierarchy, slowing down execution a lot. If you add another processor with another N/2 units of cache, they'll both still see about the same cache miss rate, but cache misses then don't necessarily have to be satisfied from the next level down -- they can instead (at least some of the time) be serviced from the other processor's cache, which is likely to be significantly faster (whether you're talking about CPU caches relative to DRAM in an SMP system or memory relative to disk between two separate compute nodes over a network link).
Hmn. I think the reason there's superlinear speedup in the paper you linked is because the requests must be serviced in order. If you only care about throughput, and you can service the requests out-of-order, then you can use LARD in a serial process too, to improve cache locality, and achieve speed 1/Nth that of N-machine LARD. But to serve requests online, you can't do that reordering, so with one cache you'd be constantly invalidating it, thus the increased aggregate cache across the various machines results in superlinear speedup.
So, mission accomplished! I now believe that superlinear speedup is a real thing, and know of one example!
I find it incredibly dishonest of Adapteva to equate it to a "theoretical 45 GHz CPU". There are much better ways to talk about the performance level of their hardware than that metric, especially given the rest of the text in their Kickstarter pitch is aimed at people who need to inherently understand the hardware's execution model in order to program it effectively.
The computing industry has established language and metrics to discuss computing performance and, while the waters often get muddied when the hardware is wide, that's a step too far.
Hum, no. Sandy/Ivy Bridge can only execute 4 double-precision instructions per cycle per core, in the form of two SSE instructions per cycle (one instruction doing adds, the other doing muls, executed by different units).
Doing 8 double-precision instructions per cycle would translate to either four 128-bit SSE instructions, or two 256-bit AVX instructions per cycle, which is not possible (unless I did not keep track of the latest AVX capabilities).
Why do you say the Parallella is a 45GHz computer?
We have received a lot of negative feedback regarding this number so we want to explain the meaning and motivation. A single number can never characterize the performance of an architecture. The only thing that really matters is how many seconds and how many joules YOUR application consumes on a specific platform.
Still, we think multiplying the core frequency(700MHz) times the number of cores (64) is as good a metric as any. As a comparison point, the theoretical peak GFLOPS number often quoted for GPUs is really only reachable if you have an application with significant data parallelism and limited branching. Other numbers used in the past by processors include: peak GFLOPS, MIPS, Dhrystone scores, CoreMark scores, SPEC scores, Linpack scores, etc. Taken by themselves, datasheet specs mean very little. We have published all of our data and manuals and we hope it's clear what our architecture can do. If not, let us know how we can convince you.
Why do you say the Parallella is a 45GHz computer?
This is a Kickstarter project. In order to be successful, we need to attract as much attention from spam blogs as possible. To do that, facts are not particularly useful. What we need is something exciting. If we say we have 64 cores, that's not exciting. 64? I've forgotten how to count that low. Similarly, if we say we have a 700MHz processor, most people listening to us talk will actually start laughing in our faces. So that's no good. But thanks to our mathematical forefathers, there are many ways to make small numbers big. We could add the two numbers, saying we have a 764MHz machine. But that's not exciting, and the units don't work. We could divide the two numbers, yielding 10.94MHz. The units work, but that number is even smaller! Finally, we could try multiplication! And, boy, does that deliver! 45GHz!
TLDR: The only reason you're here is because of our misleading and dishonest claim. But now you're here. Please cough up your hard-earned cash which we might not use to go on a nice tropical vacation. You can trust us, we'd never mislead you...
Their answer to 'Why do you call the Parallella a supercomputer?' is also pretty curious: ;)
> The Parallella project is not a board, it's intended to be a long term computing project and community dedicated to advancing parallel computing.
> The current $99 board aren't considered supercomputers by 2012 standards, but a cluster of 10 Parallella boards would have been considered a supercomputer 10 years ago.
Wait, what? :D (emphasis mine)
> Our goal is to put a bona-fida supercomputer in the hands of everyone as soon as possible but the first Parallella board is just the first step. Once we have a strong community in place, work will being on PCIe boards containing multiple 1024-core chips with 2048 GFLOPS of double precision performance per chip. At that point, there should be no question that the Parallella would qualify as a true supercomputing platform.
They are clearly wrong. The purpose of higher clock rates is to produce a given answer in a smaller amount of time (latency). The purpose of adding more processors (cores) is to produce more answers in a given time (throughput). They are free to report their results using any standard measurement of throughput. Their answer is weasely.
But then, clock rate is irrelevant to latency and throughput. Really matters how much more work per cycle you get done, how fast you can move IO, and the cost of throughput per watt (and whether the system can meet your requirements at all).
They are "clearly wrong" when talking to geeks about specific types of problems. For most most people this means nothing, and multiplying it is fine. And for a lot of situations where you are considering batch jobs, multiplying it is fine as a quick illustration. It is not as if the raw numbers tell you anything anyway, since the characteristics of the system are so unusual.
They miscalculated how people would interpret it, and got burned. But they've been clear about what it is they actually mean the whole time.
I don't see a big difference between this and "petaflops" measurements that are the de-facto standard in bragging about supercomputers. You really only hit that peak performance for embarrassingly-parallel problems, but unless you have a specific workload or benchmark to talk about, it's the best you have and is a fairly well-accepted practice in the industry.
On a related note "petaflops" would be a great name for a pet bunny.
"this board should deliver about 90 GFLOPS of performance, or --in terms PC users understand-- about the same horse-power as a 45GHz CPU."
This is wrong.
A 4-core 3.0 GHz x86-64 processor delivers more GFLOPS than the Parallela: 96 GFLOPS with SSE instructions, because each core can execute 8 single precision instructions, 4 adds and 4 muls, each cycle. And yes, when Parallela claims 90 GFLOPS, they mean single-precision.
For example, for the same price as Parallela, you can get a $100 Phenom II X4 965 (4-core, 3.4 GHz, 125W) delivering 109 GFLOPS. Count $200 to include minimal mobo/RAM/PSU (if all you care about is raw GFLOPS).
Parallela may not beat anything on GFLOPS/Watt and GFLOPS/$, but if they can maintain ease of development (x86-64's stronghold) while doing not too bad on these 2 metrics (dominated by GPUs), they may be a good compromise and may have a shot at succeeding in the HPC market.
Exactly right. ARMs lure isn't really the current performance for supercomputing, it's rather the expectation that they will hit the next big performance wall much later than x86 because of its simple architecture that's suitable for the maximum amount of cores per die space. Give it 2-3 years and we might have the big step in supercomputing architecture at hand.
Can anyone explain what the practical differences between something like this and a gpgpu approach? It doesn't sound particularly performant compared to modern gpus otherwise. Maybe they add in some more general purpose instructions for a little more flexibility?
A typical GPU can execute a small number of threads on a large number of streams of data carefully laid out in memory. Every time you want to do something conditionally on just one data stream, you waste a lot of capacity.
In contrast, the Epiphany chips can execute individual threads on each core in parallel on data either local to the core (fastest), on any other core, or in separate main memory.
The current Epiphany chips aren't too spectacular, since the core count is "low". They can "only" execute 16 individual instruction streams in parallel. But that's on a chip the size of your finger nail, and their roadmap is aiming for 1024 core chips.
They're effectively aiming for people to find ways of making effective use of simple, small, power efficient cores for problems that are not "data parallel" enough to be efficiently done on GPU's.
Depends a lot on the type of problem, and I think that's going to be what makes or breaks them. They have some good examples, but you're right, it's a hard problem and one of the reasons it's so important for them to get these dev boards out.
I backed this project simply because it's a great idea to build a very small highly parallel computer that runs on very little power. Maybe this one won't hit it out of the park but it might give other people ideas. Building the first one of anything is always hard. Add a little serendipity and we might get an entirely new use for computers.
Just saying that I could do more with a $99 graphics card sort of misses the point.
I love this board, but keep in mind that the entire premise of this board is parallel execution of _separate instruction streams_. From the performance people are getting from GPU's for bitcoin mining, I presume the calculations can be done extremely parallel with few instruction streams - for that a normal GPU is likely to be a far better choice.
> But when we consider the consuming power of ASIC platform, I think this board has strength.
No, it can't possibly have.
SHA is half bitshifts-by-constants. On an ASIC platfrom, those essentially refactor to no-ops. There is no way, no how general-purpose hardware could ever possibly get anywhere near even a piss-poor special purpose ASIC for this task. If you think otherwise you simply don't understand the domain. Those 600-watt ASIC systems contain multiple chips and run at tens of GHashes/s. That 5-watt chip, if it's very, very good, might maybe break 40MHash/s.
It's nowhere near fast enough. My 7970s can push out about 1.3Ghash/s and combined they are capable of around 7 TFLOPs. When (/if) they release the BFL Jalapeño it'll run at 5 Ghash/s and be powered by USB. 90 GFLOPs is equivalent to a decent processor, but nowhere near powerful enough for bitcoin mining.
While I find this quite exciting from a pure developer perspective, it also reminded me that I haven't had anything I'd call a Desktop box in quite some time.
If I were to ever get a Desktop machine again, it would have to be cheap and light, definitely don't want anything clunky, otherwise a laptop seems preferable to me.
There do not seem that many products that would fill that gap, Intel's NUC is too expensive, the Raspberry PI too slow. Apple's mini Mac seems like the best proposition in this segment.
I wonder if the Parallela could not only be used as development center, but also as a Desktop computer?
It won't run any fancy games, that's clear, but it may actually be usable for browsing, watching videos and office duties.
Wow, I remember seeing the original Kickstarter for this and thinking "this will never see the light of day", yet here it is. I still find it a bit of an odd product; neither for hobby or business, but it sure is cheap.
They seem to actually have support from some of the hardware manufacturers. From Update #31 "Much gratitude goes out to the component manufacturers who really “got it” (Xilinx, Analog Devices, Intersil, Micron, Microchip, Samtec all deserve special thanks). Without their help we would be losing $100 per board!"
So, the backers are getting a Very Good Deal, with the hopes that a successful launch will make demand high enough to make the $99 viable with volume.
a supercomputer is cluster of machines connected by high throughput, low latency interconnect.
Hundreds of servers connected together with 1Gigabit is still a "grid cluster" .. you need at least 10Gigabit Ethernet (over iWARP) or infiniband (RDMA) to be considered a supercomputer.
This is marketing B.S.! this B.S. is "emphasized" by the 90GFLOPS = 45Ghz thing. 90GFLOPS is by a single 45GHz "ALU" (perhaps an ALU doing Multiply-Add - MADD op.) not a full fledged CPU (like the i7 or Xeon, which has 4-8 cores with each core having 3 ALU's) as the readers might imply.
The general idea is similar - lots of cores with distributed SRAM memory and some shared DRAM, all sitting on 2D mesh network. The main difference is that Epiphany is made of custom simple RISC cores, while Xeon Phi uses 1st gen Pentiums with huge SIMD FPUs slapped on for higher FP throughput (and TDP).
I wish these things had just a bit more memory. Most of the interesting algorithms I work with (bioinformatics) really want 4G of memory. A lot of them you can squeeze down to 2G but 1G is just out of the question.
i think this is cool, but wouldn't learning OpenCL be more future proof for someone wanting to get into parallel processing? Seems like there is more drive behind GPU development than specialist hardware like this
This is really cool. Since it's linux I assume it can run the JVM, correct? That's incredibly powerful, as even GPU programming requires bridge libraries. And what, $99? That's incredible. I'm going to get one...
This is perfect for numerical computing applications like software-defined radio or image processing, which can now be done on embedded platforms. I'll definitely be ordering a board when they're available.
Can someone explain to me how a $99 computer can have 45ghz of processing power, but an i7 costs 3x that for 1/10th that clock speed? What does this $99 miss out on that my i7 has the capability of doing?
First of all, this 45GHz figure definitely isn't valid for modern x86 chips - thanks to multiple cores and SIMD instructions they reach few dozen GFLOPS at stock frequencies.
Furthermore, x86 chips pack all of their performance in low number of cores, what makes them much more useful for common scalar code. And if 20 times higher scalar performance isn't enough to convince you to pay premium, the complexity required to achieve this level of scalar performance definitely is enough to discourage Intel from selling you i7s for $99.
This is kind of a novelty. Your i7 has way, way more power for jobs which only use a few cores. Most normal jobs are like that, so unless you have specific requirements, the i7 is going to give you much better performance.
Of course if you learn on Adapteva then your knowledge may not translate to the "worse" architectures that are used in the real world. If you want to learn parallel programming, the computer you already have supports threads, CSP, actors, OpenMP, OpenCL, etc.
I wonder how those in the performance computing sector feel about running a proprietary supervisor with built in DRM on each and every CPU? Raspberry users might not care when for just hobbyist applications, but I doubt any serious scientist is going to overlook that.
Intel platforms have a very similar risk via SMM and the platform code & controller. It's less advanced, but it can easily exert full control over the system without the os allowing it, minus access to some registers and on die cache. It could DMA in or out of the gpu memory as well.
Whether your soc vendor forces a secure supervisor to load is up to them, and i'd be surprised if an HPC builder had trouble finding vendors to supply parts with a totally controllable boot chain.
I'm sure there are ways to obscure it, but there are just as many ways on x86 platforms, the only real difference being that you could pull the eprom and reflash it and inspect the other board components. There's also plenty of evil things you can put in a soc without relying on trustzone.
Bottom line is you have to trust your vendor. If you want a soc integrated and fab monitored by a business/state that is politically aligned with yours it is probably just a matter of paying a premium.
So how fast is this really? It doesn't sound like much of a supercomputer to me. If it were so super for $99, it'd have been hyped everywhere already and gamers would not buy desktops anymore. It rather sounds like a platform to practice multithreading on.
A supercomputer is a machine that's I/O bound instead of CPU bound, at least as a first approximation.
You'll get lots of specs thrown at you like in mid 2013 a supercomputer means using X, Y, and Z technologies. But that is just a longer format version of the above.
A pessimist usually warps the definition to a machine that's primarily programmer limited rather than CPU or I/O limited, LOL.
Over the decades as parallelism has been popular its drifted more toward being financially limited more so than anything else, in the long run this is probably going to be the new definition, a overall system who's performance is solely limited economically. You might think thats all computers, not so, there's plenty which are inherently limited by architecture to low performance, or limited by programming to single core / single thread tasks.
The biggest bummer of supercomputers in the parallel era is no one is doing anything about latency. That's nice that your 2000 processor design with 20 deep pipelines eventually after enormous latency can really churn stuff out, but the olden days pursuit of low latency resulting in speed was pretty interesting technologically. Hilariously you'll even get noobs who don't even understand the difference between latency and speed or claim there isn't one.
These guys are completely dishonest. I saw their kickstarter video where they said that for $99 you could have "a computer many times faster than anything on the market ZOMG".
Yeah, maybe it's faster for all those times during the day when you calculate matrix chain products. But for largely single-threaded tasks, like EVERYTHING you do on a day to day basis, it's going to be significantly slower than your average dual-core i3.
I backed them on kickstarter, and I don't remember seeing any claim like what you claim to have seen.
To me it was always clear that the current models are not particularly fast. They may be fast "per watt", and if they succeed in their roadmap, then their future 1024 core chips may be fast for the subset of problems that they are suitable for.
In the meantime, the kickstarter page is/was careful to focus on this as a stepping stone, and developer platform for playing with the technology first and foremost, and not as being about delivering some incredibly fast computer for end users.
If anything, they've provided an extreme amount of data, down to cycle counts for memory accesses and the instruction set, and they've dumped a lot of code in our laps, including drivers etc., and the final unit actually comes with a faster version of the Zynq SoC than what they promised, after Xilinx apparently gave them an amazing deal.