To remind everyone, the H = hacker. This device is a godsend, as far as I'm concerned. For the first time ever I get fully documented access to compute array on chip. No the architecture wasn't designed for anything specific, like graphics, but that means I don't get bogged down in details I don't care about, like some obscure memory hierarchy.
The chip is plain, simple, low-power, and begging for people to have an imagination again. Stop asking what existing things you can do with it, ask what future things having something like this on a SoC would enable.
Also, you should really be thinking about the chip at the instruction level, writing toy DSL to asm compilers. Thinking along the lines of, oh yeah I'll use OpenCL so I can be hardware agnostic, is never going to allow you to see what can be possible with it. If you read the docs you'll see what a simple and regular design it is, perfect for writing your own simple tooling.
It's been a long time, but I feel like a kid again. Like when I first discovered assembly on my 8086. Finally a simple device I can tinker with, play, and wring performance out of.
I want to try my hand at writing a real, efficient, many:many message-passing API on top of SHM. It's something I've been interested in for a while (and am doing in an a side project for x86_64). Not because it hasn't been done a thousand times before, but because it's neat.
I want to write a compiler for the Parallella. Not because there aren't compilers already, but because I've never written a compiler that targets a RISC architecture before. I've never written a compiler that respects pipelining.
I want to write a parallelized FFT based on the original paper for the Parallela. I've used various FFT libraries before, but never actually implemented an FFT straight up. Why? Not because it's never been done before, but just because it's an idea that appeals to me. And for practice parallelizing algorithms . . .
I want to write a raytracer for the Parallella. Not because I haven't written a raytracer before, but because I think that I'll be able to do something interesting with a Parallella raytracer that I haven't done before: real-time (ish) raytracing. Not because that hasn't been done before, but because it'd be neat to build.
I want to build a distributed physics engine. Not because there aren't excellent open-source physics engines (Bullet, ODE, etc.) -- but because I find the problem interesting. It's something I've wanted to do for a while, but never got around to. Why? Because it's interesting.
I could go on, but I'll stop here. The Parallella, I think, is a catalyst for a lot of small projects that I've wanted to do for a while. The Parallella is my excuse to spend time on random projects that will never go anywhere beyond a page on my website describing what they are, plus a link to the source code.
And, you know what? That seems perfect to me. That's why I want a Parallella, and that's why I'm eagerly awaiting mine within the next month or three. (Hopefully!)
FPGA allows you to explore a lot of things that just aren't possible in a traditional CPU, no matter how parallel.
The more the merrier, though. I wish I had time to play with FPGA's - I have a Minimig (Amiga reimplementation where the custom chips are all in an FPGA) and I'm on the list for an FPGA Replay (targeting FPGA reimplementation of assorted home computers including the Amiga, and arcade machines in an FPGA).
Do you know how its Spartan-6 XC6SLX9 compares to the Zynq 7010 on the Parallela?
If they are roughly equal, I guess the Parallela is a better deal since it has an ARM too.
Too bad Icarus Verilog can't synthesize at all anymore, Xilinx ISE is very heavy-weight and not FOSS. And they both use that.
Not enough people understand what is and what isn't possible in a parallel computer setup. Creating a really really simple setup is a great way to let them get their heads around it without getting the information overload shutdown.
Using OpenCL is imminently sensible for a large swath of applications. And "Hardware Agnosticism" is beneficial for a large number of reasons. This is only the first "accessible" platform that has been released... we can expect many, many more. DSL -> asm on each can get a bit tiresome.
Many apps can benefit from running in more than one place. For instance, I scaled out an authoritative game server using OpenCL recently. Because I used OpenCL, I could run it on Amazon GPGPU instances as well as locally on the server here. Two different sets of cards... same codebase... only difference was instead of 100's of thousands of users on the local server, you can support millions on Amazon. There is something to be said for that sort of flexibility.
Couple that with a clustered vert.x event bus, and you begin to see the power such a system might bring... and that's just for a trivial application like gaming!!!
Imagine the benefits accrued to other, more complex, applications!
You should think carefully prior to being dismissive of hardware agnosticism. If you want to tinker and determine what is possible with this particular platform, by all means, use platform specific tools. However, if you want to bring the power of parallel processing to bear in solving problems wherever you find them... thinking about hardware agnosticism really is a "must-do" as my daughter would say.
Thank You for saying this, i wanted to write about this but never found the time. I have started reading HN less and less because there is no longer substance in the comments and they usually focus on something that makes up 10% of the article, i actually find it weird at times. I am depending more and more on the weekly newsletters but still miss the old comments which were optimistic, on point and taught me something about the subject central to the article.
Back on topic. "What Adapteva has done is create a credit-card sized parallel-processing board" This is so cool,as a linux user I hope to get my hands on one of these asap! Take a look at the kickstarter video http://www.kickstarter.com/projects/adapteva/parallella-a-su...
Stop asking what existing things you can do with it,
And the Adapteva is similar. From what it sounds like, you just have this brute power at your fingertips. I'll read into the specs a little bit, but it sounds frankly awe inspiring even before opening them up. Want to PWM control 500 leds individually? From what it sounds like in this article, the thing got power for double (off the top of my head). Want to create stereooptic 3d models using a camera and an IMU in real time? It sounds like it can do that (over the thumb calculation). Want to overlay your own 3d models into it and display them using the occulus 3d?
What I'm trying to get across is, if you're not as excited as if you found out santa is real, you're not excited enough.
EDIT: THERE'S A D*"§ FPGA ON THAT THING!!!! I'll stop datasheeting to avoid hyperventilation.
I asked that and came up blank. And I haven't seen answers from anyone else, either. Has Adapteva themselves shown any examples where their chip beats a GPU?
Not sure if world first or AMD's first, but it was around this timeframe, 2007:
"AMD Delivers First Stream Processor with Double Precision Floating Point Technology"
And a "decent NVidia card" doesn't allow me to combine arbitrary independent C programs to each individual core, and doesn't give me full low level guides for hardware access. It's a completely different beast.
I agree with Shamanmuni that the great advantage of Parallela chip over GPUs is open source (full documentation). It's a practical study tool for real parallel programming tasks that many students can afford.
You didn't really read what I said did you. A key factor for embedded electronics is power draw. Based on a quick Google, AMD Kabini is using approximately 15W of power:
On the other hand, the 64-core Parallella is using approximately 2W:
Hope you can start to see the difference now.
The Parallella doesn't seem inherently more appropriate for embedded devices; it just depends on your requirements. Kabini would be embarrassingly power-hungry in plenty of embedded applications, while the Parallella might be laughably slow in plenty of other embedded applications.
Don't forget, by the way, that "embedded" doesn't mean "battery".
Just to give you a few examples... OpenCV for robotics platforms, cheap low-power SDR capable of transmission, SIP encryption and compression. One might argue you could stick a GPU in a robot, I'd personally want something better suited to the task (lower power).
One thing that I'd like to see is what other people do with this product. I think that really will be the best part of this board. I'd equate it to minecraft (if I may be so bold). They didn't create a computer in minecraft, they created the possibility to create a computer and that was enough. That's how I see this board.
How can we get started with this?
Thinking about everything I can do with this board is making my head explode!
Sounds like an ideal use of Forth.
Do you see applications in an embedded sense, or are you looking at it to augment a regular computer's capability?
1) On the mobile side, you can have Epiphany, their compute fabric, as a unit directly on the mobile SoC. You can do codec offload, like WebP, WebM, SILK/Opus. You can do basic computer vision for augmented reality applications, or image recognition. Or perhaps physics, integrate gyro output, position the device in absolute three space. I dunno, the point is the compute is open, there for exploitation. It's not like OpenCL where I have to beg the drivers to be available, correct, or performant. Nor is it like Qualcomm's Hexagon, where who knows if I can use it, and I sure as hell won't without signing an NDA.
2) As far as cloud and heterogenous compute goes, again I see an embedded Epiphany being useful. Everybody whines about various things, like for example missing double-precision. Firstly, it's not like the architecture can't be extended in future. But more importantly they miss little details. Each node in Epiphany can branch and do integer. You can see it doing wire-speed protobuff de/coding and other parallel data shuffling of long-living data, that could be compressed, or interleaved somehow.
I'm more of a low-power, cloud kind of guy. So that's what I'll be playing with the most when I get my hands on the kit. That and maybe some parallel graph rewriting. Who knows, the sky's the limit.
And yes, I gave $99, hoping to have one soon.
As a former Forth hacker I was enthusiastic at the first glance of the GA but 128 Bytes per core were really disappointing. What could that amount of RAM be useful for?
I'm pretty skeptical, having played with the Tilera I'm not sure it gives you enough of a benefit to warrant the extra effort. The Parallella also looks a lot like a Tilera, I do wonder if there might be IP issues there down the line.
I also still think our best bet for this kind of thing is multicore ARM systems.
A fully documented ARM architecture as an accelerator chip is certainly interesting though. It will take time until the software tooling catches up, but the initial buzz in the HPC community about those ARM newcomers is certainly there. I'd give them a good chance in the long run to outrun Intel MICs and catch up to NVIDIA Tesla.
What I'd like to see next is a PCI express expansion card using this technology. See, one of the great benefits about Tesla cards is that you can swap them out in your supercomputers just like you do it with RAM - and you get the newest chip architecture, as long as your PCI bus can handle the load. For multi purpose systems you often still like to have a good number of x86 cores in there however.
But then they've included some really weird wording. They know that people are hostile to that wording, yet they chose to continue to use it.
When you're educating people it's important to be clear about terminology.
Having said that, I think it's neat, and I wish them luck. I think they're missing one of the main points of the RPi's success - it is dirt cheap. At $35 people will take a risk. At $90 people need to think about it. That might sound odd on HN where people tend to have a lot more disposable income.
Sounds like a glorious playground :)
EDIT: After reviewing their website, I notice they state
> One important goal of Parallella is to teach parallel programming...
In this respect, I can see how this is useful. Adapting scientific software to GPUs can be difficult and isn't the easiest thing to get into for your average person. This board, with its open-source toolkit and community could make this process a lot easier.
And the $99 is not for a mass produced card or chip, but for an initial small production run computer that includes one of the parallela chip.
You pay for a dual-core ARM computer based on the Zynq SoC (which means you get an FPGA in the deal) with an Epiphany chip. Getting a Zynq dev-board for that price in itself makes it worth it for a lot of people.
If you figure that what they're really trying to do is get people familiar with it and see how well it might augment one of their existing ARM products it starts to make a lot of sense.
For instance I have a low end 4 bay arm based nas. It's insanely modest specs (1.6ghz single core + 512MB ram) actually are quite sufficient for most nas tasks. But it's really more like a home server platform as they have all sorts of addons that include things like CCTV archiving, DVR, ip pbx - you get the picture. But if you really start treating it like a general purpose server you quickly realize that some common workloads perform horribly on that arm core and it's frustrating.
It can easily push 800mbps or so with nfs smb or cifs, but if you want to rsync+ssh you're looking at less than a 10th of that because of the various fp needs of that chain. Native rsync with no ssh/no compression does somewhat better but still poorly due to its heavy use of cryptographic hash functions for delta transfers.
There are plenty of other examples - file system compression, repairing multipart files with par2 (kind of like raid for file sets). Face detection, file integrity hashing) And if it could do on the fly video transcoding (don't even think about it) it could happily replace another full system i have running plex server.
There's probably a lot of devices that the designers default to arm but have to skip features that are heavy fp. If somebody in the firm has played around with a chip you can just drop in and not change your soc or toolchain that starts sounding pretty good i'd guess - and likely still far cheaper than an atom soc.
The Epipchany chip (the coprocessor on these boards) is supported as of GCC 4.8, so we also may see some novel ways to offload work to this chip in the future.
This is ideal cluster material.
SDK docs: http://www.adapteva.com/wp-content/uploads/2013/04/epiphany_...
The topology reminds me of this paper "The Landscape of Parallel Computing Research: A
View from Berkeley" http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-18...
Let me show you the AMD Southern Islands ISA specs: http://developer.amd.com/wordpress/media/2012/10/AMD_Souther...
Anyway, no. If you have an NP-hard problem, and you want an exact answer (i.e. you are denying yourself approximate solutions), and you want to solve it for large inputs, unless you have either proven that P=NP by construction (heh), or you have a non-deterministic computing machine (heh), you're basically screwed. Going parallel isn't going to help, any more than a hypothetical billion-GHz serial CPU is going to help. Asking this question suggests a fundamental lack of understanding about what is interesting (or rather, infuriating) about NP-hard problems.
Parallel processing models give you, at best, linear speedup. If your problem is O(too-big), and your input is large, linear speedup doesn't help, no matter how much linear speedup you have.
Most commonly, yes, though in practice superlinear speedups can occur in some situations. (Not that this negates your overall point, just a nitpick.)
Wikipedia also points out possibilities at an algorithmic level: https://en.wikipedia.org/wiki/Speedup#Super_linear_speedup
I don't have a thought about the wikipedia backtracking example.
Regarding cache-vs-RAM, or regarding RAM-vs-disk, I see no reason you cannot take the work that the parallel units do, and serialize it onto a single processor. Let me consider the example you gave.
Initially, you have two processors with caches of size N/2, and a problem of size N on disk. You effortlessly split the problem into two parts at zero cost (ha!), load the first part on procA, the second part on procB. You pay one load-time. Now you process it, paying one processing-time, then store the result to disk, paying one store-time. Now you effortlessly combine it at zero cost (again, ha!), and you're done.
In my serial case, I do the same effortless split, I load (paying one load), compute (paying one compute), then store (paying one store). Then I do that again (doubling my costs). Then I do the same effortless combine. My system is 1/2 as fast as yours.
In short, I think "superlinear speedup" due to memory hierarchy is proof-by-construction that the initial algorithm could have been written faster. What am I missing?
Say your workload has some inherent requirement for random (throughout its entire execution) access to a dataset of size N. If you run it on a single processor with a cache of size N/2, you'll see a lot of cache misses that end up getting serviced from the next level in the storage hierarchy, slowing down execution a lot. If you add another processor with another N/2 units of cache, they'll both still see about the same cache miss rate, but cache misses then don't necessarily have to be satisfied from the next level down -- they can instead (at least some of the time) be serviced from the other processor's cache, which is likely to be significantly faster (whether you're talking about CPU caches relative to DRAM in an SMP system or memory relative to disk between two separate compute nodes over a network link).
For a more concrete example somewhat related (though not entirely congruent) to the above scenario, see http://www.ece.cmu.edu/~ece845/docs/lardpaper.pdf (page 211).
So, mission accomplished! I now believe that superlinear speedup is a real thing, and know of one example!
The Zynq 7020 arguably has enough spare capacity and I/O ports to implement a 3d torus with ~GByte/s throughput for each link.
The computing industry has established language and metrics to discuss computing performance and, while the waters often get muddied when the hardware is wide, that's a step too far.
This board should deliver about 90 GFLOPS of performance, or — in terms
PC users understand — about the same horse-power as a 45GHz CPU.
Edit: They state the real fact and then give another figure explicitly stating it's an attempt to translate this into a metric the average user can somewhat relate to.
According to http://en.wikipedia.org/wiki/FLOPS#Computing it seems that they're off by a factor of two, but I'm guessing that's just an honest mistake.
Second edit: I was under the impression that this was the result of dumbing down by a journalist, however it seems it's from Parallela itself. That is a bit disingenuous indeed.
Also, the boards for the backers feature a ZYNQ-7020 SOC by XILINX which sports a 1.3M Gate FPGA, available to the user. This ain't bad either.
So the statement "PC users understand" is false.
Doing 8 double-precision instructions per cycle would translate to either four 128-bit SSE instructions, or two 256-bit AVX instructions per cycle, which is not possible (unless I did not keep track of the latest AVX capabilities).
Why do you say the Parallella is a 45GHz computer?
We have received a lot of negative feedback regarding this number so we want to explain the meaning and motivation. A single number can never characterize the performance of an architecture. The only thing that really matters is how many seconds and how many joules YOUR application consumes on a specific platform.
Still, we think multiplying the core frequency(700MHz) times the number of cores (64) is as good a metric as any. As a comparison point, the theoretical peak GFLOPS number often quoted for GPUs is really only reachable if you have an application with significant data parallelism and limited branching. Other numbers used in the past by processors include: peak GFLOPS, MIPS, Dhrystone scores, CoreMark scores, SPEC scores, Linpack scores, etc. Taken by themselves, datasheet specs mean very little. We have published all of our data and manuals and we hope it's clear what our architecture can do. If not, let us know how we can convince you.
This is a Kickstarter project. In order to be successful, we need to attract as much attention from spam blogs as possible. To do that, facts are not particularly useful. What we need is something exciting. If we say we have 64 cores, that's not exciting. 64? I've forgotten how to count that low. Similarly, if we say we have a 700MHz processor, most people listening to us talk will actually start laughing in our faces. So that's no good. But thanks to our mathematical forefathers, there are many ways to make small numbers big. We could add the two numbers, saying we have a 764MHz machine. But that's not exciting, and the units don't work. We could divide the two numbers, yielding 10.94MHz. The units work, but that number is even smaller! Finally, we could try multiplication! And, boy, does that deliver! 45GHz!
TLDR: The only reason you're here is because of our misleading and dishonest claim. But now you're here. Please cough up your hard-earned cash which we might not use to go on a nice tropical vacation. You can trust us, we'd never mislead you...
> The Parallella project is not a board, it's intended to be a long term computing project and community dedicated to advancing parallel computing.
> The current $99 board aren't considered supercomputers by 2012 standards, but a cluster of 10 Parallella boards would have been considered a supercomputer 10 years ago.
Wait, what? :D (emphasis mine)
> Our goal is to put a bona-fida supercomputer in the hands of everyone as soon as possible but the first Parallella board is just the first step. Once we have a strong community in place, work will being on PCIe boards containing multiple 1024-core chips with 2048 GFLOPS of double precision performance per chip. At that point, there should be no question that the Parallella would qualify as a true supercomputing platform.
But then, clock rate is irrelevant to latency and throughput. Really matters how much more work per cycle you get done, how fast you can move IO, and the cost of throughput per watt (and whether the system can meet your requirements at all).
They miscalculated how people would interpret it, and got burned. But they've been clear about what it is they actually mean the whole time.
On a related note "petaflops" would be a great name for a pet bunny.
This is wrong.
A 4-core 3.0 GHz x86-64 processor delivers more GFLOPS than the Parallela: 96 GFLOPS with SSE instructions, because each core can execute 8 single precision instructions, 4 adds and 4 muls, each cycle. And yes, when Parallela claims 90 GFLOPS, they mean single-precision.
For example, for the same price as Parallela, you can get a $100 Phenom II X4 965 (4-core, 3.4 GHz, 125W) delivering 109 GFLOPS. Count $200 to include minimal mobo/RAM/PSU (if all you care about is raw GFLOPS).
The main advantage that Parallela has with their exotic architecture over x86-64 is a better GFLOPS/Watt metric. But if you care about this metric you should consider GPUs, which beat Parallela: http://parallelis.com/parallela-supercomputing-for-all-of-us...
Parallela may not beat anything on GFLOPS/Watt and GFLOPS/$, but if they can maintain ease of development (x86-64's stronghold) while doing not too bad on these 2 metrics (dominated by GPUs), they may be a good compromise and may have a shot at succeeding in the HPC market.
In contrast, the Epiphany chips can execute individual threads on each core in parallel on data either local to the core (fastest), on any other core, or in separate main memory.
The current Epiphany chips aren't too spectacular, since the core count is "low". They can "only" execute 16 individual instruction streams in parallel. But that's on a chip the size of your finger nail, and their roadmap is aiming for 1024 core chips.
They're effectively aiming for people to find ways of making effective use of simple, small, power efficient cores for problems that are not "data parallel" enough to be efficiently done on GPU's.
How difficult would it be in the practical sense to keep all of the cores on something like this "fed" with enough information to get benefits from its concurrency?
I mean to "feed" all 64 cores enough data/code so they can all "do something" concurrently is one hell of a job all on its own!
Just saying that I could do more with a $99 graphics card sort of misses the point.
Let me know if there's anything else that I can do to help.
No, it can't possibly have.
SHA is half bitshifts-by-constants. On an ASIC platfrom, those essentially refactor to no-ops. There is no way, no how general-purpose hardware could ever possibly get anywhere near even a piss-poor special purpose ASIC for this task. If you think otherwise you simply don't understand the domain. Those 600-watt ASIC systems contain multiple chips and run at tens of GHashes/s. That 5-watt chip, if it's very, very good, might maybe break 40MHash/s.
> For example a Radeon 6990 has 5.2 gigaFLOPS of computing power and yields roughly 800 megahash/s in bitcoin mining.
That was in July 2011. Mining is harder now.
If I were to ever get a Desktop machine again, it would have to be cheap and light, definitely don't want anything clunky, otherwise a laptop seems preferable to me.
There do not seem that many products that would fill that gap, Intel's NUC is too expensive, the Raspberry PI too slow. Apple's mini Mac seems like the best proposition in this segment.
I wonder if the Parallela could not only be used as development center, but also as a Desktop computer?
It won't run any fancy games, that's clear, but it may actually be usable for browsing, watching videos and office duties.
That they've actually managed to get it price competitive with a lot of cheap ARM computers, despite sporting a Zynq (ARM SoC with built in FPGA) is amazing.
So, the backers are getting a Very Good Deal, with the hopes that a successful launch will make demand high enough to make the $99 viable with volume.
Hundreds of servers connected together with 1Gigabit is still a "grid cluster" .. you need at least 10Gigabit Ethernet (over iWARP) or infiniband (RDMA) to be considered a supercomputer.
This is marketing B.S.! this B.S. is "emphasized" by the 90GFLOPS = 45Ghz thing. 90GFLOPS is by a single 45GHz "ALU" (perhaps an ALU doing Multiply-Add - MADD op.) not a full fledged CPU (like the i7 or Xeon, which has 4-8 cores with each core having 3 ALU's) as the readers might imply.
For example the i7 3770K does 121.6GFLOPS @ "only" 3.5Ghz (ref> table page 2 http://elrond.informatik.tu-freiberg.de/papers/WorldComp2012...)
measuring performance with Ghz is soooo Penitum III! the whole thing is very misleading, and I don't like that!
Supercomputer? not even funny! Its a Super-"Raspberry Pi". That's it!
It looks like (from info on wikipedia pages) the Xeon Phi 3100, gets about 3.3 GFLOPS/WATT, whereas the Epiphany E64G401 manages about 50 GFLOPS/WATT.
So something like 10 of these might compare to 1 xeon phi, and still be cheaper in terms of hardware, and much cheaper in terms of power consumption.
Phi has shared GDDR and distributed caches. Phi cores and caches are connected through a bidirectional ring interconnect, not a 2D mesh network. Still similar, but not as much.
Furthermore, x86 chips pack all of their performance in low number of cores, what makes them much more useful for common scalar code. And if 20 times higher scalar performance isn't enough to convince you to pay premium, the complexity required to achieve this level of scalar performance definitely is enough to discourage Intel from selling you i7s for $99.
Whether your soc vendor forces a secure supervisor to load is up to them, and i'd be surprised if an HPC builder had trouble finding vendors to supply parts with a totally controllable boot chain.
I'm sure there are ways to obscure it, but there are just as many ways on x86 platforms, the only real difference being that you could pull the eprom and reflash it and inspect the other board components. There's also plenty of evil things you can put in a soc without relying on trustzone.
Bottom line is you have to trust your vendor. If you want a soc integrated and fab monitored by a business/state that is politically aligned with yours it is probably just a matter of paying a premium.
And Raspberry Pi probably doesn't run any secure mode hypervisor as well.
Prior to the popularity of mining using GPUs, it would have been the shizzle.
Today's ASIC-based systems will hash circles around it.
Wikipedia also seems to say it's "a fast computer": "a computer at the frontline of current processing capacity, particularly speed of calculation"
You'll get lots of specs thrown at you like in mid 2013 a supercomputer means using X, Y, and Z technologies. But that is just a longer format version of the above.
A pessimist usually warps the definition to a machine that's primarily programmer limited rather than CPU or I/O limited, LOL.
Over the decades as parallelism has been popular its drifted more toward being financially limited more so than anything else, in the long run this is probably going to be the new definition, a overall system who's performance is solely limited economically. You might think thats all computers, not so, there's plenty which are inherently limited by architecture to low performance, or limited by programming to single core / single thread tasks.
The biggest bummer of supercomputers in the parallel era is no one is doing anything about latency. That's nice that your 2000 processor design with 20 deep pipelines eventually after enormous latency can really churn stuff out, but the olden days pursuit of low latency resulting in speed was pretty interesting technologically. Hilariously you'll even get noobs who don't even understand the difference between latency and speed or claim there isn't one.
Not sure if this machine qualifies under that, but that is at least one competing definition to just "cutting edge and very fast"
Yeah, maybe it's faster for all those times during the day when you calculate matrix chain products. But for largely single-threaded tasks, like EVERYTHING you do on a day to day basis, it's going to be significantly slower than your average dual-core i3.
To me it was always clear that the current models are not particularly fast. They may be fast "per watt", and if they succeed in their roadmap, then their future 1024 core chips may be fast for the subset of problems that they are suitable for.
In the meantime, the kickstarter page is/was careful to focus on this as a stepping stone, and developer platform for playing with the technology first and foremost, and not as being about delivering some incredibly fast computer for end users.
If anything, they've provided an extreme amount of data, down to cycle counts for memory accesses and the instruction set, and they've dumped a lot of code in our laps, including drivers etc., and the final unit actually comes with a faster version of the Zynq SoC than what they promised, after Xilinx apparently gave them an amazing deal.