>> Wish that O(n^3) algorithm could execute in real-time? Now it's possible! We encourage you to push the limits of our platform and disregard what was previously intractable.
One of the most important things to note about asymptotic analysis is that the speed of your computer is immaterial. An O(n^3) algorithm will experience cubic slowdown, irrespective of how it's implemented. A blazing fast, parallelised solution just reduces the constant multiple in front of the n^3, but ultimately, n^3 will outgrow that constant. I'm sure you know this, but it's disingenuous to say that you can disregard an algorithm's complexity because of your data / results. A particular algorithm with particular space complexity (also important) was run on a particular input and was much faster - granted, this is impressive - but not definitive enough to justify labels like '1000x faster' and 'disregard what was previously intractable'.
Also, O(n^3) is considered polynomial, which is absolutely not intractable. The whole point of intractable problems is that they are too complex to solve (given current best known methods). This might be very useful for applications that are currently too slow to run in real-time, but absolutely not a solution to problems of much higher order (read: non-polynomial) complexity
ie: polynomial usually correlates with tractable but not necessarily?
My knowledge in this is rusty though so please correct me if that's inaccurate.
Solving problems caused by misuse of familiar tools (SQL, Hadoop, Excel, etc) seems to be a good meta-strategy for finding good business niches.
Memory-bound jobs are not an issue either. In fact, 12GB Nvidia and AMD GPUs were announced this week.
I would be careful about phrasing it as "memory bound is not an issue". There's a distinction between memory bandwidth (which GPUs excel at as long as the data is resident), memory latency (which GPUs are pretty bad at), and memory capacity. 12gb is slim in the grand scheme of things; my laptop has 16gb and you can get machines with >1TB of memory. Outside of the GPU's memory limits, you're relying on streaming data to/from the device, and potentially bound on PCIe bandwidth, or partitioning the problem across multiple GPUs/machines.
PCIe is the least of our concerns at 16GB/s. Some of our tests indicate that our GPU analyzes data at a rate of ~5GB/s. If we can feed it the data in that amount of time, we're in business. We're also parallelizing disk reads, which is one bottleneck at this point.
Also, have you considered using Xeon Phi instead of GPUs? Seems like this would be a perfect use case, and it would probably be much easier to work with x86.
 - http://cs.olemiss.edu/heroes/persistentThread.pdf
The very fitting application is MapReduce, which is embarrassingly parallel.
Please, could you quote the CPU and GPU for which you got that 350x speedup? Also, for the 1000x speedup.
(also, it would be interesting to know the CPU/GPU that gave the original 1 hour to 0.2 sec (18,000x speedup) - I'm guessing other factors like network latency, low-end CPU + high-end CPU, optimized code etc were part of it.)
Also, mapreduce is not embarrassingly parallel. Out of mapreduce, map is the only embarrassingly parallel phase. It has input parsing, shuffle, reduction phases that are not embarrassingly parallel.
We're certainly not the first to describe mapreduce as "embarrassingly parallel," but I can defend myself nonetheless:
A shuffle executes in parallel where each unit re-indexes into another unit. Reduction is parallelized over each key, and even finer granularity can be achieved by considering a tree-based approach:
In fact, I used this tree reduction approach to merge non-disjoint clusters of friend groups in my social network clustering algorithm.
Parallelizing the input parsing is problem-specific, but even that is possible in many cases.
Also, what are the limits like for input data sizes? I've done a little OpenCL, but I've never gone past the GPU RAM size.
We could also provide what you're suggesting outside of Hadoop, no problem!
"What are the limits like for input data sizes?" --- There are no limits. You can store your data in AWS, and we will crunch it for you. That said, initially, for IO-bound and disk-bound jobs, ParallelX might not be ideal. This is a problem we are solving as we scale.
Thanks for the feedback! We appreciate it!
Spark is a new computing framework out of Berkeley's AMPLab (https://amplab.cs.berkeley.edu/software/), and it might be an interesting platform to target.
It's being adopted by Twitter, Yahoo, Amazon (http://www.wired.com/wiredenterprise/2013/06/yahoo-amazon-am...), and it's now commercially backed by Databricks (http://databricks.com/), which just recieved funding from Andreessen Horowitz.
I guess I've only delved into GPU stuff for dense matrix math, though, which is a pretty bad fit with Hadoop. Maybe you guys can come up with some other use-cases for them.
I do a lot of work with graph clustering algorithms and have done a little with GPUs and I have to say this is, at best, rather unexpected.
That's not a fair CPU/GPU comparison at all.
We're just sharing our story at this point. Next, we'll be sharing our whitepaper and data. We rather start gathering feedback now, as opposed to after writing 100k LoC for the compiler.
There's a lacuna between Codentical and Parallel X: 1. people willing to pay for it; 2. a spark went off; 3. "validate before you build"
You say you did have paying users, so why did you stop? What was the spark? I'm guessing that there weren't enough paying users - but that's not stated. This makes an inexplicable gap as you race down the homestretch of the story. It's not a huge problem, but since the rest of the story is so great, it's a shame to mar it.
Also, the story would be absolutely compelling if you mentioned the specific CPU, GPU and task that gave the incredible speedups. Your reader wonders, "Why aren't they mentioned?"
As I stated, we had a good number of people "willing to pay for it," but they weren't paying customers yet.
We will include this information on our website soon! :)
I would be extremely hesitant investing the engineering resources to use a product where the founders themselves don't have any conviction in their previous companies. I'd hate to be the victim of another flight of fancy, where the response is "Sorry, but we're pivoting."
This is the first time we'll be working full time, in the same city, if that matters for anything.
The lesson with FB is more nuanced though - if you want a startup then dependance is bad, but you can have a very successful lifestyle-type business on them and do well though.
You're definitely right about the lesson with FB. I'm just bitter about it :)
Or maybe all that time building apps in college you forgot to attend your statistics class?
I know little about GPU compilers though. What are the key factors to make this successful?
There are a number of factors that will make this product successful:
1. The compiler needs to work well.
2. We need to teach average developers that they can build new applications/algorithms that were previously unfeasible.
3. We need to reduce Hadoop costs for large corporations.
Simple enough, right? :) Also worth noting that ParallelX will be an Heroku add-on as well (with a freemium plan!)
How would this overlap with the upcoming support for GPU computing target for Java 9, being developed by Oracle and AMD under the HSA foundation?
Assuming it really gets integrated, that is.
"The microprocessors that made the microcomputer possible had originally been developed to run traffic lights and vending machines. They had never been intended to power computers. The first entrepreneurs to attempt this were laughed at; the computers they had created looked hardly worthy of the name--they were so small and could do so little. But they caught on with just enough people for whom they saved time, and slowly, the idea took off."
Always reserve the right to change your API TOS!
Facebook is notorious for doing this too.
The deadline was 5PM PST - how many days (or hours) did you have to comply?