>> Wish that O(n^3) algorithm could execute in real-time? Now it's possible! We encourage you to push the limits of our platform and disregard what was previously intractable.
One of the most important things to note about asymptotic analysis is that the speed of your computer is immaterial. An O(n^3) algorithm will experience cubic slowdown, irrespective of how it's implemented. A blazing fast, parallelised solution just reduces the constant multiple in front of the n^3, but ultimately, n^3 will outgrow that constant. I'm sure you know this, but it's disingenuous to say that you can disregard an algorithm's complexity because of your data / results. A particular algorithm with particular space complexity (also important) was run on a particular input and was much faster - granted, this is impressive - but not definitive enough to justify labels like '1000x faster' and 'disregard what was previously intractable'.
Also, O(n^3) is considered polynomial, which is absolutely not intractable. The whole point of intractable problems is that they are too complex to solve (given current best known methods). This might be very useful for applications that are currently too slow to run in real-time, but absolutely not a solution to problems of much higher order (read: non-polynomial) complexity
If this is the Moyes algorithm which does graph clustering I think you have to ask yourself a couple of things:
1. can a parallel algorithm that optimizes each node individually get the same result as a serial algorithm that optimizes them sequentially (this is not at all obvious, and for some graph clustering algorithms the answer is "no")
2. is a O(n^3) total work algorithm really that much better than one of the multiple existing subquadratic graph clustering algorithms out there.
I don't doubt that you have, although I would be curious about the mechanism. Spindles are spindles, whether they're feeding to a GPU or a CPU.
I would be careful about phrasing it as "memory bound is not an issue". There's a distinction between memory bandwidth (which GPUs excel at as long as the data is resident), memory latency (which GPUs are pretty bad at), and memory capacity. 12gb is slim in the grand scheme of things; my laptop has 16gb and you can get machines with >1TB of memory. Outside of the GPU's memory limits, you're relying on streaming data to/from the device, and potentially bound on PCIe bandwidth, or partitioning the problem across multiple GPUs/machines.
PCIe is the least of our concerns at 16GB/s. Some of our tests indicate that our GPU analyzes data at a rate of ~5GB/s. If we can feed it the data in that amount of time, we're in business. We're also parallelizing disk reads, which is one bottleneck at this point.
Right. These problems are all addressable. We're using a streaming scheme to ensure that the bus bandwidth is fully utilized for each GPU. Moreover, a dispatcher will intelligently partition and distribute jobs across multiple hosts.
A lot of academic work has been done on using GPUs for general purpose workloads. I would recommend familiarizing yourself with the literature if you have not already. Some work that may be interesting to you is the idea of Persistent Thread Groups .
Also, have you considered using Xeon Phi instead of GPUs? Seems like this would be a perfect use case, and it would probably be much easier to work with x86.
GPU acceleration sounds all great.. until anyone actually generalizes the framework, implements applications, and measures even the most trivial workload. There should be some niche applications you may get 2-10x benefit, but 1000x is either measurement error, apple-to-banana comparison, or a new marketing gimmick. Unless you have a very fitting application, it is very easy for GPU to have less performance than CPU. I would be very curious what application actually could have 1000x performance enhancement. Even memory bandwidth, the main benefit of GPU, does not differ by that amount.
(not parent) Yes, since highend GPUs have 1,000's of cores, it makes sense (though I wouldn't expect one GPU core to be faster than one CPU core.)
Please, could you quote the CPU and GPU for which you got that 350x speedup? Also, for the 1000x speedup.
(also, it would be interesting to know the CPU/GPU that gave the original 1 hour to 0.2 sec (18,000x speedup) - I'm guessing other factors like network latency, low-end CPU + high-end CPU, optimized code etc were part of it.)
I should clarify -- initially I/O bound, but with pipelining, we are able to "impedance match" the flow of data coming off the backing store with the GPU's computational progress. I suppose the better word is "I/O heavy," rather than "I/O bound."
We're certainly not the first to describe mapreduce as "embarrassingly parallel," but I can defend myself nonetheless:
A shuffle executes in parallel where each unit re-indexes into another unit. Reduction is parallelized over each key, and even finer granularity can be achieved by considering a tree-based approach:
Why bring Hadoop into it? MR is a paradigm for parallel programming, and having it compiled into OpenCL for you is hugely convenient. But the best Hadoop use cases always leverage massive storage capacity: people who do ML on Hadoop are doing it because the data is already in Hadoop, not because it seemed like a nifty thing to do. Do you actually leverage HDFS and the Hadoop scheduling infrastructure to run ParallelX jobs? Or can you safely jettison them and just sell this as: "Write your ML/graph analysis/whatever code in Java, have it execute on a GPU"
Also, what are the limits like for input data sizes? I've done a little OpenCL, but I've never gone past the GPU RAM size.
"Do you actually leverage HDFS and the Hadoop scheduling infrastructure to run ParallelX jobs?" -- Yes.
We could also provide what you're suggesting outside of Hadoop, no problem!
"What are the limits like for input data sizes?" --- There are no limits. You can store your data in AWS, and we will crunch it for you. That said, initially, for IO-bound and disk-bound jobs, ParallelX might not be ideal. This is a problem we are solving as we scale.
Yeah, it's unclear how the Hadoop cluster comes into play here. However, in general, a Spark cluster (http://spark.incubator.apache.org/) will have better IO performance than Hadoop, and Spark code is much simpler than equivalent Hadoop code.
Are you porting string processing routines to OpenCL? The grid model of GPU computing seems to be a bit of a mismatch with the jagged-edge model of a bunch of variably-sized records in 2 1TB files that I want to join on in my typical hadoop use-case.
I guess I've only delved into GPU stuff for dense matrix math, though, which is a pretty bad fit with Hadoop. Maybe you guys can come up with some other use-cases for them.
" My co-founder, Chuck Moyes, discovered a clustering algorithm that identified friend groups in a given social graph. However, the algorithm took an hour to run parallelized across 6 computers. Then Chuck had the clever idea of coding the algorithm on a single GPU. The run time? 0.2 seconds! WHOA."
I do a lot of work with graph clustering algorithms and have done a little with GPUs and I have to say this is, at best, rather unexpected.
This was my reaction as well. I think there should be a whitepaper explaining the comparison as many GPU/FPGA application acceleration companies tend to do. I would say a typical GPU speedup would be in the 10-20x range with 100x being possible for highly "regular" data parallelism. I have no doubt the GPU wallclock time is correct so I am guessing the CPU implementation is just very poor.
We used parallel Python to implement the first version of the clustering algorithm. Although the CPUs weren't particularly beefy, the implementation was by no means poor. It's worth mentioning that the laptop GPU wasn't powerful either.
In that case, how much of your 18000x speedup is due to the GPU and how much of it is due to re-implementing the algorithm in a compiled language? Python->C/C++/etc is a 100x-1000x speedup right off the bat.
Love the serendipity moral; loved surviving the vicissitudes. The story reads well.
There's a lacuna between Codentical and Parallel X: 1. people willing to pay for it; 2. a spark went off; 3. "validate before you build"
You say you did have paying users, so why did you stop? What was the spark? I'm guessing that there weren't enough paying users - but that's not stated. This makes an inexplicable gap as you race down the homestretch of the story. It's not a huge problem, but since the rest of the story is so great, it's a shame to mar it.
Also, the story would be absolutely compelling if you mentioned the specific CPU, GPU and task that gave the incredible speedups. Your reader wonders, "Why aren't they mentioned?"
Not to be negative, but how soon before these founders quit ParallelX for their next great idea?
I would be extremely hesitant investing the engineering resources to use a product where the founders themselves don't have any conviction in their previous companies. I'd hate to be the victim of another flight of fancy, where the response is "Sorry, but we're pivoting."
To be fair, Facebook shutting down GraphMuse left us with no other choice but to abandon the project, as mentioned in the article. Greekdex still has active users, and we'll still push the occasional update on there.
1000x? I won't believe it until you can provide a simple, open source example illustrating your technology, or at least a technical write-up on your testing methodology and associated raw data from the benchmarks.
Or maybe all that time building apps in college you forgot to attend your statistics class?
There are a number of factors that will make this product successful:
1. The compiler needs to work well.
2. We need to teach average developers that they can build new applications/algorithms that were previously unfeasible.
3. We need to reduce Hadoop costs for large corporations.
Simple enough, right? :) Also worth noting that ParallelX will be an Heroku add-on as well (with a freemium plan!)
Yes, we are in our own separate niche. Rather than optimizing the computation of individual loops and the like, we're attacking the problem from a "Big Data" perspective where we seek to optimize processing large data sets. We're talking about special filesystems a la FB's Haystack to optimize disk sector layout coupled with RAID schemes to bring data to the GPU as efficiently as possible to overcome I/O bottlenecks. We are initially interop'ing with Hadoop, and we are going to extend this to support Pig and Hive queries (along with database support). I really do think this is an "apples and oranges" comparison, and it's great that they are taking the C++ AMP route of parallelizing loops (perhaps with some parallel primitives mixed in), but at the same time, there's a lot more to the problem we are trying to solve than just parallel compute.
"It was originally designed for traffic lights and vending machines, whereas it ended up fueling the computer revolution" ... I thought silicon chips/cpus were designed for military purpose initially, such as radars ?
"The microprocessors that made the microcomputer possible had originally been developed to run traffic lights and vending machines. They had never been intended to power computers. The first entrepreneurs to attempt this were laughed at; the computers they had created looked hardly worthy of the name--they were so small and could do so little. But they caught on with just enough people for whom they saved time, and slowly, the idea took off."
One can already run Java->OpenCL workloads on Hadoop using https://code.google.com/p/aparapi/ you don't need buy any magic sauce. It compiles kernels written in java down to OpenCL which can run on SSE or GPU backends.