
From a Frat Social Network to a GPU Compiler for Hadoop - tonydiv
https://medium.com/game-changing-ideas/c3c516146599
======
daniel-levin
From the product website:

>> Wish that O(n^3) algorithm could execute in real-time? Now it's possible!
We encourage you to push the limits of our platform and disregard what was
previously intractable.

One of the most important things to note about asymptotic analysis is that the
speed of your computer is immaterial. An O(n^3) algorithm _will_ experience
cubic slowdown, irrespective of how it's implemented. A blazing fast,
parallelised solution just reduces the constant multiple in front of the n^3,
but ultimately, n^3 will outgrow that constant. I'm sure you know this, but
it's disingenuous to say that you can disregard an algorithm's complexity
because of your data / results. A particular algorithm with particular space
complexity (also important) was run on a particular input and was much faster
- granted, this is impressive - but not definitive enough to justify labels
like '1000x faster' and 'disregard what was previously intractable'.

Also, O(n^3) is considered polynomial, which is absolutely not intractable.
The whole point of intractable problems is that they are too complex to solve
(given current best known methods). This might be very useful for applications
that are currently too slow to run in real-time, but absolutely not a solution
to problems of much higher order (read: non-polynomial) complexity

~~~
Hydraulix989
Right, we're not changing the complexity itself, N-ary parallelization in the
ideal case modifies the runtime s.t. C' = C_0 / N.

~~~
daniel-levin
By Amdahl's law, that idealisation assumes that your program is totally
parallelisable. Very few real world applications actually are. Out of
interest, was yours? (Or at least very close to it?)

~~~
Hydraulix989
Yes, our algorithm performed an O(N^2) loop over the graph vertices so B for
us was very close to 1.

~~~
daniel-levin
OK, but I think you mean B was close to zero then?

~~~
Hydraulix989
Fair enough.

------
fiatmoney
Hadoop & MapReduce are usually an iffy choice for compute-bound workloads and
definitely a bad choice for memory-bandwidth bound problems.

Solving problems caused by misuse of familiar tools (SQL, Hadoop, Excel, etc)
seems to be a good meta-strategy for finding good business niches.

~~~
tonydiv
We've proven that we can also speed up IO-bound and disk-bound workloads.

Memory-bound jobs are not an issue either. In fact, 12GB Nvidia and AMD GPUs
were announced this week.

~~~
fiatmoney
I don't doubt that you have, although I would be curious about the mechanism.
Spindles are spindles, whether they're feeding to a GPU or a CPU.

I would be careful about phrasing it as "memory bound is not an issue".
There's a distinction between memory bandwidth (which GPUs excel at as long as
the data is resident), memory latency (which GPUs are pretty bad at), and
memory capacity. 12gb is slim in the grand scheme of things; my laptop has
16gb and you can get machines with >1TB of memory. Outside of the GPU's memory
limits, you're relying on streaming data to/from the device, and potentially
bound on PCIe bandwidth, or partitioning the problem across multiple
GPUs/machines.

~~~
tonydiv
Correct.

PCIe is the least of our concerns at 16GB/s. Some of our tests indicate that
our GPU analyzes data at a rate of ~5GB/s. If we can feed it the data in that
amount of time, we're in business. We're also parallelizing disk reads, which
is one bottleneck at this point.

------
nemonemo
GPU acceleration sounds all great.. until anyone actually generalizes the
framework, implements applications, and measures even the most trivial
workload. There should be some niche applications you may get 2-10x benefit,
but 1000x is either measurement error, apple-to-banana comparison, or a new
marketing gimmick. Unless you have a very fitting application, it is very easy
for GPU to have less performance than CPU. I would be very curious what
application actually could have 1000x performance enhancement. Even memory
bandwidth, the main benefit of GPU, does not differ by that amount.

~~~
tonydiv
We are not making up these numbers. Even for an IO-bound workload, we are
seeing 350x speedup.

The very fitting application is MapReduce, which is embarrassingly parallel.

~~~
nemonemo
This is a very bold statement. IO-bound means that the bottleneck is at IO.
GPU has little benefit in IO.

Also, mapreduce is not embarrassingly parallel. Out of mapreduce, map is the
only embarrassingly parallel phase. It has input parsing, shuffle, reduction
phases that are not embarrassingly parallel.

~~~
Hydraulix989
I should clarify -- initially I/O bound, but with pipelining, we are able to
"impedance match" the flow of data coming off the backing store with the GPU's
computational progress. I suppose the better word is "I/O heavy," rather than
"I/O bound."

We're certainly not the first to describe mapreduce as "embarrassingly
parallel," but I can defend myself nonetheless:

A shuffle executes in parallel where each unit re-indexes into another unit.
Reduction is parallelized over each key, and even finer granularity can be
achieved by considering a tree-based approach:

[http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x...](http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf)

In fact, I used this tree reduction approach to merge non-disjoint clusters of
friend groups in my social network clustering algorithm.

Parallelizing the input parsing is problem-specific, but even that is possible
in many cases.

------
alanctgardner2
Why bring Hadoop into it? MR is a paradigm for parallel programming, and
having it compiled into OpenCL for you is hugely convenient. But the best
Hadoop use cases always leverage massive storage capacity: people who do ML on
Hadoop are doing it because the data is already in Hadoop, not because it
seemed like a nifty thing to do. Do you actually leverage HDFS and the Hadoop
scheduling infrastructure to run ParallelX jobs? Or can you safely jettison
them and just sell this as: "Write your ML/graph analysis/whatever code in
Java, have it execute on a GPU"

Also, what are the limits like for input data sizes? I've done a little
OpenCL, but I've never gone past the GPU RAM size.

~~~
tonydiv
"Do you actually leverage HDFS and the Hadoop scheduling infrastructure to run
ParallelX jobs?" \-- Yes.

We could also provide what you're suggesting outside of Hadoop, no problem!

"What are the limits like for input data sizes?" \--- There are no limits. You
can store your data in AWS, and we will crunch it for you. That said,
initially, for IO-bound and disk-bound jobs, ParallelX might not be ideal.
This is a problem we are solving as we scale.

Thanks for the feedback! We appreciate it!

~~~
jbooth
Are you porting string processing routines to OpenCL? The grid model of GPU
computing seems to be a bit of a mismatch with the jagged-edge model of a
bunch of variably-sized records in 2 1TB files that I want to join on in my
typical hadoop use-case.

I guess I've only delved into GPU stuff for dense matrix math, though, which
is a pretty bad fit with Hadoop. Maybe you guys can come up with some other
use-cases for them.

~~~
Hydraulix989
Yes, we are porting string processing routines to OpenCL. We are going to have
many of the Java libraries accessible within the compiled GPU code, including
file system I/O.

------
sanskritabelt
" My co-founder, Chuck Moyes, discovered a clustering algorithm that
identified friend groups in a given social graph. However, the algorithm took
an hour to run parallelized across 6 computers. Then Chuck had the clever idea
of coding the algorithm on a single GPU. The run time? 0.2 seconds! WHOA."

I do a lot of work with graph clustering algorithms and have done a little
with GPUs and I have to say this is, at best, rather unexpected.

~~~
groups
I genuinely don't understand what "'at best' 'rather' 'unexpected'" means.

~~~
tonydiv
I'm not sure either! Either way, implementing a clustering algorithm on a GPU
was thinking out of the box for us :)

~~~
sanskritabelt
What it means is I'm hesitant to believe an 18000x speedup.

~~~
enos_feedler
This was my reaction as well. I think there should be a whitepaper explaining
the comparison as many GPU/FPGA application acceleration companies tend to do.
I would say a typical GPU speedup would be in the 10-20x range with 100x being
possible for highly "regular" data parallelism. I have no doubt the GPU
wallclock time is correct so I am guessing the CPU implementation is just very
poor.

~~~
tonydiv
We used parallel Python to implement the first version of the clustering
algorithm. Although the CPUs weren't particularly beefy, the implementation
was by no means poor. It's worth mentioning that the laptop GPU wasn't
powerful either.

~~~
sanskritabelt
In that case, how much of your 18000x speedup is due to the GPU and how much
of it is due to re-implementing the algorithm in a compiled language?
Python->C/C++/etc is a 100x-1000x speedup right off the bat.

That's not a fair CPU/GPU comparison at all.

~~~
tonydiv
You're right, it isn't a perfectly fair comparison. However, even if we had
coded it in C++, the run time difference would still have been significant.

We're just sharing our story at this point. Next, we'll be sharing our
whitepaper and data. We rather start gathering feedback now, as opposed to
after writing 100k LoC for the compiler.

------
6ren
Love the serendipity moral; loved surviving the vicissitudes. The story reads
well.

There's a lacuna between _Codentical_ and _Parallel X_ : 1. people willing to
pay for it; 2. a spark went off; 3. "validate _before_ you build"

You say you did have paying users, so why did you stop? What was the spark?
I'm guessing that there weren't enough paying users - but that's not stated.
This makes an inexplicable gap as you race down the homestretch of the story.
It's not a huge problem, but since the rest of the story is so great, it's a
shame to mar it.

Also, the story would be absolutely compelling if you mentioned the specific
CPU, GPU and task that gave the incredible speedups. Your reader wonders, "Why
aren't they mentioned?"

~~~
tonydiv
We had paying users for GraphMuse, not Codentical. GraphMuse was shut down by
Facebook, whereas we quit Codentical.

As I stated, we had a good number of people "willing to pay for it," but they
weren't paying customers yet.

We will include this information on our website soon! :)

------
steven2012
Not to be negative, but how soon before these founders quit ParallelX for
their next great idea?

I would be extremely hesitant investing the engineering resources to use a
product where the founders themselves don't have any conviction in their
previous companies. I'd hate to be the victim of another flight of fancy,
where the response is "Sorry, but we're pivoting."

~~~
tonydiv
ParallelX has been an idea in the making for 2.5 years. Your skepticism is
valid, and only as we continue developing, iterating, and working with
customers will we convince you that we're serious.

This is the first time we'll be working full time, in the same city, if that
matters for anything.

------
idoh
Hey Tony - nice to see that you've bounced back after GraphMuse! I was the PM
for Bingo and GraphMuse really did help with the #'s, and Facebook really did
shut the whole thing down with no notice.

The lesson with FB is more nuanced though - if you want a startup then
dependance is bad, but you can have a very successful lifestyle-type business
on them and do well though.

~~~
tonydiv
Hey Idoh! I'm happy you saw our post, and thank you for the kind words.

You're definitely right about the lesson with FB. I'm just bitter about it :)

------
iamleppert
1000x? I won't believe it until you can provide a simple, open source example
illustrating your technology, or at least a technical write-up on your testing
methodology and associated raw data from the benchmarks.

Or maybe all that time building apps in college you forgot to attend your
statistics class?

~~~
enos_feedler
There should definitely be some kind of whitepaper accompanying this marketing
claim.

~~~
tonydiv
We are working on that paper right now. If you're interested, feel free to
provide us your email on the site.

------
laglad
I've known Tony now for 6 years. He's smart and extremely resilient - two
great qualities of an entrepreneur.

I know little about GPU compilers though. What are the key factors to make
this successful?

~~~
tonydiv
I can't thank you enough!

There are a number of factors that will make this product successful:

1\. The compiler needs to work well. 2\. We need to teach average developers
that they can build new applications/algorithms that were previously
unfeasible. 3\. We need to reduce Hadoop costs for large corporations.

Simple enough, right? :) Also worth noting that ParallelX will be an Heroku
add-on as well (with a freemium plan!)

------
jerven
Very nice read! How does this work compare to work on java 8 to bring GPU
compute to the jvm by AMD, ARM and Oracle? Do you see this as a competition or
do you feel you are in a separate niche?

~~~
Hydraulix989
Yes, we are in our own separate niche. Rather than optimizing the computation
of individual loops and the like, we're attacking the problem from a "Big
Data" perspective where we seek to optimize processing large data sets. We're
talking about special filesystems a la FB's Haystack to optimize disk sector
layout coupled with RAID schemes to bring data to the GPU as efficiently as
possible to overcome I/O bottlenecks. We are initially interop'ing with
Hadoop, and we are going to extend this to support Pig and Hive queries (along
with database support). I really do think this is an "apples and oranges"
comparison, and it's great that they are taking the C++ AMP route of
parallelizing loops (perhaps with some parallel primitives mixed in), but at
the same time, there's a lot more to the problem we are trying to solve than
just parallel compute.

------
pjmlp
Great work!

How would this overlap with the upcoming support for GPU computing target for
Java 9, being developed by Oracle and AMD under the HSA foundation?

Assuming it really gets integrated, that is.

~~~
Hydraulix989
From what I understand, the GPU computing target is a loop parallelizer a la
Microsoft's C++ AMP.

------
Keyframe
"Extraordinary claims require extraordinary evidence".

------
guiomie
"It was originally designed for traffic lights and vending machines, whereas
it ended up fueling the computer revolution" ... I thought silicon chips/cpus
were designed for military purpose initially, such as radars ?

~~~
tonydiv
Not from what I've read, but I could be wrong:

 _" The microprocessors that made the microcomputer possible had originally
been developed to run traffic lights and vending machines. They had never been
intended to power computers. The first entrepreneurs to attempt this were
laughed at; the computers they had created looked hardly worthy of the name--
they were so small and could do so little. But they caught on with just enough
people for whom they saved time, and slowly, the idea took off."_

[http://www.fastcompany.com/3002810/aha-moments-made-paul-
gra...](http://www.fastcompany.com/3002810/aha-moments-made-paul-grahams-y-
combinator-possible)

------
catdaddy
Great read - I've known Tony for some time, and seen him through a few of his
ventures, and this strategy seems like an incredibly savvy way of solving a
real problem. Congrats, Tony!

~~~
tonydiv
Who is this? Either way, thank you very much :D

------
sitkack
One can already run Java->OpenCL workloads on Hadoop using
[https://code.google.com/p/aparapi/](https://code.google.com/p/aparapi/) you
don't need buy any magic sauce. It compiles kernels written in java down to
OpenCL which can run on SSE or GPU backends.

------
j_s
_Then we received this email from Facebook’s Platform Policy Team [Notice of
Violation] I used the Wayback Machine to see when the clause was added… 3 days
prior_

Always reserve the right to change your API TOS!

~~~
tonydiv
While I realize that an TOS is always subject to change, I still think it's a
low blow to pull something like that. They intended it to be a one-way
conversation as well. We had no way to respond.

Facebook is notorious for doing this too.

~~~
scott_usa
I love how the email from FB began with "hi," as if it were a note from your
dear grammy checking in with you.

The deadline was 5PM PST - how many days (or hours) did you have to comply?

~~~
tonydiv
We had less than 3 days to comply if I recall. Considering we had multiple
customers in a handful of countries, it wasn't much time.

------
truthteller
Where is the paper for this novel clustering algorithm?

~~~
tonydiv
Here you go -
[http://www.eecs.berkeley.edu/~isabelle/papers/cci.pdf](http://www.eecs.berkeley.edu/~isabelle/papers/cci.pdf)

------
bouk
Cool story, and good luck with your company!

------
booop
Great read, and ParallelX looks great too.

------
officialjunk
i like this piece. great journey. thanks for sharing

~~~
tonydiv
You're very welcome. I appreciate the kind words.

------
crowhack
Enjoyed this :)

------
pbhojwani
Great read!

