
HPAT – A compiler-based framework for big data in Python - rbanffy
https://github.com/IntelLabs/hpat
======
tedmiston
The abstract from the paper about it _HiFrames: High Performance Data Frames
in a Scripting Language_ [1] describes HPAT better than the description in the
repo:

> Data frames in scripting languages are essential abstractions for processing
> structured data. However, existing data frame solutions are either not
> distributed (e.g., Pandas in Python) and therefore have limited scalability,
> or they are not tightly integrated with array computations (e.g., Spark
> SQL). This paper proposes a novel compiler-based approach where we integrate
> data frames into the High Performance Analytics Toolkit (HPAT) to build
> HiFrames. It provides expressive and flexible data frame APIs which are
> tightly integrated with array operations. HiFrames then automatically
> parallelizes and compiles relational operations along with other array
> computations in end-to-end data analytics programs, and generates efficient
> MPI/C++ code. We demonstrate that HiFrames is significantly faster than
> alternatives such as Spark SQL on clusters, without forcing the programmer
> to switch to embedded SQL for part of the program. HiFrames is 3.6x to 70x
> faster than Spark SQL for basic relational operations, and can be up to
> 20,000x faster for advanced analytics operations, such as weighted moving
> averages (WMA), that the map-reduce paradigm cannot handle effectively.
> HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26 on 64 nodes of
> Cori supercomputer.

[1]: [https://arxiv.org/abs/1704.02341](https://arxiv.org/abs/1704.02341)

------
chrisaycock
The linked papers demonstrate the performance gains in a specific arena:
static compilation around a loop instead of mere data parallelism of the
iterations. For example:

    
    
        for i in range(num_iterations):
            some_function(i)
    

A library like Spark can execute the function in parallel, but must return
control to the host language for each iteration. That leads to a restart of
the distributed environment.

HPAT is a compiler. The function is executed in parallel, just like in Spark,
but the distributed environment doesn't return control the host language
between iterations. Instead, the loop is run in its entirety in the
distributed environment.

Spark's _driver_ must farm work to _executors_ during each iteration. HPAT
removes that runtime overhead.

------
noobermin
Ugh, MPI. I guess it uses MPI behind the scenes but MPI is the wrong level of
abstraction for most things, for today. This[0] convinced me of that.

[0] [https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-
it...](https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html)

------
zitterbewegung
From a usability standpoint this is really interesting. If I don't have to
think about moving to spark but instead just code within the restrictions of
the subset of Python that is really neat.

~~~
tedmiston
It's especially interesting because of all of the other implications of moving
to Spark — like trying to debug your app code that runs PySpark that runs
Scala Spark on the JVM. It's very difficult to trace errors all the way
through, no way to use tools like pdb. It takes me so much longer to debug
errors in Spark code.

------
gravypod
From their documentation:
[https://intellabs.github.io/hpat/_build/html/source/notsuppo...](https://intellabs.github.io/hpat/_build/html/source/notsupported.html)

"Similarly, function calls should also be deterministic. The below example is
not supported since function f is not known in advance"

What if the if statement is valuable at compile time?

------
The_Amp_Walrus
So this framework generates MPI compliant code, which is intended to be sent
out to a bunch of super-computer nodes and executed.

What kind of existing infrastructure would you need to actually run this? I
assume you would need to install agents that understand MPI on all your nodes
and run some sort of centralised master agent to send out messages and
supervise the whole thing. Is there a standard way to do this?

~~~
jzwinck
Yes you need to install an MPI implementation such as OpenMPI or MPICH. You
probably also want to install a scheduler such as TGSFKA Sun Grid Engine
([https://en.m.wikipedia.org/wiki/Oracle_Grid_Engine](https://en.m.wikipedia.org/wiki/Oracle_Grid_Engine)),
Torque, or Slurm. All of these contain a program called qsub which launches
multiple copies of a job in parallel across machines. Perfect for MPI.

A vanilla compiler will suffice for MPI code, so you don't need anything
special at build time.

------
conceptoriented
HPAT looks pretty promising. I wonder how they managed to signficantly
increase performance of shuffling and sorting which are known to be quite
difficult operations in map-reduce.

