
A Python Compiler for Big Data - bluemoon
http://continuum.io/blog/blaze
======
ezl
I just want to point this out because I feel like there's a good chance a lot
of people won't have gotten this far:

 _Because our implementation does not explicitly depend on Python we are able
to overcome many of the shortcomings of the Python runtime such as running
without the GIL and utilising real threads to dispatch custom Numba kernels
running at near C speed without the performance limitations of Python._

~~~
freyrs3
Yes, using Numba we can just-in-time compile numeric Python logic straight
down to machine code, so naturally we can achieve some pretty impressive
numbers on kernel execution.

In case many people didn't reach the bottom here are the links to the repo and
the docs. The project is still in early stages, but is public and released
under a BSD license.

* <http://blaze.pydata.org/docs/>

* <https://github.com/ContinuumIO/blaze>

------
rpm4321
Bit of a tangent, but I'm wondering if anyone here has had any luck with
Cython?

I'm starting to run into some performance bottlenecks with Python, and so I'm
just now looking at Cython, PyPy, Psyco, and... gasp... C.

From what little I've read, Cython is supposed to be as easy as adding some
typing and modifying a few loops here and there, and you are in business.

~~~
frozenport
I would go with C/C++ as the ways to address performance are well studied.
There are many tools out there like callgrind or nvvp that will make it pain-
free.

I can narrow down performance in C/C++ quite quickly, but neither I nor
anybody I know has done much of this for Python. Many people who I work with
consider a Python implementation a prototype, while Fortran/C/C++ is mature
_real_ code worthy of attention.

The only real downside is that C/C++ requires a little knowledge of the
POSIX/LINUX or Windows. This represents a learning curve, but when you are
over it, it represents quite durable long lasting skills.

~~~
winter_blue
> Fortran/C/C++ is mature _real_ code worthy of attention

Just be prepared for Drew Houston, Paul Graham et al. to come after you
whipping their lashes.. (tongue in cheek)

~~~
frozenport
I'm not scared of a man who speak with a Lisp!

------
greenonion
So is there anyone using Python for machine learning in production systems
(i.e. not just for prototyping). I would love to do it but seems Java/Mahout
is a safer choice, performance-wise.

I wonder whether Blaze is a step towards that direction.

~~~
law
I use Python for nearly all of my ETL processes that involve text processing.
Even in production systems, I'd be hard-pressed to admit any significant
performance issues. Python facilitates implementing algorithms in a functional
style, which I tend to prefer over the imperative style (i.e., Java). With
C++11 and boost, I'm able to translate my Python code to C++ while preserving
the functional style, which has immensely simplified prototyping/deploying
NLP/ML algorithms while simultaneously begetting enormous performance gains. I
see Python as an extremely viable alternative to Java.

~~~
greenonion
You got me a bit confused here. If I understand correctly what you 're saying,
you 're still using Python for prototyping the core algorithms and C++ in
actual production systems. I'm not saying Python is not good for production
systems in general, I'm wondering whether it is good enough for real-world
implementations of machine learning algorithms.

Also, I believe most people would consider Java as an alternative to C++,
hence all the Java-based Apache projects, such as Mahout, Solr etc.

~~~
law
I use Python in production for text pre-processing and other ETL-related
processes, which is part of a larger reinforcement learning approach.
Additionally, I use Python to prototype the core ML algorithms, which I
sometimes re-implement in C++. However, for many of those algorithms, numpy
actually performs identically to BLAS in C++.

~~~
greenonion
I get it now, thanks. It's very interesting, maybe I will give Python for ML a
chance!

------
davidf18
It would be great to eventually have a GPU version as well (as in the cases of
Matlab and R). I saw a brief demo of Matlab on a Mac Retina Pro 15 where the
GPU version ran 30x the CPU version.

~~~
freyrs3
GPU support is definitely planned and already supported in NumbaPro[2]. Here's
a video of Travis Oliphant's talk about targeting CUDA through Numba:

[1] <http://www.ustream.tv/recorded/26973799>

[2] <https://store.continuum.io/cshop/numbapro>

------
Caligula
I read about continuum after the fellow who developed numpy left a few days
ago to work on continuum. I am curious to see actual projects using continuum.
So some sort of writeups.

~~~
omni
You're being downvoted because Travis Oliphant, the original author of Numpy,
is also a co-founder of Continuum Analytics.

~~~
hyperbovine
I figure he's talking about <http://news.ycombinator.com/item?id=4931027>. Not
sure where the downvoting comes in...

------
andrewcooke
how does this compare to theano? it seems like some of the ideas are similar?

<http://deeplearning.net/software/theano/>

in general, i like (ie i don't see a better solution than) the idea of having
an AST constructed via an embedded language that is implemented by a library.
but it does have downsides - integration with other python features is going
to be much more limited (it gives the _illusion_ of a python solution, but in
practice you're off in some other world that only looks like python).

are there more details? i guess the AST is fed to something that does the
work. and that something will have an API and be replaceable. but is that
something also composable? does it have, say, a part related to moving data
and another to evaluating data? so that you can combine "distributed across
local machines" with "evaluate on GPU"?

~~~
freyrs3
> how does this compare to theano? it seems like some of the ideas are
> similar?

It's quite similar, we just take some of the ideas farther and try to
generalize the data storage to include storage backends that data scientists
use more frequently ( i.e. SQL, CSV, S3, etc ). We're very friendly with the
Theano developers and hope to bridge the projects with a compatibility layer
at some point.

> (it gives the illusion of a python solution, but in practice you're off in
> some other world that only looks like python).

I would argue that's what make Python a great numeric language, and NumPy so
succesfull. You get this high level language where you can express domain
knowledge but also this 1:1 mapping between fast code execution at the C
level. Blaze is the continuation of that vision

> i guess the AST is fed to something that does the work. and that something
> will have an API and be replaceable.

Precisely, we build up a intermediate form called ATerm out of the
construction expression objects, do type inference, graph rewriting, and then
pattern match our layout, metadata, and type information against a number of
backends to find the most optimal one to perform execution. Or if needed we
build a custom kernel with Numba informed by all this type and data layout
information we've inferred from the graph.

We don't aim to solve all the subproblems in this area ( expression
optimization passes, distributed scheduling ) but I think we have a robust
enough system that others can build extensions to Blaze to do expression
evaluation in whatever fashion they like.

> are there more details?

Yes! See: <http://blaze.pydata.org/>

------
lucian1900
Interesting approach to modelling data that lives elsewhere, in fact quite
similar to SQLAlchemy's.

~~~
piqufoh
... but you can't use numpy operations efficiently on SQLAlchemy data

~~~
lucian1900
That's not what I meant. Both this and SA turn python expressions into
expressions to be run elsewhere, on data that isn't necessarily in the
process' memory.

