
DARPA gives $3M to Continuum Analytics to improve Python for big data - astaire
http://www.drdobbs.com/tools/us-defense-agency-feeds-python/240151767
======
pjscott
More info from Continuum Analytics:

<http://continuum.io/blog/continuum-wins-darpa-xdata-funding>

The money is going to develop Blaze and Bokeh:

<http://blaze.pydata.org/>

<https://github.com/ContinuumIO/Bokeh>

~~~
primelens
Better D3 integration from within Python would be awesome.

~~~
paddy_m
Bokeh has three components. BokehJS runs in the browser, the Bokeh server and
the Bokeh Python client library.

BokehJS draws plots in the browser with canvas. BokehJS maintains a data model
of plot models (plots, renederers of different types, tools) with Backbone JS.
It has a a connection to the Bokeh server via web sockets.

The Bokeh server keeps track of plot objects and other models with redis. It
keeps these persistently in documents.

The python client connects to the server and either create new plots or
updates existing plots. You can create a plot with python and get a url to
embed that plot in any web page. This means that your python code can just be
a script that doesn't have to worry about serving up a web page.

We intend to write client libraries in other languages like java, R, and ruby
to name a few.

~~~
siddboots
So... the server would maintain a set of grammar-of-graphics-like objects,
which would be replicated as backbone models? In theory we could use whichever
graphics libraries we want on top the backbone layer, right?

Bokeh is starting to sound like everything I've ever wanted.

------
dekhn
Blaze has the potential to be great. Numpy/Scipy (and previously, numarry and
NumPy) are, at least to scientists, some of the most useful libraries in
Python. Being able to express computation in a high level language, and having
it run in native code, yet still be able to access the data in python, is
often a major productivity enhancer.

I'm still a bit skeptical Blaze will catch on. It's ambitious, and the python
numeric community has had to chase each new implementation (with serious
issues with backwards compatibility and cost of adoption) over the years. I
would actually far prefer numpy and its limited functionality, over a much
more complex system that didn't have any adopters.

~~~
siddboots
Having a bit of a look at the docs^1, I don't really see it as "yet another
arrays package". The API exposes generalized table/array operations, and a
subset of those features can be implemented using NumPy's ndarrays. If the
interface is clean, then I don't think adoption will be an issue: there simply
isn't anything else that does this.

You are absolutely right that both of these projects are extremely ambitious,
but, given the $3 mil grant, I can't help but be optimistic.

1 [http://blaze.pydata.org/docs/overview.html#blaze-is-a-
genera...](http://blaze.pydata.org/docs/overview.html#blaze-is-a-
generalization-of-numpy)

------
eliben
This is great! Continuum Analytics are doing cool stuff for numeric / big data
Python and it's great to see the language getting more and more traction in
this domain.

~~~
chm
Will this mean better scientific computing with Python?

~~~
hazov
Depends on what you call scientific computing, Continuum Analytics appears to
have a data/stats focus (I now work as a statistician and people talks good
about their products, I never used them), there I believe Python has the
possibility to displace R and Octave/SciLab/MatLab as the language for light
academic stuff, for CFD (Computational Fluid Dynamics) and CEM (Computational
Electrodynamics) guys generally run on C, C++ and Fortran, or whatever runs
Numerical Linear Algebra faster, I doubt Python will take this post anytime
soon.

~~~
dchichkov
I doubt, that you will find it easy to beat well written numeric Python code
with naive C, C++ or Fortran approach.

Python is not inherently slower than C, C++ or Fortran. It is just a language
after all. If you have a fast implementation of computational engine, like
Numpy, you can reach speeds of C++ in computational tasks. If you have an
optimizer that simplifies your math expressions, before running them - you can
get better than naive C++ approach. An optimizer and computational engine that
optimally offloads work from CPU to GPU can give an order of magnitude
advantage over naive C++/blas code.

And to give you a concrete example - there is a nice Python library called
Theano, that is doing just than.

~~~
coldtea
> _I doubt, that you will find it easy to beat well written numeric Python
> code with naive C, C++ or Fortran approach._

Huh? People do it all the time. Naive, but not memory-leaking or improper
complexity using, C/C++ etc code, beats Python hands town -- and can be more
than 10-20 times faster.

> _Python is not inherently slower than C, C++ or Fortran. It is just a
> language after all._

An interpreted language, with a not-that-good interpreter and garbage
collector, non primitive integers and other such things holding it back.

> _If you have a fast implementation of computational engine, like Numpy, you
> can reach speeds of C++ in computational tasks._

That's because the "fast implementation" is NOT written in Python.

~~~
dchichkov
Let's go down to some example and consider a typical numeric problem, for
example estimation of a cross entropy gradient of some function on your data.

With Python (and Theano computational engine) you can write something along
the lines:

    
    
        def cost(goal, prediction):
            crossEntropy = -goal * log(prediction) - (1 - goal) * log(goal - prediction)
            return mean(crossEntropy)
    
        prediction = 1 / (1 + exp( ....  -dot(x,w) - b + ...)) > 0.5
        gradW, gradB = grad(cost(y, prediction) + (w**2).sum(), [w,b])
    

And then just apply that function to a matrix containing your data. That's it.
When you apply a function, it will be interpreted, converted into a
computation graph, this computation graph will be optimized and parts of the
computation will be offloaded to GPU with memory transfers between the host
and GPU taken care of, and you will get a result in a user friendly and
efficient Numpy array.

The resulting computation will be nearly optimal and limited by memory
bandwith, CPU - GPU bus bandwidth and GPU FOPS rate. With luck you can get
close to theoretical maximum of your GPU floating point performance. And all
done in a few lines of Python.

Now consider the same in C++. Yes, it can be done. But there are just no open
source libraries available that can do that. Closest open-source
implementation that I know of is gpumatrix, a port of C++ Eigen library to
GPU. And it doesn't even come close to what is available in Python. So with
C++, if you want to match the performance of these few lines of Python code,
good luck studying Cuda or OpenCL and implementing the computation engine
right, from the first time.

(disclaimer) I'm not in any way affiliated with OP and I actually use (and
like) C/C++ a lot.

------
ahaefner
I saw a presentation about Blaze at PyData. The CEO of Continuum was the
author of Numpy and said they are doing everything he would have done
differently in Blaze.

Also, Blaze is attempting to abstract the concept of having huge arrays
distributed across multiple systems, clusters, clouds, whatever. So imagine
having large distributed arrays that you can easily perform computations on.
Could be very powerful if it works as planned.

------
toolslive
"If they can learn an easy language, they won't have to rely on an external
software development group to complete their analysis."

Although scientists earn more credit here than business people, allow me to
make a skeptical remark:

This is exactly the same reason they had in the past regarding with SQL.
"Business people can't wait for programmers to run their queries; if they can
learn an easy language, they won't have to rely on programmers to do that for
them..."

In the end, the business people did not do SQL; they just force developers to
do it for them, using SQL.

History repeats itself, perpetuating fallacies

~~~
yummyfajitas
Many business people do write SQL. Typically in a poorly administered access
database, but they do write SQL.

Scientists also already use several easy languages (r, MATLAB, SAS,
increasingly python) rather than relying on outside developers.

------
indeyets
Would be really cool, if part of this money went for funding PyPy's NumPy
effort (they still need $15K to be fully funded)

------
ateev23
This will sure make me to learn python over ruby

------
wilfra
Article calls it both a 'donation' and an 'investment' - which is it?

~~~
gojomo
I'd guess 'grant' would be a better term than either for what's happening
here. AFAIK, DARPA doesn't take equity, but does make grants based on intended
deliverables.

('Investment' was used in the article in the general sense of "an investment
in our children's future" or "investing in shared infrastructure", rather than
expecting a specific liquid return like cash dividends or an appreciated exit
event.)

------
qompiler
All I see in these libraries are standard data types which are already
available in Python. Are these libraries even being developed by
mathematicians? It looks like someone hired a bunch of first grade computer
science students to chunk out wrappers.

~~~
mmcnickle
There are many new data types introduced by these libraries, distinct from the
standard library types. The N-dimentional array is the most notable one.

