Python data tools just keep getting better

Homunculiheaded · on March 25, 2013

As someone who uses R for just about all of my ML/data analysis needs I'm surprised not to see Theano[0] mentioned. SciPy, SciKit-learn, pandas etc are great and all, but there's not much really different than what you get with R (except of course having it all in a general purpose language). But Theano (plus it's related deep learning tools) really stands out for me as something the R tool chain can't compete with.

I feel like eventually I should become as fluent with SciPy/Scikit-learn/pandas as I am with R, but learning Theano well is much higher on my list.

0. http://deeplearning.net/software/theano/

lightcatcher · on March 25, 2013

If you're interested in theano primarily because of deep learning, I highly recommend you check out pylearn2 (not too much documentation, but docs here: http://deeplearning.net/software/pylearn2/ and source on github here: github.com/lisa-lab/pylearn2 ).

Pylearn2 is a set of deep learning algorithms implemented with theano. The LISA (deep learning) group at the University of Montreal (same group that created theano) maintains this library and puts a lot of the code they use for their papers in pylearn2. pylearn2 thus makes it quite easy to use a lot of state of the art algorithms, such as maxout.

Homunculiheaded · on March 25, 2013

Thanks! This is exactly the sort of thing I've been looking for. I really want to experiment with some deep learning techniques for some problems I have, but the start up cost of "oh yea you have to build all the tools yourself right now" keeps putting me off.

stdbrouw · on March 25, 2013

> except of course having it all in a general purpose language

But that's the point though. A lot of data analysis is just munging numbers / getting things in shape so you can actually do the analysis, and so being able to do that in a general purpose language is a breath of fresh air.

Choronzon · on March 26, 2013

Indeed, R libraries are superior to python and it is more of a lingua franca in the data world so if you are 'just' doing data that is probably the superior choice.

But I would never attempt to build a production system in R.So if you want to go from research to production in the same language or as the same programmer python has all the advantages.You also have the R2Py route for missing libs though that is not the same as doing it natively.

That said if anybody got the Pandas/Numpy/Ipython workflow going in Go however I drop python in a heartbeat.I would love faster loops(natively not just through numba) and better concurrency in python.

BTW IPython now runs R code(and Octave!) interactively so there is an advantage to knowing both from the python perspective.

rambodfas · on March 25, 2013

Maybe because R doesn't even support 64bit integers even when compiled with 64bit hardware and software...

piqufoh · on March 24, 2013

Whilst O'Reilly books just keep getting worse. Seriously, "Scipy and Numpy" is typo ridden and so simplistic it's irrelevant and "Python for Data Analysis" should really be called "Mainly Pandas 'cos I wrote it, plus a chapter about IPython".

wesm · on March 25, 2013

This is really a mischaracterization. The point of the book was to address data tooling topics (and the bare essentials: IPython and NumPy and matplotlib); for most data tasks (especially Chapters 7, 9, and 10) I would challenge you to replicate all of the data work without using pandas and then come back and snark on HN about how I'm self-promoting, or whatever. The truth is, it's the only game in town for complex structured data manipulation in Python, unless you want pages and pages of spaghetti nested dicts and lists.

The sales numbers already show that the book was timely and relevant to a huge (> 10,000 so far) number of people.

ryporter · on March 25, 2013

In defense of Wes, the book does do a good job of introducing you to IPython and NumPy. I had mainly used Python as a generic scripting language prior to reading much of this book, mainly writing vanilla scripts in emacs. Afterwards (and with the help of online pandas docs), I felt like I was fairly productive at analyzing some new data that had been thrown my way.

I can definitely see why some think that the title is misleading. The book clearly does not aim to survey the field of data analysis in Python. However, it does teach you everything you need to know to start doing data analysis in Python using one particular (good) tool chain.

piqufoh · on March 25, 2013

Wes, there needs to be a great book about Pandas as it's an amazing tool, and you've written it. But Data Analysis is a HUGE subject and passively implying that it can only be tackled with Pandas is misleading. My post was intended to be a snark at O'Reilly's general failing quality - I'm sorry if I put your nose out of joint.

darkxanthos · on March 25, 2013

I do data analysis and pandas is a huge part of that. He can't usefully write about the whole universe of data analysis. The part he wrote about is definitely quite relevant.

nullsub · on March 25, 2013

Have to agree, piqufoh's comment is asinine. Pandas is the first time I've even considered moving work done in R to a new all-python workflow. Exhaustive coverage by some imaginary, disinterested third party would still be forced to settle on pandas as the focus, because it literally is the only game in town.

As to the merits, per se, of having some third party write the book: I'm doubtful. I followed wesm's development through blog posts, twitter updates, etc. Writing pandas forced him to consider a lot aspects of dealing with data in python, both deep and practical. In the process, I'm confident he became one of the foremost experts in not only his own software but in the current state of Data Analysis in Python. Who better to write this book?

stdbrouw · on March 24, 2013

That's a bummer, I just bought Scipy and Numpy and was looking forward to reading it.

Another piece of advice for people: about 80% of Python for Data Analysis is the same as the online documentation for Pandas, the other 20% some examples and other notes. So if you're buying it you should buy it to support Wes McKinney, not to learn a ton of new stuff. Which wouldn't be the worst thing you could do, pandas is pretty awesome and I don't know what else you would use in Python to do statistical data analysis. (Of course there are separate tools for more specialized analyses and techniques like ML and NLP, but I mean "plain ol' stats".)

gjreda · on March 24, 2013

I mostly enjoyed Python for Data Analysis, but do agree that there was too much focus on pandas (though that was to be expected with Wes McKinney writing it).

I would have loved if the book had included sections on the statsmodels and scikit libraries, especially since they are integrated with pandas.

sycren · on March 24, 2013

Can you suggest a better book for data analysis with python?

Choronzon · on March 24, 2013

It is actually a good book and definitely worth getting,but it really should have been called Pandas for Data Analysis.

While not directly covering the python libraries mentioned such as scikits learn this is a good into to some concepts in machine learning. https://www.udacity.com/course/cs373

piqufoh · on March 25, 2013

Do get the 'Python for Data Analysis' book as it's a good and well written - but it's about Pandas (which you can then use for Data Analysis).

peterjs · on March 24, 2013

As for the integrated stack problem, I can wholeheartedly recommend the Enthought Python Distribution [1]. I've used it on both Windows and Mac. It includes all the important libraries for linear algebra, matrix computation, and visualization (SciPy, NumPy, matplotlib, etc.). So it is a great replacement for Matlab, which probably falls short of any programming language known to mankind.

[1] http://www.enthought.com/

hogu · on March 25, 2013

That would have been my answer a year ago as well, however we've (continuum analytics) put out our own distribution now which gives you much more - all for free.

https://store.continuum.io/cshop/anaconda

cridal · on March 25, 2013

How does this distribution play with the outside world? Will I be able to install other third party libraries into the distribution? Does it have its own 'installation procedure'?

hogu · on March 25, 2013

So, we ship anaconda with our own package management system, called conda. Conda is open source and was created because for scientific python, we need to manage versions of non-python libraries (blas/atlas/mkl,libhdf5, etc..). But you don't need to use conda if you don't want to. Conda is just what we use to install stuff. You can use pip on top of Anaconda if you would like, and we even have functions to turn whatever you did to your anaconda environment into a conda package if you wanted to.

niggler · on March 24, 2013

Anyone have experience with an f2py (fortran to python translator)? How does it compare to translating to native C?

paddy_m · on March 24, 2013

What is your use case?

Have you looked at Numba? There are guarantees about memory that can be made with Fortran that can't be expressed with C. Numba is trying to bring the same type of optimizations to NumPy python code, because it does know some of the memory constraints.

Note: I work for Continuum Analytics.

tocomment · on March 24, 2013

So I watched the numba talk at pycon. What I still don't understand is does it speed up any python code or only code that uses numpy? How does it know if you're using numpy?

hogu · on March 25, 2013

it's a numpy-aware compiler - you either tell the jit decorator the types of the arguments the function will be called with, and that information is used in the compilation. It doesn't have to be NumPy arrays, but the type declaration mechanism does know about them and can optimize around that. Similar to how providing type information can allow cython to provide efficient C code, providing type information on the decorator allows Numba to generate efficient llvm byte code.

There is also an autojit decorator that watches what you're calling the function with, and compiles it for the given type signature.

niggler · on March 24, 2013

I've got some numerical analysis code in f77 that leverages lapack. I tried translating to C but was never happy with using clapack or atlas.

synparb · on March 24, 2013

Another option is the approach detailed here: http://fortran90.org/src/best-practices.html#interfacing-wit...

jotaass · on March 25, 2013

f2py is a remarkable tool. And it's much easier than the documentation would lead you to believe. It's literally as simple as doing

f2py foo.f -c -m foo

Which creates a foo python module with easy access to whatever functions you put in foo.f, with proper type checking and conversion from python objects to fortran arrays. It is also clever enough to remove some unecessary inputs which are common in old fortran. For example, in fortran you usually define an extra input argument to be the length of an input array. f2py determines that this variable is uncessary as it can be determined from the length of the python object and makes it optional.

If you write fortran90-95 it's even better, as you can mark a variable as output and f2py puts it on the left side automatically.

I wish they would update it to support pointers and custom types, though. That's about the only thing it lacks right now.

It's the best and quickest way I know of linking python and native code. Scipy's weave is also nice.

gtani · on March 24, 2013

You could look at how pyMC calls out to Fortran (windows and POSIX FFI, source translation is sketchy without thorough unit test suites

http://pymc-devs.github.com/pymc/INSTALL.html

peterjs · on March 24, 2013

Are there any similar tools available for Ruby? When I've checked it a few years ago, there were some projects at very early stages; hardly comparable to their Python counterparts.

synparb · on March 24, 2013

There's http://sciruby.com/, but Ruby is really missing any sort of scientific ecosystem that's at all comparable to Python.

tvst · on March 25, 2013

This is the first time I hear about bokeh and bokeh.js. They sound super interesting.

Does anyone know where I can find a demo of bokeh.js plots?

hogu · on March 25, 2013

In the bokehjs repo, we have some unittests which demonstrate the capabilities - thus far we've focused mainly on getting things ready for demos, and we've SORELY neglected documentation, ease of getting others setup. But now that we have some downtime after PyCon, we'll be working on this very soon. So I guess what I'm saying is, wait a week, and those demos will be available. If you're really eager, you could clone the repo, but it's quite hard to get up and running right now. https://github.com/ContinuumIO/bokehjs, https://github.com/ContinuumIO/Bokeh

aet · on March 24, 2013

What about something like Mahout?