As someone who uses R for just about all of my ML/data analysis needs I'm surprised not to see Theano[0] mentioned. SciPy, SciKit-learn, pandas etc are great and all, but there's not much really different than what you get with R (except of course having it all in a general purpose language). But Theano (plus it's related deep learning tools) really stands out for me as something the R tool chain can't compete with.
I feel like eventually I should become as fluent with SciPy/Scikit-learn/pandas as I am with R, but learning Theano well is much higher on my list.
If you're interested in theano primarily because of deep learning, I highly recommend you check out pylearn2 (not too much documentation, but docs here: http://deeplearning.net/software/pylearn2/ and source on github here: github.com/lisa-lab/pylearn2 ).
Pylearn2 is a set of deep learning algorithms implemented with theano. The LISA (deep learning) group at the University of Montreal (same group that created theano) maintains this library and puts a lot of the code they use for their papers in pylearn2. pylearn2 thus makes it quite easy to use a lot of state of the art algorithms, such as maxout.
Thanks! This is exactly the sort of thing I've been looking for. I really want to experiment with some deep learning techniques for some problems I have, but the start up cost of "oh yea you have to build all the tools yourself right now" keeps putting me off.
> except of course having it all in a general purpose language
But that's the point though. A lot of data analysis is just munging numbers / getting things in shape so you can actually do the analysis, and so being able to do that in a general purpose language is a breath of fresh air.
Indeed,
R libraries are superior to python and it is more of a lingua franca in the data world so if you are 'just' doing data that is probably the superior choice.
But I would never attempt to build a production system in R.So if you want to go from research to production in the same language or as the same programmer python has all the advantages.You also have the R2Py route for missing libs though that is not the same as doing it natively.
That said if anybody got the Pandas/Numpy/Ipython workflow going in Go however I drop python in a heartbeat.I would love faster loops(natively not just through numba) and better concurrency in python.
BTW IPython now runs R code(and Octave!) interactively so there is an advantage to knowing both from the python perspective.
Whilst O'Reilly books just keep getting worse. Seriously, "Scipy and Numpy" is typo ridden and so simplistic it's irrelevant and "Python for Data Analysis" should really be called "Mainly Pandas 'cos I wrote it, plus a chapter about IPython".
This is really a mischaracterization. The point of the book was to address data tooling topics (and the bare essentials: IPython and NumPy and matplotlib); for most data tasks (especially Chapters 7, 9, and 10) I would challenge you to replicate all of the data work without using pandas and then come back and snark on HN about how I'm self-promoting, or whatever. The truth is, it's the only game in town for complex structured data manipulation in Python, unless you want pages and pages of spaghetti nested dicts and lists.
The sales numbers already show that the book was timely and relevant to a huge (> 10,000 so far) number of people.
In defense of Wes, the book does do a good job of introducing you to IPython and NumPy. I had mainly used Python as a generic scripting language prior to reading much of this book, mainly writing vanilla scripts in emacs. Afterwards (and with the help of online pandas docs), I felt like I was fairly productive at analyzing some new data that had been thrown my way.
I can definitely see why some think that the title is misleading. The book clearly does not aim to survey the field of data analysis in Python. However, it does teach you everything you need to know to start doing data analysis in Python using one particular (good) tool chain.
Wes, there needs to be a great book about Pandas as it's an amazing tool, and you've written it. But Data Analysis is a HUGE subject and passively implying that it can only be tackled with Pandas is misleading. My post was intended to be a snark at O'Reilly's general failing quality - I'm sorry if I put your nose out of joint.
I do data analysis and pandas is a huge part of that. He can't usefully write about the whole universe of data analysis. The part he wrote about is definitely quite relevant.
Have to agree, piqufoh's comment is asinine. Pandas is the first time I've even considered moving work done in R to a new all-python workflow. Exhaustive coverage by some imaginary, disinterested third party would still be forced to settle on pandas as the focus, because it literally is the only game in town.
As to the merits, per se, of having some third party write the book: I'm doubtful. I followed wesm's development through blog posts, twitter updates, etc. Writing pandas forced him to consider a lot aspects of dealing with data in python, both deep and practical. In the process, I'm confident he became one of the foremost experts in not only his own software but in the current state of Data Analysis in Python. Who better to write this book?
That's a bummer, I just bought Scipy and Numpy and was looking forward to reading it.
Another piece of advice for people: about 80% of Python for Data Analysis is the same as the online documentation for Pandas, the other 20% some examples and other notes. So if you're buying it you should buy it to support Wes McKinney, not to learn a ton of new stuff. Which wouldn't be the worst thing you could do, pandas is pretty awesome and I don't know what else you would use in Python to do statistical data analysis. (Of course there are separate tools for more specialized analyses and techniques like ML and NLP, but I mean "plain ol' stats".)
I mostly enjoyed Python for Data Analysis, but do agree that there was too much focus on pandas (though that was to be expected with Wes McKinney writing it).
I would have loved if the book had included sections on the statsmodels and scikit libraries, especially since they are integrated with pandas.
It is actually a good book and definitely worth getting,but it really should have been called Pandas for Data Analysis.
While not directly covering the python libraries mentioned such as scikits learn this is a good into to some concepts in machine learning.
https://www.udacity.com/course/cs373
As for the integrated stack problem, I can wholeheartedly recommend the Enthought Python Distribution [1]. I've used it on both Windows and Mac. It includes all the important libraries for linear algebra, matrix computation, and visualization (SciPy, NumPy, matplotlib, etc.). So it is a great replacement for Matlab, which probably falls short of any programming language known to mankind.
That would have been my answer a year ago as well, however we've (continuum analytics) put out our own distribution now which gives you much more - all for free.
How does this distribution play with the outside world? Will I be able to install other third party libraries into the distribution? Does it have its own 'installation procedure'?
So, we ship anaconda with our own package management system, called conda. Conda is open source and was created because for scientific python, we need to manage versions of non-python libraries (blas/atlas/mkl,libhdf5, etc..). But you don't need to use conda if you don't want to. Conda is just what we use to install stuff. You can use pip on top of Anaconda if you would like, and we even have functions to turn whatever you did to your anaconda environment into a conda package if you wanted to.
Have you looked at Numba? There are guarantees about memory that can be made with Fortran that can't be expressed with C. Numba is trying to bring the same type of optimizations to NumPy python code, because it does know some of the memory constraints.
So I watched the numba talk at pycon. What I still don't understand is does it speed up any python code or only code that uses numpy? How does it know if you're using numpy?
it's a numpy-aware compiler - you either tell the jit decorator the types of the arguments the function will be called with, and that information is used in the compilation. It doesn't have to be NumPy arrays, but the type declaration mechanism does know about them and can optimize around that. Similar to how providing type information can allow cython to provide efficient C code, providing type information on the decorator allows Numba to generate efficient llvm byte code.
There is also an autojit decorator that watches what you're calling the function with, and compiles it for the given type signature.
f2py is a remarkable tool. And it's much easier than the documentation would lead you to believe. It's literally as simple as doing
f2py foo.f -c -m foo
Which creates a foo python module with easy access to whatever functions you put in foo.f, with proper type checking and conversion from python objects to fortran arrays. It is also clever enough to remove some unecessary inputs which are common in old fortran. For example, in fortran you usually define an extra input argument to be the length of an input array. f2py determines that this variable is uncessary as it can be determined from the length of the python object and makes it optional.
If you write fortran90-95 it's even better, as you can mark a variable as output and f2py puts it on the left side automatically.
I wish they would update it to support pointers and custom types, though. That's about the only thing it lacks right now.
It's the best and quickest way I know of linking python and native code. Scipy's weave is also nice.
Are there any similar tools available for Ruby? When I've checked it a few years ago, there were some projects at very early stages; hardly comparable to their Python counterparts.
In the bokehjs repo, we have some unittests which demonstrate the capabilities - thus far we've focused mainly on getting things ready for demos, and we've SORELY neglected documentation, ease of getting others setup. But now that we have some downtime after PyCon, we'll be working on this very soon. So I guess what I'm saying is, wait a week, and those demos will be available. If you're really eager, you could clone the repo, but it's quite hard to get up and running right now.
https://github.com/ContinuumIO/bokehjs,
https://github.com/ContinuumIO/Bokeh
I feel like eventually I should become as fluent with SciPy/Scikit-learn/pandas as I am with R, but learning Theano well is much higher on my list.
0. http://deeplearning.net/software/theano/