I love the concept and ease of development, but I can't shake the feeling that the infrastructure is so shaky it almost amount to instant technical debt (sorry if this offends anyone, I'm just a dumb customer.)
But now Dave is working on mrjob regularly again, hence the pace of recent improvements.
Grandparent is correct about the second-class support for non-EMR production Hadoop usage. Like any open source project, the code only works well if a major stakeholder invests in improving it. Few non-EMR users spend much time contributing, so the situation doesn't improve.
Its free with a permissive license.
It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.
DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.
Your performance is going to be complete and utter crap because you're paying for serialization on every single data element.
Dask is higher performance and more pythonic: http://matthewrocklin.com/blog/work/2016/02/22/dask-distribu...
However http://discoproject.org/ might be worth a look as a standalone alternative.
Its free with a permissive license and actively growing.
But TBH, after a certain scale you should really be asking whether or not you should be using Python.
I'm equally excited for all the suggestions sure to appear in the comments (hinthint). I got a ton from this thread last time, even though they weren't data analysis specific:
Also reparse if you want to parse natural language with regular expressions: https://github.com/andychase/reparse
sorted(files, key=lambda x: int(x[4:])
, and it will do the right thing.
Although with natsort, you don't have to parse the actual strings yourself.
x = ['foo12901','fooo900','fooooooo980090']
x =sorted(x,key = lambdax:int(re.search('\d+',x).group()))
Datetime in python is a really sad state of affairs. I wince every time I have to do it - especially if you've just used ruby/rails recently..
Whilst this isn't a data analysis library per se, PyOpenCL may be of interest for people doing data analysis work in Python:
"Vincent is essetially frozen for development right now, and has been for quite a while. The features for the currently targeted version of Vega (1.4) work fine, but it will not work with Vega 2.x releases. Regarding a rewrite, I'm honestly not sure if it's worth the time and effort at this point."
Here is an interesting ipython notebook with some examples:
The best demo I've seen for generating a PDF report is on Practical Business Python
Edit: I forgot to mention the new pandas Style feature for generating some impressive looking html tables.