Hacker News new | past | comments | ask | show | jobs | submit login
Lesser-Known Python Data Analysis Libraries (jyotiska.github.io)
321 points by jyotiska on April 19, 2016 | hide | past | favorite | 68 comments

My 2 cents: I would not recommend basing any new work on MRjob. As someone who inherited and has been maintaining a bunch of code that depends on it, the library seems to be barely maintained, support for VPC is only partial and not very well documented, the auditing tools stopped working quite a while ago and tracking the progress/status of EMR jobs is extremely painful (to be fair, this is more of an issue with Elastic MapReduce than MRJob itself.)

I love the concept and ease of development, but I can't shake the feeling that the infrastructure is so shaky it almost amount to instant technical debt (sorry if this offends anyone, I'm just a dumb customer.)

It looks like mrjob development has been re-started, but there was a disconcerting period (nearly two years) without a release.[1] I used it for rinky-dink projects, and it seemed fragile at the time, so I can understand your inclination to divest from it.

[1]: https://github.com/Yelp/mrjob/releases

In case anyone's curious, what happened was that Dave (@davidmarin) and I (@irskep), the mrjob maintainers, left Yelp within about a month of each other. (There's no story there, just coincidence.) There was never any momentum with new maintainers, going by the release history.

But now Dave is working on mrjob regularly again, hence the pace of recent improvements.

Grandparent is correct about the second-class support for non-EMR production Hadoop usage. Like any open source project, the code only works well if a major stakeholder invests in improving it. Few non-EMR users spend much time contributing, so the situation doesn't improve.

Hey guys, for what its worth, MRJob has given us around 3 years of working (if sometimes clunky) EMR, so thanks for that :)

I have the opposite experience with MrJob. Classifying it as an inactive project is demonstrably false. The rest are EMR complaints, I use it on my own Hadoop cluster.

Just read the comment from one of the creators: https://news.ycombinator.com/item?id=11528776

Do you know of any good alternatives? Any way to write MapReduces in python?

It's not quite the same (since it doesn't become a Map-Reduce job) but if you're mostly interested in the programming paradigm/scalability the Python API for Apache Spark might be a good alternative

Yes! Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

This is likely the best answer for those who wish to code within the map/reduce paradigm by hand and would prefer to use python.


Your performance is going to be complete and utter crap because you're paying for serialization on every single data element.

Dask is higher performance and more pythonic: http://matthewrocklin.com/blog/work/2016/02/22/dask-distribu...

Luigi does decent job. It is relatively easy to start with and powerful enough to do almost anything

I've been using Luigi for a few months, with no complaints. It supports running Python jobs on Hadoop and Spark, but it's not really a MapReduce framework unto itself.

However http://discoproject.org/ might be worth a look as a standalone alternative.

I have used Disco extensively in the past, nothing but good things to say about it. Fast job launch, easy to write, the DFS has been stellar. This was only using Python for job code.

Unfortunately, no. We are slowly moving away to a streaming infrastructure, so I've been mostly trying to "keep it running" until we are done replacing it. Sorry.

Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license and actively growing.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

Andrew Montalenti did a great talk about scaling out Python at Parsely at the last PyData conference: https://www.youtube.com/watch?v=gVBLF0ohcrE

But TBH, after a certain scale you should really be asking whether or not you should be using Python.

My favorite new (to me) tool is snakemake[0], make files with python 3 support. It allows me to both make my workflow and document it in the same place, hugely helpful for jumping around to different projects or needing to rerun a pipeline with new data. If interested, i recommend taking a look at this tutorial[1] with lots of different snakemake patterns.

[0] https://bitbucket.org/snakemake/snakemake/wiki/Home

[1] https://github.com/leipzig/SandwichesWithSnakemake

I use histogram.py from https://github.com/bitly/data_hacks all the time...

Its really cool, I wish it handled missing values (empty string), will submit a PR soon. until then here is the issue https://github.com/bitly/data_hacks/issues/34

Wow. This looks really cool.

Interesting. Looks similar to my version, albeit with a bit fewer features.


I expect if you mentioned a few of the features yours has that the other doesn't you wouldn't have gotten downvoted.

plotly is a fantastic tool for plotting. It has a python API [0], but also works from R, matlab, and Julia. It also has support for pandas dataframes and jupyter notebook[1], which is by far the fastest way I've found to make attractive plots. plotlyjs[2] is a fantastic wrapper around d3. So I can go all the way from plotting something quickly from a dataframe to building a totally custom chart.

[0] https://plot.ly/python/

[1] https://plot.ly/ipython-notebooks/cufflinks/

[2] https://plot.ly/javascript/

I like plotly as well but I couldn't stand the python api nor cufflinks for that matter so I created my own wrapper. It's not fully featured but it handles 90% of the cases I want.


very nice. I like that it each chart method returns the figure, so if it is needed to do something you didn't implement the figure is available to edit.

Thanks, I am happy to accept PRs that expose more functionality.

How does it compare with Bokeh?

I prefer the aesthetic of the defaults in plotly over Bokeh. Also, for most of my tasks I can simply use dataframe.iplot() using the library from [1] above, and I value that simplicity. Lastly, I prefer that plotly is built on top of d3js so I have access to that api if I want to do anything crazy, whereas Bokeh reinvented the wheel a bit with BokehJS.

This is a great list.

I'm equally excited for all the suggestions sure to appear in the comments (hinthint). I got a ton from this thread last time, even though they weren't data analysis specific:


If you're looking for a simple data pipeline, there's pipeless: https://github.com/andychase/pipeless

Also reparse if you want to parse natural language with regular expressions: https://github.com/andychase/reparse

Number of good open source data analysis projects by primary developer of the NumPy package and his company listed here:


Check out dask for distributed and out of core parallel programming : http://www.slideshare.net/continuumio

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

I think you might be interested by this talk: https://www.youtube.com/watch?v=gVBLF0ohcrE

Thanks...though the whole "GIL being a feature" is sort of a joke.

I believe he meant it as a joke. At least that's how I interpreted it :)

Another option is Agate (http://agate.readthedocs.org) which comes from the journalism community.

Natsort is a lifesaver when working with filenames numbered by humans (like file1, file2 ... file11), those will be sorted correctly. Beats asking people to "Please add leading 0's oh and when you suspect you will pass 100, add 2 leading 0's."

I dislike how it changes behavior from release to release, for example foo-1.2, id that foo 1.2 or foo -1.2? Default dpends on release of natsort with new routines to restore previous behavior.

FWIW, the sort method (and sorted keyword) take a 'key' keyword, where you can pass a function to use to calculate the key to sort the sequence with. So in your file11 case, you can do:

sorted(files, key=lambda x: int(x[4:])

, and it will do the right thing.

Although with natsort, you don't have to parse the actual strings yourself.

That is a neat trick, but it would be incredibly brittle. Kids, don't try this at home!

Pass in an re.match or re.search based function, i would imagine that would be powerful enough to meet most needs.

import re

x = ['foo12901','fooo900','fooooooo980090']

x =sorted(x,key = lambdax:int(re.search('\d+',x).group()))


+1 this is the right way to build a custom sorting function. The only thing worse than relying on ad-hoc heuristics for processing your data is relying on heuristics that somebody else maintains!

I'll have to check delorean out, I usually use http://crsmithdev.com/arrow/ for python date manipulation. It works a lot like the javascript library moment.

cool, let me know if you have any issues. sanx.

I use arrow for all my time related operations. I tried dolorean once (very quickly) and found out it was missing several elelents I needed (which arrow had). Maybe I did not look closely enough, I will try again and be back if there is interest. Thanks.

dataset https://dataset.readthedocs.org/en/latest/ is pretty good and for most scenarios as easy as tinydb, but backed by a real SQL database.

I've used this with a Flask project before, great module and super easy.

Off topic: I really like the minimalistic approach to your blog. In Minion (my default serif font) it looks better and more readable than the majority of webpages out there.

> delorean

Datetime in python is a really sad state of affairs. I wince every time I have to do it - especially if you've just used ruby/rails recently..

tinydb looks like it could be useful, thanks for this.

Whilst this isn't a data analysis library per se, PyOpenCL may be of interest for people doing data analysis work in Python:


Vincent has not been properly maintained in a year. and is broken at this point since the release of Vega 2.0

Yes, this recommendation puzzled me. It's essentially a dead project.

"Vincent is essetially frozen for development right now, and has been for quite a while. The features for the currently targeted version of Vega (1.4) work fine, but it will not work with Vega 2.x releases. Regarding a rewrite, I'm honestly not sure if it's worth the time and effort at this point."

I'd also add to this list Pandashells https://github.com/robdmc/pandashells - Basically use Pandas in the command line.

https://github.com/turicas/rows also worths a mention.

I've used PrettyTable on a few projects and found it to be very easy to use. Highly recommended!

Tabulate is also a good alternative, and more recent: https://pypi.python.org/pypi/tabulate

I would consider Tabulate much superior to PrettyTable.

Looks good! It might be time to update some code ...

I hear a lot of talk about using python for data analysis. I gave up after trying to find a library to do cross tabs. Is there something to make custom tables in python other than prettytables?

Perhaps I should have been more clear. I want to present the results in pdf or html. Like xtables, tables and stargazer packages in R.

I haven't used xtables or stargazer in a while, but ipython + pandas can display tables as html.

Here is an interesting ipython notebook with some examples:


Oooo, I'm going to have to look at that qgrid widget. I've been frustrated when I had to dump a df (DataFrame) to Excel to browse a large df.

You can easily export any pandas DataFrame to html using the to_html() method. To generate full webpage, you'll probably want a templating engine like Jinja2.

The best demo I've seen for generating a PDF report is on Practical Business Python[1]

Edit: I forgot to mention the new pandas Style[1] feature for generating some impressive looking html tables.

[1] http://pbpython.com/pdf-reports.html

[2] http://pandas.pydata.org/pandas-docs/stable/style.html

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact