
Lesser-Known Python Data Analysis Libraries - jyotiska
http://jyotiska.github.io/blog/posts/python_libraries.html
======
dguaraglia
My 2 cents: I would not recommend basing any new work on MRjob. As someone who
inherited and has been maintaining a bunch of code that depends on it, the
library seems to be barely maintained, support for VPC is only partial and not
very well documented, the auditing tools stopped working quite a while ago and
tracking the progress/status of EMR jobs is extremely painful (to be fair,
this is more of an issue with Elastic MapReduce than MRJob itself.)

I love the concept and ease of development, but I can't shake the feeling that
the infrastructure is so shaky it almost amount to instant technical debt
(sorry if this offends anyone, I'm just a dumb customer.)

~~~
zfrenchee
Do you know of any good alternatives? Any way to write MapReduces in python?

~~~
_dark_matter_
[https://hadoop.apache.org/docs/r1.2.1/streaming.html](https://hadoop.apache.org/docs/r1.2.1/streaming.html)

~~~
pvnick
This is likely the best answer for those who wish to code within the
map/reduce paradigm by hand and would prefer to use python.

~~~
pwang
BUT WHY

Your performance is going to be complete and utter crap because you're paying
for serialization on every single data element.

Dask is higher performance and more pythonic:
[http://matthewrocklin.com/blog/work/2016/02/22/dask-
distribu...](http://matthewrocklin.com/blog/work/2016/02/22/dask-distributed-
part-2)

------
elsherbini
My favorite new (to me) tool is snakemake[0], make files with python 3
support. It allows me to both make my workflow and document it in the same
place, hugely helpful for jumping around to different projects or needing to
rerun a pipeline with new data. If interested, i recommend taking a look at
this tutorial[1] with lots of different snakemake patterns.

[0]
[https://bitbucket.org/snakemake/snakemake/wiki/Home](https://bitbucket.org/snakemake/snakemake/wiki/Home)

[1]
[https://github.com/leipzig/SandwichesWithSnakemake](https://github.com/leipzig/SandwichesWithSnakemake)

------
melor
I use histogram.py from
[https://github.com/bitly/data_hacks](https://github.com/bitly/data_hacks) all
the time...

~~~
philovivero
Interesting. Looks similar to my version, albeit with a bit fewer features.

[https://github.com/philovivero/distribution](https://github.com/philovivero/distribution)

~~~
Bjartr
I expect if you mentioned a few of the features yours has that the other
doesn't you wouldn't have gotten downvoted.

------
elsherbini
plotly is a fantastic tool for plotting. It has a python API [0], but also
works from R, matlab, and Julia. It also has support for pandas dataframes and
jupyter notebook[1], which is by far the fastest way I've found to make
attractive plots. plotlyjs[2] is a fantastic wrapper around d3. So I can go
all the way from plotting something quickly from a dataframe to building a
totally custom chart.

[0] [https://plot.ly/python/](https://plot.ly/python/)

[1] [https://plot.ly/ipython-notebooks/cufflinks/](https://plot.ly/ipython-
notebooks/cufflinks/)

[2] [https://plot.ly/javascript/](https://plot.ly/javascript/)

~~~
qacek
I like plotly as well but I couldn't stand the python api nor cufflinks for
that matter so I created my own wrapper. It's not fully featured but it
handles 90% of the cases I want.

[https://github.com/jwkvam/plotlywrapper](https://github.com/jwkvam/plotlywrapper)

~~~
elsherbini
very nice. I like that it each chart method returns the figure, so if it is
needed to do something you didn't implement the figure is available to edit.

~~~
qacek
Thanks, I am happy to accept PRs that expose more functionality.

------
forgetsusername
This is a great list.

I'm equally excited for all the suggestions sure to appear in the comments
(hinthint). I got a ton from this thread last time, even though they weren't
data analysis specific:

[https://news.ycombinator.com/item?id=10782969](https://news.ycombinator.com/item?id=10782969)

------
qopp
If you're looking for a simple data pipeline, there's pipeless:
[https://github.com/andychase/pipeless](https://github.com/andychase/pipeless)

Also reparse if you want to parse natural language with regular expressions:
[https://github.com/andychase/reparse](https://github.com/andychase/reparse)

------
nxzero
Number of good open source data analysis projects by primary developer of the
NumPy package and his company listed here:

[https://www.continuum.io/open-source-core-modern-
software](https://www.continuum.io/open-source-core-modern-software)

------
tanlermin
Check out dask for distributed and out of core parallel programming :
[http://www.slideshare.net/continuumio](http://www.slideshare.net/continuumio)

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more
complex and granular parallel patterns than just map reduce. Also has a API
for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects
succeed because I was a user will benefit.

~~~
gshulegaard
I think you might be interested by this talk:
[https://www.youtube.com/watch?v=gVBLF0ohcrE](https://www.youtube.com/watch?v=gVBLF0ohcrE)

~~~
tanlermin
Thanks...though the whole "GIL being a feature" is sort of a joke.

~~~
gshulegaard
I believe he meant it as a joke. At least that's how I interpreted it :)

------
codenberg
Another option is Agate
([http://agate.readthedocs.org](http://agate.readthedocs.org)) which comes
from the journalism community.

------
teekert
Natsort is a lifesaver when working with filenames numbered by humans (like
file1, file2 ... file11), those will be sorted correctly. Beats asking people
to "Please add leading 0's oh and when you suspect you will pass 100, add 2
leading 0's."

~~~
herge
FWIW, the sort method (and sorted keyword) take a 'key' keyword, where you can
pass a function to use to calculate the key to sort the sequence with. So in
your file11 case, you can do:

sorted(files, key=lambda x: int(x[4:])

, and it will do the right thing.

Although with natsort, you don't have to parse the actual strings yourself.

~~~
daveguy
That is a neat trick, but it would be incredibly brittle. Kids, don't try this
at home!

~~~
solaxun
Pass in an re.match or re.search based function, i would imagine that would be
powerful enough to meet most needs.

import re

x = ['foo12901','fooo900','fooooooo980090']

x =sorted(x,key = lambdax:int(re.search('\d+',x).group()))

print(x)

~~~
shoyer
+1 this is the right way to build a custom sorting function. The only thing
worse than relying on ad-hoc heuristics for processing your data is relying on
heuristics that somebody else maintains!

------
c17r
I'll have to check delorean out, I usually use
[http://crsmithdev.com/arrow/](http://crsmithdev.com/arrow/) for python date
manipulation. It works a lot like the javascript library moment.

~~~
googletron
cool, let me know if you have any issues. sanx.

~~~
BrandoElFollito
I use arrow for all my time related operations. I tried dolorean once (very
quickly) and found out it was missing several elelents I needed (which arrow
had). Maybe I did not look closely enough, I will try again and be back if
there is interest. Thanks.

------
r0muald
dataset
[https://dataset.readthedocs.org/en/latest/](https://dataset.readthedocs.org/en/latest/)
is pretty good and for most scenarios as easy as tinydb, but backed by a real
SQL database.

~~~
xemoka
I've used this with a Flask project before, great module and super easy.

------
Ruud-v-A
Off topic: I really like the minimalistic approach to your blog. In Minion (my
default serif font) it looks better and more readable than the majority of
webpages out there.

------
forgotpwtomain
> delorean

Datetime in python is a really sad state of affairs. I wince every time I have
to do it - especially if you've just used ruby/rails recently..

------
ZenoArrow
tinydb looks like it could be useful, thanks for this.

Whilst this isn't a data analysis library per se, PyOpenCL may be of interest
for people doing data analysis work in Python:

[https://mathema.tician.de/software/pyopencl/](https://mathema.tician.de/software/pyopencl/)

------
jbrambleDC
Vincent has not been properly maintained in a year. and is broken at this
point since the release of Vega 2.0

~~~
Semiapies
Yes, this recommendation puzzled me. It's essentially a dead project.

"Vincent is essetially frozen for development right now, and has been for
quite a while. The features for the currently targeted version of Vega (1.4)
work fine, but it will not work with Vega 2.x releases. Regarding a rewrite,
I'm honestly not sure if it's worth the time and effort at this point."

~~~
merlincorey
The new project is here: [https://github.com/uwdata/ipython-vega-
lite](https://github.com/uwdata/ipython-vega-lite)

------
micah_chatt
I'd also add to this list Pandashells
[https://github.com/robdmc/pandashells](https://github.com/robdmc/pandashells)
\- Basically use Pandas in the command line.

------
diegosouza
[https://github.com/turicas/rows](https://github.com/turicas/rows) also worths
a mention.

------
metaobject
I've used PrettyTable on a few projects and found it to be very easy to use.
Highly recommended!

~~~
jventura
Tabulate is also a good alternative, and more recent:
[https://pypi.python.org/pypi/tabulate](https://pypi.python.org/pypi/tabulate)

~~~
azag0
I would consider Tabulate much superior to PrettyTable.

------
TheLogothete
I hear a lot of talk about using python for data analysis. I gave up after
trying to find a library to do cross tabs. Is there something to make custom
tables in python other than prettytables?

~~~
jboynyc
What's wrong with using Pandas? [http://pandas.pydata.org/pandas-
docs/version/0.17.0/generate...](http://pandas.pydata.org/pandas-
docs/version/0.17.0/generated/pandas.crosstab.html)

~~~
TheLogothete
Perhaps I should have been more clear. I want to present the results in pdf or
html. Like xtables, tables and stargazer packages in R.

~~~
closed
I haven't used xtables or stargazer in a while, but ipython + pandas can
display tables as html.

Here is an interesting ipython notebook with some examples:

[http://nbviewer.jupyter.org/gist/chris1610/f2f4a2e9181f6ec22...](http://nbviewer.jupyter.org/gist/chris1610/f2f4a2e9181f6ec22e88)

~~~
Notre1
Oooo, I'm going to have to look at that qgrid widget. I've been frustrated
when I had to dump a df (DataFrame) to Excel to browse a large df.

