
Teaching Pandas and Jupyter to Northwestern journalism students - palewire
https://www.californiacivicdata.org/2017/06/07/dc-python-notebook/
======
aldanor
So many people don't realize pandas can be horribly slow if you use it "wrong"
\-- i.e., for computations that don't vectorize in the way that's native for
pandas. Also, working with dataframes that contain millions of rows is like
playing a Russian roulette -- there's usually many ways to do the same thing
in pandas, if you guessed correct you'll wait a minute or two till the
computation's done, if you guessed wrong it'll run out of ram, segfault or
never finish.

For big datasets, I've stopped using pandas myself a few years back for
anything other than printing dataframe, datetime index series, doing quick
plots, or working with tiny/toy datasets -- in favor of numpy
structured/record arrays. It's kind of the same thing, without all the
groupby/index fluff, but very fast.

Just last week, I've helped my colleague speed up her code (numerical solver
for financial data) by more than 100x, the biggest part of it was ditching
pandas entirely and using numpy.

~~~
Declanomous
So I've been learning Pandas after mostly using either standard Python, R or
VB to do our analysis, and I'm glad I read this because I thought I was going
crazy.

I have a data set of about 4 million rows I routinely analyze. I have 32 gb of
space on my desktop, and the only time I've really run out is when I write
incredibly poor code. In the short while I've been trying to use Pandas run
out of memory and get killed by the OOM killer or completely freeze my system
for half an hour while processing what I thought were simple operations.

I was honestly beginning to believe I was way worse at programming than I
thought due to all of the issues I was having. I wasn't even doing anything
particularly complex, I was just loading a dataframe from a sql query and
playing around with basic manipulation.

------
farnsworth

      But pandas’ magical simplicity makes things like computed columns immediately intuitive:
      > data['% of total'] = data.amount / data.amount.sum()
    

Is that immediately intuitive? I'm staring at this trying to understand what
it's doing. Is the / operator overloaded? data.amount is one particular
amount, and data.amount.sum() is the sum of all amounts? Why does the
"computed column" property goes on the same data object as the actual data?
Maybe it's immediately intuitive if you've used pandas.

~~~
kinkrtyavimoodh
OTOH I think it's immediately intuitive if you are not a programmer. :)

When you see amount / sum, you think of how a list can be divided by what
appears to a scalar.

When they see it, they parse it out for what they naturally understand a
percentage to mean. And all is well.

~~~
david_eads
Exactly this. I'm the author of the post and was a programmer by trade for a
long time before I became a journalist. I _don't_ actually find this more
intuitive than more explicit and fundamental programming techniques. But my
students grokked it immediately, whereas even simple structures like loops
seem to be harder to get for them to get their heads around.

Given I had ten weeks to cram a lot of material in but did want to show them
some amount of programming, this worked pretty nicely.

~~~
gravypod
I've been very troubled by coming to this stuff as a programmer. I'm having
the same instant dis-satisfactory response that your students are having with
looping structures.

I've recently started working on some projects where I need to do a lot of
data visualization, story telling, and investigation "into the data". As a
programmer getting into this stuff is far worse then I expected. Nothing works
as I would think would make sense. My biggest problem is that I'm thinking
like a programmer not like a mathematician. I expect objects, segregation or
elimination of state, application and reduction, re-usability, and algorithms.

Are there any good frameworks that allow for processing, caching, data
visualization (layout -> data population -> rendering), then exporting to some
format (PNG/PDF/TeX)?

What follows, below this line, is my groveling about the things that have
bothered me. Be warned if you don't like rambling and complaining. \-------

Pandas, one of the biggest "offenders", is trying to be an in-memory database
with only one table but ends up having far fewer features and a far clunkier
interface (want to do a simple map/reduce? Welcome to chaining a strange
combination of '.loc', '&', and ':,' "operators"). Matplotlib is unintuitive
and poorly documented for anyone who isn't a mathematician (.plot(lons, lats,
latlons=True) is correct). Dealing with anything more then 100,000 data points
is a pain to revision on. State everywhere it shouldn't be
(matplotlib.pyplot).

While I've been working on this project I probably (each spin) spend an hour
or two getting the data out of a format that doesn't make sense from a
programmers perspective, I spend another 5 to 10 minutes writing an
application/reduction, then I spend another hour to go back into the strange
data formats that matplotlib will take. All the while re-running expensive
computations and waiting because I have no good persistence layer for my
project.

There are just things in this community that are common that I'd never dream
of. What follows is a list of these things.

1\. Functions with 20-40 arguments are the norm for some reason. They also
love to throw in a few insane defaults, undocumented options, and even magical
flags (not enums).

Things like "draw a line, connect the dots" makes it so you need to know what
5 to 7 arguments of a massive function. In C/Java when I need some flags they
probably look like this:

    
    
        some_operation(some_data, DO_A | DO_C | DO_Z)
    

Or, if someone was feeling really nice and defined an enum & used varargs, it
looks more like this:

    
    
        some_operation(some_data, SomeOperationFeatures.DO_A, SomeOperationFeatures.DO_C, SomeOperationFeatures.DO_Z)
    

Where all of these have appropriate documentation. My IDE place nice and can
complete these things. My compiler likes it and can typecheck these things. I
like it because I know all of my options available (SomeOperationFeatures. _).

With matplotlib you have things like `linestyle=""`. You have to go to a
webpage, look through the docs, and figure out what you want. It's worth
reading the docs [1] if you never have. This could have very easily have been
LineStyle.DOTTED, LineStyle.DASHED, LineStyle.BLANK. IDEs would have played
nice. The 3.6 runtime's typechecking would have played nice. You would be able
to see what your options are (LineStyle._).

2\. Non-standard ways of treating python-isms

Pandas, for some reason, cannot stick to python-isms. I can't do simple things
like...

    
    
        if not df: # Check if DF is empty
            return ...
    
        for row in df: # Iterate through the rows of a DF
            row.date = datetime(row.year, row.month, row.day, ...) # Create a new column in the row based on the row's data.
    
        subset = [a for a in df if some_condition(a)] # Do simple filtering
    

Pandas also implements it's own versions of standard python objects! You need
to know, and go back and forth between two, ways of doing things.

3\. All these libraries separate logically grouped concepts.

Lets say I have time series data from 10 sensors.

    
    
        class SomeMagicalSample:
            def __init__(self, a, b, c, d. ..., occurred)
                self.a = a
                ...
                self.occurred = occurred
    
    

With this code I can generate very complex filtering, combinations, and what
not. Things like extracting "real" meaning from measured values becomes easy
to express.

    
    
        def get_magical_scalar(self): return ... some interpolation ...
    
        def is_some_magical_type(self): return ... some check ...
    
    

Now I can use my already tried and true reduction and application.

    
    
        sum(map(SomeMagicalSample.get_magical_scalar,
                filter(SomeMagicalSample.is_some_magical_type, samples)))
    

Pandas, matplotlib, numpy, scipy and the lot are designed to make me avoid
this style of organization. I'm instead forced to do something like this.

    
    
        a = [...]
        b = [...]
        c = [...]
        d = [...]
        ....
        occurred = [...]
    

Then I have to jump through hoops to keep all of this data in the same order,
shift it around together.

4\. Because everything is meaningless lists of numbers there are no ways to
reuse code.

Most of the code I have written to show off a single value over time, or pull
some data out of some other data and visualize it, is never going to be used
again. Unless I want to look at this exact same thing this code will not be
useful. If there was some way pass objects around, hide the internals, and
process them independently of their meaning then this would not be the case.

The one case where this was not true in the past few days was when I rendered
a model's prediction into a pcolormesh and drew it onto a basemap. By passing
it a basemap it will automatically find the place to generate data for with
the model. This was an undocumented feature that I had to read the source of
basemap to find was possible (pulling the top left and bottom right Lat Lons
from a basemap regardless of projection).

Maybe these warts just hurt for a little while? Do these go away? Are there
alternatives that can handle >10 million data points? I don't have a good
analysis framework setup for the work I'm doing. Maybe this is the issue. I
don't even know what a good analysis framework would look like.

[1] -
[https://matplotlib.org/api/lines_api.html#matplotlib.lines.L...](https://matplotlib.org/api/lines_api.html#matplotlib.lines.Line2D.set_linestyle)

~~~
bigger_cheese
">Are there any good frameworks that allow for processing, caching, data
visualization (layout -> data population -> rendering), then exporting to some
format (PNG/PDF/TeX)?"

I use SAS for this in my Day Job it's not a free program but powerful for this
type of stuff.

I typically use SQL queries (via SAS's proc sql command) to manipulate and
process my data but you can also programatically manipulate your data sets
using SAS's "datastep" language.

SAS has support for macro expansions which make some of your examples (like
manipulating 10 sensors at once) pretty trivial. But this is getting into
programming language territory I would not expect someone new/unfamiliar with
programming to grasp all of this intuitively.

edit: Heres some code I have in production that counts how many (of 8) sensors
are reading high in a given time frame.

array aads (*) TP_AD1_TOP_STACK_TC1 -- TP_AD1_TOP_STACK_TC8;
NO_AD1_TEMPERATURES_HIGH = 0; do j= 1 to dim(aads); if aads(j) gt 160 then
NO_AD1_TEMPERATURES_HIGH = NO_AD1_TEMPERATURES_HIGH +1; end;

Downside is that SAS is a commercial package and it is not free I Have heard a
lot of good things about "R" which is supposedly quite similar but have not
had opportunity to use it myself.

~~~
gravypod
I'd like to get my analysis systems as "inclusive" as possible. I'd be using
my internal SQL server and just fall into python for my processing if I didn't
care about sharing my work.

SAS looks good though. I've looked at it many times and it is a clean solution
if you really are in the "big games".

~~~
bigger_cheese
Yeah that is a good point trying to separate analysis from database.

My work is going opposite direction unfortunately we are starting to use
Hadoop makes it quite difficult to do things "outside of the database" there
is just too much data to work with locally.

------
tkt
For installation of Jupyter, Anaconda works well across all platforms, even
most slightly older OSes.

[http://jupyter.readthedocs.io/en/latest/install.html](http://jupyter.readthedocs.io/en/latest/install.html)

It does work better for people to install Jupyter with Anaconda, rather than
use virtual environments, because there's not the overhead of also having to
learn about virtual environments. People tend to think of them as just
associated with the class and don't use them as much for their own work
outside of the workshop or course.

------
flyaway
I spend about 8 months of the year teaching pandas to journalism students, and
it's a wild ride! Despite some of the iffy syntax and pandas' seeming
inability to standardize parameter names, the students seem to grok the
workflow much more quickly than wrangling lists and dictionaries in the
"normal" world of Python.

I know everyone loves the reproducibility Notebooks supposedly bring to the
table, but without a doubt my favorite part is the ability to export super-
unattractive matplotlib charts as PDF, clean them up in Illustrator, and
suddenly find yourself with publication-quality graphics. Knowing you're
producing something more than just some numbers to toss in a story can be a
strong sell to a lot of folks.

------
thearn4
I really like Jupyter, but somehow I'm not in love with it. Like, every time I
fire it up to use it for quick data analysis, I seem to inevitably end up back
in sublime + bash, sending plots to disk. Am I the odd one out?

~~~
has2k1
If you know what kind of short analysis you want to do, the benefits of
Jupyter are not obvious. If you have to do a lot of exploration, and do longer
analyses then it becomes indispensable.

~~~
david_eads
It's also really valuable for sharing. At NPR, I did an analysis of Trump's
tweets that was used in a digital post and Morning Edition piece. The notebook
was easy to share with the reporter, editor, and readers and accessible enough
for them to understand ([https://github.com/nprapps/trump-tweet-
analysis/blob/master/...](https://github.com/nprapps/trump-tweet-
analysis/blob/master/trump-tweets.ipynb)).

------
bsder
It is hard to overstate just how ferociously bad the experience of getting
Jupyter from blank computer to the equivalent of "Hello world" actually is.

~~~
rjeli
I have a strategy that works pretty consistently - close your eyes and ignore
the best practices like using Anaconda, Python 3, virtualenv (or venv in
py3... oh wait it's a module?) and just install Python 2.7 with pip into
default locations (I even run pip with sudo, the horror). It works really
well! I run all sorts of CV, ML, deep learning notebooks with no problems.

~~~
radarsat1
I agree, I never use virtualenv. I might if I was building a production
system, but for my own laptop I feel perfectly capable of
remember/tracking/checking what is in my ~/.local. (I always install with
`--user`)

If I really need to containerize something, I use Docker.

------
jastr
I've found that most of the queries that journalists are trying to run are
pretty basic, mostly filtering and histograms. Setting up a virtualenv,
dependencies, etc can be tough. And RTFM isn't sufficient for someone getting
started. I was surprised that nothing existed for this, so I built it.

It has the basics of a Jupyter notebook - filter, sum, average, plot. So far
it's attracted a pretty interesting audience including journalists, but also
lawyers and consultants.

www.CSVExplorer.com

------
farnsworth
Side note, I googled "pandas" and get a lot of results related to the python
library, and very few related to the large mammal. Bing doesn't give me any
related to the python library. Google knows me too well.

------
koolhead17
Excellent share.

