Hacker News new | past | comments | ask | show | jobs | submit login
A Gentle Visual Intro to Data Analysis in Python Using Pandas (jalammar.github.io)
195 points by jalammar on Nov 1, 2018 | hide | past | favorite | 51 comments

A very gentle intro, indeed ;)

Pandas is such a vast monster, that even after going through the book from the original author of Pandas (https://www.safaribooksonline.com/library/view/python-for-da...), I was absolutely unprepared for doing real analysis.

Whilst I understood the basics, such as data loading, (simple) cleaning, selections, functions, groupby', indexes etc., I spent most of my time on stackoverflow looking for solution to actual problems I was facing. I reckon that many other users have made the same experience - there is lot's of general info out there when it comes to pandas, but every data is different and the devil lies in the details. Long Story Short: learning pandas is all about trial-and-error and will take months (years even), to be efficient in.

As a daily user of pandas for a few years now, I really must suggest that anyone looking to use it for serious data analysis familiarize themselves with the Split/Apply/Combine paradigm [0].

Lots of data munging has been enabled or sped up by judicious application those concepts.

[0] https://pandas.pydata.org/pandas-docs/stable/groupby.html

I agree. Hadley Wickham (a very prolific author of important R libraries) wrote a great paper about this method using one of his libraries. I'm a Python + pandas user, but his paper really helped me understand the approach better: https://vita.had.co.nz/papers/plyr.pdf

It might take months to be be able to use its full power and be efficient (delivering at a high velocity), but as someone who was proficient in python but had never used pandas it took me just over a week to write a process to clean data and produce graphs (with seaborn) to compare sets with boxplots. This incldes anonymizing the data properly and play with different Turkey's fence values for the graphs to make the most sense. This is after people spent weeks, and failed, to try to get a similar process with excel.

It's good to warn people, but let's not scare them.

Serious question: I've tried to use Pandas for some data analysis for my small business. Data sets are on the order of 10,000 data points or less. After struggling for days with Pandas, I've begun to wonder if it wouldn't be easier to code the analyses in raw Python. I wouldn't mind taking longer to complete the task at hand if in the process I was acquiring skills that will pay off down the road, but I wonder if Pandas isn't so esoteric and difficult that I may never reach the point that I can cash in that investment of time and effort as long as I am only a casual user.

In contrast, while I'm not an expert in JS or Python, I find that time spent struggling with those technologies pays dividends since the lessons learned make everything I do in the future easier.

This is highly subjective of course, but in your opinion, should I keep fighting with Pandas? Is it worth it?

Make sure you learn what a Series is and how it relates to the things in the DataFrame and how selection works, specifically .loc and .iloc. Then your life will be much easier. Try starting with this article: https://medium.com/dunder-data/selecting-subsets-of-data-in-...

10,000 data points is well within the range of what Excel can handle without needing PowerQuery. What type of analyses are you attempting?

Linear optimization and generating some simple graphs. I would like to be able to at least generate the graphs automatically from my database.

> Linear optimization

Did you mean linear regression? Linear optimization isn't a use case that Pandas covers, but there are other tools that I can recommend.

Yes, that's what I meant.

I would recommend to "keep fighting with Pandas". Many of its feature seem confusing at first, but later on you see the value of them.

I think some of the problem is that people who use pandas don't necessarily know how to drop one level in programming. Like, if the file isn't a nicely formatted csv, they don't know how to read and parse the file directly. If they can't use basic filtering or a boolean mask easily with pandas, they don't know how to use lists, loops, and conditionals directly. It's great to use pandas rather than reinventing the wheel, pandas is an excellent library, but if you're going to deal with data at a very intricate level, you do need to know how and when to punt and just write the code yourself.

I think this comes up particularly in the context of pandas, because it's a common entry point into a programming language for people who don't think of themselves as programmers and may resist the notion that this is actually what they're doing.

I still see myself going to spark and Spark SQL for some tasks like stratified sampling which I haven't been able to properly do with pandas. Somehow the spark DF API feels more intuitive and I was able to figure out a lot by myself.

That's interesting as the Spark api was inspired by Pandas

Can you use groupby for your stratified sampling work?

How does Pandas compare to R's Tidyverse?

Tidyverse was super easy to pick up, and I can do almost anything I want with. Why would I want to switch to Panda?

Has anyone tired the python tydiverse port? How does it compare to the original?

Echoing other comments, Tidyverse is somewhat more coherent (aided significantly by magrittr's %>% operator). Beginners might get tripped up by Non-Standard Evaluation (NSE), which is a little unintuitive, but there are packages to help with that.

The Pandas's API is a generalized solution to complicated, variegated use cases and its syntax reflects that (it was also hemmed by strictures of Python). There are several indexing methods, several ways to slice, several ways to do apply's, all of which behave slightly differently. Even expert Pandas users have trouble remembering the syntax for all of these, so they typically have a Pandas API browser window open or a printed cheat sheet pasted on some corkboard. Pandas definitely takes longer to get used to than Tidyverse but the payoff is that you get to use Python, which is a somewhat "deeper" language than R.

R is great for interactive work, and for data munging jobs that don't interact too much with non-R libraries. However Python is sinply more versatile end-to-end.

I used to start my interactive analysis in R and port to Python for production, but these days I start in Python straight away so there's no impedance mismatch. I've personally found that writing production code in Python (and by extension Pandas) to be much more pleasant than in R, even with Tidyverse.

The Tidyverse is more coherent and is generally bigger than what’s just in Pandas (R’s Tidyverse; I haven’t used the Python port).

If you already have a good grasp of Python, sure why not learn Pandas too? In my case, I’m reasonably ambidextrous in Python and R but find myself not reaching for Python unless there are colleague / deployment considerations that remove R as an option. The reason? R’s Tidyverse is pretty awesome, and reflects one of the better parts of the R language, namely the meta programming that is a holdover from Scheme’s influence on R.

Now, if you don’t already know Python and don’t have some other reason (such as specific deployment considerations or a team of Python collaborators) to learn? I don’t think so. Python is a fine language, just as R is a fine language. You’re already getting things done in R.

If you want a mental challenge, or to get in on the ground floor of something that might be the future, learn Julia, or F#, or (my favorite) Racket. Or heck, learn Spark, or a new modeling method.

Pandas' syntax and conventions are significantly more cumbersome than R, but it does pretty well given the Python syntax and convention that it has to work with. I haven't done a lot with pandas because of how difficult it is to remember the syntax and API, but I feel it's good enough that if you're already a Python user, you can stick to doing your data work in pandas rather than move over to R.

I haven't used tidyverse myself, but I know that pandas is heavily influenced and inspired by R. Most analysis tasks are doable in both platforms. If later stages of your pipeline involve deep learning (or machine learning, generally), then it could pay to be in the python universe given the wide adoption of python ML/DL tools. I generally wouldn't advise switching unless you have a certain pain point, though.

> pandas is heavily influenced and inspired by R.

Is it? How so?

I use both: Python/Pandas for working with production code and pipelining TensorFlow/Keras code, and R/tidyverse/ggplot2 for ad hoc data reports and visualizations. They both have their advantages and disadvantages and it doesn't hurt to know both workflows.

I find pandas far easier to actually program with, whereas the tidyverse is better for quick one-off scripts. The tidyverse and its obsession with non-standard evaluation, makes writing functions more difficult than it should be, and readability goes out the window when using tidyeval.

Neural net universe is in Python and you can use Python to build production pipelines.

Pandas is inspired by R's dataframes, which I'm told are native.

Native doesn't necessarily mean it's the best option. (tidyverse/dplyr leverages Rcpp for data transformation, which makes it a lot faster at common ETL tasks)

Hello HN, author here. If you've ever wanted to get into data analysis, this is my best attempt at getting you past that first hump. A lot of these concepts are easier than you might think.

Thank you very much. I would certainly love to read more and more about Pandas (or anything) written this style and go deeper in the subject.

Are you going to write more? Can you (or anybody) recommend where (a book, a YouTube channel, a website or whatever) do I continue from the point where you intro ends? As for now all I use of Pandas is a datetime-indexed array of real numbers + simple vector operations on its columns but I feel like I would like to take a learning/career path to becoming a Pandas expert.

You may want to check this out:

> Short hands-on challenges to perfect your data manipulation skills


Also this:

> Things in Pandas I Wish I'd Known Earlier


Hey there, I'm involved with Dataquest and we have a Pandas and NumPy fundamentals course where we dive into more intermediate concepts like vectorization, key data structures, and the key functions.


We use a similar approach to the OP. Lots of diagrams and visual aids and you always work with a real dataset.

A reasonable next step would be pick up a dataset that interests you (in a domain you're comfortable with) and explore it with pandas. Kaggle has a bunch of data sets (https://www.kaggle.com/datasets) in various domains. You can look at the "Kernels" where other users often use pandas to uncover insights and show you their process.

Thanks for the kind words!

This is a perfect out-of-tutor-session reference for my novice data analysis pupils. Will be sharing with them later today. Thank you!

Thanks for the tutorial, it's really what i was looking to give to some friends.

Very pleasant to read. I will pass this to my wife who is an accounting professor trying to break ground into using Python/Pandas/Numpy instead of Stata.

I really enjoyed your style of writing and use of visual examples. I wish for such an explainer for SQL. If you made that as a book, you can tun with my money.

Thanks for this. This is very useful for a beginner.

Love this intro! All of the popular dataframe oriented tools (tidyverse, pandas, etc) all require familiarity with vectorization and thinking with related mental models. I'm involved with Dataquest Labs and we teach data science interactively in the browser. We're pretty big believers in using diagrams and visual aids to help people learn these concepts as well.

We've had a pandas course (https://www.dataquest.io/course/pandas-fundamentals) for a while and we just launched some R courses that teach a lot of vectorization (https://www.dataquest.io/path/data-analyst-r).

What kind of syntax is var['string'] in these examples?

Haven't really used python for anything and I'm just wondering, since it looks like an array or map, but clearly seems to have some logic behind it as it seems to reference the specified column at each row. What is this functionality, something that's built in to python or use of some sort of magic functions?

var['string'] means get the item of object var with key 'string'. Any Python class can define the magic method __getitem__ to define the behavior of/overload the [] operator.

Thanks! Didn't know python had magic functions so I was really confused.

It's key-lookup syntax, like for a dict (map or hashmap in other languages). A Pandas DataFrame can be thought of as a dict of columns.

Note to author:

> We can select one or multiple rows using their numbers (inclusive of both bounding row numbers):

> df[1:3]

That will slice beginning from the row with integer location 1 up to 3, exclusive of the last element. So, just two rows, not three as shown.

Thanks for the heads up! Indeed it's df.loc[1:3] that would return three rows, not straight-up df[1:3] which indeed returns two rows.

Edit: Corrected in the post. Thanks again!

Genuine question, did you run the original before publishing it on the website?

But why design it that way? Seems to be a sure way to confuse new user.

I'd say that's because df[start:stop] mimics python's builtin list slicing, which includes start but excludes end, so this dictates the behaviour of indexing without .loc. By contrast, df.loc[start:stop] is a label-based indexer, and labels can be anything (integers, strings, datetimes, categories, etc.), so it doesn't always make sense to exclude the right endpoint of the interval.


I wanted to try to analyze application logs (usually a timestamp and some text) but all examples in pandas deal with numbers.

Is this useful for the analysis of such data (with a machine learning mid term goal (clustering and anomaly detection)?

This is great. I hope the author posts more Pandas visual guides.

Curious, for stuff like this, why not just use sql(sqlite)?

There are some things which Pandas is just better at, such as: extracting content via RegEx and pivoting... However, there are also some situations where you should use SQL such as UPSERT or date-range joins.

The last time I wanted to use Pandas, it ate 32GB or RAM and then I just killed it, and made all the analysis in Postgres.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact