
A Gentle Visual Intro to Data Analysis in Python Using Pandas - jalammar
https://jalammar.github.io/gentle-visual-intro-to-data-analysis-python-pandas/
======
Bishonen88
A very gentle intro, indeed ;)

Pandas is such a vast monster, that even after going through the book from the
original author of Pandas
([https://www.safaribooksonline.com/library/view/python-for-
da...](https://www.safaribooksonline.com/library/view/python-for-
data/9781491957653/)), I was absolutely unprepared for doing real analysis.

Whilst I understood the basics, such as data loading, (simple) cleaning,
selections, functions, groupby', indexes etc., I spent most of my time on
stackoverflow looking for solution to actual problems I was facing. I reckon
that many other users have made the same experience - there is lot's of
general info out there when it comes to pandas, but every data is different
and the devil lies in the details. Long Story Short: learning pandas is all
about trial-and-error and will take months (years even), to be efficient in.

~~~
istjohn
Serious question: I've tried to use Pandas for some data analysis for my small
business. Data sets are on the order of 10,000 data points or less. After
struggling for days with Pandas, I've begun to wonder if it wouldn't be easier
to code the analyses in raw Python. I wouldn't mind taking longer to complete
the task at hand if in the process I was acquiring skills that will pay off
down the road, but I wonder if Pandas isn't so esoteric and difficult that I
may never reach the point that I can cash in that investment of time and
effort as long as I am only a casual user.

In contrast, while I'm not an expert in JS or Python, I find that time spent
struggling with those technologies pays dividends since the lessons learned
make everything I do in the future easier.

This is highly subjective of course, but in your opinion, should I keep
fighting with Pandas? Is it worth it?

~~~
rchaud
10,000 data points is well within the range of what Excel can handle without
needing PowerQuery. What type of analyses are you attempting?

~~~
istjohn
Linear optimization and generating some simple graphs. I would like to be able
to at least generate the graphs automatically from my database.

~~~
wenc
> Linear optimization

Did you mean linear regression? Linear optimization isn't a use case that
Pandas covers, but there are other tools that I can recommend.

~~~
istjohn
Yes, that's what I meant.

------
Mefis
How does Pandas compare to R's Tidyverse?

Tidyverse was super easy to pick up, and I can do almost anything I want with.
Why would I want to switch to Panda?

Has anyone tired the python tydiverse port? How does it compare to the
original?

~~~
jalammar
I haven't used tidyverse myself, but I know that pandas is heavily influenced
and inspired by R. Most analysis tasks are doable in both platforms. If later
stages of your pipeline involve deep learning (or machine learning,
generally), then it could pay to be in the python universe given the wide
adoption of python ML/DL tools. I generally wouldn't advise switching unless
you have a certain pain point, though.

~~~
wenc
> pandas is heavily influenced and inspired by R.

Is it? How so?

------
jalammar
Hello HN, author here. If you've ever wanted to get into data analysis, this
is my best attempt at getting you past that first hump. A lot of these
concepts are easier than you might think.

~~~
qwerty456127
Thank you very much. I would certainly love to read more and more about Pandas
(or anything) written this style and go deeper in the subject.

Are you going to write more? Can you (or anybody) recommend where (a book, a
YouTube channel, a website or whatever) do I continue from the point where you
intro ends? As for now all I use of Pandas is a datetime-indexed array of real
numbers + simple vector operations on its columns but I feel like I would like
to take a learning/career path to becoming a Pandas expert.

~~~
happy-go-lucky
You may want to check this out:

> Short hands-on challenges to perfect your data manipulation skills

[https://www.kaggle.com/learn/pandas](https://www.kaggle.com/learn/pandas)

Also this:

> Things in Pandas I Wish I'd Known Earlier

[http://nbviewer.jupyter.org/github/rasbt/python_reference/bl...](http://nbviewer.jupyter.org/github/rasbt/python_reference/blob/master/tutorials/things_in_pandas.ipynb)

------
skadamat
Love this intro! All of the popular dataframe oriented tools (tidyverse,
pandas, etc) all require familiarity with vectorization and thinking with
related mental models. I'm involved with Dataquest Labs and we teach data
science interactively in the browser. We're pretty big believers in using
diagrams and visual aids to help people learn these concepts as well.

We've had a pandas course ([https://www.dataquest.io/course/pandas-
fundamentals](https://www.dataquest.io/course/pandas-fundamentals)) for a
while and we just launched some R courses that teach a lot of vectorization
([https://www.dataquest.io/path/data-
analyst-r](https://www.dataquest.io/path/data-analyst-r)).

------
NightlyDev
What kind of syntax is var['string'] in these examples?

Haven't really used python for anything and I'm just wondering, since it looks
like an array or map, but clearly seems to have some logic behind it as it
seems to reference the specified column at each row. What is this
functionality, something that's built in to python or use of some sort of
magic functions?

~~~
cgriswald
var['string'] means get the item of object var with key 'string'. Any Python
class can define the magic method __getitem__ to define the behavior
of/overload the [] operator.

~~~
NightlyDev
Thanks! Didn't know python had magic functions so I was really confused.

------
happy-go-lucky
Note to author:

> We can select one or multiple rows using their numbers (inclusive of both
> bounding row numbers):

> df[1:3]

That will slice beginning from the row with integer location 1 up to 3,
exclusive of the last element. So, just two rows, not three as shown.

~~~
jalammar
Thanks for the heads up! Indeed it's df.loc[1:3] that would return three rows,
not straight-up df[1:3] which indeed returns two rows.

Edit: Corrected in the post. Thanks again!

~~~
yufeng66
But why design it that way? Seems to be a sure way to confuse new user.

~~~
EForEndeavour
I'd say that's because df[start:stop] mimics python's builtin list slicing,
which includes start but excludes end, so this dictates the behaviour of
indexing without .loc. By contrast, df.loc[start:stop] is a label-based
indexer, and labels can be anything (integers, strings, datetimes, categories,
etc.), so it doesn't always make sense to exclude the right endpoint of the
interval.

------
BrandoElFollito
I wanted to try to analyze application logs (usually a timestamp and some
text) but all examples in pandas deal with numbers.

Is this useful for the analysis of such data (with a machine learning mid term
goal (clustering and anomaly detection)?

------
catacombs
This is great. I hope the author posts more Pandas visual guides.

------
Scarbutt
Curious, for stuff like this, why not just use sql(sqlite)?

~~~
joelschw
There are some things which Pandas is just better at, such as: extracting
content via RegEx and pivoting... However, there are also some situations
where you should use SQL such as UPSERT or date-range joins.

------
pleasecalllater
The last time I wanted to use Pandas, it ate 32GB or RAM and then I just
killed it, and made all the analysis in Postgres.

