Hacker News new | past | comments | ask | show | jobs | submit login
Python Data Science Handbook: Full Text in Jupyter Notebooks (github.com)
396 points by TsukiZombina 3 months ago | hide | past | web | favorite | 36 comments

An acquaintance once advised me to keep a context file: all the little "notes to self", code snippets, key config elements, etc, in a file. I've tried a few times in Vim but finally really got traction in Jupyter, through a combination of my org's massive Windows dependencies, which is definitely not the Jupyter community's default (needed to document lots of little idiosyncrasies), and actually having interesting data in that world. What I really like about Jupyter for this is that it's trivial to mix it all together: a link to a handbook like this, how to decode and encode Windows environment variables, tips on Vim, python, pandas, plotting, etc.

And I was really struck how a number of the headings in this handbook mapped exactly to the headings in my context file. I suspect this will not be the last time I click that link.

hmm interesting idea. I've been trying all kinds of different tools for documenting my work/code snippets/notes.... But nothing sticked. Mind explaining more on that? Would it be stupid to have a wiki like system inside jupyter?

I really prefer small text files, preferably org-mode. But, I just made a radical change that is so far working for me: I signed up for G Suite for my personal domain, and manually copied over org-mode, Apple notes, etc. to Keep Notes. I copied all purchased PDF eBooks, ACM Communications PDFs, and important research papers to Google Drive, and copied over all old email. With Cloud Search, I can find any of this stuff instantly.

As a programmer, there is no larger time saver than having notes for code snippets, configuration file examples, etc.

I used to use Evernote, then I wrote a personal version of Evernote in Clojure that worked really well for me, except everything was just on my primary laptop. G Suite is not great from a privacy standpoint (but I can live with it) but for me wins out for convenience - well worth $12/month.

EDIT: I used to keep Jupiter-lab running on a GPU leased server for machine learning educational projects. If I still did that, as other people here have pointed out, with the new file interface Jupiter-lab would be a good choice, esapecially with some customization to implement a global search to find stuff quickly in all notebooks.

Have you tried the new JupyterLab interface yet? It's pretty straight forward to segregate your content into separate notebooks in that setting since you have quick access to a filesystem view. That said, I also have a wiki with 10+ years of my medical notes in it. I love the wiki, but work keeps blocking my domain (I have lots of images from med school that I'm sure have a copyright on them, so I can't just make the wiki public access, therefore the little ladies in tennis shoes at BrightCloud mark it as "personal storage".

I have just started using jupyter with jupyter lab for notes. I havent gotten as far as making a full wiki system, but I definitely think it is possible. My killer feature is the ability to drop in code from a lot of different languages. I haven't fully tested it, but I am optimistic in its power

Disclaimer: I have used todoist, emacs org-mode, wunderlist and trialled a dozen other task management programs.

Is there a way to run notebooks automatically, so you could regenerate notebooks like this after some library code changes or dependency upgrades and check that everything stillw orks?

Yep, Jupyter notebooks have an execution API, you can find more of it here - https://nbconvert.readthedocs.io/en/latest/execute_api.html. Hosted notebooks as a service is a growing area of investment and those services presumably use this API.

I built ReviewNB [1] to see visual diff for Jupyter Notebook changes & do a code review on it (by writing comments on cell changes etc.).

One of the next feature for ReviewNB is a CI pipeline for Jupyter Notebooks on GitHub. The idea is to make it easy for users to specify notebook "tests"/"checks" that can then be run on every change.

Given the nature of Notebooks, it's a bit hard to design CI for it in a clean way, but I appreciate any inputs or use cases that you might want to see fulfilled.

[1] https://www.reviewnb.com/

There's the nbval plugin for pytest. https://nbval.readthedocs.io/en/latest/ I've used that in one of my packages (https://github.com/qucontrol/krotov) to verify example notebooks on every push, on Travis CI

At the very least you can use the command line tool to run/export them.

It seems to be a nice introduction to numpy, pandas, and matplotlib

Any reviews on this ?

I have a fair amount of experience with pandas, and find the notebooks very help to refer to! I would say it's worth noting that his book is organized by technology (e.g. numpy, then pandas, then plotting), which makes it feel more like a technical reference, than a walk-through of basic to advanced DS activities.

It's also worth checking out the notebooks for Wes McKinney's data science book. Daniel Chen doesn't have the code from his DS book on GitHub, but does have some useful notebooks he uses for workshops.



Hands-on Machine Learning with Scikit-Learn and TensorFlow [1] is more ML focused, but highly recommended. Out of the three books (Python for Data Analysis and Python Data Science Handbook) I learned the most from this one by far.

[1] https://github.com/ageron/handson-ml

An incredibly critical review of McKinneys book can be found here: https://medium.com/dunder-data/python-for-data-analysis-a-cr...

Ah thanks for pointing out--I mostly agree with his posts (and his minimally sufficient pandas is a great one!), and it's definitely worth reading. A common quirk with a lot of the python DS books is them being "reference manuals".

(I'm a little concerned with the aggressive way he's come at Wes McKinney in posts and on twitter, considering Wes has given a lot of his time working on open source contributions)

I agree with that review. McKinneys book reads like a reference manual and an old one at that. I don't understand why it is recommended so often.

Haven't read this yet, but just from Jake Vanderplas' reputation, I think it's probably worth

If you want some more recs, my two favorites are Chris Albon's Machine Learning with Python Cookbook and Joel Grus' Data Science from Scratch: First Principles with Python

This is the first book on the subject matter which I actually finished. When I started using it, I was completely noob in data analysis/ml.

Personally I'd say this is a good book. Sections dealing with Numpy, Pandas and Matplotlib are great.

However, I am hesitant to say the same about ML section. I felt like this book assumes some familiarity with general ML concepts. I also felt like ML chapters progressed a bit fast from beginning to the core of chapter.

In all, book is great. Sections on Numpy, Pandas etc are great. But as for ML section, don't use that section as an introduction/first course for ML.

"These notebooks are just Python code. They even have #-comments instead of markdown. For awesome Python notebooks, see

http://norvig.com/ipython/README.html "


I have the paperback version and I have read the Jupyter, Numpy and Matplotlib chapters as well as most of the Pandas chapter (I haven't read the scikit-learn chapter at all). So far I like it. It's well written, well edited, overall a good quality book. It's really focused on the tools and it shows you how they work with small, contrived examples. This is good because you can use it as a reference, pick up pretty much any section and understand it. However it doesn't teach you much about the process of data science, which would require larger examples. In other words it's focused on the how but not on the what and why. Maybe a more accurate title would be Python Data Science Tooling Guide. In my opinion it should be perfect for people who've already done some data science in another environment and are switching to Python. Other people might need to seek additional guidance elsewhere.

What would you recommend for someone who has the skills with pandas and numpy but struggles with the what and why?

For me, it is one of the best books for beginners. It covers almost completely from numpy, pandas, matplotlib/seaborn to ML algorithms.

When an open source book has 150 open PRs and the last commit is from 4 months ago I am discouraged to spend time on it.

Or maybe you should spend time on it, by creating a fork with all the good PRs applied?

I would probably do that if the time investment was worth it. For example if it was something I was using on a day to day basis but not for leisurely/exploratory reading.

Even for leisurely reading, you expect an author to still be updating a book several years after it was published? What experience has led you to believe that's a reasonable expectation?

I don’t “expect” the authors to do anything. But I am not going to spend many hours of my time reading a book when I see that the book is not maintained because there are many great books on my backlog that ARE being maintained by the authors/community.

All of my books are in pdf and receive zero maintenance - many of them are still incredibly useful.

I did not make a blank statement about the usefulness of outdated books. But certain topics do get out of date pretty quickly and would be less valuable to a maintained book.

Last commit was 3 months ago, and there are 48 open pull requests ...

Which one of us is accessing the wrong repo?

My mistake. You are right it’s only 50. I still think that 3 months is bit long for not merging any PR for an open source book.

any one else reading this comment should know -- I did look through the PRs and many of them are typo fixes.

Its easy to make demands on open source code maintainers time. Not all PRs and "tickets" need attention. The maintainer does not owe us anything.

I am not making demands nor did I say the maintainers owe me their time. But I do not owe them my time either. And I would rather spend my time on a book that is actively being maintained by the authors or the community.

I also recommend checking out the open source Automunge tool for automated data wrangling at automunge.com

Is this an advertisement?

Some ideas / questions: - The documentation on GH is unreadable like this - On GH it says "Patent Pending" so is this not open source after that or is that phrase just a joke? - How is it related to the mentioned Data Science Handbook?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact