
Advanced computing with IPython - leephillips
https://lwn.net/SubscriberLink/756192/ebada7ecad32f3ad/
======
chrxr
At Harvard we've built out an infrastructure to allow us to deploy JupyterHub
to courses with authentication managed by Canvas. It has allowed us to easily
deploy complex set-ups to students so they can do really cool stuff without
having to spend hours walking them through setup.

Instructors are writing their lectures as IPython notebooks, and distributing
them to students, who then work through them in their JupyterHub environment.

Our most ambitious so far has been setting up each student in the course with
a p2.xlarge machine with cuda and TensorFlow so they could do deep learning
work for their final projects.

We supported 15 courses last year, and got deployment time for an
implementation down to only 2-3 hours.

In conclusion, IPython good, JupyterHub good.

Edit: surfacing the link to the open source repo on GitHub
[https://github.com/harvard/cloudJHub](https://github.com/harvard/cloudJHub)

~~~
chrxr
It is worth noting that there is an argument that it is a worthwhile task for
students to learn how to setup complex computing environments, as it better
prepares them for the real world. However, in reality, there just isn't time
within a single semester to do this for a class of 100+ students. So
implementations such as this one trade-off that learning for a greater focus
on computational theory and its implementations.

~~~
w0m
Agreed. At the beginning of class, walk students through the setup. Then for
every project after, let them use the pre-rolled systems.

~~~
kaitai
I'd actually do the opposite. Let them use pre-rolled first, then when they
actually know and care about how the system is set up, have them set it up the
way they like it.

(I am actually leading a machine learning for high-schoolers camp in 2 weeks
and we are using Jupyter notebooks so that all students, with heterogeneous
backgrounds, will start in the same place and get to the fun stuff fast. Many
will never have used Python and will not know or care about 2.7 vs 3, just to
give the most high-level and basic example!)

~~~
meko
I wish I had ML camps growing up! Mine were photography and adobe flash :<

------
bitexploder
Don't overlook all of the % commands, such as %edit. If you are familiar with
emacs keybindings, it has a very good built in editor as well. You can also
load snippets from saved files, and save your history to a file. Or individual
lines to files using range type expressions. In short it is very easy to get
code in and out of IPython.

Another great trick. Anywhere you want to debug or play in your scripts run
`import IPython` and the run `IPython.embed()` and your program at that point
with all its locals drops into an IPython session, which is nice.

~~~
taeric
I'm not sure I would compare it favorably to emacs in pretty much anything
other than mindshare. And, don't get me wrong, that is huge and not intended
as a vanity comment.

The keyboard shortcuts it has are superficial. My number one shortcuts in
emacs are compile, jump to next error, grep/occur, index, and magic. Just up
and down? Obviously u use them a lot, but the arrow keys do work fine.
Beginning of line and begging of text are both huge. But really, the screwy
tab behavior kills me in Jupyter.

~~~
bitexploder
Sure. The keybindings for the line editor are emacs like, but only surface
level. I can often fix up a line quickly, whereas I notice the vi folks just
use %edit. Jupyter, I only use when making plots and such.

The one other super handy shortcut is CTRL-R for shell like reverse search.
That is pretty sweet.

------
mlthoughts2018
If interested, I spent some time on a comment thread a few days ago describing
how my experience leads me to believe the Notebook environment (not all of
Jupyter / IPython, just the Notebook part) is actually only appropriate for a
tiny subset of pedagogical or throw-away situations, and should be avoided
most of the time and avoided in most of the cases it’s marketed for
(especially anything having to do with ‘reproducibility’ or ‘exploratory
analysis’).

<
[https://news.ycombinator.com/item?id=17202704](https://news.ycombinator.com/item?id=17202704)
>

~~~
bunderbunder
You're not wrong, though I think the value of pedagogical and throw-away
situations in your work may not be a universal experience.

For me, at least, pedagogic and throw-away situations aren't a tiny subset.
They're most of what I do. It's exploratory work, figuring out how the data
behaves, _if_ the data behaves, where it needs to be cleaned, churning through
great heaps of experiments and iterations before hitting on the ultimate plan,
and putting together a presentation to help explain what I finally settled on
to colleagues and stakeholders.

Only after sinking a whole lot of sweat into that process do I go on to start
building anything that we intend to keep. At which point, forget Jupyter
notebooks, I'm typically not even working in Python anymore for that part of
the job.

~~~
wenc
> At which point, forget Jupyter notebooks, I'm typically not even working in
> Python anymore for that part of the job.

This is what is typically done out there but I suggest it breaks the feedback
loop between the scientist roles and the developer roles. In rapidly changing
environments those feedback loops could be crucial.

It's similar to what Wall Street folks did (still do?)--quants write models in
Excel/VBA and pass them over to developers who would rewrite them in Java for
production. There's a natural impedance mismatch, and back-and-forths are
difficult.

I think a better approach would be for data scientists to write somewhat
production-ready code, send it to prod (with the help of devs), get feedback
from the production environment as well as get a sense of what tricks are
needed for prod, and then iterate on that code. It also helps to remove the
insulation between data scientists and the real world.

~~~
bunderbunder
Well, emphasis on the pronouns there. I'm not doing the proverbial "throw it
over the wall to engineering", I'm also writing the production version. I also
dislike the "2 teams" approach. Even if you have separate roles for data
scientists and software engineers, better to mix them onto a single team than
force them to communicate across a partition.

For me it's really down to efficiency. Writing somewhat production-ready code
is more expensive and time-consuming than blithely hacking. In the early
stages of a new project, I _know_ that almost everything I'm doing will get
thrown away. For the most interesting projects, there's even a decent chance
that it will be a complete failure and everything gets thrown away. So, at
that stage in the game, I'm inclined to say that any extra effort spent on
production readiness is just a waste of time and money. Fail fast, YAGNI, etc.

~~~
wenc
Oh I understand, I'm one of the few people on my team who does devops + data
engineering + data science (some people on my team only do 1 or 2, but not all
3). My point is more about the impedance mismatch between _roles_ and code
produced by each role, whether or not they are carried out by the same person.
For instance, I find it difficult iterating between my own model code and
production code, especially if the model code was conceived in an interactive
notebook environment.

I do agree that notebooks are good for writing throwaway code, but of n failed
notebooks, typically there's one that we'd like to bring to production. That's
typically the one notebook we'd want to be production ready.

When I say production-readiness, I don't mean actually working in production
boilerplate in the first iteration (maybe in later iterations...). I mean
writing the code in a way that lends itself to easy productionization through
observance of certain constraints, e.g. being cognizant of
environment/scoping/global state/namespace conflicts, writing model code in
modular units (functions or classes depending on the use case) rather than
just imperative line-by-line code, etc. These tiny disciplines are almost
effortless but can lower the friction of iterating between model and
production.

In data science work, the real proof of the pudding is in production, not in
unit tests. Most people don't want to admit this but unit testing doesn't work
as well in the mathematical modeling world as they do in the software
development world -- much of the time our inputs aren't discrete/enumerable,
and the state-space is large or infinite. So it's really important to be able
to iterate between production and modeling. If I ever need to go back to my
interactive environment to experiment and change the logic, there should be an
easy path to flow that back into production. Right now notebook environments
don't aid in that. I've observed OTOH that IDE environments do.

------
AdmiralAsshat
Deep research uses aside, I often prefer to use IPython because it's simply a
better shell than the default Python shell. You get basic niceties like tab
completion and being able to up-arrow to revise an earlier multi-line command
(like a function) without it being an exercise in frustration.

~~~
Bromskloss
Indeed. If I could only figure out how to have it automatically run `from math
import *` (and then present me with the interactive shell, I could use it as a
calculator too.

~~~
spot
[http://ipython.readthedocs.io/en/stable/config/options/termi...](http://ipython.readthedocs.io/en/stable/config/options/terminal.html#configtrait-
InteractiveShellApp.exec_lines)

~~~
Bromskloss
My problem is that I don't know how to make it accept spaces in the command to
be run. This works:

    
    
        ipython3 --InteractiveShellApp.exec_lines='["print(2)"]'
    

This does not work:

    
    
        ipython3 --InteractiveShellApp.exec_lines='["from math import *"]'
    

The latter command results in the following error:

    
    
        [TerminalIPythonApp] CRITICAL | The 'exec_lines' trait of a TerminalIPythonApp instance must be a list, but a value of class 'str' (i.e. '["from') was specified.

~~~
spot
idk, it works for me:

    
    
        $ ipython3 --InteractiveShellApp.exec_lines='["from math import *"]'
        Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37)
        Type 'copyright', 'credits' or 'license' for more information
        IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
    
        In [1]: sin(5)
        Out[1]: -0.9589242746631385

~~~
Bromskloss
I see. I have iPython 5.1.0. Maybe there was a bug that has been fixed in
yours.

------
biridir
The content in the article is a bit dated. Jupyter + Numba + Dask is the
direction scientific computing (in Python) is taking. Ipyparallel is not
really scalable in my experience.

~~~
semi-extrinsic
This is an extremely limited view of "scientific computing" that seems to only
focus on analytics, which is a tiiiny part of sci comp.

Your "stack" does nothing for solving/including ODEs, PDEs, DAEs, Fourier
analysis, numerical integration, automatic differentiation, linear equation
system solvers, preconditioners, nonlinear equation system solvers, the entire
field of optimization, inverse problems, statistical methods, Monte Carlo
simulations, molecular dynamics, PIC methods, geometric integration, lattice
quantum field theory, molecular dynamics, ab initio methods, density
functional theory, finite difference/volume/element methods, lattice Boltzmann
methods, boundary integral methods, mesh generation methods, error estimation,
uncertainty quantification...

Those are just off the top of my head, the list goes on and on.

~~~
biridir
I completely agree that there are many scientific libraries in python which
scale up. I was addressing the article which showed a more advanced way to use
python with the purpose of making it applicable to large datasets. If you were
to implement a method from scratch or scale up to a larger dataset then you'll
end up with using numba, numpy and dask. This is completely from a lower level
programming perspective to implement and integrate methods rather than
pipelining methods from higher level scientific libraries.

Just for some context:
[https://www.scipy.org/about.html](https://www.scipy.org/about.html)
[https://www.scipy.org/topical-software.html](https://www.scipy.org/topical-
software.html)

~~~
semi-extrinsic
I have yet to see a situation where Numba makes real sense, as compared to
just dropping down into C(++) or Fortran when you need to do the heavy
lifting. Can you give me a good example?

------
giu
The linked slides "Python in HPC" [0] are quite awesome, from which I learned
about mpi4py's [1] existance. Definitely going to give it a try in the future;
using the MPI API with an "easy-to-follow" syntax should be fun.

[0]
[https://hpc.nih.gov/training/handouts/171121_python_in_hpc.p...](https://hpc.nih.gov/training/handouts/171121_python_in_hpc.pdf)

[1] [http://mpi4py.scipy.org/docs/](http://mpi4py.scipy.org/docs/)

~~~
skierscott
mpi4py is a solid implementation, but the docs aren’t the greatest, especially
if you’re not familiar with MPI.

~~~
ianhowson
And there are different versions with different APIs. Best to just read the
source code for whatever you're running.

------
carc1n0gen
For the last couple of year I had wrongly thought that IPython was short for
Iron Python. Now that I know it is not, I understand the hype around it more

~~~
Waterluvian
Early in my career I thought CPython and Cython were the same. And then Pypy
and Pypi confused me. And Python the language vs. Python the implementations
and how when someone says "Python" in some contexts they assume "CPython".
Landmines everywhere!

~~~
leephillips
Yup, whenever I write one of these I need to spend some time on terminology.
It's even more confusing now with Jupyter, as there is still much overlap
between that and IPython.

~~~
Waterluvian
Yep. I honestly thought Jupyter is just a rename of IPython until you said
that. Now I have to go learn the difference.

~~~
Asooka
Summary: Jupyter is the evolution of IPython. IPython is deprecated.

~~~
detaro
Jupyter is the language-agnostic parts of IPython (UI, notebook format,
protocols to talk to notebooks, ...) extracted out to be used with many
languages.

IPython remains as the project maintaining the Python-specific parts of that
stack. It's not deprecated, but has been limited in scope.

~~~
Phrodo_00
IPython still makes nicer-repl-for-python. I don't know if it has a name
beyond just IPython, which is a bit confusing.

------
vaibhavsagar
There are solutions available for hosted Jupyter notebooks, such as
[https://cocalc.com/](https://cocalc.com/) (which is collaborative!) and
[https://mybinder.org/](https://mybinder.org/).

------
happy-go-lucky
It's a pdf file, which I am trying to open but cannot. Can you guys read it?

Edit: The link to the file:
[https://hpc.nih.gov/training/handouts/171121_python_in_hpc.p...](https://hpc.nih.gov/training/handouts/171121_python_in_hpc.pdf)

Edit: I can read it now.

------
pankajkumar229
Disclaimer: I am one of the founders of DataCabinet.

We built an online service on top of Jupyter which takes away the effort of
handling JupyterHub. It would be great if we could hear some feedback in the
context of this conversation. We feel that DataCabinet is better in ways
because it provides: a. Autoscaling according to number of users b. Sharing
full containers easily between people. You can install pip/conda binaries and
share with students/users. c. Shared storage so nbgrader works seamlessly.
Here is a full comparison:
[https://datacabinet.info/pricing.html](https://datacabinet.info/pricing.html)

Please excuse our landing page, it just got created today and we are fixing
it.

