
Estimating Number of Jupyter Notebooks on Github - eoinmurray92
https://kyso.io/KyleOS/nbestimate
======
julienchastang
In the same spirit as “Effective Java” and “Effective C++” we need to have a
book entitled “Effective Jupyter Notebooks”. Here are some of my items below.
Maybe this sub-thread can come up with an outline for this book.

Item #1 Writing a notebook is foremost an exercise in expository writing. Make
sure the writing is high quality is the first objective when writing a
notebook. This is the Knuth’s literate programming idea where prose takes
precedence over code, which is usually the reverse of the way we usually
program; code first, comments second.

Item #2 Don't use notebooks for general purpose programming. Notebooks are
supposed to have an audience and clearly explain something.

Item #3 Keep code cells simple and clear. If needing a comment in the code
block, consider putting that verbiage in a markdown cell instead and
elaborating on the idea the notebook is trying to convey.

Item #4 Don't make notebooks a long series of extended code cells, or even
worse, just one long cell. Explain what is going on or see Item #2.

~~~
dec0dedab0de
I disagree.

While your points are valid for presentations, I believe that notebooks should
first and foremost be used for exploratory computing.

Notebooks are my goto tool for whenever I need to do something with a computer
and any of the following apply:

* I'm not quite sure what or how.

* It will likely be a one-off

* I need it _now_.

* Someone is watching me, to learn how I do it.

I would go a step further and say that notebooks should probably not be shared
directly most of the time, and that you should wrap up functionality into
modules/packages for other people to use in their own notebooks.

Edit: Just thought of a corny pun to make my point: They are notebooks, not
textbooks. Notebooks are personal, and while it may be useful/insightful to
compare notes, it's not the primary function of a notebook.

~~~
bluejay2387
I agree with with your disagreement. I've always approached notebooks as a
prototyping tool first and a presentation tool second. I see Jupyter more as
an example of an early Interactive programming tool that just happens to be
useful for presentation and teaching purposes.
([https://en.wikipedia.org/wiki/Interactive_programming](https://en.wikipedia.org/wiki/Interactive_programming))

And regarding sharing -- particularly with notebooks you should be really
cautious about sharing and opening them. The security model is not exactly
bullet proof.

~~~
TeMPOraL
> _I see Jupyter more as an example of an early Interactive programming tool_

Nitpick, but I hope you meant "young", not "early". Interactive programming is
as old as (and today still primarily featured in) Lisp - i.e. twice as old as
most of us here on HN.

------
kbd
If you ever put notebooks in source control, you owe it to yourself to try the
text-based notebooks supported in Visual Studio Code[1]. They're round-
trippable with real (i.e. browser-based) notebooks, yet are much better for
collaboration, diffing, and editing.

[1] [https://code.visualstudio.com/docs/python/jupyter-
support](https://code.visualstudio.com/docs/python/jupyter-support)

~~~
claytonjy
How does this compare to Jupytext?

I prefer pipenv to Conda, and I don't like having Jupyter(Lab) installed in
each venv separately, so instead I only add `Ipykernel` to each venv and then
use my system-level JupyterLab to access per-project kernels; seems like that
wouldn't work here?

~~~
xvilka
Exactly. Conda is the worst of Python world - installing gigabytes of
unnecessary garbage every time.

~~~
linalgmixer
But the purpose of Anaconda is to have a easily installable set of frequently
used tool for a variety of data science tasks. It's not meant to be a
minimalist package.

------
alpb
Why doesn't this just use the GitHub public dataset available on Google
BigQuery to have much more accurate data rather than "scraping GitHub web
search results"? [https://cloud.google.com/bigquery/public-
data/](https://cloud.google.com/bigquery/public-data/)

There are a lot of examples of people analyzing public code on GitHub
efficiently for patterns and usages with BigQuery and getting pretty accurate
data out of it. [https://medium.com/google-cloud/analyzing-go-code-with-
bigqu...](https://medium.com/google-cloud/analyzing-go-code-with-
bigquery-485c70c3b451)

If you use GitHub on a daily basis, you are unlucky enough to know that web
search sadly can't even find words that exist in your repository.

~~~
mmcniece
One limitation of the BigQuery dataset is they only look at repos with a
license on them[1], the scraping approach can look at all public repos.

[1] [https://github.blog/2017-01-19-github-data-ready-for-you-
to-...](https://github.blog/2017-01-19-github-data-ready-for-you-to-explore-
with-bigquery/)

------
manaskarekar
Off topic:

Possibly something in my config, but I've recently got a lot of

    
    
        "Sorry, something went wrong. Reload?"
    

when trying to view Jupyter notebooks on github itself. Seems to be working
right now.

I have used this as an alternate:
[https://nbviewer.jupyter.org/](https://nbviewer.jupyter.org/)

~~~
eoinmurray92
This is what we made Kyso for - the linked post is actually a Jupyter notebook
itself, the code is hidden by default to make it readable to non-technical
people but you can click on the "code hidden" button on the top right to see
the code in full.

If github is not working for you well for your notebooks. You can try Kyso by
signing up and importing your notebook from Github directly on this page:
[https://kyso.io/github](https://kyso.io/github), or upload it using this
page: [https://kyso.io/create/study](https://kyso.io/create/study)

Disclaimer - I am OP and founder of Kyso

~~~
aw3c2
You made Kyso for making GitHub-hosted Jupyter Notebooks render when accessed
with a web browser on github.com?

~~~
KyleOS
Well, it's one of the ways to post to Kyso, but yeah, you can synchronize your
Github repositories to Kyso and choose the notebooks you want to have
rendered. When you push changes to the repo, the sister post on Kyso will be
automatically updated.

------
tincholio
If only more people would use org-babel...

If you're on emacs and like Jupyter, there's [https://github.com/dzop/emacs-
jupyter](https://github.com/dzop/emacs-jupyter) , which is pretty nice. I've
been using it for a few days with Julia, and it works really nice. It also
allows you to use different kernels from the same org-mode file, though I
haven't tried to pass data between them yet (should be possible, though, at
least it works in plain org-mode).

~~~
kadal
I agree. emacs-jupyter is great but I'm eagerly waiting for jupyter notebook
server support since my work is on remote clusters now. For local use, it's
pretty great.

------
lmeyerov
We use notebooks heavily for onboarding devs & data scientists to Graphistry,
and I only see that increasing.

Interestingly, for initial use, we increasingly start teams on their existing
internal NB servers, and for new ones, they either start on Jupyter included
in their Graphistry AMI or use Google Colab. So, very little outside of our
quick start notebook skeletons hits GitHub.

So... How many notebooks are actually out there? Probably an even more
interesting growth curve...!

~~~
eoinmurray92
How do you share and collaborate on the notebooks internally - I'd love to get
your thoughts on our Kyso for teams system [1] if you would be willing to
chat?

[1] [https://kyso.io/for-teams](https://kyso.io/for-teams)

~~~
lmeyerov
Google Colab has solved the 90% for us. I wish it'd have context sharing
across users and better default folder management, and probably other things,
but free + usable + sharing has been amazing in + across partners.

Feel free to ping in a ~couple weeks, happy to chat.

------
eoinmurray92
OP and Founder of Kyso here - we built Kyso to make it easier to blog your
notebooks to the public and also to make them easier to share in teams.

The linked post is actually a Jupyter notebook itself - analysing the number
of notebooks on Github.

A key element with Kyso is that the code is hidden by default to make it
readable to non-technical people but you can click on the "code hidden" button
on the top right to see the code in full.

If you want to give Kyso a go - sign up and import from Github directly on
this page: [https://kyso.io/github](https://kyso.io/github), or upload using
this page: [https://kyso.io/create/study](https://kyso.io/create/study)

~~~
aequitas
In my opinion hiding code is an anti feature as exposing the code by default
gives you more incentive to write clean understandable and explaining (self
documenting) code as it's always visible. By hiding it people might be more
likely to paste in big blobs ugly code that would be much better put into a
reusable function then in a notebook snippet.

~~~
eoinmurray92
Thats true if you sharing with someone who also understands the code - we
think that the feature allows you to share the notebooks with a completely
non-technical audience.

Image you had a notebook to analyse sales data and you needed to present the
results to your CEO (who perhaps cannot code) - this feature lets you present
the notebook as is, without needing to prepare a report in some other format

~~~
mike_ivanov
Why would you share a _notebook_ with somebody who doesn't understand the
code?? What are they supposed to _do_ with them anyways? Reading? Isn't that
what PDF is for? Just give them PDFs, C*O people have got piles of other
things to worry about besides shared notebooks.

~~~
eoinmurray92
Mostly so you don't need to convert to PDF and so that you can host the
reports in a central place where everyone can read them technical or not.

Like an internal wiki for a companies data-science where the technical people
can communicate their work to the non-technical people with a pretty seamless
experience

~~~
mike_ivanov
This is not the experience non-technical senior leadership people are looking
for, unless you are a 10-people startup.

~~~
eoinmurray92
Im not sure - we have large teams using Kyso as a knowledge base for data-
science work and there's also Airbnb's knowledge-repo which originally
inspired us so from my point of view there is decent evidence for the need for
this

~~~
mike_ivanov
Ah, that makes sense. So, basically you replaced a dashboard effort with a
whole bunch of readonly notebooks, thus distributing the information delivery
job among the peers outside of your DS team. Clever.

------
sytse
We're seeing an explosion of Jupyter use as well on GitLab. GitLab already
makes Jupyter easier to install on a Kubernetes cluster
[https://docs.gitlab.com/ee/user/project/clusters/#installing...](https://docs.gitlab.com/ee/user/project/clusters/#installing-
applications) In response to the growing demand we're doing two things:

1\. Adding better Jupyter support to GitLab 12.0 [https://gitlab.com/gitlab-
org/gitlab-ce/issues/47138](https://gitlab.com/gitlab-org/gitlab-
ce/issues/47138) as suggested by my co-founder.

2\. Making it easier to do the entire data lifecycle with Meltano
[https://meltano.com/](https://meltano.com/) which plans to include JupyterHub

------
xchaotic
Python Notebooks will the be the PERL of 2010s - write once, pretty impossible
to maintain long term

------
hodder
Very cool work here. This is a pretty epic post, so please do not take this
the wrong way.

I was under the impression that FB Prophet was optimal for significantly
seasonal time series data.

Honestly given the fickle nature of these kind of growth patterns beyond the
very near term, an ARIMA with a flat vol or a simple eyeball extrapolation in
my experience as a quant would likely generate just as reasonable/reliable
results.

While I understand this is likely intended as a standalone project, it would
be interesting to run a comparison of ARIMA vs FB Prophet on out of sample
trending Github tools/file types, as well as the general performance of these
predictions beyond a one year time frame (especially vs the reported
confidence intervals in Prophet).

I am not that familiar with how Prophet works, so I am absolutely open to
being humbled and corrected. I have a project myself that has a varying
seasonal component and I am looking forward to diving into Prophet for a
deeper understanding. I am attempting to model an Asian 2 asset spread option
with a volume weighted average index price setting mechanism where the
underlying exhibits seasonality in the volume traded over the trading time
window. I am currently running a Monte Carlo on the valuation with a simple
average settlement assumption, as opposed to a volume weighted average
assumption, and I was thinking Prophet could help.

Does anyone have experience in financial time series analysis and option
valuation who would care to chime in?

Also, what is everyone's thoughts on using prophet non seasonal vol clustering
times series?

~~~
KyleOS
Hey, I posted the notebook by the OP. Thank you for your feedback! You're
correct in saying that FB Prophet is for forecasting time series with strong
seasonal effects. FB Prophet was the model used in the original script I found
& the main point here was simply to make the notebook more readable on kyso,
which has quite a few non-technical readers. I've worked a lot with ARIMAs
before for financial/economic data and I like the idea of comparing the
results between the two, and maybe even extend the time frame. So 1. I think
that'll be my next project and 2. if your project is public I'd love to give
it a read when published.

~~~
jerednel
Did you try running the forecast with a log ceiling to control for the
trajectory a bit? Or would that only be a concern of yours if you had to
forecast past a couple of years? I find that when I use Prophet to forecast
down to the day I end up creating initial forecasts with heavy log ceilings to
prevent unreasonable estimates of the future and then end up removing the
ceiling once enough history is established to provide a resaonable baseline.

------
martinzugnoni
My two cents: We've been recently working in a FREE hosted version of Jupyter
Lab mainly intended for education. Feel free to check it out.

[https://notebooks.ai/](https://notebooks.ai/)

Would love to hear some feedback.

~~~
jsilence
Wondering how you are planning to keep it free. Also wondering whether you
would possibly consider shifting to Sagemath/CoCalc as a service.

~~~
martinzugnoni
We got support of the local university at my city and we got a bunch of free
credits at AWS. Costs are very low and we want to keep it that way so we can
support the most students we can with a free access.

~~~
ertemplin
What happens when your funding and AWS credits run out?

~~~
martinzugnoni
We might add bigger paid tiers later if we decide to support business usage of
it. For now it's only educational and we can deal with the costs, even without
the credits. Containers used are small and get shut down on inactivity. So, we
only need to care about concurrent users. Hope it makes sense.

~~~
zeptomu
> Containers used are small and get shut down on inactivity.

How do you define inactivity? If I do

$ nohup ./computational_intense_and_runs_for_100_hours.py &

Do you just kill the process (or stop the container)? In essence Jupyter is a
graphical rich shell, so you providing free *nix machines - don't
underestimate how this feature can be exploited (e.g. CoCalc limits at least
internet access for free instances).

~~~
martinzugnoni
First, that will use 100% of the CPU quota assigned to your user, which is
really small.

Second, yes. The container will be killed after 10min unless we keep detecting
activity of your user in the platform. So, basically the rule is: If we don't
detect user's activity after 10mins we kill all containers for that user. You
could hack this by doing periodical requests to the API to simulate activity,
but at some point your JWT will be expired and requests will start failing.

In any case, other students won't be affected at all by the appropriate usage
and we will end up banning your account at some point when we detect it.

We also limit the amount of parallel running containers to avoid unlimited
containers running at the same time.

Do you see any drawbacks on this implementation? Happy to hear about possible
improvements.

------
KyleOS
I think it would be cool to run the same analysis on the number of R Notebooks
on Github and compare the two.

~~~
turingbike
Aren't Jupyter notebooks R notebooks? Jupyter stands for "Julia, Python, R" I
believe

~~~
KyleOS
I think the file extension scraped on Github is only .ipynb, which is only
python notebooks right?

~~~
rodonn
I think most people save under this extension even if they are using a
different kernel (i.e. they are running R, Julia, Matlab, etc. code in the
notebook).

~~~
eoinmurray92
Yeah so its mostly a split between Jupyter and R Studio - but Jupyter can mean
different languages

------
syntaxing
On the topic of Jupyter Notebooks, Is there something similar to a paid
version of Google's CoLab? CoLab is so awesome for creating prototypes and
even better since it's free. However, there is no paid alternative that I have
seen. I do not want to have to deal with setting up my own VM or server. The
way that CoLab is perfect for what I need.

~~~
eoinmurray92
The OP post is a Jupyter notebook itself and if you sign up to Kyso you can
actually Jupyterlab on our cloud and the post the notebooks to the web, or
make them private on the paid plan - is that what your looking for?

~~~
syntaxing
Yeah but unfortunately there is no GPU support. I wish there was!

~~~
eoinmurray92
Ah yeah ok - we are not planning GPU support soon - what if we created one of
those one-click deploy a VM to aws/digital-ocean buttons and from there if you
wanted to post to Kyso you could do it with git or our jupyter lab plugin.

You get most of the same experience and you can even customise various of the
steps?

~~~
syntaxing
Interesting! That would be super awesome. Like a one-click deploy with auto-
shutdown after the end of run to save money. I would definitely pay for a
service like that!

------
nsxwolf
So I just learned they're not laptops.

~~~
superdimwit
This comment made my day!

------
randomfool
There's also the GitHub extracts table available in BigQuery which allows
analysis of the contents of the notebooks themselves:
[https://bigquery.cloud.google.com/table/fh-
bigquery:github_e...](https://bigquery.cloud.google.com/table/fh-
bigquery:github_extracts.contents_ipynb?pli=1)

------
airocker
We built a runnable jupyter notebook website. Would someone be able to take a
look and give us some feedback?

[https://datacabinet.systems](https://datacabinet.systems)

We are VM based for now but are moving to be kubernetes based to make sharing
better. Our initial market is classrooms.

~~~
localhost
Something to be aware of (and a general comment about k8s in general) is that
k8s is not suitable for use in hostile multi-tenant scenarios like the one
that you're describing. Once an attacker escapes from the container (see HN
archives for lots of examples of this), they can p0wn the entire cluster.
Jessie Frazelle has a great post on this:
[https://blog.jessfraz.com/post/hard-multi-tenancy-in-
kuberne...](https://blog.jessfraz.com/post/hard-multi-tenancy-in-kubernetes/)

There are expensive ways to deal with this today, e.g., running each user
isolated in a separate VM. Hopefully we will have better solutions in the near
future.

~~~
airocker
We were starting to work on disabling kubernetes cluster access. We will try
the steps in the post.

------
hyperbovine
Especially odd because they are so unsuitable for use with git. Someone needs
to find a way to fix this.

~~~
chrisjc
> because they are so unsuitable for use with git

Can you going into a little more depth about this statement?

~~~
pwhitebelt
Diffs primarily I'm guessing - a, it's kinda hard to parse the jsons that you
see when you look at a notebook in raw text b, every time I execute a cell, it
shows up in the diff as a change. That being said, there are plugins and tools
that deal with these issues quite well. check out
[https://nbdime.readthedocs.io/en/latest/](https://nbdime.readthedocs.io/en/latest/)

~~~
hyperbovine
I have not had good luck with nbdiff. Minutes-long runtimes and huge memory
consumption on fairly standard ipynb’s.

------
gus_massa
Isn't the prediction too low? My (unsupported) prediction fitting a smooth
curve in the graphic is
[https://imgur.com/a/ykeIxPm](https://imgur.com/a/ykeIxPm)

------
JBorrow
This seems like an incredibly complicated way to fit an exponential to data...

------
formalsystem
Maybe it's time to be able to run them implicitly on Azure cloud?

------
trpc
nice marketing, kyso.io team

------
funkythingsss
I hate jupyter notebooks. Joel Grus puts it perfectly:
[https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUh...](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-
dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1)

past hn discussion:
[https://news.ycombinator.com/item?id=17856700](https://news.ycombinator.com/item?id=17856700)

~~~
localhost
What specifically do you hate about Jupyter? Is it out of order execution
exacerbating the "hidden state" problem? If so, and if you already use VS
Code, I encourage you to try out our Python VS Code extension. We have an
"Interactive Python Window" mode
[https://code.visualstudio.com/docs/python/jupyter-
support](https://code.visualstudio.com/docs/python/jupyter-support) that we
showed a lot of folks at pycon last week and even among the "I don't like
Jupyter" crowd, it was quite well received. The key thing about our experience
is that it is an editor focused interactive programming experience vs. trying
to just replicate Jupyter functionality in an editor (though we are also doing
work here because we believe that folks want an experience where they can move
seamlessly back and forth between and editor and a notebook experience).

Disclaimer: I designed this experience in VS Code.

~~~
euler_angles
Please allow me to thank you for designing an experience that hits the sweet
spot between maintaining a good history of work through git while allowing for
interactivity and ease of exploration. I started using VS Code on seeing a
video of the jupyter support within the latest release of the Python
extension. This has eased and sped up my work tremendously!

~~~
localhost
Thanks for the kind words! If you find things that you would like to see
improved, please do open an issue on our Github -
[https://github.com/Microsoft/vscode-
python/issues](https://github.com/Microsoft/vscode-python/issues).

We also need to work in the discoverability of this feature too. Lots of
existing users of our extension had no idea it was there ... suggestions
welcome!

------
jjtheblunt
exponentially?

~~~
sp332
Yeah? [https://imgur.com/a/J93nCXR](https://imgur.com/a/J93nCXR)

