At Harvard we've built out an infrastructure to allow us to deploy JupyterHub to courses with authentication managed by Canvas. It has allowed us to easily deploy complex set-ups to students so they can do really cool stuff without having to spend hours walking them through setup.
Instructors are writing their lectures as IPython notebooks, and distributing them to students, who then work through them in their JupyterHub environment.
Our most ambitious so far has been setting up each student in the course with a p2.xlarge machine with cuda and TensorFlow so they could do deep learning work for their final projects.
We supported 15 courses last year, and got deployment time for an implementation down to only 2-3 hours.
It is worth noting that there is an argument that it is a worthwhile task for students to learn how to setup complex computing environments, as it better prepares them for the real world. However, in reality, there just isn't time within a single semester to do this for a class of 100+ students. So implementations such as this one trade-off that learning for a greater focus on computational theory and its implementations.
It’s a good experience, but a motivated student who’s good at charging through docs can do it on their own. Having to manage that for a class that’s intended to teach conceptual material would be a big time-sink.
Its a valid point to consider that as part of the learning experience. I don't think that students with limited time should be forced to go through that though. If you get them hooked on to computing, their natural curiosity would lead them to explore it further. Bogging them down with unnecessary setup stuff would probably only make them think that this shit takes way too much effort etc.
I've had many courses that were bogged down by software setup issues in college; I would rather than not be the case.
I want to say that, if it's pedagogically valuable, then it needs to be made into a small lab course (or part of the lab unit for an intro class), and taught once, in an organized manner.
And then stop letting professors hide behind this lame excuse so that they can get on with teaching the stuff that their course is actually about.
I'd actually do the opposite. Let them use pre-rolled first, then when they actually know and care about how the system is set up, have them set it up the way they like it.
(I am actually leading a machine learning for high-schoolers camp in 2 weeks and we are using Jupyter notebooks so that all students, with heterogeneous backgrounds, will start in the same place and get to the fun stuff fast. Many will never have used Python and will not know or care about 2.7 vs 3, just to give the most high-level and basic example!)
Setup is orthogonal to understanding and learning, leave it for extra credit or an optional follow up exercise. It will lose or distract many students.
I think at the very least they should be able to see someone do it so they know what steps are involved. Possibly a video with steps but stated that it isn't the focus of the class.
Depends what that particular course is supposed to teach. Stats, maybe some sort of intro course, or a programming course intended for non-CS majors are all courses where it could make sense to abstract away setting up complex environments -- just like how, in the business world, companies (SAS, etc) make tons of money abstracting away complex environments so businesses can have their employees focus on what provides value in that context.
The majority of the courses that utilize JupyterHub at the moment involve some kind of stats work. They use JupyterHub to ease people with little technical experience into using stats libraries and coding generally, with the aim of giving them a high-level knowledge of CS principles and techniques.
An exception would be the example of the deep learning project work. In this case JupyterHub was utilized as an easy way to deploy a centrally managed, cost effective environment for a large class to use GPU resources without the risk of running up huge AWS costs for each student.
I'm torn. On the one hand that's really cool to get everything configured and up and running so students can get to the interesting parts. On the other hand, learning how to configure your own environment is kind of an essential part of working with any tool that forces you to understand at least some of the structure involved.
If you have 100 students, you will have 100 different mistakes to debug in the setup. A setup is not a program, there is rarely an easy way to pinpoint a problem, and so it takes a lot of time to setup just one, let alone a 100.
When the number of hours are limited, it's best to skip it entirely, and just provide a solid paper tutorial.
Average class size of those using JupyterHub was more like 30, excluding two very large, very popular classes. Worth remembering all the classes have labs and sections which are much smaller.
It's almost a waste of time teaching people to setup a deep learning stack.
NVidia will break everything you do with every release and any instructions you write will be outdated in weeks.
For example, the TensorFlow/CUDA/CuDNN installation changes continually because you can't install the default releases of any of them and get a working system.
If you're already exposed to what the working system looks like and you later on have to follow the setup tutorial to get it going yourself, that's much easier than starting from scratch. And it's more motivating too, since you know more of what the end result will give you vs hoping that it helps you solve the problem at hand.
On the third hand, it's helpful (maybe ideal) to see somebody else do it, see what it looks like and how it ought to behave before trying to set it up yourself. But of course, at some point, you should do everything yourself.
I do understand the dilemma. I work at a K-12 and the office next to mine is where they put together the science lab kits for students. It takes a fair bit of understanding to do that correctly sometimes, and that preparation work is some knowledge the students seem to miss out on in order to get to the subject matter. My coworker has mentioned on more than one occasion that with certain modules it feels like she does most of the work and the students just do the final step.
I can appreciate that feeling; my father was an electronics teacher in secondary school in the UK.
It is true that we had to deal with some issues that might not have occurred had students gone through the process of setting up the environment themselves, like have to rebuild the machine of the student who uninstalled Cuda.
>Our most ambitious so far has been setting up each student in the course with a p2.xlarge machine with cuda and TensorFlow so they could do deep learning work for their final projects.
Wow! How expensive was this? Do you do any sort of shutdown/startup work or use pre-empt instances?
The answer for this particular course was, "very expensive". Each students machine shuts down after 20 minutes of inactivity (defined as no Jupyter process currently running), so that saves money, but in this specific instance many of the students were running extremely long jobs, so that didn't really help us.
The average cost over all of the other courses was something like $2-3 per month per student. The deep learning course ended up being closer to $20 per student. Thanks to Amazon Educate almost the entire cost was covered with credit.
Interesting. Do you think your use case could be helped by doing only the training remotely on a single GPU/GPU cluster, and doing the rest of the development work on a cheaper machine? (basically just equivalent of estimator train() runs on a faster machine that quits afterward).
I wrote a tool that does that with Keras but I'm not sure if it's actually useful for real-world use cases.
I saw your repo a few months ago when looking at implementing something similar! A lot of the things in the development notes seem to have been solved or improved recently.
I do think that Jupyter notebooks are an amazing thing for CS Education. I wish more college level classes would utilize them. It adds a nice layer of interactive experimentation to any program/assignment/project.
What I ended up using was z2jh [0], which is working out great for right now!
We aren't yet allowing students to use GPUs or any libs that would require them, but we may look into that in the future.
If I'm allowed a shameless plug (I'm the developer), there is ElastiCluster [1] which can deploy (among others) JupyterHub on various IaaS cloud infrastructures. We have used it at UZH to provide teaching environments for various courses. By default it only comes with the local "passwd" authenticator, but anything that works with Jupyter or Linux/PAM can be used relatively easily (= write an Ansible playbook).
Feel free to reach out to me if you would like more info.
Most of the implementation is open source. Authentication module is separate as it's part of our Canvas app work, but it will likely be open sourced soon. We've also done implementations which authenticate via GitHub...
There is a lot of work on the jupyterhub organization that provide custom authenticator (GitHub, laugh) feel free to reach out if you want to migrate your work there. Curious also why the existing GitHub Oauth did not work for you.
"we had an issue with Oauth when we upgraded JupyterHub version 0.7 to version 0.8. The instance spawner we wrote needed to be updated to fix the issue. The case that opened about this in our repository* fixed with the latest update of the instance spawner"
So it seems we could be using GitHub standard OAuth now. But 95% of our implementations utilize Canvas auth reconciling with our university AD.
Yeah, we need to get back in touch with the Jupyter Project folks to see if there is anything we've done that might be useful feeding back into the project in any way. I'm not 100% sure on the Oauth stuff, I am but the lowly PM. I've punted the question over to the dev lead, who may comment further (though not sure if he has an HN account).
Ok, Great, i saw the other comment, JupyterHub is about to be released in version 0.9 – so if there are changes you like to go in, there is still some time. Feel free to send PR that add Canvas to the list of known Authenticators[1] for other to find it easily.
I know there've been some work to have instructions on how to deploy on AWS, and work on the k8s helm charts to do so[2] if that can be of help. If any work could be consolidated to both decrease the workload of you (and us), that would be good. Are any of you attending JupyterCon in August ? In person feedback is always welcome (Sturday August 25th is open, free, Community Day/ hackathon / sprint/ open studio, where the Jupyter team will be there)
Instructors are writing their lectures as IPython notebooks, and distributing them to students, who then work through them in their JupyterHub environment.
Our most ambitious so far has been setting up each student in the course with a p2.xlarge machine with cuda and TensorFlow so they could do deep learning work for their final projects.
We supported 15 courses last year, and got deployment time for an implementation down to only 2-3 hours.
In conclusion, IPython good, JupyterHub good.
Edit: surfacing the link to the open source repo on GitHub https://github.com/harvard/cloudJHub