Hacker News new | past | comments | ask | show | jobs | submit login
Coursera courses preserved by Archive Team (archive.org)
513 points by mihaitodor on July 9, 2016 | hide | past | favorite | 75 comments

It's great that there's a central place for this, at least once it's organized sensibly.

The stupidest, most counter-productive aspect of all these MOOC systems is the artificial schedule imposed on the course. While I've been able to take a couple to completion, I've had to let some by the wayside due to scheduling. Once that happens, you're disincentivized to catch up because of being behind and those that affect scoring. When I've gone back to finish courses that I had to leave by the wayside for the moment, sometimes I've lost access to the materials because the course has shifted to its next "semester". There's absolutely no reason for that. While there are a few courses like music or writing where you are collaborating or cross-reviewing other people's work, most of them are standalone lectures, homework, and tests.

That aspect is frustrating. I took all of Dr. Chuck's "Python for Everybody" courses, and have lost access to the videos. There have been times when I would have liked to have reviewed the videos, because while I did learn the material and perform the work at the time, I hadn't used that knowledge much in the months since. I feel as though I'm losing some of it, and would like to review. Having paid $300+ for the courses, I don't feel I should have lost access to the materials.

My wife works with Dr. Chuck and helps produce the university of michigan's coursera videos. She says you should absolutely still have access to those videos, and if you don't you should contact Coursera's support.

If you paid for it on the old Spark platform (over a year ago) those classes were migrated to their new system so you might need to resign up, but you shouldn't need to repay. The URL to use for requesting is: http://learner.coursera.help

> and have lost access to the videos.

One thing I like about Udemy (disclaimer: I work there) is the lifetime access policy for premium content.

This seems to be what was indicated in https://news.ycombinator.com/item?id=11881767

I was under the impression that would remove access to old courses by June 30 for those who hadn't paid. Removing access for those who've paid goes too far.

this is why i always (try to) work out a way of downloading an offline copy of the videos/course notes.

So true. One of the worst aspects of school is the imposition of a uniform schedule on everyone. There is a point beyond which learning faster means learning more shallowly. You don't have time to develop a rich mental model, which you test, explore, edit, and integrate with related models, but have to settle for a less refined model. With even less time, you have to resort to memorizing parts you don't have time to figure out to parrot back on a test. The less time, the more you have to rely on memory of isolated facts and general-purpose logic to "just pass the test".

It is tragically common for people (esp. kids) who would be fully capable of understanding the material to be hustled along at a pace that is determined by factors unrelated to their personal needs, forcing them into less-effective learning modes.

These artificially time-limited MOOCs potentially turn us all into such kids. If I'm trying to learn something for which my preparation was 20 years ago, I'll be as good as anyone after I finish the course, but I need more time to do it correctly. If I'm trying to learn something but only have an hour a week to allocate to it, it's my turn to be the "slow learner", but given the necessary time, I'll know the material as well as the fastest learner by the time I finish.

But MOOCs impose a schedule unrelated to my individual circumstances, and I can take it or leave it. It would be so much better if they just told me what to read, and whenever I finished reading, I could watch the corresponding video lecture, then I could go to an open, ongoing forum to read the accumulating Q&A for that lecture and post my own questions or comments, then on to the problem sets with answers and discussion again on the forum, followed by quizzes and tests (multiple versions of each, so I could take them, repeatedly if necessary, for my own guidance, not grades)...and I could plow through it all in a week or chip away at it for a year, as suited my needs.

If they wanted to offer credentials, they could separate the teaching from the ultimate certification testing.

As it is, in most cases a good textbook still meets my needs so much better than all this advanced teaching technology.

In addition to that, I've been disappointed that community building is based around courses and not forms of study. So if you go to a course after it's ended, the forum is dead. It would make sense to have forums/communities set up around specific fields of study, with specific course forums that stay up between sessions (so the forum for a Calc II class wouldn't keep being recreated every time the schedule flips over).

Also when they take a 1-hour lecture and split it into 5-minute segments, makes it so tedious. I much prefer OCW's approach.

Precisely. Not all of us suffer from ADHD! I simply abhor this new trend of mini-videos - all it does is break up my flow.

As someone with ADHD, I can assure you that splitting up videos into 5 minute chunks does me no favours whatsoever.

OCW does have some of the 5-minute segments approach if you access it through the MIT website (for some courses at least).

However, OCW coupled with Youtube 1.5x playback is heaven.

Coursera actually has a solution to this and it involves rolling over from your enrolled session in to a later one.

To tell you the truth, I find it a great system because it gives me a sense of the time frame I should realistically be completed the course.

Although, at the same time I tend to do deep dives on a course and complete them as fast as possible blowing past the schedule anyways.

If you don't want to subscribe to a self-imposed schedule the audit option basically allows you to learn the material on your own time.

Yes and some of the data science courses seem to be offered weekly, at minimum monthly. At least the courses I took.

Something that annoys me is courses that are listed in the course list but were once and done and never coming back, or "maybe someday" classes.

The time limit is enforced discipline. It's arguably their main benefit. Some people study without an online course... most don't.

I don't have data on completion data on timed vs untimed courses. The former's popularity is suggestive, but confounded by marketing, more people online, prestige, packaging of courses, lecturer/tutor availability, cohort forum, etc.

But I do have one anecdatum: I took Ullman's Automata, even though all the lectures were already on youtube, and slides, tests and solutions on his homepage. I found the schedule very tough towards the end... I meant to later review some parts I needed more time on, but over a year later, I still haven't...

The stupidest, most counter-productive aspect of all these MOOC systems is the artificial schedule imposed on the course.

I think that comes from imitating "physical" courses, which are designed to be completed in a predetermined amount of time. As the sibling comment alludes to, it also gives some motivation for actually finishing them.

But even for the meatspace courses in our uni, the course materials (lecture notes and exercise pdfs) remain available on the public course web page to the foreseeable future after the course.

Having predetermined time schedule might be reasonable for stuff like course credit and motivating the students to complete the course on time, but killing the forum community and even restricting access to materials (or making the access an unnecessary hassle) seems very unnecessary if the objective really is to foster learning in general.

Also you get community going through the course at the same time, can be useful for assignments where you grade each other's work, group projects, discussion boards, etc.

Does anybody know the reason why MOOCs impose these artificial schedule?s

Aren't these MOOCs just largely loss leaders for the schools?

If memory servers me right, Andrew Ng from Coursera said that there's a stark difference in test completion with and without deadlines. Apparently, if you don't have a hard deadline it's quite easy to just let it slip, because you can always do it later.

I don't think that a schedule is counter-productive because for most people, if you don't accomplish the course in a certain schedule, you would have to start over because you would forget most things... but of course all material should stay available forever in an ideal world.

The schedule can be an issue for those of us with travel and other real life concerns. OTOH, I wonder if what you want is purely self-paced learning why YouTube videos plus a book/website don't fit the bill.

For a narrow set of classes, programming autograders and other online aids are probably useful. But then, to the degree people have issues, they're really on their own (absent an explicit "TA" model that you pay for).

For those unaware, Coursera shutdown their old platform on Jun 30th [1].

Many of the courses on the old platform are slowly coming back on the new platform. When I built the list [2] of courses on the old platform the course count was 472, now its around 390. Some of the notables that I was excited to see come back are:

Neural Networks for Machine Learning with Geoffrey Hinton [3]

Computer Architecture from Princeton [4]

Programming Languages from UW by Dan Grossman [5]

Introduction to Natural Language Processing by Dragomir Radev [6]

Many of these courses were last offered a couple of years ago. Hopefully more courses form the list [2] start coming back.

[1] https://www.class-central.com/report/coursera-old-platform-s...

[2] https://www.class-central.com/collection/coursera-old-stack-...

[3] https://www.coursera.org/learn/neural-networks

[4] https://www.coursera.org/learn/comparch

[5] https://www.coursera.org/learn/programming-languages

[6] https://www.coursera.org/learn/natural-language-processing

If you want to help Archive Team in its efforts to preserve disappearing content and have some bandwidth to spare: Run an "Archive Team Warrior"-Appliance! This way you can help downloading all this content that is about to disappear!

Or do you have some Digital Ocean promotional credits left, that are about to expire? Spin up a (few) VMs with docker-containers running the warrior on DigitalOcean!


I started the warrior VM after seen this message.

However it seems whatever they are crawling cannot handle this type of mass distributed crawler, for the past few hours My computer haven't done anything.

You know what would be AMAZING? If there was a Coursera course (or some other MOOC course, books, etc) that explains how archive.org works from the foundational technologies upwards. So, like, you could build your own mini version of archive.org as an exercise. It'd be a fascinating project and could be a great case study in web archiving techniques and information retrieval. Does anything like this exist already?

And then archive.org could archive it, and you would achieve inception.

Not yet, but if you ask, I assure you Archive.org peeps are going to see it in this thread.

Archive.org accepts all kinds of donations.


Credit Card, PayPal, Bitcoin. Brewster is an amazing Steward of the Internet Archive.

Video Tour: https://vimeo.com/59207751

All the names are 'Coursera Curses' instead of 'Coursera Courses'. Someone there must have been really upset with Coursera's approach :) .

Most Archive Team efforts are given a punny name (at least for the associated IRC channel), usually the offending organization's name twisted into something mildly negative. Coursera > #cursera, Soundcloud > #soundclown, etc.

and the title of the page is "Archive Team: The Coursea Curse"

Coursea Curse :p

Yup, totally worthless, token effort.

This is incredible, I was just on the Coursera website trying to go to my old courses to continue from where I've left off, but I couldn't get to them. The link to my previously enrolled courses was taking me to the newer version of those courses which haven't started yet. So I thought I'll search HN because I remember reading that someone was archiving these ... and boom! it's the top link on the front page! Yay HackerNews!!! There is some Voodoo-AI going on at HN!

If anyone wants a quick index of the courses in those 45G webarchive files here is the index of course slugs. Unfortunately it looks like the webarchive slugs don't match the Coursera slugs anymore so I couldn't pair the names with more readable course titles:


Ah, nevermind. They just had some extra spaces. Here you go: https://gist.github.com/mihaitodor/b0d8c8dd824ab936c057508ed...

I also included the quick script I used to generate them, for convenience.

Those slugs that you extracted don't seem to be correct. I wanted to match them with the ones from here: https://docs.google.com/spreadsheets/d/1kaWxZG3krI83WfdzlExW...

Whilst I hate some aspects to MOOCs, the fact is that I spent about $120 to learn the basics if SAP, whereas if I'd gotten "proper" training on the same subject matter it would have cost me thousands.

I'd love to see a site that specialises in user contributed content along the lines of Wikipedia. It's funny though - take SAP as an example: I'd be just as happy reading a book that explains it all better than what is out there right now! A book that assumes you are into technology but have little skills or knowledge of the business processes that SAP gets intoned in, and which gives you a rundown of this before giving a detailed rundown on how SAP implements these processes.

Sadly, no such thing exists, but happily for me I stumbled upon http://www.accountingverse.com/ and http://www.accountingcoach.com/ (no, I'm not affiliated with them in any way) and it turns out they didn't cost anything and I finally "get" double-entry book keeping, financial transaction concepts like the general ledger, journal, accrual method and the fundamental accounting equation. Wish I'd known this earlier to be honest - as I say, I lament that there are no books on SAP core modules that go from concepts to the nuts and bolts of how SAP does things :-(

Archive.org does amazing work, I would highly recommend donating to them if you can.

The actual _work_ done here was by ArtchiveTeam, which is not the same thing. Archive.org are just doing the hosting of the end-results.

I have taken a couple of Coursera courses on R and Stats. They basically give you a brief outline of some topics you might want to pursue more in depth and they give you access to a discussion forum. I haven't found that this method of learning/teaching is very useful. There seems to me to be a huge opportunity waiting to be developed if someone can make a site like that but with more interactive elements AND where the learning/teaching is based on sound educational principles that can be demonstrated to effectively result in skills mastery. As it is now, Coursera is basically skimming cash off of the internet's insatiable google searching for information, like for example someone might google "Learn R" and then fall into the trap of paying $49 for a class that consists of nothing but videos really without having a clue about whether or not the videos really work to communicate knowledge or even whether the videos are touching on anything meaningful. If it hadn't been for the "Johns Hopkins Data Science Course" branding on the class I signed up for I wouldn't have fallen for it I am sure.

If it hadn't been for the "Johns Hopkins Data Science Course" branding on the class I signed up for I wouldn't have fallen for it I am sure.

Just to provide a counter-point.. I've taken 5 or 6 of the classes in that sequence, and have found them well worth every penny I've spent so far. Probably you could argue that the same information is available elsewhere for free, but the classes have worked for me, and the way I study and learn.

Obviously YMMV, but they've been a bargain from my perspective. I think because, if nothing else, they provide some structure, sequencing and a token measure of accountability... whereas if I just said "Hey, I'm going to teach myself R from this book" it would be a lot easier to loaf around, waste time reading HN instead of studying, etc.

That said, I don't argue against the idea that online learning could still be better. In fact, I don't think we've even come close to tapping the full potential of this stuff.

I agree with you, while they are far from the optimal to consume something there is always some need for alternative ways of learning, if only to address the variety of people who would like to learn. I find that I retain content better when I learn it in multiple ways (explore using MOOCs and podcasts about it, go deeper into it using books and practical work if applicable) and I'm sure it's the same for most people.

Yeah, same here. I'm doing the Johns Hopkins Data Science classes on Coursera, as well as the Duke "Statistics with R" series, but I've been supplementing that with both dead-tree books (R in Action among others) and other videos (like the Professor Leonard ones on Youtube), reading Wikipedia articles, etc., etc. Gaining a good understanding definitely involves attacking the problem from multiple angles in my approach. :-)

A complex field should be divided into multiple skills and each one should be learned in proper logical order. If a student has problems on some of those skills, appropriate videos and exercises should be presented. As it is now, there is no skill tracking and no adaptation to student needs. I'd especially like to see this done in ML, Stats, Probability and Information Theory. There are a lot of mini-skills there to be mastered, and every student has a different level of experience with each one.

How can i use this? Those file are so big. Is there is any way I can download only courses which I want?

Is there a way to torrent/mirror the data from archive.org? Storing all of it in a central repository seems counter-intuitive to me.

Individual pieces have torrent links. (And there probably are scripts to fetch an entire category somewhere...)

This is a collection of torrent links of copyrighted material? Is that right?

I guess I'm asking, how is this legal?

Does anyone know if they archived the webpages, assignments, and quizzes too? Or did they just manage to download the lecture videos? I'll try downloading it myself and checking, but I don't have the fastest internet connection.

I believe they grabbed everything. Do a quick search through this script: https://github.com/ArchiveTeam/coursera-grab/blob/master/cou...

If you manage to download one of the archives, please let me know what exactly is contained in it.

I will now commit to myself and make it known here:

If I am forced to buy one of these new Coursera certs, I will donate every time to the Archive Team.

Is anyone able to navigate this at all? I'm looking for Pedro Domingo's machine learning course.


I know there are others but I really like his method of teaching, and can't seem to be able to find an archive of it.

I would pay several dollars to keep some courses I took in their original form (even archived form, no new edits or posts). I guess many other course participants would too.

That might be a source of income for coursera, probably enough to cover operational expenses of running the old platform with old content.

Maybe I'm missing something here... but where are the actual course titles?

I think this is just the unorganized data dumps direct from the archive team. Most likely someone will now go through and combine all the dumps and make it more usable and put it at a different URL

I see. That makes sense, thanks. :)

That's something that I'm currently trying to understand myself. I haven't yet found what exactly is contained in those WARC files...

WARC files are raw recordings of crawler runs, including HTTP headers and other metadata. The raw, archival result of the downloads, that you can extract the files downloaded from.


During my time at the Internet Archive when we were working the wayback machine and related stuff, we wrote an arc/warc python library to parse and unpack these files. The library is over here https://github.com/internetarchive/warc. Just in case anyone is interested.

That sounds like it could be useful.

Thanks, Noufal.

Yeah, but they claim that "URLs are directly available in the Wayback Machine too" over here: http://www.archiveteam.org/index.php?title=Coursera

What I don't get is which URLs...

I guess if you go directly to the URL for a course/video via the wayback machine you get the content as well?

So far, I couldn't find any one that works, but maybe I'm missing something :(

Thank god. Geoffrey Hinton's RMSProp for deep neural networks is still cited in papers from his slide on his coursera course (the only place it was published AFAIK). It would be a shame to lose that forever.

That's really amazing.

How to tell which class is it though? The title doesn't give away.

It is great that someone decided to act on it.

there is many people that acted on this but I'm glad that http://archive.org exist and hope that they never disappear or get forced to delete content...

Couldn't sci-hub mirror this?

Welcome to encouraging bitrot.

Has this archive been licensed properly? None of the other comments brought this up that I saw and the site doesn't appear to mention licensing whatsoever (which is very odd). I like archive.org but they've always seemed to have had a whatever stance towards licensing and copyright. Is this santioned or just a wild west effort?

The Internet Archive is a library, and thus has a bit more flexibility regarding copyright issues. They also tend to have a "save it first and answer takedown notices after" philosophy. The Archive Team (not an official Internet Archive organization,) doesn't give a shit about copyright, or rather they ignore it in a pragmatic effort to save more content.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact