Hacker News new | past | comments | ask | show | jobs | submit login

That's something that I'm currently trying to understand myself. I haven't yet found what exactly is contained in those WARC files...



WARC files are raw recordings of crawler runs, including HTTP headers and other metadata. The raw, archival result of the downloads, that you can extract the files downloaded from.

https://en.wikipedia.org/wiki/Web_ARChive


During my time at the Internet Archive when we were working the wayback machine and related stuff, we wrote an arc/warc python library to parse and unpack these files. The library is over here https://github.com/internetarchive/warc. Just in case anyone is interested.



That sounds like it could be useful.

Thanks, Noufal.


Yeah, but they claim that "URLs are directly available in the Wayback Machine too" over here: http://www.archiveteam.org/index.php?title=Coursera

What I don't get is which URLs...


I guess if you go directly to the URL for a course/video via the wayback machine you get the content as well?


So far, I couldn't find any one that works, but maybe I'm missing something :(




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: