
Python library to scrape Google Code and srchub - nadams
https://github.com/nadams810/codescrape
======
nadams
I wrote this in about a day and for a couple of reasons.

Google Code is shutting down - going into an archive only mode. But I feel
like it's still useful to have your own offline copy for research and/or
offline archival purposes (or archive to another service).

FlossMole [1] did write some tools to scrape Google Code and other services -
but the instructions were lacking on exactly what you needed to do to run it.
I wanted a no-non sense tool that pretty much anyone could use.

One important point about this library is that it does scrape wiki and issues
- but for downloads and repo locations it merely presents the URLs to you and
you need to decide what you want to do wit that (not really hard - either curl
the releases or svn co/hg clone/git clone the repo). I did this to minimize
the crawling load.

Yes there is also an export to github - but again I am off the mindset that it
is important to archive data rather than just copy it to another service (I
think the archive team agrees with that as well [2]).

If you still aren't convinced this could be useful - I think it would be
pretty awesome if someone grabbed all the code repos and created a public
OpenGork [3] instance to be able to do an advanced search through all the code
(it even support regex searching). I think that would be pretty amazing to
offer as an online and offline tool.

I expanded it to support srchub - which I run. Feel free to use it against it
(all you have to do is replace the URL to the production instance). If there
is interest I would be open to expanding it to support gitlab, github, and
bitbucket.

This tool isn't perfect as I made it quickly but I think it does the job. I am
open to issues or problems with it. And if you think it sucks - as the
subversion authors once said "patches welcome".

[1]
[https://code.google.com/p/flossmole/source/browse/#svn%2FFLO...](https://code.google.com/p/flossmole/source/browse/#svn%2FFLOSSmoleGoogleCode%2Fsrc)

[2]
[http://www.archiveteam.org/index.php?title=Google_Code](http://www.archiveteam.org/index.php?title=Google_Code)

[3] [http://opengrok.github.io/OpenGrok/](http://opengrok.github.io/OpenGrok/)

