
GH Archive - bluu00
https://www.gharchive.org/
======
ReaLNero
I actually used GH archive to mine github data! Two notes:

\- The easiest way to access the data is using Google Cloud Platform ->
BigQuery -> githubarchive. Google lets you write SQL queries for 1TB of the
data for free. So you can filter or aggregate the data you want, then download
it.

\- This is the sad part. Github data is notoriously noisy, and not really
valuable for data mining. [1] My work was on predicting GitHub collaborator
skill using open-source collaboration data. Filtering out bots and people who
use GitHub like a version of Google Drive was very difficult.

[1]:
[https://kblincoe.github.io/publications/2014_MSR_Promises_Pe...](https://kblincoe.github.io/publications/2014_MSR_Promises_Perils.pdf)

~~~
lsinger
Oh, hello there, co-author! :D

------
WayToDoor
This is from the folks that make changelog nightly/weekly, a newsletter you
can subscribe to to see what GitHub repositories were the most started in the
last day/week. Nice work!

~~~
rcshubhadeep
Can you please give the link for subscription?

~~~
WayToDoor
[https://changelog.com/nightly](https://changelog.com/nightly) for nightly and
[https://changelog.com/weekly](https://changelog.com/weekly) for weekly.

~~~
jlgaddis
If you don't want to subscribe, you can access any "nightly version" by going
directly to

    
    
      http://nightly.changelog.com/YYYY/MM/DD/
    

For example:
[http://nightly.changelog.com/2020/08/28/](http://nightly.changelog.com/2020/08/28/)

There's a broswable archive of the weekly version:
[https://changelog.com/weekly/archive](https://changelog.com/weekly/archive)

------
rsync
I archive github repos, for my own purposes, into my rsync.net account:

    
    
      ssh user@rsync.net "git clone git://github.com/freebsd/freebsd.git freebsd"

~~~
mdaniel
If you haven't already considered it, you might want to add -n to avoid the
checkout (since the .git is the real value of that operation) and depending on
your objective you might also want -r to pull down submodules in order to get
the whole story about what's "in the repo"

------
bethecloud
There is a cool, similar project that stores a mirror of github on the
decentralized cloud (STORJ) – here:
[https://gitbackup.org/#/](https://gitbackup.org/#/)

------
blindm
What I don't understand about this:

> GH Archive is a project to record the public GitHub timeline

What does that even mean? SO it's basically a massive mirror of Github. Isn't
that an enormous undertaking?

~~~
ReaLNero
There are a lot of events on GitHub by users. I don't think they include the
object blobs, so the timeline is actually pretty small (I think dataset was in
terabytes).

~~~
blindm
Oh thanks for clarifying. And here I was thinking this was a capture of all
the binary blobs, which would be massive!

