
Making open source data more available - taylorwc
https://github.com/blog/2201-making-open-source-data-more-available
======
minimaxir
The GitHub Archive dataset was updated as well. Example BigQuery to get the
Top Repositories from 2015-2016 YTD, by the number of Stars given during that
time:

    
    
       SELECT repo.id, repo.name, COUNT(*) as num_stars
       FROM TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('2015-01-01'), TIMESTAMP('2016-12-31'))
       WHERE type = "WatchEvent"
       GROUP BY repo.id, repo.name
       ORDER BY num_stars DESC
       LIMIT 1000
    

Which results in this output:
[https://docs.google.com/spreadsheets/d/16yDS2wDdDOTxjVsjGvWm...](https://docs.google.com/spreadsheets/d/16yDS2wDdDOTxjVsjGvWmpHVsOIU65wLEjXFHDtDeKU4/edit?usp=sharing)

Since the query only hits 3 columns, it only uses 15.4GB of data (out of a 1TB
allowance)

More information on the GitHub Archive changes:
[https://medium.com/@hoffa/github-archive-fully-updated-
notic...](https://medium.com/@hoffa/github-archive-fully-updated-notice-some-
breaking-changes-64e7e7cd0967)

~~~
bduerst
Github Easter Egg hunt query:

    
    
        SELECT
          CONCAT("https://github.com/",repo_name,"/blob/master/",path) AS file_url,
        FROM
          [bigquery-public-data:github_repos.files]
        WHERE
          id IN (SELECT id FROM [bigquery-public-data:github_repos.contents]
          WHERE NOT binary AND LOWER(content) CONTAINS 'easter egg')
          and path not like "%.csv"
        GROUP BY 1
        LIMIT 1000
    

36s elapsed, 1.79 TB (so not free). Using github_repos.sample_files and
github_repos.sample_contents only costs 31 GB (free) but not as many easter
eggs :)

~~~
Goopplesoft
Just looked at their pricing. ~$9 query right there.

[https://cloud.google.com/bigquery/pricing](https://cloud.google.com/bigquery/pricing)

~~~
bduerst
Good point. Here's the results if anyone wants them:

[https://raw.githubusercontent.com/bduerst/GithubEasterEgg/ma...](https://raw.githubusercontent.com/bduerst/GithubEasterEgg/master/EasterEggs.csv)

------
cdibona
Please note if you are wondering where your project is, we only archived to
bigquery open source projects. So add an open source license for reals?

~~~
jlgaddis
Out of curiosity, how do you automatically(/programatically) "detect" that a
project (repo) is open source? The presence of a _LICENSE_ file that contains
the text of the BSD|GPL|MIT|... license?

~~~
fhoffa
GitHub licenses API

[https://developer.github.com/v3/licenses/](https://developer.github.com/v3/licenses/)

------
fhoffa
I'm compiling all links and tips I can find at:

\- [https://medium.com/@hoffa/github-on-bigquery-analyze-all-
the...](https://medium.com/@hoffa/github-on-bigquery-analyze-all-the-
code-b3576fd2b150)

The Changelog also invited us to record podcast with Arfon Smith (GitHub),
Will Curran (Google), and me (Google) -
[https://changelog.com/209/](https://changelog.com/209/)

Happy to answer any questions!

------
chickenbane
Here's the announcement from Google

[http://google-opensource.blogspot.com/2016/06/github-on-
bigq...](http://google-opensource.blogspot.com/2016/06/github-on-bigquery-
analyze-all-code.html)

------
seltzered_
FWIW, the author (Arfon Smith) had a recent Microsoft Research talk on github
and open collaboration for the scientific community:
[https://www.youtube.com/watch?v=7XOuJFwy270](https://www.youtube.com/watch?v=7XOuJFwy270)

------
sqs
This is super cool. If you want to benefit from this info in your workflow
now, we have analyzed some of this same data at Sourcegraph, and you can see
(e.g.) all the repos that call http.NewRequest in Go
([https://sourcegraph.com/github.com/golang/go/-/info/GoPackag...](https://sourcegraph.com/github.com/golang/go/-/info/GoPackage/net/http/-/NewRequest?refs=all))
or lots of usages of Joda-Time DataTime in Java
([https://sourcegraph.com/github.com/JodaOrg/joda-
time/-/info/...](https://sourcegraph.com/github.com/JodaOrg/joda-
time/-/info/JavaArtifact/joda-time/joda-
time/-/org/joda/time/DateTime:type?refs=all)). You can search for
functions/types/etc. on Sourcegraph to cross-reference them globally. We're
working on an API to provide this information to anyone else who wants it;
email me at sqs@sourcegraph.com if you are interested in using it.

------
kozikow
Top emacs packages required on github repos
[https://kozikow.wordpress.com/2016/06/29/top-emacs-
packages-...](https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-
used-in-github-repos/)

------
hut8
I made an interface to some of this which can be used for finding a single
user's contributions over time:
[https://githubcontributions.io](https://githubcontributions.io)

------
kozikow
sample_contents only lists contents of 10% sample of all files. Scanning the
full data set may be hard for people new to big query. I managed to query the
full data set in [https://kozikow.wordpress.com/2016/06/29/top-emacs-
packages-...](https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-
used-in-github-repos/) . "Resources exceeded during query execution" are
especially hard to debug as may mean many things that could have caused Big
query to go out of memory.

Some big query tricks to make it work:

    
    
      - TOP/COUNT is faster and more memory efficient than GROUP BY/ORDER
      - Filtering data prior to join in sub-query reduces memory usage.
      - Regexps and globs are expensive. Use LEFT/RIGHT as a faster version.
      - Avoid reading all files to get around 1TB freebie scan limit. Only access file contents after filtering some paths.

------
dikaiosune
I'm curious how frequently this will be updated. It'd be nice to set up
weekly/monthly queries which show updated information.

~~~
fhoffa
Weekly for now (fingers crossed)

------
sergames
I created open source webservice which serves data in json format:
[http://getjson.info](http://getjson.info)

Hope you will find it useful

------
AznHisoka
Is there a way to get all Ask HN posts from the HN dataset?

~~~
minimaxir
Something along the lines of:

    
    
       SELECT id, title
       FROM [bigquery-public-data:hacker_news.full_201510]
       WHERE title CONTAINS "Ask HN" AND url=""
       LIMIT 1000
    

Output:
[https://docs.google.com/spreadsheets/d/12HZ2DqkR_nl380bpxM0B...](https://docs.google.com/spreadsheets/d/12HZ2DqkR_nl380bpxM0BLgfVfwCzb3l7hBSFKjlPDuw/edit?usp=sharing)

------
fiatjaf
[https://scholar.google.com/scholar?hl=en&q=github&btnG=&as_s...](https://scholar.google.com/scholar?hl=en&q=github&btnG=&as_sdt=1%2C32&as_sdtp=)

People take their time "studying" everything these days, isn't it?

I can't imagine how would that be if the State wasn't paying them to do that.

