Hacker News new | comments | show | ask | jobs | submit login
Making open source data more available (github.com)
254 points by taylorwc 532 days ago | hide | past | web | favorite | 24 comments



The GitHub Archive dataset was updated as well. Example BigQuery to get the Top Repositories from 2015-2016 YTD, by the number of Stars given during that time:

   SELECT repo.id, repo.name, COUNT(*) as num_stars
   FROM TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('2015-01-01'), TIMESTAMP('2016-12-31'))
   WHERE type = "WatchEvent"
   GROUP BY repo.id, repo.name
   ORDER BY num_stars DESC
   LIMIT 1000
Which results in this output: https://docs.google.com/spreadsheets/d/16yDS2wDdDOTxjVsjGvWm...

Since the query only hits 3 columns, it only uses 15.4GB of data (out of a 1TB allowance)

More information on the GitHub Archive changes: https://medium.com/@hoffa/github-archive-fully-updated-notic...


Github Easter Egg hunt query:

    SELECT
      CONCAT("https://github.com/",repo_name,"/blob/master/",path) AS file_url,
    FROM
      [bigquery-public-data:github_repos.files]
    WHERE
      id IN (SELECT id FROM [bigquery-public-data:github_repos.contents]
      WHERE NOT binary AND LOWER(content) CONTAINS 'easter egg')
      and path not like "%.csv"
    GROUP BY 1
    LIMIT 1000
36s elapsed, 1.79 TB (so not free). Using github_repos.sample_files and github_repos.sample_contents only costs 31 GB (free) but not as many easter eggs :)


Just looked at their pricing. ~$9 query right there.

https://cloud.google.com/bigquery/pricing


Good point. Here's the results if anyone wants them:

https://raw.githubusercontent.com/bduerst/GithubEasterEgg/ma...


On my phone, but what if you limit 2000 inside the .contents query? (BigQuery should exit early and touch less data).


Changing the return limit does nothing. Queries are charged on data searched.


Why do you use BigQuery when Github code search is free?


Please note if you are wondering where your project is, we only archived to bigquery open source projects. So add an open source license for reals?


Out of curiosity, how do you automatically(/programatically) "detect" that a project (repo) is open source? The presence of a LICENSE file that contains the text of the BSD|GPL|MIT|... license?



I'm compiling all links and tips I can find at:

- https://medium.com/@hoffa/github-on-bigquery-analyze-all-the...

The Changelog also invited us to record podcast with Arfon Smith (GitHub), Will Curran (Google), and me (Google) - https://changelog.com/209/

Happy to answer any questions!



FWIW, the author (Arfon Smith) had a recent Microsoft Research talk on github and open collaboration for the scientific community: https://www.youtube.com/watch?v=7XOuJFwy270


This is super cool. If you want to benefit from this info in your workflow now, we have analyzed some of this same data at Sourcegraph, and you can see (e.g.) all the repos that call http.NewRequest in Go (https://sourcegraph.com/github.com/golang/go/-/info/GoPackag...) or lots of usages of Joda-Time DataTime in Java (https://sourcegraph.com/github.com/JodaOrg/joda-time/-/info/...). You can search for functions/types/etc. on Sourcegraph to cross-reference them globally. We're working on an API to provide this information to anyone else who wants it; email me at sqs@sourcegraph.com if you are interested in using it.


Top emacs packages required on github repos https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-...


I made an interface to some of this which can be used for finding a single user's contributions over time: https://githubcontributions.io


sample_contents only lists contents of 10% sample of all files. Scanning the full data set may be hard for people new to big query. I managed to query the full data set in https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-... . "Resources exceeded during query execution" are especially hard to debug as may mean many things that could have caused Big query to go out of memory.

Some big query tricks to make it work:

  - TOP/COUNT is faster and more memory efficient than GROUP BY/ORDER
  - Filtering data prior to join in sub-query reduces memory usage.
  - Regexps and globs are expensive. Use LEFT/RIGHT as a faster version.
  - Avoid reading all files to get around 1TB freebie scan limit. Only access file contents after filtering some paths.


I'm curious how frequently this will be updated. It'd be nice to set up weekly/monthly queries which show updated information.


Weekly for now (fingers crossed)


I created open source webservice which serves data in json format: http://getjson.info

Hope you will find it useful


Is there a way to get all Ask HN posts from the HN dataset?


Something along the lines of:

   SELECT id, title
   FROM [bigquery-public-data:hacker_news.full_201510]
   WHERE title CONTAINS "Ask HN" AND url=""
   LIMIT 1000
Output: https://docs.google.com/spreadsheets/d/12HZ2DqkR_nl380bpxM0B...


Here is a simple query to get all the titles.

  SELECT title FROM [bigquery-public-data:hacker_news.stories] 
  where title like '%Ask HN%' LIMIT 1000


https://scholar.google.com/scholar?hl=en&q=github&btnG=&as_s...

People take their time "studying" everything these days, isn't it?

I can't imagine how would that be if the State wasn't paying them to do that.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: