SELECT repo.id, repo.name, COUNT(*) as num_stars
FROM TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('2015-01-01'), TIMESTAMP('2016-12-31'))
WHERE type = "WatchEvent"
GROUP BY repo.id, repo.name
ORDER BY num_stars DESC
Since the query only hits 3 columns, it only uses 15.4GB of data (out of a 1TB allowance)
More information on the GitHub Archive changes: https://medium.com/@hoffa/github-archive-fully-updated-notic...
CONCAT("https://github.com/",repo_name,"/blob/master/",path) AS file_url,
id IN (SELECT id FROM [bigquery-public-data:github_repos.contents]
WHERE NOT binary AND LOWER(content) CONTAINS 'easter egg')
and path not like "%.csv"
GROUP BY 1
The Changelog also invited us to record podcast with Arfon Smith (GitHub), Will Curran (Google), and me (Google) - https://changelog.com/209/
Happy to answer any questions!
Some big query tricks to make it work:
- TOP/COUNT is faster and more memory efficient than GROUP BY/ORDER
- Filtering data prior to join in sub-query reduces memory usage.
- Regexps and globs are expensive. Use LEFT/RIGHT as a faster version.
- Avoid reading all files to get around 1TB freebie scan limit. Only access file contents after filtering some paths.
Hope you will find it useful
SELECT id, title
WHERE title CONTAINS "Ask HN" AND url=""
SELECT title FROM [bigquery-public-data:hacker_news.stories]
where title like '%Ask HN%' LIMIT 1000
People take their time "studying" everything these days, isn't it?
I can't imagine how would that be if the State wasn't paying them to do that.