Here is an example using playground from this paper to calculate pull pull request latency (i.e. how much time user need after fork to submit a PR) -
https://tinyurl.com/github-popular-repo-pr-latency
(I wouldn't post here full query, since it is not that compact, but it is not complex)
It does not match with 2012:
>80%+ pull requests come in within 1 day of the fork
But new query calculates statistics only for popular repos.
This is cool! If you hit the pencil link on query results in the article, it pops you into a browser where you can edit and run Github queries yourself.
You write: "all the events in all GitHub repositories since 2011"
I have a question:
When was the data loaded?
Will the data in https://gh-api.clickhouse.tech/play be updated?
The downloadable dataset was created two days ago and the queries were run on this dataset.
Data on https://gh-api.clickhouse.tech/play (for interactive queries) is updated every hour as described in the article, but downloadable datasets in .xz are not updated.
Yes, 80 vCPU are logical cores, but ClickHouse is using the number of threads equal to the number of physical cores by default (40 threads).
This is a reasonable default - when setting up the number of threads to full vCPU count, performance of single query will be slightly better but this will affect the max number of concurrent queries badly.
Here is an example using playground from this paper to calculate pull pull request latency (i.e. how much time user need after fork to submit a PR) - https://tinyurl.com/github-popular-repo-pr-latency
(I wouldn't post here full query, since it is not that compact, but it is not complex)
It does not match with 2012:
>80%+ pull requests come in within 1 day of the fork
But new query calculates statistics only for popular repos.