Hacker News new | past | comments | ask | show | jobs | submit login
Everything You Always Wanted to Know About GitHub (But Were Afraid to Ask) (clickhouse.tech)
48 points by zX41ZdbW on Dec 9, 2020 | hide | past | favorite | 13 comments



There was [presentation](https://www.igvita.com/slides/2012/bigquery-github-strata.pd...) from the creator of gharchive.org and github employee (I guess) about open source metrics, based on github data

Here is an example using playground from this paper to calculate pull pull request latency (i.e. how much time user need after fork to submit a PR) - https://tinyurl.com/github-popular-repo-pr-latency

(I wouldn't post here full query, since it is not that compact, but it is not complex)

It does not match with 2012:

>80%+ pull requests come in within 1 day of the fork

But new query calculates statistics only for popular repos.


"Pull panda" for your repository:

https://gh-api.clickhouse.tech/play?user=play#c2VsZWN0IGNvdW...

(replace condition on repo_name to your favorite repo)


Something more complex - How much time it takes to merge non trivial PRs

https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUCiAgIC...


This is cool! If you hit the pencil link on query results in the article, it pops you into a browser where you can edit and run Github queries yourself.


Amazing!

You write: "all the events in all GitHub repositories since 2011" I have a question: When was the data loaded? Will the data in https://gh-api.clickhouse.tech/play be updated?


The downloadable dataset was created two days ago and the queries were run on this dataset.

Data on https://gh-api.clickhouse.tech/play (for interactive queries) is updated every hour as described in the article, but downloadable datasets in .xz are not updated.



Most of Linux kernel development happens on maillists and pull requests on GitHub are just for lulz.


This is absolutely brilliant. Great work!!!


What hardware does it run on? Is it a single server or a cluster?


It is single cloud instance (Yandex Cloud, 80 vCPU Intel Cascade Lake, network storage).


Below query says there are 40 threads, not 80, and ram for Clickhouse is limited by 20GB =)

select * from system.settings where name IN ('max_memory_usage', 'max_threads'); -- 20GB limit, 40 CPU -- 40 threads, 20GB RAM limit


Yes, 80 vCPU are logical cores, but ClickHouse is using the number of threads equal to the number of physical cores by default (40 threads).

This is a reasonable default - when setting up the number of threads to full vCPU count, performance of single query will be slightly better but this will affect the max number of concurrent queries badly.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: