The ”average person eats 3 spiders a year” factoid is actually just statistical error. The average person eats 0 spiders per year. Spiders Georg, who lives in a cave & eats over 10,000 each day, is an outlier and should not have been counted”
Of course, however given that there are 1.2 million repositories with makefiles, the listed average would mean that they contain 1.8 billion makefiles in total. I doubt that that’s the case.
Also, according to the page, only 59 million of all files are named “Makefile”.
I suspect that the file language recognition has major false positives for the Makefile language.
Yea this clearly seems like a bug or some repo skewing the results. Also a good reminder why the median should often be shown alongside average values.
I also noticed "knob" and "xxx" near the top of that list (and others) for which I understand why they are considered "offensive" in some contexts, but in code those are quite non-offensive (unless, of course, the context is offensive, haha).
>>> If someone wants to host the raw files to allow others to download it let me know. It is a 83 GB tar.gz file which uncompressed is just over 1 TB in size.
Anyone knows if it is possible to download similar data set for youtube and reddit? I have ideas for search engine based on it, but I don't want to write/maintain scraper scripts.
https://news.ycombinator.com/item?id=21121735 (80 comments)