Hacker News new | past | comments | ask | show | jobs | submit login
Processing 40 TB of code from ~10M projects with a server and Go for $100 (2019) (boyter.org)
61 points by KolmogorovComp on Oct 3, 2022 | hide | past | favorite | 13 comments




Oops linked to an anchor in the body and cannot edit. Correct link: https://boyter.org/posts/an-informal-survey-of-10-million-gi...


Average number of Makefiles is 1518.1769808494607? That can’t be right?


The ”average person eats 3 spiders a year” factoid is actually just statistical error. The average person eats 0 spiders per year. Spiders Georg, who lives in a cave & eats over 10,000 each day, is an outlier and should not have been counted”


Of course, however given that there are 1.2 million repositories with makefiles, the listed average would mean that they contain 1.8 billion makefiles in total. I doubt that that’s the case.

Also, according to the page, only 59 million of all files are named “Makefile”.

I suspect that the file language recognition has major false positives for the Makefile language.


Yea this clearly seems like a bug or some repo skewing the results. Also a good reminder why the median should often be shown alongside average values.


Does anyone know why "pawn" is considered an offensive word? It seems likely it ranks highly on the offensive word list because of chess engines.


I also noticed "knob" and "xxx" near the top of that list (and others) for which I understand why they are considered "offensive" in some contexts, but in code those are quite non-offensive (unless, of course, the context is offensive, haha).


Phonetically it sounds like "porn".


>>> If someone wants to host the raw files to allow others to download it let me know. It is a 83 GB tar.gz file which uncompressed is just over 1 TB in size.

Anyone knows if it is possible to download similar data set for youtube and reddit? I have ideas for search engine based on it, but I don't want to write/maintain scraper scripts.


There's a very large dataset of Reddit posts and comments at https://files.pushshift.io/reddit/


I have most of the historical reddit data except ~year I use to train ML models. Let me see if I can find a public link for you...


It would be interesting to plot some of the statistics by age of repository.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: