
Processing 40TB of code from 10M projects with a dedicated server and Go - boyter
https://boyter.org/posts/an-informal-survey-of-10-million-github-bitbucket-gitlab-projects/
======
reacharavindh
> I actually wrote a Python solution at first, but having to install the pip
> dependencies on my clean varnish box seemed like a bad idea and it keep
> breaking in odd ways which I didn’t feel like debugging.

Amen! This is why I am learning Go at the moment and considering using it
instead of Python for admin and data processing tasks on a fleet of servers.
The single binary deployment makes it a lot easier for users to adopt. Python
misses out on a lot of use because of the inability to do this. And No! I do
not want to pip install a lot of stuff on the servers just to be able to run
this script once. Heck, some of these servers don't even have access to public
internet to be able to pip install whatever.

Yes, I have looked into pyinstaller and Nuitka. They threw up some errors that
were indicative of deeper issues that I didnt feel like a good use of my time
to debug. I'd rather choose a language that has this as a priority/design goal
instead.

~~~
reubenmorais
I don't get it... Can't you just use a virtualenv? Then there's no worry about
namespace pollution or conflicts between dependencies of different projects.

~~~
skocznymroczny
virtualenv still requires internet access. You'd have to create a venv on
destination machine and pip install your packages in. You can't scp a local
venv onto destination machine and have it work.

~~~
twic
I am an utter noob at Python packaging, but i would like to learn more, so
excuse the basic question, but: why doesn't it work? And is there anything in
the ecosystem which is like that, but does work?

I had heard that the Python packaging and deployment story had got a lot
better in recent years, but this sounds like it still falls far short of table
stakes.

~~~
weberc2
I'm not sure why the Python community can't get it sorted out. We've been
using Pipenv because it's one of a few tools that supports
lockfiles/reproducible builds, and at the time we chose it, the Python
packaging authority was advertising it as the official solution for Python
packaging; however, after we were already invested in it, they backpedaled
becaused the community realized it was very buggy and super slow (as in "it
takes literally half an hour to add or remove or update any dependency").

We're finally getting around to migrating away, and we've settled on Twitter's
github.com/pantsbuild/pants which is like Google's Bazel except that it's
jankier in every way _except_ that Pants supports Python while Bazel doesn't
(Bazel advertises Python support, but it doesn't work as the many Python 3
issues in their issue tracker can attest). The nicest thing about Pants is
that it builds pex files (github.com/pantsbuild/pex) which are executable zip
files that ship all of your dependencies except the Python interpreter and
shared objects.

I'm still not very satisfied with this solution, and it's still far, far worse
than Go's dependency management and packaging model, but it's a dramatic
improvement over Pipenv, virtualenv, etc (gives us both reproducibility and
performance).

------
Groxx
First I've heard of "Taco Bell Programming"[1]. I _love_ it, this needs to
become a normal part of my lexicon. Also heck yes, it completely mirrors my
experience - a few dozen tools combine in fairly simple ways to make absurdly
useful results. The problem is finding the useful N-dozen elemental tools
you'll use.

[1]: [http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
progra...](http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
programming.html)

~~~
john-radio
> I could have done the whole thing Taco Bell style if I had only manned up
> and broken out sed, but I pussied out and wrote some Python.

And with one dumb word choice, I suddenly can't share this otherwise good
piece with most audiences.

~~~
homonculus1
Seriously? That's no harsher language than calling someone a dick. Do you have
to keep an ultra-clinical PR image for some reason or are we really that
puritanical?

~~~
john-radio
The networks that I belong to and benefit from are that "puritanical," if you
want to choose a really stupid word for avoiding insulting half the population
for no reason, yeah.

------
robbya
> If someone wants to host the raw files to allow others to download it let me
> know. It is a 83 GB tar.gz file which uncompressed is just over 1 TB in
> size.

Some of the cloud providers have free hosting for public data sets (people who
use the data incur cost to download/process the data). I'm not sure if this
would qualify.

* [https://aws.amazon.com/opendata/public-datasets/](https://aws.amazon.com/opendata/public-datasets/) * [https://azure.microsoft.com/en-us/services/open-datasets/](https://azure.microsoft.com/en-us/services/open-datasets/)

~~~
lunixbochs
I similarly have a 127GB tar.xz of about 2TB of code/files from the top GitHub
repositories (filtered by highest starred repositories per primary language)
as of around August 2017 if anyone wants it or wants to host it.

------
ojkelly
As much as I love the confirmation that .yml is the correct extension, I’m
most amazed by the fact that there’s over a TRILLION lines of code public on
GitHub.

That’s an astronomical number. And that’s only what we can all see. Can’t
imagine how much more is private.

Also, it look’s like 20% of code is comments. Which feels just about right.

~~~
brazzledazzle
People actually bike shed over .yaml vs .yml? If you thought tabs vs spaces
was a useless debate...

~~~
pixelbash
Never been comfortable with extension names longer than 3 letters long for
some reason..

~~~
wazoox
MS-DOS. The reason is MS-DOS.

~~~
hendi_
Thank you for the EXPLAN~1!

~~~
XJ6
If anyone's curious this references the 8.3 filename shortening that windows
was even still doing recently.

Amusingly, that tiny bit of backward compatibility can lead to vulnerabilities
as outlined here: [https://www.acunetix.com/blog/articles/windows-
short-8-3-fil...](https://www.acunetix.com/blog/articles/windows-
short-8-3-filenames-web-security-problem/)

------
breakerbox
I thought the most common file names had some peculiar results.

> the filename “15” & “s15” made the top 50 most popular filenames

Anyone know why?

[https://boyter.org/posts/an-informal-survey-of-10-million-
gi...](https://boyter.org/posts/an-informal-survey-of-10-million-github-
bitbucket-gitlab-projects/#what-are-the-most-common-filenames)

~~~
GrumpyNl
I would like to know to, any insights?

~~~
breakerbox
The only thing I’ve put together from light searching on GitHub might be to do
with the fact that 15 and s15 can be used to describe school years (s for
spring 15 semester) and a lot of people post assignments to GitHub.

15 is also a common number in coding problem sets that people post to GitHub.
It’s a stretch but it might be something to do with 2015 being a year that has
a lot of coursework commits, and 15 is a pretty low number (i.e. higher
probability a student posts solutions to #1-15 than #1-19).

The same goes for schoolwork. Many people do online MIT and Stanford courses
that are from previous years ~2012-2016, but there has been less time elapsed
for students to post answers from 2016,17,18,19)

This is mostly conjecture, so I hope someone has a better answer!

~~~
WilliamEdward
advent of code also came out in 2015... so, maybe?

------
gitgud
One query that always interested me is how much duplication is in the Open-
Source community. Like how many boiler-plate files are copied into projects.

It seems generating a `react/angular` project produces many files, that almost
never get changed, so it would be interesting to know the most duplicated
files...

I think this could also give valuable insight into how to make the
language/frameworks better or simpler...

------
chewxy
A little bit of promo here: Ben will be giving a talk at GopherconAU later
this year. - tickets are on sale now at
[https://gophercon.com.au](https://gophercon.com.au)

------
mirekrusin
This is great. Btw. seeing "jquery" most-common filename 20,015,171 times - it
probably skews js LoC (artificial LoC of dependencies).

------
jeremyjh
This is a cool project and a great write-up. Some of the complexity numbers
struck me as pretty off though. I know he caveats you can only compare files
of the same language, but I took a quick look at the code[1] and at least as
of now it looks like the complexity statements are just kind of copy/pasted,
so keywords not seen in Java/C such as "match" are missing from Rust, OCaml,
and Scala, "case" is missing from Haskell and Elixir etc, causing these
languages to be much lower in complexity than they should be just based on
standard control flow statements.

[1]
[https://github.com/boyter/scc/blob/master/languages.json](https://github.com/boyter/scc/blob/master/languages.json)

~~~
boyter
In all seriousness if you are a language expert for those languages please
submit a PR or at least raise an issue saying what keywords should be in
there. I am happy to include it into the next release should it produce more
accurate numbers.

------
gilleain
A couple of the large files are explainable:

\- The largest .c file is actually the CATH database, not sure why it has that
extension

\- The large .cpp file is actually C++ but has a kind of 'data as code'
approach defining bonds between atoms

------
tylerl
Can't you do all this same research in just a few seconds for like pennies,
just by sending a few sql queries to Google's BigQuery? I'm pretty sure we've
had stories about that here.

[https://medium.com/google-cloud/github-on-bigquery-
analyze-a...](https://medium.com/google-cloud/github-on-bigquery-analyze-all-
the-code-b3576fd2b150)

~~~
jmngomes
I partially agree with you; unless the goal was to "do this in Go", the choice
of tools seems odd/inneficient.

Spark would have a been a simple option to do this kind of processing, with
less lines of code, and could also run on "spare compute". Same goes for the
"How does one process 10 million JSON files taking up just over 1 TB of disk
space in an S3 bucket?": there are appropriate file formats for storing and
querying big datasets, text/json is simply the least efficient option and
likely the cause of the "$2.50 USD per query" number...

------
Scarbutt
Missing content for:

[https://boyter.org/posts/an-informal-survey-of-10-million-
gi...](https://boyter.org/posts/an-informal-survey-of-10-million-github-
bitbucket-gitlab-projects/#the-most-complex-code-is-written-in-what-language)

------
tuananh
when cpu utilization is high for a long period of time like this, dedicated
servers always make more sense than cloud.

------
lowsenberg
> The front-end varnish box for instance is doing the square root of zero most
> of the time.

Nicely written. :)

------
08-15
Why would anyone store 1TB of data in highly redundant JSON?

Stored as protobuf, I estimate it would be 8x smaller. A custom binary format
would be smaller again. Not only is that smaller, which saves storage and
transfer cost, it's also proportionally faster to process.

------
throwaway77384
Glad to see Go Template make the list of languages with the most curse words
in them.

NOT Go, but Go Template!! A subset of Go is causing more anguish than all of
Python and only a sliver less than all of Rust.

Says it all really ;)

------
ran3824692
plug [https://www.softwareheritage.org/](https://www.softwareheritage.org/)

~~~
trevyn
Is this available to download in bulk?

------
z3t4
Nice to see that Hypecript is far from engulfing JavaScript. Hopefully the
marketing team will not start to delete old JavaScript repos because of this.
:P

------
jtwaleson
Haha, is this a coincidence or did you take inspiration from my 2016 blog
post?

"Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AWS"

[http://blog.waleson.com/2016/01/parsing-10tb-of-
metadata-26m...](http://blog.waleson.com/2016/01/parsing-10tb-of-
metadata-26m-domains.html)

We used a lot of the same tech: python, golang, S3, 32 core machines. Anyway,
nice read and good use of Hetzner.

