Hacker News new | past | comments | ask | show | jobs | submit login
Processing 40TB of code from 10M projects with a dedicated server and Go (boyter.org)
238 points by boyter on Oct 1, 2019 | hide | past | favorite | 80 comments



> I actually wrote a Python solution at first, but having to install the pip dependencies on my clean varnish box seemed like a bad idea and it keep breaking in odd ways which I didn’t feel like debugging.

Amen! This is why I am learning Go at the moment and considering using it instead of Python for admin and data processing tasks on a fleet of servers. The single binary deployment makes it a lot easier for users to adopt. Python misses out on a lot of use because of the inability to do this. And No! I do not want to pip install a lot of stuff on the servers just to be able to run this script once. Heck, some of these servers don't even have access to public internet to be able to pip install whatever.

Yes, I have looked into pyinstaller and Nuitka. They threw up some errors that were indicative of deeper issues that I didnt feel like a good use of my time to debug. I'd rather choose a language that has this as a priority/design goal instead.


Yeah, I was writing a Docker image built around Prometheus' jmx_exporter intended to be used as a mixin for our Java apps, and part of what I needed to do was provide a simple script to preprocess separate config files to produce the config used by jmx_exporter.

My initial thought was Python, but it needed a couple of 3rd party dependencies, and there wasn't an overly clean _and_ simple way to copy the script in from the mixin and run locally.

So I shrugged, rewrote my script in Go, and then used a multi-stage build to copy in the binary and nothing else.

Ending up with a single statically linked binary was cool, even if Go does some stuff that made my eyebrows quirk a tad (I still can't believe that an idiomatic set in Go is map[T]struct{}...)


> I was writing a Docker image built around Prometheus' jmx_exporter intended to be used as a mixin for our Java apps, and part of what I needed to do was provide a simple script to preprocess separate config files to produce the config used by jmx_exporter

Did you consider, and stop me if this suggestion is completely wild, Java?


It's a script that merges YAML objects, there's an overhead to Java for this use case. I did consider Kotlin, but that same overhead of a POM, plugins to build capsules/fat jars etc.


I don't get it... Can't you just use a virtualenv? Then there's no worry about namespace pollution or conflicts between dependencies of different projects.


Installing a full virtualenv with all related libraries just to run one script is kind of ridiculous.

The server might have the wrong Python version available (2.x vs 3.x, multiple evolutions of 3.x with big features added in each point release). Et cetera, et cetera.

Or you could just do it with bash or a static Go binary and be done with it. Portable, works pretty much everywhere.


Why is it ridiculous? You'll need the related libraries either case, and by default virtualenv uses symlinks. The "full virtualenv" part makes it sound like it's heavy, or something like that.


If you do it in a language that provides static binaries or an equivalent, you don't need to pull dozens of libraries with possibly hundreds of files just to run a single script.

Imagine a situation where you'd need to perform an operation on a hundred servers, would you either transfer one static exacutable or build an virtualenv in every machine and download correct versions all of the related libraries?


If I controlled the environment, I'd probably build a Docker image and deploy that. Otherwise yes, This kind of extreme example isn't very representative of the initial situation though. And you're also talking about switching up languages, which isn't always possible or easy, especially in large teams or codebases.


> The server might have the wrong Python version available

That's the point of using virtual environments, so that you can run the Python version and libraries that you need. Also, as of 3.3, Python ships with venv which means you don't need to separately install virtualenv anymore. It's all very portable.


You'd need to install the right Python version though, and even with something like pyenv you might need to install system packages.

It really isn't as portable as one might wish.


This is my exact situation right now. Our website is made in Django, and we use Pyenv to manage our python versions. We'll likely be going to route of a docker image/ pyenv install script. It takes trial/error to figure out some of the packages you need if they aren't installed already, a single binary would be far easier for sure.


Portable means something else. Portable would mean that I could ship the virtualenv. Python missed the forest for the trees.


I still need to have python installed on the machine and download all of the libraries needed.

Not all operating systems carry Python >3.3 by default, that would mean installing backports or unofficial repositories on production machines.

Or you can use a static binary that just works.


virtualenv still requires internet access. You'd have to create a venv on destination machine and pip install your packages in. You can't scp a local venv onto destination machine and have it work.


I've done things in a way that it doesn't.

Basically build a virtualenv and then pack that into a .deb, install the .deb on the target machine and you have a self-contained package that requires no external resources to install.

That said, using golang is WAAAAY simpler. The virtualenv/deb solution, while it works, is very sketch.


You just need a venv with the right architecture, which is fairly trivial. You can then copy the venv directly over.

See this repo: https://github.com/unixtreme/D3Edit. It has Linux/Mac/Windows virtualenvs and it works without any additional setup


I think what people are arguing is deploying a single binary is even more trivial than a "venv with the right architecture". Deploying one file is going to be easier than deploying a set of files.


I am an utter noob at Python packaging, but i would like to learn more, so excuse the basic question, but: why doesn't it work? And is there anything in the ecosystem which is like that, but does work?

I had heard that the Python packaging and deployment story had got a lot better in recent years, but this sounds like it still falls far short of table stakes.


I'm not sure why the Python community can't get it sorted out. We've been using Pipenv because it's one of a few tools that supports lockfiles/reproducible builds, and at the time we chose it, the Python packaging authority was advertising it as the official solution for Python packaging; however, after we were already invested in it, they backpedaled becaused the community realized it was very buggy and super slow (as in "it takes literally half an hour to add or remove or update any dependency").

We're finally getting around to migrating away, and we've settled on Twitter's github.com/pantsbuild/pants which is like Google's Bazel except that it's jankier in every way _except_ that Pants supports Python while Bazel doesn't (Bazel advertises Python support, but it doesn't work as the many Python 3 issues in their issue tracker can attest). The nicest thing about Pants is that it builds pex files (github.com/pantsbuild/pex) which are executable zip files that ship all of your dependencies except the Python interpreter and shared objects.

I'm still not very satisfied with this solution, and it's still far, far worse than Go's dependency management and packaging model, but it's a dramatic improvement over Pipenv, virtualenv, etc (gives us both reproducibility and performance).


Its not that Python packaging doesn't work, or that it is impossible to automate, but whatever you do it requires nontrivial time and effort to get right.

The point is that the static binary created by go is a single file that you copy into place and run. It is a lot simpler.


So a virtualenv for every script I want to run? each of that downloading a full Python interpreter and whatever modules that Python script uses? How is this any better than a statically compiled binary? Other than useless additional work and bloat on the servers?


This is becoming a parody of software engineering.

People are now irrationally scared of using dynamic libraries, OS packages and even directories, like virtualenvs.

Instead, a simple solution is replaced with containers, or by rewriting tons of code in the new hyped language.

Are we trying to create job security through unnecessary complexity?


> Are we trying to create job security through unnecessary complexity?

As a python/go dev, I can assure you that go is not much more complex. And as a server admin, it's much easier to deploy.

It does have drawbacks when compared to python, but it can often be the optimal solution.


Also a Python dev. I do use VirtualEnv but the other thing is a huge performance boost for small performance intensive projects.


This is the weirdest straw man I've seen in a while. Python, not Go, benefits most from containerization. Installing "Dynamic libraries, OS packages, and even directories like virtualenvs" on target machines is far more complicated than sending a single file. I say all this as the DevOps guy in a Python shop.


Isolation is merely one utility of containers.


Fully agree about Go; however, if you can't move away from Python, github.com/pantsbuild/pex is a pretty good alternative. It's an executable zip file that contains all of your python dependencies; you still need the Python interpreter and any shared objects that don't come with pip (e.g., for whatever reason, the Python graphviz package depends on the graphviz-devel system package and the latter doesn't ship via Pypi). It's still an order of magnitude worse than Go, but it's an order of magnitude better than pip or pipenv. :)


> The single binary deployment makes it a lot easier for users to adopt.

I've had some success making a Python file an executable. Though, I do understand your gripe with needing to install dependencies.


You didnt mention virtualenv. That is what I use. I even made a deb installer that forces itself to use virtualenv and fetches system python files for those packages available to Debian but not pip (kind of a hack but it worked! Look mah!). I agree though I like that with D and Go deployment seems simpler.


zipapp, which comes bundled with python since 3.5, is your friend.

It can produce a single .pyz file which has all dependencies except the interpreter itself inside. Actually this format is compatible all the way back to Python 2.6, it's only the zipapp convenience scripts that are new.

https://docs.python.org/3/library/zipapp.html#creating-stand...


First I've heard of "Taco Bell Programming"[1]. I love it, this needs to become a normal part of my lexicon. Also heck yes, it completely mirrors my experience - a few dozen tools combine in fairly simple ways to make absurdly useful results. The problem is finding the useful N-dozen elemental tools you'll use.

[1]: http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...


> I could have done the whole thing Taco Bell style if I had only manned up and broken out sed, but I pussied out and wrote some Python.

And with one dumb word choice, I suddenly can't share this otherwise good piece with most audiences.


Seriously? That's no harsher language than calling someone a dick. Do you have to keep an ultra-clinical PR image for some reason or are we really that puritanical?


The networks that I belong to and benefit from are that "puritanical," if you want to choose a really stupid word for avoiding insulting half the population for no reason, yeah.


The website it's linked on is called widgetsandshit.com. I think that's a bigger problem if you're that worried.


Nothing offensive about shit.


Which word? "manned"?


> If someone wants to host the raw files to allow others to download it let me know. It is a 83 GB tar.gz file which uncompressed is just over 1 TB in size.

Some of the cloud providers have free hosting for public data sets (people who use the data incur cost to download/process the data). I'm not sure if this would qualify.

* https://aws.amazon.com/opendata/public-datasets/ * https://azure.microsoft.com/en-us/services/open-datasets/


I similarly have a 127GB tar.xz of about 2TB of code/files from the top GitHub repositories (filtered by highest starred repositories per primary language) as of around August 2017 if anyone wants it or wants to host it.


You could also use an S3 or GCS “requester pays” bucket, which means you never get charged for data transfer. Would cost about $2/month for the 83GB storage, but that’s it.

https://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPay...

https://cloud.google.com/storage/docs/requester-pays


Make a bit torrent


I'm very interested in this data; if there is a torrent, I would be able to seed it if needed.


Or you could just put it on some Hetzner box. I would do that but I'm not sure how to contact him.


As much as I love the confirmation that .yml is the correct extension, I’m most amazed by the fact that there’s over a TRILLION lines of code public on GitHub.

That’s an astronomical number. And that’s only what we can all see. Can’t imagine how much more is private.

Also, it look’s like 20% of code is comments. Which feels just about right.


People actually bike shed over .yaml vs .yml? If you thought tabs vs spaces was a useless debate...


Never been comfortable with extension names longer than 3 letters long for some reason..


MS-DOS. The reason is MS-DOS.


Thank you for the EXPLAN~1!


If anyone's curious this references the 8.3 filename shortening that windows was even still doing recently.

Amusingly, that tiny bit of backward compatibility can lead to vulnerabilities as outlined here: https://www.acunetix.com/blog/articles/windows-short-8-3-fil...


It goes back further than that - DOS got the idea of 8.3 filenames from CP/M and before that DEC used three letter file extensions on its minicomputer operating systems. The idea arguably goes back even further into the mainframes of the 1960s although it gets a bit messier then to trace it.


All I'm saying mate, is that we don't pronounce it Yiml now, do we. :D


.yml only makes sense in a word where every other extension is three letters.

With variably length extensions, there isn't a good reason to not just use the full name.


Who cares? Just switch to .jsn and forget about it.


I'm sorry .jon would be more logical Javascript being one word and all.


Surely copy & paste played a major role in this.


Yep that and forking. But if you fork a repo and change a small thing, I’d it still a duplicate of the other repo?

I don’t think the raw number is what’s so impressive, as much as the fact that there’s more code public on GitHub than I could comprehend in a lifetime.


I am waiting for the results showing how much copy pasta is on the plate.


I thought the most common file names had some peculiar results.

> the filename “15” & “s15” made the top 50 most popular filenames

Anyone know why?

https://boyter.org/posts/an-informal-survey-of-10-million-gi...


s15 seems to be skewed by a bunch of JavaScript projects including Font Awesome, which stores the bathtub icon as s15.js for some reason (note that s1 through s14 are unused, so it's not just consecutive numeric ids). https://fontawesome.com/v4.7.0/icon/bath

Plain 15 is trickier. GitHub's search is too fuzzy to see a pattern at a quick glance, but again it seems that JavaScript is the culprit with a disproportionately high number of filename:15 results compared to filename:14 or filename:16.


I would like to know to, any insights?


The only thing I’ve put together from light searching on GitHub might be to do with the fact that 15 and s15 can be used to describe school years (s for spring 15 semester) and a lot of people post assignments to GitHub.

15 is also a common number in coding problem sets that people post to GitHub. It’s a stretch but it might be something to do with 2015 being a year that has a lot of coursework commits, and 15 is a pretty low number (i.e. higher probability a student posts solutions to #1-15 than #1-19).

The same goes for schoolwork. Many people do online MIT and Stanford courses that are from previous years ~2012-2016, but there has been less time elapsed for students to post answers from 2016,17,18,19)

This is mostly conjecture, so I hope someone has a better answer!


advent of code also came out in 2015... so, maybe?


One query that always interested me is how much duplication is in the Open-Source community. Like how many boiler-plate files are copied into projects.

It seems generating a `react/angular` project produces many files, that almost never get changed, so it would be interesting to know the most duplicated files...

I think this could also give valuable insight into how to make the language/frameworks better or simpler...


A little bit of promo here: Ben will be giving a talk at GopherconAU later this year. - tickets are on sale now at https://gophercon.com.au


This is great. Btw. seeing "jquery" most-common filename 20,015,171 times - it probably skews js LoC (artificial LoC of dependencies).


This is a cool project and a great write-up. Some of the complexity numbers struck me as pretty off though. I know he caveats you can only compare files of the same language, but I took a quick look at the code[1] and at least as of now it looks like the complexity statements are just kind of copy/pasted, so keywords not seen in Java/C such as "match" are missing from Rust, OCaml, and Scala, "case" is missing from Haskell and Elixir etc, causing these languages to be much lower in complexity than they should be just based on standard control flow statements.

[1] https://github.com/boyter/scc/blob/master/languages.json


In all seriousness if you are a language expert for those languages please submit a PR or at least raise an issue saying what keywords should be in there. I am happy to include it into the next release should it produce more accurate numbers.


A couple of the large files are explainable:

- The largest .c file is actually the CATH database, not sure why it has that extension

- The large .cpp file is actually C++ but has a kind of 'data as code' approach defining bonds between atoms


Can't you do all this same research in just a few seconds for like pennies, just by sending a few sql queries to Google's BigQuery? I'm pretty sure we've had stories about that here.

https://medium.com/google-cloud/github-on-bigquery-analyze-a...


I partially agree with you; unless the goal was to "do this in Go", the choice of tools seems odd/inneficient.

Spark would have a been a simple option to do this kind of processing, with less lines of code, and could also run on "spare compute". Same goes for the "How does one process 10 million JSON files taking up just over 1 TB of disk space in an S3 bucket?": there are appropriate file formats for storing and querying big datasets, text/json is simply the least efficient option and likely the cause of the "$2.50 USD per query" number...



when cpu utilization is high for a long period of time like this, dedicated servers always make more sense than cloud.


> The front-end varnish box for instance is doing the square root of zero most of the time.

Nicely written. :)


Why would anyone store 1TB of data in highly redundant JSON?

Stored as protobuf, I estimate it would be 8x smaller. A custom binary format would be smaller again. Not only is that smaller, which saves storage and transfer cost, it's also proportionally faster to process.


Glad to see Go Template make the list of languages with the most curse words in them.

NOT Go, but Go Template!! A subset of Go is causing more anguish than all of Python and only a sliver less than all of Rust.

Says it all really ;)



Is this available to download in bulk?


Nice to see that Hypecript is far from engulfing JavaScript. Hopefully the marketing team will not start to delete old JavaScript repos because of this. :P


Haha, is this a coincidence or did you take inspiration from my 2016 blog post?

"Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AWS"

http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m...

We used a lot of the same tech: python, golang, S3, 32 core machines. Anyway, nice read and good use of Hetzner.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: