> I actually wrote a Python solution at first, but having to install the pip dependencies on my clean varnish box seemed like a bad idea and it keep breaking in odd ways which I didn’t feel like debugging.
Amen!
This is why I am learning Go at the moment and considering using it instead of Python for admin and data processing tasks on a fleet of servers. The single binary deployment makes it a lot easier for users to adopt. Python misses out on a lot of use because of the inability to do this. And No! I do not want to pip install a lot of stuff on the servers just to be able to run this script once. Heck, some of these servers don't even have access to public internet to be able to pip install whatever.
Yes, I have looked into pyinstaller and Nuitka. They threw up some errors that were indicative of deeper issues that I didnt feel like a good use of my time to debug. I'd rather choose a language that has this as a priority/design goal instead.
Yeah, I was writing a Docker image built around Prometheus' jmx_exporter intended to be used as a mixin for our Java apps, and part of what I needed to do was provide a simple script to preprocess separate config files to produce the config used by jmx_exporter.
My initial thought was Python, but it needed a couple of 3rd party dependencies, and there wasn't an overly clean _and_ simple way to copy the script in from the mixin and run locally.
So I shrugged, rewrote my script in Go, and then used a multi-stage build to copy in the binary and nothing else.
Ending up with a single statically linked binary was cool, even if Go does some stuff that made my eyebrows quirk a tad (I still can't believe that an idiomatic set in Go is map[T]struct{}...)
> I was writing a Docker image built around Prometheus' jmx_exporter intended to be used as a mixin for our Java apps, and part of what I needed to do was provide a simple script to preprocess separate config files to produce the config used by jmx_exporter
Did you consider, and stop me if this suggestion is completely wild, Java?
It's a script that merges YAML objects, there's an overhead to Java for this use case. I did consider Kotlin, but that same overhead of a POM, plugins to build capsules/fat jars etc.
I don't get it... Can't you just use a virtualenv? Then there's no worry about namespace pollution or conflicts between dependencies of different projects.
Installing a full virtualenv with all related libraries just to run one script is kind of ridiculous.
The server might have the wrong Python version available (2.x vs 3.x, multiple evolutions of 3.x with big features added in each point release). Et cetera, et cetera.
Or you could just do it with bash or a static Go binary and be done with it. Portable, works pretty much everywhere.
Why is it ridiculous? You'll need the related libraries either case, and by default virtualenv uses symlinks. The "full virtualenv" part makes it sound like it's heavy, or something like that.
If you do it in a language that provides static binaries or an equivalent, you don't need to pull dozens of libraries with possibly hundreds of files just to run a single script.
Imagine a situation where you'd need to perform an operation on a hundred servers, would you either transfer one static exacutable or build an virtualenv in every machine and download correct versions all of the related libraries?
If I controlled the environment, I'd probably build a Docker image and deploy that. Otherwise yes, This kind of extreme example isn't very representative of the initial situation though. And you're also talking about switching up languages, which isn't always possible or easy, especially in large teams or codebases.
> The server might have the wrong Python version available
That's the point of using virtual environments, so that you can run the Python version and libraries that you need. Also, as of 3.3, Python ships with venv which means you don't need to separately install virtualenv anymore. It's all very portable.
This is my exact situation right now. Our website is made in Django, and we use Pyenv to manage our python versions. We'll likely be going to route of a docker image/ pyenv install script. It takes trial/error to figure out some of the packages you need if they aren't installed already, a single binary would be far easier for sure.
virtualenv still requires internet access. You'd have to create a venv on destination machine and pip install your packages in. You can't scp a local venv onto destination machine and have it work.
Basically build a virtualenv and then pack that into a .deb, install the .deb on the target machine and you have a self-contained package that requires no external resources to install.
That said, using golang is WAAAAY simpler. The virtualenv/deb solution, while it works, is very sketch.
I think what people are arguing is deploying a single binary is even more trivial than a "venv with the right architecture". Deploying one file is going to be easier than deploying a set of files.
I am an utter noob at Python packaging, but i would like to learn more, so excuse the basic question, but: why doesn't it work? And is there anything in the ecosystem which is like that, but does work?
I had heard that the Python packaging and deployment story had got a lot better in recent years, but this sounds like it still falls far short of table stakes.
I'm not sure why the Python community can't get it sorted out. We've been using Pipenv because it's one of a few tools that supports lockfiles/reproducible builds, and at the time we chose it, the Python packaging authority was advertising it as the official solution for Python packaging; however, after we were already invested in it, they backpedaled becaused the community realized it was very buggy and super slow (as in "it takes literally half an hour to add or remove or update any dependency").
We're finally getting around to migrating away, and we've settled on Twitter's github.com/pantsbuild/pants which is like Google's Bazel except that it's jankier in every way _except_ that Pants supports Python while Bazel doesn't (Bazel advertises Python support, but it doesn't work as the many Python 3 issues in their issue tracker can attest). The nicest thing about Pants is that it builds pex files (github.com/pantsbuild/pex) which are executable zip files that ship all of your dependencies except the Python interpreter and shared objects.
I'm still not very satisfied with this solution, and it's still far, far worse than Go's dependency management and packaging model, but it's a dramatic improvement over Pipenv, virtualenv, etc (gives us both reproducibility and performance).
Its not that Python packaging doesn't work, or that it is impossible to automate, but whatever you do it requires nontrivial time and effort to get right.
The point is that the static binary created by go is a single file that you copy into place and run. It is a lot simpler.
So a virtualenv for every script I want to run? each of that downloading a full Python interpreter and whatever modules that Python script uses? How is this any better than a statically compiled binary? Other than useless additional work and bloat on the servers?
This is the weirdest straw man I've seen in a while. Python, not Go, benefits most from containerization. Installing "Dynamic libraries, OS packages, and even directories like virtualenvs" on target machines is far more complicated than sending a single file. I say all this as the DevOps guy in a Python shop.
Fully agree about Go; however, if you can't move away from Python, github.com/pantsbuild/pex is a pretty good alternative. It's an executable zip file that contains all of your python dependencies; you still need the Python interpreter and any shared objects that don't come with pip (e.g., for whatever reason, the Python graphviz package depends on the graphviz-devel system package and the latter doesn't ship via Pypi). It's still an order of magnitude worse than Go, but it's an order of magnitude better than pip or pipenv. :)
You didnt mention virtualenv. That is what I use. I even made a deb installer that forces itself to use virtualenv and fetches system python files for those packages available to Debian but not pip (kind of a hack but it worked! Look mah!). I agree though I like that with D and Go deployment seems simpler.
zipapp, which comes bundled with python since 3.5, is your friend.
It can produce a single .pyz file which has all dependencies except the interpreter itself inside. Actually this format is compatible all the way back to Python 2.6, it's only the zipapp convenience scripts that are new.
First I've heard of "Taco Bell Programming"[1]. I love it, this needs to become a normal part of my lexicon. Also heck yes, it completely mirrors my experience - a few dozen tools combine in fairly simple ways to make absurdly useful results. The problem is finding the useful N-dozen elemental tools you'll use.
Seriously? That's no harsher language than calling someone a dick. Do you have to keep an ultra-clinical PR image for some reason or are we really that puritanical?
The networks that I belong to and benefit from are that "puritanical," if you want to choose a really stupid word for avoiding insulting half the population for no reason, yeah.
> If someone wants to host the raw files to allow others to download it let me know. It is a 83 GB tar.gz file which uncompressed is just over 1 TB in size.
Some of the cloud providers have free hosting for public data sets (people who use the data incur cost to download/process the data). I'm not sure if this would qualify.
I similarly have a 127GB tar.xz of about 2TB of code/files from the top GitHub repositories (filtered by highest starred repositories per primary language) as of around August 2017 if anyone wants it or wants to host it.
You could also use an S3 or GCS “requester pays” bucket, which means you never get charged for data transfer. Would cost about $2/month for the 83GB storage, but that’s it.
As much as I love the confirmation that .yml is the correct extension, I’m most amazed by the fact that there’s over a TRILLION lines of code public on GitHub.
That’s an astronomical number. And that’s only what we can all see. Can’t imagine how much more is private.
Also, it look’s like 20% of code is comments. Which feels just about right.
It goes back further than that - DOS got the idea of 8.3 filenames from CP/M and before that DEC used three letter file extensions on its minicomputer operating systems. The idea arguably goes back even further into the mainframes of the 1960s although it gets a bit messier then to trace it.
Yep that and forking. But if you fork a repo and change a small thing, I’d it still a duplicate of the other repo?
I don’t think the raw number is what’s so impressive, as much as the fact that there’s more code public on GitHub than I could comprehend in a lifetime.
s15 seems to be skewed by a bunch of JavaScript projects including Font Awesome, which stores the bathtub icon as s15.js for some reason (note that s1 through s14 are unused, so it's not just consecutive numeric ids). https://fontawesome.com/v4.7.0/icon/bath
Plain 15 is trickier. GitHub's search is too fuzzy to see a pattern at a quick glance, but again it seems that JavaScript is the culprit with a disproportionately high number of filename:15 results compared to filename:14 or filename:16.
The only thing I’ve put together from light searching on GitHub might be to do with the fact that 15 and s15 can be used to describe school years (s for spring 15 semester) and a lot of people post assignments to GitHub.
15 is also a common number in coding problem sets that people post to GitHub. It’s a stretch but it might be something to do with 2015 being a year that has a lot of coursework commits, and 15 is a pretty low number (i.e. higher probability a student posts solutions to #1-15 than #1-19).
The same goes for schoolwork. Many people do online MIT and Stanford courses that are from previous years ~2012-2016, but there has been less time elapsed for students to post answers from 2016,17,18,19)
This is mostly conjecture, so I hope someone has a better answer!
One query that always interested me is how much duplication is in the Open-Source community. Like how many boiler-plate files are copied into projects.
It seems generating a `react/angular` project produces many files, that almost never get changed, so it would be interesting to know the most duplicated files...
I think this could also give valuable insight into how to make the language/frameworks better or simpler...
This is a cool project and a great write-up. Some of the complexity numbers struck me as pretty off though. I know he caveats you can only compare files of the same language, but I took a quick look at the code[1] and at least as of now it looks like the complexity statements are just kind of copy/pasted, so keywords not seen in Java/C such as "match" are missing from Rust, OCaml, and Scala, "case" is missing from Haskell and Elixir etc, causing these languages to be much lower in complexity than they should be just based on standard control flow statements.
In all seriousness if you are a language expert for those languages please submit a PR or at least raise an issue saying what keywords should be in there. I am happy to include it into the next release should it produce more accurate numbers.
Can't you do all this same research in just a few seconds for like pennies, just by sending a few sql queries to Google's BigQuery? I'm pretty sure we've had stories about that here.
I partially agree with you; unless the goal was to "do this in Go", the choice of tools seems odd/inneficient.
Spark would have a been a simple option to do this kind of processing, with less lines of code, and could also run on "spare compute". Same goes for the "How does one process 10 million JSON files taking up just over 1 TB of disk space in an S3 bucket?": there are appropriate file formats for storing and querying big datasets, text/json is simply the least efficient option and likely the cause of the "$2.50 USD per query" number...
Why would anyone store 1TB of data in highly redundant JSON?
Stored as protobuf, I estimate it would be 8x smaller. A custom binary format would be smaller again. Not only is that smaller, which saves storage and transfer cost, it's also proportionally faster to process.
Nice to see that Hypecript is far from engulfing JavaScript. Hopefully the marketing team will not start to delete old JavaScript repos because of this. :P
Amen! This is why I am learning Go at the moment and considering using it instead of Python for admin and data processing tasks on a fleet of servers. The single binary deployment makes it a lot easier for users to adopt. Python misses out on a lot of use because of the inability to do this. And No! I do not want to pip install a lot of stuff on the servers just to be able to run this script once. Heck, some of these servers don't even have access to public internet to be able to pip install whatever.
Yes, I have looked into pyinstaller and Nuitka. They threw up some errors that were indicative of deeper issues that I didnt feel like a good use of my time to debug. I'd rather choose a language that has this as a priority/design goal instead.