I'm sympathetic about bleeding-edge dependencies, but just handling fairly munda...

wegs · on May 28, 2020

We definitely use virtualenv and pip. But:

1) There's a world of difference between that and docker, and especially docker with containers for not just postgresql, but a half-dozen specialized data stores, queuing systems, MTAs, etc.

2) There's also a world of difference between having numpy / pandas / etc. in your requirements.txt, and having those pinned to a specific version. I'm okay with one or two pinned dependencies on any specific project (for example, if there's an overall project built on Django).

But if you're using the corners of standard libraries in ways where version 1.65 works and 1.73 doesn't, you're probably doing something wrong. You're probably using features which are too bleeding-edge. I'm okay with a few conditionals in code too (if library is 1.65, do X, and if it's 1.73, do Y).

When I've seen systems that to depend on nuances of specific versions, upgrades turn into "migration to [library] 1.73" and eat up weeks of developer time. It gets worse when you have cascades (upgrading library X means upgrading Y, etc.).

And goodness help you if you want to integrate two systems built in docker with pinned everything and fine-grained dependencies.

A lot of this also comes back to willing and able to say "no" to features which take 15 minutes to introduce, but cost time down the line to maintain.

Systems which install on Ubuntu without virtualenv or pip (just apt-get installing packages) are an ideal I strive for. It's usually one I don't hit (and it's also not how I develop, obviously -- it's not for me, but for my users, as well as for the discipline).

nilkn · on May 29, 2020

I can’t tell if we’re in agreement or disagreement. I don’t disagree that one should avoid exotic dependencies or unstable behaviors from specific versions of libraries. Version pinning is more about just making sure someone else can run the program. It’s not about (or at least shouldn’t be about) creating a reliance on odd corner case behaviors. We almost never manually pin versions — pip-tools does that automatically.

Your argument probably would be that the dependencies used should be so simple and core that the risk of it not working with someone else’s package set should be minimal or zero. That’s just a bit too extreme for my taste. I want builds to be 100% reproducible. This is exactly what modern build tools for other languages do.

Re: Docker, I don’t think anyone is claiming pip and virtualenv are somehow a replacement for that.

Re: apt-get, we tend to actually avoid this. It’s really not a good package manager at all and can easily break. We’re going in the direction of nix instead and may even port our entire Python workflow over to it or bazel at some point.

wegs · on May 29, 2020

There's a spectrum of agreement.

(1) I want builds to be 100% reproducible on deployment servers and on CI/CD pipelines. Otherwise, you can undebuggable Heisenbugs. On the other hand, I don't want builds to be reproducible between developer machines. If I'm running Python 3.6 on Ubuntu, and another developer is running Python 3.7 on a Mac, and we have slightly different versions of numpy, that makes sure the system is not too brittle. Come to think of it, if I had infinite resources, I'd have several build machines with different (reproducible) configurations.

(2) I'm a lot more spartan about dependencies than other developers I've met.

(3) I'd never use apt to manage Python packages myself in something I'm working on. The constraint is in the other direction. If I build a tool, a user ought to be able to install it using apt in some future version of Debian, and likewise for other systems. Even if that's an abstract user.

I've found that if I develop this way, the upsides outweigh the downsides, especially over extended periods. A lot of software gets built like a system which can only live in one places. There's a set of AWS machines, code on them, and that's the system. There might be a few copies of it (stage+dev+etc.), but you can't move it somewhere else. I like systems I build to be portable. Someone can bring them up-and-running on their own machine, ideally in a few minutes. I've always found that to be cheaper, in the long term.

nilkn · on May 29, 2020

I feel like your idea in (1) is not that unachievable with finite resources. It depends how far you took it but requiring tests to pass in a few mild perturbations of the target environment wouldn’t be that expensive in a lot of cases and not even that hard to set up. Sounds like this deserves a name like “perturbative testing” to me if it doesn’t already have one.

wegs · on May 30, 2020

In abstract, it doesn't take a huge amount of time and resources to do that. But, there are probably around a hundred higher-priority ideas achievable with the same resources which would take priority over this on the projects I'm working on right now.

On projects I've worked on before, I think this would have made sense /technically/, given project priorities, but so did many other things which weren't done. It's a lot easier to make the case for resources for customer-facing features than for technical debt or infrastructure. So there's the political component too, which varies organization-by-organization.

This is already done in a lot of projects with hardware. The Linux kernel will run on a thousand hardware and software configurations before integrating features.

If I did this, I'd probably want at least three builds:

* my pinned deployment versions (sometimes a release or two behind, sometimes bleeding-edge)

* latest released version; and

* HEAD

If an upstream project is introducing a breaking change, I'd know immediately. That'd be super-helpful, probably both to me and to those projects.

Come to think of it, the right way to do this might be to have three virtualenvs on my local machine, rather than just different targets in CI/CD....