Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Has anyone migrated off containerized infrastructure?
345 points by murkt on Aug 14, 2020 | hide | past | favorite | 358 comments
I'm constantly aggravated by various quirks of containers, and don't really remember any big problems with non-containerized infra.

A random and non-exhausting list of things that bother me from time to time:

— Must build an image before deploying and it takes time, so deploys are slow (of course we use CI to do it, it's not manual).

— If I need quick hot fix RIGHT NOW, I can't just log in, change couple of lines and restart, must go through full deploy cycle.

— Must remember that launched containers do not close when ssh breaks connection and they can easily linger for a couple of weeks.

I generally find it harder to change how things work together. It's probably possible to spend lots of effort to fix these things, but I don't remember having to do all this cruft with old school infrastructure.

Hence the question - has anyone migrated off containerized infrastructure? Are you satisfied? Or I'm misremembering things and horrible things wait for me in the old-school ways?

Disclaimer: I'm a container infrastructure consultant at Red Hat, so take all of this with a big grain of salt :-)

What you are complaining about isn't really containers (you could still pretty easily run stuff in a container and set it up/treat it like a "pet" rather than "cattle"), it's the CI/CD and immutable infrastructure best practices you are really sad about. Your complaints are totally valid: but there is another side to it.

Before adopting containers it wasn't unusual to SSH in and change a line of code on a broken server and restart. In fact that works fine while the company/team is really small. Unfortunately it becomes a disaster and huge liability when the team grows.

Additionally in regulated environments (think a bank or healthcare) one person with the ability to do that would be a huge threat. Protecting production data is paramount and if you can modify the code processing that data without a person/thing in your way then you are a massive threat to the data. I know you would never do something nefarious - neither would I. We just want to build things. But I promise you it's a matter of time until you hire somebody that does. And as a customer, I'd rather not trust my identity to be protected because "we trust Billy. He would never do that."

I pine for the old days - I really do. Things are insanely complex now and I don't like it. Unfortunately there are good reasons for the complexity.

Another way of putting this, which largely amounts to the same thing, is that containerization was developed by and for very large organizations. I have seen it used at much smaller companies, most of whom had zero need for it, and in fact it put them into a situation where they were unable to control their own infrastructure, because they had increased the complexity past the point where they could maintain it. Containerization makes deploying your first server harder, but your nth server becomes easier, for values of large n, and this totally makes sense when your organization is large enough to have a large number of servers.

I think containers are great even for really small companies. You boiled it down to `n' servers but its `n' servers times `m' services times `k` software updates. That's easier as soon as n * m * k > 2!

First of all, containers can be used with Cloud Run and many other ways to run containers without managing servers at all! (tho if you can use services like Netlify and Heroku to handle all your needs cost-effectively, you probably should).

Setting up a server with docker swarm is pretty easy, because there's basically one piece of software to install. From there on all the software to update and install is in containers.

If your software gets more complex, your server setup stays simple. Even if it doesn't get complex, being able to install software updates for the app independently of the host is great. Ie, I can go from Python 3.7 to Python 3.8 with absolutely zero fuss.

Deploying servers doesn't get more complicated with a few more containers. At some point that's not true but if you want to run, say, grafana as well, the install/maintenance of the server stays constant.

Imagine what you would do without containers... editing an ansible script and having to set up a new server just to test the setup, or horribly likely ssh'ing in and running commands one-off and having no testing, staging or reproducibility.

I vastly prefer Dockerfiles and docker-compose.yml and swarm to ansible and vagrant. There are more pre-built containers than there are ansible recipes as well. So your install/configure time for any off-the-shelf stuff can go down too.

Setting up developer laptops is also improved with Docker, though experiences vary... Run your ruby or python or node service locally if you prefer, set up a testing DB and cache in docker, and run any extra stuff in containers.

Lastly, I think CI is also incredibly worthwhile even for the smallest of companies and containers help keep the effort constant here too. The recipe is always the same.

Having used Docker and Kubernetes, and also spun up new VM's, I can say that Docker and Kubernetes are _not_ easier, if you're new at it. Spinning up a new VM on Linode or the like is easier, by far.

Now, this may sound incredible to you, because if you're accustomed to it, Docker and Kubernetes can be way easier. But, and here's the main point, there are tons of organizations for whom spinning up a new server is a once every year or two activity. That is not often enough to ever become adept at any of these tools. Plus, you probably don't even want to reproduce what you did last time, because you're replacing a server that was spun up several years ago, and things have changed.

For a typical devops, this state of affairs is hard to imagine, but it is what most of the internet is like. This isn't to say, by any means, that FAANG and anybody else who spins up servers on a regular basis shouldn't be doing this with the best tools for your needs. I'm just saying, how you experience the relative difficulty of these tasks, is not at all representative of what it's like for most organizations.

But, since these organizations are unlikely to ever hire a full-time sysadmin, you may not ever see them.

Some of us have notes, that we can mostly copy-paste to setup a server and it works well without magic and n·m·k images.

Last time I checked, docker swarm was accepting connections from anywhere (publish really publishes) and messing with the firewall making a least-privilege setup a PITA; docker was building, possibly even running containers as root; and most importantly - the developers thought docker was magically secure, nothing to handle.

How do you handle your security?

An nginx container handles redirects to HTTPS and SSL termination and talks to the other services using unpublished ports. Only 22 (sshd running on server) and 80 and 443 (published ports) are open to the world. Swarm ports open between the swarm servers. That's between AWS security groups.

I don't build on my servers. A service (in a container) makes an outgoing connection to listen to an event bus (Google PubSub) to deploy new containers from CI (Google Cloud Builder).

Config changes (ie, adding a service) are committed then I SSH in, git pull and run a script that does necessary docker stack stuff. I don't mount anything writable to the containers.

I cannot agree that "Containerization universally makes first server deployment harder". Even at single person scale, tools like Docker-Compose etc make my dev life massively simpler.

In 2020, I'd argue the opposite for most people, most of the time!

Also, if your container runtime is preinstalled in your OS as is often the case, the first run experience can be as little as a single command.

One of my favorite things is how it forces config and artifact locations to be explicit and consistent. No more "where the hell does this distro's package for this daemon store its data?" Don't care, it puts it wherever I mapped the output dir in the container, which is prominently documented because it pretty much has to be for the container to be usable.

Hell it makes managing my lame, personal basement-housed server easier, let alone anything serious. What would have been a bunch of config files plus several shell scripts and/or ansible is instead just a one-command shell script per service I run, plus the presence of any directories mapped therein (didn't bother to script creation of those, I only have like 4 or 5 services running, though some include other services as deps that I'd have had to manage manually without docker).

Example: Dockerized Samba is the only way I will configure a Samba server now, period. Dozens of lines of voodoo magic horsecrap replaced with a single arcane-but-compact-and-highly-copy-pastable argument per share. And it actually works when you try it, the first time. It's so much better.

> you could still pretty easily run stuff in a container and set it up/treat it like a "pet" rather than "cattle"

Keep in mind, though, if you've got a pet stateful "container" you can SSH into, it's not really a container any more; it's a VPS.

(Well, yes, it is technically a container. But it's not what people mean to talk about when they talk about containers.)

When $1/mo fly-by-night compute-hosting providers are selling you a "VPS", that's precisely what they're selling you: a pet stateful container you can SSH into.

And it's important to make this distinction, I think, because "a VPS" is a lot more like a virtual machine than it is like a container, in how it needs to be managed by ops. If you're doing "pet stateful containers you can SSH into", you aren't really 'doing containers' any more; the guides for managing containerized deployments and so forth won't help you any more. You're doing VMs—just VMs that happen to be optimized for CPU and memory sharing. (And if that optimization isn't something you need, you may as well throw it away and just do VMs, because then you'll gain access to all the tooling and guidance targeted explicitly at people with exactly your use-case.)

A VPS (which is usually a virtual machine) will be running on top of a hypervisor, and each VM on the host will have their own kernel. Containers on the other hand are different because the kernel is shared among every container running on the host. The separation/isolation of resources is done via kernel features rather than by a hypervisor like a VM. Adding SSH and a stateful filesystem to your container to make it long lived doesn't make it any less of a container. To me that seems like saying "my car is no longer a car because I live in it. Now it's a house (that happens to have all the same features as a car, but I don't use it that way so it's no longer a car)"

If you're defining "container" not by the technology but rather by "how it needs to be managed by ops" then we're working with completely different definitions from the start. We would first need to agree on how we define "container" before we can discuss whether you can treat one like a pet rather than cattle.

Where does an RV fit into your taxonomy?

If you have stateful containers where changes persist across restarts of the container, then I think you can't really call them containers anymore. Just like if you have VMs with read-only filesystem images generated by the CI/CD pipeline, it's not unreasonable to describe them as container-like. Once you throw in containers with a stateful filesystem or a VM with a read-only filesystem into the mix, then 'container' is no longer a good description of what's going on, and more precise terms need to be used, especially as you get into more esoteric technologies like OpenVZ/Virtuozzo, which uses kernel features, and not virtualization, to provide isolation, but it's containers are not the same as Docker's.

We could come to an agreement of the definition of container, but that wouldn't even use useful outside this thead, so maybe it's more useful to enumerate where the specific technology is and isn't important. The ops team cares about how the thing needs to be managed, and less so how it goes about achieving isolation. However, the exactly technology in use is of critical importance to the security team. (Those may be the same people.) Developers, on the third hand, ideally don't even know that containers are in use, the system is abstracted away from them so they can worry about business logic and UX matters, and not need to worry about how to upgrade the fleet to have the latest version of the openssl libraries/whatever.

Container is a thing before Docker invented. LXC/OpenVZ/Solaris Zones should be a container. We need a different term about immutable container style like Docker.

OpenVZ "VPS" offerings are, in fact, just containers with a shared kernel.

> A VPS (which is usually a virtual machine)

This is where I disagree. Like I said in my sibling post, the term "VPS" was invented to obscure the difference between VM-backed and container-backed workload virtualization, so that a provider could sell the same "thing" at different price-points, where actually the "thing" they're selling is a VM at higher price-points and a container at lower price-points. "VPS" is like "spam" (the food): it's a way to avoid telling you that you're getting a mixture of whatever stuff is cheapest.

Sure, there's probably some high-end providers who use "VPS" to refer solely to VMs, because they're trying to capture market segments who were previously using down-market providers and are now moving up-market, and so are used to the term "VPS."

But basing your understanding of the term "VPS" on those up-market providers, is like basing your understanding of the safety of tap water on only first-world tap water, and then being confused why people in many places in the world would choose to boil it.

(And note that I referred specifically to down-market VPS providers in my GP post, not VPS providers generally. The ones who sell you $1/mo VPS instances are not selling you VMs.)

> If you're defining "container" not by the technology but rather by "how it needs to be managed by ops" then we're working with completely different definitions from the start.

It seems that you're arguing from some sort of top-down prescriptivist definition of what the word "container" should mean. I was arguing about how it is used: what people call containers, vs. what they don't. (Or rather, what people will first reach for the word "container" to describe; vs. what they'll first reach for some other word to describe.)

Think about this:

• Linux containers running on Windows are still "containers", despite each running isolated in their own VM.

• Amazon Elastic Beanstalk is a "container hosting solution", despite running each container on its own VM.

• Software running under Google's gVisor is said to be running "in a container", despite the containerization happening entirely in userland.

• CloudFlare markets its Edge Workers as running in separate "containers" — these are Node.js execution-context sandboxes. But, insofar as Node.js is an abstract machine with a kernel (native, un-sandboxed code) and system-call ops to interface with that kernel, then those sandboxes are the same thing to Node that containers are to the Linux kernel.

• Are unikernels (e.g. MirageOS) not running as VMs when you build them to their userland-process debugging target, rather than deploying them to a hypervisor?

> To me that seems like saying "my car is no longer a car because I live in it. Now it's a house (that happens to have all the same features as a car, but I don't use it that way so it's no longer a car)"

A closer analogy: I put wheels on my boat, and rigged the motor to it. I'm driving my boat down the highway. My boat now needs to be maintained the way a car does; and the debris from the road is blowing holes in the bottom that mean my boat is no longer sea-worthy. My boat is now effectively a car. It may be built on the infrastructure of a boat—but I'm utilizing it as a car, and I'd be far better served with an actual car than a boat.

> CloudFlare markets its Edge Workers as running in separate "containers" — these are Node.js execution-context sandboxes.

This is inaccurate:

- Cloudflare Workers does not use Node.js at all. It is a new custom runtime build on V8.

- Cloudflare absolutely does not market Workers as using "containers", in fact we market them explicitly as not "containers": https://blog.cloudflare.com/cloud-computing-without-containe...

(Disclosure: I am the lead engineer for Workers.)


In the industry today, the term "container" refers to a hosting environment where:

- The guest is intended to be a single application, not a full operating system.

- The guest can run arbitrary native-code (usually, Linux) binaries, using the OS's standard ABI. That is, existing, off-the-shelf programs are likely to be able to run.

- The guest runs in a private "namespace" where it cannot see anything belonging to other containers. It gets its own private filesystem, private PID numbers, private network interfaces, etc.

The first point distinguishes containers from classic VMs. The latter two points distinguish them from things like isolates as used by Cloudflare Workers.

Usually, containers are implemented using Linux namespaces+cgroups+seccomp. Yes, sometimes, a lightweight virtual machine layer is wrapped around the container as well, for added security. However, these lightweight VMs are specialized for running a single Linux application (not an arbitrary OS), and generally virtualize at a higher level than a classic hardware VM.

Hmm, is this really true? Typically people mean lxd or docker when they say containers, but VPSes run on KVM or OpenVZ and are a different level of abstraction than a container. I could be misunderstanding VPSes but I believe they are true VMs?

OpenVZ is fundamentally a container system, almost exactly equivalent to LXC. (In fact, Linux namespaces and cgroups were effectively created through a refactoring and gradual absorption of OpenVZ-descended code.)

"Virtual Private Server" (VPS) is a generic marketing term used by compute providers to allow them to obscure whether they're backing your node with a true VM or with a container. Down-market providers of the kind I referred to always use it to mean containers.

Yes, these VPS provider containers are wrapped in a VM-like encapsulating abstraction by the compute engine (usually libvirt), but this is a management-layer abstraction, not a fundamental difference in the isolation level. VMs that use OpenVZ or Linux containers as their "hypervisor backend" leave the workloads they run just as vulnerable to cross-tenant security vulnerabilities and resource hogging as they would if said workloads were run on Docker.


But all that's beside my original point. My point was that, when you run a "pet stateful container that you can SSH into", you're Greenspunning a VPS node, without getting any of the benefits of doing so, using tooling (Docker) that only makes your use-case harder.

If you acknowledge what you're really trying to do—to run your workload under VPS-like operational semantics; or maybe even under VM-like operational semantics specifically—then you can switch to using the tooling meant for that, and your life becomes a lot easier. (Also, you'll make the right hires. Don't hire "people who know Docker" to maintain your pseudo-VPS; they'll just fight you about it. Hire VPS/VM people!)

Just to be clear, I don't think anybody is arguing that you should use containers like you would a VPS, merely that you can. I would bet everyone here would agree that just because you can doesn't mean you should :-D

Yeah, I see what you mean (when taking the word 'container' in its technical meaning.) I'm not arguing with that; in fact, that was the same point I was making!

But I think that people don't tend to use the word "container" to describe "a container used as a VPS."

Which points at a deeper issue: we really don't have a term for "the software-artifact product of the Twelve-Factor App methdology." We refer to these things as containers, but they're really an overlapping idea. They're signed/content-hashed tarballs of immutable-infrastructure software that can be automatically deployed, shot in the head, horizontally shared-nothing scaled, etc. These properties all make something very amenable to container-based virtualization; but they aren't the same thing as container-based virtualization. But in practice, people conflate the two, such that we don't even have a word for the type of software itself other than "Docker/OCI image." (But a Google App Engine workload is such a thing too, despite not being an OCI image! Heck, Heroku popularized many of the factors of the Twelve-Factor methodology [and named the thing, too], but their deploy-slugs aren't OCI images either.)

My claim was intended to mean that, if your software meets none of the properties of a [twelve-factor app OCI image workload thing], then you're not "doing [twelve-factor app OCI image workload thing]", and so you shouldn't rely on the basically-synonymous infrastructure that supports [twelve-factor app OCI image workload thing], i.e. containers. :)

Ah ok, cool yeah I think we're in total agreement then. No doubt you are absolutely right, the word container is used commonly to mean all sorts of things that aren't technically related to the technology we call containers :-)

I do think a lot of enterprise marketing and startup product pitching has made this problem so much worse. I see this a lot with Red Hat customers (and Red Hat employees too for that matter). "Containers" are painted as this great solution and the new correct way of doing things, even though much of what is being sold isn't tied to the technical implementation of containers. There indeed isn't a good marketing-worthy buzzword to describe immutable infrastructure/12-factor app and all that at a high level.

No, it isn't true. The OP basically says:

If you use containers like VPSes then you have basically have a VPS but in a container.

No, this is dogma ;)

Everything is a host and can be used for anything.

Before adopting containers it wasn't unusual to SSH in and change a line of code on a broken server and restart. In fact that works fine while the company/team is really small. Unfortunately it becomes a disaster and huge liability when the team grows.

Writing a script to ssh into a bunch of machines and run a common command is the next step. That works far longer than most people acknowledge.

I pine for the old days - I really do. Things are insanely complex now and I don't like it. Unfortunately there are good reasons for the complexity.


Containers provide solutions to the problems that someone else had. If you don't have those problems, then containers just create complexity and problems for you.

What problems do they solve? They solve, "My codebase is too big to be loaded on one machine." They solve, "I need my code to run in parallel across lots of machines." They solve, "I need to satisfy some set of regulations."

If you do not have any of those kinds of problems, DON'T USE CONTAINERS. They will complicate your life, and bring no benefit that you care about.

Counterpoint, in many ways its much simpler than 20 years ago: Docker, k8s, etc is miles beyond the type of automation I used to have to deal with from the operations type people.

We have used chroots + a bunch of perl scripts for 20 years. Besides APIs for adding/deleting nodes or autoscaling nodes, nothing much changed for us. And, as I have remarked here before (as it is one of my businesses); that extra freedom, esp autoscaling, is almost never needed and, for most companies, far more expensive than just properly setting up a few baremetal machines. Most people here probably vastly underestimate how much transactions a modern server can handle and how cheap this is at a non-cloud provider. Ofcourse, badly written software will wreck your perf and with that nothing can save you.

You’re comparing cowboy sysadmin (mutable servers, ssh in and live-edit stuff) to process-heavy devops with CI. These are orthogonal to containers/not-containers.

If you don’t use CI, it’s easy to get fast deploys with containers. Just build the images on your dev box, tag with the branch and commit hash, and push directly to a docker registry (docker push is smart enough to only push the layers the registry doesn’t already have). Code is running 30 seconds after compiling finishes.

(Don’t want to pay for a registry? It’s trivial to run one yourself)

These aren’t foolproof, fully reproducible builds, but practically they’re pretty close if your tools require your repo to be clean and pushed before building the image, and if your build system is sane. Besides, if you’re used to editing code as it’s running on servers, you don’t care about reproducible builds.

Also, if you’re starting containers manually on the command line, you’re doing it wrong. At least use compose so your setup is declarative and lifetime-managed.

(Edit: s/swarm/compose/)

> Also, if you’re starting containers manually on the command line, you’re doing it wrong. At least use compose so your setup is declarative and lifetime-managed.

As I wrote in a nearby comment, I'm not starting containers manually - we have compose, swarm, it's declarative and lifetime-managed.

However, we often need to do some bespoke data analysis, so we often ssh into a server, type `make shell` to launch a REPL and type/paste some stuff into it.

You can do all that with containers quickly.

I develop my stuff with k8s with all it's lifetime management and have 10 second deploys from my dev box.

Still 10 seconds. I have 0 seconds deploy for some ‘cowboy development’ I do; it is now a competitive advantage :) 10 sec (and for almost all setups I have seen it is vastly more that than) is a lot while devving and deploying for test. Each their own, but just fixing bugs live with my client (Zoom + me live on the test server fixing many issues in an afternoon) is vastly more efficient for me. Obviously committing from the test(now dev) server to github will result in ci/cd to the staging server, but workflows where I work on my local, commit, ci/cd to test and then the client tests is vastly slower and I do not like it; it feels like a waste of my time.

In my main business I am forced (regulatory) to do it all by the book; vastly less enjoyable for me.

10 seconds deploys? How is this possible :) Do you have any links that explain your workflow?

It's nothing fancy - I just use a typical k8s Service/Deployment object on GKE. A deploy is:

  1. docker build (most layers cached) - 2s
  2. docker push - 2s
  3. update deploy.yaml
  4. kubectl apply -f deploy.yaml
  5. kubectl rollout status deployments {name} - 6s

Lucky!! With our infrastructure, deploys can take an hour or so. 10 minutes for the build, 10 minutes for the image to get built, plus the rest of the time for terraform to apply infrastructure changes across dev, staging, and prod. Only thing we do have is automated testing after every deploy so issues tend to get caught. But that's still so long for a deploy! I don't know of a good way to get it down faster.

Why is terraform deploying infrastructure for every container deployment? Can't you just rollout onto the existing infrastructure?

Also sounds like there is some lay hanging fruit available by adding some caching/layering in the build process

Ah, I should have explained more. That's an hour for our ASG + EC2 deployments. The only benefit we get is easy roll backs because it always deploys a new ASG. We're switching over to Spinnaker and started with our EC2 infrastructure. I think container deployments will be faster but still, an hour for EC2 deployments!

Yeah this is exactly what I do too, works just fine. You probably already have something like this, but I hacked a bash+yq script that automatically updates all relevant yaml files with the latest image tag. So getting new code running is two lines:

make image push deploy

kubectl -f somewhere/deploy.yaml

Take a look at skaffold. Ours is under 10s

I think people kind of boiled themselves alive with Docker and don't step back to think if they're where they want to be often enough.

Docker first started getting traction when people were building their software with make. Make never quite got caching right (not really its fault), so nobody was really sure that their changes were going to show up in their release unless they ran "make clean" first. And, you had to have 700 helper utilities installed on your workstation for a build to work after "make clean". Docker to the rescue! It gives you a fresh Linux distribution, apt-get installs all the utilities needed for your build, and then builds your thing. It works 100% of the time! Celebrate!

At the same time, programming languages started moving towards being build systems. They want "mylang run foo.ml" to be fast, so they implemented their own caching. But this time they did it right; the "mylang" compiler knows about the effect every input file has on every output file, so it's guaranteed to give you the right answer with or without the cache. Some languages are so confident these days that you can't even disable the cache! They know it works perfectly every time. The result is extremely fast incremental builds that are just as reliable as clean builds, if you have access to that cache.

This, unfortunately, is not something that Docker supports -- layers have one input filesystem and one output filesystem, but now languages are producing two outputs -- one binary for the next layer, one cache directory that should be used as input (opportunistically) to the next build. The result is, to work around people writing makefiles like "foo.o: foo.c" when they actually meant "foo.o: foo.c foo.h helper.h /usr/include/third-party-library.h /usr/lib/libthird-party.so.42", EVERYONE suffers. "mylang run foo.mt" takes a few milliseconds on your workstation, but 5 minutes in Docker. Your critical fix is hamstrung by your CI infrastructure.

There are a number of solutions to this problem. You could have language-specific builders that integrate with your CI system, take the source code and a recent cache as inputs, and produce a Docker container as output. (Systems like Bazel even have a protocol for sharing cache between machines, so you don't have to copy gigabytes of data around that you probably don't need.)

But instead of doing that, people are writing articles about how their CI takes 3 hours when a build on their workstation takes 3 seconds, and it's because containers suck! But they don't actually suck -- the underlying problem is not saying "in production, we will only run .tar.gz files that contain everything the application needs". That is actually wonderful. The underlying problem is "the first build step is 'rm -rf /' and the second is 'make world'".

> Docker first started getting traction when people were building their software with make.

A result of this is that containerisation took off hardest in ecosystems without good build and deploy tools. Getting exactly the right version of the application, every library, and the runtime has traditionally been a struggle in Ruby, a nightmare in Python, torture in C, etc, but pretty easy in Java. As a result, most of the Java shops i know are still deploying by copying a zip file or some such onto a server.

I'm astounded that docker still doesn't have the concept of 'build volumes' that can be used for streaming artifacts and caches into build steps.

That said, 'docker build' is not the only game in town. For a long time, Red Hat's source-to-image tool has been able to do incremental builds:

    s2i build --incremental /path/to/mycode mybuilderimage myapp
This creates a new container from the mybuilderimage image, copies the source code from /path/to/mycode into the container at /tmp/src, and runs mybuilderimage's 'assemble' script (which knows how to invoke "mylang run foo.ml" and install the build result into the right place so that the 'run' script will later find it). The result is committed as the myapp container image, which typically uses mybuilderimage's 'run' script as its entry point.

The next time the command is run, the first thing s2i will do is create a new container based on myapp, and invoke its 'save-artifacts' script, which has the job of copying out any build caches and other artifacts to be used in an incremental build. The container is then discarded.

Now, the build runs as before, but with the addition of s2i copying the saved build artifacts into /tmp/artifacts, so that the 'assemble' script can use them to make the build faster.

This isn't perfect: you pay for speedier builds with larger container images, since you can't delete build caches, etc. like you'd normally do at the end of a Dockerfile. But it's a good first step, and you can always have another step in your pipeline that starts with the myapp container, deletes unwanted files and then squashes the result into a single layer above the original base image that mybuilderimage was itself built from.

It does, now. The whole build system was uprooted with the change from legacy build to buildkit. One of the current "experimental" features is build volumes.

Do you have a link with a description? I can only find years old open docker github issues, 3rd party software or some very hacky solutions

This is a great comment - especially the the last 2 paragraphs.

Docker is not the same as containers - docker is just one way to build containers. I've not seen one yet that would be easier to use (and as as result faster than docker) but it does not mean we cannot create one!

I personally use Nix to build Docker-compatible containers in production.

If you need a hot fix "RIGHT NOW" you might be doing something wrong in the first place.

Being able to just ssh into a machine is one of the problems that we did solve with containers. We didn't want to allow anyone to SSH into a machine and change state. Everything must come from a well define state checked into version control. That's where containers did help us a lot.

Not sure what you mean with lingering containers. Do you launch yours through SSH manually? That's terrible. We had automation that would launch containers for us. Also we had monitoring that did notify us of any discrepancies between intended and actuated state.

Maybe containers aren't the right tool for your usecase - but I wouldn't want to work with your setup.

Btw. most of this is possible with VMs, too. So if you prefer normal GCE / EC2 VMs over containers that's fine, too. But then please build the images from checked in configuration using e.g. Packer and don't SSH into them at all.

> If you need a hot fix "RIGHT NOW" you might be doing something wrong in the first place.

Can we please grow up beyond this kind of comment?

I suspect everybody knows that hotfixing production isn't an ideal thing to do, and the many reasons why that's the case, but lots of us nevertheless do it from time to time. Where I work now we've probably hotfixed production a handful of times over the past couple of years, amongst thousands of "proper" deployments. It's really not a big deal.

We know so little of the context behind OP's issues that picking holes isn't helpful or informative. The specific question was about migrating from containerized infrastructure back to a more traditional hosting model. I for one would be interested in reading about any experiences people have in this area so it's quite frustrating to find this as the top comment when it's substantially off-topic.

>Can we please grow up beyond this kind of comment?

No platform that assigns quantitative virtue points to comments based on how many people agree with that comment will ever not have an abundance of people lobbing low-effort quips and dog whistles designed to appeal to the majority.

This should probably be the banner message when you arrive on this site. Well said.

this happens in the workplace as well though

> this happens in the workplace as well though

Exactly this. I'm fortunate in that where I work now it's pretty rare: the culture is fairly collegiate and friendly, and there's a strong sense of "we're all on the same side". Other places I've worked every meeting has been an exercise in point scoring, which is incredibly wearing and - try as one might - the culture does end up influencing one's own behaviour.

You can obviously get preachy about how people should be stronger characters and not so easily influenced but the phrase "Bad company corrupts good character" exists for a reason. Unless you're Gandhi it's incredibly difficult for many people to resist the culture around them day in and day out without some positive reinforcement from the behaviour of others, especially when you're incentivised to do otherwise, and may be penalised for not doing so[0].

[0] The answer is, if you can, find another job. Not always an option, but a good idea if it is.

Same experience here, for the last 4400 deployments over the last 4 years at my current job we had to do 3 live patch on production. This just happens in the real world. Last occurrence was 2 months ago when a deployment step that worked well several thousands of times failed in an unrecoverable state with our deployment pipeline.

Strong disagree, I've worked at/seen so many places that do deployments in reckless and awful ways that I don't think it's obvious to most people.

Even in this thread, the arguments "against" containers, a lot of these things have alternatives that are still better. For instance, ok maybe you need to make a hotfix, you're still better off just compiling a one-off image on a dev machine and deploying it (I used to be able to do this in like, 30 seconds), as compared to trying to edit files directly on the server. Especially if you have multiple servers.

Thank you!

You're absolutely right. There is no context outside of the few lines OP asked and the quote you highlighted shows a lack of real world experience.

Yes it's true the "right now" fixes indicate there is a problem, but in small shops it's generally the most reasonable approach. Now if you're on a team of 50 other people and you need to make "right now" fixes then there is certainly a problem. Neither of which any of us can know from the context.... but besides the point, that it's not even on topic.

I don't even think "right now" fixes necessarily indicate a problem in development approach. Sometimes unexpected things happen and you need a way to fix them quickly, without being bogged down by infrastructure.

A great example of this is using a production repl. It even served Nasa well (https://stackoverflow.com/questions/17253459/what-exactly-ha...). Having to change your system on-the-fly is not always an indication of a poorly developed or managed system.

If the rate of unexpected things is "sometimes" and not "very rarely" it probably does indicate that there are problems with development, build process or something else in the infrastructure.

Most of us are building CRUD apps not sending people to the moon.

This is very common and just because it doesn't match your use case doesn't mean some businesses don't need a hot fix "right now." If you work in 24/7 ecommerce and your site is producing $60k per hour and there is a network failure that breaks something you need a hot fix right now, otherwise your 3 hour code review, build, q/a, deploy pipeline will cost the company $180,000.

> otherwise your 3 hour code review, build, q/a, deploy pipeline will cost the company $180,000

Taking code review away, because we all know we're not going to sit around waiting for a review while a patch needs to go out urgently, if your build, test, and deploy pipelines take three hours then you have some serious problems that you need to address, and containers aren't it.

There are methods for handling hotfixes/patches to production quickly that work well in high volume sales website setups.

> Taking code review away, because we all know we're not going to sit around waiting for a review while a patch needs to go out urgently

I worked at a trading firm where you could add a zero to that cost for a 3-hour outage, and I can tell you that the one thing we would absolutely never skimp on was code review. Because the cost of a bad "fix" that actually makes things worse has the potential to be greater still, and because humans are most likely to make silly mistakes when they're working under intense pressure.

What we would do instead is slightly intensify the code review process by pair programming the hotfix, and ensuring that a third developer who was familiar with the system in question was standing by to follow up with an immediate review.

Pair programming anything as urgent as a hotfix works really well. It takes some pressure off of the developer working on it and turns it into a team event.

We will even sometimes keep the video call open until deployment is done and production validation is complete - the devops guys get the information they need, someone else is keeping an eye on the checklist and calling out items if necessary, etc.

Oh yeah, absolutely. I was happiest when I also had the attending ops person looking over my shoulder while I worked on the fix.

I really like that approach. It also takes pressure off the tech lead or whoever is implementing the fix, and transforms the patch into a full-team responsibility. I bet this sort of behavior makes for strong and effective teams.

Please do tell. Our containers never take less than 10 minutes to deploy (a single one).

I have app stacks that deploy a dozen containers in seconds because they are stateless and close to "functional" – just transformations over inputs.

I have app stacks that deploy a dozen containers over an hour because the orchestration takes time: signal the old containers to drain, pause for an app with a very long initialization time to settle, gradually roll traffic to the new one to let caches warm, and then repeat.

In both of these cases, deployment is a function of the application. There's nothing infrastructural that puts a time floor on things.

Sure, not contending that. It's just that in my memories blue/green deployments still took less time, although I can't say how much.

Not a serious reply, but it may interesting: One million containers over 3500 nodes in 2 minutes: https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orl...

Well 10 minutes is way less than 3 hours.

Certainly. But it does add up and it does kill motivation for rapid iteration. :(

10 minutes to deploy to production should be fine, but rapid iteration shouldn't be happening in production. It sounds more like the complaint is related to running a similar setup in dev, and taking 10 minutes to see changes during development (which is understandably too long).

Yep, that's what I meant. I have no choice but to run a local k8s cluster or else I can't test.

and then we're back to the original statement, if you're rapidly iterating hot fixes you have serious problems and likely doing it wrong.

A lot of businesses that operate 24/7 run on containers quite well. When I had to so these sort of quick hot fix things for a startup, almost all of the issues were caused by lack of testing. Testing lacked because there wasn't an easy way to constantly make sure dev staging and prod are absolutely the same. Same infrastructure, same code, same packages, etc.

That's an easier solve with docker containers. And testing, including UI testing, can be integrated much more easily with ci/CD tools and docker containers that have code which goes by commit hashes and which ensure every package down to the version is controlled across environments.

Someone fucking directly with prod might also cost $180000.

One will cost and another might, see the difference

Time to deploy is a known value. The impact potential of having a workflow where ssh'ing into production is even possible can cost you buckets as: 1. You're messing with production, obviously. 2. Your infrastructure isn't stateless. 3. Your infrastructure is likely not HA. 4. You likely don't have canaries in place to mitigate the impact of bad production deployments.

And the impact of all of these on direct revenue/productivity can be immense. SSH into prod is a crutch.

They are both a might. Both are a probability distribution.

In the long run "fucking with production" as well <i>will</i> cost a fortune.

Citation needed. I’ve “fucked” with many prod systems over the last 15 years and not caused outages, or extended them.

Those costs need to be amortised annually.

There are other costs from outages which occur infrequently but are highly costly because developers/support are manually touching the servers.

The cost really should be analysed as what is the median cost on revenue (or profit) as a percentage point. One off pricing is pretty meaningless.

you rollback and then make a proper fix.

> If you need a hot fix "RIGHT NOW" you might be doing something wrong in the first place.

Maybe I phrased this point badly. I haven't found myself to need a hot fix RIGHT NOW for a long time. What I had in mind was this: sometimes there is an issue that is very hard to reproduce locally. If a developer doesn't understand it from the get go and have to experiment a bit, it can be frustrating to commit, wait for deploy (even to stage environment), test what happens and repeat.

> Not sure what you mean with lingering containers. Do you launch yours through SSH manually?

Of course the application itself is run automatically, web server, workers, etc.

However, we often need to do some bespoke data analysis, so we often ssh into a server, type `make shell` to launch a REPL and type/paste some stuff into it.

I think you might be falling into a cycle that I often find myself falling in to. It goes like this: Oh, there's a bug. But it's obviously caused by _this line here_, so I won't even bother testing it offline, I'll just fix it and run CI and then... oh no, wait, it was more complex than I thought, but I'm in the change-CI-test cycle now so I'll keep doing that. And then, all of a sudden it's taken me three hours or more to fix what was in reality quite a small bug.

The solution is to force yourself not to use the CI/container system at all during debugging, and instead to build the binary (or whatever) standalone. That's hard because you invariably aren't tooled up to run the component outside its deployment system so you have to do some extra tricks, but in the long tun it seems to be the way to go.

Over the years I've learned to do this very rarely. However, I'm not the lone wolf in the woods, I also have other developers working for me. I can talk to them however long I want and motivate them to debug locally to understand the problem, but in the end they have to gain some experience with their own pain. I would prefer that their experience gaining would be faster, so it would cost me less money :)

> I would prefer that their experience gaining would be faster, so it would cost me less money :)

Maybe your current strategy isn't working?

docker exec -it <container id> kubectl exec -it pod -c container

Not sure what do you want to say with this command.

you want to be able to "ssh in and make changes to figure out how to fix things". why doesn't that do that for you?

Sorry, but did you not forget the --rm option to remove the container once it dies ?

For both docker and k8s, that command is 'exec into already running container', not 'spin up a new container'.

This is all getting quite heated! Please take it outside.

This was a joke. Humour.

Please contain yourself.

I interpret it as a suggestion that you allow developers access to the running container in production to debug the issue

There are more sane ways to do this than through raw kubectl access, of course: see e.g. telepresence (https://www.telepresence.io/)

By sane, you mean yet another piece of infrastructure that has to be installed, documented, packaged into containers, deployed and updated for a problem that did not exist without containers. For a very large organization (e.g. hundreds of devs) this make sense, but for a medium sized company (> billion EUR turnover per year outside of software business) this soon becomes just another piece of overhead.

This problem certainly exists without containers.

- How do you provide easy visibility into running applications?

- How do you prevent inspecting an application from affecting the behavior of an application?

You put this fairly succinctly. And I agree wholeheartedly.

I think of the phrase “adding epicycles” when this kind of stuff starts happening.

I’ve worked in a number of areas with basically no fail/delay SLAs. I think it’s naive to think “if you need a hot fix right now, you’re doing it wrong”...the number of times we needed to hot fix because of ourselves was very low. But when you’re in an integration heavy environment and one of the many moving parts (outside of your control) breaks, well thought out “put the fire out” stopgaps on the server consistently save the day (and the company money by not breaching the SLA)

That makes perfect sense and it's definitely true that sometimes the hotfix is not a bug in your code (which can be solved by a rollback) but instead having to patch a problem in a dependent system. But that seems orthogonal to the container issue. Shelling into a live server and changing something only works if you have the entire build toolchain on the production server which hasn't generally been the case in my experience. Even if you aren't using containers you still need to build artifacts and deploy them. It's just that you are deploying binary artifacts instead of containers. It doesn't seem like the container builds are the real long pole in that process.

Redeploy the older working version?

"outside your control" is key here. you're assuming a rollback would work. in many cases, some external system changes without your knowledge, and you're only seeing those changes on production.

I've got a client that has data feeds from multiple vendors. some are pulls, some are... "hey, we'll FTP this file to you". the file format has changed - unannounced - at least 3 times in the past... 15 months. Then something breaks on production, but you don't know what. You need to get on that machine and take a look.

"Redeploy the older working version" doesn't do anything except re-introduce more problems in these instances.

This is a good point. There are probably lots of people on HN working in cloud environments where your dependencies are actually organizationally within your control. If one of your dependencies makes a change that breaks you, you can escalate the problem and compel them to roll back the change. This is the luxury of building the entire world. My service depends on nothing that can't be escalated to my own VP, so "roll back to the old version [of whatever changed]" is a very satisfying answer, but it's not an option when your dependencies aren't obligated to keep you running.

I pity anyone whose system needs less than X downtime per month, but who depends the constant availability of an external system that is down for more than X per month :)

> Then something breaks on production, but you don't know what. You need to get on that machine and take a look.

If that's a problem you find yourself having at all, much less regularly. You have a serious observability problem.

Isn’t the larger issue that your production environment can be brought down by bad user input?

There's breakage outside "brought down". A system that's running but doesn't produce outputs because the input data changed can be "broken" and violating its SLA too. And not really something you can design around outside "we won't promise anything", but then you loose to competitors that do take it on themselves to react quickly enough with hotfixes.

How fast can you grow if you’re constantly putting out fires? It sounds like you are in a B2B world. Businesses always want more. Where are the sales people/customer service managers that can set realistic expectations on what the client requires vs. what they are expected to do? “logging into production and manually correcting stuff” can only go so far and doesn’t scale.

> If you need a hot fix "RIGHT NOW" you might be doing something wrong in the first place.

Without ant context that seems like an outrageously ignorant comment. (In saying that most companies probably are doing many things wrong).

> In saying that most companies probably are doing many things wrong

I think that sounds actually very plausible :)

Agreed. This confused the heck out of me.

Almost sounds like script kiddies when they start learning programming for the first time in their lives they assume that the test of successfully learning how-to-program is that your code should compile on first try without errors.

Give me a break. Everyone needs a hotfix occasionally.

If your users are asking for something and your response is “you’re just doing it wrong!”, you’re probably the one who is wrong.

But if your developers are asking for something and your response is "you're doing it wrong" (hopefully not verbatim), you've got a good mentorship opportunity on your hands

I don't know what I'd do without being able to ssh into VM instances. Whether it's for looking at various logs, the occasional core dump, or uploading a custom binary to test something, it's incredibly time saving.

But... you can ssh into a container and change state (depending on config options), can't you? I'm not sure I follow this response or the OP's complaint.

I mostly write python code and one very nice pattern I've found is to run a container somewhere (either locally or on a server somewhere) with an open port. You can SSH into it and use the remote interpreter as your main project interpreter. That way your dev environment is 100% reproducible. VS Code an Pycharm Pro both support doing that.

I think OP is working with a farm of containers that are spawned and destroyed dynamically. Which means SSHing on one and fixing a problem would not really help.

For small deployments, running a Common Lisp web app with remote repl access is great for finding problems. I wouldn't recommend this for high traffic apps but there are many use cases in this world for small user base, focused web apps where maximizing developer productivity is required for profitability.

Depends on how you setup the container.

> Being able to just ssh into a machine is one of the problems that we did solve with containers.

you should be able to ssh into a machine and strace a process to see why something is going wrong. If your only solution is always "restart the container" or "revert to an old container" or "only deploy containers that are known to work" you're not actually debugging extant problems.

We did at feeder.co. We ran into so many quirks with Docker and Kubernetes that we decided to go back ”bare VM”. It was great! We’re a small team, so any resources save are a great thing. It always felt like we containerized to be able to ”scale up quickly” or ”iterate faster”. That of course was never needed, what was needed however was to scale up infrastructure to keep up with the resources required by Docker Swarm, k8s or just Docker in general.

Also, running services is hard already. Adding another layer that caused existing assumptions to break (networking, storage, etc) made it even harder.

Bare VMs are crazy fast and running apt install is crazy easy. Not looking back at the moment!

Yep, storage and networking are two things that I haven't mentioned in the post but they definitely annoy. Sometimes (rarely but it happens) network breaks and Docker Swarm falls apart, we then have to restart it.

Storage is ughh.

It's also hard to understand how the network works in K8s and Docker Swarm. Sometimes we'd hit random slowdowns that were impossible to understand (I'm definitely no networking expert) Just restarting the server or moving to another node would fix it. I really want to use K8s, because it's a cool promise, but for us at least, it was too complicated in reality.

If you only have a few VMs, consider ditching swarm and just using docker (with compose) with host mode networking. I’m not sure swarm ever got stable enough to use in prod; I migrated our stuff from swarm to k8s a couple years ago due to similar issues. K8s has been solid but it’s a beast.

For storage, why not just mount host dirs into the container for stuff you want to persist? Then you’re no worse off than you were before.

It depends how you designed your app. My app uses a RDS instance and a S3 bucket for data and file storage. I fel it is a best practice that your containers should be stateless (except perhaps in development). Docker is not very good at storage and I wouldn't recommend using it in that way.

maybe scale is not needed ,but how do you achieve resiliency with bare metal VMs without adding LB and watchdog layers (which is what k8s is anyway)?

We use DigitalOcean's loadbalancer product, we've also tried Cloudflare's for pure HTTP loadbalancing. We also use a single Redis instance for job queues. We use Graphite and Grafana for monitoring system metrics (running a Bitnami Graphite/Grafana instance on AWS because we had credits) And for the rest I guess just keeping the services simple? When we do need to scale up and add a new web server or task runner, it takes about an hour of my day.

One thing I realised with going bare-VM is that most services today are insanely stable. MySQL almost never crashes, Redis definitely almost never crashes, Rails/Passenger/Nginx never have any issues. The things that do happen is disks filling up, application bugs causing issues, or actual VM downtime, which is rare but happens when you have 30 VMs. With Docker or K8s it added a super complex layer that is in development and has issues.

The 4 months we ran our web servers on K8s, I spent at least 1 month debugging issues that ended at an existing open ticket on Github.

A lot of it been like this for a long time. Postgre, MySQL, Redis, Nginx are bulletproof solid. Sqlite might as well be a hammer and so many businesses could easily run on it, if only there was a way to drop a column :o

Amount of $$ that was spent on Docker infra to run couple dozen servers at the last place I used to contract for.. oh and more fun when those machines have GPUs on them(a lot more fun) then they decided to support Singularity as well because.. I don’t know.

> When we do need to scale up and add a new web server or task runner, it takes about an hour of my day.

So you manually create and configure your VMs?

Do you have some kind of HA for your database? If a VM goes offline, how quickly can you replace it?

Maybe you don't need to go back to Docker or Kubernetes, but at least consider using one of the hyperscale cloud providers, with its auto-scaling groups and multiple data centers per region, so you can have a system thatheals itself even while you're asleep or on a plane.

Yes, I manually create and configure in the sense that I tweak a number in Terraform templates and manually run the ansible playbook for each new server. It's taken a lot of time to get to that level (I think keeping a setup of bash scripts would suffice in our case...)

We run a read-replica on every database, so in case a hardware error occurs on the main database we can manually switch it over. It might mean up to an hour of downtime if the worst happens. Some data loss is OK and can be solved with manual customer support most of the time. It's also a lot cost effective than working towards a 100% SLA.

Keeping the read-replicas alive is plenty pain enough! I can't imagine automating everything to auto-heal itself. (Sounds super fun though)

Codifying the setup for auto-scaling would be a massive undertaking. Each new change then requires destroying VMs and bringing up new ones. That would then require a k8s-like layer of infrastructure for secrets, DNS, service discovery, not relying on ephemeral storage (which is a lot faster than volumes/block storage).

I really love doing ops/devops, and would love to have the perfect setup which is 100% automatic and scaleable. Even now I have to stop myself from spending too much time scriptifying things that can just be run manually.

and hour of downtime and some data loss are not metrics acceptable to most businesses i know of

and what if you have 10 customers joining every day? still gonna be running that ansible manually?

I guess you assumed 1 customer = 1 new server? For enterprise purposes where data siloes are important a different approach definitely makes sense. We have 300+ new users per day, so our manual system scales well.

well if you rely on SaaS solutions for LB and HA, thats fine

less so if you're limited to airgapped/onprem or there are some other security or regulatory considerations

All cloud providers offer LBs with backend instance health checks. Custom scaling rules too.

You should add a load balancer, depending on what you do. Most load balancers will check in on the backend node, and disable them if they fail, rerouting traffic to other nodes.

The load balancers we utilize will do failover between them selfs, and do it really well, as in "you don't notice".

Many seem to underestimate the stability of modern virtualization, the build in redundancies, fail over feature and the capabilities of load balancers.

I would guess that most Kubernetes clusters are built on virtual machines, not physical hardware. Meaning that you just now have layers of redundancy.

Instead of fixing something RIGHT NOW, meaning adding another commit to your build, why aren't you instead rolling back to a known good commit?

Image is already built.

C/I already certified it.

The RIGHT NOW fix is just a rollback and deploy. Which takes less time than verifying new code in any situation. I know you don't want to hear it but really, if you need a RIGHT NOW fix that isn't a rollback you need to look at how you got there in the first place. These systems are literally designed around never needing a RIGHT NOW fix again. Blue/Green, canary, auto deploys, rollbacks. Properly designed container infrastructure takes the guesswork and stress out of deploying. Period. Fact. If yours doesn't, it's not set up correctly.

I very much disagree. If your bad deploy included migrations that can't be reversed (drop an old table for example) then rolling back just gave you two problems.

As long as the dev wrote (and tested!) two-way migrations and they are possible, then yes you are correct.

The long-term viable approach is to not make forward-backward incompatible migrations in short period of time.

If your system is truly distributed (multiple machines hosting it), then if one server performs migration that deletes the table, then the other server stops working.

You must have 2 checkins and 2 rollouts: Create a new table while maintaingin the old one -- and let it bake in; and then delete the old table a week later (or whatever is your cadence/release cycle).

IMHO the answer here is to just never do irreversible migrations. (In other words: just leave old tables/columns around--if not indefinitely, then for a few weeks after they stop being used.)

I have done that in the past and I agree, it's a decent solution. The risk is that the old stuff never gets cleaned up and before long your DB is full of all kinds of cruft and nobody knows which part is needed and which isn't.

I saw a heinous bug once because a newer person was using a column that had stopped getting updated years earlier (because it was replaced). This person saw the column and it was exactly what they needed. Then customers started getting billed on old accounts that they had either closed or changed with us. They were really mad.

My solution to this is to rename that table/column to “table_deprecated” or “column_deprecated”. This has the nice property of being reversible, causes nice visible errors if something unanticipated is actually using the column/table, makes it obvious that it shouldn’t be used (well, one would hope), and makes it easy to find for permanent deletion later.

That's a great idea! Although I did work with a person once who would use anything that looked helpful regardless of "deprecated" being all over the place (Eclipse would even yell at the method calls but he didn't care). Granted that's Java and at that point in time APIs were being deprecated without workable replacements, so it's hard to fault him.

At that point you might as well just delete them.

I think you're missing the point. If your migration renames a table or column, then the migration can be rolled back, by simply reversing the rename. If you delete the thing, then there is no way of rolling back other than restoring from backup.

I can see it being useful to rename these things, as long as there is some process in place to delete the renamed items after a period of time has passed.

Would be good if you could schedule a future migration that will happen only after a certain date has passed, which deletes the renamed tables/columns.

This sounds so much like something that I'd do, that I had to stop and think about my personal hall-of-fame screwups to make sure that wasn't one of them.

In terms of things that scare me as a developer, stale data is WAAAAYYY less concerning then having people try to shotgun fix an issue on a production server after an irreversible update took down the entire system... Especially now that the developer and the entire organization is likely now in a panic state and not thinking clearly.

That bug you describe, maybe stale data is part of the problem, but ... why are you having a new person working on the billing system presumably without any code review? It sounds like you have a lot of process issues.

Just create an issue in your issue tracker with the highest priority to delete a column (and why) with a deadline in 4 weeks. Stupid simple solution but works perfectly.

"Clearly the answer for any software problem is to just not have any bugs."

You see how that isn't an actual solution?

An irreversible migration isn't a bug, nor is anyone saying the answer for any software problem is to not have bugs.

They're saying the solution to the specific problem of being unable to rollback due to irreversible migrations is to not write irreversible migrations, which is a completely valid solution and indeed the correct one. The whole point of migrations is to track db changes so that undoing them is easy.

That misses the point.

Code and the DB always need to be compatible one version forward and back. That's required engineering discipline.

creating a table is an irreversible migration. Once you have the data, you can't just ... reverse the migration by deleting the whole table.

I would never ever EVER want to do a deploy with migrations that can't be rolled back. That sounds like professional malpractice TBH.

Two way migrations are both possible and SHOULD be done for any real data.

> Properly designed container infrastructure takes the guesswork and stress out of deploying. Period. Fact. If yours doesn't, it's not set up correctly.

Hmmm, I don't have the experience to know if it's setup correctly or not. All I can do is watch it fail and then learn from my mistakes.

Is there a container "framework" that out of the box gives me all of " Blue/Green, canary, auto deploys, rollbacks..." so I don't have to guess if I'm doing it right?

I hate to say it, but yeah that's kind of the point of kubernetes deployments (https://kubernetes.io/docs/concepts/workloads/controllers/de...). Or openshift for more UI and "out of the box" experience.

You deal with all the headache of making your app stateless with a predictable API so that you can reap the benefits of a system like k8s, which can automatically manage all of it for you.

Similarly i'm a bit confused by your comment about SSH dying... in k8s you configure a readiness/liveness probe and behavior when the probe starts to fail. If SSH is an important thing for a given container, maybe the "liveness" probe is the command "ps aux |grep sshd". Then if it dies, the container can be pruned automatically.

Nomad [1] does it as well, also visualized nicely in their awesome UI.

[1] https://learn.hashicorp.com/tutorials/nomad/job-blue-green-a...

We’ve been using Convox[0] for the last 2 years. I’ve been pretty happy with how simple it is to work with. We’re still on version 2 which uses AWS ECS or Fargate. Version 3 has migrated to k8s and is provider agnostic. We just haven’t had the bandwidth to upgrade yet.

[0] https://convox.com/

We are using Convox v2 too and are happy with it, but I'm hesitant to do the upgrade to introduce the complexity of kubernetes to our devs and if convox the right abstraction on top of kubernetes when there which is already a pile of abstractions in k8s itself (and so many other tools to choose from in the k8s universe).

https://github.com/aws/copilot-cli isn't ready for our use cases, but is more or less convox v2 built by AWS.

I remember that... But it was too aws specific. I'll be giving it another look, thank you for pointing that out!

Edit: sadly it's still another layer onto aws/gcp/other... I'll pass again.

This seems overly absolute to me. What about all of the cases where the bug wasn't caused by a recent commit? Some cases of this I've seen are:

* Time bomb bugs. Code gets committed that works fine until some future condition happens, such as a particular subset of dates that aren't handled properly.

* Efficiency issues. Some code could might function properly and work fine with low amounts of data, but hit a wall when it has to handle loads beyond a particular size.

* Bugs in code that just hadn't received much traffic yet. A feature having a bug that only affects 0.1% of people using it might not be discovered until the feature gains traction down the line.

Having dealt with all of these in production, I can tell you the strategies I've used to combat these things:

1. Solid code reviews. Anyone of our developers can halt a code review for any reason. We require 3 approvers on each review. Sensitive areas require reviews from people familiar in that area. We also have tooling that allows us to generate amounts of test data in dev that is similar to prod loads. This helps us catch a lot of time bombs.

2. Feature toggles to decouple deploy of code from release of code. This allows us to test our code in production before turning it on for customers. It also allows us to slowly rollout a feature and watch how the code behaves. This also gives us a kill switch to turn off the code if it is bad.

3. An incredibly robust testing pipeline. It takes about 50 minutes from commit to production deployment. We can also deploy previous containers very quickly for situations that require it.

This doesn't solve all of our problems. Some changes cannot go behind feature toggles (DB migrations, dependency upgrades, etc). But we do pay a lot of attention to design and rollout plans for database migration changes and such.

All of these things come at an extra cost to us, but it allows us to move quickly when we need to. But we're in a lot better place than we were when we were trying to do weekly releases. We have a good mix of team experience (sr vs jr) - and have a lot of discipline in our software engineering practices. We still have problems like I said, but these strategies have greatly improved our ability to deliver software.

Out of curiosity, how many devs does your org have? I think a lot of the disagreements here come from people at orgs with 3 developers talking to people at orgs with 30,000.

about 40 and growing

Sometimes you have to quickly roll forward.

Those sometimes should be super rare, and you should build testing infrastructure to prevent that from needing to happen. When you release something, you move traffic from the alb over to the new instance, if you have an issue, just move it back. If you are deploying breaking changes and don't provide yourself a stable upgrade and downgrade path, yea, you're gonna have trouble.

I’ve learned to take the time and go through the normal deploy steps for any hot fix. More often then not, rushing the steps leads to longer outages, missing the actual bug, creating a new bug, etc.

Don’t cowboy it, deploy properly and you’ll be more relaxed in the long term.

Yeah, I should have been more clear. I'm 100% for using normal deploy steps and I'm not recommending cowboy-ing updates in a container.

He was asking about using non-containerized infr though. If you can commit a code hotfix and quickly deploy the code package, you can roll forward without the slow container build/deploy.

Incidents _requiring_ rolling forward are extremely rare. In the cases you have to, just build the image and deploy to your cluster with a high max-surge configured.

If you image has correct caching, rebuilding it shouldn't take much time. Most of your time is likely spent in CI and rolling deployments, both of which you can manually skip.

> If you image has correct caching

This is the hangup for most CI/CD systems with containers. Typical configurations (e.g. Gitlab basic setup) don't leverage any caching, so every container is built 100% from scratch, every time.

Adjusting the system to properly utilize caching and ordering your container builds in a way that the most volatile steps are as late as possible in the build will massively speed up container builds.

Please do not.

My guess is that they are writing code (migrations...) Without thought given to rollback.

I'm not a huge fan of containerized infrastructure for the purpose of containerized infrastructure. Typically teams I've seen moving to k8s or a containerized solution don't have strong reasons to do so, and aren't using the pieces of those technologies that provide value.

I have worked with a few companies moving from containers to serverless and a few moving from containers to VMs.

I think that serverless often gives people what they were looking out of containers in terms of reliable deployments, less infra management, and cloud providers worrying about the hairy pieces of managing distributed compute infrastructure.

Moving to more traditional infrastructure has also often simplified workflows for my customers. Sometimes the containerization layer is just bloat. And an image that can scale up or down is all they really need.

In any of these cases, devops is non-negotiable and ssh to prod should be avoided. Canary deployments should be utilized to minimize the impact of bad production builds and to automate rollback. If prod is truly down, pushing an old image version to production or reverting to an old serverless function version should be baked into process.

The real crux of your issue seems to be bad devops more than bad containers and that's where I'd try to guide your focus.

Serverless (Lambdas, functions) might be ok for some backend trigger type processes but it’s absolute shit for end user facing apis. Also managing deployment of that crap is worse than dealing with K8s.

Asking about non-container systems on HN is mostly going to get you "you're doing it wrong" responses -- HN people are very driven by blog-sexy tooling and architectures.

If you want to deploy fast you need to skip steps and reduce I/O - stateful things are _good_ -- deploying with rsync/ssh and HUPing services is very fast, but people seem to have lost this as an option in a world with Docker.

I consult in this space - first 30m on the phone is free - hmu if interested.

> If you want to deploy fast you need to skip steps and reduce I/O

It's perfectly okay to build your artifacts on a stateful host and then put them into a container as a quick to add layer. A whole lot of work has gone into making applications quicker to build via incremental builds that depend on state, and it's worth taking advantage of.

People want to make containers the solution to every problem, and I think that's where the philosophy is hurting. It's okay to not have perfectly hermetic build / developer environments if your tooling is better for it.

rsync needs constant attention not to send unwanted artifacts. Git based deployments aren't as fast, but much more robust and controllable with pull requests.

Edit: of course this is for source code deployments, not binary output.

All the binary deploys I created at work run over git.

Honestly both of those are non-issues. Rsync is as much set once and forget as git, you'll need to script both and configure it there. And git slowness is a one time per deployment server issue.

The one advantage of git is that it will keep an history of your deployments. Of course, CD tools do that too, but git is way more transparent and reusable. The one disadvantage is that it will make your deploy scripts more complex and stateful.

Yeah, with rsync you get both -- like I can deploy source code (python, et al) and stuff with outputs I don't want to track in git (webpack et al).

I can essentially invoke whatever build steps I need to once on a build host - then let rsync handle moving all the changed things to my deployment environments.

asking the other way round: did you ever have a problem that you don't quite know how to replicate a given setup? for many tasks it saves you time - instead of having to mess with the host environment one can instead create an isolated environment - and give the docker access to the host file system or network.

It takes a time to get used to that mode of working (without automated testing it is very hard); but it does have a lot of advantages.

I use other tools - config management and whatnot I do in Saltstack or Ansible, so the VMs are throw-away really, nothing is configured by hand, it's just that they live longer than the lifetime of a single deploy.

I sure have, but not for any of those reasons, and especially not that cowboy "just log in, change couple of lines and restart". The reason is always when the time, cost and complexity of managing the control plane outweighs the benefits.

You can still have a perfectly good, quickly executing, end-to-end CI/CD pipeline for which the deployment step is (gasp) put some files on a server and start a process.

The inflexion point for this varies by organisation, but I've never seen an environment with less than three services where managing container scaffolding is a net positive.

We have a series of customers who I want to move away from Kubernetes, simply because the management of the Kubernetes cluster out weights the cost of managing the application, compared to running it on a few virtual machines.

I wouldn't even set the limit a three service, but perhaps closer to 10, depending on the type of services and the development flow.

Getting a late night alarm on a virtual machine is still much easier to deal with, that an error on a container platform.

One solution that seems to be somewhat simpler, but still managing to retain many of the advantages of having a containerized infrastructure is Nomad, but I still haven't tested it on anything large scale.

This may sound like heresy, but Docker Swarm is perfectly viable for this kind of use case.

Viable yes, annoying, also yes.

Have you tried fixing Docker Swarm when it randomly decides that one worker is missing and it spins up the "missing" containers on the remaining worker while reporting that your missing a worker, but at the same time your containers are somehow also over-replicated?

Yes, and I've also run into the networking issues more than once.

However, both fail states are fairly rare, and Docker Swarm is far simpler to manage than K8s.

I use K8s at work, but Docker Swarm at home because it is simple to set up and works well.

But that's the operational headache of using Kubernetes specially when you are dealing with a small number of services. There are other simple and/or managed platforms that will give the same usage experience hiding the complexity.

Absolutely, just plain Docker or Docker Compose are both wonderful and easy to use tools.

I'm not sure if you have seen a proper build and maintained automated e2e lifecycle.

You write code, you push it, 5 Minutes later it is rolling out, tested, with health checks and health metrics.

Your infrastructure itself is keeping itself up to date (nightly image builds, e2e tests etc.)

It just works and runs. It doesn't make the same mistake twice, it doesn't need an expert to be used.

I'm not saying its for everyone! Put three/four VMs on AWS, add a managed database and you are good to go with your ansible. Use a Jira plugin to create reaccuring Tickets for doing your maintenance and that should work fine.

Nonethless, based on your 'random list of things' it does sound like you are not doing it right.

There is something very wrong if you really think its critical for you to be able to 'hot fix' aka playing hero by jumping on your vms and hacking around. IF you only one single server for your production env. there is no risk of forgetting a server to patch but there is still the issue of forgetting to backport it (which is probably the wrong term if you don't hotfix your release branch)

Most mistakes i do, are mistakes i do because i was able to do them.

And that might sound unfortunate but there is a feeling of trust for your setup. At least i get that feeling and i get that through automatisation. Knowing and seeing the deployprocess just working day in day out. Knowing that my monitoring and alerting is setup properly, knowing that the systems keep themselfs up to date, knowing there are proper tests in place.

> You write code, you push it, 5 Minutes later it is rolling out, tested, with health checks and health metrics.

Yep, I have that. It's more like 15 minutes than 5 for me, but this process has nothing to do with containers - it can be done in the same way without containers.

> It just works and runs. It doesn't make the same mistake twice, it doesn't need an expert to be used.

Except when you misconfigure something on friday night and it does the same mistake 100 times per hour until someone notices it.

This will happen once and then there will be a test for it.

My automated system will only get more resiliant over time. This is a benefit for the system itself.

Of course when you do it manually, you will learn and gain experience but thats only for YOU. It does not just get transfered to your colleagues and when you are on holiday and shit hits the fan, it will not help.

My biggest reason why i like automatisation so much is: the company becomes less reliant on me.

It is the same mechanism why the industry is getting more computer logic: Machines are complicated and you need to train people. Make it easier for people to 'just use' and you have more people available which you need to train less.

That's why there is a don't deploy on Friday rule.

Yeah, I don't think that's a new problem with containers.

0. "Builds are slow" -> use multi-stage builds to ensure layer caching works great.

1. "Hot fix now" -> I do just log in and enter the container, change a couple of lines and restart, not sure what's your problem here

2. "containers do not close when ssh breaks" -> I guess that's also going to save you if you run <whatever mysql management command> without screen/tmux !

3. "Harder to change how things work" -> actually, it makes it much easier to add services to a stack: just add them to the compose file and use container names for hostnames in configurations !

4. "Must remember launched containers" -> why not use NetData for monitoring ? it really does all the monitoring/alerting you need out of the box ! And will show containers, tell you before /var/lib/docker runs out of space (use BtrFS to save a lot of time and space)

I'll add that containers make it easy to go from DevOps to eXtreme DevOps which will let you maintain a clean master branch, and that is priceless ! Details -> https://blog.yourlabs.org/posts/2020-02-08-bigsudo-extreme-d...

Where I'm migrating to : replacing Ansible/Docker/Compose with a new micro-framework that lets you do all of these with a single script per repo, instead of a bunch of files, but that's just because I can :)

I've heard a lot of the same complaints from people who almost universally have /not bought into the idea/ if containerisation. If you're using containers but you yearn for the days of being a web master, treating servers as pets rather than cattle, and wanting to edit code on the fly, then you're never going to get along with containers.

It's the same jump, from non-containerisation to containerisation, as it is from non-SCM to SCM. People who upload their files via FTP have a hard time picking up Git (or well, they did, ten years ago or so.) You'd have people complaining that they have to run a whole bunch of commands: git add, git commit, git push, then on the other side git pull, when they used to just drag the files into FileZilla and be done with it.

The thing is though, if you change the way you work, if you change the process and the mindset, you can be in a much better position by utilising the technology. And that requires that you buy in.

But, as for your questions: no, I haven't. I have always taken legacy or new projects and gone containerisation with continuous integration, delivery, and deployment.

Isn’t it quite a jump from treating servers as pets to containerization? There is a middle ground - autoscaling, health checked VMs behind a load balancer where the autoscaling group is using an image.

I tried the middle ground. I used Packer with Ansible to build the VM images, on the theory that the auto-scaling group should use final, ready-to-run images. My image builds took 15-20 minutes. Also, for the services that had only one instance, it was way too tempting to just SSH into the one VM and update things manually rather than suffering a full build and deploy. Do you have any suggestions for a better way to do this middle-ground approach?

yes, Chef Habitat

Stop being mediocre

That advice is impossible to act on. No one can be perfect at everything. So it's necessary to be mediocre at some things in order to accomplish the things one really cares about. So yes, I'm mediocre at operations. I want to get better, but if I tried to be perfect at operations, I wouldn't get other things done.

So, do you have any suggestions that can actually be put into practice?

If you get better at operations it makes other things easier. Being better at operations is the ultimate sharpening the saw move.

I run everything directly on VPSs and deploy via rsync.

Every now and then I have long discussions with friends who swear on containerizations.

All the benfits they mention are theoretical. I have never run into one of the problems that containerization would solve.

That sounds like a very small setup you run with very limited requirements if you run this successful.

The benefits they are mentioning are theoretical for you and i personally have not worked in a professional env where VPSs and rsync would be enough at all.

You sound exactly like my friends. Except that they know that my systems are several orders of magnitude bigger then theirs.

So they don't argue with the present "This cannot work" but with the future "This will lead to catastrophic failure at some point!".

This has been going on for years now.

What is your rough setup then?

> That sounds like a very small setup you run with very limited requirements if you run this successful.

No, not necessarily. Computers are fast, and if you don't add complexity until you need it, you can do a hell of a lot with a half-decent VPS and some rsyncing.

For context: a couple of years ago I ran a website that was in the Alexa top 1K for a while (back when that was still relevant), and that was heavily visited and used for the time during which it was relevant. If you worked at any news organization anywhere, it was probably on your daily list to check.

Yet it was relatively crappy PHP, not even very optimized aside from some very naive memcache caching, and ran off a random VPS with 2GB of RAM - and that included the database. The biggest challenge wasn't scaling or deployment processes, but fighting off constant DDoS attacks.

Of course, the key difference between that deployment and a typical startup deployment is that it wasn't built like a startup. It wasn't "measuring engagement", it wasn't doing "big data", it wasn't collecting data for targeted advertising - it just did one thing and it did it well, with the only complexity involved being that which was actually necessary for that purpose.

Over the years I've looked at a lot of complex "devops" setups for other people, and almost without exception the vast majority of the resource requirements and complexities originate from data collection that approaches kleptomania, and their choice of tooling - which ostensibly was chosen to better handle complexity. It's just a self-fulfilling prophecy that way. Most people don't actually have this degree of complexity to manage.

That's not to say that there's no organizations or projects at all that would benefit from automated cluster orchestration (with or without containers). But it's very much a "prove that you need it" kind of thing, not a "you need it unless..." kind of thing.

(I do think that there's inherent value in deterministic deployments. But that's separate from whether you need multi-system orchestration tooling, it can be achieved without containers, and even then the deployment process should be trivial enough to make it worth your while.)

Edit: To be clear, this is not an argument to prioritize performance over everything else or avoid dependencies/tools, at all. Just an argument to not add moving parts that you don't actually need. For anything you add, you should be able to answer "what concrete problem does this solve for me, and why is it worth it?".

These things can all be managed an automated using ansible/salt/etc -- containers just add another layer of abstraction to manage/maintain/understand/etc.

> I have never run into one of the problems that containerization would solve.

You've never had to migrate your app to another host, or manage dependencies for an app?

you have never made an overwriting change that broke your system in a way that made rollback difficult?

> At Viaweb, as at many software companies, most code had one definite owner. But when you owned something you really owned it: no one except the owner of a piece of software had to approve (or even know about) a release. There was no protection against breakage except the fear of looking like an idiot to one's peers, and that was more than enough. I may have given the impression that we just blithely plowed forward writing code. We did go fast, but we thought very carefully before we released software onto those servers. And paying attention is more important to reliability than moving slowly. Because he pays close attention, a Navy pilot can land a 40,000 lb. aircraft at 140 miles per hour on a pitching carrier deck, at night, more safely than the average teenager can cut a bagel.

> This way of writing software is a double-edged sword of course. It works a lot better for a small team of good, trusted programmers than it would for a big company of mediocre ones, where bad ideas are caught by committees instead of the people that had them.


The idea that Navy pilots don't crash because they have minds like steel traps is absurd. They have a ton of process and redundancy to reduce human error to a minimum, and they follow tedious checklists religiously. Even private pilots have this rigor.

If you applied the amount of process pilots use to software deployment, you'd improve ops by orders of magnitude.

And they sometimes can't land those planes on the pitching decks at night, and have to redirect or ditch. Source: old man was in Carrier Ops, talked about it constantly.

Not that I remember, no.

I can opine based on my current position, where I interact with both containerized and non-containerized infra, specifically a docker-compose-like system versus direct installs on AWS EC2 instances. In my opinion, a well made containerized system is far superior an experience:

- Deploy times are certainly slower, up to 50x slower than non-containerized. However, we're talking 30s deploys versus 20 minute deploy times, all-inclusive. The sidenote here is that you can drastically reduce containerized deploy by putting in some effort: make sure the (docker) containers inherit from other containers (preferably self-built) with executable version that you need. For instance, you might inherit version X of program A and version Y of program B before building only a container with version Z of program C, as A and B barely change (and if they do, it's just a version bump in the final container). Even better, just build a code container during deploy (so a container with essentially only code and dependencies), and keep all the executable as separate images/containers that are built during development time;

- Containers do allow high-speed fixes, in the form of extremely simplified rollbacks. It is built into the entire fabric of containers to allow this, as you just change a version number in a config (usually) and can then rollback to a non-broken situation. Worst case, the deploy of fixed code in my case does take only 20 minutes (after the time it takes to fix/mitigate the issue, which is usually much longer);

- Local environment is _so much easier_ with containers. It takes 10 minutes to setup a new machine with a working local environment with containers, versus the literal hours it can take on bare metal, disregarding even supporting multiple OS'es. On top of that, any time production wants a version bump, you can start that game all over again without containers. Most of my devs don't ever worry about the versions of PHP or Node they are running in their containerized envs, whereas the non-container system takes a day to install for a new dev.

Containers can be heavy and cumbersome, but in many cases, a good responsibility split can make them fast and easily usable. In the specific case of docker, I find treating the containers like just the executable they are (which is a fairly default way of dealing with it) works wonders for the easy-and-quick approach.

Local environment specifically isn't fully containerized in my project. DB and similar things (elasticsearch, message queue) locally are inside containers, but the code itself is not. I worked on a project before where I had to have it containerized and it was a slow mess. I'd rather spend a couple more hours on setting up local dev environment for every new hire than deal with code in Docker locally.

In production we have it done the other way - PostgreSQL and Elasticsearch are run directly, but code is in containers.

To be honest, I have a fairly similar situation, I just use a different code-container for local than for production. In production we run some things directly and package the code, whereas in local we package the services, and have the code semi-separate.

In the production environment, I want the code image to be set in stone, that way a deploy or rollback will go to the exact git commit that I expect. So the CI-script for deployment is just a `docker build` command, the dockerfile of which clones a specific commit hash and runs dependency installation (yarn install, etc.), then sets the image version in the production env variables. The code is then encapsulated in 1 image, which is used as a volume for other containers, and the runtimes are each in their own container, connected by docker-compose.

For local, it's a much heavier code image that I've prebuilt that contains our current version of every tool we use, so that the host machine needs nothing but docker installed to be able to do anything. The services that actually display stuff on the screen (Node.js) run as their own container with their own processes, but you can hop into your code container (used as a volume for the services) and try out command line Node stuff there, without fear of killing the procs that show your local environment.

It took a long time to reach this point, lots of experimentation, but it's now pretty lightweight and pretty useful too.

Hmm, not sure I understand why you put code in one image and then use it as a volume for other containers. Why not run directly from the volume with the code?

Composability whilst maintaining a monolith codebase. The code in its single imageis used in 3 different environments, without including any runtimes that those environments might not need, keeping the image small. At the same time all code can be kept in a single git repository.

It is slow in your local because you are probably using OSX or Windows. On Linux the speed is near native for me for local development.

Yes, locally we use OSX apart from one developer who uses Linux.

I guess thousands of developers would pay for a fast docker implementation on mac. File access is so slow if you want to mount your sourcecode into the container.

There are solutions like docker-sync but they have a kind of random delay, sometimes the sync happened fast sometimes it takes a few seconds.

I also do this. We have DB and another service running via docker-compose. But our actual Webpack typescript app is done locally. We run on OSX, and due to the Docker file system slowness, the dev-test-run deploy cycle is far too slow. Much better running it outside of Docker.

> a well made containerized system is far superior an experience

Like everything, it depends on context. Is it good to have separate dev, staging, and production environments? For many companies, the answer is "D'oh!". But if you're one guy trying to get your first p.o.c. out, by all means, deploy directly to production.

Containers is sortof the same thing: if you're a small team, and don't have many customers, the disadvantages of containers (however small) may outweigh the advantages.

I largely agree with you, best tool for the job and whatnot. At the same time, I feel that due to the time I've spent investigating and experimenting with containers, I'm almost better at getting containers to work the way I want than to run it plain.

So if I was now starting a new project solo, I would probably go straight for containers and never deal with conflicting versions or difficult, hacked-together rollbacks, and I don't think it would take me, personally, more time to set up.

In other words, I think it's a good investment of time to spend, say, a full work-week on understanding containers and experimenting on how to make it work for your use-case.

regarding local environments,

have you faced performance issues? running 5-6 docker images (app,redis,db,mq etc) at the same time has been causing the machine to lag for us. most of my team is on the 16 inch macbook.

I won't be the first to tell you that docker + macos is a constant struggle. It can be done, and I support about 10 developers using about 10 simultaneous containers on macos. My troubleshooting workflow right now is the following:

- Is the container throwing errors? A running container somehow repeatedly throwing errors on its non-main process restarting will eat up all resources of any machine, any OS;

- Is the container trying to sync a folder or datasource between the container and the host? Especially on macos, using docker for mac, this will hurt your performance. Solutions are in the form of specialized syncing systems (docker-sync)[http://docker-sync.io/] or manual syncing using rsync or unison;

- If you have many running containers, it can be useful to spin up a linux VM (ubuntu, debian) in virtualbox, then run the containers in there, finally using a tool like unison to sync dynamic content (the changing code) to the vm;

- Is 1 container using far more resources than the others? Is this strictly necessary? It is possible, in docker-compose at least, to limit certain resources for containers (it's probably also possible without docker-compose, just using docker run) https://docs.docker.com/config/containers/resource_constrain....

Probably the dynamic content synchronisation is the biggest resource hog, and docker-sync has really helped with that in the past, plus it doesn't require different setups between linux and mac hosts, i.e. you could use the same docker-compose.yml file.

It doesn't hurt to inspect whether the host machines are potentially running other heavy services, such as heavy electron apps or a million open tabs in browsers. I've had to tell some devs to perhaps not use their work machines for personal chat applications (but just having their phone open then) because those applications were using >3GBs of RAM each on a 16GB ram machine, leaving very little for any work-related processes.

I was happy with docker-sync only in the first hour or so. I was trying it for a php+js setup (so many small files) and was frustrated after some time. Sometimes the sync happened instantly, sometimes it needed multiple seconds. Its reslly frustrating refreshing a webpage to see a bug still happening again and dont understand whats happening. Changing again and not fixing it again, adding some test output and not seeing it finally understanding that the sync is slow again or you need to restart it again because it silently broke.

File access in Docker for Mac is a known issue. Best solution is to mitigate using docker-sync or similar.

Run it native - binaries plus supervisord - do the same thing on prod. It's fast, observable, easy to debug, etc.

What do you think about systemd and monit?

Maybe monit works - i like something that’s not the system init so you can use the same config on macos that you use on linux...

ah, that makes sense. I don't deploy to macos.

Neither do I, but it's convenient to use the same stuff in dev you use in prod.

I have long felt like containers, and VMs before then, have been abused to the point of absurdity. Most of the primary reasons for people jumping to them are already sufficiently solved problems.

* Resource and process isolation -> capability-based security

* Dependency management conflict resolution -> nix / guix style package management

* Cluster orchestration and monitoring -> ansible, salt, chef, puppet, etc.

If you need all of those things at the same time, maybe containers are the right choice. But I hate the fact that the first thing we do when we run into a pip package conflict is to jump to the overhead of containerization.

Others are talking about process reasons but I have a technical one:

We have an internal tool that listens to a message queue, dumps a database (from a list of permitted databases), encrypts it and sends it to S3 to be investigated by developers.

When running on a container, the process takes 2-3 minutes with small databases, about an hour or more with larger ones. When running on a regular EC2 image, the process takes about 5 minutes in the worst case scenario and is borderline instant with smaller databases.

Mapping internal volumes, external volumes, editing LLVM settings, contacting AWS support etc yielded nothing. Only migrating it to a regular EC2 instance had any results and they were dramatic.

We run Docker containers for local development too but when restoring very large databases we need to use MySQL in a real VM instead of in our Docker setup because Docker crashes when faced with a disk-heavy workload.

So to conclude, the only reason I wouldn't want to containerise a workload is when the workload is very disk IO heavy, whereas most of our apps are very network IO heavy instead.

If your hosts run linux, maybe take a look at disabling the userland-proxy in docker for your development environement, and see if it helps. userland-proxy _really_ slows down certain applications in my experience. Setting userland-proxy=false in daemon.json and restarting docker converts from using the userland-proxy to using iptables. FWIW still considered "beta" and may result in bugs of its own, but has really helped with a few of our ($dayjob) more pokey apps a few of our environments.

Isn't that the same as just running docker --net=host?

No, it still forwards to a docker internal/bridge ip, just using iptables forwarding instead of a tcp-proxy. net=host just uses the host context.

Yep, I moved our knowledge management platform[1] from Docker + Docker Swarm to just deploying with Ansible.

I think containerization is another one of those things that you're told is great for everyone, but really you need to have many teams with many services that all need to act in concert in order for containerization to be worth the effort / overhead.

That being said, I conceptually prefer how with tools like K8s you can have fully declarative infra as code, rather than the hybrid imperative/declarative mix of a tool like Ansible.

[1] https://supernotes.app

thank you for giving visitors of your website the choice to completely avoid cookies - without any dark ux patterns involved!

Why didn't anyone bother asking what's this containerized infrastructure for? The size of it? Purpose and redundancy options?

Everyone in HN starts criticizing vague container statements. This really turned into Apple vs PC debate.

The desire to make production changes outside of version control and change management flows is not a complaint about containers. It isn't an Apple vs. PC debate. It is literally reading a question from that person on your team who makes your job extremely difficult and error prone.

isolation, ease of deployment, management of artifacts and network resources

if you're running a single app a lot of this may not apply

Still vague. In my opinion containers are not for everyone, it has to be a very specific scenario. Most internal things can be achieved with VMs easily.

Docker ≠ containers

You can run lxc/nspawn containers as lightweight VMs and save a lot of (runtime, management) overhead without having to worry about any of Docker's or k8s's quirks.

We're quite happy with that approach, Docker isn't production grade IMO and k8s doesn't make sense at our scale.

I felt bamboozled when I first heard about nspawn. It's like docker but well integrated into systemd. How come there is so little discussion about it? Is the tooling lacking?

I love containers! I just hate that so many people assume that means docker and ignore the things you refer to. lxc is so nice and fast... I haven't taken the time to test out nspawn yet though.

We never bothered to migrate our small setup (circa 20 instances) to containers, we just use VMs.

We use Go binaries, so dependencies are compiled in, hosted on VMs on a cloud provider, setup with cloud config, and using systemd to manage services sitting behind a load balancer, one vm per service. Automated test and deploy so it's simple to make updates.

Never really felt the need for containers or the extra layer of abstraction and pain it brings.

Re hot fixes, automate your deploy process so that it's fast, and you won't mind checking in a fix and then deploying because it should take seconds to do this. You don't need containers to do this.

If you can't deploy changes quickly and easily in a reproducible way something is wrong. Having a record of every change and reproducible infrastructure is incredibly important, so we check everything into git then deploy (service config, code changes, data migrations). You don't need containers to get there though, and I'm not really sure they make it easier - perhaps in very big shops they help to standardise setup and infrastructure.

I have migrated to containerized infrastructure recently and I can tell that it has its benefits, quite a lot actually. (Where before I worked only with VPSs.)

But after working with it, it's pretty visible that the abstraction layer is really huge and you need to learn the tools well. When you deploy to linux VPS, you probably have already worked on unix system and know plenty of the commands.

Another thing, I think having a designated person to the infrastructure makes it much less trying for a team. On the other hand, you have 'code' sitting in the repos and everyone feels like they can into devops. I don't think it's exactly true, because e.g. k8s is a pretty complex solution.

> If I need quick hot fix RIGHT NOW, I can't just log in, change couple of lines and restart, must go through full deploy cycle.

If you have the need for that kind of thing, I don't know why you would use containers.

Containers is for organizations who have processes.

Unfortunately nowadays we teach every developer to have containers, ci/cd, terraform, test coverage, ... as a requirement

Logging in to fix something RIGHT NOW also really falls off as soon as you hit moderate scale and have to do the editing on 20-30 boxes.

Who really did that though?

I’m sure a few. But mostly back in the days before we (almost) all had containers we automated that stuff with puppet etc...

It worked ok but it had its own problems, we iterated and move on to disposable workloads and infrastructure - which has a completely different set of problems. Amongst other things it makes scaling even easier - if you need it and if you do it right.

Before automation frameworks and services we had scripts, which were either nice and simple but limited, or a huge mess and highly complex (and often fragile).

Before that we either had mainframes / centralised computing or we didn’t have scale.

Maybe it was different in the windows world, but that’s pretty much what my experience has been across many clients over the past 16~ years with Linux/BSD etc...

You can both have processes, and rare exceptions to processes. Abd even processes for exceptions (eg checklist that includes documenting what you did and reimplementing the change eg in vc'd ansible)

You are asking right, but you also listing points(opinionated) that are intended to protect you against human error.

Pick the tools that suits your flow.

Nothing wrong with bare metal or virtual servers.

EDIT: to add some good years ago was managing PHP shop where all production was baremetal and development/staging was replicated in containers, everybody happy, hope it helps

We use Nix and NixOS with no containers. You get some of the benefits of containers (eg different binaries can declare dependencies on different versions of X), without some of the wonkiness of containers

It has trade offs (eg worse docs), but you might like them better than eg Docker’s

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact