Hacker News new | past | comments | ask | show | jobs | submit login

You described the layering features of Docker images: you can upgrade the upper layers without touching the lower ones. It helps, sometimes.

The reality always seemed a bit more complex to me: once you ship, you'll have to start to maintain your software. You will definitely want to upgrade the lower layers (base OS? JVM? Runtime of the day?). In that case the statically-allocated cache hierarchy of Docker layers will be of little utility.

On the bit-for-bit reproducibility: I have my doubts, too. Are we sure that all the Dockerfiles out there can be executed at arbitrary times from arbitrary machines and always generate the same content hash?

Downloading a pre-built image from a registry does not count as reproducibility, to me.

Obviously Docker is a tool, and as such you can use it with wildly varying degrees of competence. I am just skeptical that using a specific tool will magically free ourselves from having to care about software quality.




I think my issue with your comment is the same one I have with the original article: it focuses on some technical details which don't impact most teams in a major way.

> You will definitely want to upgrade the lower layers (base OS? JVM? Runtime of the day?). In that case the statically-allocated cache hierarchy of Docker layers will be of little utility.

My team deploys to production multiple times per day. We upgrade the OS or the language version every couple months. So the layered structure is optimized for our typical workflow.

> Are we sure that all the Dockerfiles out there can be executed at arbitrary times from arbitrary machines and always generate the same content hash?

Again, the content hash doesn't affect my team. It's "bit-for-bit" enough that, in my experience, if the tests pass locally they pass in production. If that weird library is installed correctly in the local container, it's installed correctly in the production container. That's what matters.

> skeptical that using a specific tool will magically free ourselves from having to care about software quality

I never said this and I'm not sure what you mean. At the end of the day I think the benefits of the "Docker system" are obvious to anyone who ship multiple times a day, especially working on big software systems that are used by lots of people. There are other approaches too, but personally I haven't seen a VM-based solution which offers as good a workflow.

The faster startup time of VMs is cool and I appreciate the work put into the paper... I'm just saying that it doesn't seem to matter in the bigger picture.


> The faster startup time of VMs is cool and I appreciate the work put into the paper... I'm just saying that it doesn't seem to matter in the bigger picture.

I imagine this kind of analysis is aimed at use cases like, say, AWS Lambda, where you're launching containers/VMs on something like a per-request basis.


"Downloading a pre-built image from a registry does not count as reproducibility, to me."

Completely disagree... On my team, the lead devs build the docker images, push them to a private repo, and everyone else pulls that exact image. Bringing up dev environments is almost instant. If a lead dev adds a dependency that breaks the build, everyone else is fine. They will fix the build and push it up when ready.


It's not reproducible because you didn't produce (i.e. build) anything, you just downloaded (i.e. copied) it.

Calling what you did "reproducible" would be like calling a scientific paper "reproducible" because you're able to copy the table of results at the end into another paper.


I think we are using two different meaning of "reproducible".

To me, saying that a build is reproducible means that anyone is able to independently build a bit-for-bit exact copy of an artifact given only its description (source code + build scripts, Dockerfile, ...), more or less in the sense of [0].

For this definition, even content-addressability is not enough.

[0] https://reproducible-builds.org/


Well, I think you disagree on the meaning of the word "artifact." GP cares that all production deployments are identical to the development environment that went through system tests. To ops, a reproducible build is one that has been uploaded to the registry, because it can be ported over to the production cluster and it will be bit-for-bit identical.

It is not necessary to reproduce the same build bit-for-bit because you have kept a copy and distributed it.

You are not rebuilding the same thing twice because that is an inefficient use of resources.

Nobody really cares that the timestamps are different when the builds started at different times of day, it is evident therefore that you will not produce bit-for-bit identical builds as long as your filesystem includes these kinds of metadata and details.

If you built it again, it is to make a change, so there is no point in insisting on bit-for-bit reproducibility in your build process (unless you are actually trying to optimize for immutable infrastructure, in which case more power to you, and it might be a particularly desirable trait to have. Not for nothing!)


> Nobody really cares that the timestamps are different when the builds started at different times of day, it is evident therefore that you will not produce bit-for-bit identical builds as long as your filesystem includes these kinds of metadata and details.

Do you think different timestamps is all that you need to worry about? How about the fact that some source tarballs that you depend on may have moved location? Or that your Dockerfile contains 'apt-get upgrade' but some upgraded package breaks your build whereas it did not before? Or that your Dockerfile curls some binary from the Internet but now that server uses an SSL cipher suite that is now incompatible with your OpenSSL version? All problems that I encountered.


Hey FooBarWidget! :D

I was not speaking in terms of formal correctness in terms of the definition of reproducible builds, I was speaking of practical implementation for a production deployment. All of the issues you mentioned are (at least temporarily) resolvable by keeping a mirror of the builds so that you can reproduce the production environment without rebuilding if it goes away.

I'm just defending the (apparently ops) person who was going by the original dictionary definition of "reproducible" which predates the "reproducible builds" definition of reproducible.

If my manager asks me to make sure that my environment is reproducible, I assure you she is not talking about the formal definition that is being used here. I'm not saying that one shouldn't care about these things, but I am saying that many won't care about them.

If you're doing a new build and the build results in upgrading to a newer version of a package, then that is a new build. It won't match. If you're doing a rebuild as an exercise to see if the output matches the previous build, then you're concerned about different things than the ops person will be.

If my production instances are deleted by mistake, I'll be redeploying from the backup images, I won't be rebuilding my sources from scratch unless the backups also failed.

I agree that it is a sort of problem that people usually don't build from scratch, it's just not one of the kinds of problems that my manager will ever be likely to ask me to solve.


that's a mirror, not a reproducible build


No, you're right, I read the "reproducible builds" page and it's a good formal definition for the term.

I just think that if you ask ten average devops people on the street what "reproducible" means in the context of their devops jobs, eight of them at minimum are not going to know this definition, or insist on a bit-for-bit representation that arises directly from pure source code like what meets the definition at http://reproducible-builds.org

We're going to think you mean "has a BDR plan in place." Or maybe I've underestimated the reach of this concept.


I'd agree with zackify here too, unless I'm misunderstanding what was meant by "reproducibility". How is this different from say, pulling an external dependency from a repo manager, like we do all the time when building software?

The ability to easily deploy pre-built Docker images from a registry is one of my favorite features of a Docker workflow, especially the time that can be saved when deploying components of a software stack to a local development environment. I find I have to deal with significantly less installation issues if the developer can just run a Docker Compose file or similar on their machine to get going.


You are right, giobox: pulling an external dependency without guaranteeing its content impairs reproducibility (given that anyone else in this world uses this word in the same meaning I am. At this point I am starting to think that I am wrong).

Let me give a short example, limiting to the Dockerfile format:

  RUN wget https://somehost/something.tar.gz
is not reproducible: there is no guarantee that everyone is going to end up with the same container contents.

  RUN wget https://somehost/something.tar.gz && echo "123456 something.tar.gz" | md5sum -c
is reproducible, even if there is an external dependency. You can rest assured that the RUN statement either will convey the same result everyone else is having, or an error.

Similar considerations can be made for other statements (FROM something:latest, RUN apt install something, ...). But, as I was saying, maybe my personal opinion of what is "reproducible" is a bit too strict.

I agree that using a registry & compose is very useful (and personally, do it all the time). Simply, that does not fit to the definition of reproducibility, for me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: