It’s not common to need that many pulls nor is it hard to build your own images.
If you’re deploying to a cluster with 200 machines, you could easily hit this if you use the public registry though. However, if you’re managing that size cluster you can probably afford the fee, but more importantly, you should probably pull once to a local registry and use that to deploy to your cluster anyway.
Do you run a local registry? Any high-quality articles/youtube talks to share? I'm about to set one up for our own little cluster (~5 machines, ~75 containers). I know tons about docker engine, and a fair bit about the registry, but it's always nice to watch a "lessons learned from actually doing this in production" talk to know what mistakes to avoid
I run tens of thousands of docker images in production, or rather, tens of thousands of copies of a few hundred images.
If you do something like this, you absolutely MUST have a local registry.
Harbor [1], JFrog [2], and Quay [3] would be the first ones that I look at.
Harbor is open source, free, and a member of the CNCF. You will need to do a little bit of work to set it up to scale properly. JFrog offers a SaaS registry, but you will pay big $$ based on pull traffic. Their commercial site license is about $3k/year. Quay is older than either of them, stable, and high quality. I'd start with Harbor these days.
Just to add all the major cloud service providers provide registries ACR /ECR/GCR etc .
If you run k8s service with one of the them in my experience it is best to use the corresponding registry.
I have pulled and run 20k times a 1GB image in less than 10-15 minutes without breaking a sweat.
Finally GitHub packages offers a registry out of the box . It is great for CI and devs to access . I generally have the tags mirrored from tags GitHub for production to ACR .
Github packages works with Github CI out of the box, it makes development lot easier, like I mentioned for best networking in prod you should always use the registry from your k8s Provider, mirroring the Github registry to ECR/GCR/ACR is fairly straightforward. Bandwidth costs are eliminated, network is lot more reliable intra DC.
FYI, Using ECR with Docker Swarm is something we did try. It was hellish. We never nailed down the exact problems, but we spent about a month with 2-3 experienced engineers trying to fix the edge case issues.
The main issue was ECR has a slightly different authentication model than docker swarm. The whole '--with-registry-auth' only partially works when you are using ECR. Unfortunately, it works just enough that you think it's working, until all your tokens time out and a worker can suddenly no longer pull an image.
Our common failure case was an image becoming unhealthy or a node being drained. When that image would try to be restarted on a different worker, if that worker did not have the image it would try to get it from the registry. If the tokens were expired it would fail.
The only "fix" we ever found was to setup a cron job that forcibly deployed a new version of a "replicated globally" image every X minutes (where X was based on ECR token expiration). It kind of worked, but we still had occasional failures we could not identify.
I wish it worked better, because it was nice to use ECR. Frankly token expiration sounds much more secure too, but without direct support for token refresh inside the docker engine it's just hard to get everything to work
Looking at doing GitHub Packages for direct-to-dev and mirroring into ECR over here. Seems sound. But also considering other options as ECR is a pain to work with.
That said, word of warning for anyone looking at GitHub Packages for docker registry: it's broken with containerd and some other similar tools. They (GitHub) are currently working on a fix: https://github.com/containerd/containerd/issues/3291
I was a happy user of JFrog's registries via site license at my last 2 places. Seemed to just work as expected. Didn't have visibility into the cost though (other teams set it up) so I had no idea it was $3k/year.
We have not had good luck with Quay. They are not stable, especially as of late. There was a period last month where for two weeks pulling images was a crapshoot.
If you want to run a local registry to stay below the 100 pulls per 6 hours limit please consider GitLab. The Dependency Proxy https://docs.gitlab.com/ee/user/packages/dependency_proxy/ will cache docker images. This way you stay within the limits Docker set and subsequent pulls should be faster as well.
Personally, I wish generic caching proxies were still a thing, and easier to set up. I've tried setting up squid several times in the past, and failed miserably every single time--all I want to do is use it as a gateway (ie, make the proxy invisible to the application) for e.g. apt packages, so I just ended up using apt-cache or whatever other appropriate software, but I'd far rather use something generic that just works on 90% of the software I use at home, whether it's reading webcomics or repeatedly installing the same software in a dozen VMs with slightly different configurations, or even just browsing remote filesystems via webdav.
I use nginx to proxy cache the Arch Linux package repository transparently. It's fairly easy to set up, and enables nice features like contacting a secondary mirror if the first one is down, or when multiple requests hit the same resource, all are blocked waiting for a single merged package download, so the proxy will not make the download multiple times if I run pacman -Syu on my 18 machines in parallel. And it's all just 20-30 lines of nginx config.
Just drop a Sonatype Nexus instance on a Docker container somewhere on your network. Alternatively, use Squid if you don't push to the public Docker registry, although you might need to mess around with internal CA for SSL...
Storage in containers has been a long solved issue. The defaults are unfortunate because but make sense for ease of use. Your container root should be read only, ephemeral storage lives in a tmpfs or dynamic volumes depending on performance and size needs, and persistent storage lives in volumes.
you can set it up in less 10 min and the only thing required is to add '--insecure-registry' in your client. It is not a issue if all your machine are in private network.
Expect to spend 1-2 hours first time you try it until you can setup the correct DNS records, API keys and configuration.
Afterwards it's pretty hands off, every three months you'll receive an email from letsencrypt and you'll have to rerun this script to regenerate your certificates. Takes 2-3 minutes max (but of course you still need to distribute your certificates to all relevant services...)
If you’re deploying to a cluster with 200 machines, you could easily hit this if you use the public registry though. However, if you’re managing that size cluster you can probably afford the fee, but more importantly, you should probably pull once to a local registry and use that to deploy to your cluster anyway.