Hacker News new | past | comments | ask | show | jobs | submit login
Docker to rate limit image pulls (docker.com)
355 points by AaronFriel on Aug 24, 2020 | hide | past | favorite | 263 comments



I originally came here to ask how folks use that many docker images in six hours (I'm mostly a Docker n00b, and not being facetious); however, after reading the article, I clicked to see how much unlimited is and it's $5 a month. Now my question has changed to: is $5 a month really a deal breaker for folks to get unlimited? Or what are the use cases where the cost is prohibitive? Open source or community projects?

In most cases it would seem $5/user isn't much as a business or organizational expense, and if it's a personal project 200 images in six hours seems pretty solid?

I'm just sort of shocked it's so cheap. I figured it'd be like $25-$100 a month or something just because of all the bandwidth someone could probably burn building random/broken shit over and over.

I'm mostly curious because I've considered using Docker recently for personal projects and my home server; but I'd rather not invest a bunch of time porting things only to find out it's gonna be way more than $5 (not including my time).

Or is it that hard to just build your own images? To be honest, I've never really understood that part of Docker (using other folks images)... It always seemed like an enormous security risk[1]. FWIW, I still deploy my personal projects using chef/ansible, shell scripts, and systemd units like some sort of curmudgeon-y monster...

[1] Everything is a security risk (APT, source, etc), I know, I get it. No one need scribe miles of pedantry into the comments explaining it to me. It's what each of us has tolerance for and can accept that matters.


It’s not common to need that many pulls nor is it hard to build your own images.

If you’re deploying to a cluster with 200 machines, you could easily hit this if you use the public registry though. However, if you’re managing that size cluster you can probably afford the fee, but more importantly, you should probably pull once to a local registry and use that to deploy to your cluster anyway.


Do you run a local registry? Any high-quality articles/youtube talks to share? I'm about to set one up for our own little cluster (~5 machines, ~75 containers). I know tons about docker engine, and a fair bit about the registry, but it's always nice to watch a "lessons learned from actually doing this in production" talk to know what mistakes to avoid


I run tens of thousands of docker images in production, or rather, tens of thousands of copies of a few hundred images.

If you do something like this, you absolutely MUST have a local registry.

Harbor [1], JFrog [2], and Quay [3] would be the first ones that I look at.

Harbor is open source, free, and a member of the CNCF. You will need to do a little bit of work to set it up to scale properly. JFrog offers a SaaS registry, but you will pay big $$ based on pull traffic. Their commercial site license is about $3k/year. Quay is older than either of them, stable, and high quality. I'd start with Harbor these days.

[1] https://goharbor.io/ [2] https://www.jfrog.com/confluence/display/JFROG/JFrog+Artifac... [3] https://quay.io/


Just to add all the major cloud service providers provide registries ACR /ECR/GCR etc . If you run k8s service with one of the them in my experience it is best to use the corresponding registry.

I have pulled and run 20k times a 1GB image in less than 10-15 minutes without breaking a sweat.

Finally GitHub packages offers a registry out of the box . It is great for CI and devs to access . I generally have the tags mirrored from tags GitHub for production to ACR .


Github Docker Registry is a mess and should be avoided at all costs.

1) It is broken and unusable on Kubernetes and Docker Swarm.

2) It is flaky often returning 500 type errors.

3) It is expensive as the amount of pull bandwidth is very limited.


Github packages works with Github CI out of the box, it makes development lot easier, like I mentioned for best networking in prod you should always use the registry from your k8s Provider, mirroring the Github registry to ECR/GCR/ACR is fairly straightforward. Bandwidth costs are eliminated, network is lot more reliable intra DC.


> It is broken and unusable on Kubernetes and Docker Swarm.

Hmm, I use them on several kubernetes clusters in the past few months and don't see any issue yet.


FYI, Using ECR with Docker Swarm is something we did try. It was hellish. We never nailed down the exact problems, but we spent about a month with 2-3 experienced engineers trying to fix the edge case issues.

The main issue was ECR has a slightly different authentication model than docker swarm. The whole '--with-registry-auth' only partially works when you are using ECR. Unfortunately, it works just enough that you think it's working, until all your tokens time out and a worker can suddenly no longer pull an image.

Our common failure case was an image becoming unhealthy or a node being drained. When that image would try to be restarted on a different worker, if that worker did not have the image it would try to get it from the registry. If the tokens were expired it would fail.

The only "fix" we ever found was to setup a cron job that forcibly deployed a new version of a "replicated globally" image every X minutes (where X was based on ECR token expiration). It kind of worked, but we still had occasional failures we could not identify.

I wish it worked better, because it was nice to use ECR. Frankly token expiration sounds much more secure too, but without direct support for token refresh inside the docker engine it's just hard to get everything to work


Looking at doing GitHub Packages for direct-to-dev and mirroring into ECR over here. Seems sound. But also considering other options as ECR is a pain to work with.

That said, word of warning for anyone looking at GitHub Packages for docker registry: it's broken with containerd and some other similar tools. They (GitHub) are currently working on a fix: https://github.com/containerd/containerd/issues/3291


I got set up with ECR without any difficulties whatsoever. You do have to authenticate before pulls and pushes, but that can be scripted very easily.


I was a happy user of JFrog's registries via site license at my last 2 places. Seemed to just work as expected. Didn't have visibility into the cost though (other teams set it up) so I had no idea it was $3k/year.


We have not had good luck with Quay. They are not stable, especially as of late. There was a period last month where for two weeks pulling images was a crapshoot.


Thank you very much. This is exactly the type of info I needed.


If you want to run a local registry to stay below the 100 pulls per 6 hours limit please consider GitLab. The Dependency Proxy https://docs.gitlab.com/ee/user/packages/dependency_proxy/ will cache docker images. This way you stay within the limits Docker set and subsequent pulls should be faster as well.


Personally, I wish generic caching proxies were still a thing, and easier to set up. I've tried setting up squid several times in the past, and failed miserably every single time--all I want to do is use it as a gateway (ie, make the proxy invisible to the application) for e.g. apt packages, so I just ended up using apt-cache or whatever other appropriate software, but I'd far rather use something generic that just works on 90% of the software I use at home, whether it's reading webcomics or repeatedly installing the same software in a dozen VMs with slightly different configurations, or even just browsing remote filesystems via webdav.


I use nginx to proxy cache the Arch Linux package repository transparently. It's fairly easy to set up, and enables nice features like contacting a secondary mirror if the first one is down, or when multiple requests hit the same resource, all are blocked waiting for a single merged package download, so the proxy will not make the download multiple times if I run pacman -Syu on my 18 machines in parallel. And it's all just 20-30 lines of nginx config.

It's not transparent though.



I just use ECR[1] which in many cases costs less and is fully locked down behind my AWS VPC

With ECR you pay for image storage: $0.09 per GB after the first 1 GB which is free

[1] https://aws.amazon.com/ecr/


are you gonna rebuild all the images that you use and push to ECR?


nope. you don't have to rebuild images to push to different registries

pull from docker hub once, push to ECR. then pull from ECR as much as wish


Just drop a Sonatype Nexus instance on a Docker container somewhere on your network. Alternatively, use Squid if you don't push to the public Docker registry, although you might need to mess around with internal CA for SSL...


Docker supports proxies (they call them “pullthrough repos”) so you don’t have to be so generic as an http proxy.


I would stay away from Nexus. It has problems with latest tags.


Nexus in a container... because storage in containers is such a good idea? Any vps with a disk is probably a better idea


You can still bind mount a directory into a container...


Storage in containers has been a long solved issue. The defaults are unfortunate because but make sense for ease of use. Your container root should be read only, ephemeral storage lives in a tmpfs or dynamic volumes depending on performance and size needs, and persistent storage lives in volumes.


https://docs.docker.com/registry/

you can set it up in less 10 min and the only thing required is to add '--insecure-registry' in your client. It is not a issue if all your machine are in private network.


Isn't there no authentication on that registry? I guess that's fine if you don't believe in zero-trust architecture.


you are right. That is what you can get in minutes.


If you cannot get a TLS cert for internal infrastructure in a few minutes, I'd recommend you start looking into why.


no good document on it and it is not very important for me ( I run it on homelab).

still wonder how to do it in minutes.


I use this (in a docker image) to generate certificates automatically: https://github.com/adferrand/dnsrobocert

Expect to spend 1-2 hours first time you try it until you can setup the correct DNS records, API keys and configuration.

Afterwards it's pretty hands off, every three months you'll receive an email from letsencrypt and you'll have to rerun this script to regenerate your certificates. Takes 2-3 minutes max (but of course you still need to distribute your certificates to all relevant services...)


If you run traefik it's even easier: https://docs.traefik.io/https/acme/


If you run on Kuberenetes, the image is/can be cached at network layer.


Anything is a deal breaker if the user expects to pay nothing. That is why going from free to $1/month will have a much larger user dropoff than, say, $1 to $10.


Weber's Law. Human perception is logarithmic.


It may be general resistance to a concept. More and more services are becoming subscription based, so 5$ here, 5$ there and the cost creeps further.


Exactly.

It's hard enough to save money as it is.

News websites blow my mind with this - if I forked over $5 to every news outlet I occasionally like to read, I'd be spending at least $500 maybe more per year JUST to get access to some random person's biased recant of what's happening in the world. If there were a news source that did the opposite of this, and basically provided a bullet list of objective, non-biased events boiled down to exactly what I need to know, that might be something I'd pay for. Hell, it would save you time over filtering the opinionated BS out.


Providing objective, non-biased events would be very hard.

Consider for example the current riots going on in the US. How do you objectively report on that? With bias, on one side you have "peaceful protest disrupted and escalated by the police", on the other side you have "police intervening in riots to maintain order and protect property". There's not really inbetween.


"Protesters say their peaceful assembly has been disrupted and escalated by the police. The police argue they've only been intervening in riots to maintain order and protect property."

Done.


If you have to represent "both sides" (in many cases there'll be more than two sides really), you end up having to give a voice to nutjobs, plus you present both sides as equally valid assessments.

Much as we'd all love an "unbiased" news source, the reality is that bias is a very hard problem to solve well.


Indeed, just repeating what people say about an event may be factual, but without any concept of what is actually true, it can’t be considered objective.

If one side is lying, objective reporting would tell you which side it was.


I explored building exactly that but turns out there’s no money in it. Most people want the narrative with the facts, if not more so.


There's a big market for this, it's just not for individuals. For example, Bloomberg distributes factual news on its terminal. The Bloomberg terminal even highlights important words in news stories so you can absorb the information more quickly. So if there was a earthquake somewhere, it might highlight the word "earthquake," the number of people that died, and the economic cost, for example.

Also there are news wire services that do mostly what you're describing. If you just want to be entertained (most people read news for entertainment), then they don't really care about the facts. They want to hear about so and so blasting so and so or whatever. But if you're trying to make money from information (traders, journalists, etc), then you really don't want to be reading the kind of stuff the New York Times is publishing.


I ran the math on this once.

Just a subscription to The Information is $399/yr. Add a subscription to the Times, and you've already blown past your $500 budget.


Mind blowing, imagine paying $5 for each newspaper one wants to read.


You’re missing the point. If there’s a news source that you occasionally read then it’s far more cost effective for you to just buy the paper at the stand for 50¢ the few times you want it. Same with magazines. If you read every newspaper then getting lower cost and delivery in return for the paper getting consistent revenue is a good deal for both parties.

If you get your news like most people, via link aggregators like HN, Reddit Facebook, Twitter then you get linked to dozens of publications that all want a $5/mo. commitment which is untenable.


It would be neat if news sites would start offering 50 cent day passes instead of difficult to cancel subscriptions.


I sort of half thought apple news might go that way. might not be cost effective - bundling larger subscription stuff is probably more revenue/profit.

But... in their news app, there's always a couple of interesting articles I might want to read, but I'm not signing up. They have my info, and a Touch ID device I'm holding tied to my payment info. "Read this article for 50c?" I'd certainly give some a read now and then.


My daily newspaper costs 2€ at the newsstand, not cents. With a monthly subscription of 5€. Pretty much worth it.

I have been subscribing valuable information sources since 1995, so I do get the point.


Except it’s not $5, it’s $5 to sign up, then an email and a phone call and your firstborn dog to unsubscribe. If it were microtransactions, that’d be one thing...


If the service doesn't play ball there is always the consumer protection agency.


That doesn’t work. I find it better to stay anonymous and avoid spam from these services.

The perpetual spam is worse for me than the $5.


$5/month isn't $5/month. It's convincing your boss you need $5/month, because they need to convince their boss, which eventually makes its way up the chain to C-levels, who don't know Docker from yesterday's rotting tuna casserole and view eating either that or the $5/month with the same level of disdain.

It isn't about the money, it's about the Mommy-May-I up and down the chain with emails and meetings and careful explanations to skeptical glares. It's a psychological and institutional barrier.


If your C suite is personally approving a $5/month charge, your organization is likely no where near the size where a change like this from Docker impacts you.


> If your C suite is personally approving a $5/month charge, your organization is likely no where near the size where a change like this from Docker impacts you.

Or has toxic micromanaged structure. I've had friends who have worked places that would barf over ongoing $60/year software charges, where anything like that would have to go up to C levels and require justification. Luckily never worked at one myself, dodged that particular bullet.


My purchase-approval flow is the same for $1 purchases as it is for $1k purchases. If we hadn't finagled a minor workaround, it would be the same as that required for $5k purchases.

Handling each purchase and documenting it in case we are ever audited requires easily $25-50 of people's time.


If your C suite requires approving a $5/month charge the company has serious issues. This always bugs me about the HN attitude about spending money, but it's really a extremely frugal developer complex. It's $5 a month, you get insane value out of Docker Hub, just pay it. I come from an entrepreneur attitude, my time is my most precious commodity. I don't optimize minor expenses, I optimize the big picture, my time, outcomes, and revenue coming in.


Yeah, when it's $5/mo blocking good utilisation of multiple $150k people, someone doesn't understand priorities.


Or maybe your organization is a public institution, where each penny spent needs to be authorized/accounted for.


> It was revealed on January 22, 2009 that Thain spent $1.22 million of corporate funds in early 2008 to renovate two conference rooms, a reception area, and his office, spending $131,000 for area rugs, $68,000 for an antique credenza, $87,000 for guest chairs, $35,115 for a gold-plated commode on legs, and $1,100 for a wastebasket. Thain subsequently apologized for his lapse in judgment, and reimbursed the company in full for the costs.

> https://en.wikipedia.org/wiki/John_Thain

"Sorry I got caught. I will work harder to hide next time."


Even in a public institution that has never failed an audit, I have trouble believing $5/mo is a hard thing to get. Technically, its $60/year fee because its one of those annual plans. We pay for vultr with no drama.


$5 a month also means consideration is given as part of a contract, and legal will need to review to make sure you aren't granting them patent immunity, etc.


Worked at a place where everything went through the VP finance and they were making $15M/yr profit. At one point we brought ion a desktop off of the curb to use as our build server. Sure, we got in trouble with ops later but those above us didn't care since it didn't cost us anything.


It took 3 months at my old client (fortune 100) to get a signature for an Addendum that would deliver us a service for free in addition to the paid stuff that was already signed.


For me it’s not the c-suite, it’s the admins who make purchases. Or the hassle of maintaining a corporate card.

The headaches to spend $5 cost way more than $5.


>It's convincing your boss you need $5/month, because they need to convince their boss, which eventually makes its way up the chain to C-levels, who don't know Docker from yesterday's rotting tuna casserole and view eating either that or the $5/month with the same level of disdain.

this is where miracle of enterprise sales happen - the $5 subscription can be sold as a $50K+ deal by smooth enterprise sales who will provide the C-exec with the experience making him feel like he did something smart and great for the company.


Just promise to give the exec a keynote time slot where he can describe the ROI of the 50k deal. Helpfully, the vendor will provide an Excel doc that lets you calculate the benefits. You don’t even have to realize them, just aspire to. From there, the C-level gets some fawning press coverage for their LinkedIn profile and a set of job offers at the next-size-up organization where they can also level up on comp...


This "up-and-down chain" should stop pretty quickly at the level where eng mgmt knows loaded costs. It's an eng manager and director's job to compute total-cost-to-execute, and opportunity cost (using loaded-costs for engineer time). They should be able to approve once you've convinced them that it's cheaper to buy than to build.

Engineers can head this off by prepping a total-cost-to-execute analysis for mgmt. This doesn't need to be complicated. It's just some estimates of what's needed and why, and what the alternatives cost. My eng VP used to ask me for these when I'd send up a request. He wanted to know that we thought about total-cost-to-execute. He'd usually only read the exec summary and approve. If these requests are really going that far up the mgmt chain either someone isn't doing their job, or higher level mgmt are micromanagers.


If you work for a company that disconnected from reality you best be looking for a new job. $5 the guy who decides budget should be angry that you didn't come with him directly for such a tiny amount of money.


There is no way you need C-level approval to get a $5 a month subscription.


Start looking. That place is going to fail.


>Now my question has changed to: is $5 a month really a deal breaker for folks to get unlimited?

I want to speed forward 5 years and see how well this ages. It reminds me of all the other comments about "of course Facebook will never require an account to use your VR headset"...


I think it will age fine. If every ad-supported website switched to this model, I'd be paying for either zero or one websites.

If it was $5 for access that would be a completely different situation, but a big free tier followed by $5 for unlimited is fine.


I don’t think anyone who’s adopted systemd can call themselves a curmudgeon :)


Sometimes, curmudgeonliness is measured in resistance to changing the default:

"Stock CentOS was good enough before and it's good enough now!"


It is pretty cheap, it is just the free beer crowd that is upset.


Yeah, I'm just going to pay for it. I think I probably already am, but if I'm not, I will. Even the enterprise 7/user isn't that bad at all. Github enterprise is like 3x that.


The limit for unauthenticated users is by IP address. I could imagine a smallish business that has a consistent on-ramp IP for their users that could breach the limit not on any single user but in aggregate.

I do sympathize with Docker though: Storage and bandwidth at that scale isn’t cheap and they need to monetize somehow.


You should be more shocked that everything else is so expensive.

The base costs for a lot of tech stuff (like bandwidth) are so cheap. But you would never know between these bullshit "what is it worth to you" pricing models and the number of middlemen trying to stick their hand into the pot. It's disgusting.


Developers are cheap. “I could build that” is your competitor

Note could, not will


Developers are overly optimistic, undervalue their time (they don't know the loaded-cost of their time), and overvalue their solutions (they rarely account for maintenance costs or the long-tail of time to perfect the solution). So developers think they are cheap. For example, I had an engineer express disgust that we might have to pay $4/month-per-user for gitlab just to have mirroring enabled. He said "we can do our own mirroring." When asked how long it would take he said, "between 2 hours to two-weeks if we have to iron out bugs". So when accounting for his time, his price to deliver ranged between $200-ish in the best-case scenario, to $8000-ish in the worst-case scenario (which far exceeds the cost of paying for the licenses). Needless to say, we paid for the licenses. He also wasn't considering the opportunity-cost, i.e., what will he NOT do that's more valuable while he works on this proposed solution that we could just pay for?


Good!

I'd even welcome much more agressive limits than what they're proposing; the current culture regarding builds and CI in general is horrifyingly ineffficient, wasteful and in the end just plain slow.

I'm looking forward to developers adjusting their workflows (and caches, etc.) to actual, reasonable limits, not just perusing the service as if it were an unlimited cost-free cornucopia of software.


I welcome the end of free beer culture, so I am quite alright with their change.


Everything is inefficient. There is a huge class of "developers" that don't understand what O(n) means, and a subset of them are vocally proud of it.


> the current culture regarding builds and CI in general is horrifyingly ineffficient

I used to be in charge of the website for a company you’ve heard of. We once realized some huge proportion of our traffic originated from a hosted CI company requesting the site thousands and thousands of times (guessing one for each build they hosted) every 5 minutes.

I can’t remember what proportion of traffic it was but I’m pretty sure it was a majority, maybe even more than 80%.


I sure hope the "cloud first" advocates are happy now, because they have managed to create masses of developers who have next to no idea that what they're doing is an incredible waste of resources. These are also the same people who are perplexed why their systems intermittently fail, or are surprised that they do when the Internet connection cuts out for a bit.


There is very little reason for a build node to need to pull 200 images in 6 hours, and here is why:

When a machine issues a ``docker build`` command, the program reads the relevant dockerfile to check for any base images that need to be pulled (a la "FROM:")

These base images are identified based on the image repository, image name, and image tag. The first thing docker does is it checks its local registry and tries to find a match for the base image the docker build is requesting. If a matching image is located in the local registry, it uses that one in lieu of downloading the image.

This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.

Many docker users erroneously believe that if their Dockerfile requests a "latest" tagged image, docker build will always download the newest version of the image. However, the "latest" tag is literally just a tag, it doesn't have any special functionality built in. If the docker build command finds an image tagged "latest" in the local registry, it stops there.

The only way to get docker build to always use the "actual latest" version of the base image is to add the "--pull" parameter to the docker build command. This arg will tell docker build to check the repository remote to see if the SHA hash of the image tagged "latest" has changed, and if so, re-download and use it. In the absolute worst case, this means each build node will pull 1 copy of each base image when the base image is updated. So unless you use 200 different base images that all have updates deployed to Dockerhub each and every day, you are fine.


I don't disagree with what you are saying, _but_:

> Docker defines pull rate limits as the number of manifest requests to Docker Hub.

> For example, if you already have the image, the Docker Engine client will issue a manifest request, realize it has all of the referenced layers based on the returned manifest, and stop. ... <excluded> ... So an image pull is actually one or two manifest requests,

This still implies that even if you are appropriately re-using layers on your machine, with a free plan you can only do maximum 200 builds (since docker still needs to verify it has the image) per 6 hours?

This change also seems to imply that builds steps which previously did not handle/require authentication against Docker hub (it was only pulling public images, and pushing elsewhere) will now be required to auth against docker hub in order to double the number of pulls/checks/builds it is allowed?


This is an excellent point. Trying to find out if docker build --pull without an accompanying blob download will trigger the rate limiter.

If it does, then this will definitely be a reason to riot. It will effectively mean that anyone who wants to do more than 200 builds every 6 hours using the "right" way will have to get a docker pro subscription.


It sounds like it definitely does trigger the rate limiter.

> There is a small tradeoff – if you pull an image you already have, this is still counted even if you don’t download the layers.

I expect we're just going to see a lot more recycling of build nodes once it has "used up it's docker credits".


All reasonable orgs should have had their private docker repo a long time ago.

Everybody else is living the pipe dream where they have externalised their risk and probably deserve the Docker treatment.


Yes but Docker achieved their goal of making it annoying as hell to not user DockerHub. Because you can run your own private repo just fine but what you want is a transparent proxy (like apt-catcher) that will let you pretend you’re using DH but actually pulling from either the cache or your private repos. All the pieces are there with private repos and “pullthrough” proxies they’re just not well integrated, seemingly on purpose.

RedHat’s patches to Docker make this possible but Docker has refused to upstream it.


> annoying as hell to not user DockerHub

Why? All your need to do is to use your domain name when referencing the image.


Agreed. Third party package repositories has been a weak point in our CI, and we put all of them behind a self-hosted proxy that we can manage in our own HA fasion. Turns out we get faster pulls from it, as well as being a good internet-citizen.


If you do 200 builds every 6 hours, you could probably afford to pay $5 a month.


I admittedly have only used Docker very little, but how exactly does someone manage to build images once every 108 seconds continuously for 6 hours? That sounds extreme.


Easily with CI. Every pull request on the GitHub project will build a dozen Docker images whenever a PR is opened, updated or merged.

Granted, there's only a couple base images involved, so CI pipelines will need updating to be more efficient in terms of `docker build --pull` usage.


> The first thing docker does is it checks its local registry and tries to find a match for the base image the docker build is requesting. If a matching image is located in the local registry, it uses that one in lieu of downloading the image.

While I agree that this is the way it's supposed to work, I have unfortunately worked at companies with "stateless" build/CI servers that download the Docker image each build.


Well, this policy change will force them to be more efficient, and it's a net win for everyone


Or just pony up the $5/mo for Pro... not as fun as re-engineering your CI pipeline, once again.


You have to re-engineer it anyway to authenticate your Pro account.


> While I agree that this is the way it's supposed to work, I have unfortunately worked at companies with "stateless" build/CI servers that download the Docker image each build.

Couldn't they remain stateless but be redirected through a caching proxy? Memoization is not contrary to statelessness.


Sure, now they have to build a proxy...


Super easy to run a docker cache proxy:

    docker run -d -p 6000:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io \
    --restart always \
    --name registry registry:2
That's it. Now fetch docker images from the IP that command is running on. Taken from gitlab: https://docs.gitlab.com/runner/install/registry_and_cache_se...


Is that going to actually help with the manifest-based rate limits? It sounds like it only caches the layers, the manifest metadata for a tag is not cached.

https://docs.docker.com/registry/recipes/mirror/#what-if-the...

> When a pull is attempted with a tag, the Registry checks the remote to ensure if it has the latest version of the requested content. Otherwise, it fetches and caches the latest content.


Hm you're right. I wonder if there's a way to cache a tag's metadata for a while...


I think this addition to docker/config.json should do the trick to make it hit the proxy?

https://docs.docker.com/registry/recipes/mirror/#configure-t...


Artifactory is less bad than most of the tools I have to use all day.


Artifactory is the very definition of expensive (even at an enterprise scale) when it comes to docker images though.


Can you tell me more? How expensive are we talking?

Working for the same sized companies for a while has apparently dulled my senses. At a certain size, the capital that matters is the political capital it takes to get a vendor agreement in place to begin with. The monthly costs of the system are something you only feel through pushback on how big the repo gets, or the rate of traffic (experiencing the latter now with a browser testing SaaS)


> This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.

You're assuming that the set of build nodes is relatively static.

Plenty of architectures set up autoscaling for the underlying nodes, that terminate servers that aren't being used and relatively soon enough (tens of minutes, hours) spin up new servers to replace them as needed.

Rarely do the machine images used to spin up new servers include the base images of the containers that will be spun up to replace them. Much more often, the base machine image is a base OS image, and container images are downloaded on-the-fly as needed. Essentially, the engineering cost of making image-launching more efficient was externalized onto an external provider willing to pay the price.


If you're doing this, you're in a cloud environment that's also proximate to a blob store, and can trivially host your own registry.


> and can trivially host your own registry

That is far from trivial.


And now you have an alternative option - pay $5/month


If you’re using docker for production distribution of images, you should be paying for it. That’s exactly the behavior that creates the need for a limit.


Or don’t use docker in production ;)


You can just build your own machine images and not use docker at all.


CI/CD systems on AWS, Azure, GCP and others might be running on Kubernetes containers (using kaniko, podman, etc) or using Docker-in-Docker, and there isn't a widely supported or in-cloud-platform tool for sharing cached layers.

And as pointed out below, even if you are intelligently caching layers, manifest requests count as a pull. As far as I know, no caching proxies exist for Docker that support limiting manifest pulls.


Surprised to see Docker-in-Docker mentioned so deeply down here. It’s an extremely valid way of doing things, and non-trivial to implement a caching layer for.


Isn't Docker-in-Docker actually using the host's Docker daemon? I am mounting the docker socket in all my Docker-in-Docker containers, thus all the build tasks running on the same host can share the caches.

I guess one could have docker containers that actually run docker, but I don't see a reason to do that...


No Docker-in-Docker would generally refer to running a new dockerd inside of a container.


I was wondering how Docker-in-Docker works, but I couldn't find it dockermented anywhere. If it's using the host's Docker daemon, why do you need to mount the docker socket?


> If it's using the host's Docker daemon, why do you need to mount the docker socket?

There are 2 components for docker: the daemon and the tool used to send commands to the daemon. In order for said tool to be able to send commands to the daemon, it needs a way to communicate with the daemon. Mounting the socket in the container is the easiest method.

I have a "tooling" image that consists of a set of scripts (python code) to do various things ops related. One of the things is to build new images when required. I have a script that given a git commit will detect the images that need to be build and build them. Having my tooling code in a container makes it easier to deploy and use new versions of the tooling code. I don't need anything on the host apart docker itself. No build scripts, no python.

As I said, i could be running the docker daemon inside the container, but that breaks one of my rules related to containers: containers are not virtual machines, they should only run 1 process and the output of that process should be std out.


Very interesting, thanks for sharing! I found a good article about it: https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-d...

At the end he describes mounting the socket. The tooling image which has all the dependencies needed to build will also have the docker cli installed, which is what I'm assuming you are doing.

I might just use this. Cheers!


Docker-in-Docker (DinD) doesn't piggy back on the host's Docker daemon, but instead runs a stripped-down Docker daemon inside of the container. The major downside is that I/O is quite slow, since you're going through two virtualization layers (the DinD one, plus the host Docker daemon).


This is not true.

There is, effectively, no "virtualization" layer here. There are some things that if needed can cause overhead... such as the bridge networking (really shouldn't be a bottleneck for majority of people), and the CoW filesystem... which docker won't be (or shouldn't be) running on top of since, for example, overlayfs on top of overlayfs is not supported.

There is also nothing stripped down about the daemon inside of the container.


Sure, I was speaking off the cuff based on my experience from a few years ago. Maybe I messed up and somehow had the DinD daemon not use a volume mount, and that's what caused it to build images slowly?


Very well could be since it would have to fallback to the naive graphdriver that just copies stuff around.


Will mounting the socket, as the person I replied to suggested, make it use the host's docker daemon?


Yes, that's the point.


Usually docker outside of docker is used, no? If the image is cached on the host, it would be available to any container having access to the docker daemon socket as well since it's the same daemon.


No that’s only the case if you mount the Docker socket into the container, which is not what Docker-in-Docker is.


> This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.

Only if your build nodes have unlimited storage. If the build nodes are spun up on demand or have housecleaning tasks to prevent Tragedy of the Commons disk exhaustion, this is not true.

On the other hand, this is what caching proxies/registries are for.


I'm not sure if this is 100% true any more. I've found that when enabling the DOCKER_BUILDKIT=1 env var that docker will sometimes eagerly re-fetch stale images. I think your argument is generally still true, but was happy to see that some progress is being made on dealing with stale `latest`.


This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.

Unless you’re using something like AWS CodeBuild that spins up a Linux/Windows container for your build environment, executes bash commands in a yaml file, and then terminates it when it is done. Nothing is stored locally after the build is finished.

I’m sure there are other similar services. Wouldn’t Azure Devops using hosted builds do basically the same thing? I haven’t used it since they changed the name from Visual Studio Team Services.


What is a solution for the scenarios you have described? Amazon has ECR but it doesn’t support signing and doesn’t function as a proxy so you would miss upstream changes unless someone pushed them. Anything self hosted that supplies that functionality?


Do you really need all your private images to be derived directly from the upstream? Don't you start every image with:

    FROM foo
    RUN apt-get upgrade etc
?

Then why not have a set of base images, derived directly from upstream that get built every so often and have your private images be derived from that? This will not only relieve the stres on DockerHub and prevent you from having to pay the 5/month, but also give your security people a hook to run their tests and make your private images build faster, since all the system updates won't happen every time you change the code.


Pay $5.00 a month. Docker is a business that deserves to get paid if they offer something valuable.


Well that takes all the fun out of it. Looks like Docker itself has a Dockerhub proxy -

https://docs.docker.com/registry/recipes/mirror/

https://hackernoon.com/mirror-cache-dockerhub-locally-for-sp...

https://stackoverflow.com/questions/32531048/docker-pull-thr...

https://docs.docker.com/registry/configuration/

https://www.google.com/amp/s/ops.tips/amp/gists/aws-s3-priva...

If using Alpine, looks like docker-registry is the needed package and /usr/bin/docker-registry serve /etc/docker-registry/config.yml is the command line. Next to last link has information on the config file.


Per your first link:

> What if the content changes on the Hub?

> When a pull is attempted with a tag, the Registry checks the remote to ensure if it has the latest version of the requested content. Otherwise, it fetches and caches the latest content.

If that causes a manifest pull, it counts as a pull and will be rate limited. Yikes! This could lead to wildly nondeterministic behavior.


Yes, it does a pull but caches the response, so subsequent pulls should hit the local cache and not be limited.


I don't think that's what it means.

> When a pull is attempted with a tag, the Registry checks the remote

Checking the remote is a manifest pull.


Difference is HEAD request or conditional GET, the server will not send a file if it matches the time and/or tag of the version you have, so you are replying with a few bytes rather then (potentially) dozens or hundreds of megabytes. Same with all CDNs.


This still counts as a manifest pull for rate limiting purposes based on what i'm seeing in this thread.


If everyone in your company (plus your CI system) is behind the same firewall/IP address, that's going to be a lot more than 200 pulls.


Luckily though, companies have cash to pay for registrations, and those that won't probably have engineers who can set up Squid proxies.


Lots of tools built on top of Docker do imbue special meaning in latest, however.


I kinda wonder if Docker as a company is struggling. Redhat made Podman which is a compatible replacement, Then there's swarm but apparently that's not recommended and actively developed anymore, then as far as I know they sold off their enterprise clustering product. Seems Kubernetes is the popular thing now even if a bit complex to setup. Wonder what the current business model? Pretty neat idea of using containers, but seems they put it out in the wild, got popular and sorta lost control with so many competing options being released.


> Redhat made Podman which is a compatible replacement

I would actually prefer if they made an incompatible replacement. Docker's CLI is pretty bad in my opinion.

I want to use Docker the same way I use a headless virtual machine running an SSH server. I want starting/exiting containers to be independent from their 'main process'. I want to attach/detach whenever I need to and execute arbitrary processes.

-- Just use /bin/bash as the main process

This seems to be the workaround, but I always have problems with containers exiting when I don't want them to and it's just harder than what it needs to be. I've spent a total of like 6 hours learning Docker and I still don't know exactly how to achieve this simple workflow without my containers quitting on me or attach/detach issues. With VirtualBox I can do this easily. Am I too stupid to use Docker?

-- Then just use VirtualBox

That's what I do, but I would like not to have the overhead of a vm.


> I want to use Docker the same way I use a headless virtual machine running an SSH server. I want starting/exiting containers to be independent from their 'main process'.

If you’re running systemd anyway, check out systemd-nspawn. Your ssh command becomes `machinectl shell user@container`. It’s a more VM-like way of managing containers, without Docker’s image distribution features or philosophy that containers should be ephemeral.


They made podman as a fully compatible replacement so people could easily drop-in replace their use of docker with podman, which worked.

To handle spurious interrupts from /bin/bash you can put a small script as the entrypoint containing a while true loop with a sleep infinity in it.


> I want to attach/detach whenever I need to and execute arbitrary processes.

Isn't "docker exec -ti container-id /arbitrary/command" enough for that?


And if you want a shell? Just use “bash” as the command.


Sounds like you might want LXD which starts and leaves running a full "machine" container. You can even SSH to it if you want or just use "lxc shell bionic" to get into it.

https://linuxcontainers.org/lxd/getting-started-cli/


There is a key sequence for detaching from the container... default is ctrl-p+q.

But if you want to not deal with attach/detach, perhaps `docker exec` is what you want. It doesn't affect the main process (unless of course your command you run kills the main process).


have you looked at toolbox? Its by Red Hat and works with podman under the hood


Yes Docker seems to be struggling as a company. But I doubt Podman has anything to do with it. The adoption of Docker open-source tools is massive and 99% of its users have never heard of podman or any other clones, and likely never will. The problem is simply that those tools are free, and Docker has failed to convert the success of their free tools into a successful business.


It’s partly open source and partly freeware. They could just make docker for windows/mac a paid software. I would pay for it if they would listen more to the community when a bug is found. They seem to ignore many bugs Docker Desktop on GitHub. I like about the podman Tools that there is a community effort from Red Hat. It’s not 100% compatible with docker and probably never will. So I will just hope that Microsoft or Canonical buys Docker and make it more open to the Community.


They won the container war but lost the orchestration war. Even if docker compose was successful though, I fail to see how the clouds wouldn’t just replicate everything. So I guess they just failed to monetize the technology.


> So I guess they just failed to monetize the technology.

Yes, it's really that simple. All those "container wars" and "orchestration wars" are a distraction from the core issue, which is that all those container and orchestration tools are open-source, and it's very hard to build a viable business on top of them. Docker tried and failed, like most startups involved.


Anyone that wants to make money with developer tools with the free beer generation, can only focus on enterprise customers, while adopting the traditional sales models.

Even here, when commercial projects are show, there is always an endless thread of free beer open source alternatives.


> Podman which is a compatible replacement,

Kinda... It doesn't support caching layers for example which makes it very different in practice.


It’s still amazing that there is an alternative. It must be tedious to copy such a bad CLI design over to podman. LXD CLI is far superior.


After Docker Swarm failed it was clear that they could not survive just on the core Docker tech and CLI, which are all becoming less valuable day by day due to the various open container initiatives. In the absence of a killer product they are still a ripe acquisition target, but not a successful business.


Isn't everybody struggling right now?

Except for Zoom, of course.


Yes, good. After reading the comments on their image retention limits [0] talking about simply pulling the images all the time to keep them fresh, this seems like a reasonable response.

I'll repeat what I wrote there [1]:

If people really think this is a problem, they'd contribute a non-abusive solution. Writing cron jobs to pull periodically in order to artificially reset the timer is abusive.

Non-abusive solutions include:

- extending docker to introduce reproducible image builds

- extending docker push and pull to allow discovery from different sources that use different protocols like IPFS, TahoeLAFS, or filesharing hosts

I'm sure you can come up with more solutions that don't abuse the goodwill of people.

-----------------------------

Additionally, hosting a local network docker repo would mitigate this rate limit completely. Or straight up pay. It's not that difficult. Getting mad about a free, open-source service becoming pay to use... I couldn't imagine the gall and conceitedness.

0: https://news.ycombinator.com/item?id=24143588

1: https://news.ycombinator.com/item?id=24144475


> introduce reproducible image builds

This is a great idea in concept, but in practice very challenging.

RUN curl https://www.random.org/integers/?num=1&min=1&max=99999

Docker will cache this after the first invocation. The build is not reproducible. Now what?

Replace "curl random.org" with "nondeterministic and really expensive code build/model training/etc operation".

> extending docker push and pull to allow discovery from different sources that use different protocols like IPFS, TahoeLAFS, or filesharing hosts

This is great, if you can solve the image integrity/trust issues therein, which should be just some signing/merkle tree work.


Ugh, I don't envy their position. There are many ways to reduce the size of a docker image. I'm guilty too. Probably the best thing to do is leverage multi-stage builds. Those have the largest effect on repo size. (Like a 10x reduction often).

The problem is, docker, the company behind the repo, has no control over what Open Source Joe and Developer Suzy are committing and the other developers pulling down their images. They can send out all these notices and announcements, and I think the typical reach of such things probably gets like what, %0.05 percent of the developers it needs to?

And of those, are any willing to rewrite the image to be smaller?


Well, they could use different rate limits depending on the size of the image. Say if the image size (or the size of the added layers) is in line with what we want for dockers you could offer different rates. Unlimited for images <10MiB, high limits for <100, and low limits for everything else. That way they both push for small images, and keep everyone happy.

Or people will just add a proxy/imagestream in between instead of directly pulling from docker hub.


Or you just do it the easy way and limit the speed based on the amount of data downloaded so far, with the history being 'lost' after 6 hours. This way you also prevent someone from doing dumb things to get around it, like using multiple connections (or in this case, multiple layers for a project that doesn't actually need them)


That is possible but it means distributing the total amount of data to all the nodes. Luckily it doesn’t require correctness but it is more complicated to setup regardless.


This is the future of containers:

https://guix.gnu.org/

https://nixos.org/

You can build a Docker-compatible image from a Guix or Nix package. You never have to use Docker or Docker Hub.

The limitation of Docker is that the nice semi-reproducible sandbox you get exists on top of an operating system that was not designed for it, resulting in giant blobs to get it to work. It's inefficient and a band-aid stopgap until we get to the future where the operating system is a pure function (which can be versioned in a tree, diffed, reverted, etc. just like git). If you used NixOS, you wouldn't need Docker. Sure, it's available and you can use it, but you wouldn't need to.


It works fine though and a lot of companies have valuable time and process built on top of it. However I'm all for natural selection, if these are better products may the the best container win


Honest question, but why can't docker use something like bittorrent to download images?

Most of us download our OS via torrents only, so we may as well download the images too if there was support for it.


> Most of us download our OS via torrents only

I mean, I only do it to stick it to the people who claim torrents can only be used for piracy; I think most people prefer the simplicity of direct downloads though…


Docker images already need special handling since you download the layers separately and reassemble them. Going from that to full BitTorrent should be transparent to the users.

In fact, there already exist several implementations of it for Docker![0, 1, 2]

[0]: https://coreos.com/blog/torrent-pulls

[1]: https://d7y.io/en-us/

[2]: https://github.com/uber/kraken


I can usually max out my connection speed with torrents. This is rarely the case with direct downloads.


The start up time of a torrent is typically so long though, by the time it has connected to all peers and is downloading I have already downloaded the iso with a direct download anyway.


Soon, when QUIC is widely supported by the swarm (just needs updating to the current version), IPFS should be better than torrents for start-up delay, and much more importantly, it allows for sharing data across "torrents" when using content-defined chunking via rabin or buzhash. This means that things like common larger binaries get shared between images that include them, which should greatly increase the average amount of seeders for the chunks that make up an image.


There's a feature of linx-server that provides uploaded files with a torrent URL as well as a regular download URL. I believe the torrent client tries fetching data from peers as well as from linx (http).

https://github.com/andreimarcu/linx-server


On symmetric gigabit fiber only a few services, mainly steam and battle.net have ever given me 90mbyte/s download speeds. A surprising amount will limit to 500mbit or less, even if you have the download pipe for it.


Guess I just have slow internet :(


At least for Ubuntu and popular distros like mint you get super high speeds on torrent maybe because you have a peer in the same region.

Afaik Windows is also using this to install updates where it shares the download with others in the region (1) using p2p (though i may be wrong since i don't use Windows anymore)

(1) https://www.itproportal.com/amp/news/how-to-stop-windows-10-...


> Honest question, but why can't docker use something like bittorrent to download images?

Docker will be limiting manifest operations, not actual blob transmissions.


How would bittorrent work in companies? Only HTTP traffic is allowed and often only when going through the company proxy.


You could have a mixed mode pretty easily. Heck, it would even be pretty efficient.

For your own company, you'd host a swarm that is firewalled in. Then when someone says "I want image xyz" the first thing you do is look for seeders in the swarm for that file. If non exists, then you initiate an http download from docker to get the image.

Now you've got fast distribution with low external network traffic.

Not sure how this would play with Cloud provider pricing, though. I don't believe AWS would be too happy seeing their services turned into BT swarms :)


> How would bittorrent work in companies? Only HTTP traffic is allowed and often only when going through the company proxy.

1. Only some companies work like that.

2. I'd expect it to work like a webtorrent; try to download by p2p, but if that fails then fall back to HTTP.


- Its an alternative download solution

- There is no rate limiting for paid accounts or companies from what I see in the article.


The company proxy would have to be modified to allow torrent traffic, I guess?


The hardest part of that would be verifying image authenticity.

Google Cloud uses an adjacent feature called binary authorization. When turned on, only images that are signed by a given authority (usually your ci/cd instruments) can be run inside your Kubernetes cluster.

Binary authorization may be a good starting point for someone trying to make bittorrent distributed images a usable thing.


> The hardest part of that would be verifying image authenticity.

That's exactly what Bittorrent does with its hash tree. You'd get the root hash (extremely tiny) from Docker Hub, and the rest of the metadata, as well as the data blocks, from the swarm. The authenticity is all handled by the TLS that serves you the root infohash from Docker Hub. It's a Merkle tree: the root hash is for the metadata, which is a list of hashes of the blocks.


If this reads like greek, let me dumb it down.

You get the hash of the final result from the trusted server and hash is checked. Because of this you will never get an invalid image.

There are also some clever tricks to make sure no one can force you to start over from scratch by sending wrong data. But that's more of a detail.


Why, the website could still have the hash for the image which is only a few KB versus hundreds of megabytes or even gigabytes. Just have the bit-docker app check the hash before executing.


What is the issue here? Is the torrent checksum (provided by docker) not enough?


Let's say someone hacks a maintainer for the Ubuntu base image. The hacker publishes a new version of the base image with a backdoor.

When the backdoor is detected, you now need a revocation system so the distribution of the malicious image will die. You can theoretically do this on the tracker level, but people may build other trackers that may not propagate the changes.


You still would have a centralized manifest system though, right? It shouldn't hurt Docker much at all to host a few KBs of data describing the hash of each layer, which is still fetched every time, just big downloads are done over torrents.



I see a lot of people mentioning the low cost, saying that it's no big deal. It's not the cost that I find annoying; it's dealing with credentials and secrets..


And docker hub account management and security management options have always been _horrible_. For example, I can't create a login token exclusive to one organization or repo on docker hub, meaning anywhere i use those credentials is a potential security risk to ALL my orgs and projects.


Yeah for ephemeral work or CI builds especially, that's the most annoying part (especially for orgs where you might not want to put your personal credentials in repo settings... but you also don't want or have the funding structure to pay the $5/month for unique credentials for that org).


Okay, now can you finally accept patches about the default registry?

https://github.com/moby/moby/issues/1988

https://github.com/moby/moby/issues/4324

Or we're still stonewalling?


I setup a local docker cache using a docker image. The transition to HTTPS everywhere makes caching this sort of thing difficult. One has to install and often manually configure trusted certificates on every client to maximize the cache.

APK and APT caches often come hand-in-hand for this sort of thing as the logical next step is to add some OS packages to the pulled image. This also benefits from local caching and also means frustrating cache setup, certificate setup, etc.

To maximize local caching there's a lot of manual work to setup a house-of-cards series of proxies that only work on the network. Setting it up on a laptop then traveling means everything breaks in not-so-obvious ways when you leave the network.


Does anyone have any good ideas on how the Docker Hub could be monetized in a way that's user friendly and makes sense?

AWS, GCP, Azure, DigitalOcean and even GitHub / GitLab all have private container registry offerings.

If your stack is on X provider, chances are you're going to use their private registry service instead of using the Docker Hub because you've gone all-in with that provider. That means private repos alone isn't enough to get folks to pay for Docker Hub.


I think there was a lot of potential to be made in the service/support aspect of containerization of applications - configuring appropriate dependencies etc. Now probably not so much - most people have probably painstakingly learnt the knowledge by now. Maybe even some kind of container marketplace.

Otherwise, I don't think it's practical to have docker itself as a commercial product.


- Keep the 100 pulls for Docker anon users, but make it monthly.

- Every user gets 600 pulls per month for free.

- Pre/post pay per pull or buy a rate plan. Something like $100USD === 10,000 pulls on "pay-as-you-go" and prepay could reduce the cost per pull.


take over supply checks?

E.g. make base images for programming languages, systems, etc and guarantee security.

I think many small companies would like it a lot better if they got could externalize the cost of running docker images with all their dependencies.


This what Cloud Native Buildpacks do. There are already three major suppliers of these: Heroku, Paketo (by Cloud Foundry folks), and Google Cloud.

I'd be pretty thrilled if Docker encouraged buildpack use. It would be a huge win for them.


Currently the state of Github Packages' Docker UX is terribly and ironically it doesn't integrate well with Github Actions (or at least it didn't when I tried it to months ago).


And the pricing is still horrible. $0.25/gb for storage and $0.50/gb for data transfer is pretty rough since you end up pushing even Docker Hub image layers to GHP.


This is good, just like pruning old images. What's bad, and what Docker isn't saying, is it was a mistake to ever allow unlimited free plans. Independent of scale. Setting up an expectation of unlimited free hosting and bandwidth at any point of a business is bad. Tuning knobs of paid hosting and services at all tier levels, with a limited free tier, should have been baked in from the beginning, and would have led to a much stronger business.


I don’t think it was a mistake per se, it was part of their growth strategy. Their biggest problem is that they never really managed to capitalize on the market (and kubernetes happened), so now their plan B is to effectively reduce costs of docker hub or get some money out of the people using it.

Completely understandable imho.


Want to see if you're affected?

kubectl get pods --all-namespaces -ojson | jq -r '.items[].spec | .containers[] // [] += .initContainers[] // [] | .image' | sort | uniq | less

... and look for anything that's not from a private registry that you control.

Give people two months to migrate? What a nightmare.


Giving people over a certain limit two months to migrate or set up a cache or give them a few dollars. What a non-nightmare.


well a shitload of stuff from k8s mostly lies in quay.io or k8s.gcr.io


I guess CI services like Github Actions could be easily hitting these limits (100 pulls per IP per 6 hours).


I am anecdotally aware of large Concourse installations that have caused individual companies to be blocked entirely from Dockerhub (and which caused "face melting" of Github Enterprise instances).

Badly-behaved CI is a serious issue at scale.


I'm thinking the same thing. GitLab has shared CI runners that probably do a lot of image pulling. Hopefully they have plans to implement their own docker registry cache.


GitLab PM here - we have a feature called the [Dependency Proxy](https://docs.gitlab.com/ee/user/packages/dependency_proxy/) that allows you to cache images from DockerHub. Currently this only works for public projects, but we are working on adding support for private projects now: https://gitlab.com/gitlab-org/gitlab/-/issues/11582.


I would imagine that GitHub would work out something with Docker to either pay themselves or have some way of caching images themselves.

They could just inject their own TLS certificates into their VMs and then intercept Docker requests for images with their own cache.


This seems incredibly likely to break development use cases at both extremes: CI/CD systems and developers just starting out could end up pulling quite a few images per hour. Imagine if NPM, Ruby Gems, and so on rate limited package downloads until you paid!

I'm not sure if there's a better way to monetize the Docker Hub, but this seems so hostile to adoption.


> Imagine if NPM, Ruby Gems, and so on rate limited package downloads until you paid!

Sounds like an entirely reasonable thing to start imagining.

Reliability, safety, determinism and predictability are not thrust upon someone from the commons.

I frankly find it someone atrocious and abusive that downstream systems do not adequately cache these assets. The main archive repositories should be the source of truth, but they also don't need to be the fountain.


Do you think NPM, Ruby Gems, or others would last long if they were so user hostile? Would Node have grown into the genuinely useful development environment it is if free users were limited to 10000 package downloads an hour?

It's so unbelievably user hostile that it seems like the result will be people just stop using Docker. The right solution for Docker is probably to spin off or monetize the Hub in a different way.

Imagine if GitHub started charging users for git cloning too many packfiles per hour.


I think npm and ruby gems are quite a lot smaller than the average and upper bound size of docker images, which can easily be multiple GiB. Also the use case is different, where a docker build may incur a few manifest retrievals from the registry, while an npm build may incur hundreds. I don't know how quickly the average user would run into a 10000 download/hour limit for NPM, but if they were able to arrive at an equivalently high limit like 200/hr for docker, then maybe it would be fine to start charging for rate limits above that, and not impact most people.


All these things cost money to run why would it be surprising if they cost money to utilize their resources beyond a point?


You can either monetize the thing that brings you business directly, adversarially impacting users.

Or you can monetize something else that's correlated to those costs to subsidize the main use case and keep new user acquisition frictionless. Like NPM charging large businesses with special needs with NPM Enterprise, or GitHub with teams and CI/CD features and GitHub Enterprise, and so on.

Docker sold off Docker Enterprise. Now they're trying to extract rents from people who are just trying to "docker build", a command whose ease of use is what drove people to use Docker in the first place.

Foot, meet gun.

Docker is breaking their main selling point. "docker build" and "docker run" should _just work_. By breaking that expectation in subtle ways they risk alienating users. By advertising that they're willing to break their main use case, they're going to alienate businesses and early adopter developers like myself.

Now I'm looking for alternative registries and making sure that devops code I manage doesn't depend on Docker. That's surely not what they wanted, right?


Forgive my unfamiliarity. From the article

>We’ve been getting questions from customers and the community regarding container image layers. We are not counting image layers as part of the pull rate limits. Because we are limiting on manifest requests,

> For example, roughly 30% of all downloads on Hub come from only 1% of our anonymous users.

The limits appear to be 100 pulls per 6 hour time frame per ip address for anon users and twice as much for authenticated users. The least favorable reading of this is to assume a rolling period so lets roll with that. Pun intended.

According to docker whom I imagine is in a better position to evaluate the situation this will impact almost no users. Logically properly caching downloads seems like it would improve local performance. Do you really need, given the possibility to cache downloads, to pull a new image every 1.8 minutes in environments where paying $5 a month for individuals or $25 a month for a team would be prohibitive?

I'm going to assume that orgs manage a variety of one off and recurring expenses. I don't see how this is any different.

What I think is the most salient point is that docker is not a new endeavor. They already have many users. Acquiring new people who consume resources and pay nothing isn't a valuable proposition for them. Why would it be? Do you wish you had more roommates living with you eating your food and paying nothing towards the rent?


> Do you really need, given the possibility to cache downloads, to pull a new image every 1.8 minutes in environments where paying $5 a month for individuals or $25 a month for a team would be prohibitive?

Given that even a "docker build" does a manifest pull, it's not just "new images", but existing ones as well.

Now expand that to a team that say, builds 10 images in parallel using Docker Compose. Now it's a build every ~18 minutes will hit the limit. Larger builds, like say a CI system building every time there's a push to a branch? Yikes.

I've expressed my thoughts in detail here about why I think they should find another avenue to monetize Docker: https://twitter.com/AaronFriel/status/1297988737981247488


$5 per month for when you pull so many images it's basically abuse is user hostile?

No developer starting out is going to hit 200 images in 6 hours, and even if the did, they would go "heh" and then either take a break or pony up the 5$.

5$! It's way too little money for those limits, there should be brackets all the way up to 5000$ per month. Same for Rubygems and NPM. It's ridiculous that those are struggling organizations that can barely afford to have professionals work on them, when they're absolutely essential to whole industries.

I wish they would force us to pay them.


Presumably companies that pay 5-6 figure salaries to devs can probably pay those devs to set up caching or pay docker for the privilege of not bothering. The team pro plan with no limits starts at $25 a month for 5 users.


Yeah, someone should come up with an alternative that mirrors it or something.


Built in torrent and/or ipfs?


The latter has content-dependent chunking support. That would allow for cross-image sharing of common data (large binaries, etc.).


MS should just buy them and put out of their misery.


We have seen that a thousand time with google products and co.

First, you have unlimited, then you have reasonably large limits. The rational is "limits are needed to reduce abuses and should not impact normal users".

Then, it starts to be mandatory to be authenticated. Again, officially for reducing abuses.

Once everyone has an account, and are used to limitations, free limits are reduced again, little by little, and finally to the point where you need to take the "pro" offer to have a normal usage.


I'm very conflicted about this. On the one hand, I recognize that there are potentially significant costs to be born to serve these repositories. On the other hand, making docker part of your infrastructure requires a certain degree of availability.

At some level this seems to me like using my IDE and after 6 hours it would stop working or finding that my CDNJS references to bootstrap stopped working after 6 hours of my site being up. I think it is exactly as if NPM or PIP were to stop working for the day if you included "too many" packages.

I don't really have a good feeling for how these new limits might impact setting up new dev/test environments so I'll likely switch from using specific images to using generic images to limit my exposure.

For example, at the moment, if I want a new dev container for a project in python, I'll FROM python:latest and supplement with pulls from support service containers like nginx:latest and postgres:latest.

Moving forward, it would seem a safer approach will be to use pull a single 18.04 image and run the required installs into them. This super bums me out though as it seems to bypass some of the nicer aspects of getting up and running with docker.


Well to play devil's advocate if docker is part of your infrastructure and you need to make more than 200 pulls in 6 hours you should already host your own registry and maybe mirror a few dockerhub repos to your own


It really sounds like a problem we just have to solve. Putting all the expectation on one central service is not reliable.. it should be a distributed network of content delivery nodes, of which you run a few yourself.


I like this idea. I would think that the “default” hub should not be managed only by Docker, but eg by docker and other registries. I suspect that docker inc still wants full control over their registry though so I don’t see this happening.

We might decide on another, non dockerhub open registry though, and use that instead.


Do pulls for a "latest" tag bypass the cache?


No, latest is just a tag and is not handled differently. There is a CLI flag to ignore cached layers. This however does not effect the FROM line, so if you have the image, there will be no pull.

Expect this to effect CI systems


No they don't, but if you do something silly like spin up a bare VM build node, and then pull your environment every time, then obviously you won't get any caching.


Github Actions? Gitlab CI? Many of those don't do any caching because of the chance of poisoning a tag in the local registry, and don't have a good way to do caching per project.

I can imagine that this affects those sorts of operations.


Yeah, that's what I was getting at - in CI I can see it, but in local dev it would seem surprising to need more than the free-with-authentication quota.

That said, if your CI needs more, it's probably time to invest in a paid account if you can, and certainly if you're commercial.


You can use the GitLab Dependency Proxy for caching images from DockerHub, it's pretty straightforward to pull cached images:

https://docs.gitlab.com/ee/user/packages/dependency_proxy/


Using `docker-machine` You can use an environment variable to pull from a local docker mirror from new bare VMs.

`export ENGINE_REGISTRY_MIRROR=https://mirror.mysite.com`


Why don't they move to a P2P model? Something like BitTorrent DHT or IPFS? DockerHub itself can just be used if the image is scarce and to penalize free riders.


Ipfs is already implemented. https://blog.bonner.is/docker-registry-for-ipfs/

But it's on you to use it. (And it doesn't solve the metadata queries the way I understand it)


Also worth mentioning they added a 6 month retention policy for unpaid accounts. https://www.docker.com/pricing/resource-consumption-updates If you don't want your images disappearing, you should probably move off of Docker Hub. (That said, I don't think there's any reason to continue using Docker Hub, even with a paid service. I suspect such docker services will be discontinued gradually.)

It's not hard to run a registry yourself. https://docs.docker.com/registry/deploying/ Registry supports all sorts of storage backends (files, S3, GCS ...). It's just a stateless HTTP server. It's not hard to configure for production (it comes with decent defaults). If you deploy this to something like Google Cloud Run, you can get ~free hosting (sans the storage costs) for the "registry" itself plus TLS and autoscaling too (I should probably write a tutorial on this).


The real reason this is a problem, is that docker hub has never focused on implementing standard and secure features for account management. Even with the recently implemented access tokens, there's no way for me to limit the scope of those tokens to a specific organization or image, and thus it's no different from a password.

That means if I build something from one organization, it's going to impact the ability to build something for another organization.

This effort seems to me like they are trying to disguise a last ditch effort to stay afloat.

Now that github, gitlab, and other great image repository options exist that don't lack these inherent security/integration features, anyone impacted can easily switch providers for no cost.

This new rate limiting won't help anything for docker, it's just going to kick off the exodus away from docker. Ironically, it's this account/security/integration stuff that they lacked focus on that lost them so much financial opportunity to begin with.


Can’t say I’m surprised. They must be incurring significant data costs. The limits seem reasonable at least.


The sysadmin in me dies a little bit inside every time I realize there are shops that actually pull images for production from the global registry.


Hrm, based on this: https://docs.docker.com/docker-hub/orgs/#add-a-member-to-a-t...

I see that Docker doesn't actually offer an AWS-style enterprise account that one can use to hand authorization to developers without requiring those developers to make individual accounts.

It feels pretty sassy of docker to give everyone 2 months to shove credentials everywhere when docker themselves haven't done the minimum to make enterprise accounts realistic. Instead, they're adopting the github model of "oh, just ask everyone to make personal accounts and then include their personal accounts in the org team". That has problems.

Firstly, it puts employers in the unpleasant position of attempting to compel employees to make legal agreements with third parties (docker, in this case). The correct way to do this is AWS-style, where the org itself makes /one/ agreement and then delegates that agreement via access keys. This is the minimum I expect from enterprise account systems, hard fail for docker.

Secondly, it's a clusterfuck to manage. You end up with an org filled with random-arse account names that you can't really audit, and you don't know who has access to what. If employees leave the org, it's hard to ensure that their access is revoked because the access takes place entirely outside the standard account domains.

Github has recently improved this a shade by adding ADFS authorization to org accounts, but that involves asking employees to tie their personal (and all github and docker accounts /are/ personal) account to their work ADFS account, which is a shitty half-solution.

All things considered, docker made this problem for themselves. They've spent /years/ working hard to get everyone to make docker accounts and push everything to docker hub instead of fostering an ecosystem of registries by different orgs for different purposes. All of a sudden it's now "too expensive" and they're dropping the hammer on everyone to sign up and push credentials everywhere with very little warning, whilst not doing their half of the work by making a proper delegated authority account system.

Doesn't fill me with confidence for their future as a stable platform on which to base a business.


People are also forgetting how newer docker infrastructures use multi-arch and multi-stage builds, meaning a single build could count for dozens of manifest hits, especially with distributed caching.

All the things docker has been working for to enhance the build tooling will now be more difficult to use, even if the user never stores a single image of their own on docker hub.


Numbers seem reasonable. Does anyone have a quick guide for me to use ECR as a transparent cache or something?


I’m hoping GCP/AWS steps up and creates a limitless docker registry. Considering the size of their infrastructure operations, I suspect this would be a small cost but bring a lot of goodwill.

One thing this is bound to do is to make the process of using docker a bit more complex. Explicit registries will probably start to be used everywhere, which is something I welcome. But it seems like a really poor decision by docker, the company to do this: they’re going to drive people off using docker hub.


They have registries already which work with docker after an auth setup. They are generally private to you.

Yes, docker is a struggling company which sold some lines of business and is now trying to reinvent itself again towards developers. Given other recent moves, I'm not sure the new leadership understands how to do this.


They have registries but at least AWS' one is not free at all...


They all cost money, because it's expensive to ship bytes out of the cloud. Docker is likely going the way of the Dodo because they likely funded our free access with VC money.

They also probably made a bad assumption in that they could define the only container format before standards bodies got involved. Blitz Scaling was the mantra of the time, seems to have bitten those back who took a bite of that cake.


yeah docker is both the best and the worst build tool

best because it's hermetic

worst because it's wasteful and isn't great at any kind of graph-based step caching (even with buildkit graphs, that's not going to integrate well with your package management and build tool)


They want you to login, so they have better data for monetization


I think they want you to login because it's hard to monetize anything while hemorrhaging cash to egress fees.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: