Go 1.19 added "go mod download -reuse", which lets it be told about the previous download result including the Git commit refs involved and their hashes. If the relevant parts of the server's advertised ref list is unchanged since the previous download, then the refresh will do nothing more than the ref list, which is very cheap.
The proxy.golang.org service has not yet been updated to use -reuse, but it is on our list of planned work for this year.
On the one hand Sourcehut claims this is a big problem for them, but on the other hand Sourcehut also has told us they don't want us to put in a special case to disable background refreshes (see the comment thread elsewhere on this page ).
The offer to disable background refreshes until a more complete fix can be deployed still stands, both to Sourcehut and to anyone else who is bothered by the current load. Feel free to post an issue at https://go.dev/issue/new or email me at email@example.com if you would like to opt your server out of background refreshes.
I think Drew is right in that he shouldn't take a personalized Sourcehut-only exception because this doesn't address the core issue for any new small providers that pop up.
Between this and the response in the original thread that said, "For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt," it gives the impression that the Go team doesn't care. Sometimes what we _need_ to do to be good netizens is a fair bit of boring technical work but it's essential.
Exactly. We already saw how this ended with Google vs. people running mail servers.
I cannot file an issue; as the article explains I was banned from the Go community without explanation or recourse; and the workaround is not satisfying for reasons I outlined in other HN comments and on GitHub. However, I would appreciate receiving a follow-up via email from someone knowledgeable on the matter, and so long as there is an open line of communication I can be much more patient. These things are easily solved when they're treated with mutual respect and collaboration between engineering teams, which has not been my experience so far. That said, I am looking forward to finally putting this issue behind us.
Why was the author of the post banned without notice from the Go issue tracker, removing what is apparently the only way to get on this list aside from emailing you directly?
Do you, personally, find any of this remotely acceptable?
...but as a place that could hold a rate limit recommendation it would be nice since it appears that the Git protocol doesn't really have the equivalent of a Cache-Control header.
A crawler has a list of resources it periodically checks to see if it changed, and if it did, indexes it for user requests.
Contrary to this totally-not-a-crawler, with its own database of existing resources, that periodically checks if anything changed, and if it did, caches content and builds chescksums.
But yes, it may be the best available solution in this case, even if I would argue that it isn't really it's main purpose.
What would be good is respecting `Cache-Control`, which unfortunately many RSS clients don't, and just pick a schedule and poll on it.
Eg: https://www.robotstxt.org/faq/kinds.html >"What's New" monitoring
Maybe it should be an opt-in list where the big providers (such as github) can be hit by an army of bots and everyone else is safe by default.
It smells like Go is on its way out.
The question is who will swallow their pride first: Sourcehut or Google.
Hi ddevault, FWIW, in May 2022 on that #44577 issue  you had opened, it looks like someone on the core Go team commented there  recommending that you email the golang-dev mailing list or email them directly.
Separately, it looks like in July 2022, in one of the issues tracking the new friendlier -reuse flag, there was a mention  of the #44577 issue you had opened. In the normal course, that would have triggered an automatic update on your #44577 issue... but I suspect because that #44577 issue had been locked by one of the community gardeners as "too heated", that automatic update didn't happen. (Edit: It looks like it was locked due to a series of rapid comments from people unrelated to Sourcehut, including about “scummy behavior”).
Of course, communication on large / sprawling open source projects is never quite perfect, but that's a little extra color...
And given that they banned him for no reason, he is perfectly in the right to tell them that they should email him instead.
We have not been reading the same tickets and articles it seems
So yes. The issue they banned him from. Because reality's more complicated than flippant one liners.
The background refresh is meant to prefetch for that situation, to avoid putting that time on an actual user request. It's not perfect but it's far less disruptive than having to set GOPRIVATE.
...but it sounds like disabling background refreshes would have strictly better end-user performance than what the Sourcehut team had been planning as described in their blog post today (GOPRIVATE and whatnot)?
(At least from an enviromental perspective.)
Easy to verify, get a report :
go install paepcke.de/fsdd/cmd/fsdd@latest && cd $GOMODCACHE && fsdd .
 Warning: Apple user with fixed restricted & expensive nvme space maybe very upset. Easy fixable via fsdd . --hard-link.
For some reason he did not do this and instead chose an option that causes breakage.
Also, how do you opt-out? Imagine a random developer in a startup, running a Gitlab instance and then pushing a Go module there and only to be left with inexplicable traffic pattern(and bill). I have no skin in the game but this default _does not_ sound reasonable to me, whichever way you slice it.
I am very concerned that "own assessment" of what is a DoS means that source code is expected to be hosted only on large platform or by large corporation which is another way to say that "the little guys don't matter".
Self hosting of source code should be an option and the proxy should be there to reduce the traffic load, not amplify or artificially increase that load despite the "level of DoS".
One thing Drew is asking for is to respect robots.txt to allow the operator to determine what a reasonable level is for that operator and not apply a github bias to it.
We must absolutely reduce our resource usage.
That's per Go repository. That's a non-trivial amount of egress data and probably adds up to thousands of dollars a month.
In this context, that seems like a lot. Of course the module mirror can't know about this context, but there are certainly a lot of scenarios where this is comparatively a lot of bandwidth. Not everyone is running beefy servers.
Seems like an exceedingly poor and unreasonable default, and it doesn't take much imagination to see how this could be improved fairly easily (e.g. scale to number of actual go gets would already be an improvement).
That’s (upper bound) 4Gib times 2500 per hour. That’s not nothing.
"All this would go away if other people would just do what they're told" is a pretty dystopian policy, but it seems to be a popular choice in Google.
Anyway, things get a bit kafka-esque when you realise that there is another company doing this WiFi thing and to opt out from that one you need a different SSID suffix. Since you can't have both, you end up with at least one company data mining you.
Why GDPR has not put a stop to this is baffling
I could be wrong but in comments to https://github.com/golang/go/issues/44577 issue there at least 4 hosters that forced to manually disable background refresh because of this exactly issue?
Holy cow Google! Wouldn't it behoove us to check if any changes occurred before downloading an entire repo?
A checkout with this can literally clone nothing but hash
git clone --depth=1 --filter=tree:0 --no-checkout https://xxxx/repo.git
Of course, I have seen alleged examples of Go modules using tags like branches and force pushing them regularly, but that kind of horror sends shivers down my back, at least, and I don't understand why you'd build an ecosystem supporting that sort of nonsense and which needs to be this paranoid and do full repository clones just for caching tag contents. If anything: lock it down more by requiring tag signatures and throwing errors if a signed tag ever changes. So much of what I read about the Go module ecosystem sounds to me like they want supply chain failures.
I don't understand the Go ecosystem.
Yeah, on third-party code hosting platforms :). And maaaaaybe in some short-lived cache somewhere. I mean, why spend on storage and complicate your life with state management, when you can keep re-requesting the same thing from the third-party source?
Joking, of course, but only a bit. There is some indication Google's proxy actually stores the clones. It just seems to mindlessly, unconditionally refresh them. Kind of like companies whose CI redownload half of NPM on every build, and cry loudly when Github goes down for a few hours - except at Google scale.
But the whole thing is frankly a little rude.
Could you tell me more about what tool you used to host the repos? And I assume you noticed the traffic in your web logs?
That's some really entitled thinking on the part of the Go team at Google, and it's sad to see people stanning for them.
The entire point of the sumdb (go.sum), is to prevent the need for such a relationship. If Google (or any proxy you use) tries to return questionable packages, it will be detected by that system.
That is exactly the detection of a poisoned module in the ecosystem. It would break builds, issues would get filed, and a new version would be released (and the malicious party may not be so lucky this time since it’s trust on anyone’s first use).
But I guess it's also fairly easy to test it: just serve a slightly different version to the google's go mirror (by the user agent), and see how long until somebody complains to you about it.
I think every company I know of with private Go modules (6-8 or so?) is running a module proxy, which will detect this. The several times we've detected this it's always been within 2-3 days of the upstream mistake. When I go to report a bug we're not always the first either.
No, the error message you get is neutral about which side might be wrong - it says "verifying module: checksum mismatch" and "This download does NOT match the one reported by the checksum server." (I've seen it a lot because it also appears when module authors rebase, which a small but surprisingly high number do...)
I'd be interested to understand why that solution hasn't been implemented yet.
"it would be a fair bit of extra work for us to read robots.txt", so clearly tracking commit hashes would be even more work.
> I was banned from the Go issue tracker without explanation, and was unable to continue discussing the problem with Google.
is completely asinine. But it's also par for the course when it comes to interacting with Google. When is anyone going to hold them to account for their terrible customer service and community interaction?
> I was banned from the Go issue tracker for mysterious reasons [ In violation of Go’s own Code of Conduct, by the way, which requires that participants are notified moderator actions against them and given the opportunity to appeal. I happen to be well versed in Go’s CoC given that I was banned once before without notice — a ban which was later overturned on the grounds that the moderator was wrong in the first place. Great community, guys. ]
When a story has two sides and one party chooses to keep silent when accused, I tend to favor the accuser.
Regardless, I don't really want to re-litigate it here. The main issue is that Google has been DoS'ing SourceHut's servers for two years, and I think we can all agree that there is no conduct violation for which DoS'ing your servers is a valid recourse.
edit: Possibly not as much nothing, see the replies to . But the GH search kinda sucks for this.
Can someone explain what this means? After all reading a small text file is plainly easy to do so its meaning must be something other than obvious.
Judging just from the linked post, the issue on which this was discussed, and this thread, it's feeling a lot like this proxy was some kind of proof-of-concept that escaped its cage and got elevated to production.
The other problem is that it's Google so their perception of "not much traffic" is "biblical floods" to other people.
This is a consequence of what IMO is another bad decision by the Go team: having packages be obtained from (and identified by) random github repos, instead of one or more central repositories like Maven Central for Java or crates.io for Rust. The proxy ends up being nothing more than an ad-hoc, informally-specified, bug-ridden re-implementation of half of a central repository, to paraphrase Greenspun's tenth rule.
You can run your own proxy service if you want. There's a large benefit to the go community as a whole for there to be a shared default proxy service.
E.g.: source-based packages on distributions where users may not be Go programmers, or even non-programmers, will compile and install Go software where some nested library dependency is on sr.ht. These packages will now fail and, sadly, this is going to cause widespread disruption. I think it'd be worse if those failures only happened occasionally, and not reliably repeatably.
I am wondering, because even if google's traffic is unreasonable, it might still be less than without the proxy.
CI configurations are notoriously inefficient with dependency fetching, so I would not be surprised if the actual client traffic is massive and might overwhelm sourcehut if all migrate to direct fetches.
Would it be simple to solve this with an additional layer of the same proxy? Currently, end users request a package from proxy.golang.org as per the default value of the GO_MOD_PROXY env var. Google runs many of these to handle the traffic, let’s say 1000. They all maintain a mirror of all packages that have been recently requested (note that I expect there’s more nuance here around shared lists of required packages, etc, but the structure should hold true)
The result is that every one of the 1000 proxy instances requests the source data from git.sr.ht every day.
Google could set GO_MOD_PROXY on the existing instances to internalproxy.golang.org. They could then run 100, or maybe 10 of these internal instances. This would drop the traffic to hit.sr.ht by one or two orders of magnitude.
I suspect it would require minimal if any change to application code. This might be accomplished entirely within the remit of a sysadmin (SRE?).
Any holes in my reasoning?
Isn't that exactly what "git fetch" is doing already ?
> For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt […]
This is coming from one of the biggest, richest, most well-staffed companies on the planet. It’s too much work for them to read a robots.txt file like the rest of the world (and plenty of one-man teams) do before hammering a server with terabytes of requests.
If this is too much for them then no wonder they won’t implement smarter logic like differential data downloads or traffic synchronization among peer nodes.
Furthermore, we try to look past the tip of our own nose when it comes to these kinds of problems. We often reject solutions which are offered to SourceHut and SourceHut alone. This isn't the first time this principle has run into problems with the Go team; to this day pkg.go.dev does not work properly with SourceHut instances hosted elsewhere than git.sr.ht, or even GitLab instances like salsa.debian.org, because they hard-code the list of domains rather than looking for better solutions -- even though they were advised of several.
The proxy has caused problems for many service providers, and agreeing to have SourceHut removed from the refresh would not solve the problem for anyone else, and thus would not solve the problem. Some of these providers have been able to get in touch with the Go team and received this offer, but the process is not easily discovered and is poorly defined, and, again, comes with these implied service considerations. In the spirit of the Debian free software guidelines, we don't accept these kinds of solutions:
> The rights attached to the program must not depend on the program's being part of a Debian system. If the program is extracted from Debian and used or distributed without Debian but otherwise within the terms of the program's license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the Debian system.
Yes, being excluded from the refresh would reduce the traffic to our servers, likely with less impact for users. But it is clearly the wrong solution and we don't like wrong solutions. You would not be wrong to characterize this as somewhat ideologically motivated, but I think we've been very reasonable and the Go team has not -- at this point our move is, in my view, justified.
So, obviously he's supposed to know the even more obscure and annoying method of opting out.
I'm ok with saying "google should do better!" But the compromise solution from the Go team seems reasonable to solve the immediate issue in a way that doesn't harm end users. The author should at least address why they have chosen the more extreme solution.
EDIT: Moreover, sr.ht doing a workaround only for sr.ht, and lubar doing a workaround for lubar, etc... is not what Free Software is about. The point is that we're supposed to act as a community, for the betterment of the collective. Individualism is not a solution.
Wouldn't the effect on sourcehut users be identical?
I've not seen any explanation about why the solution offered by the Go team was unacceptable. Its weird that that is completely left out of the blog post here.
You know, like be good neighbors and respectful of other people's resources, maybe read robots.txt and not make excuses for why you are writing shitty stateless workers that spam the rest of the community.
Edit: Uh okay, if it's not user traffic then why wasn't the "don't background refresh" not an option?
Doing shallow clones, which are significantly cheaper.
Google is DDoSing them, by their service design. Why a full git clone, why not shallow? Why do they need to do a full git clone of a repository up to a hundred times an hour. It doesn't need that frequency of a refresh.
The likely answer is that the shared state to handle this isn't a trivial addition, it's a lot simpler to just build nodes that only maintain their own state. Instead of doing it on one node and sharing that state across the service, just have every node or small cluster of nodes do its own thing. You don't need to build shared state to run the service, so why bother? That's just needless complexity after all, and all you're costing is bandwidth, right?
That's barely okay laziness when you're interacting with your own stuff and have your own responsibility for scaling and consequences. Google notoriously doesn't let engineers know the cost of what they run, because engineers will over-optimise on the wrong things, but that also teaches them not to pay attention to things like the costs they inflict on other people.
It's unacceptable to act in this kind of fashion when you're accessing third parties. You have a responsibility as a consumer to consume in a sensible and considered fashion. Avoiding this means you're just not costing yourself money through your laziness, you're costing other people who don't have stupid deep pockets like Google.
This is just another way in which operating at big-tech-money scales blinds you to basic good practice (I say this as someone who has spent over a decade now working for big tech companies...)
Huh? I left a few months ago but there was a widely used and well known page for converting between various costs (compute, memory, engineer time, etc).
Thats a DDoS.
I could build a system that did this in a week without any support from Google using existing open source tech. It's mind boggling that Google isn't honoring robots.txt, is requesting full clones, and isn't maintaining a local cache.
If you disregard the question who pays for a moment and only look at what makes sense for the "bits", the stateless architecture seems not so bad. Just a pity that in reality somebody else has to foot the bill.
I was not talking about how the nodes store data, but about a central cache. Purely architecture wise, it doesn't make sense to introduce a central storage that just mirrors Sourcehut (and all other Git repositories). Sourcehut is already that central storage. You would just create a detour.
It's also not an easy problem. If the cache nodes try to synchronize writes to the central cache, you are effectively linearizing the problem. Then you might as well just have the one central cache access Sourcehut etc. directly. But then of course you lose total throughput.
I guess the technically "correct" solution would be to put the cache right in front of the Sourcehut server.
If Google is going to blindly hammer something because they must have their Google scale shared nothing architecture pointed at some unfortunate website, then they should deploy a gitea instance that mirrors sr.ht to Google Cloud Storage, and hammer that.
It's unethical to foist the egress costs onto sr.ht when the solution is so so simple.
Some intern could get this going on GCP in their 20% time and then some manager could hook the billing up to the internal account.
It's about the clearest example of bad engineering justified by "developer velocity" - developer time is indeed expensive relative to inefficiency for which you don't pay because you externalize it to your users. Clearest, because there are fewer parties affected in a larger way, so the costs are actually measurable.
I do have a dog in this, in a way, because as one of the paying users of sr.ht, I'm unhappy that Google's indifference is wasting sr.ht budget through bandwidth costs.
It says in the post they'll check the UserAgent for the Go proxy string and return a 429 code.
Now a bit of personal history. The Go project was started, by Rob,
Robert, and Ken, as a bottom-up project. I joined the project some 9
months later, on my own initiative, against my manager's preference.
There was no mandate or suggestion from Google management or
executives that Google should develop a programming language. For
many years, including well after the open source release, I doubt any
Google executives had more than a vague awareness of the existence of
Go (I recall a time when Google's SVP of Engineering saw some of us in
the cafeteria and congratulated us on a release; this was surprising
since we hadn't released anything recently, and it soon came up that
he thought we were working on the Dart language, not the Go language.)
Go is not only a "side project" at Google, but one of its most trivial side projects.
AdWords is mainly written in Go. YoutTube is mainly written in Go. Just because they have strategic reasons for not directly monetizing Go doesn't make it a side project more than any other internal tooling.
It's core to their ability to pull in revenue now. If they were somehow immediately deprived access to Go, the company would go under. That's how you know it's not a side project.
Can you source these claims? Last I checked, YouTube was primarily written in Python, and I doubt that's changed dramatically in the intervening years given the size of YouTube. I assume there's some similar thing going on for AdWords.
> Just because they have strategic reasons for not directly monetizing Go doesn't make it a side project more than any other internal tooling.
Agreed, but all internal tooling is a side project pretty much by definition.
> It's core to their ability to pull in revenue now.
No, it's just the thing that they implemented some of their systems in. I'm a big Go fan, but they could absolutely ship software in other languages for a marginal increased operational overhead.
> If they were somehow immediately deprived access to Go, the company would go under. That's how you know it's not a side project.
I don't know what it means to be "deprived access to Go", but this is a pretty absurd definition of "side project" since it applies to just about everything Google does and a good chunk of the software Google depends on whether first party or third party (Google depends much more strongly on the Linux Kernel; that doesn't mean contributing to the Linux Kernel is Google's primary business activity). It seems you have a bizarre definition of "side project" which hinges on whether or not a business can back out of a given technology on a literal moment's notice irrespective of how likely it is that said technology becomes unavailable on that sort of timeline, and that these unusual semantics are at the root of our disagreement.
Yes, I do imagine that people who are really into Go are more likely than average to join or start Go shops, and then pick GCP over competitors because they have to start with something, and being Go people, Google stuff comes first to mind.
Lots of companies across lots of industries spend a lot of money to achieve more-less this fuzzy, delayed-action effect.
How many such people do you imagine there are? I'm active in the Go community, and I've been a cloud developer for the better part of a decade. It's never occurred to me to pick GCP over AWS because Google develops Go, nor have I ever heard anyone else espouse this temptation. I certainly can't imagine there are so many people out there for whom this is true that it recoups the cost that Google incurs developing Go.
Rather, I'm nearly certain that Google's value proposition RE Go is that developing and operating Go applications is marginally lower cost than for other languages, but that at Google's scale that "marginally lower cost" still dwarfs the cost of Google's sponsorship of the Go language.