Hacker News new | past | comments | ask | show | jobs | submit login
Sourcehut will blacklist the Go module mirror (sourcehut.org)
550 points by Tomte 11 months ago | hide | past | favorite | 337 comments



The Go team has been making progress toward a complete fix to this problem.

Go 1.19 added "go mod download -reuse", which lets it be told about the previous download result including the Git commit refs involved and their hashes. If the relevant parts of the server's advertised ref list is unchanged since the previous download, then the refresh will do nothing more than the ref list, which is very cheap.

The proxy.golang.org service has not yet been updated to use -reuse, but it is on our list of planned work for this year.

On the one hand Sourcehut claims this is a big problem for them, but on the other hand Sourcehut also has told us they don't want us to put in a special case to disable background refreshes (see the comment thread elsewhere on this page [1]).

The offer to disable background refreshes until a more complete fix can be deployed still stands, both to Sourcehut and to anyone else who is bothered by the current load. Feel free to post an issue at https://go.dev/issue/new or email me at rsc@golang.org if you would like to opt your server out of background refreshes.

[1] https://news.ycombinator.com/item?id=34311621


I realize in the real world most modules are probably hosted by large providers that can absorb the bandwidth, like Github, but it seems incredibly discourteous to not prioritize the hammering of small providers, especially two years on when the response is still "maybe later this year".

I think Drew is right in that he shouldn't take a personalized Sourcehut-only exception because this doesn't address the core issue for any new small providers that pop up.

Between this and the response in the original thread that said, "For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt," it gives the impression that the Go team doesn't care. Sometimes what we _need_ to do to be good netizens is a fair bit of boring technical work but it's essential.


It's super-weird that the Google-side Go folks' responses to this have basically been "we don't have the resources to run this service that we decided to run and that's now misbehaving, responsibly". Like... don't, then? Why take on that kind of thing in the first place if urgent fixes to its generating abusive traffic for no good reason take three years?


> he shouldn't take a personalized Sourcehut-only exception because this doesn't address the core issue for any new small providers that pop up

Exactly. We already saw how this ended with Google vs. people running mail servers.


The Exception "exclusion solution" definitely it's not personalized and Sourcehut-only but available to anyone that requests it and in the issue track you can see several people that are using this exclusion already. True that an opt-out (what this solution boils down to) is not ideal but it's way better than using your users quality of service to try strong-arming your side. And anyway from the go team it's been made clear that they are working on improving the situation of the refreshes and this opt-out is just the temporary solution until they fix the real issue.


Hi Russ! Thank you for sharing. I am pleased to hear that there is finally some progress towards a solution for this problem. If you or someone working on the issue can reach out via email (sir@cmpwn.com), I would be happy to discuss the issue further. What you described seems like an incomplete solution, and I would like to discuss some additional details with your team, but it is a good start. I'm also happy to postpone or cancel the planned ban on the Go proxy if there's active motion towards a fix from Google's end. I am, however, a bit uneasy that you mentioned that it's only prioritized for "this year" -- another year of enduring a DoS from Google does not sound great.

I cannot file an issue; as the article explains I was banned from the Go community without explanation or recourse; and the workaround is not satisfying for reasons I outlined in other HN comments and on GitHub. However, I would appreciate receiving a follow-up via email from someone knowledgeable on the matter, and so long as there is an open line of communication I can be much more patient. These things are easily solved when they're treated with mutual respect and collaboration between engineering teams, which has not been my experience so far. That said, I am looking forward to finally putting this issue behind us.


Sounds good. I will reach out over email. Thanks.


Why does the Go team and/or Google think that it's acceptable to not respect robots.txt and instead DDoS git repositories by default, unless they get put on a list of "special case[s] to disable background refreshes"?

Why was the author of the post banned without notice from the Go issue tracker, removing what is apparently the only way to get on this list aside from emailing you directly?

Do you, personally, find any of this remotely acceptable?


FWIW I don't think this really fits into robots.txt. That file is mostly aimed at crawlers. Not for services loading specific URLs due to (sometimes indirect) user requests.

...but as a place that could hold a rate limit recommendation it would be nice since it appears that the Git protocol doesn't really have the equivalent of a Cache-Control header.


> Not for services loading specific URLs due to (sometimes indirect) user requests.

A crawler has a list of resources it periodically checks to see if it changed, and if it did, indexes it for user requests.

Contrary to this totally-not-a-crawler, with its own database of existing resources, that periodically checks if anything changed, and if it did, caches content and builds chescksums.


I'm taking the OP at his word here, but he specifically claims that the proxy service making these requests will also make requests independent of a `go get` or other user-initiated action, sometimes to the tune of a dozen repos at once and 2500 requests per hour. That sounds like a crawler to me, and even if you want to argue the semantic meaning of the word "crawler," I strongly feel that robots.txt is the best available solution to inform the system what its rate limit should be.


When I mean crawler I mean something that discovers new pages. Refreshing the same URL isn't really crawling.

But yes, it may be the best available solution in this case, even if I would argue that it isn't really it's main purpose.


After reading this and your response to a sibling comment I wholeheartedly disagree with you on both the specific definition of the word crawler and what the "main purpose" of robots.txt is, but glad we can agree that Google should be doing more to respect rate limits :)


What you're thinking about, in my opinion, is best referred to as a spider.


As annoying as it is, there is precedent for this opinion with RSS aggregator websites like Feedly. They discover new feed URLs when their users add them, and then keep auto-refreshing them without further explicit user interaction. They don't respect robots.txt either.


I wouldn't expect or want an RSS aggregator to respect robots.txt for explicitly added feeds. That is effectively a human action asking for that feed to be monitored so robots.txt doesn't apply.

What would be good is respecting `Cache-Control`, which unfortunately many RSS clients don't, and just pick a schedule and poll on it.


robots.txt was originally created to include such bots. That they think they don't need to respect it goes against the original intent.

Eg: https://www.robotstxt.org/faq/kinds.html >"What's New" monitoring


I want my software to obey me, not someone else. If the software is discovering resources on its own, then obeying robots.txt is fair. But if the software is polling a resource I explicitly told it to, I would not expect it to make additional requests to fetch unrelated files such as a robots.txt


I can almost see both sides here... But ultimately when you are using someone else's resources, then not respecting their wishes (within reason) just makes you an asshole.


Going up the stack a bit this feels to me like the same sort of "we know better" mentality that said no one really needs generics.


Why should a git client respect an http standard such as robots.txt?


Google began pushing for it to become an Internet standard—explicitly to be applicable to any URI-driven Internet system, not just the Web—in 2019, and it was adopted as an Internet standard in 2022.

https://developers.google.com/search/blog/2019/07/rep-id


This is true but irrelevant to the parent's question -- in the article, it's made clear that Google's requests are happening over HTTP, which is the most obvious reason why robots.txt should be respected.


It's relevant because it attacks the premise of their objection.


Read the OP; it's obvious based on the references to robots.txt, the User-Agent header, returning a 429 response, etc, that most (all?) of Google's requests are doing git clones over http(s).


Because it uses HTTP.


I suspect they have a problem with this DDoS by default unless you ask to opt out behavior. Why is anyone getting hit with these expensive background refreshes until you have a chance to do it right? Why is it still not done right 2 years after this was first reported?

Maybe it should be an opt-in list where the big providers (such as github) can be hit by an army of bots and everyone else is safe by default.


And why isn't this a priority? These folks are offering a service that benefits the Go ecosystem.

It smells like Go is on its way out.


This smells wildly overdramatic. They've been working on solutions big and small since it was reported, it's just that the big solutions take time, this was communicated to Drew.


This reminds me a bit of a disfunctional relationship: clearly Sourcehut wants Google to stop DDoS their servers; clearly Google don’t actually want to DDoS Sourcehut, but Sourcehut also doesn’t want to ask Google to stop, and Google also want to be asked to stop. And so nothing gets done.

The question is who will swallow their pride first: Sourcehut or Google.


This isn't true. Sourcehut reported a bug, and since the bug is somewhat involved to fix entirely, we asked what the impact of the bug is to them and offered to make a custom change for the site in the interim. The impact matters: the appropriate response is different for "I saw this in my logs and it looks weird but it's not bothering me" versus "this is causing serious problems for my site". We have been getting mixed signals about which it is, as I noted, but since Sourcehut told us explicitly not to put in a special case, we haven't.


Your comment in this thread is the first time I've seen anyone mention that it was being worked on since... June 2021? This despite repeatedly raising the issue up until I was banned without explanation. I was never told, and still don't know, what disabling the refresh entails, the ban prevents me from discussing the matter further, and I was under the impression that no one was working on it. We have suffered a serious communication failure in this incident. That said, I am looking forward to your follow-up email and seeing this issue resolved in a timely and amicable manner.


> the ban prevents me from discussing the matter further

Hi ddevault, FWIW, in May 2022 on that #44577 issue [0] you had opened, it looks like someone on the core Go team commented there [1] recommending that you email the golang-dev mailing list or email them directly.

Separately, it looks like in July 2022, in one of the issues tracking the new friendlier -reuse flag, there was a mention [2] of the #44577 issue you had opened. In the normal course, that would have triggered an automatic update on your #44577 issue... but I suspect because that #44577 issue had been locked by one of the community gardeners as "too heated", that automatic update didn't happen. (Edit: It looks like it was locked due to a series of rapid comments from people unrelated to Sourcehut, including about “scummy behavior”).

Of course, communication on large / sprawling open source projects is never quite perfect, but that's a little extra color...

[0] https://github.com/golang/go/issues/44577

[1] https://github.com/golang/go/issues/44577#issuecomment-11378...

[2] https://github.com/golang/go/issues/53644#issuecomment-11751...


The offer in [1] was to email the ML to ask for an exclusion, not to continue discussing the general issue which was still being discussed in the GH issue.

And given that they banned him for no reason, he is perfectly in the right to tell them that they should email him instead.


Correct: it was made clear to me in no uncertain terms that the only thing I was allowed to say was "yes" or "no" to this offer.


Now witness the firepower of this fully ARMED and OPERATIONAL Google cloud.


> the appropriate response is different for "I saw this in my logs and it looks weird but it's not bothering me" versus "this is causing serious problems for my site". We have been getting mixed signals about which it is

We have not been reading the same tickets and articles it seems


No problems were ever mentioned, serious or otherwise. Elevated traffic isn't automatically a problem. Drew's played it up quite a lot elsewhere, but the Go team can only be reasonably expected to follow the one issue filed, not Drew's entire online presence.


You mean the issue they banned him from? ;-)


Yes. He had plenty of opportunity to state problems if they existed. Relaying harm caused would have likely accelerated things, and if harm was being done he would have taken up the still-open offer to solve this problem in the interim while the real solution is pushed out instead of writing misrepresentative and openly salty blog posts for years. Even with him being banned, the Go team is still tracking this issue, still brings it up internally, and has pushed a feature that would fix this ahead by an entire release.

So yes. The issue they banned him from. Because reality's more complicated than flippant one liners.


Thanks for the insight, Russ. Would you comment on what the potential consequences of opting out of background refreshes would be? Could there be any adverse effects for users?


Opting out of background refreshes would mean that fetching a module version that (1) no one else had fetched in a few days and (2) does not use a recognized open-source license might not be in the cache, which would make 'go get' take a little extra time while the proxy fetched it on demand. The amount of time would depend on the size of the repo, of course.

The background refresh is meant to prefetch for that situation, to avoid putting that time on an actual user request. It's not perfect but it's far less disruptive than having to set GOPRIVATE.


Some people today raised concerns about disabling background refreshes (the temporary workaround originally suggested by the Go team) as having possibly unacceptable resulting performance for end users...

...but it sounds like disabling background refreshes would have strictly better end-user performance than what the Sourcehut team had been planning as described in their blog post today (GOPRIVATE and whatnot)?


Hey Russ, I got your messages that my emails aren't coming through but I'm not sure why. As an alternative, you can reach me on IRC at ddevault on Libera Chat. I'm in CEST, but my bouncer is always online. Cheers!


If google is on the way to fix this, a look at the resources on (EVERYONES LOCAL!) not de-duplicated $GOMODCACHE would be very welcome too!

(At least from an enviromental perspective.)

Easy to verify, get a report [1]: go install paepcke.de/fsdd/cmd/fsdd@latest && cd $GOMODCACHE && fsdd .

[1] Warning: Apple user with fixed restricted & expensive nvme space maybe very upset. Easy fixable via fsdd . --hard-link.


According to some comments in the linked GitHub issue, including [1] from last May, Drew could have simply asked to be excluded from automatic refresh traffic from the mirror. If I understand correctly, that would still leave traffic from the mirror when it’s acting as a direct proxy for someone’s request, but that is traffic that would be going to sr.ht regardless.

For some reason he did not do this and instead chose an option that causes breakage.

[1] https://github.com/golang/go/issues/44577#issuecomment-11378...


I think it is problematic that we are using github issues as "support forum" for asking a git host provider to be excluded from the refresh list. This should not have come to that. Whatever happened to "reasonable defaults", so as a random person hosting a single Go module doesn't get DOSed - https://github.com/golang/go/issues/44577#issuecomment-86087... ?


Everyone can make their own assessment of what is a reasonable default and what counts as a DoS (and they are welcome to opt-out of any traffic), but note that 4GB per day is 0.3704 Mbps.


Comes around $8-11 of egress monthly traffic on AWS. I would think twice before signing up for a service that charges me $10/month - not sure why this should be any different.

Also, how do you opt-out? Imagine a random developer in a startup, running a Gitlab instance and then pushing a Go module there and only to be left with inexplicable traffic pattern(and bill). I have no skin in the game but this default _does not_ sound reasonable to me, whichever way you slice it.


As an aside, I have done this and I do not have the problem. The most interesting question for me in all of this has always been, what triggers Google's request load?


AWS gouges the crap out of people with their egress fees, though. You can get 20TB of transfer for ~$5 a month from Hetzner plus a server to go with it.


Opt-out is still manual and undocumented process?

I am very concerned that "own assessment" of what is a DoS means that source code is expected to be hosted only on large platform or by large corporation which is another way to say that "the little guys don't matter".

Self hosting of source code should be an option and the proxy should be there to reduce the traffic load, not amplify or artificially increase that load despite the "level of DoS".

One thing Drew is asking for is to respect robots.txt to allow the operator to determine what a reasonable level is for that operator and not apply a github bias to it.


This way of thinking is the exact opposite of what we need to do in the IT for reducing our impact on the environment. 4 GB is an enormous amount of data. It's enough to listen to 21k hours of music, or having a small dump of all french wikipedia (no pics, main paragraph). It's enough to completely travel off-line in Germany. It's enough to watch 3 to 4 movies in a good resolution. And that's per day.

We must absolutely reduce our resource usage.


> but note that 4GB per day is 0.3704 Mbps

That's per Go repository. That's a non-trivial amount of egress data and probably adds up to thousands of dollars a month.


what? why isn't it only checking the checksum file for changes? :o


That 4GB figure is for a repo at git.lubar.me, a self-hosted git repo where – quoting the person running it – "I am the only person in the world using this Go module".

In this context, that seems like a lot. Of course the module mirror can't know about this context, but there are certainly a lot of scenarios where this is comparatively a lot of bandwidth. Not everyone is running beefy servers.

Seems like an exceedingly poor and unreasonable default, and it doesn't take much imagination to see how this could be improved fairly easily (e.g. scale to number of actual go gets would already be an improvement).


> A single module can produce as much as 4 GiB of daily traffic from Google.

That’s (upper bound) 4Gib times 2500 per hour. That’s not nothing.


Per Go module repository, no?


Seems like a weird default policy, until you remember Google's wifi mapping opt-out process: https://support.google.com/maps/answer/1725632?hl=en#zippy=%...

"All this would go away if other people would just do what they're told" is a pretty dystopian policy, but it seems to be a popular choice in Google.


That right there is why i stopped using Google.

Anyway, things get a bit kafka-esque when you realise that there is another company doing this WiFi thing and to opt out from that one you need a different SSID suffix. Since you can't have both, you end up with at least one company data mining you.

Why GDPR has not put a stop to this is baffling


I’ve got to ask, which other Wi-Fi mapping provider requires a SSID suffix to opt out? Is it one of the big boys or is it someone like wigle.net or openwifimap.net?


I can confirm asking to be excluded from refreshes (which AFAIK is still a standing offer, but I obviously can't speak for the mirror team because I am not at Google anymore) would stop all automated traffic from the mirror, and that sr.ht could send a simple HTTP GET to proxy.golang.org for new tags if it wished not to wait for users to request new versions.


Should every person that hosts an instance of SourceHut, Gitea, or Forgejo have to opt-in to this? That just doesn't scale at all. Drew is standing up for all independent hosters as much as he is standing up for his own business interests.


No other independent hosters are having issues. I'm on one with no more than a few dozen users. Multiple of us write in Go, including the owner of the service, and yet Gitea and the VPS hosting it have never even blinked. If there were more of an issue, there would be more of a fuss than just two individuals. And one of those individuals was completely satisfied with the temporarily hackjob while a more permanent solution has been in the works. The other just denied it with no reason ever given cause he wants to keep litigating the issue.


> No other independent hosters are having issues.

I could be wrong but in comments to https://github.com/golang/go/issues/44577 issue there at least 4 hosters that forced to manually disable background refresh because of this exactly issue?


How are you getting that number? There's Drew, who refused all solutions offered. There's Ben, who took the offer and was satisfied. If you're counting Gentoo, which is linked, you shouldn't because that's an unrelated issue caused by a regression test.


They should probably use GOPRIVATE[1] instead. GOPRIVATE doesn't disable the module cache globally, it just disables it for individual domains or paths on domains. This is mainly used with private repo dependencies on GitHub.

[1]: https://goproxy.io/docs/GOPRIVATE-env.html


Thanks for the tip, will update the post.


> It should not be necessary to fetch the same git module up to 2,000 times per day.

Holy cow Google! Wouldn't it behoove us to check if any changes occurred before downloading an entire repo?


I don't understand why they have to do a fresh clone every time.


Or, as mentioned in the post, why they don't do a shallow clone if they have to fetch it every time for whatever reason. Seems like a weird decision either way.


Yep, a shallow clone is enough to get the latest version. And you can even filter the tree to make the download size even smaller given you only want the hash but not the contents (if the git server supports this feature)

A checkout with this can literally clone nothing but hash

    git clone --depth=1 --filter=tree:0  --no-checkout https://xxxx/repo.git
    cd repo
    git log


People using Go modules should be using git tags, right? They should have at least one hash already that should be infinitely cacheable, the tag commit.

Of course, I have seen alleged examples of Go modules using tags like branches and force pushing them regularly, but that kind of horror sends shivers down my back, at least, and I don't understand why you'd build an ecosystem supporting that sort of nonsense and which needs to be this paranoid and do full repository clones just for caching tag contents. If anything: lock it down more by requiring tag signatures and throwing errors if a signed tag ever changes. So much of what I read about the Go module ecosystem sounds to me like they want supply chain failures.

I don't understand the Go ecosystem.


It sure reads like someone wrote this the dumbest way that could possibly work, without a thought for what the effects would be.


presumably the proxy backend is stateless


At the end of the day it has to store, consolidate, and present that state to the service's consumers, so there is state somewhere.


> there is state somewhere

Yeah, on third-party code hosting platforms :). And maaaaaybe in some short-lived cache somewhere. I mean, why spend on storage and complicate your life with state management, when you can keep re-requesting the same thing from the third-party source?

Joking, of course, but only a bit. There is some indication Google's proxy actually stores the clones. It just seems to mindlessly, unconditionally refresh them. Kind of like companies whose CI redownload half of NPM on every build, and cry loudly when Github goes down for a few hours - except at Google scale.


could it be a CI service that builds the go-proxy server for testing and that build process does an initial clone of all sorts of go modules?


I can see this happening on my own git hosting too. I've started moving my Go code off GitHub and the module mirror shows up every 25min for each repo it's aware of doing a full clone. Thankfully the few modules I've moved are very small ones with very little history. This won't come anywhere near my egress allocation for that box.

But the whole thing is frankly a little rude.


Yeah I wanted to do the same thing and self host my repos but that will have to wait now until this issue is fixed, it looks like...

Could you tell me more about what tool you used to host the repos? And I assume you noticed the traffic in your web logs?


So either you disable the proxy completely for your site or get overwhelmed by traffic from a not to well written service, similar to a small DDOS attack, which is run by google and they are not planning to fix this? Did I get anything wrong here?


There is a refresh exclusion list which you can request your site to be added to. The proxy will continue to process requests for modules from your site but will not perform the automatic refresh which caused issues for Sourcehut. The Go team extended an offer to add Sourcehut to the list if a request to do so was made. The request never came and instead Sourcehut blocked the proxy.


"Opt out of me DoSing" you is not legit behavior -- especially when the victim has raised it with you multiple times, suggested fixes, and you have then blocked them from communicating in your issue tracker.

That's some really entitled thinking on the part of the Go team at Google, and it's sad to see people stanning for them.


They should have them add a file to their web service at a path of "/i_want_to_live.txt" to indicate not to DoS the server.


If an amount of traffic that nobody else even notices brings you to your knees, you're doing something wrong. Go could be far more efficient but pointing at this and calling it a DDoS is silly.


You are correct, it was more a complaint about resource usage and costs, as opposed to a breakage of the src.ht service.


I called it a DoS, not a DDoS.


Technically it'd be a DDoS, but however distributed it is makes no difference to what we're talking about here. If you could respond to substance, it'd be appreciated.


Does every site need to personally request they not be DoS attacked?


> and implies a trust relationship with Google to return authentic packages.

The entire point of the sumdb (go.sum), is to prevent the need for such a relationship. If Google (or any proxy you use) tries to return questionable packages, it will be detected by that system.


If the proxy changes a new version of a package, when you update it, there's no way to detect it since it fetches through the cache anyways, so a poisoned sum will be added to sumdb, and anyone who isn't fetching their packages through Google's proxy will get told that whatever they're using is trying to trick them.


> anyone who isn't fetching their packages through Google's proxy will get told that whatever they're using is trying to trick them.

That is exactly the detection of a poisoned module in the ecosystem. It would break builds, issues would get filed, and a new version would be released (and the malicious party may not be so lucky this time since it’s trust on anyone’s first use).


Considering how few people do so, I'm fairly certain it would take more than a month for somebody to catch that.

But I guess it's also fairly easy to test it: just serve a slightly different version to the google's go mirror (by the user agent), and see how long until somebody complains to you about it.


> how few people do so

I think every company I know of with private Go modules (6-8 or so?) is running a module proxy, which will detect this. The several times we've detected this it's always been within 2-3 days of the upstream mistake. When I go to report a bug we're not always the first either.


> anyone who isn't fetching their packages through Google's proxy will get told that whatever they're using is trying to trick them.

No, the error message you get is neutral about which side might be wrong - it says "verifying module: checksum mismatch" and "This download does NOT match the one reported by the checksum server." (I've seen it a lot because it also appears when module authors rebase, which a small but surprisingly high number do...)


Wow, that is shocking. There is never a reason to rebase a public git repo, except maybe credentials leak in the past.


Even then, you want to revoke those credentials rather than try to wipe it from history, no?


That is what I think but security people want both.


Strange, I hadn't come across that before. Not sure what they're trying to achieve, deny they ever had a leak?


Third party security consultants, following a check list.


In practice, I've never validated that any of the checksums in go.sum are either correct or consistent.


I find it annoying that blocking the Go module proxy (on the server side) doesn't cause a graceful fallback. Am I unreasonable in thinking that it should? Doesn't the sum file prevent you from getting a maliciously modified copy?


> Perform a shallow git clone rather than a full git clone; or, ideally, store the last seen commit hash for each reference and only fetch if it has been updated.

I'd be interested to understand why that solution hasn't been implemented yet.


Because from Google's perspective everything works just fine.


From the github issue,

"it would be a fair bit of extra work for us to read robots.txt", so clearly tracking commit hashes would be even more work.


This:

> I was banned from the Go issue tracker without explanation, and was unable to continue discussing the problem with Google.

is completely asinine. But it's also par for the course when it comes to interacting with Google. When is anyone going to hold them to account for their terrible customer service and community interaction?


It's one side of the story. I don't think he was banned for no reasons.


https://drewdevault.com/2022/05/25/Google-has-been-DDoSing-s...

> I was banned from the Go issue tracker for mysterious reasons [ In violation of Go’s own Code of Conduct, by the way, which requires that participants are notified moderator actions against them and given the opportunity to appeal. I happen to be well versed in Go’s CoC given that I was banned once before without notice — a ban which was later overturned on the grounds that the moderator was wrong in the first place. Great community, guys. ]

When a story has two sides and one party chooses to keep silent when accused, I tend to favor the accuser.


I would share their side as well, but I never heard it. This is a violation of their own code of conduct, which requires them to notify the affected person, explain why, and offer the opportunity to mediate the situation. This is not the first time I was banned from the Go community without notice or explanation, and the first time turned out to be frivolous -- the ban was overturned months later with an admission that it was never justified in the first place. Community management in Go strikes me as very insular and unprofessional.

Regardless, I don't really want to re-litigate it here. The main issue is that Google has been DoS'ing SourceHut's servers for two years, and I think we can all agree that there is no conduct violation for which DoS'ing your servers is a valid recourse.


DoSing SourceHut? And the perpetrator is known? Why don't you try to sue them?


Ask Max Schrems how easy it is to sue tech giants.


Because blocking is a lot easier and cheaper?


Could try and do both? Drew lives in The Netherlands nowadays. I don't know where Sourcehut is based legally, and I realize the Dutch court system is also overburdened & expensive but it might be worth a shot. I wonder if Drew considered it, and if he'd been able to use the Dutch court system. If Drew needs a good Dutch lawyer, I can recommend Arnoud Engelfriet of ICTRecht (has a blog on Iusmentis.com, legal advice is free if he's allowed to publicize it).


Are you a lawyer offering your services pro bono?


If he's able to sue in Dutch court I can recommend a good lawyer, see my other reply.


Do you have a legal claim against Google for abusing your systems?


Knowing him, it probably wasn’t nothing. But that also does Dr matter - he runs a hosting service that sees a fair amount of traffic and the onus is on Google as stewards of the golang community and the brilliant minds behind invisibly proxing requests through Google’s servers to continue to figure this issue out. They have his email address.


The other side: https://github.com/golang/go/issues?q=is%3Aissue+sort%3Aupda...

Spoiler: Nothing.

edit: Possibly not as much nothing, see the replies to [0]. But the GH search kinda sucks for this.

[0]: https://news.ycombinator.com/item?id=34311799


Perhaps, but those reasons should still be communicated.


> "it would be a fair bit of extra work for us to read robots.txt",

Can someone explain what this means? After all reading a small text file is plainly easy to do so its meaning must be something other than obvious.


All I can figure is they're trying really, really hard to keep this very parallel on their side and also avoiding having to coordinate between nodes. It can't possibly be the reading of the robots.txt that's hard, so I think that statement has more to do with applying those policies to all the nodes—they must regard coordination between nodes to ensure the system as a whole isn't exceeding e.g. request rate maximums, to be "a fair bit of extra work".

Judging just from the linked post, the issue on which this was discussed, and this thread, it's feeling a lot like this proxy was some kind of proof-of-concept that escaped its cage and got elevated to production.


Not a good look for Google here.


Mhmmm. Yeah I wish the default were to not use a proxy. Though to be fair I'm not sure exactly what the performance implications would be.


From what I understand, the proxy also helps people make sure that an upstream deleting their GitHub repos doesn't result in builds breaking on new machines that don't have it cached locally. Imagine the problems that could happen if someone new joins your team, runs `go build` and then one of the vital dependencies 404s.

The other problem is that it's Google so their perception of "not much traffic" is "biblical floods" to other people.


If they change the case on their username on the other hand, the Go ecosystem explodes: https://github.com/sirupsen/logrus/issues/570#issuecomment-3...


> the proxy also helps people make sure that an upstream deleting their GitHub repos doesn't result in builds breaking

This is a consequence of what IMO is another bad decision by the Go team: having packages be obtained from (and identified by) random github repos, instead of one or more central repositories like Maven Central for Java or crates.io for Rust. The proxy ends up being nothing more than an ad-hoc, informally-specified, bug-ridden re-implementation of half of a central repository, to paraphrase Greenspun's tenth rule.


But that is what should happen. The build should break. So that someone can fix it.


Why? If some open source maintainer goes off the rails and deletes all their packages, why should that break my builds? I still have a valid license to the code, I don't really care that a maintainer rage quit 4 dependencies down from my application. I certainly don't want to have to scramble to deal with that.


This is an argument for maintaining a local cache of necessary build dependencies, not to rely on a third party.


Before the proxy service that is what people did. Now we don't need to because the proxy service handles that automatically.

You can run your own proxy service if you want. There's a large benefit to the go community as a whole for there to be a shared default proxy service.


Do they change the behavior if the repo was dropped for legal issues, and no valid licenses could have been obtained because the repo owner didn't have one in the first place?


The go proxy won't save modules that don't have a permissive license.


the blast radius of that outcome is another leftpad incident. the go team decided that they don't want that to happen


Sr.ht should set up their own graceful rate-limiting anyway, or they're going to have problems with badly coded CI setups as the service becomes more popular.


Wouldn't this just make the failures more mysterious and harder to track down for users? Failing sometimes for no reason that's obvious or apparent to end users is worse than always failing, IMO.

E.g.: source-based packages on distributions where users may not be Go programmers, or even non-programmers, will compile and install Go software where some nested library dependency is on sr.ht. These packages will now fail and, sadly, this is going to cause widespread disruption. I think it'd be worse if those failures only happened occasionally, and not reliably repeatably.


Are google's requests done on-demand, or part of some scheduled refresh?

I am wondering, because even if google's traffic is unreasonable, it might still be less than without the proxy.

CI configurations are notoriously inefficient with dependency fetching, so I would not be surprised if the actual client traffic is massive and might overwhelm sourcehut if all migrate to direct fetches.


Per the issue, both.


I recently moved to docker build pipeline for a project, and it’s redownloading all deps on each source file change, unlike the efficient on disk incremental compilation, because of how docker layer caching works, so my usage skyrocketed (and my build times went from seconds to minutes).


Since writing this, I have seen the trick - copy just the go.sum and go.mid files into an earlier layer, then get the deps, then copy all the other source code.


The proposed resolutions from Google have been along the lines of changes to the proxy application code. The resistance has been that this is hard because of the way responsibility is divided between the thin layer that provides the proxy web service and the ‘go’ command itself, which actually fetches the packages.

Would it be simple to solve this with an additional layer of the same proxy? Currently, end users request a package from proxy.golang.org as per the default value of the GO_MOD_PROXY env var. Google runs many of these to handle the traffic, let’s say 1000. They all maintain a mirror of all packages that have been recently requested (note that I expect there’s more nuance here around shared lists of required packages, etc, but the structure should hold true)

The result is that every one of the 1000 proxy instances requests the source data from git.sr.ht every day.

Google could set GO_MOD_PROXY on the existing instances to internalproxy.golang.org. They could then run 100, or maybe 10 of these internal instances. This would drop the traffic to hit.sr.ht by one or two orders of magnitude.

I suspect it would require minimal if any change to application code. This might be accomplished entirely within the remit of a sysadmin (SRE?).

Any holes in my reasoning?


> or, ideally, store the last seen commit hash for each reference and only fetch if it has been updated.

Isn't that exactly what "git fetch" is doing already ?


Yes, but they do a complete clone on each run.


Vendoring in go mod should be mandatory.


From the GitHub issue, by a Googler:

> For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt […]

This is coming from one of the biggest, richest, most well-staffed companies on the planet. It’s too much work for them to read a robots.txt file like the rest of the world (and plenty of one-man teams) do before hammering a server with terabytes of requests.

If this is too much for them then no wonder they won’t implement smarter logic like differential data downloads or traffic synchronization among peer nodes.


Why did sourcehut not take the offer to be added to the refresh exclusion list like the other two small hosting providers did? It seems like that would have resolved this issue last year.


For a number of reasons. For a start, what does disabling the cache refresh imply? Does it come with a degradation of service for Go users? If not, then why is it there at all? And if so, why should we accept a service degradation when the problem is clearly in the proxy's poor engineering and design?

Furthermore, we try to look past the tip of our own nose when it comes to these kinds of problems. We often reject solutions which are offered to SourceHut and SourceHut alone. This isn't the first time this principle has run into problems with the Go team; to this day pkg.go.dev does not work properly with SourceHut instances hosted elsewhere than git.sr.ht, or even GitLab instances like salsa.debian.org, because they hard-code the list of domains rather than looking for better solutions -- even though they were advised of several.

The proxy has caused problems for many service providers, and agreeing to have SourceHut removed from the refresh would not solve the problem for anyone else, and thus would not solve the problem. Some of these providers have been able to get in touch with the Go team and received this offer, but the process is not easily discovered and is poorly defined, and, again, comes with these implied service considerations. In the spirit of the Debian free software guidelines, we don't accept these kinds of solutions:

> The rights attached to the program must not depend on the program's being part of a Debian system. If the program is extracted from Debian and used or distributed without Debian but otherwise within the terms of the program's license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the Debian system.

Yes, being excluded from the refresh would reduce the traffic to our servers, likely with less impact for users. But it is clearly the wrong solution and we don't like wrong solutions. You would not be wrong to characterize this as somewhat ideologically motivated, but I think we've been very reasonable and the Go team has not -- at this point our move is, in my view, justified.


Supposedly the operator of sourcehut has been banned from posting to the Go issue tracker: https://github.com/golang/go/issues/44577#issuecomment-11378...

So, obviously he's supposed to know the even more obscure and annoying method of opting out.


Why should you opt-in to a No-DDOS list ? Why is it not the default ?


You are basically arguing that sr.ht is taking a "principled stand" against google. If that is what they are doing they should just say that and not pretend like there were no other options.

I'm ok with saying "google should do better!" But the compromise solution from the Go team seems reasonable to solve the immediate issue in a way that doesn't harm end users. The author should at least address why they have chosen the more extreme solution.


Or, we should not assume that Google's stance is correct, that sr.ht is expected to explain themselves. We should ask why is Google continuing on that path of DOSing upstream servers by default, not willing to use standards for properly using network resources, expecting all of them to do all the work.

EDIT: Moreover, sr.ht doing a workaround only for sr.ht, and lubar doing a workaround for lubar, etc... is not what Free Software is about. The point is that we're supposed to act as a community, for the betterment of the collective. Individualism is not a solution.


The author addressed this by way of their by-line


IIRC, because that would only help the official SourceHut instance, not other instances.


I wondered that too, but then I wondered if that's what sourcehut have actually done. I didn't notice any details about how go module mirror will be blocked.

Wouldn't the effect on sourcehut users be identical?


No. The Go team offered to add sourcehat to a list that would stop background refreshes. It would still allow fetches initiated by end users. The change sourcehat is making is to break end users unless they set some environment variables.

I've not seen any explanation about why the solution offered by the Go team was unacceptable. Its weird that that is completely left out of the blog post here.


They could also just add the site to that list, or better yet, make it opt in for sites instead of slamming them with shitty workers, and shitty defaults.

You know, like be good neighbors and respectful of other people's resources, maybe read robots.txt and not make excuses for why you are writing shitty stateless workers that spam the rest of the community.


I think Google should DDoS noone. Not everyone until they opt out.


But it’s not Google DDoSing them, it’s every user downloading packages. Without the proxy it would just be millions of users hammering their servers.

Edit: Uh okay, if it's not user traffic then why wasn't the "don't background refresh" not an option?


> Without the proxy it would just be millions of users hammering their servers.

Doing shallow clones, which are significantly cheaper.

Google is DDoSing them, by their service design. Why a full git clone, why not shallow? Why do they need to do a full git clone of a repository up to a hundred times an hour. It doesn't need that frequency of a refresh.

The likely answer is that the shared state to handle this isn't a trivial addition, it's a lot simpler to just build nodes that only maintain their own state. Instead of doing it on one node and sharing that state across the service, just have every node or small cluster of nodes do its own thing. You don't need to build shared state to run the service, so why bother? That's just needless complexity after all, and all you're costing is bandwidth, right?

That's barely okay laziness when you're interacting with your own stuff and have your own responsibility for scaling and consequences. Google notoriously doesn't let engineers know the cost of what they run, because engineers will over-optimise on the wrong things, but that also teaches them not to pay attention to things like the costs they inflict on other people.

It's unacceptable to act in this kind of fashion when you're accessing third parties. You have a responsibility as a consumer to consume in a sensible and considered fashion. Avoiding this means you're just not costing yourself money through your laziness, you're costing other people who don't have stupid deep pockets like Google.

This is just another way in which operating at big-tech-money scales blinds you to basic good practice (I say this as someone who has spent over a decade now working for big tech companies...)


> Google notoriously doesn't let engineers know the cost of what they run

Huh? I left a few months ago but there was a widely used and well known page for converting between various costs (compute, memory, engineer time, etc).


Per TFA > More importantly for SourceHut, the proxy will regularly fetch Go packages from their source repository to check for updates – independent of any user requests, such as running go get. These requests take the form of a complete git clone of the source repository, which is the most expensive kind of request for git.sr.ht to service. Additionally, these requests originate from many servers which do not coordinate with each other to reduce their workload. The frequency of these requests can be as high as ~2,500 per hour, often batched with up to a dozen clones at once, and are generally highly redundant: a single git repository can be fetched over 100 times per hour.


The issue isn't from user-initiated requests. It's from the hundreds of automatic refreshes that the proxy then performs over the course of the day and beyond. One person who was running a git server that hosts a Go repo only they use was hit with 4gb of traffic over the course of a few hours.

Thats a DDoS.


That's not how the proxy works. The proxy automatically refreshes its cache extremely aggressively and independently of user interactions. The actual traffic volume generated by users running go get is a minute fraction of the total traffic.


They wouldn't recommend users getting the data directly from them if the user traffic was the problem


sourcehut's recommendations seem absolutely reasonable: (1) obey the robots.txt, (2) do bare clones instead of full clones, (3) maintain a local cache.

I could build a system that did this in a week without any support from Google using existing open source tech. It's mind boggling that Google isn't honoring robots.txt, is requesting full clones, and isn't maintaining a local cache.


Despite the issue, I'm not convinced that Go doesn't do shallow fetches vs deep clones. Other issues (like Gentoo's issue with the proxy, I don't have a link handy sadly) point to fetches being done normally, not clones.


It's not about what Go allows, it's about what Google's proxy does on its own schedule. If there was a knob sr.ht could use to change this, it would've come up in the two years since this issue was raised with the Go team.


What does a local cache even mean at Googles scale though? Some of the cache nodes are likely closer to Sourcehuts servers than to Google HQ. I guess local would mean here that Google pays for the traffic. But then it is not a technical problem, but a "financial" one.

If you disregard the question who pays for a moment and only look at what makes sense for the "bits", the stateless architecture seems not so bad. Just a pity that in reality somebody else has to foot the bill.


Are you serious? Google Cloud Storage is a service that Google sells to folks using its cloud. If they can't use it for their own project, that would be shocking, no?


They are probably already using something like GCS to store the data at the cache nodes.

I was not talking about how the nodes store data, but about a central cache. Purely architecture wise, it doesn't make sense to introduce a central storage that just mirrors Sourcehut (and all other Git repositories). Sourcehut is already that central storage. You would just create a detour.

It's also not an easy problem. If the cache nodes try to synchronize writes to the central cache, you are effectively linearizing the problem. Then you might as well just have the one central cache access Sourcehut etc. directly. But then of course you lose total throughput.

I guess the technically "correct" solution would be to put the cache right in front of the Sourcehut server.


Go's Proxy service is already a detour for reasons of trust mentioned in the article. They are in a position to return non-authentic modules if necessary (e.g. court order). That settles all architecture arguments about sources-of-record vs sources-of-truth. The proxy service is a source of truth.

If Google is going to blindly hammer something because they must have their Google scale shared nothing architecture pointed at some unfortunate website, then they should deploy a gitea instance that mirrors sr.ht to Google Cloud Storage, and hammer that.

It's unethical to foist the egress costs onto sr.ht when the solution is so so simple.

Some intern could get this going on GCP in their 20% time and then some manager could hook the billing up to the internal account.


[flagged]


Drew has some... strong opinions on some things, but a straight reading of the issue suggests he's being perfectly reasonable here, and it's Google who can't be arsed to implement a caching service correctly - instead, they're subjecting other servers to excessive traffic.

It's about the clearest example of bad engineering justified by "developer velocity" - developer time is indeed expensive relative to inefficiency for which you don't pay because you externalize it to your users. Clearest, because there are fewer parties affected in a larger way, so the costs are actually measurable.

I do have a dog in this, in a way, because as one of the paying users of sr.ht, I'm unhappy that Google's indifference is wasting sr.ht budget through bandwidth costs.


> I didn't notice any details about how go module mirror will be blocked.

It says in the post they'll check the UserAgent for the Go proxy string and return a 429 code.


That's especially silly because 429 is a retriable error


Sure, but with it, the biggest impact will likely be... spam in logs on the Google side. Short-circuiting a request from a specific user agent to a 429 error code is cheap, compared to performing a full git clone instead.


I don't have any particular affinity for Google, but they're still a business and they're already developing the Go language (and relevant infrastructure) at their own expense. It's not like the Go team at Google has access to the entire Alphabet war chest like your "biggest, richest, well-staffed companies on the planet" suggests.


Go since inception has always been well funded. It is authored by some of the biggest names in programming and they are on staff at Google. This is not a side hobby. Not sure why you're suggesting that Go is lacking in resources.


No. It is much smaller team as far as resources go. Compared to Swift for Apple or Java for Oracle, Go is not strategic bet for Google. There is absolutely no dependency on Go to develop services for Google platform in it. Hell, large number of Google employees spend time on disparaging Go. It does not happen for other company sponsored languages.


Someone in the Go team (rsc, IIRC) commented on how a Google executive came to him in the cafeteria to congratulate him on the launch. It turns out the executive confused him with someone on the Dart or Flutter teams.


Thanks for this anecdote! This is hilarious but seems very true to me.


I just hope it wasn't Rob Pike.


Found it: Ian Lance Taylor:

https://groups.google.com/g/golang-nuts/c/6dKNSN0M_kg/m/EUzc...

Now a bit of personal history. The Go project was started, by Rob, Robert, and Ken, as a bottom-up project. I joined the project some 9 months later, on my own initiative, against my manager's preference. There was no mandate or suggestion from Google management or executives that Google should develop a programming language. For many years, including well after the open source release, I doubt any Google executives had more than a vague awareness of the existence of Go (I recall a time when Google's SVP of Engineering saw some of us in the cafeteria and congratulated us on a release; this was surprising since we hadn't released anything recently, and it soon came up that he thought we were working on the Dart language, not the Go language.)


Yes, Google staffs its Go team, but the original comment invokes Google's vast wealth as though its entire market cap is available for the development of Go, which is of course absurd. Google probably spends single-digit millions of dollars on Go annually, and it seems they've determined that supporting Drew's use case would require a nontrivial share of that budget which they feel could be spent to greater effect elsewhere.

Go is not only a "side project" at Google, but one of its most trivial side projects.


Knowing that "we only have a few million in funding per year" was a valid excuse for generating abusive traffic and refusing to do anything about it, would definitely have changed a few conversations I've had working at startups. Interesting.


Of course, Google doesn't materially benefit from optimizing the module proxy for Drew's use case, and I doubt your startups would have made traffic optimization its top priority either under similar circumstances (which is to say "no ROI from traffic optimization").


Drew's use case?!


"scenario"? Pick your synonym.


They wouldn't write significant parts of their backend in a side project.


This is obviously untrue because we know that Google does write significant portions of its backend in Go and that Google derives ~0% of its revenue from Go (the very definition of a side project). My guess is that you're assuming that a side project for Google is the same as a side project for a lone developer or a small team, which is (pretty obviously, IMHO) untrue.


> that Google derives ~0% of its revenue from Go

AdWords is mainly written in Go. YoutTube is mainly written in Go. Just because they have strategic reasons for not directly monetizing Go doesn't make it a side project more than any other internal tooling.

It's core to their ability to pull in revenue now. If they were somehow immediately deprived access to Go, the company would go under. That's how you know it's not a side project.


> AdWords is mainly written in Go. YoutTube is mainly written in Go

Can you source these claims? Last I checked, YouTube was primarily written in Python, and I doubt that's changed dramatically in the intervening years given the size of YouTube. I assume there's some similar thing going on for AdWords.

> Just because they have strategic reasons for not directly monetizing Go doesn't make it a side project more than any other internal tooling.

Agreed, but all internal tooling is a side project pretty much by definition.

> It's core to their ability to pull in revenue now.

No, it's just the thing that they implemented some of their systems in. I'm a big Go fan, but they could absolutely ship software in other languages for a marginal increased operational overhead.

> If they were somehow immediately deprived access to Go, the company would go under. That's how you know it's not a side project.

I don't know what it means to be "deprived access to Go", but this is a pretty absurd definition of "side project" since it applies to just about everything Google does and a good chunk of the software Google depends on whether first party or third party (Google depends much more strongly on the Linux Kernel; that doesn't mean contributing to the Linux Kernel is Google's primary business activity). It seems you have a bizarre definition of "side project" which hinges on whether or not a business can back out of a given technology on a literal moment's notice irrespective of how likely it is that said technology becomes unavailable on that sort of timeline, and that these unusual semantics are at the root of our disagreement.


Not to mention, it's likely a quite impactful form of marketing / developer relations gain for them. I think so because when I talk to people who start to learn Go, I usually see a transfer of positive feelings and excitement from Go itself to Google as its creator/backer - one of the clearest examples of "halo effect" I've seen first-hand.


Do you really imagine some significant number of Google's search, cloud, etc customers were driven to Google over a competitor because of "good vibes" derived from Go? Google only develops Go because it's a useful internal tool, and I'm pretty sure the marketing team nor the executives spend any meeting minutes discussing Go.


Marketing works in mysterious ways.

Yes, I do imagine that people who are really into Go are more likely than average to join or start Go shops, and then pick GCP over competitors because they have to start with something, and being Go people, Google stuff comes first to mind.

Lots of companies across lots of industries spend a lot of money to achieve more-less this fuzzy, delayed-action effect.


> Yes, I do imagine that people who are really into Go are more likely than average to join or start Go shops, and then pick GCP over competitors because they have to start with something, and being Go people, Google stuff comes first to mind.

How many such people do you imagine there are? I'm active in the Go community, and I've been a cloud developer for the better part of a decade. It's never occurred to me to pick GCP over AWS because Google develops Go, nor have I ever heard anyone else espouse this temptation. I certainly can't imagine there are so many people out there for whom this is true that it recoups the cost that Google incurs developing Go.

Rather, I'm nearly certain that Google's value proposition RE Go is that developing and operating Go applications is marginally lower cost than for other languages, but that at Google's scale that "marginally lower cost" still dwarfs the cost of Google's sponsorship of the Go language.


This problem isn't really specific to Google. If some hobby project was DoSing sites it would get banned. "We don't have the resources to not DoS" is not a valid excuse. The Go team needs to scope their ambitions properly; if they can't make their proxy work safely they should not have bothered to develop it.


But this isn't one of those "we developed fast and did dumb stuff". They put significant effort into doing something dumb.


Surely Google of all places has the most tested, battle-hardened robots.txt library in existence, and they have a company-wide public monorepo to boot. There's no excuse for this.


I'm pretty sure parsing robots.txt isn't the challenge. The Go team asserts that there are technical difficulties to this traffic optimization, and I don't have any reason to disbelieve them (they're clearly not dumb people, and I certainly trust them more than Internet randoms when it comes to maintaining the Go module proxy). It's a bummer for Drew, but he isn't Google's top priority right now (it seems wild to me that you think there is "no excuse" for Google not to prioritize niche use cases like Drew's--how do you imagine large organizations choose what to work on?).


It seems like they're getting tons of bandwidth paid by the war chest if they don't care about this waste at all.


Well, for boring technical reasons, I guess they will remain blocked.


Is the source code of the service behind proxy.golang.org actually open-source?



They're referring to the actual service domain not the public static domain.

https://git.sr.ht/robots.txt



presumably the robots.txt entry they are talking about is https://git.sr.ht/robots.txt


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: