Hacker News new | past | comments | ask | show | jobs | submit login

The spam is infuriating (not GitLab's fault, of course). Atleast, on our instance at https://git.cloudron.io, we got massive snippet spam. After we disabled snippets, we got massive spam on the issue tracker (!). The way we "fixed" is by turning on mandatory 2FA for all users.

As a general lesson, what we learnt is these are not bots. These are real humans working in some poor country manually creating accounts (always gmail accounts) and pasting all sorts of random text. Some of these people even setup 2FA and open issues with junk text, it's amazing. Unfortunately, GitLab from what I can tell cannot make issues read-only to non project members (i.e I only want project members to open issues, others can just read and watch issues).

Currently, our forum spam (https://forum.cloudron.io) is way more than GitLab spam. On the forum, we even have Captcha enabled (something we despise) but even that doesn't help when there are real humans at work.




We had one of the "real humans" write to us (in issues) asking us to leave his spam up for "just a few hours".

We implemented a filter anyway.

(This was not Gitlab, but a specific form on our unique website.)


> asking us to leave his spam up for "just a few hours"

What... why? What is their goal???


Feeding their family?


But like... who's paying for that kind of spam??


In our case, mostly pirate TV streaming services. (Watch NBA games live etc.)


A lot of people unfortunately seem to bite on those "SEO experts" kind of emails. I had a few clients ask me if they should give it a try, since, "why not, it's cheap".


Why are they posting random text in Gitlab?


This is a typical spam profile. Usually they contain links, which search engines follow.

https://forum.cloudron.io/user/cardioaseg


The link contains rel=nofollow.


That doesn't matter, see other comments below on Google's changing treatment of this attribute.

Also you'll find spambots posting on any open form on the internet even if it doesn't do them any good, because much of it is automated, so even if you hide the results the spam will still come in.


I don’t know for sure, but I think our Markdown implementation adds nofollow.


I used to think that spammers would stop if their spamming didn't win them any results. But they don't care. They spread their spam as widely as possible without trying to prune out the places where it does them no good.


I am not entirely sure. See https://forum.cloudron.io/users, if you go to say page 10 or something you will see all sorts of nonsense. I am still trying to figure what the best way to fight this spam (because captcha is enabled and required to even create accounts). But these are real people and not bots. I know this because they even post new messages all the time.


Definitely the SEO backlinks- for example one profile I see is linking to an Indian escort service in the profile.


Maybe GitLab needs an option to disable external linking, and filter any comment that contains an external link automatically


Or a nofollow option (add rel=nofollow)


That's a great idea. We have discussed ways of getting a trust level, and enable this for specific groups. Discourse uses the same system for preventing spam. "Good" bots detect the rel=nofollow and do not come back.

See my proposal here: https://gitlab.com/gitlab-org/gitlab/-/issues/14156#note_258...


Iterating on my original thought, here is a smaller feature request for self-hosted GitLab instances. This can help GitLab.com too: https://gitlab.com/gitlab-org/gitlab/-/issues/273618


I still think an even better path to success is to allow entirely disabling linking for non-admins.

Google no longer treats "nofollow" as strongly as it used to: https://webmasters.googleblog.com/2019/09/evolving-nofollow-...


Thanks for sharing, I have added it to the issue, maybe you want to join the discussion there :) https://gitlab.com/gitlab-org/gitlab/-/issues/273618#note_43...

Just so that I can follow - URLs posted by non-admins should not render as HTML URLs at all? Wouldn't that be quite limiting for OSS project members for example?


My opinion on the topic isn't definitive by any means, but I think a lot of projects would do just fine without allowing arbitrary hyperlinks to be added by non-admins.

I think being able to link to related issues and link into the code is still important, for example.

It's certainly a trade off, but spammers want it to be rendered as a link.


getting that sweet sweet seo backlink juice


Isn’t that why we add rel=nofollow to low friction user submitted links on our platforms?


Google changed the interpretation of those a year ago. https://webmasters.googleblog.com/2019/09/evolving-nofollow-...


> Looking at all the links we encounter can also help us better understand unnatural linking patterns.

It appears as though they want to mark these links in order to prevent inorganic SEO, not help it.


I don't get it. They post all this spam in the hopes that people click on the links therein, thereby boosting the ranking of those sites? Does that actually work at all?


It doesn’t actually require anyone clicking on the links. Google sees inbound links and uses that as a factor when calculating the ranking of the linked page.


I thought that was how it worked like a decade or more ago, but not today.


Regardless of whether it works, people still pay for it. I have a Facebook ad right now that says "Get over 500,000 backlinks for $29.99". No doubt it's someone with a bot that spams comment forms.


A service like Stop Forum Spam might be a solution to this. It checks for IP address and email address and gives it a value based on how likely it is assumed to be a spammer.

When they have to set up a new email account and maybe even a new IP address for every few accounts, it gets to be a lot of work soon.

https://www.stopforumspam.com/


How do you know they are real humans? I imagine bots doing 2FA would still be cheaper.


I know this because in our forum we have LOTS of "spam" users - https://forum.cloudron.io/users . These users will go into posts and actually make "helpful" comments. Like, "Oh I tried this solution but I found that my disk my full. Deleting data fixed my problem". It almost seems genuine but they build reputation and once they have some votes, they go back and edit all the comments to have links.


Banning entire countries helps a lot. I don't want to name certain countries, but let's assume it's one where it's common to see human corpses floating on a big river.


That doesn't help narrow it down. I live in Seattle and the first think I thought of was a popular tiktok of teens finding a corpse in the river last month.


The word you missed was "common."


Apart from the fact that banning an entire country from contributing to their code would be antithetical to the Wikimedia foundation, if you're implying the country which I think you're implying (which is also where I live, btw) you'll:

1. Ban a burgeoning tech industry which has produced over 20 unicorns,receives billions in funding from across the world and produces world-class tech talent; 2. Ban millions of other OSS developers from contributing; and 3. Just lead to SEO spammers picking out other impoverished countries to spam from, which means finally you'll end up with only people from the "west" being able to contribute in any way.


Many bots are likely still powered under the hood by humans.

On my backlog of projects to do is to make a browser extension that solves the more obnoxious captchas for me, as I'm regularly behind vpn and fall into ridiculously long solve loops.

On the most popular api i could find, $10 buys you a shockingly LOT of solves (not that I've tested it yet). It is automatable but ultimately still powered by humans.


It’s incredibly sad how the open web is being destroyed by google’s recaptcha.


Without google's recaptcha, do you think there would be less spam?

Personally, I suspect there would be more without at least some speed bumps to raise the cost of spamming. I would absolutely love for there to be better options than recaptcha that meets the same needs around bot-detection, price, implementation effort, and accessibility. It is, sadly, the best option I've seen on offer.

You're right. The scenario we're in is incredibly sad. It would be wonderful if the individual actors involved had better options to meet their needs.


I'd argue that it's equally sad to see the open web get destroyed by massive DDoS attacks and malicious actors. How would you keep your own website up if it was constantly being attacked?


You're barking up the wrong tree. Bad actors create abuse and spam which they can do because of fundamental weaknesses in the design of the internet. People trying to solve that reality with Recaptcha (and Cloudflare for that matter) aren't the ones destroying the internet.


I don't think it's so much the wrong tree as it is but one tree in a forest to be barking up.

All the maturely developed bot filters frequently throw me in an endless battery of tests that have me giving up in frustration before finally making it through to content I'm requesting.

> aren't the ones destroying the internet

IMO they are every bit as much destroying it as the abusers they're claiming to fend off.


I'm totally in that camp of opinion, although I'll acknowledge the escalating abuses carried out by both "sides."

In the meantime, i hope to have the savviness to program my own way out of unsolvable captchas.


Already exists: https://github.com/dessant/buster

Edit: on re-read, you meant solving using humans. Buster uses speech-to-text APIs to solve.


Every lead to spice(e:solve, lol gboard) my problems is probably worth a peek. I'll take a look, thank you.


Could add nocrawl to your robots.txt and advertise the fact on signup page that search engines won’t find this content.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: