It's annoying for sure. I deal with abuse at a large scale.
I'd recommend:
- Rate-limit everything, absolutely everything. Set sane limits.
- Rate-limit POST requests harder. Preferably dynamically based on geoip.
- Rate-limit login and comment POST requests even harder. Ban IPs that exceed the amount.
- Require TLS. Drop TLSv1.0 and TLSv1.1. Bots certainly break.
- Require SNI. Do not reply without SNI (nginx has 444 return code for that). Ban IP's on first hit that connect without. There's no legitimate use and you'll also disappear from places like Shodan.
- If you can, require HTTP/2.0. Bots break.
- Ban IP's listed on StopForumSpam, ban destination e-mail addresses listed there. If possible also contribute back to SFS and AbuseIPDB.
- Collect JA3 hashes, figure out malicious ones, ban IPs that use those hashes. This blocks a lot of shit trivially because targeting tools instead of behaviour is accurate.
> - Rate-limit everything, absolutely everything. Set sane limits.
This breaks when multiple users are behind the same IP. I've seen services fail even in classroom, because the prof did something and a few tens of students followed (captchas everywhere).
I do in fact rate-limit everything, it is good advice, but the way you implement rate-limiting allows for traffic bursts. It's basically a reverse leaky bucket, where you start out with N allowed requests, which gets depleted for each request, and refilled slowly over time.
Search traffic is fairly bursty, people do a few requests where they tweak the query and then they go go away.
Off-topic, but isn't that a normal (non-reverse) leaky bucket? When the bucket gets full the rate limiting engages. An empty bucket allows for a burst without getting full. It slowly leaks over time at a rate that allows a normal amount of traffic without filling up.
To me, it's a bucket that's being filled at constant rate from a tap until it's full, and the traffic requires taking some water from the bucket. If there's no water, the trafic has to be dropped or wait in a queue.
I like it. I guess it's one of those things where it depends on the example used when you learned about it? For me it was some Nginx guide on rate limiting and I think they described it in the way I see it.
Yup. This happened to us when we had rate limiting turned on our sites and ran off-site events at hotels, for example - then the hotel's IP got temp banned and our sales engineers would complain, rightfully so.
Black hats always find ways around rate limiting, and that's why they are more prevalent than actual users. People can literally run click farms with cheap 4g cell phones that artificially pump anything they want without consequences, while authentic posters that simply run 2 necessary accounts are penalized if they post regularly.
The only real way to properly police Internet communities is to keep them smaller so that botting is more obvious, and to involve carefully managed moderation. Reddit tried this, but also lost track of the human factors involved and now moderators collect side money and promote their own posts artificially.
The main problems facilitating the surge in bots are shammy creator funds and all the other measures sites take to boost their profit and market dominance. They have grown far too big and can no longer effectively manage their user bases effectively. Things weren't meant to be this way at all, the excessive quest for market dominance and profit has thoroughly corrupted freedom of info online in business, now many users are also following the same road map.
For sure! Our site is B2B ecommerce, and any sizeable customer has all their users coming to us from a single NAT or proxy. For major customers it's likely that there are several of their employees using our system at any given time.
The answer needs a whole lot more finesse than this.
A "sane limit" wouldn't be "one person one IP", such a global limit should rather stop one IP (even if it's a nasty CGNAT) from having a negative impact on the entire service.
If such a limit would hinder classroom usage, but that's your target audience then other solutions should be found, fairly logical.
Or at least rate limit session cookies. If a person does not have a session cookie, rate limit by IP. If they are authenticated as a unique person have different rate limits and different levels of authentication. HAProxy can do different rate limits by ACL conditions.
Or instead of strictly rate limiting, ask them a question that can't be "looked up" in a table and that requires human thought, philosophy, emotion, ethics. Maybe GPT could eventually adapt to this and in that case fall back to IP rate limiting and grow the set of questions.
I ban IPs from small data centers all the time. For my purposes there is no need to support traffic from small hosting providers that are everywhere all over the world. I do not tend to ban the IPs of commercial ISPs that provide service to end users.
You will probably ban a lot of VPN users as collateral damage. VPN providers often use these small and relatively cheap providers for their endpoints.
You may be fine with banning those VPN users, or even want that - lots of bots will try to hide behind "legitimate" VPNs - but one has to be aware of this consequence at least, especially considering that more and more people seem to use them - probably also thanks to the aggressive "sponsoring" certain providers such as ExpressVPN do on e.g. a wide variety youtube videos.
It depends on what you are trying to protect I suppose. Banning OVH IPs (and others) cleared up a lot of issues for me. I don't miss them, but sure you might.
"This spam traffic is all from botnets with IPs all over the world. Tens, maybe hundreds of thousands of IPs, each with a relatively modest query rates, so rate limiting does all of bupkis."
In other words, make your website unusable for people who have to connect through VPNs or public networks, difficult for anyone without a stable Western broadband connection, and unpleasant for everyone else.
With a bit of work any limits can be fine-tuned usually not to impact actual users behind NATs. Some collateral does happen but that's an unfortunate reality. I'd like you to elaborate on the rest of your comment though.
I agree that there's a "that's life" aspect to collateral/tradeoff.
That said, I sympathize somewhat with the parent. "Done right, negative side effects are minimal" is an uncomforting statement. First, because things are often not implemented correctly. There are a lot of details and tuning that will often fail to materialize in practice. Second, because long tail usability issues can often go overlooked. The abuse->anti-abuse feedback loop is pretty tight. Abuse gets identified and counteracted. The anti-abuse-> UX problems loop tends to be noticeably looser. Often, it's just aggregates (revenue/AUD/etc).
I'm not very familiar with all the workings of HTTP/2.0 - why would it break bots? Assuming no CloudFlare type protection, does it somehow stop someone from using curl to get (non-JS generated) content? Does it thwart someone accessing the site from something like playwright/selenium?
Anything that got a significant update in the past ten to twelve years will support TLS 1.2. The window of systems that would support 1.1 but not 1.2 is pretty small. You have to go all the way back to IE on Windows XP before a lack of 1.2 support becomes an issue.
So, yeah. You're absolutely right. In a lot of cases the loss of revenue from users with severely outdated software will be less than the cost decease of cutting spam and abuse.
This gets back to an old question - to what degree should legacy systems be supported and at what level of expense? There's no one easy answer that works for everyone.
I don't think there's much web you can visit with those browsers anyway. Windows XP with IE and no SNI support, maybe sites from that era without JavaScript would work?
I think you'd be surprised the how recent a browser can be and still lack a client/server cypher overlap once you start whittling down what TLS versions you accept. Just at the start of the pandemic many government sites had to re-enable early TLS because so many people couldn't access their recent TLS only sites.
But yeah, corporate employees aren't going to care about those people. Governments have to. And human persons building personal websites should too.
If you're reasonably certain that the request is being made by a malicious bot, you could try responding with a ZIP bomb [1] if HTTP 1.1 or newer is being used and the bot accepts compression. It can cause problems for several browser engines.
I'm sure these would work but I'll probably got banned too just because I often try to poke ip addresses directly. I also often use VPN especially when outside, so I'll definitely got banned.
Some old browsers break as well, if it's worth it depends on the website. It's your prerogative to disable an useful feature, you can also disable JavaScript. But there's little reason for a website operator to cater to that unnecessary edge case if it's mostly used for abuse.
If doing something fends off a lot of bots, but also inconveniences a very small number of people who have significantly non-standard or just out-of-date configurations, I'm likely to favour protecting myself from the former over worrying about the latter. To paraphrase Mr Spok: The inconveniences of the me outweigh the inconveniences of the you!
FWIW I don't block non-HTTP2, but at some point that 4.8% who can't use it will be much smaller and if you are still turning it off you will be part of an increasingly insignificant population.
It is not dissimilar to measures that will have the side effect of stopping apps working on IE - I wouldn't in DayJob¹ but I'd not think twice on a personal project. I don't have time to care that much about people with very non-standard or very out-of-date setups, if accidentally blocking them is a side effect of a benefit I might feel. Added benefit: blocking them effectively removes potentially difficult users from the support queue!
And to some, that 4.8%+4.8%+… doesn't overlap their target audience enough that losing the few in both categories isn't really a problem in the grand scheme of things.
----
[1] some of our clients still have users using IE11 even though we no longer officially support it² so we still don't want anything other than minor display & performance issues there ATM
[2] of course we will for more money, if someone wants to pay us to address said minor display & performance issues
To be fair, most of my visitors are not exactly lining up with the expectations of "standard". I get >90% of my [human] traffic from desktop clients, for example.
Because it adds nothing to improve my browsing experience, and reducing the number of protocols supported by my browser from 3 to 1 also reduces the attack surface.
> Your web browsing must be awfully slow sans multiplexing.
And yet it's not slowed down at all. How many different resources must a web page use before it feels slow on a connection pool of keep-alive TCP sockets? Maybe people visit some wild experimental web pages with hundreds of blocking <src> tags that are not bundled/minified?
Either way, my experience is it doesn't slow anything down when I use both websites (forums, resources, youtube, social media) and web apps (banking, maps, food delivery etc).
> They're a major part in killing off web forums, and a significant wet blanket on any sort of fun internet creativity or experimentation.
> The only ones that can survive the robot apocalypse is large web services. Your reddits, and facebooks, and twitters, and SaaS-comment fields, and discords. They have the economies of scale to develop viable countermeasures, to hire teams of people to work on the problem full time and maybe at least keep up with the ever evolving bots.
This is not true at all. There are web forums that are not "web-scale" and don't spend all day fighting bot spam. The solution is real simple: it costs 10 bux to register an account, if you're a nuisance your account is banned and you pay 10bux to get back on.
Even the sites that don't require payment for explicit registration - often succeed by gating functionality or content behind paywalls. Requiring a "premium membership" to post in the classifieds forum is an extremely extremely common thing on small interest-based web-boards (photrio, pentaxforums, homebrewtalk, etc). That income supports the site and supports the anti-bot efforts as a whole. The customer isn't advertisers - it's the community itself, and you're providing the service of high-quality content and access to people with similar interests.
You need to bootstrap a community first, of course, but it doesn't need to be a large community, just a high-value one.
The twitters and facebooks of the world just don't like that solution because they value growth above all other considerations. They'd rather be kings of a billion user website with 200 million bots than a 1k-100k user forum with 100% organic membership and content. And they value engagement over content quality, which is the entire reason comment-tree/vote-based systems have been pushed heavily over web-1.0 threaded forum discussions as well.
This botpocalypse is the inevitable outcome of the systems that social-media giants have created, not inherent outcomes of the internet as a whole.
> This is not true at all. There are web forums that are not "web-scale" and don't spend all day fighting bot spam. The solution is real simple: it costs 10 bux to register an account, if you're a nuisance your account is banned and you pay 10bux to get back on.
That doesn't work at all unless your service is already pretty popular. Who would pay $5 to access a new, empty forum?
You mention "you need to bootstrap a community first," but that's basically an admission that this solution doesn't solve the problem at all, because you have to solve the problem in some other way to use this solution. 10bux was a solution limited to a very specific time.
There are (at least) 2 kinds of spam - "technical" spam such as bots hammering the web service with requests and consuming resources, and the commonly-accepted definition of spam where bots post promotional or other obnoxious content.
I feel like the article here talks more about the first kind. I do agree with your solution for the second kind of spam though.
It's spam from a server owner's point of view in the broader sense, in that it is "junk requests" instead of legitimate requests, they can be sent as a flood at no cost or consequence to the senders, and it's up to you as the recipient to find a way to filter it all to separate the wheat from the chaff.
It's certainly not denial of service, that means something far more specific.
One could call it "scraping", but I'd argue the meaning / emphasis is different (it'd be like describing trolling as 'typing').
The author seems to be referring to this as "spam" so I've reused their definition. In general I agree, what the author is experiencing is more akin to a DoS than a spam attack.
Uhhmmm, I beg to differ and so do a lot of very smart people with many more servers and users than you or I are likely to see.
As with most 'Oh, its' Simple - Just Do XYZ' solutions there are often very good reasons for not doing the 'Easy/Simple/One-Liner' and here are a few with yours -
Firstly - The '10 bux' could exclude a vast swathe of the poorest. Skipping a couple of Starbuck coffees vs. the local currency equivalent of whatever you are charging equating to a month's worth of food or being able to send at least one of your children to the local village school. I mean - your forum / site so you can gate it anyway you wish, I'm just pointing out that it could and would be exclusionary (perhaps unintentionally so).
Next Problem: Accepting and Processing the 'Good Behavior' deposit. Congratulations, you now need to become a Payment Processor and as such have certain legal requirements regarding payment details and storage and also tax returns. 'Oh, just Off-Load it to Stripe' someone might suggest. Do-able I guess but anyone who has taken payments over the internet will tell you that its a Royal Pain in The Ass. Also, now all a 'Griefer' needs to do is run a few dodgy cards through your registration system and 'Poof' there goes your payment processor and/or the fees go sky high.
Most ‘oh its simple – why don’t they just…’ overlook (or are not aware) of the many, many good reasons why greater minds than yours or mine haven’t already implemented it.
Sure – sometimes people do come up with novel solutions to old problems so theres no harm in spit-balling and I’m certainly not directing any scorn or ill-intent in my reply.
> Firstly - The '10 bux' could exclude a vast swathe of the poorest. Skipping a couple of Starbuck coffees vs. the local currency equivalent of whatever you are charging equating to a month's worth of food or being able to send at least one of your children to the local village school. I mean - your forum / site so you can gate it anyway you wish, I'm just pointing out that it could and would be exclusionary (perhaps unintentionally so).
It is intentionally exclusionary. Not necessarily of the poorest among us, but of those who expect free service. Botters and spammers are disproportionately likely to look for free service. Pretty much any level of required spending in any currency will have a similar effect. By cutting off the abuse-prone free tier that many bad actors depend on, you dramatically decrease your exposure to abuse.
The point is not to keep out the poor people. The point is to make it far more work to get over the hurdle than it's worth for abusers. If you have a way to do the latter without the former that doesn't hinge on pushing a bunch of extra work onto administrators, I suspect quite a lot of people would be very curious to hear about it.
Providing paid service doesn’t make one exclusionary of the poorest. You are free to give discounts or free service to vulnerable groups in exceptional circumstances, requesting a verification of your choice in place of a financial roadblock. While giving them status otherwise equivalent to that of regular paid accounts, you can still apply more sensitive monitoring procedures to them if you have grounds to expect false positives and abuse—good thing now you don’t have to do it with every account.
What being a paid service does make you is a player in a game with certain agreed-upon rules, by which your incentives are aligned with those of your customers. Giving your paying users the right of the ultimate vote (with their wallets), you in a way paint yourself into a corner where you can either act in their best interests or go broke. It matters less if you are a huge corporation (many revenue streams), and it doesn’t guarantee you won’t cheat and violate the rules (take my money and sell my data, why not!), but still as a user I find it a useful signal.
> Most ‘oh its simple – why don’t they just…’ overlook (or are not aware) of the many, many good reasons why greater minds than yours or mine haven’t already implemented it.
The forums they are referring to have been operating with the "10 bux" model implemented on top of a highly customized version of vBulletin for over 20 years.
>The solution is real simple: it costs 10 bux to register an account, if you're a nuisance your account is banned and you pay 10bux to get back on.
Many years ago there was a public server called SDF (Super Dimensional Fortress). It was a BSD system and anyone could get a user account for $1. The theory was even the least of us, a kid scrounging for money on the street, could come up with a dollar (and presumably the postage to mail it). To a certain person, access to this kind of server was invaluable - the only situation you could hope to get close to this kind of system. As time went on, the number of people interested in this was dwindling.
Jumping through hoops is a useful gateway, but if your hoops are too complex or arduous, you miss out on people who you genuinely want to include in your community.
> Many years ago there was a public server called SDF (Super Dimensional Fortress). It was a BSD system and anyone could get a user account for $1. The theory was even the least of us, a kid scrounging for money on the street, could come up with a dollar (and presumably the postage to mail it). To a certain person, access to this kind of server was invaluable - the only situation you could hope to get close to this kind of system. As time went on, the number of people interested in this was dwindling.
SDF is still around though, and still operates on the same model - pay once to get in, and you can stay as long as you want unless you become a nuisance.
> Jumping through hoops is a useful gateway, but if your hoops are too complex or arduous, you miss out on people who you genuinely want to include in your community.
It is certainly not impossible for communities with this model to die - that's not what I'm saying at all. Small social media sites die all the time, including with Reddit-style gamification bullshit. Or they turn into cesspits like Digg or Voat.
But yes, increasing the friction of engagement is literally the point, you are losing some users but increasing the quality of the ones who remain. It's the old "fire your bad customers" routine, but for social media.
"Oh no, we are all losing out on your valuable shitposting, how will this community ever go on?"
> As time went on, the number of people interested in this was dwindling.
That's mostly because access to computers has became easier - you can either get your own Linux box or get a proper VPS for extremely cheap (if not free - see cloud provider free tiers) nowadays so why bother with a non-root account on a BSD system?
IMO it doesn't have anything to do with the barrier to entry.
few edits I wanted to make but couldn't while HN was down: this comes to a question of intentions, right? Like are you trying to build a high-value community, or are you trying to make a billion-dollar company? Photrio or Pentaxforums is never going to sell for a billion dollars like Reddit, and that's not the kind of community that Reddit is trying to build.
The highly-chaotic multithreaded model of Reddit/HN/etc is directly designed to be impenetrable and chaotic, where everyone is just responding to everyone rather than having a "flow of conversation" in which everyone is involved. The "everyone responding to everyone" is literally engagement, and that's what those sites/companies want to drive, not community-building. It's designed to suck as a medium for serious discourse, because making 27 slight variations on the same response to 27 different comments keeps you on the site.
As long as we persist in having engagement be the primary metric, that's what you will get, and as long as we persist in the idea that the objective of social media needs to be making a couple people into billionaires, engagement is going to be the focus.
And again, it's a fundamental shift in "who the customers are". Are the customers the people using the site, who want a great place to discuss the nuances of parrot taxonomy, or are your customers the advertisers? Those lead to different ways you build the community.
And you can still make a six-figure income being a webmaster of a smaller community too. You're just not going to make Reddit money off Pentaxforums.
> The highly-chaotic multithreaded model of Reddit/HN/etc is directly designed to be impenetrable and chaotic, where everyone is just responding to everyone rather than having a "flow of conversation" in which everyone is involved.
I couldn’t disagree more with this characterisation. Part of the reason that sites like Reddit and HN are preferred to traditional fora (which have their own engagement mechanisms) is because it's possible to have a different set of discussions on a topic.
Single-threaded fora result in many conversations on various sub-topics and of varying quality being multiplexed in a single chaotic and incomprehensible comment chain. Comment trees allow sub-topics and sub-conversions to be grouped in a reasonable fashion, and voting allows junk contributions to be pushed to the bottom.
It's not just webscale entrepreneurs - users love comment trees. It's part of why sites like Reddit and HN succeeded in gaining traction, and why subreddits are now the de facto replacement for fora (unfortunately, as it centralises editorial power).
This wasn't always through - the first web forums were threaded more often than not, and even vbulletin supported a threaded mode as a user preference well into the 00s (though as it was no longer the default, others were not using linked replies, which hurt its usefulness at that point). This is probably related to imitating the way mailing lists worked, as some of these early UIs were like the older style of mailing list presentation with a few more forms attached.
The flat forums were considered an innovation for a while because "normal users will never understand this nerdy threading model". Arguably to some extent they were right, as mass market products like youtube, facebook, etc. still limit to one level of replies.
The real innovation of the social news sites was the voting and scoring algorithms, which made it manageable by presenting users with the most popular subthreads first, rather than the chronological order the forums had used. And gaining those points had a kind of skinner box effect on keeping users hooked on the sites, which helped their growth too - especially when points used to have a much more prominent display.
> I couldn’t disagree more with this characterisation. Part of the reason that sites like Reddit and HN are preferred to traditional fora (which have their own engagement mechanisms) is because it's possible to have a different set of discussions on a topic.
Preferred by whom, and in what contexts?
The Reddit/HN threaded comment styles work well in scenarios that are relatively high-traffic, ephemeral, and mostly anonymous, in the sense that you generally don't notice or care about the username of people with whom you're having a conversation. You're right that it makes it easier to comprehend because you can just read from the parent downwards to get the full context, and that's rarely going to get to double digits.
But this has a lot of drawbacks. It's a method of communication that's built for a burst of posts on a topic for a day or so, then effectively archived as people move onto the next topic.
This doesn't mean that users prefer this method of communication universally, though. Even though forums are dying, Discord continues to grow and offer smaller communities that actually have a "community" aspect to them. Depends on the server, but I pretty much never see extensive use of threads in Discord, if at all; most people are happy to have a constant flow of conversation in one of the channels. It's closer to how people have conversations in person.
You're right that requiring a small fee to register can solve the problem for small communities (and so OP is wrong in the paragraphs that you quoted), but still this solution doesn't work for OP's case as this requires login and he says that he doesn't want to know who is using his search engine (I guess for privacy reasons) and in this case indeed the bots can harm the proliferation of this kind of small scale services.
> The highly-chaotic multithreaded model of Reddit/HN/etc is directly designed to be impenetrable and chaotic, where everyone is just responding to everyone rather than having a "flow of conversation" in which everyone is involved.
What are examples of websites that maintain a flow of conversation while involving everyone without letting everyone reply to each other? Being able to reply directly to others while discussing a topic has been the norm at forums, and before those, mailing lists and BBSs. The clients/interfaces just got better over time at sorting/collapsing replies.
Whose got the forum where "everyone responding to everyone" doesn't happen?
A bit below the author talks how creating any form of account is going counter to their goal of free and anonymous, so they are looking only for a behavioral sift of bots.
> And they value engagement over content quality, which is the entire reason comment-tree/vote-based systems have been pushed heavily over web-1.0 threaded forum discussions as well.
While they certainly do value engagement over quality, I suspect that the systems are put in place because they don't scale in terms of manpower, and they don't trust their users to do formalize the structure of the site.
> There are web forums that are not "web-scale" and don't spend all day fighting bot spam. The solution is real simple: it costs 10 bux to register an account, if you're a nuisance your account is banned and you pay 10bux to get back on.
There is a big problem with all sorts of activists though. Their modus operandi is finding forums they don't like and the go on post illegal material and reporting it to hosting provider. Many forums stopped accepting new users because of that and for instance only way to sign up is to find the owner and speak directly to them.
> it costs 10 bux to register an account, if you're a nuisance your account is banned and you pay 10bux to get back on.
You've highlighted its biggest tradeoff which is that it creates an economic incentive to ban people. The only way to make more money, is to have more rules and culture for ostracizing people. It would have been smarter of Something Awful (since that's the site we're talking about) to charge $4/month or something.
> It would have been smarter of Something Awful (since that's the site we're talking about) to charge $4/month or something.
The innovative thing off that model is driving the revenue off the misbehavers instead of good citizens. You don't want to have the shitters around even if they are paying $4/month, and you don't want to drive off good-faith users even if they're mediocre/hapless/etc. So run the site off the backs of the people you don't want to have around.
People don't like paying monthly (this is even true of, say, app store revenue today) and if you apply recurring charges then when people don't think they're getting enough value they'll leave. You have a hard enough time on the user-acquisition side, why make it worse on the retention side by driving away the users who you're actually trying to keep?
Billing good citizens works in some situations where you have some specific value that you provide to them - providing sales listings on classifieds boards inside interest-specific forums is a good example, since you are providing access to interested buyers, which is a value-add, same as ebay taking their fee - but just in terms of operating a forum, you aren't a big enough value-add that people are going to pay Netflix-level subscriptions to the Parrot-Ass Discussion Club. You need the users more than they need you at that point. But a one-time fee is viewed much differently by people. People will pay $5 for an app, they aren't going to pay you $5 a month for it though, or at least far fewer.
I can't remember the last time I saw a ban on SA that wasn't after a string of "stop trashing discussion" probes--sometimes in the dozens--or wasn't some flavor of FYAD-escapee bigot or death-threat-spewing weirdo. Even extremely tedious, thread-killing arguers will often be left alone unless a IK is ignored when they say "drop it" or whatever, and that's usually just a sixer.
There are people who get really mad that they eat probes for, say, misgendering trans people, and I for one would like those people to be madder still. And preferably no longer on the site.
> I can't remember the last time I saw a ban on SA that wasn't after a string of "stop trashing discussion" probes--sometimes in the dozens--or wasn't some flavor of FYAD-escapee bigot or death-threat-spewing weirdo.
Well, you made me curious so I looked at the leper's colony and looks like things haven't changed much:
There were some funny bans in there, though. "No, climate change does not justify pedophilia" and "Do not ever ask how to convert commercial drones to deploy grenades" are my favourites.
Nah, there are other ways to raise money. The forums have you pay for avatar changes (or to change others' avatars! which is a fun chunk of the forums culture), there are various upgrades like no-ads and unlocking private messaging and stuff. And the forums now have a Patreon, too.
Really glad to see someone finally talking about this.
Does anyone know what's going on with that "Duke de Montosier" spam botnet? It accounts for more than half of the botspam attacks on my sites, and I can't find anyone talking about it online anywhere, except one tweet dating back to mid-2021. It's identifiable by several short phrases that it posts:
Duke de Montosier
for Countess Louise of Savoy
Testaru. Best known
And cryptic short posts that can assemble into creepy sequences:
Europe, and in Ancient Russia
Century to a kind of destruction:
Western Europe also formed
and was erased, and on cleaned
only a few survived
number of surviving European
55 thousand Greek, 30 thousand Armenian
Many of the IPs involved seemed to be in Russia, China and Hong Kong, though they're coming from all over (eg European & US VPNs, Tor exit nodes). From tracking the IPs on AbuseIPDB, the weird spam posts seem to be just one layer, while behind the scenes it also attempts SMTP Auth and IMAP attacks on the server.
I'm eager to know more if anyone knows, and especially if anyone is trying to shut this thing down. But I can't find anyone even talking about it. (Maybe there's a reason for that?)
I suspect some of these are "bots sold for hire" where they make money selling the bot to people, many of whom don't know how to use it and run it with the default config.
I've found spam email that certainly is the above, because it has things like PUT_LINK_TO_STORE_HERE and other variables that obviously weren't updated in the config file.
My theory for the phrases above is that they're a "unique seed" used to identify sites that are easily compromised. Do a web search, find a website filled with "Duke de Montosier" comments - bingo, you've identified an easy website to target with your backlink comment spam. Or, more maliciously, a website that is easy to thoroughly compromise with vulnerabilities. But that's just my current theory.
Here's the one tweet I found in Swedish about the comment spam botnet, and it dates back to February 2021. She's the only person I could find who has mentioned it in public. Or maybe my search skills are failing me.
First, of course, you have cloudflare and recaptcha, which are free and very efficient, as the author say.
But even if you don't want to use them (some of my services don't), most bots are very dumb:
- require JS, and you lose half of the web ones
- silly tricks like hidden input fields in forms that worked in 2000 still work in 2022. Use a bunch of them, and you can yet again halve the bot traffic.
- many URL should have impossible to guess paths. E.G: just changing the /admin/ url to a uuid in django or the /wp-admin/ in wordpress, you save so many requests.
- bots are usually not tailored to your site, meaning if you require JS, you can actually embed anti-bot measure in the client code and they will work. E.G: exponential backoff + some heavy calculations if too many fast consecutive ajax requests.
- fail2ban + a few iptables rules (mitigate syn flood, etc) will help
- varnish + redis gets you very far to shave excess dummy traffic
One of the things I’ve done before among the other suggestions is to put a hidden link like /followmeifyouscraping.html in the landing page to get a bit of info about scraping volume and then you can use fail2ban filters to block if it’s visited if you want
> First, of course, you have cloudflare and recaptcha, which are free and very efficient, as the author say.
Recaptcha has been almost useless, in my experience. If you read the spam logs, you'll quickly learn about the spam software they (claim to) use to bypass Recaptcha, because that's what they end up promoting. I started tagging in logs if Recaptcha had validated on messges, and sure enough these spam posts had all successfully passed it. Great opportunity to rip out more Google dependencies from my website.
I've found my own custom written filters to be vastly more effective than Recaptcha.
Lots of the bots are running full Chrome with JS, lots of HeadlessChrome being used lately. The fact that they're using HeadlessChrome is something that makes them easy to detect, ahem.
That really hasn't been my experience, perhaps I'm just getting hit more by the sophisticated bots than the naive ones. I'm glad that it works for some people.
Recaptcha was also filtering out some legit humans (I logged all posts regardless of captcha status to be reviewed later), so it just wasn't worth reducing the user experience when the captcha bot detection rate was so low.
Please don't make blocking VPNs plan A. Between snooping ISPs and public wifi networks that have indiscriminate content filters, I'm on a VPN about half the time. Many other legitimate users are as well.
> block signups/comments from known throwaway email domains
Throwaway email services have caught up to the point where even bigger players don't manage to block them, i.e. they rotate the domains used for the emails.
Also, attackers are rarely going to try to guess your URLs - they’re going to find them via Google or Shodan, or, if you’re a good rest citizen, via “/<yourapp>/“
I run a popular blog and confirm that spam is a massive issue. I am trying to keep the independent web alive with an old-school commenting system because it helps readers and myself improve outdated posts. My domain is over 20+ years old and attracts all sorts of threats, including monthly DDoS and daily spam. Using Cloudflare solved all of these problems. Next, you need to add firewall rules inside Cloudflare WAF to trigger a captcha for /path/to/blog/wp-comments-post.php. That will not get rid of human spam tho. For that, you need to use another filtering service called Akismet.
Putting everything 'behind Cloudflare' isn't a panacea. By merely living outside the West, I'm getting Geo blocked from 'normal' news sites and constantly having to solve hCAPTCHAs to solve riddles for some AI algo without compensation. It's such a burden and I find myself giving up pretty often. GeoIP blocking is what prevented me from getting my voter information out of my last domicile. Running everything through Cloudflare or similar also contributes to the concept of letting the internet be centralized around a few choke points that can hurt free speech (both the good and bad kind) and when they go out (which did recently) a large swath of the internet comes with it.
Does Cloudflare's "Privacy Pass" browser plugin help at all? It's advertised as reducing the number of hCaptchas you need to solve by a factor of 30, but I rarely see hCaptchas anywhere on my connection so I can't really evaluate myself.
I agree with you. But, what solution do you propose for independent solo developers or people who wish to run a blog instead of using FB, Twitter and co to create content? Cloudflare may not be perfect, but it prevented me from shutting down my solo operation without putting a massive cost burden on me. When the first time DDoS hit, I had to beg one of those large cloud companies to reduce bandwidth costs. It took them forever to forgive that abuse and price, which was not my fault, and I was given a strong warning not to repeat such an issue again. There is no easy solution to this problem. At least with Cloudflare, people like me can stay online, but it does cause a problem for a bad IP reputation.
TL;DR: I won't expose any of my projects or API directly these days due to spam, ddos and other abuse.
For commenting or forums the best solution IME is to require moderation review for all new users before showing the posts anywhere. You can also add a super simple "CAPTCHA" that is just a text field with a question easily answerable to any human looking at your site (e.g. name of $thing that your site is about without mentioning $thing in the question itself) to reduce load from non-targeted automated spam.
Same job, same problem. I simply don't allow comments anymore.
This is unfortunate, because they're amazing feedback if you write about bureaucracy. People won't take the time to write to you about their experience, but they'll leave a comment.
I came to the same solution, just disabling comments. There's a prompt to email me in the footer, but no one ever has. Shame, but that's the world we live in.
Just like with banners, nobody reads footers so that is not surprising. Your footer also makes no indication if you will publish replies via email in a way similar to comments so there will be people that won't bother if it is only you that will see their thoughts.
Have you considered an info box where the comments would be to push anyone interested to mail? Or allow comments but make them work more like mail with comments not being publicly visible until you approve them?
If you just want to cut down on non-targeted spam then it might be enough to have your comments work even slightly differently from the standard word-press solution, e.g. by adding a simple text field with a question that real human visitors of your site should be able to answer. This alone has worked for all of my sites that allow user contributions so far - won't stop anyone trying to spam your site specifically ofc but you can always take further measures WHEN that becomes a problem.
PS: Is your site supposed to have a default WordPress favicon?
> Have you considered an info box where the comments would be to push anyone interested to mail?
Not a bad idea, thanks. Maybe I'll try that. I suspect the "problem" is I just have approx. zero readers :)
> If you just want to cut down on non-targeted spam then it might be enough to have your comments work even slightly differently from the standard word-press solution, e.g. by adding a simple text field with a question that real human visitors of your site should be able to answer.
Yeah, I'd like to do that, but I don't have the time/interest in figuring out how to make Wordpress do that. I tried googling for plugins or whatever a while back, but never found anything pre-made.
> Is your site supposed to have a default WordPress favicon?
Hmmm, maybe offload your comments to something else ?
Im thinking of using GitHub issues/discussions as a comment system.
The website and everything will function normally without CloudFlare, but the comments are based on GitHub, which will deal with spam and hosting for me.
And i personally think using it is better from centralizing the internet view, as you don't increase the absolutely crazy 20% that CF controls of the web.
But that obviously doesn't solve the DDOS problem, which should be solved with big cloud providers which have included DDOS protection.
Another workaround is using IPFS, but for a normal user, he/she will need a gateway, and guess who operates on of the biggest IPFS gateways ?, Yes CF.
That without considering the tradeoffs is using IPFS.
I think a static site + external comment provider like Github, might help with the attacks and spam without using Cloudflare, but I don't have any website close to your blogs size, so its all just a predication.
I've integrated a static site and a Wordpress site with Mastodon for comments. I rarely get any comments, but I never get any spam.
I do generally think federated social networking (ActivityPub) is a good way to handle commenting. Comments are associated with identities separate from your site and there's plenty of opportunity to layer a reputation system on top of that, though I don't know if it's been done yet.
Unfortunately, well-known platforms with known URIs are targetted way more than any custom website. I think if wordpress just allowed rewriting all the URLs would reduce spam attacks by a lot
People here like to say that BigCo has ruined the independent web and that everything is now siloed and blablabla but the truth is that running an independent website fucking sucks in many regards
For my forum with 500k users a month I just added a registration captcha related to my niche. E.g. for a Dark Souls forum it would say "what game is this forum about?" And if you got it wrong the validation would include "tip it's just two words Drk S*ls". This reduced spam by over 99% and didn't annoy people with recaptcha.
If someone was unable to get past that captcha (it still happens I have logs!) I figured they were probably not that valuable a contributor anyway.
If someone wanted to target my site directly they could but hasn't happened so far.
Reminds me of a guy who implemented a pre-screen on his phone calls to stop spammers. He said he wanted to use something simple at first and that he planned to tweak it depending upon how many spammers go through. So his phase one it asks "Dial 1 to continue". But that was enough to stop all the spam calls so he never had to improve it.
re: someone was unable to get past that captcha - this reminded me a story I heard back in ICQ times about some human who couldn't pass anti-bot question: "What planet do we live on?"
I remember a friend's con-group forums who had an issue along these lines - the anti-bot question was "What is the brightest thing in the sky at noon?" the expected answer was "the sun", but some guy got stuck because he was answering "Sol". Since they had an IRC channel the issue was relatively quickly resolved, but it was an in-joke for some time.
The key takeaway is that if you have a second line of communication, humans can use it but bots won't - "Issues registering? Contact someemail or see us on IRC/Discord" can do wonders.
I am running a website builder with > 20K sites. I use open contact forms without captcha. What worked for me is to use a one line javascript that places current timestamp in a hidden input field that is default 0. Then I check on the backend and if the value is either 0 or time to fill out and send the form is less than 4 seconds I block as spam. This blocks more than 99% of spam and also takes care of most human copy paste spam as well.
I used to do it with PHP but the problem is that then you can't cache the HTML (varnish or other solutions). Javascript don't have that problem and the added benefit of stopping bots that don't run javascript.
In the error message I have a friendly texts for human to turn on javascript if it is off and a <a href="javascript: history.go(-1);">Go back and try again</a> making them not loose the text that they have typed.
People will be able to write programs that circumvent the feature but that's also true for the javascript solution. The point of it was that it gets rid of most spam because most bots fill in the form faster than 4 seconds and are not made to circumvent this feature.
<?php if (time() - $_POST['starttime'] < 4) die('2fast4me');
Revealing the error condition (that it was submitted too fast) is nice for users and bots alike, of course. Up to you.
I've had websites where I was too fast in submitting a form before. Not any kind of anti-spam, just their server was so fricking slow that I had input the date (iirc it was a reservation system) and clicked next before the JS blobs had finished triggering each other and fully loaded. It would break the page somehow with no visual indication. I found out by looking in the dev console and noticing stuff was still loading in the background. How normal people are able to use the Internet with how often I need the dev console to do entirely ordinary things is a mystery to me.
You could generate a CSRF token or something similar in a hidden filed based on a JWT token (yes I know) on the server side. The JWT token can either contain some timestamp after which it's valid or the time it was created.
I like this solution because spammers are unlikely to try to get around it. A delay eats into their time budget and they can't introduce a human-like waiting time on every site they try to spam, better to just move on to find cheaper targets.
I'm already using this timestamp technique on my website and so far no bot operator has bothered trying to work around this. However even if some bot operator were to specifically target a website using this technique and try to decrease the timestamp, I believe you could still force a bot to wait by just changing the website to use something like a cryptographic nonce that includes a timestamp instead of just a simple timestamp that can be understood easily.
I also use essentially the same technique (although I have the server generate the timestamp instead of using JavaScript) on my website and concur that this is a highly effective technique for blocking bot submissions.
I experienced this firsthand with government immigration websites. The thing is there are only so many time slots and and people are forsed to use a certain web site to apply, so everyone is hunting for available time and generally none are available.
So, some creative people set up bots which check periodically for them. They are paid services which will do that for you. Now we have bots hammering gatekeeper's website. Perhaps hundreds of bots.
Which results in the website is being unavailable, serving a serious qps to bots. I think it is only a matter of time before someone will write bots that will apply to application bots hoping that more entries with their information will provide them with better probability of success.
This is so dystopian and cruel to the average person, and I don't think there is a good solution besides a deep anti-bot expertise whithin the primary website development team
> I don't think there is a good solution besides a deep anti-bot expertise whithin the primary website development team
But there is a solution: the website team should get their act together and remove the "first come first served" aspect altogether.
Do you, citizen, want to register? Cool - leave your e-mail and we'll call you. Is the service optional? Then we'll pick at random from the pool of applicants and e-mail them. Is the service mandatory? Then sign up and we'll call you once you reach the top of the queue. Add a quick ID/credit card/whatever check on top (like good concert venues do), regular e-mail updates to let people know they haven't been forgotten, and you're done.
Any second year CS student could write such a system. The difficult part is accepting that the current approach doesn't work and looking for alternatives.
"Thank you for waiting three weeks for your appointment selection. We are happy to offer you a time slot next Friday, from 1 pm to 1:15 pm. Click here to accept: [button]. If this does not suit you, click here to get sent back to the queue: [button]."
Half the point of these services tends to be giving users some choice in when they have to show up somewhere. Because not everyone can make time in the middle of business hours of an arbitrary day. Not to mention that you might simply be out of town.
The thought did cross my mind, which is why I'd also ask for an ID or equivalent. If you don't show up with that specific ID on the day of your appointment, you lose the appointment. And you can use that ID number to deduplicate requests.
If you don't do that, I 100% agree with you - scalpers could then register with hundreds of accounts for reselling, and we would be back where we started.
>Do you, citizen, want to register? Cool - leave your e-mail and we'll call you. Is the service optional? Then we'll pick at random from the pool of applicants and e-mail them. Is the service mandatory? Then sign up and we'll call you once you reach the top of the queue.
Wait a minute, wouldn't that be more work for "us"?
It gets more fun when there are arbitrary restrictions on process. Fun anecdote: South Africa's Home Affairs website, where you make bookings for passports, only lets you book appointment dates 2 weeks ahead normally. It's effectively permanently booked even without bots this way.
Luckily, if you're technically inclined, editing the value of the input element via dev tools is accepted by the form.
I faced a website like this recently when booking a slot at my own German embassy, went a different route around the embassy instead. What I don't like about the slot system: You won't get a convenient time slot anyways so why do they bother setting it up like this in the first place? Why not just register with your contact details and receive an email with a guaranteed spot at a selection of three days instead. No more need to reload and no need for bots.
The Upper Austrian government did this for covid vaccinations in the early phase and it worked very well that way (early, high vaccination rate amongst the elderly and certain professions).
That reminds me of the chaos that ensued in my state in the early days of Covid-19 vaccination when they were still having centralized systems where the elderly could book an appointment. Of course, they had way more demand than supply but still insisted on First Come, First Served, so you ended up with every member of the extended family being asked to try and book a slot, quickly overwhelming their booking systems.
At least you didn't need to worry about it all day: there wasn't a chance in hell to get something 3 minutes after the booking system opened each day.
Friends described how stressful it was for them, their parents being completely helpless and essentially fearing that their health depended on getting one of those elusive appointments, and being devastated each day they didn't succeed.
In some miracle of competence my district alotted shots by decreasing age limit so that it were no "shortage" or lagfest for those eligible in the booking system.
We have all the data about everyone in the registers, but god forbid we use them to lower friction in case of emergencies! Good to hear that some got it right.
This is pretty much the only way you can book a driving test in the UK at the moment unless you want to take your test in a random place in the Highlands.
In the 1980's, we kept anklebyters off dial-up BBSses with a simple technique: voice validation. To join the forum, you had to fill an application first, which included your real name and phone number. The sysop would give you a call for a quick chat, and then grant you access if you didn't seem like a twit.
This would be entirely practical for some small-time operator trying to run a forum off residential broadband, while impractical for the reddits, facebooks and twitters.
I guess it's a different time and it also depends on who's your target audience. Some people go crazy if you ask for their email address. Phone numbers and calling is a big no-no.
Someone that far away shouldn't want my direct contact information to join a group.
However there is a medium / large organization case, where each area has local 'chapters' or some other term for a small fragment of the larger group. In that case the local leaders each operate as a small group for their areas.
I remember some BBS registration forms where you would have to give the names of a couple existing users that would vouch for you. Kind of like other sites where you need an invite or referral from an existing member.
It is interesting to watch comments about this dance around the topic of barriers to entry. It wasn't exactly easy for the uninitiated to access various internet fora in the early days and with popularity comes the bots born out of the desire to profit for little work at the expense of the community garden. The recent story about VRchat embracing anticheat DRM is another example of this, as its ascending popularity led to more scammers [0].
Does this extend to societies as well? One can think of a membrane that has selective permeability to ideas but resists antisocial actors and concepts. Alexander Bard has talked a lot about social membranics (it's a bit hard to search for).
As odious as the web3 charlatanry is, I'm starting to yearn anything that raises the transaction costs for the dumbest bots. I remember reading something about new ideas with distributed moderation at some point--maybe someone can refresh my memory.
What's hard to do now is host a lightly used but broadly interesting service that doesn't require a login.
Although, surprisingly, I host such a service, and while it gets a constant stream of random hits, they're a minor nuisance. Probably because it's just the back end for a web page, and nobody bothers to target it specifically. Random web browsing won't find it, and the API will just return an error if called incorrectly. Even if it is called correctly, it has fair queuing on the service, so hammering on it from a small number of IP addresses won't do much.
That did happen once. Someone from a university was making requests at a high rate and not even reading the results. I noticed after a month, and wrote to their department chair, which stopped the problem.
Yes, this is why I plan to take down my hobby projects. And it's not only bots, real people do it as well. Apparently some people have a passion for screwing up other people's work. Some even email me afterwards asking for money to disclose a bug they found.
Same! I also got like 20 requests every second from a university IP. I tried a few things to make it error out, like returning 404, but no dice. In my case, it was my own fault though, a page with a few lines of JS to periodically check for updates got into a crazy state (I never found out how) and they didn't notice because it was a remote desktop system where they left the page open. Went on for months but didn't impact my service (I just noticed it in access logs while looking for something else) so I left it and remembered again a few months later, then it was gone.
>What's hard to do now is host a lightly used but broadly interesting service that doesn't require a login.
Which other broadly interesting services do exist? The owners of those services could come together and offer a VPN that gets preferred treatment for these services. This could be more precise than https://www.abuseipdb.com/.
I get a ton of spam from my contact me pages even with a captcha in place, i've been experimenting with loading an initial dummy form and replacing it within a few seconds of loading to the real deal which seems to have cut down on bots submitting stuff.
Rate limit everything you can and use a captcha where acceptable, there are also a load of public IP and email blacklists that you can use to run a quick check. Working in a field where there is a large amount of bots and incentive to abuse we invest quite a bit of time and money in fraudulent traffic detection using a cornocopia of different services in tangent and at the end of the day we still see a small percentage of traffic getting through that is fantastically human like.
With that out of the way, I've been engulfed in AI and GPT3 functionality lately and I thought this post was going to be doomsaying the coming apocalypse of bot spam, because the level of human like quality coming from the AI is going (already has) to make deciphering human vs bot traffic/posts/emails/comments nearly impossible. It will be fun here soon when we see forums entirely dedicated to bots conversing and arguing with each other outside of reddit.
"The rest are forced to build web services with no interactivity, or seek shelter behind something like Cloudflare, which discriminates against specific browser configurations and uses IP reputation to selectively filter traffic."
Interactivity is not a must-have. The world's first general purpose computer, ENIAC, was not built for "interactivity". It was built to calculate ballistic trajectories, which were otherwise calculated manually. Computers exist to allow automation, to reduce manual labour.^1 "Tech" companies need interactivity to support collection of data about www users and paid services related to programmatic online advertising. Generally, users do not need interactivity. Generally, users do not need to spend excessive quantities of time "interacting" with networked computers.
As a user, I want _non-interactive_ www services, whether it is data/information retrieval or e-commerce. I want to use more automation, not less. Automation is not reserved for those providing "services". It also should be available to those using them.
Provide bulk data access. Let others mirror it. Take advantage of "open datasets" hosting if necessary. For example, Common Crawl is hosted for free with Amazon. Upload the data to internet Archive.
"The API gateway is another stab at this, you get to choose from either a public API with a common rate limit, or revealing your identity with an API key (and sacrificing anonymity)."
Publish the rate limit for the public API. Do not make users guess. Do not require "sign-in" to use an API to retrieve public data.
1. Some folks consider having to "interact" with a computer as labour, not fun.
>>> Automation is not reserved for those providing "services". It also should be available to those using them.
Yes !
I call this software literacy. And yes - no matter how cool the JS on a major site, the fact that the sites goals are to keep me there and clicking and my goals are to get what I want with minimal action are in conflict.
I would suggest that bots are actually not a problem. For most things I would like a bot acting for me. Telling me as and when that I need to visit the dentist, who has slots free next weds and friday. Friday is best because I am also WFH that day.
The bot apocalypse is only one because we are trying to make a "web for humans" when actually a "web for bots, and a bot for a human" is a much better idea :/)
> Telling me as and when that I need to visit the dentist
Isn't that simply your calendar? Sure, you want it automated; but it doesn't need internet access, it doesn't need to crawl or search, I don't know why you refer to it as a 'bot'.
To my mind, the idea of personal 'bots' was that you could give it some general instructions such as "Let me know when the content at any of these URLs changes", and then leave it running. Were they also called agents?
Not really. I want a PA, who will help organise my day, monitor my emails and messages, arrange meetings, prepare materials, tell me options for kids camps, you know what people
who can afford a full time helper get. But I don't want to spend 80k pa or more. I want a bot.
I run a website about immigration. I'd love to reinstate comments and get valuable feedback from people who just tried my advice. Bots just make it too time-consuming.
While clearly bot spam is on the rise, we need to be very careful on how we choose to deal with it. Cloudflare has already introduced "proof-of-Apple" [1], where proven Apple devices get special treatment, bypassing captchas. Later we might see websites that are only accessible via Google, Microsoft, or Apple devices. If we continue down this path, we'll end up with a social credit system ruled by big tech.
I wonder if proof-of-work would help. Suppose every form submission requires an expensive calculation, calibrated to take about 1 second on a typical modern computer/smartphone. For human users, this happens in the background, although it makes the website feel slower. But for bots, it dramatically limits how many submissions each botnet host can make to random websites.
I'm curious whether this can actually be considered to be a "CAPTCHA" in the true sense of the term. It doesn't seem to be intended to "tell computers and humans apart", but rather to force the client computer (not the human user) to do some work in order to slow down DOS attacks.
Of course slowing down DOS attacks is a great goal in itself, and it's very often what captchas have been (ab)used for, but it doesn't seem to me to replace all or most use cases for a captcha.
In particular, since it can be completed by an automated system at least as easily as by a human, it doesn't seem like it would limit spambot signups or spambot comment or contact form submissions in any meaningful way.
I used "captcha" to simplify mCaptcha's application, calling it a captcha is much simpler to say than calling it a PoW-powered rate limiter :D
That said, yes it doesn't do spambot form-abuse detection. Bypassing captchas like hCaptcha and reCAPTCHA with computer vision is difficult but its is stupid easy to do it with services offered by CAPTCHA farms(employ humans to solve captchas; available via API calls), which are sometimes cheaper than what reCAPTCHA charges.
So IMHO, reCAPTCHA and hCaptcha are only making it difficult for visitors to access web services without hurting bots/spammers in any reasonable way.
Thanks for the reply! That's basically what I thought then – but as you say, traditional captchas are deeply flawed and ineffective anyway, and I totally agree that in many cases the cost to real users outweighs any benefit.
So I'm excited to see alternatives such as mCaptcha popping up. It'll be interesting to see how it works out for people in real-world use.
The results of the PoW are just thrown away, right? I wonder if you could couple that with something useful, e.g. what SETI@home used to do, but the intentionally small size of the work probably makes it difficult to be useful.
It looks great, as a suggestion: Instead of an easy mode and advanced one I would use a single mode with a calculator, that way it is more transparent to the user and it would make the process of learning the advance mode and concepts easier.
Also, here: https://mcaptcha.org/, under the "Defend like Castles" section, I think you meant "expensive", not "experience".
> Instead of an easy mode and advanced one I would use a single mode with a calculator, that way it is more transparent to the user and it would make the process of learning the advance mode and concepts easier.
Makes sense, I'll definitely think about it. The dashboard UX needs polishing and this is certainly one area where it can be improved.
> Also, here: https://mcaptcha.org/, under the "Defend like Castles" section, I think you meant "expensive", not "experience".
Fixed! There are a bunch of other typos on the website too, I can't type even if my life depended on it :D
How does that work without becoming a SPOF for taking down the website ? Can't a user/botnet with more CPU power than the server simply send more captchas than can be processed ?
In addition, using sha256 for this is IMHO a mistake, calling for ASIC abuse.
> How does that work without becoming a SPOF for taking down the website ? Can't a user/botnet with more CPU power than the server simply send more captchas than can be processed ?
Glad you asked! This is theoretically possible, but the adversary will have to be highly motivated with considerable resources to choke mCaptcha.
For instance, to generate Proof of Work(PoW), the client will have to generate 50k hashes(can be configured for higher difficulty) whereas the mCaptcha server will only have to generate 1 hash to validate the PoW. So a really powerful adversary can overwhelm mCaptcha, but at that point there's very little any service can do :D
> In addition, using sha256 for this is IMHO a mistake, calling for ASIC abuse.
Good point! Codeberg raised the same issue before they decided to try mCaptcha. There are protections against ASIC abuse: each captcha challenge has a lifetime and also, variable difficulty scaling implemented which increases difficulty when abuse is detected.
That said, the project is in alpha, I'm willing to wait and see if ASIC abuse is prevalent before moving to more resource-intensive hashing algorithms like Scrypt. Any algorithm that we choose will also impact legitimate visitors so it'll have to be done with care. :)
> the client will have to generate 50k hashes(can be configured for higher difficulty)
I completely forgot how PoW worked, it's clearer now. You should probably add that this is a probabilistic average, so people will have to be ready for much longer (and faster) resolutions.
With what you said, an adversary can probably just DoS mCaptcha without any computation, if verification is stateless (by sending garbage at line rate); if it is stateful (e.g CSRF token), you'll have to do a cache query, which is probably on the same order of magnitude of a single hash.
I’m the co-founder of Friendly Captcha [0], we offer a proof of work-based captcha since two years or so. Happy to answer any questions.
A big part of what makes our captcha successful in fighting abuse is that we scale the difficulty of the proof-of-work puzzle based on the user’s previous behavior and other signals (e.g. minus points if their IP address is a known datacenter IP).
The nice thing about a scaling PoW setup is that it’s not all-or-nothing unlike other captcha’s. Most captcha’s can be solved by “most” humans, but that means that there is still some subset of all humans that you are excluding. In our case if we do get it wrong and wrongly think the user is a bot, the user may have to solve a puzzle for a while, but after that they are accepted nonetheless.
While your service is of high quality, the pricing is completely unreasonable for private use cases, many times higher than hosting the site in the first place.
I think the it depends on what counts as a "request" in terms of pricing. Is it only successful checks? Pricing would be fine then. If it also includes failed checks then there is no point in the service, including the Advanced plan. Would eat through the entire credit in a day.
I'm sorry to hear that. We offer free and small plans for small use-cases, but I also understand that some projects don't have a budget at all.
There is a blessed source-available version of the server that you can self-host [0]. It is more limited in its protection, but it is probably good enough for hobby projects.
Newer variations (such as argon2) are tunable so you can include memory footprint and cpu-parallelism. There also are time-lock puzzles or verifiable delay functions that negate any parallelism because there's a single answer which can't be arrived at sooner by throwing more cores at the problem.
For small scale self-hosted forums, bespoke CAPTCHA questions can work quite well in practice. Make it weird enough and it just isn't worth that much for malicious users to break, while most humans can pass easily. Spammers benefit from volume.
I would suggest that bots are actually not the underlying problem. For most things I would like a bot acting for me. Telling me as and when that I need to visit the dentist, who has slots free next weds and friday. Friday is best because I am also WFH that day.
The bot apocalypse is only one because we are trying to make a "web for humans" when actually a "web for bots, and a bot for a human" is a much better idea :/)
We need to redesign a web based on APIs, certificates, rate limits etc. And stop having "engagement" as a goal, and have "getting things done" as a goal
This kind of botspam is usually pretty easy to address with redbean using the finger https://redbean.dev/#finger and maxmind https://redbean.dev/#maxmind modules. The approach I usually recommend people isn't so much ip reputation, which can be unfair, but rather it allows you to find evidence of clients lying to you. For example, if the User-Agent says it's Windows with a language preference of English, but the TCP SYN packet says it's Linux and MaxMind says it's coming from China, then that means the client is lying (or being MiTM'd) and you can righteously hellban once it's fingered for the crime.
My laptop is lying all the time. I change my UA, preferred language, my ip, mac and so on, because of tracking, terrible dev assumptions and personal preferences.
Yet, I'm a very good web citizen.
Because of this, I often have to solve the same captcha many times before it thinks I'm human.
I don't doubt it. Given how rare people like you are, I'm sure a good citizen like you would also be perfectly fine sending an email to the service asking to be whitelisted, or having a second browser that isn't your daily driver for situations like this that doesn't try to obfuscate its identity by behaving like a bot.
The first solution isn't practical (so many services to manually find a mail to send a message to, then interact with a human that might not even exist), and if you do, they don't whitelist you. I tried. Either they don't answer, or have "no way to have a specific whitelist for a single user in our system".
So the second browser is the solution. But then the site will do all the bad things that I wanted it not to do in the first place. Like serving terrible French results instead of good English ones, or assuming Firefox doesn't work based on UA while their site work fine with it. And of course track me to death, sell my data, and so on.
The only solution that works is to chose services you pay money for: they have your card, so they know you are not a bot. For years now, I have been suspicious of anything free. But it doesn't solve the tracking problem.
Yes I understand the desire for capitalism rather than surveillance capitalism, but that's a derailment. The OP appears to be someone who just wants to build something cool and share it with other human beings. In that case, it's really helpful to be able to have a free practical way to address abuse. Would you really tell someone like the OP to stop expressing themself and shut down their service and put a paid one in its place? How can you charge for search when Google gives it away for free?
I understand all causes and consequences of this problem, and I'm not implying there is an easy solution, only underlying that using "the user is lying" will lead to frustrating false positives.
> The OP appears to be someone who just wants to build something cool and share it with other human beings.
Only thing is, you don't know if that statement is true. Or they could really be wanting to build something cool but take advantage of all those "free" services and basically sell you to Google and Facebook.
I won't lie, if you make an asshole system that bans me for doing perfectly normal things that make the internet work I'm going to assume I don't want to interact with it anyway.
That's actually a terrible heuristic. My requests are often from windows proxied by Linux, with language set to my preferred one in a non-matching country. And that's before I start travelling and using a hotspot with a faked TTL to workaround telco limitations. That's before you even get to people completely unaware of interesting routing applied to them (like corporate proxies, vpns) and people with incorrect maxmind entries.
What's stopping them from clubbing you with a monkey wrench? With bots, to answer your question, it'd probably take another standard deviation in the IQ of the person using it. So you've ruled out all the script kiddies in the world by default. The purpose of this game isn't to have a perfect defense, which is impossible, but rather to make the list of people who can mess with you as short as possible.
I thought this article was referring to the upcoming deluge of GPT-3/DALL-E bots that will eventually flood all of online discourse. And whatever future models that will be even more indistinguishable from people - perhaps even ones that are good at "signup flow".
That's going to be way worse for humanity than spiders and automated scripts sending too much traffic. This article isn't imagining apocalypse creatively enough.
We're certainly heading towards a scenario where internet abuse (due to poor regulation against it, IMHO, it's digital pollution) becomes enough of a nuisance to require increasingly intrusive verification.
Though we can all work against that by securing our own systems and preventing them from being abused. Used or unused domains should have a strict SPF policy, website registration (or newsletter signup forms) should have captchas, comments should have captchas. Wordpress or other CMS's plugins should be up-to-date and so on and on. Work on requiring 3DS everywhere, everything in-depth.
That way malicious actors would be limited to the services they pay for and that makes their life significantly harder.
We're coming up on the end of open forums and open social media. Everything will require intrusive verification. Anonymous forums could exist but they'll require something else like an anonymous payment, a ton of proof of work on your local machine, etc. to filter out crap.
Have a "CAPTCHA" that gives the IP reputation for some time (cookie+IP=key), but instead of a CAPTCHA make the web page / browser solve and submit a BOINC task from a randomly picked science project. No user interaction needed, it has the benefits of "paying by computation" of cryptocurrencies without the tracing, and if bots solve the problem efficiently, it's good for science.
Usually you have a landing page, and then you enter stuff there, and finally you get output. That gives you about 1 minute before returning first results.
If the user is faster, you can show something like cloudflare does when you visit through Tor.
On later submissions you can reuse the reputation from the cookie.
I work in this space at a company you've heard of - even at our scale and with our resources the proportionally larger attack incentives mean we are constantly firefighting.
> The other alternatives all suck to the extent of my knowledge, they're either prohibitively convoluted, or web3 cryptocurrency micro-transaction nonsense that while sure it would work, also monetizes every single interaction in a way that is more dystopian than the actual skull-crushing robot apocalypse.
I understand the drawback here but I would like to see monetized transactions employed as a defense layer a little more before we make a final decision. It is undemocratic, to be sure, but maybe for those of us who can afford it it's still better than the cesspool we currently sift through on every major platform. Anyone aware of any platforms taking this approach?
Maybe the fediverse will help - by fragmenting networks attackers may have less incentive to attack a particular one.
Anyone aware of any major platforms taking this approach?
The Postal Service?
Sure, there's junk mail, but imagine how much junk mail there would be if it were delivered for free. It wasn't until phone calls became so cheap as to be "unlimited" that we ended up flooded with billions of junk calls.
Microtransactions (non-crypto, thankyouverymuch) would solve a certain number of today's problems.
I do think it would help, like even if a transaction cost 0.05c, it would add up very quickly for a bot operator but stay cheap for everyone else. But I think the problem is it would inevitably introduce the need for a middle man, shaving 0.01c off that 0.05c, with a dubious incentive to increase the amount of money changing hands as much as possible. What you've invented at that point is basically Cloudflare with worse incentives.
Yes for sure, I have thought about making an email "stamp" web-3 service that would implement this. I even wanted to make some fun "pony express" animations whenever a letter was arriving to your inbox.
Wasn't this the basis of bitcoin? Pre bitcoin, I remember some whitepaper suggesting that email clients should spend some minute amount of processing power solving a cryptographic problem for each email sent. The theory was that a legit client would barely notice, but a spammer would expend significant resources.
> The other alternatives all suck to the extent of my knowledge, they're either prohibitively convoluted, or web3 cryptocurrency micro-transaction nonsense that while sure it would work, also monetizes every single interaction in a way that is more dystopian than the actual skull-crushing robot apocalypse.
In the interest of practicality: There's a way to go the web3 route without being laden with transactions:
- Mint a fixed-cost non-transferrable NFT to an address, with ownership limit of 1 per address.
- Use SIWE (sign-in with Ethereum) to verify ownership of address & therefore NFT.
- If malicious behaviour is detected, mark the NFT as belonging to a malicious actor at the server's end & block the account.
- Require non-malicious-marked NFTs in order to use the site/app.
At most, the user only had to perform 1 transaction (minting the non-transferrable NFT) on any blockchain network where the contract resides, & the costs to do so can be made cheaply with Layer 2 networks. (Polygon PoS, Arbtirum, Optimism, zkSync 2.0, etc)
Can this be done entirely without web3? Yes, but the added friction imposed onto malicious actors to generate new addresses & mint new non-transferrable NFTs increases the costs for them considerably.
> If anyone could go ahead and find a solution to this mess, that would be great, because it's absolutely suffocating the internet, and it's painful to think about all the wonderful little projects that get cancelled or abandoned when faced with the reality of having to deal with such an egregiously hostile digital ecosystem.
In all honesty, there's no perfect solution, just hard-to-make tradeoffs: The prevention of botspam inherently requires tracking in some form to resolve said issue, as there's no immediately-recognizable stateless solution for botspam tracking. Someone has to do the tracking to prevent botspam, which inherently involves in state being changed in order to mark an actor as malicious.
Because you don't have experience with it. There's nothing complicated about SIWE, minting an NFT and checking its validity, certainly not to describe it "prohibitively convoluted" aside from being scared of web3 keywords. Come on now.
Not commenting on op's solution's validity or effectiveness, just replying to your comment.
Do you mint one new NFT per site? Then you impose an excessive burden on users per new site they visit.
Do you mint one NFT per address? If blocking only applies to the one site, a malicious operator can just spam the next site using that address - after all, they own tons of addresses and have many sites to spam, and they can surely spam for at least a bit before getting caught (per site and per address).
If blocking is a public operation that gets you banned everywhere, well now one callous server owner can now disable your address’s ability to access anything.
I fail to see how Web3 NFTs solve any of these problems…
In all honesty, there's limited tooling to resolve the botspam issue: There's nothing special about a human that a curated algorithm/botfarm can't sufficiently mimic. No technology is the panacea to this solution.
The only way out is to increase the costs for bots to such an extent that it becomes costly for them to operate. The methodology for performing this is still up for debate, but tracking of some form will have to exist (be it publicly-collated or privately-monitored or some blend of both).
> Mint a fixed-cost non-transferrable NFT to an address, with ownership limit of 1 per address. - Use SIWE (sign-in with Ethereum) to verify ownership of address & therefore NFT. - If malicious behaviour is detected, mark the NFT as belonging to a malicious actor at the server's end & block the account. - Require non-malicious-marked NFTs in order to use the site/app.
In the real world that's called a paywall. Just requires users have an email address (wallet address) and proof of payment (NFT). Web3 contributes nothing but a different lingo for the existing concepts, yet has all of the same problems: nobody likes paywalls.
I run a website for a small company. The site has been around since the mid-1990s, and bots are a minor annoyance, but not a problem.
We also use some simple heuristics to reject obvious bot traffic.
One of the simplest is to have a form field that is hidden via CSS. Humans don't see it and it stays blank. Bots fill it in.
Bots tend to fill in every form fields with random garbage, even checkboxes. Validating checkboxes rather than checking they have a value is another good way to detect bots.
Many bots have a hard time with CSRF tokens in hidden fields.
Many bots also don't handle session cookies properly. If someone submits a registration form without an existing session, we reject it. (So we don't get as far as checking the CSRF token.)
After a certain number of failed attempts to register or login, we block the IP for a period of time.
tangential : Two weeks ago (and for a while) our country's twitter-sphere (Chile) was completely and obviously dominated by bots, they were starting and inflating trending topics with absurd lies, spreading fear and chaos in favor of "Rechazo" (the option against our new constitution in the next ballot) or echo chambers for republican and extreme right associated politicians.. What happened? a self organized group[1] started to do data analysis of the trending topics and delivering the results to the people showing who was behind the campaigns and synthetic likes, after this, prominent and public figures from that sector started to cut funding for bot networks (because of the public shaming and media attention they were receiving) and is so pathetic now ,they can't even get more than 100 likes and often the most popular response is a refutation or the very same analysis showing the bot network working with substantially more organic likes. I think is a very interesting phenomenon to watch.
Note that this lies/fear/chaos campaign is transversal, from rural AM radio to tiktok, but is not working at all. People is very aware of these campaigns and knows how to defend against. Truth is stronger than money.
Simple answer is I don't know, but it appears to be happening to other search engines as well.
My best guess is they're assuming it's backed by google, and are attempting to poison its search term suggestions. The queries I've been getting are fairly long and highly specific, often within e-pharma or online casino or similarly sketchy areas.
Many of these bots are exceptionally simple - zero tailoring, just scriptkiddie copy/past/run stuff.
They could simply iterate through a list of websites inputted by the user. Said list could be some curated list you find on blackhat forums, which is just a collection of websites with traffic over a certain threshold. Doesn't say anything about the content or what type of website, just the URL.
Then it does a simple operation to map the website - as well as checking for input forms. If an input form is detected, it does a POST operation. If there's an error, timeout, or whatever, it simply moves to the next one.
This of course sucks if you try to run a search engine - because even the simplest of bots will succeed when your website is literally just one website, with the text/input field right there in the front, and with no captcha or similar to dissuade legit users.
I want to run a honeypot for doing more research on bots and the economics for them, but I get bogged down quickly in the planning stages. I should just start with a vulnerable wordpress site or something.
Just make a site with a Contact page, with a comment form that logs the details of every request (IP address, timestamp, message content, email provided). You'll get plenty of data for research, once the page has been indexed into the database the comment form spammers use. For bonus points, put the contact form at the bottom of every page of your website.
A couple of my toy/project websites accidentally became honeypots. Rather than shut down the comment forms, I now have those sites generate summary logfiles that I can upload daily to AbuseIPDB.
EDIT: Forgot to mention, also log the Referer field and User-Agent on each request. Very, very useful information for research and detection.
There are multiple reasons - negative SEO, positive SEO, malware distribution, paid clicks, advertising and probably others I've forgotten at the moment.
> There has been upwards of 15 queries per second from bots. There is just no way to deal with that sort of traffic, barely even to reject it.
If the queries are not a megabit each, you're doing way too much processing before applying rate limiting. Rejecting traffic ought not to take more than 1-2 milliseconds, even if you need to look up an api key or IP address in the database.
I, too, host services on a residential connection: 50 mbps shared with other users. My domains must host hundreds of separate scripts, a few of which have a database attached (I can think of six off the top of my head, but there's over a hundred databases in mariadb so I'm sure there's more that I've forgotten about). This is a ten-year-old laptop with a regular "apt install mariadb", no special configs.
Yes, most traffic is bots, and yes sometimes they submit more than 1 q/s. But it comes nowhere near to exhausting resources to a noticeable extent. Load average is about 0.15, main peaks come from my own cronjobs. If you're having this much trouble rejecting traffic, you might want to spend some time looking at bottlenecks. You'll also notice the bots knock it off if it's unsuccessful.
That's queries per second (as searches), not requests per second.
You know how blogs sometimes can't handle being on the hacker news front page from all the traffic? Well my search engine has survived that. That was 1-2 QPS.
Unmitigated bot-traffic is roughly 10x the traffic of a hacker news death hug, as a sustained load.
Yeah but WordPress is an extreme example. Every time a WP blog is posted to HN without a static-page-ifier (caching layer that basically turns the dynamic pages into static ones), it dies within minutes. Normal software doesn't seem to have that problem.
I traced it once, and I got to admit there was not an obvious bottleneck (this was 2015 or so). Just millions upon millions of calls into deeper and deeper layers for things like translations or themes. Wrapping mysql_query in a function that caches the result (to avoid doing identical queries) helped a few % I think, but aside from major changes like patching out the entire translation system for single-language sites, I didn't spot an obvious way to fix it. You'd need to spend a lot of time to optimize away the complexity that grew from suiting a million different needs, contributed by thousands of people across many years.
Because I want this search engine to exist, and I'm not a multimillionaire so I can't afford better hardware.
See, when it comes to not being able to find stuff on Google, you can either complain about it on the internet, or you can build a search engine yourself that allows you to find what you are looking for.
I found that my netlify site attracts a lot of spam specifically from the same spammer/group. The messages always start with some variation of “Hi my name is Eric”.
Netlify seem to not really care after reporting it on their support forum. The spammer disables JS so no client side protection works. I’ve recently decided to (unfortunately) break the ability for JS disabled browsers to be able to submit the contact form. The form elements attributes are all wrong meaning the form won’t submit correctly. Instead, some JS on page load sets the attributes to the correct values. I will wait a while and see if this solves it.
While netlify does correctly mark all this as spam the fact is that legitimate messages sometimes can slip past these with false positives. So I have to check the vast amount of spam often.
Why do you consider them to be "the most evil"? Their services seem to be completely fine in almost every regard, and their communication doesn't at all suggest that they might be evil.
It depends on your audience and regions you're most interested in. But if you're aiming for the EU, gcore labs may be interesting. Akamai is not bad, but a bit enterprisey - I don't think they even had an official api the last time I used them?
Their paid one when you go over the limits. Not everything has to be black and white and reduced down to a single, oft repeated catch phrase like that.
Not totally unrelated but I had to turn off email alerts and come up with a way to summarize things because Fail2Ban and other alert systems were hit quite literally every 15 seconds with port scans/attempted entries on SSH and other ports. Reporting the abuse to ARIN/ICANN didn't help because almost a full 95% of the traffic originated from China, and 90% of the remaining 5% was Russia. Of the remaining they were zombies inside of America, typically on digital ocean, and I was able to get those handled quickly and efficiently. When I had a simple (secure) login system hosted on HTTPS it was getting hit hard enough my VPS ISP was sending emails to figure out a way to stop it. There are literally 3 people that even know of the existence of these services.
It is actually nuts just how much bot spam there is.
I wonder if a general solution could be to make the visit more computationally demanding to the visitor than to the host, e.g. some form of proof-of-work. I guess captchas already do that in some sense but they require the humans to do the work.
Now the author above has stated they dislike the crypto route and I agree that the whole web3 idea is bs but what if in the case that spam of some form is detected by the server, it requires the visitor to show some proof-of-work and combine that with the "mining crypto in JS instead of ads" craze. That way the bot would need to put work in which would slow it down and at the same time it would pay for its own visit.
No ofc no spam detection system is perfect and it would also hit human users but in their case it would be just a wait a few more seconds longer for page to load kinda case.
> The other alternatives all suck to the extent of my knowledge, they're either prohibitively convoluted, or web3 cryptocurrency micro-transaction nonsense that while sure it would work, also monetizes every single interaction in a way that is more dystopian than the actual skull-crushing robot apocalypse.
Payment per request is the long term solution, but I completely disagree that it’s in any way dystopian. The trick is to set the fee so low that humans, who make few requests, aren’t really affected while bots, who make a large number of requests, become unprofitable.
It’s exactly the same solution as email spam: at a hundredth of a USD cent per email, spam emails would no longer be profitable while regular consumers would spend 10 cents per year (assuming they send 3 emails per day).
I have considered skipping the regular fingerprinting, geolocation, captcha, hashcash, email verification, payment required, etc... mitigations and instead requiring people to drop into a public chat room (or pm/chat the support team) to have their account activated.
The number of languages supported would be small to match whoever helped moderate this, but it would at least require speaking to someone. A PM thread or live chat would be an instant way to find out if someone can string two sentences together and might be worth allowing into the site. You could even have them create an account and solve a single captcha prior to getting access to the chat.
It's not perfect by any stretch, but might be worth exploring having humans-verify-humans.
> The only ones that can survive the robot apocalypse is large web services. Your reddits, and facebooks, and twitters, and SaaS-comment fields, and discords. They have the economies of scale to develop viable countermeasures, to hire teams of people to work on the problem full time and maybe at least keep up with the ever evolving bots.
I only agree when it comes to the system resources that can keep up with bots. When it comes to fighting spam, these services often do a terrible job because 1) their business model benefits from higher user & engagement numbers and 2) their monopoly status affords them to retain users even if their experience is degraded by the spam, something a small site often won't be able to do.
If one request to the site generates more revenue than it costs in resources, the bot problem is solved.
The author says that he is getting 15 bot requests to his site per second. That is about 36 million requests per month. How much does it cost to serve those? $1000 would seem high.
$1000/36M = $0.00003 per request.
How long would a crypto currency, that is suitable for mining in the browser, need to be mined before $0.00003 is generated?
If it turnes out it is a few seconds or so, the solution would be nicely user friendly. A few seconds of CPU time for access to the site. No ads needed to finance the site.
It is kind of telling, that Bitcoin started as a spam blocker. The original "hashcash" use case was to use proof of work to prevent email spam.
Satoshi Nakamoto almost certainly isn't Adam Back.
It might be enough for the request to require more resources from the requestor than from the server, even if it doesn't actually give the server any money. I mean the requestor probably isn't going to be willing to dedicate more hardware "horsepower" to taking your search engine down than you are to keeping it up. That was the idea behind Hashcash.
As for coins, the current Bitcoin hashrate is about 200 exahashes per second, down from a high of over 250 a couple of months ago, and the block reward is 6.25 BTC until probably June 02024. At a price of US$24000/BTC that's US$150k per block (plus a much smaller amount in transaction fees) or about US$1.25e-18 per hash. So your suggestion of US$3e-5 would require about 2e13 hashes. https://en.bitcoin.it/wiki/Non-specialized_hardware_comparis... says an overclocked ATI Radeon HD 6990 can do about 800 megahashes per second (8e8) so you're looking at about 3e4 seconds of compute on that card, about 8 hours.
Maybe one of the altcoins that uses a hash function with a smaller ASIC speedup would be a better fit, although I don't know enough about mining to know if there are any where GPUs are still competitive. Still, it seems like it might be more than a few seconds?
I have not seen any arguments yet, why Satoshi is not Adam.
I did say "crypto currency, that is suitable for mining in the browser" for exactly this reason: That Bitcoin is not well suited for it.
One would have to look at what typical consumer hardware is good at. Maybe an algorithm that saturates one CPU core with serial calculations that need fast access to exactly 1GB of RAM. I think consumer hardware is pretty good when it comes to single core performance and RAM access.
You'd only need lightning integration and paying a small amount of sats, recycling the Bitcoin proof of work instead of making more. You could even have the server transfer the same sats back and forth as a token as long as the server side is happy.
Well it would "just" have to be integrated in browsers. :)
You don't have to buy it, you could crunch numbers for an equivalent cost if that's preferable. The advantage is that the effort can be stored and used later, so you need not even add a 2 second latency. Similar to "Privacy Pass".
As much as I hate the whole cryptocurrency hype myself, I think I agree that a proof-of-work requirement on spam detection that pays in the hosts favour could help solve spam to some degree.
Coin Hive was an interesting solution before it became synonymous with crypto jacking. In order to post a comment you had to lend your CPU to mine for X seconds. The only true anonymous and frictionless micropayment system I've seen.
The only real solution to the abuse of anonymous protocols is to stop using anonymous protocols and use protocols where clients can be held accountable. But that's politically nonviable in the West.
I don't see any good reason why people can't be allowed to remain anonymous while still allowing website operators from taking measures to stop bot abuse. CAPTCHAs can already stop many bots. Other commenters have also mentioned that things like Proof of Work systems and micro-transactions could also stop bot abuse. These don't necessarily require giving up anonymity.
Website operators could still take measures to stop abuse from troll farms as well while still allowing people to remain anonymous. A website operator like Twitter for instance could perhaps require users to make a small micro-transaction before allowing someone to make a post. Some equilibrium for the cost of a post could probably be found where most legitimate users would still be willing to pay that cost but most troll farms would not.
Are you serious? The problematic troll farms are the ones backed by states and multinational corporations. Gating speech behind money only makes the problem worse.
The correct approach is to deanonymize reasonably "public" online behavior. This is the only way to hold abusers accountable, and, ironically, democratize free speech.
Yeah, I'm serious. Even with Twitter, for example, currently allowing accounts to be created and posts to be made for essentially free, real accounts and posts still outnumber those of troll farms and bots from what I've seen. If those troll farms and bots actually had to pay, I imagine there would be far less. I also imagine that those troll farms and bots are less influential than real people. I believe that the endgame is that if a website operator takes enough measures to stop troll farms and bots, the operators of those troll farms and bots will eventually run out of resources and be forced to curtail their activity.
You're right that gating speech behind money could potentially be bad and make problems worse but I only offered that as one suggestion. Instead of or in addition to using money, you could perhaps make a system that uses some type of karma/reputation for instance. Those could still be done anonymously.
It would be simpler to decentralize and implement webs of trust[1] (that locality would also help community-building / social cohesion).
Secure Scuttlebutt[1] doesn't have a lack of moderation / spam issue and it is completely decentralized and without monetary fees nor proof-of work. Why can't centralized services do better?
So you actually mentioned web3, but not the technology from it that is most interesting to me for this problem.
I’m hopeful that “decentralized identity” is a solution here.
In theory this would let people cryptographically prove whatever small fact about themselves they would like to share with a service provider like marginalia, without paying you, or sharing their identity, or anything else.
Eg: “I am a human”, or “I have written less that 100k words on the Internet” or “my comments have an upvote average above zero”
Does anyone know how google/linkedin manage to block bots who are using SSO?
Trying to login to a linkedin account using a google account from an automated browser(like puppeteer+puppeteer-stealth or fakebrowser), will open a white empty window instead of the normal google login window, I could be a limitation of those libraries, but I doubt it, smells like something they detect, maybe looking into might lead some interesting insights on how to limit modern bots.
> has been upwards of 15 queries per second from bots
What type of queries are they generating? For what purpose are querying Marginalia? Scraping and filling internal search engines?
> If anyone could go ahead and find a solution to this mess
I would maybe trying to investigate why are querying your search engine. Is for the search results? Maybe from there you can create and sell an API service. Is for the wiki? Is for research purpose?
I would love to see some data, raw or with some behavior derived from it.
Most of the queries don't seem to be tailored toward my search engine, they're ridiculously over-specified and typically don't return any results at all.
As I've mentioned in another comment, my best guess is they're betting it's backed by google, and are attempting to poison their search term suggestions. The queries I've been getting are fairly long and highly specific, often within e-pharma or online casino or similarly sketchy areas.
Like
> cialis 50mg online pharmacy canada price
Either that, or nonsense like the below, where they appear to be looking for CMSes to exploit (although I don't understand the appendage at the end)
> "Please enter the email address associated with your User account. Your username will be emailed to the email address on file." Finestre Antirumore Torino
> affordable local seo services "Din epostadress delas eller publiceras aldrig Obligatoriska flt r markerade med"
> "You are not logged in. (Login)" Country "City/Town" "Web page" erst
Point is, none of these queries actually return anything at all. I don't offer real full text search, for one. And the queries are much too long.
Tor users are often legitimate good internet citizens.
A lot of (lucky) us have the luxury to live in real democracies.
Some others live in countries that use every single aspect of their private lives (DPI, mass surveillance) to put pressure on them and bend them to the regime's will.
In my opinion, Tor and anonymity should not be killed as a result of silly bots.
> Tor users are often legitimate good internet citizens.
We have had exactly zero traffic from it at any point which was legitimate. Any user who ever showed up with a exit IP ended up being banned eventually, so we just proactively fraud banned anybody who uses one, and anybody that was related to them. There is zero value in allowing anonymizer traffic on your service, and a whole lot to lose.
It's funny the author mentions Facebook and Twitter, because the bot spam on both is quite apparent. The spam on the former has risen greatly, seemingly mostly from India and parts of Africa. Scam air duct cleaning posts, posts about hacked account recovery, and random other crap. It really degrades the experience of the Internet, IMHO.
To those who have been recently pondering the history of antivirus companies of the 90s and 00s, and suspiciously wondering how they were always able to so quickly come up with definitions for the newest infections, this all feels so familiar. What a sad world we live in, sometimes.
captchas were designed to solve this (and only this, as opposed to requiring them to merely view content like modern ignorant web devs like to do [yes i know some web devs now require it to be able to make sure the people they're datamining are real, but this is a new practice from this year basically])
public services should be implemented by decentralized p2p. static content is solved by ipfs, freenet, etc. dynamic content perhaps can only be solved with smart contracts, which would be less bad than cloudflare if they weren't expensive, as they still provide protocol conformance (unlike cloudflare that requires you to have your packets look like a big 4 browser), anonymity (yeah, pseudonyms, you can still make one per query), etc. without smart contracts many interactive applications are still possible
> The other alternatives all suck to the extent of my knowledge, they're either prohibitively convoluted, or web3 cryptocurrency micro-transaction nonsense that while sure it would work, also monetizes every single interaction in a way that is more dystopian than the actual skull-crushing robot apocalypse.
centralized web hosting is and always was unsustainable and this is the reason most web content is commercial garbage, and the problem will only get worse. my concern was always what kind of garbage boomer protocol will become the new standard. i sure as hell dont want something that looks like email, web, or UN*X.
But you said "barely even to reject it", rejecting 15 QPS should not be heavy on any resource, right? Or is the actual problem identifying the bot traffic?
I’ve noticed that a lot of old popular forums disappeared in recent years, but I didn’t realize it was possibly due to bots. Why is that? I assumed that the admins just got tired of running them and moderating them.
It's more complicated than just bots, competition from Reddit is another factor, but bot traffic were certainly a significant part of the problem, both in terms of the constant drive-by exploits and ceaseless comment spam drove up the amount of work needed to operate a forum as a hobby to basically a full time job. With waning visitor numbers, it simply became untenable.
My trick is to have one field that should always be blank and one field that should always have an value, this stops all automated bots. No "CAPTCHA" needed.
End game: everything runs on US-owned services, and all users need to identify to be allowed to even raise a finger, so that "bad actors" can be kept out.
All while we blame Russia and China, and say that their spambots and evil actions forced us to do this.
I disagree with TFA's take on dealing with spam — giving up!
For our app, we don't deal with spam in any novel way. We use honey pot, SFS, and Akismet.
However, by far the easiest way to stop spammers is a post queue. Lots of spammers will just create a burner account, fire off their spam, and start over. Given no actual reputation, give them the trust they deserve — none.
The other factor is building out a fast backend. Besides benefiting your own users, it also means Googlebot or Ahrefsbot won't absolutely cripple your site when they come knocking. Sometimes that is doable, sometimes not.
Specific scenarios require creative solutions. For a search engine, how do you differentiate between robots and legitimate users? It seems a rate limiting step is likely the best solution.
If query rate from the same IP exceeds a threshold, throttle them creatively. +100ms the next time, +250ms the next, etc.
The upside is these bots will adjust their strategies to hit your site slower, which is the whole point, isn't it.
If they spread requests across IPs, perhaps try fingerprinting. I'm not sure how effective that is on the backend though.
I will reiterate what I had been saying on HN for years:
1) The problem is centralization. Yes DNS is federated but there is a central registry. This means anyone can spam you@yourdomain.com or visit your web server listening for HTTP connections at www.domain.com
2) DNS is a glorified search engine. Human readable domain names are only needed for dictating a domain name (and listeners often make mistakes anyway). They only map to a small fraction of URLs) namely the ones with “/“ path name. For most others, the human readability adds little benefit.
3) Start using URIs that are not human readable. The titles, favicons and other metadata of resources should simply be cached, and displayed to the user in their own bookmarks, search engines or whatever. For Javascript environments, variables can easily hold non human readable URIs. Also QR codes can resolve to non human readable URIs.
4) There may be some cookie policy for third party hostnames etc. but just make them non human readable also.
5) We should have DHT or other decentralized systems for routing, and here is the key… in this system, you need a capability issued by the website / mailbox owner in order for your message to be routed to them. If the capability is compromised and used to get a ton of SPAM, they simply revoke that specific capability (key).
For HTTP websites you can already implement it on your side by signing the keys / capabilities ie session cookie balues with an HMAC, and there is no need to even do network I/O to verify them, you can upload the whitelist to the edges and check them there easily.
But going further, for new routing protocols, IP addresses should be removed after the first hop in the DHT, because the global routing system will send traffic there otherwise. See how SAFE network does it.
6) I don’t need a “real names policy” or “blue checkmark”. I can know who “The Real Bill Gates (TM)” is through some verified claims by Twitter or someone else. Just because I have the email billgates@microsoft.com doesnt mean I should be able to email him. There can be many Bill Gates. The names are just verified claims by some third party. Here on HN we dont have names or photos, and it works just fine.
7) Most of the celebrity culture, papparazzi, Elon Musk and Donald Trump moving markets and tweeting at 5am to 5 million people at once, are problems of centralization. Both a 1 to many megaphone and a many to 1 inbox. Citizens United is just a symptom of the problem. I have spoken about this (privately owning access to an audience) with Noam Chomsky in an interview I did a year ago:
Fox News (Rupert Murdoch), CNN (Ted Turner), Twitter (Elon or Jack), Facebook (Zuck) are controlled by only a few people. Channels on youtube, telegram, podcasts etc are controlled by a few people. This leads to divisions in society, as outrage clickbait rises to the top. Nonprofit models based on collaboration like Wikipedia, Wikinews, Open Source and Science produce far more balanced and benign information for the public.
In short we need alternatives to celebrity culture, DNS and other systems that centralize decision making in the hands of a few, or create firehoses and megaphones. Neither the celebrity nor the public actually enjoy the results.
So there’s one service keeping this search engine online, and it’s probably doing it for free, and the author can’t even think of a better way to do it.
Yet Cloudflare still gets two paragraphs of complaints in the face? Because the author wants to “own” something instead of “renting”?
I'm doing it for free because I don't want this to be a commercial service. I get that HN is startup city, but I'm not running a startup, it's just a hobby.
I'd recommend:
- Rate-limit everything, absolutely everything. Set sane limits.
- Rate-limit POST requests harder. Preferably dynamically based on geoip.
- Rate-limit login and comment POST requests even harder. Ban IPs that exceed the amount.
- Require TLS. Drop TLSv1.0 and TLSv1.1. Bots certainly break.
- Require SNI. Do not reply without SNI (nginx has 444 return code for that). Ban IP's on first hit that connect without. There's no legitimate use and you'll also disappear from places like Shodan.
- If you can, require HTTP/2.0. Bots break.
- Ban IP's listed on StopForumSpam, ban destination e-mail addresses listed there. If possible also contribute back to SFS and AbuseIPDB.
- Collect JA3 hashes, figure out malicious ones, ban IPs that use those hashes. This blocks a lot of shit trivially because targeting tools instead of behaviour is accurate.