I see a lot of traffic I can tell are bots based on the URL patterns they access...

echelon · 2024-12-30T18:03:44 1735581824

You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.

Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.

Maybe you don't even need a full LLM. Just a simple transformer that inverts negative and positive statements, changes nouns such as locations, and subtly nudges the content into an erroneous state.

marcus0x62 · 2024-12-30T18:18:00 1735582680

Self plug, but I made this to deal with bots on my site: https://marcusb.org/hacks/quixotic.html. It is a simple markov generator to obfuscate content (static-site friendly, no server-side dynamic generation required) and an optional link-maze to send incorrigible bots to 100% markov-generated non-sense (requires a server-side component.)

gagik_co · 2024-12-30T20:50:23 1735591823

This is cool! It'd have been funny for this to become mainstream somehow and mess with LLM progression. I guess that's already happening with all the online AI slop that is being re-fed into its training.

gs17 · 2024-12-30T18:40:57 1735584057

I tested it on your site and I'm curious, is there a reason why the link-maze links are all gibberish (as in "oNvUcPo8dqUyHbr")? I would have had links be randomly inserted in the generated text going to "[random-text].html" so they look a bit more "real".

marcus0x62 · 2024-12-30T18:46:13 1735584373

Its unfinished. At the moment, the links are randomly generated because that was an easy way to get a bunch of unique links. Sooner or later, I’ll just get a few tokens from the markov generator and use those for the link names.

I’d also like to add image obfuscation on the static generator side - as it stands now, anything other than text or html gets passed through unchanged.

tivert · 2024-12-30T21:27:55 1735594075

> You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.

> Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.

I agree, and not just to discourage them running up traffic bills. The end-state of what they hope to build is very likely to be extremely for most regular people [1], so we shouldn't cooperate in building it.

[1] And I mean end state. I don't care how much value you say you get from some AI coding assistant today, the end state is your employer happily gets to fire you and replace you with an evolved version of the assistant at a fraction of your salary. The goal is to eliminate the cost that is our livelihoods. And if we're lucky, in exchange we'll get a much reduced basic income sufficient to count the rest of our days from a dense housing project filled with cheap minimum-quality goods and a machine to talk to if we're sad.

danlugo92 · 2025-01-02T18:24:46 1735842286

If your employer can run their companies without employees in the future it also means you can have your own company with no employees.

If anything this will level the playing field, and creativity will prevail.

tivert · 2025-01-02T20:04:51 1735848291

> If your employer can run their companies without employees in the future it also means you can have your own company with no employees.

No, you still need money. Lots of money.

> If anything this will level the playing field, and creativity will prevail.

That's a fantasy. The people that already have money will prevail (for the most part).

tyre · 2024-12-30T18:06:44 1735582004

Their problem is they can’t detect which are bots in the first place. If they could, they’d block them.

echelon · 2024-12-30T18:12:02 1735582322

Then have the users solve ARC-AGI or whatever nonsense. If the bots want your content, they'll have to solve $3,000 of compute to get it.

Tostino · 2024-12-30T18:35:06 1735583706

That only works until The benchmark questions and answers are public. Which they necessarily would be in this case.

EVa5I7bHFq9mnYK · 2025-01-01T10:39:37 1735727977

Or maybe solve a small sha2(sha2()) leading zeroes challenge, taking ~1 second of computer time. Normal users won't notice, and bots will earn you Bitcoins :)

endofreach · 2024-12-30T21:04:06 1735592646

> Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills

Or just wait for after the AI flood has peaked & most easily scrapable content has been AI generated (or at least modified).

We should seriously start discussing the future of the public web & how to not leave it to big tech before it's too late. It's a small part of something i am working on, but not central. So i haven't spend enough time to have great answers. If anyone reading this seriously cares, i am waiting desperately to exchange thoughts & approaches on this.

jorvi · 2025-01-01T04:25:51 1735705551

Very tangential but you should check out the old game “Hacker BS Replay”.

It’s basically about how in 2012, with the original internet overrun by spam, porn and malware, all the large corporations and governments got together and created a new, tightly-controlled clean internet. Basically how modern Apple & Disneyland would envision the internet. On this internet you cannot choose your software, host your own homepage or have your own e-mail server. Everyone is linked to a government ID.

We’re not that far off:

- SaaS

- Gmail blocking self-hosted mailservers

- hosting your own site becoming increasingly cumbersome, and before that MySpace and then Meta gobbled up the idea of a home page a la GeoCities.

- Secure Boot (if Microsoft locked it down and Apple locked theirs, we would have been screwed before ARM).

- Government ID-controlled access is already commonplace in Korea and China, where for example gaming is limited per day.

In the Hacker game, as a response to the new corporate internet, hackers started using the infrastructure of the old internet (“old copper lines”) and set something up called the SwitchNet, with bridges to the new internet.

llm_trw · 2024-12-30T18:25:44 1735583144

You will be burning through thousands of dollars worth of compute to do that.

lazide · 2025-01-02T03:55:34 1735790134

The biggest issue is at least 80% of internet users won’t be capable of passing the test.

araes · 2025-01-03T22:20:27 1735942827

Agree. The bots are already significantly better at passing almost every supposed "Are You Human?" test than the actual humans. "Can you find the cars in this image?" Bots are already better. "Can you find the incredibly convoluted text in this color spew?" Bots are already better. Almost every test these days is the same "These don't make me feel especially 'human'. Not even sure what that's an image of. Are there even letters in that image?"

Part of the issue, the humans all behaved the same way previously. Just slower.

All the scraping, and web downloading. Humans have been doing that for a long time. Just slower.

It's the same issue with a lot of society. Mean, hurtful humans, made mean hurtful bots.

Always the same excuses too. Company / researchers make horrible excrement, knowing full well its going harm everybody on the world wide web. Then claim they had no idea. "Thoughts and prayers."

The torture that used to exist on the world wide web of copy-pasta pages and constant content theft, is now just faster copy-pasta pages and content theft.

kmoser · 2024-12-31T03:22:11 1735615331

My cheap and dirty way of dealing with bots like that is to block any IP address that accesses any URLs in robots.txt. It's not a perfect strategy but it gives me pretty good results given the simplicity to implement.

Capricorn2481 · 2024-12-31T20:12:31 1735675951

I don't understand this. You don't have routes your users might need in robots.txt? This article is about bots accessing resources that other might use.

IncreasePosts · 2024-12-31T21:44:01 1735681441

It seems better to put fake honeypot urls in robots.txt, and block any up that accesses those.

trod1234 · 2025-01-01T23:50:41 1735775441

Blocking will never work.

You need to impose cost. Set up QoS buckets, slow suspect connections down dramatically (almost to the point of timeout).

Capricorn2481 · 2024-12-31T22:32:44 1735684364

Ah I see

Beijinger · 2025-01-01T01:19:37 1735694377

How can I implement this?

kmoser · 2025-01-01T21:16:29 1735766189

Too many ways to list here, and implementation details will depend on your hosting environment and other requirements. But my quick-and-dirty trick involves a single URL which, when visited, runs a script which appends "deny from foo" (where foo is the naughty IP address) to my .htaccess file. The URL in question is not publicly listed, so nobody will accidentally stumble upon it and accidentally ban themselves. It's also specifically disallowed in robots.txt, so in theory it will only be visited by bad bots.

aorth · 2025-01-01T05:55:09 1735710909

Another related idea: use fail2ban to monitor the server access logs. There is one filter that will ban hosts that request non-existent URLs like WordPress login and other PHP files. If your server is not hosting PHP at all it's an obvious sign that the requests are from bots that are probing maliciously.

acheong08 · 2024-12-31T14:09:10 1735654150

TLS fingerprinting still beats most of them. For really high compute endpoints I suppose some sort of JavaScript challenge would be necessary. Quite annoying to set up yourself. I hate cloudflare as a visitor but they do make life so much easier for administrators

petre · 2025-01-02T13:46:23 1735825583

You rate limit them and then block the abusers. Nginx allows rate limiting. You can then block them using fail2ban for an hour if they're rate limited 3 times. If they get blocked 5 times you can block them forever using the recidive jail.

I've had massive AI bot traffic from M$, blocked several IPs by adding manual entries into the recidive jail. If they come back and disregard robots.txt with disallow * I will run 'em through fail2ban.

herbst · 2025-01-02T17:14:03 1735838043

Whatever M$ was doing still baffles me. I still have several azure ranges in my blocklist because whatever this was appeared to change strategie once I implemented a ban method.

petre · 2025-01-02T20:20:19 1735849219

They were hammering our closed ticketing system for some reason. I blocked an entire C block and an individual IP. If needed I will not hesitate banning all their ranges, which means we won't get any mail from Azure, M$ office 365, since this is also our mail server. But scew'em, I'll do it anyway until someone notices, since it's clearly abuse.

newsclues · 2024-12-30T17:52:55 1735581175

The amateurs at home are going to give the big companies what they want: an excuse for government regulation.

throwaway290 · 2024-12-30T17:56:18 1735581378

If it doesn't say it's a bot and it doesn't come from a corporate IP it doesn't mean it's NOT a bot and not run by some "AI" company.

bodantogat · 2024-12-30T18:29:30 1735583370

I have no way to verify this, I suspect these are either stealth AI companies or data collectors, who hope to sell training data to them

datadrivenangel · 2024-12-30T19:22:57 1735586577

I've heard that some mobile SDKs / Apps earn extra revenue by providing an IP address for VPN connections / scraping.

odo1242 · 2025-01-02T12:49:33 1735822173

Chrome extensions too

int_19h · 2025-01-01T01:57:31 1735696651

Don't worry, the governments are perfectly capable of coming up with excuses all on their own.