The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.
This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.
There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.
I actually envision Liapunov stability, like wolf and rabbit populations. In this scenario, we're the rabbits. Human content will increase when AI populations decrease, this providing more food for AI, which will then increase. This drowns out human expression, and the humans will grow quieter. This provides less fodder for the AI, and they decrease. This means less noise and the humans grow louder. The cycle repeats and nauseam.
I've thought along similar lines for art, what ecological niches are there where AI can't participate, are harder to pull training data from or not economical, where humans can flourish.
Agreed, it seems inevitable. Unfortunately I think it will also result in further centralization & consolidation into a handful of "trusted" megacorps.
If you thought browser fingerprinting for ad tracking was creepy, just wait until they're using your actual fingerprint.
Google is already scraping your site and presenting answers directly in search results. If I cared about traffic (hence selling ad space), why would I want my site indexed by Google at all anymore? Lots of advertising-supported sites are going to go dark because only bots will visit them.
It will entrench established search engines even more if they have to move to auth-based crawling, so that the only crawlers will be those you invite. Most people will do this for google, bing, and maybe one or two others if there is a simple tool to do so.
What about the next-gen of AI that would be able to signup autonomously? Even if implemented auth-walls everywhere right now, whats stopping the companies to get some real cheap labor to create accounts on websites and use them to scrape the content?
Is it going to become another race like the adblocker -> detect adblocker -> bypass adblocker detector and so on...?
This cannot be further from the truth. Ad business is not going anywhere. It will grow even bigger.
OpenAI goes through initial cycle of enshittification. Google is too big right now. Once they establish dominance you will have to see 5 unskippable ads between prompts, even for paid plan.
I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.
The stated problem was about indexing, accessing content and advertising in that context.
> I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.
That is not really solution. Since typical indexing still works for masses, your approach is currently unique. But in the end, bots will be capable of reading on web page context if human is capable on reading them. And we get back to the original problem where we try to detect bots from humans. It's the only way.
Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI
How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.
Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.
(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)
I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.
How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.
Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)
This sort of positive security model with behavioural analysis is the
future. We need to get it built-in to Apache,Nginx,Caddy etc. The
trick is spotting crawlers from users. It can be done though.
AI is good at solving captchas. But even if everyone added a captcha search engines will continue indexing. Because it is easy to add authentication for search engines to escape captcha, Google will just need to publish a public key.
This is fine, as Google's utility as a search engine has turned into a hot pile of garbage, at least for my cases. Where a decade ago I could put in a few keywords and get relevant results, I now have to guide it with several "quoted phrases" and -exclusions to get the result I'm looking for on the second or third result page. It has crumbled under its own weight, and seems to suggest irrelevant trash to me first and foremost because it's the website of some big player or content farm. Either their algorithm is tuned for mass manipulation or they lost the arms race with SEO cretins (or both).
Granted, I'm not looking forward to some LLM condensing all the garbage and handing me a Definitive Answer (TM) based on the information it deems relevant for inclusion.
> I expect the models will continue improving though,
How? They've already been trained on all the code in the world at this point, so that's a dead end.
The only other option I see is increasing the context window, which has diminishing returns already (double the window for a 10% increase in accuracy, for example).
This makes no sense. Claude 3.7 Sonnet is better than Claude 3.5 Sonnet and it’s not because it’s trained on more of the world’s code. The models are improving in a variety of ways, whether by being larger, faster, using the same number of parameters more effectively, better RLHF techniques, better inference-time compute techniques, etc.
> The models are improving in a variety of ways, whether by being larger, faster, using the same number of parameters more effectively, better RLHF techniques, better inference-time compute techniques, etc.
I didn't say they weren't improving.
I said there's diminishing returns.
There's been more effort put into LLMs in the last two years than in the two years prior, but the gains in the last two years have been much much smaller than in the two years prior.
That's what I meant by diminishing returns: the gains we see are not proportional to the effort invested.
One way is mentioned in the article, expanding and improving MCP integrations
- give the models the tools to work more effectively within their limitations on problems in the context of the full system.
I don't even bother with `error1`, `error2`, ... `errorN`.
I initialise all pointers to NULL at the top of the function and use `goto cleanup`, which cleans up everything that is not being returned ... because `free(some_ptr)` where `some_ptr` is NULL is perfectly legal.
What part of the world are you from? I ask because this is the first that I've heard that `greater` and `grader` are pronounced the same and now I am curious what country you are in.
For everything in this list, there is at least one word that is not pronounced the same as the other two.
> greater grater grader
> baron barren bearing
> grisly grizzly gristly
> pedal peddle petal
> I also put since with cense, cents, scents, sense
I'm from South Carolina, USA and I pronounce 'greater' and 'grader' the same. There is a subtle difference and that difference can be more noticeable sometimes, but most of the time I'm saying them the same.
For everything in this list, its incredibly common for these groupings to have the same pronunciation where I live.
Words that are words, backwards, but are not palindromes. My boss is awesome and when I find a new one at 200am and excitedly text him, he congratulates me.
> But first, ask yourself why you are designing a binary format, unless maybe it's a new media container.
> When would someone ever want a binary file that's not zip, SQLite, or version controllable text?
Maybe I'm not getting the humour here, but in case you are being serious binary files do have a few advantages over text formats.
1. Quick detection (say, for dispatching to a handler)
2. Rapid serialisation, both into and out of a running program (program state, in-memory data, etc)
3. Better and safer handling of binary data (no clunky roundtrips of binary blobs to text and back again)
4. Much better checksumming.
Binary files are useful, but binary files that aren't either zip, sqlite, or a media container seem pretty niche.
It makes sense for model weights and media and opaque blobs where you don't need to load just a part of it, but I see a lot of custom binary save files that don't seem to make any sense.
If it's a server, everything is probably in a database, and if it's a desktop app, eventually something is going to make an 8GB file and it's probably going to be slow unless you have indexing.
People are also likely to want to incrementally update the file as well.
If you're sure nobody will ever make a giant file, then VCSability is probably something someone will want.
This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.
There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.
It'll all burn down.
reply