Hacker News new | past | comments | ask | show | jobs | submit | lelanthran's comments login

The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.

There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.

It'll all burn down.


I actually envision Liapunov stability, like wolf and rabbit populations. In this scenario, we're the rabbits. Human content will increase when AI populations decrease, this providing more food for AI, which will then increase. This drowns out human expression, and the humans will grow quieter. This provides less fodder for the AI, and they decrease. This means less noise and the humans grow louder. The cycle repeats and nauseam.

Until broken by the Butlerian Jihad, "Though shalt not make a machine in the likeness of the mind of man."

I've thought along similar lines for art, what ecological niches are there where AI can't participate, are harder to pull training data from or not economical, where humans can flourish.

Anything we humans deem private in nature from other humans.


If the logistic driving parameter is large enough it can also lead to complete chaos.

IMO this was one of the real motives for Web Environment Integrity. Allow Google to index but nobody else.

We're kind of stuck between a rock and a hard place here. Which do you prefer, entrenched incumbents or affordable/open hosting?


I’m supremely confident that attestation will arrive in one form or another in the near future.

Anonymous browsing and potentially-malicious bots look identical. This was sort of OK up until now.


Agreed, it seems inevitable. Unfortunately I think it will also result in further centralization & consolidation into a handful of "trusted" megacorps.

If you thought browser fingerprinting for ad tracking was creepy, just wait until they're using your actual fingerprint.


does indeed sound like we're headed right back to AOL. At least this time it'll be faster? Certainly won't be as charming.

Google is already scraping your site and presenting answers directly in search results. If I cared about traffic (hence selling ad space), why would I want my site indexed by Google at all anymore? Lots of advertising-supported sites are going to go dark because only bots will visit them.

It will entrench established search engines even more if they have to move to auth-based crawling, so that the only crawlers will be those you invite. Most people will do this for google, bing, and maybe one or two others if there is a simple tool to do so.

What about the next-gen of AI that would be able to signup autonomously? Even if implemented auth-walls everywhere right now, whats stopping the companies to get some real cheap labor to create accounts on websites and use them to scrape the content?

Is it going to become another race like the adblocker -> detect adblocker -> bypass adblocker detector and so on...?


> The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

AI companies with best anti-captcha mechanics will win and will inject ads to LLM output in more sophisticated way.


This cannot be further from the truth. Ad business is not going anywhere. It will grow even bigger.

OpenAI goes through initial cycle of enshittification. Google is too big right now. Once they establish dominance you will have to see 5 unskippable ads between prompts, even for paid plan.

I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.

Example 'search' project: https://rumca-js.github.io/search


The stated problem was about indexing, accessing content and advertising in that context.

> I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.

> Example 'search' project: https://rumca-js.github.io/search

That is not really solution. Since typical indexing still works for masses, your approach is currently unique. But in the end, bots will be capable of reading on web page context if human is capable on reading them. And we get back to the original problem where we try to detect bots from humans. It's the only way.


Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI

How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.

Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.

(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)

I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.


How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.

Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.

And in that case both systems end up with a situation new entrants can't enter.

I don't see how that helps the case where the UA looks like a normal browser and the source IP looks residential.

How about if they claim to be google chrome running on windows 11, from a residential IP address? Is that a human or an AI bot?


The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.

I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.

This is scary!

You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.

Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.

AI is good at solving captchas. But even if everyone added a captcha search engines will continue indexing. Because it is easy to add authentication for search engines to escape captcha, Google will just need to publish a public key.

This is fine, as Google's utility as a search engine has turned into a hot pile of garbage, at least for my cases. Where a decade ago I could put in a few keywords and get relevant results, I now have to guide it with several "quoted phrases" and -exclusions to get the result I'm looking for on the second or third result page. It has crumbled under its own weight, and seems to suggest irrelevant trash to me first and foremost because it's the website of some big player or content farm. Either their algorithm is tuned for mass manipulation or they lost the arms race with SEO cretins (or both).

Granted, I'm not looking forward to some LLM condensing all the garbage and handing me a Definitive Answer (TM) based on the information it deems relevant for inclusion.


> Indeed a “CSS rule” is already a thing and it has nothing to do with lines.

Shouldn't make a difference; we had the element `<hr>` (horizontal rule) since before CSS, after all.


honestly didn't know hr actually stood for that, huh

Not to worry. CRISPR will give us the talking cows from the restaurant at the end of the universe.

> I expect the models will continue improving though,

How? They've already been trained on all the code in the world at this point, so that's a dead end.

The only other option I see is increasing the context window, which has diminishing returns already (double the window for a 10% increase in accuracy, for example).

We're in a local maxima here.


This makes no sense. Claude 3.7 Sonnet is better than Claude 3.5 Sonnet and it’s not because it’s trained on more of the world’s code. The models are improving in a variety of ways, whether by being larger, faster, using the same number of parameters more effectively, better RLHF techniques, better inference-time compute techniques, etc.

> The models are improving in a variety of ways, whether by being larger, faster, using the same number of parameters more effectively, better RLHF techniques, better inference-time compute techniques, etc.

I didn't say they weren't improving.

I said there's diminishing returns.

There's been more effort put into LLMs in the last two years than in the two years prior, but the gains in the last two years have been much much smaller than in the two years prior.

That's what I meant by diminishing returns: the gains we see are not proportional to the effort invested.


You said we're in a local maximum. Your comment was at odds with itself.

One way is mentioned in the article, expanding and improving MCP integrations - give the models the tools to work more effectively within their limitations on problems in the context of the full system.

I don't even bother with `error1`, `error2`, ... `errorN`.

I initialise all pointers to NULL at the top of the function and use `goto cleanup`, which cleans up everything that is not being returned ... because `free(some_ptr)` where `some_ptr` is NULL is perfectly legal.


What part of the world are you from? I ask because this is the first that I've heard that `greater` and `grader` are pronounced the same and now I am curious what country you are in.

For everything in this list, there is at least one word that is not pronounced the same as the other two.

> greater grater grader

> baron barren bearing

> grisly grizzly gristly

> pedal peddle petal

> I also put since with cense, cents, scents, sense

> steal steel still

> peal peel pill


I'm from South Carolina, USA and I pronounce 'greater' and 'grader' the same. There is a subtle difference and that difference can be more noticeable sometimes, but most of the time I'm saying them the same.

For everything in this list, its incredibly common for these groupings to have the same pronunciation where I live.


Words that are words, backwards, but are not palindromes. My boss is awesome and when I find a new one at 200am and excitedly text him, he congratulates me.

Not, nut, tub, bard, trap, ...


I'd do `/aaa<enter>cw<replacement text>`

They're called TLV (Tag, Length, Value) and are used extensively in payment transaction systems.

> But first, ask yourself why you are designing a binary format, unless maybe it's a new media container.

> When would someone ever want a binary file that's not zip, SQLite, or version controllable text?

Maybe I'm not getting the humour here, but in case you are being serious binary files do have a few advantages over text formats.

1. Quick detection (say, for dispatching to a handler) 2. Rapid serialisation, both into and out of a running program (program state, in-memory data, etc) 3. Better and safer handling of binary data (no clunky roundtrips of binary blobs to text and back again) 4. Much better checksumming.


Binary files are useful, but binary files that aren't either zip, sqlite, or a media container seem pretty niche.

It makes sense for model weights and media and opaque blobs where you don't need to load just a part of it, but I see a lot of custom binary save files that don't seem to make any sense.

If it's a server, everything is probably in a database, and if it's a desktop app, eventually something is going to make an 8GB file and it's probably going to be slow unless you have indexing.

People are also likely to want to incrementally update the file as well.

If you're sure nobody will ever make a giant file, then VCSability is probably something someone will want.


> But if we follow that logic then any compiler specific feature of either or clang is fair game, even if it’s not standard.

Well, yeah...

How do you think Annex K got in?


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: