NoML Open Letter

viraptor · 2024-01-25T20:48:26

I'd give this very low chances of being relevant. robots.txt overall works because both sides benefit - the author gets to say what shouldn't get indexed, but the search engine also gets to skip things that are not worth indexing / spending time on.

For this option, the author gets to mark what they don't want processed and the other side gets... ?

andy99 · 2024-01-25T21:01:15

That's the problem with this kind of complaining generally. Some authors (a small vocal group anyway) want their work to be indexed, searchable, publicly available with no friction, but they somehow want to control how it's used. Anyone can make a locked down site that prohibits ML trainings (which would be futile against a determined adversary but would prevent crawling), but they don't want that, they want all the benefits that come from the public internet. Releasing something publicly means giving up some control, that's reality.

righthand · 2024-01-26T00:35:15

The only problem is that the other side already ignores robots.txt. What is the difference in processing and indexing something to be returned as a full original result vs being returned in a permutation? The point of robots.txt is to control indexing, something the LLM/ML businesses are completely ignoring already because they’re “different” and don’t have to play by the rules because AGI.

viraptor · 2024-01-26T01:22:54

> What is the difference in processing and indexing something to be returned as a full original result vs being returned in a permutation?

The difference is the work done to implement that system and the cost of losing the data. The request is basically "please make your job harder and result worse because we don't like it". I'm not making a judgement here about what's right, or what should be legal. I'm saying there's zero incentive for the companies to comply here.

righthand · 2024-01-26T03:31:03

Exactly which makes robots.txt effectively useless. Because you can block the IPs and the models will probably use a proxy to scrape you still. All of robots.txt is based on promise there’s no incentive to actually comply.