> And, of course, nobody has known to opt-out by blocking AppleBot-Extended unti...

fotta · 2024-06-10T22:46:14.000000Z

From your own link:

> Controlling data usage

> In addition to following all robots.txt rules and directives, Apple has a secondary user agent, Applebot-Extended, that gives web publishers additional controls over how their website content can be used by Apple.

> With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.

ziml77 · 2024-06-10T22:52:52.000000Z

But it also says that Applebot-Extended doesn't crawl webpages and instead this marker is only used to determine what can be done with the pages that were visited by Applebot.

Not that I like an opt-out system, but based on the wording of the docs it is true that if you blocked Applebot then blocking Applebot-Extended isn't necessary.

fotta · 2024-06-10T22:55:27.000000Z

Yeah that is true, but I suspect that most publishers that want their content to appear in search but not used for model training will not have blocked Applebot to date (hence the original commenter's argument)

threeseed · 2024-06-10T22:49:48.000000Z

Might want to actually read it:

Applebot-Extended does not crawl webpages.

They gave this as an additional control to allow crawling for search but blocking for use in models.

fotta · 2024-06-10T22:50:55.000000Z

> There is no AppleBot-Extended. And if you blocked it in the past it remains blocked.

You said there is no Applebot-Extended. The link says otherwise.

ziml77 · 2024-06-10T22:56:27.000000Z

It's still true that there's no Applebot-Extended if it isn't crawling pages. Rather it's a marker to ask Applebot to limit what it does with your pages.

thomasahle · 2024-06-10T23:23:07.000000Z

Isn't it still true that if people wanted to have their website show up in search in the past (so they didn't block Applebot), then it's too late to mark it as "no training" now, since it's already been scraped?

I guess it can be useful for data published in the future.