I don’t think we should lump together “AI company scraping a website to train their base model” and “AI tool retrieving a web page because I asked it to”. At least, those should be two different user agents so you have the option to block one and not the other.
But my question should have been phrased, “are there any frameworks commonly in use these days that provide different js payloads to different clients?
I’ve been out of that part of the biz for a very long time so this could be a naive question.
What, users won't share anything? I said I wanted Perplexity to identify themselves in the user agent instead of using the generic "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.3" they're using right now for the "non-scraper bot".
I don't, because if it will, then someone like the author of the article will do the obnoxious thing and ban it. We've been there before, 30 years ago. That's why all browsers' user agent strings start with "Mozilla".
The "scumbag AI company" in question is making money by offering me a way to access information while skipping any and all attention economy bullshit you may have on your site, on top of being just plain more convenient. Note that the author is confusing crawling (which is done with documented User Agent and presumably obeys robots.txt) with browsing (which is done by working as one-off user agent for the user).
As for why this behavior is obnoxious, I refer you to 30 years worth of arguing on this, as it's been discussed ever since User-Agent header was first added, and then used by someone to discriminate visitors based on their browsers.
If you want summaries from my website, go to my website. I want a way to deny any licence to any third-party user agent that will apply machine learning on my content, whether you initiated the request or not.
While Perplexity may be operating against a particular URL based on a direct request from you, they are acting improperly when they "summarize" a website as they have an implicit (and sometimes explicit if there's a paywall) licence to read and render the content as provided, but not to process and redistribute such content.
There needs to be something stronger than robots.txt, where I can specify the uses permitted by indirect user access (in my case, search indexing would be the only permitted use case; no LLM training, no LLM summarization, no proxying, no "sanitization" by parental proxies, etc.).
> If you want summaries from my website, go to my website.
I will. Through Perplexity. My lifespan is limited, and I have better ways to spend it than digging out information while you make a buck from making me miserable (otherwise there isn't much reason to complain, other than some anti-AI ideology stance).
> I want a way to deny any licence to any third-party user agent that will apply machine learning on my content, whether you initiated the request or not.
That's not how the Internet works. Allowing for that would mean killing user-generated content sites, optimizing proxies, corporate proxies, online viewers and editors, caches, possibly desktop software too.
Also, my browser probably already does some ML on the side anyway. You'd catch a lot of regular browsing this way.
Ultimately, the rules of the road are what they always have been: whatever your publicly accessible web server spouts out on a request is fair game for the requester to consume however they like, in part or entirely. If you want to limit access for particular tools or people, put up a goddamn paywall. All the noise about scrapping and stuff is attention economy players trying to have their cake and eat it too. As the user in - i.e. the victim of - attention economy, I don't feel much sympathy for that plight.
Also:
> LLMs — and more importantly the companies that train and operate them — should not be trusted at all, especially for so-called "summarization"
That's not your problem. That's my problem. If I use a shitty tool from questionable vendor to parse your content, that's on me. You should not care. In fact, being too interested in what I use for my Internet consumption can be seen as surveillance, which is not nice.
I addressed this in a different response: I do not care if your browser does local ML or if there is an extension which takes content that you have already downloaded and applies ML on it (as long as the results of the ML on my licensed content are not stored in third party services without respecting my licence). I do care that an agent controlled by a third party (even if it is on your behalf) browses instead of you browsing.
My goal is to licence my content for first party use, not third party derived use.
Your statement "Ultimately, the rules of the road are what they always have been: whatever your publicly accessible web server spouts out on a request is fair game for the requester to consume however they like" is both logically and legally incorrect in pretty much every single jurisdiction in the world, even if it cannot be controlled as such without expensive legal proceedings.
> > LLMs — and more importantly the companies that train and operate them — should not be trusted at all, especially for so-called "summarization"
> That's not your problem. That's my problem. If I use a shitty tool from questionable vendor to parse your content, that's on me. You should not care. In fact, being too interested in what I use for my Internet consumption can be seen as surveillance, which is not nice.
Actually, it is my problem, because it's my words that have been badly summarized.
If the LLM provides a so-called summary that is the exact opposite of what I wrote (as the link I shared previously shows happens), and that summary is then used to write something about what I supposedly wrote, then I have been misrepresented at best.
I have a moral right to work that I have created (under Canadian law and most European laws) to ensure that my work is not misrepresented. The best way that I can do that is to forbid its consumption by machine learning companies, including Perplexity.
> The moral rights include the right of attribution, the right to have a work published anonymously or pseudonymously, and the right to the integrity of the work. The preserving of the integrity of the work allows the author to object to alteration, distortion, or mutilation of the work that is "prejudicial to the author's honor or reputation". Anything else that may detract from the artist's relationship with the work even after it leaves the artist's possession or ownership may bring these moral rights into play. Moral rights are distinct from any economic rights tied to copyrights. Even if an artist has assigned his or her copyright rights to a work to a third party, he or she still maintains the moral rights to the work.
Of course, Perplexity operates under the Wild West of copyright law where they and their users truly do not give one whit about the damage they cause. Eventually, this will be their downfall, because they are going to find themselves on the wrong side of legal judgements for their unwillingness to play by rules that have been in place for a fairly long time.
Personally I don't even think that the issue. I'd prefer correct user-agent, that just common decency and shouldn't be an issue for most.
What I do expect the AI companies to do is to check the license of the content they scrape and follow that. Let's say I run a blog, and I have a CC BY-NC 4.0 license. You can train your AI and that content, as long as it's non-commercial. Otherwise you'd need to contact me an negotiate and appropriate license, for a fee. Or you can train your AI on my personal Github repo, where everything is ISC, that's fine, but for my work, which is GPLv3, then you have to ensure that the code your LLM returns is also under the GPLv3. Does any of the AI companies check that the license of ANYTHING?
What I gathered from the post was that one of the investigations was to ask what was on [some page url] and then check the logs moments later and saw it using a normal user agent.
You can just point it at a webserver and ask it a question like "Summarize the content at [URL]" with a sufficiently unique URL that no one would hit, maybe with an UUID. This is also explored on the very article itself.
In my testing they're using crawlers on AWS and they do not parse Javascript or CSS, so it is sufficient to serve some kind of interstitial challenge page like the one on Cloudflare, or you can build your own.
> Is it actually retrieving the page on the fly though?
They are able to do so.
> How do you know this?
The access logs.
> Even if it were - it’s not supposed to be able to.
There is a distinction from data used to train a model, which is the indexing bot with the custom user-agent string, and the user-query input given to the aforementioned AI model. When you ask an AI some question, you normally input text into a form, and the text goes back to the AI model where the magic happens. In this scenario, instead of inputting a wall text into a form, the text is coming from a url.
These forms of user input are equivilent, and yet distinctly different. Therefore it's intelectually dishonest for the OP to claim the AI is indexing them, when OP is asking the AI to fetch their website to augment or add context to the question being asked.
To steel man this, even though I think the article did a fine job already, maybe the author could’ve changed the content on the page so you would know if they were serving a cached response.
Author here. The page I asked it to summarize was posted after I implemented all blocking on the server (and robots.txt). So they should not have had any cached data.