It's funny I posted the inverse of this. As a web publisher, I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy.
But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable. A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.
There are many benefits to having people visit your content on a property that you own. e.g., say you are a SaaS company and you have a bunch of Help docs. You can analyze traffic in this section of your website to get insights to improve your business: what are the top search queries from my users, this might indicate to me where they are struggling or what new features I could build. In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.
> they are decreasing the probability that this user would come to by content (via Google, for example).
Google has been providing summaries of stuff and hijacking traffic for ages.
I kid you not, in the tourism sector this has been a HUGE issue, we have seen 50%+ decrease in views when they started doing it.
We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage.
It's just depressing. I'm more and more convinced that the age of regulations and competition is gone, US does want to have unkillable monopolies in the tech sector and we are all peons.
> Google has been providing summaries of stuff and hijacking traffic for ages.
Yes, Google hijacked images for some time. But in general there has "always" been the option to tell Google not to display summaries etc with these meta tags:
I'm curious about the tourism sector problem. In tourism, I would think the goal would be to promote a location. You want people to be able to easily discover the location, get information about it, and presumably arrange to travel to those locations. If Google gets the information to the users, but doesn't send the tourist to the website, is that harmful? Is it a problem of ads on the tourism website? Or is more of problem of the site creator demonstrating to the site purchaser that the purchase was worthwhile?
We would employ local guides all around the world to craft itinerary plans to visit places, give tips, tricks, recommend experiences and places (we made money by selling some of those through our website) and it was a success.
Customers liked the in depth value of that content and it converted to buys (we sold experiences and other stuff, sort of like getyourguide).
One day all of our content ended up on Google "what time is best to visit the Sagrada Familia" and you would have a copy pasted answer by Google.
This killed a lot of traffic.
Anyway, I just wanted to point out that the previous user was a bit naive taking his fight to LLMs when search engines and OSs have been leeching and hijacking content for ages.
I totally get that it killed your traffic. If a thousand people a day typing in "what time is best to visit the Sagrada Familiar" stopped clicking on the link to your page because Google just told them "4 PM on Thursdays" at the top of the page, you lost a bunch of traffic.
But why did you want the traffic? Was your revenue from ad impressions, or were you perhaps being paid by the city of Barcelona to provide useful information to tourists? If the former, I get that this hurt you. If the latter, was this a failure or a success?
Moreover, if it's the former, then good riddance. An ad-backed site is harming users a little on the margin for the marginal piece of information. Getting the same from a search engine is saving users from that harm.
Parent has the right question here: why did you want the traffic? Did you intend for anything good to happen to those people?. I'm going to guess not; there's hardly a scenario where people who complain about loss traffic and mean that traffic any good.
Now think of the 2nd order effects: they paid money to collect that useful information. If it’s no longer feasible to create such high quality content, it won’t magic itself into existence on its own. It’ll all be just crap and slop in a few years.
If the content was really trash it wouldn't have been dropped by Google in a jiffy after a surge of Press mocking Google. That didn't happen. Also Google Search is ad-backed anyways, so your position does not hold.
> If it’s no longer feasible to create such high quality content, it won’t magic itself into existence on its own. It’ll all be just crap and slop in a few years.
Except it kind of does. Almost all high-quality free content on the Internet has been made by hobbyists just for the sake of doing it, or as some kind of expense (marketing budget, government spending). The free content is not supposed to make money. An honest way of making money with content is putting up a paywall. Monetizing free content creates a conflict of interest, as optimizing value to publisher pulls it in opposite direction than optimizing for value to consumer. Can't save both masters, and all. That's why it's effectively a bullet-proof heuristic, that the more monetization you see on some free content, the more wrong and more shit it is.
Put another way, monetizing the audience is the hallmark of slop.
>Moreover, if it's the former, then good riddance. An ad-backed site is harming users a little on the margin for the marginal piece of information. Getting the same from a search engine is saving users from that harm.
Of course! It's certainly better to ruin the few sites that support their attempts at high quality content with ad revenue. Much better to let Google have that money, because of course the tech giant has nothing to do with enhsittifying everything through ad revenue of its own and pervasive tracking, or enabling ever worse content through SEO and AI gaming.
You can appreciate that a modest site trying to survive through ads isn't necessarily evil just because it looks for a way to make money off its content?
I mean, what specific harm are you referring to? Particularly compared to the much more obvious harm of Google absorbing ever more of the web in favor of its tentacled surveillance/SEO gaming machine.
> You can appreciate that a modest site trying to survive through ads isn't necessarily evil just because it looks for a way to make money off its content?
It's not necessarily evil, just statistically very likely so :). It's still affected by the conflict of interest, though. Making money off content directly means you either ask readers to pay up, or you extract that payment somehow, whether they want it or not. And since the site isn't asking...
> I mean, what specific harm are you referring to? Particularly compared to the much more obvious harm of Google absorbing ever more of the web in favor of its tentacled surveillance/SEO gaming machine.
At individual interaction level, think of it as smoking. One cigarette isn't going to kill you. Hell, some smoking might even lose your weight! But it still affects your behavior short-term in a self-reinforcing way, and long-term, it's gonna ruin your health. A site monetizing content with ads is like a store or library that lets you read for free, if you take a whiff or three of the specific brand of cigarettes they're sponsored by. A couple interactions may not hurt, but continuous exposure definitely will.
Just because the damage happens to your brain instead of your lungs and immune system, doesn't mean it's OK now. It's still an asshole move to expose your fellow humans to poison.
Comparing ads in content to smoking is some truly iffy, shaky "science". Conjecture more like it.
And finding a means of funding for content is not bad thing even if it involves a conflict of interest. You're painting it as if it were some sort of nefarious activity when in reality it consists of "here's our content, much of it is authentic, verifiably useful (people want it after all and keep reading) and also, we make money off these very visible ads right here. If anything, it's a better model than internally recommending things in content in a dishonest way.
What's more, compared to instead handing that content and those eyeballs over to Google, the top monster itself of online ads, dark patterns and gamed suggestions, it's the much better option.
Your underlying narrative seems to be that people trying to use their efforts at content online to make money is somehow inherently morally wrong, and that's absurd. It's particularly ridiculous when, as in this case, the alternative is a colossal advertising/tech corporation essentially stealing that content to suck away views from these much tinier sites.
> You can appreciate that a modest site trying to survive through ads isn't necessarily evil just because it looks for a way to make money off its content?
Then you're an emotionally intolerant ideologue about the notion of profit in a digital content landscape, who isn't willing to entertain criteria such as good faith arguments, benefit of the doubt, degree or nuance.
You expect that people should be obligated to conduct their efforts at creating readable information for free, unless they want your moral disdain?
Particularly laughable notions from someone enjoying a site deeply embedded in the ad-funded Silicon Valley parasitic consumer surveillance landscape.
So essentially you created elaborate ads and are now upset that the bigger ad company is better at it than you.
As much as I dislike Google, people who create content FOR google are infinitely worse IMO as they bury all the genuine content created by people without a profit motive. You can always go find a business model that doesn't depend on Google driving traffic to your website.
If your content has a yes/no or otherwise simple, factual answer that can be conveyed in a 1-2 sentence summary, then I don't see this as a problem. You need to adapt your content strategy, as we all do from time to time.
There was never a guarantee -- for anyone in any industry at all -- that what worked in the past will always continue to work. That is a regressive attitude.
However I do have concerns about Google and other monopolies replacing large swaths of people who make their livings doing things that can now be automated. I am not against automation but I don't think the disruption of our entire societal structure and economy should be in the hands of the sociopaths that run these companies. I expect regulation to come into play once the shit hits the fan for more people.
Google snippets are hilariously wrong, absurdly often; I was recently searching for things while traveling and I can easily imagine relying on snippets getting people into actual trouble.
Google has been in trouble for doing so several times in the past and removed key features because of it. Examples: Viewing cached pages, linking directly to images, summarized news articles.
>We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage. It's just depressing
It's a legitimate complaint, and it sucks for your business. But I think this demonstrates that the sort of quality content you were producing doesn't actually have much value.
That line of thinking makes no sense. If the "content" had no value, why would google go through the effort of scraping it and presenting it to the user?
>If the "content" had no value, why would google go through the effort of scraping it and presenting it to the user?
They don't present it all, they summarize it.
And let's be serious here, I was being polite because I don't know the OPs business. But 99% of this sort of content is SEO trash and contributes to the wasteland that the internet is becoming. Feel free to point me to the good stuff.
Pedantry aside, let's restate as "present the core thoughts" to the user, which still implies value. I agree that most of google front page results are SEO garbage these days, but that's a separate issue from claiming that are summary of a piece of information removes the original of its value. I'd even argue that it transfers it from one entity to the other in this case.
I would also think that the intrinsic value is different. If there is a hotel on a mountain writing "quality content" about the place, to them it really doesn't matter who "steals" their content, the value is in people going to the hotel on the mountain not in people reading about the hotel on the mountain.
Like to society the value is in the hotel, everything else is just fluff around it that never had any real value to begin with.
> Feel free to point me to the good stuff.
Travel bloggers and vloggers, but that is an entirely different unaffected industry (entertainment/infotainment).
I've no doubt some good ones exist, but my instinct is to ignore every word this industry says because it's paid placement and our world is run by advertisers.
It's not that it has no value, it's that there is no established way (other than ad revenue) to charge users for that content. The fact that google is able to monetize ad revenue at least as well as, and probably better than, almost any other entity on the internet, means that big-G is perfectly positioned to cut out the creator -- until the content goes stale, anyway.
This will be quite interesting in the future. One can usually tell if a blog post is stale, or whether it’s still relevant to the subject it’s presenting. But with LLMs they’ll just aggregate and regurgitate as if it was a timeless fact.
This is already a problem. Content farms have realised that adding "in $current_year" to their headlines helps traffic. It's frustrating when you start reading and realise the content is two years out of date.
The Google summaries (before whatever LLM stuff they're doing now) are 2-3 sentences tops. The content on most of these websites is much, much longer than that for SEO reasons.
It sucks that Google created the problem on both ends, but the content OP is referring to costs way more to produce than it adds value to the world because it has to be padded out to show up in search. Then Google comes along and extracts the actual answer that the page is built around and the user skips both the padding and the site as a whole.
Google is terrible, the attention economy that Google created is terrible. This was all true before LLMs and tools like Perplexity are a reaction to the terrible content world that Google created.
It would be a lot better if Google just prioritised concise websites.
If Google preferred websites that cut the fluff, then website operators would have an incentive to make useful websites, and Google wouldn't have as much of an incentive to provide the answer in a snippet, and everyone wins.
I guess it's hard to rank website quality, so Google just prefers verbose websites.
> Google wouldn't have as much of an incentive to provide the answer in a snippet, and everyone wins.
Google has at least two incentives to provide that answer, both of which wouldn't change. The bad one: they want to keep you on their page too, for usual bullshit attention economy reasons. The good one: users prefer the snippets too.
The user searching for information usually isn't there to marvel at beauty of random websites hiding that information in piles of noise surrounded by ads. They don't care about websites in the first place. They want an answer to the question, so they can get on with whatever it is they're doing. When Google can give them an answer, and this stops them from going from SERP to any website, then that's just few seconds or minutes of life that user doesn't have to waste. Lifespans are finite.
The only reason that users prefer snippets is because websites hide the info you are looking for. The problem is that the top ranked search results are ad-infested SEO crap.
If the top ranked website were actually designed with the user in mind, they would not hide the important info. They would present the most important info at the top, and contain additional details below. They would offer the user exactly what they want immediately, and provide further details that the user can read if they want to.
Think of a well written wikipedia article. The summary is probably all that you need, but it's good that the rest of the article with all the detail is there as well. I'm pretty sure that most people prefer a well designed user-centric article to the stupid Google snippet that may or may not answer the question you asked.
Most people looking for info don't look for just a single answer. Often, the answer leads to the next question, or if the answer is surprising, you might want to check out if the source looks credible, etc. Even ads would be helpful, if they were actually relevant (eg. if I am looking for low profile graphic cards, I'd appreciate an ad for a local retailer that has them in stock).
But the problem is that website operators (and Google) just want to distract you, capture your attention, and get you to click on completely irrelevant bullshit, because that is more profitable than actually helping you.
I think optimising for that just leads to another kind of SEO slop. I mostly use the summaries for answers to questions like "what's the atomic number of aluminium". The sensible way of laying this out on a website is as a table or something like that, which requires another click, load, and manual lookup in the table. The summaries are useful for that, and if the websites want to answer that question directly, it means they want to make a bunch of tiny pages with a question like that and the answer, which is not something I want to browse through normally. (And indeed, I have seen SEO slop in this vein)
> A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.
If I visit your site from Google with my browser configured to go straight to Reader Mode whenever possible, is my visit more useful to you than a summary and a link to your site provided by Perplexity? Why does it matter so much that visitors be directly on your content?
Well for one thing you visiting his site and displaying it via reader mode doesn't remove his ability to sell paid licenses for his content to companies that would like to redistribute his content. Meanwhile having those companies do so for free without a license obviously does.
Should OP be allowed to demand a license for redistribution from Orion Browser [0]? They make money selling a browser with a built-in ad blocker. Is that substantially different than what Perplexity is doing here?
I asked you this in the other subthread, but what exactly is the moral distinction (I'm not especially interested in the legal one here because our copyright law is horribly broken) between these two scenarios?
* User asks proprietary web browser to fetch content and render it a specific way, which it does
* User asks proprietary web service to fetch content and render it a specific way, which it does
The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?
Traffic numbers, regardless if it using reader mode or not, are used as a basic valuation of a website or page. This is why Alexa rankings have historically been so important.
If Perplexity visit the site once and cache some info to give to multiple users, that is stealing traffic numbers for ad value, but also taking away the ability from the site owner to get realistic ideas of how many people are using the information on their site.
Additionally, this is AI we are talking about. Whos to say that the genrated summary of information is actually correct? The only way to confirm that, or to get the correct information in the first place, is to read the original site yourself.
Yeah that's one of the best things about them for me. And then I go to the website and often it's some janky UI with content buried super deep. Or it's like Reddit and I immediately get slammed with login walls and a million annoying pop ups. So I'm quite grateful to have an ability to cut through the noise and non-consistency of the wild west web. I agree the idea that we're somewhat killing traffic to the organic web is kind of sad. But at the same time I still go to the source material a lot, and it enables me to bounce more easily when the website is a bit hostile.
I wonder if it would be slightly less sad if we all had our own decentralized crawlers that simply functioned as extensions of ourselves.
> I wonder if it would be slightly less sad if we all had our own decentralized crawlers that simply functioned as extensions of ourselves.
This is something I'm (slowly) working on myself. I have a local language model server and 30 tb usable storage ready to go, just working on the software :)
>Traffic numbers, regardless if it using reader mode or not, is used as a basic valuation of a website.
I have another comment that says something similar, but: is valuing a website based on basic traffic still a thing? Feels very 2002. It's not my wheelhouse, but if I happened to be involved in a transaction, raw traffic numbers wouldn't hold much sway.
If you were considering acquiring a business that had a billion pageviews a month versus 10 pageviews a month, you don't think that would affect the sale price?
The inaccuracy point is particularly problematic as either they cite you as the source despite possibly warping your content to be incorrect.. or they don't cite you and more directly steal the content. I'm not sure which is worse
> But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example).
Perplexity has source references. I find myself visiting the source references. Especially to validate the LLM output. And to learn more about the subject. Perplexity uses a Google search API to generate the reference links. I think a better strategy is to treat this as a new channel to receive visitors.
The browsing experience should be improved. Mozilla had a pilot called Context Graph. Perhaps Context Graph should be revisited?
> In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.
This seems like a missing feature for analytics products & the LLMs/RAGs. I don't think searching via an LLM/RAG is going away. It's too effective for the end user. We have to learn to work with it the best we can.
>> In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.
Alternative take: Perplexity is protecting users' privacy by not exposing them to be turned into "insights" by the SaaS.
My general impression is that the subset of complaints discussed in this thread and in the article, boils down to a simple conflict of interest: information supplier wants to exploit the visitor through advertising, upsells, and other time/sanity-wasting things; for that, they need to have the visitor on their site. Meanwhile, the visitors want just the information without the surveillance, advertising and other attention economy dark/abuse patterns.
The content is the bait, and ad-blockers, Google's instant results, and Perplexity, are pulling that bait off the hook for the fish to eat. No surprise fishermen are unhappy. But, as a fish, I find it hard to sympathize.
> I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy.
But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable.
This appears to be self-contradictory. If you let an LLM to be trained* on “all the books” (posts, articles, etc.) in the world, the implication is that your potential readers will now simply ask that LLM. Not only will they pay Microsoft for that privilege, while you would get zilch, but you would not even know they ever read the fruits of your research.
* Incidentally, thinking of information acquisition by an ML model as if it was similar to human reading is a problematic fallacy.
I don't know what the typical usage pattern is, but when I've used Perplexity, I generally do click the relevant links instead of just trusting Perplexity's summary. I've seen plenty of cases where Perplexity's summary says exactly the opposite of the source.
This hits the point exactly, it’s an extension of stuff like Google’s zero click results, they are regurgitating a website’s content with no benefit to the website.
I would say though, it feels like the training argument may ultimately lead to a similar outcome, though it’s a bit more ideological and less tangible than regurgitating the results of a query. Services like chatgpt are already being used a google replacement by many people, so long term it may reduce clicks from search as well.
Ironically, I’ve just started asking LLMs to summarize paywalled content, and if it doesn’t answer my question I’ll check web archives or ask it for the full articles text.
I'm not sure what you mean exactly. If Perplexity is actually doing something with your article in-band (e.g. downloading it, processing it, and present that processed article to the user) then they're just breaking the law.
I've never used that tool (and don't plan to) so I don't know. If they just embed the content in an iframe or something then there's no issue (but then there's no need or point in scraping). If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article <ARTICLE_TEXT>") then that's vanilla infringement, whether they lie about their UA or not.
> If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article <ARTICLE_TEXT>") then that's vanilla infringement, whether they lie about their UA or not.
Except, it can't possibly be like that - that would kill the Internet as you know it. It makes sense to consider scrapping for purposes of training as infringement - I personally disagree, I'm totally on the side of AI companies on this one, but there's a reasonable argument there. But in terms of me requesting a summary, and the AI tool doing it server-side before sending it to me, without also adding it to the pile of its own training data? Banning that would mean banning all user-generated content websites, all web viewing or editing tools, web preview tools, optimizing proxies, malware scanners, corporate proxies, hell, maybe even desktop viewers and editing tools.
There are always multiple programs between your website and your user's eyeballs. Most of them do some transformations. Most of them are third-party, usually commercial software. That's how everything works. Software made by "AI company" isn't special here. Trying to make it otherwise is some really weird form of prejudice-driven discrimination.
This is not accurate, and it would not. As much as I would like it to be the case, you are not free to use public internet content for arbitrary purposes. In general (as you probably know), you rely on the idea that a court would consider your processing fair use.
It's not transformative, it uses the entire work, the work copied is not a summary of facts, and insofar as their is a market at all, it circumvents that market. It fails every test.
(training OTOH is inherently transformative, and I suspect likely to turn out to be a fair use)
Well, I guess what I mean is if the situation is as I describe in my previous comment, then anyone who did have the money to fight it would be a shoe-in. It's a much stronger case than, for example, the ongoing lawsuits by Matthew Butterick and others (https://llmlitigation.com/).
I'm seriously sick of that whole "laundering copyright via AI"-grift - and the destruction of the creative industry is already pretty noticable. All the creatives who brought us all those wonderful masterworks with lots of thought and talent behind, they're all going bankrupt and getting fired right now.
It's truly a tragedy - the loss of art is so much more serious than people seem to think it is, considering how integral all kinds of creative works are to a modern human live. Just imagine all of that being without any thought, just statistically optimized for enjoyment... ugh.
Sorry for the late reply, was way too tired yesterday.
The most extreme situation is concept artists right now. Essentially, the entire profession has lost their jobs in the last year. Or casual artists making drawings for commission - they can't compete with AI and mostly had to stop selling their art. Similar is happening to professional translators - with AI, the translations are close enough to native that nobody needs them anymore.
The book market is getting flooded with AI-crap, so is of course the web. Authors are losing their jobs.
Currently, it seems to be creeping into the music market - not sure if people are going to notice/accept AI-made music. All the fantastic artists creating dubs are starting to go away as well, after all you can just synthesize their voices now.
But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable. A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.
There are many benefits to having people visit your content on a property that you own. e.g., say you are a SaaS company and you have a bunch of Help docs. You can analyze traffic in this section of your website to get insights to improve your business: what are the top search queries from my users, this might indicate to me where they are struggling or what new features I could build. In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.