I've been wondering: the efforts to block AI scrapers many sites have implemented, are they because AI scrapers hit the sites too hard and cause hosting costs to increase, or because of a potential value in licensing of their data (i.e. "If you want this data you gotta pay me some big AI money for it").
I don’t think it’s either of those. I think it’s because they don’t like the idea of their content being used to train a model that can compete with their content in the future.
At least search engine crawlers may drive traffic to them later. What’s in it for them to contribute to model training?
I wonder if the next step for AI companies (those in control of search, at least) will be to start refusing to index content from companies that have opted out, in an effort to sway their decisions.
So basically just google? I don't think other search engines matter that much. I've never heard people specifically doing SEO for anything that's not google.
That would probably be very effective so I think there's a good chance Google does take that step at some point. The only reason they might not is that it looks extremely anti-competitive to leverage their search monopoly in that way, but we all know how lax enforcement is in that area these days so I would be surprised if that stopped them.
"This is why we can't have good things". Something that was created for the benefit, has been mis/ab-used and now is being removed.
I am also thinking (more and more lately) the movie The Congress (https://www.imdb.com/title/tt1821641/) and how easy it will be very soon to make a company without human actors. The mega-big studios can train AI by feeding it all their movies, then feed it with the scenario (this is where we - the humans come in - the famous "prompt engineers"), and a movie will be created.
Now, to the above scenario there will be a lot of corrections (can't walk in the air/water/nails)(can't open with killing a dog unless "John Wick" or "Sacred Games")(Sandra Bullock can levitate in "Gravity" but not on "Speed")(etc.)
Eventually if you hire 1000 people to watch 2 iterations of the new movie per day and make corrections to each (100025 = ), the AI will learn and improve on these unrealistic outputs and eventually you can produce movies weekly.
Yo Netflix people, if you haven't started doing this already, START! :) (oh and make a follow-up series on Sacred Games while at it)
EDIT: that's quite a tangent, but I can imagine the day where I will be asking my Paramount-AI "hey, make an original Star Trek series based on this-and-that, I want to binge it next weekend!" (or show me the one you created for someone else)
They don't care. By that time the current leaders will flip in an IPO, make a ton of money, and then the deluge. Everyone else in the market has to follow this race to the bottom, unfortunately.
Only for a small group who makes money on ads. For any small business it is either a wash or a boon, since it makes people more likely to know about them. Your hotel does not care where you found them, and will be quite happy not to pay the experia fee. Same thing with the ethnical restaurant, the hardresser or the accountant.
For those who make money on subscriptions it is a no issue, since the robot can't index the content anyway.
But yeah, if you are an influencer, well the world is probably strictly better of without you.
It's not just ad revenue though. Others put out content for exposure or as a way to connect with people or a larger community. Having that content laundered through an LLM removes those benefits.
If AI ever becomes a major way to search for businesses it will quickly become pay to play. Letting an AI company scrape your website won't get you much at that point beyond the ability to pay for traffic.
Don't know about other sites, but on my own, the issue is neither of those things. It's that I simply don't want my data to contribute to the training of these models at all.
I wouldn't mind as much if I could then use the best/latest models that trained on my data for free. As it is, there is nothing in it for me.
If they just use my content to then tell users without needing to come to my site, and get ad revenue from that - no thank you. Not what AI is doing now, but looking at SEO I don't see where else this could be heading...