Hacker News new | past | comments | ask | show | jobs | submit login

I've been wondering: the efforts to block AI scrapers many sites have implemented, are they because AI scrapers hit the sites too hard and cause hosting costs to increase, or because of a potential value in licensing of their data (i.e. "If you want this data you gotta pay me some big AI money for it").



I don’t think it’s either of those. I think it’s because they don’t like the idea of their content being used to train a model that can compete with their content in the future.

At least search engine crawlers may drive traffic to them later. What’s in it for them to contribute to model training?


I wonder if the next step for AI companies (those in control of search, at least) will be to start refusing to index content from companies that have opted out, in an effort to sway their decisions.


> those in control of search

So basically just google? I don't think other search engines matter that much. I've never heard people specifically doing SEO for anything that's not google.


That would probably be very effective so I think there's a good chance Google does take that step at some point. The only reason they might not is that it looks extremely anti-competitive to leverage their search monopoly in that way, but we all know how lax enforcement is in that area these days so I would be surprised if that stopped them.


That would be a great (and extremely justified) way to bring antitrust actions against those 'in control of search'.


All of it. Site scraping for search used to actually be beneficial for content creators but now it’s detrimental to them


"This is why we can't have good things". Something that was created for the benefit, has been mis/ab-used and now is being removed.

I am also thinking (more and more lately) the movie The Congress (https://www.imdb.com/title/tt1821641/) and how easy it will be very soon to make a company without human actors. The mega-big studios can train AI by feeding it all their movies, then feed it with the scenario (this is where we - the humans come in - the famous "prompt engineers"), and a movie will be created.

Now, to the above scenario there will be a lot of corrections (can't walk in the air/water/nails)(can't open with killing a dog unless "John Wick" or "Sacred Games")(Sandra Bullock can levitate in "Gravity" but not on "Speed")(etc.)

Eventually if you hire 1000 people to watch 2 iterations of the new movie per day and make corrections to each (100025 = ), the AI will learn and improve on these unrealistic outputs and eventually you can produce movies weekly.

Yo Netflix people, if you haven't started doing this already, START! :) (oh and make a follow-up series on Sacred Games while at it)

EDIT: that's quite a tangent, but I can imagine the day where I will be asking my Paramount-AI "hey, make an original Star Trek series based on this-and-that, I want to binge it next weekend!" (or show me the one you created for someone else)


Yup.

Still wondering where AI is gonna get its raw information from in a few years when 90% of websites have given up updating.


> Still wondering where AI is gonna get its raw information from in a few years when 90% of websites have given up updating.

Facts, opinions, or skills?

Facts only need a single source of truth, skills can be learned in sim.

Plenty of room for a gossip column that keeps robots out, but how much will that matter?


Facts are hard, and AI makes it harder by flooding the zone with "factoids".


Facts are indeed hard, but I don't think AI really makes it harder for anything important as important facts are either science or on the record.

I do agree with the point in a different way however, as spamming a fake reality is quite capable of undermining the world models of us humans.

Two modes of censorship, banning true things vs. overwhelming the signal with noise.

Gossip is a noise we're very interested in.


They don't care. By that time the current leaders will flip in an IPO, make a ton of money, and then the deluge. Everyone else in the market has to follow this race to the bottom, unfortunately.


They'll scrape it all from "your" cloud storage, getting it right from the source before it even goes up on the internet.


exactly, don't people see if you take the commercial interest out of Art and Creation that people will just stop doing it???? /s


Only for a small group who makes money on ads. For any small business it is either a wash or a boon, since it makes people more likely to know about them. Your hotel does not care where you found them, and will be quite happy not to pay the experia fee. Same thing with the ethnical restaurant, the hardresser or the accountant.

For those who make money on subscriptions it is a no issue, since the robot can't index the content anyway.

But yeah, if you are an influencer, well the world is probably strictly better of without you.


It's not just ad revenue though. Others put out content for exposure or as a way to connect with people or a larger community. Having that content laundered through an LLM removes those benefits.

If AI ever becomes a major way to search for businesses it will quickly become pay to play. Letting an AI company scrape your website won't get you much at that point beyond the ability to pay for traffic.


Don't know about other sites, but on my own, the issue is neither of those things. It's that I simply don't want my data to contribute to the training of these models at all.


I wouldn't mind as much if I could then use the best/latest models that trained on my data for free. As it is, there is nothing in it for me.

If they just use my content to then tell users without needing to come to my site, and get ad revenue from that - no thank you. Not what AI is doing now, but looking at SEO I don't see where else this could be heading...


It’s the latter. Why give away something that the most valuable public company in the world wants?


I wish users would think more like this and asked to be paid for data collection :)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: