Major Sites Are Saying No to Apple's AI Scraping

dagmx · 2024-08-29T14:22:21 1724941341

A big chunk of their list have signed agreements with OpenAI, or in Meta’s case, has a competitor.

I’d argue the way Wired is reporting this as a gotcha is also a conflict of interest because of their deal with an effective competitor.

https://www.wired.com/story/conde-nast-openai-deal/

> “WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training”

Versus

> “The media company joins The Atlantic, Axel Springer, Vox Media, and a host of other publishers who have partnered with OpenAI.”

nerdjon · 2024-08-29T15:13:56 1724944436

I find it quite concerning that we are going into a situation where some things can crawl some sites but not others.

I have my own serious ethical concerns about AI, but from a pure searchability standpoint it is likely to cause more problems than solve.

Information is going to start being silo'd. It would be like if I had to go to Google for one thing and Bing for something else.

That being said, I am also quite concerned about what you mentioned. I was not happy when I found out that ArsTechnica signed a deal with OpenAI (or I assume their parent company did). Did that deal include any requirements of how they talk about OpenAI, AI in General, or their competitors?

The concerns over AI are real, but they are putting a short term gain over the long term damage by somehow thinking that licensing data is going to save them.

The tone of this is a bit weird, do we have similar "blocking google", "blocking anthropic", "blocking meta" articles? Instead of what is really happening is those partnerships were formed so everything not that partnership is being blocked.

blibble · 2024-08-29T19:11:08 1724958668

> Information is going to start being silo'd. It would be like if I had to go to Google for one thing and Bing for something else.

we're already there: a month ago reddit signed an agreement with Google to ingest their data, and then blocked everyone else (e.g. bing)

https://www.reuters.com/technology/reddit-ai-content-licensi...

AI killed the Internet

JumpCrisscross · 2024-08-29T15:28:50 1724945330

> Information is going to start being silo'd

There is an argument for banning (or limiting the term of) exclusive licensing arrangements for purposes of training an LLM. Nobody is forced to do business with anyone. But OpenAI or Facebook shouldn't be able to force a publisher to only do business with them, thereby closing the market to new entrants.

dagmx · 2024-08-29T15:23:15 1724944995

How this affects their journalistic integrity was oddly not covered in the articles from their various publications that I’ve seen.

Ars since you mentioned it https://arstechnica.com/information-technology/2024/08/opena...

To your point about siloed information, we also see this with Reddit and Google, since they have an exclusivity agreement.

masto · 2024-08-29T20:43:05 1724964185

If e-mail were invented today, you wouldn't be able to communicate with people who weren't on the same service provider.

hightrix · 2024-08-29T21:34:42 1724967282

If it were built by a VC backed company, you can be sure that you could receive email from all providers but only send to your same provider. Then, you could buy bundles of access. For only 9.99/mo, send emails to @aol.com, @yahoo.com, and @live.com.

pjc50 · 2024-08-29T15:22:24 1724944944

The Internet has long existed in a state of tension over where value gets created, how it is consumed, and who pays for it. This had evolved into an uneasy truce of ad-supported big sites, making the web almost (but not quite) unusable for most people except a few adblocking freeloaders who get the good experience.

AI just blows all that up by providing a general-purpose plagiarism engine. It offers a substitute for any content, with a similar-looking but inauthentic version. People might accept not paywalling in order to reach human users, but nobody wants to be fodder for the plagiarism bot.

In some ways this is the "Can Googlebot scrape your content and display it on the home page but without generating any ad views for you?" debate we've seen with news publishers before.

nerdjon · 2024-08-29T15:32:51 1724945571

It is interesting since at least with what Google was previously doing, you likely at least got some click throughs.

Personally the way I see it, you can't argue about being worried about plagiarism while allowing one company to just mine all of your work.

You either block all or you allow all. Anything else is not helping either situation.

digitalsushi · 2024-08-29T13:46:12 1724939172

hollywood makes a 100 million dollar movie.

a company sends a robot to watch it. 7 dollars for a single ticket.

the world spends their ticket money on hanging out with the robot to hear about the movie.

i can't make heads or tails of what's next. i am as fascinated as i am petrified.

ben_w · 2024-08-29T14:46:08 1724942768

> the world spends their ticket money on hanging out with the robot to hear about the movie.

Yes, but I can already do that by reading plot summaries on Wikipedia.

I've read precisely zero Harry Potter books, and of the films seen 1, 2 in the background muted, 3, and 4.

Despite this I know about Dolores Umbridge's personality (oh, and just recognised the reason for the name), that Snape Kills Dumbledore, and several other events and characters that were also not in any of those films.

fluidcruft · 2024-08-29T14:49:41 1724942981

But do you know Sirius's vault number at Gringots?

ben_w · 2024-08-29T15:05:24 1724943924

No. Should I?

(Alternatively: https://imgflip.com/i/91shhp)

xandrius · 2024-08-29T16:37:41 1724949461

I spent wayy to much time trying to see if this was a meme or not. Am I missing something? :(

TheAceOfHearts · 2024-08-30T11:55:45 1725018945

The vault number is only mentioned in the UK version of one of the books, the US version was localized and this detail isn't included. This question has shown up in some trivia events and many consider it a gotcha because they only ever read the US version.

notwhereyouare · 2024-08-29T15:07:49 1724944069

did you read the US or the UK version of the books?

dialup_sounds · 2024-08-29T14:32:46 1724941966

Movie recap channels on YouTube are basically this already. They summarize a 90-180 minute movie into 15 minutes of silent clips and narration. Very popular, and plausibly done end-to-end by a bot within the next few years.

What's next is the bot writing the whole story and generating the visuals. For the cost of one actual movie, Slopflix could turn out thousands of recap style stories.

maltyr · 2024-08-29T14:20:43 1724941243

I'm already getting personalized spam mail that's probably powered by AI ("use this API" type marketing in my work email.) I'm okay with this, even if I find it a little unsettling.

I think the dystopian future we're heading towards is personalized phishing and/or scams that sends to your hijacked accounts' contacts that you're going through a hard time and requesting donations, using training data from crowdfunding sites.

Or, one-level more dystopian, hijacking social media accounts and advertising AI-generated Patreon-style content using the actual account owner's likeness.

talldayo · 2024-08-29T16:30:20 1724949020

Imagine how dangerous this will be, once we have the technology to send a human with a built-in camera to that theater. Humanity and our poor 100 Million dollar films will be ruined!

tomjen3 · 2024-08-29T14:39:02 1724942342

In you case it is not obvious that it would be a bad deal for the movie makers - perhaps, on net, it would get more people wanting to see the movie?

sva_ · 2024-08-29T13:55:40 1724939740

It would be very speciest to say that a human is allowed to buy a ticket and watch a movie, but a robot is not.

falcolas · 2024-08-29T14:14:43 1724940883

A robot (nor the software that runs it) has no inherent rights, as it's a non-sentient creation. That's why they can't go into a movie theater without the theater's explicit permission. Plus, the whole "companies can chose who they associate with" thing means not all humans can even go into a movie theater.

And if we want to jog down this rabbit hole, I suggest starting with existing animals which we're realizing are sentient, but still prevent them from going into movie theaters. Once octopi can visit the movie theater, let's revisit the the robot argument.

ben_w · 2024-08-29T15:00:26 1724943626

> A robot (nor the software that runs it) has no inherent rights, as it's a non-sentient creation.

That's not why. We don't know what test could be performed to determine if some future (or present) AI is or isn't sentient, we just assert it and get on with our lives.

No, the reason robots and software have no inherent rights is because the law says so.

As a demonstration proof that this is a purely legal status that has nothing to do with underlying nature, the co-inventors of the automobile did not have many rights afforded to her husband: https://en.wikipedia.org/wiki/Bertha_Benz

And conversely many such rights do exist for corporations, which are not themselves sentient.

adolph · 2024-08-29T15:22:53 1724944973

> No, the reason robots and software have no inherent rights is because the law says so.

This is based in a limited theory of rights being delivered by a state as sovereign rather than inherent to every being. There are other valid ways of interpreting rights beyond "because the law say so."

https://en.wikipedia.org/wiki/Negative_and_positive_rights

ben_w · 2024-08-29T15:37:53 1724945873

Those other methods are great in a philosophical debate, but have no power until implemented as law.

I'm not saying you're wrong, those kinds of debate are great at telling you what the law should be, but the practicality of it is that a right you can't enforce is not useful.

adolph · 2024-09-01T17:33:53 1725212033

A view that a sentient entity has no rights but what the law gives is incompatible with that entity’s agency. Enumeration of rights is for the purpose of limiting the laws that govern any entity’s behavior and are not a specification for what the entity may do.

riversflow · 2024-08-29T16:12:45 1724947965

> have no power until implemented as law.

Influence is power. For a stark example, consider Sharia.

wwweston · 2024-08-30T00:26:29 1724977589

It's worth pointing out the main reason why people are invested in robots, AI, or corporations is that you can own them as capital either for direct production purposes or secondhand economic benefits, often replacing human labor, which you cannot own.

This status difference is rooted in significant recognized differences and philosophical beliefs about the value of individuals, and in the idea that the law and society should exist to support that. It's pretty far from "purely legal" and the fact that we're going to let people own robots should be only one of the many clues that there's something different about them.

ben_w · 2024-08-30T08:07:55 1725005275

Couple of hundred years ago, we could own other people.

All that philosophical stuff, that only mattered because it resulted in a change in the law — and in the case of the US different philosophy led to a civil war to reject/enforce the law.

falcolas · 2024-08-29T17:31:21 1724952681

I used that more in the context of the "speciest" accusation; I don't consider robots to be a species, particularly a species worthy of forcing a theater to allow their entry on their own merits.

You're right that using sapience is a bad way to identify what is and is not a species though.

ben_w · 2024-08-29T22:30:55 1724970655

Fair — "robot" is as vague a term at this point as, oh, "fish", I guess? Even sentience (or sapience) aside, a Roomba, an Optimus, and a vending machine are all importantly different kinds of robot, and a cinema would be relaxed, cross, and confused in that order by finding them in attendance during a screening.

sva_ · 2024-08-29T22:59:35 1724972375

The sentience angle is just dogmatic/philosophical and lacks any good evidence.

A service dog can be brought into a theater just fine, as it is sufficiently intelligent to be trained to behave appropriately.

There's nothing that specifically prevents animals from entering a movie theater, I'ma say they mostly just generally lack 1) any intellectual desire/intent to watch a movie, 2) appropriate social behavior to cohabit a space with humans, 3) money to pay their ticket.

---

I would agree that current ML/NN based models might seem like a way to more or less circumvent copyright, but don't feel like there should be a blanket ban discriminating against potential intelligent entities that might arise in the future from more sophisticated technologies.

I also wasn't completely serious with my initial post. It is rather enjoyable to take an idea to the extreme and see what people's thoughts on it are.

mensetmanusman · 2024-08-30T01:16:12 1724980572

>has consciousness and free will

>argues there’s no evidence

adolph · 2024-08-29T15:17:25 1724944645

[A robot is] a non-sentient creation. That's why they can't go into a movie theater without the theater's explicit permission.

Wait, whats the threshold for sentience and do all humans meet that standard? Does the metric for the threshold depend on cooperation from the subject? Asking due to recent headline "One-quarter of unresponsive people with brain injuries are conscious." [0]

It seems more straightforward to make the speciesist argument until cyborgs start containing majority biological components.

0. https://www.nature.com/articles/d41586-024-02614-z

kalleboo · 2024-08-29T15:16:43 1724944603

It would be very speciest to say that a human is allowed to buy a ticket and watch a movie, but a video camera is not.

mensetmanusman · 2024-08-30T01:16:57 1724980617

You wouldn’t download a car?

latexr · 2024-08-29T15:28:09 1724945289

Why would that be weird? In addition to robots, other entities which are not allowed to buy a ticket to watch a movie include tigers, elephants, tardigrades, tarantulas, seagulls, ants, pelicans, dolphins…

ben_w · 2024-08-30T08:19:21 1725005961

If a dolphin, a tiger, and an elephant walked up to a cinema ticket office, handed over some money, and said they wanted a ticket, it would be interesting to see how the cashier responds.

Ants may not be taken seriously if they were to offer to pay, but then again nobody will stop them just walking in to a theatre without a ticket either.

JohnFen · 2024-08-29T15:40:28 1724946028

This is, in my opinion, a silly argument for a number of reasons. The main one being that a robot is not a "species", it is a machine.

wwweston · 2024-08-29T15:17:00 1724944620

Fortunately, since we already treat artificial entities like corporations and their interests as if they’re legally people even if it means that humans suffer, we should have no problems ushering in the cyberpocalypse in which we pretend that we’re being equitable to other things we pretend are life forms by sacrificing the interests of some human beings to those of others with the capital to own good stochastic parrots.

rurp · 2024-08-29T14:42:02 1724942522

What a weird take. Nobody cares about a robot watching a movie, it's the robot reselling that movie and keeping the profits that is objectionable. I don't understand the speciest angle at all. Humans both inside and outside of tech rate intelligent non-human biological life as much less important than human life. I'm not saying that's wrong, but it's clearly true. Elevating non-biological life to parity with humans makes no sense.

raincole · 2024-08-29T14:14:28 1724940868

Human is already the special case in every single aspect of legislation.

If the laws are written in code (pun intended), they would look like:

    if typeof(subject) == Human then
    ...

ben_w · 2024-08-29T14:48:30 1724942910

Eventually, sure.

But for the moment, and I say this as someone very impressed and somewhat scared with AI progress, they're probably sufficiently un-person-like today that it's not unreasonable to ban current models.

teqsun · 2024-08-29T12:58:28 1724936308

I've been wondering: the efforts to block AI scrapers many sites have implemented, are they because AI scrapers hit the sites too hard and cause hosting costs to increase, or because of a potential value in licensing of their data (i.e. "If you want this data you gotta pay me some big AI money for it").

simonw · 2024-08-29T13:10:21 1724937021

I don’t think it’s either of those. I think it’s because they don’t like the idea of their content being used to train a model that can compete with their content in the future.

At least search engine crawlers may drive traffic to them later. What’s in it for them to contribute to model training?

tojaprice · 2024-08-29T13:30:13 1724938213

I wonder if the next step for AI companies (those in control of search, at least) will be to start refusing to index content from companies that have opted out, in an effort to sway their decisions.

raincole · 2024-08-29T14:22:42 1724941362

> those in control of search

So basically just google? I don't think other search engines matter that much. I've never heard people specifically doing SEO for anything that's not google.

rurp · 2024-08-29T14:49:45 1724942985

That would probably be very effective so I think there's a good chance Google does take that step at some point. The only reason they might not is that it looks extremely anti-competitive to leverage their search monopoly in that way, but we all know how lax enforcement is in that area these days so I would be surprised if that stopped them.

johneth · 2024-08-29T15:43:22 1724946202

That would be a great (and extremely justified) way to bring antitrust actions against those 'in control of search'.

brigadier132 · 2024-08-29T12:59:33 1724936373

All of it. Site scraping for search used to actually be beneficial for content creators but now it’s detrimental to them

HenryBemis · 2024-08-29T13:09:31 1724936971

"This is why we can't have good things". Something that was created for the benefit, has been mis/ab-used and now is being removed.

I am also thinking (more and more lately) the movie The Congress (https://www.imdb.com/title/tt1821641/) and how easy it will be very soon to make a company without human actors. The mega-big studios can train AI by feeding it all their movies, then feed it with the scenario (this is where we - the humans come in - the famous "prompt engineers"), and a movie will be created.

Now, to the above scenario there will be a lot of corrections (can't walk in the air/water/nails)(can't open with killing a dog unless "John Wick" or "Sacred Games")(Sandra Bullock can levitate in "Gravity" but not on "Speed")(etc.)

Eventually if you hire 1000 people to watch 2 iterations of the new movie per day and make corrections to each (100025 = ), the AI will learn and improve on these unrealistic outputs and eventually you can produce movies weekly.

Yo Netflix people, if you haven't started doing this already, START! :) (oh and make a follow-up series on Sacred Games while at it)

EDIT: that's quite a tangent, but I can imagine the day where I will be asking my Paramount-AI "hey, make an original Star Trek series based on this-and-that, I want to binge it next weekend!" (or show me the one you created for someone else)

dageshi · 2024-08-29T13:28:30 1724938110

Yup.

Still wondering where AI is gonna get its raw information from in a few years when 90% of websites have given up updating.

ben_w · 2024-08-29T15:03:19 1724943799

> Still wondering where AI is gonna get its raw information from in a few years when 90% of websites have given up updating.

Facts, opinions, or skills?

Facts only need a single source of truth, skills can be learned in sim.

Plenty of room for a gossip column that keeps robots out, but how much will that matter?

pjc50 · 2024-08-29T15:17:53 1724944673

Facts are hard, and AI makes it harder by flooding the zone with "factoids".

ben_w · 2024-08-29T15:34:59 1724945699

Facts are indeed hard, but I don't think AI really makes it harder for anything important as important facts are either science or on the record.

I do agree with the point in a different way however, as spamming a fake reality is quite capable of undermining the world models of us humans.

Two modes of censorship, banning true things vs. overwhelming the signal with noise.

Gossip is a noise we're very interested in.

twelve40 · 2024-08-29T14:54:08 1724943248

They don't care. By that time the current leaders will flip in an IPO, make a ton of money, and then the deluge. Everyone else in the market has to follow this race to the bottom, unfortunately.

reginald78 · 2024-08-29T15:28:49 1724945329

They'll scrape it all from "your" cloud storage, getting it right from the source before it even goes up on the internet.

paulmd · 2024-08-29T16:37:05 1724949425

exactly, don't people see if you take the commercial interest out of Art and Creation that people will just stop doing it???? /s

tomjen3 · 2024-08-29T14:42:12 1724942532

Only for a small group who makes money on ads. For any small business it is either a wash or a boon, since it makes people more likely to know about them. Your hotel does not care where you found them, and will be quite happy not to pay the experia fee. Same thing with the ethnical restaurant, the hardresser or the accountant.

For those who make money on subscriptions it is a no issue, since the robot can't index the content anyway.

But yeah, if you are an influencer, well the world is probably strictly better of without you.

rurp · 2024-08-29T15:23:10 1724944990

It's not just ad revenue though. Others put out content for exposure or as a way to connect with people or a larger community. Having that content laundered through an LLM removes those benefits.

If AI ever becomes a major way to search for businesses it will quickly become pay to play. Letting an AI company scrape your website won't get you much at that point beyond the ability to pay for traffic.

JohnFen · 2024-08-29T13:42:52 1724938972

Don't know about other sites, but on my own, the issue is neither of those things. It's that I simply don't want my data to contribute to the training of these models at all.

BehindBlueEyes · 2024-08-30T15:26:44 1725031604

I wouldn't mind as much if I could then use the best/latest models that trained on my data for free. As it is, there is nothing in it for me.

If they just use my content to then tell users without needing to come to my site, and get ad revenue from that - no thank you. Not what AI is doing now, but looking at SEO I don't see where else this could be heading...

39896880 · 2024-08-29T13:32:37 1724938357

It’s the latter. Why give away something that the most valuable public company in the world wants?

BehindBlueEyes · 2024-08-30T15:20:35 1725031235

I wish users would think more like this and asked to be paid for data collection :)

markx2 · 2024-08-29T14:40:28 1724942428

Does wordpress.com count as a major site? They host a huge amount of sites.

They are saying Yes to everyone. There is no indication anywhere that they block AI scrapers.

In fact, they let a blogger 'profit'.

https://wordpress.com/blog/2024/07/30/perplexity-partnership...

luma · 2024-08-29T13:25:38 1724937938

The same sites whose “journalism” mostly involves rephrasing reports they picked off of the news wire.

They just don’t want the competition.

2OEH8eoCRo0 · 2024-08-29T14:08:11 1724940491

Apple doesn't compete

layer8 · 2024-08-29T14:32:32 1724941952

Apple News exists [0]. While the content is externally sourced at present, it certainly competes in that space.

[0] https://www.apple.com/apple-news/

seydor · 2024-08-29T14:41:14 1724942474

Why do they train their models on that data then?

spogbiper · 2024-08-29T14:26:19 1724941579

...yet

spacebanana7 · 2024-08-29T15:29:29 1724945369

News and media sites are not very high value data sources for AI models. There's not that much additional value from having 100 news sites in your data or 100k because so much content is generic with a high degree of overlap.

It'd be much more concerning if academic publishers, code repositories and wikipedia refused scraping.

overstay8930 · 2024-08-29T15:30:55 1724945455

This has gotta be the dumbest “debate” (at least in the USA). If you put something on the internet without protection (i.e. DRM), it is public. Enforcing a data cartel like apple is doing has gotta be the absolute worse outcome, only the ultra rich can afford to get the good and correct information while the poor get whatever scraps they can get.

Apple drank the kool-aid and now has to deal with the fact that those bad faith arguments make no sense in reality.

JumpCrisscross · 2024-08-29T15:32:01 1724945521

> If you put something on the internet without protection (i.e. DRM), it is public

This is a meaningless semantic punt to the word public. At the end of the day, there is a real debate around how ownership works in relation to AI.

tdb7893 · 2024-08-29T16:03:49 1724947429

Data being viewable but being subject to some restrictions isn't some new thing, it's how it has always worked. E.g. if I can view a website it's public but copying/pasting it somewhere else is a violation of copyright. We've never lived in a world where you can just do whatever you want with someone else's content and there's on fact a whole huge portion of the law around intellectual property for this exact sort of thing.

goles · 2024-08-29T14:41:53 1724942513

https://archive.is/Zz5r1

aucisson_masque · 2024-08-29T14:58:57 1724943537

> Robots.txt allows website owners to block or permit bots on a case-by-case basis. While there’s no legal obligation for bots to adhere.

I don't know how long it's going to work like that, internet is basically the farewest. If you can't paywall your data, someone is free to take it.

Beside I bet most article written on these website used artificial intelligence to do so.

josephd79 · 2024-08-29T12:54:21 1724936061

yet some of these companies use your personal data to train ai, which you can't opt of.

jprete · 2024-08-29T13:24:49 1724937889

Yes, I agree and this annoys me. From a privacy perspective, it's still better to have fifty siloed data stores that aren't allowed to mix than for your data to be in a single large store that can do whatever it wants.

commandlinefan · 2024-08-29T14:14:44 1724940884

And when you consider the comparison, it's even worse: the data they're refusing to allow Apple to train on is _public_ data. There were several court cases in the early 1900's where owners of big buildings tried to sue photographers for selling pictures of their buildings (or pictures that included their buildings) and it was finally agreed that since the building was in public, pictures of it were fair game. If you're putting an article on the internet for anybody to read, "anybody" seems like it would include an AI as well.

JohnFen · 2024-08-29T15:45:45 1724946345

> If you're putting an article on the internet for anybody to read, "anybody" seems like it would include an AI as well.

Do you think there's something wrong with having a ToS that specifically excludes AI?

BehindBlueEyes · 2024-08-30T15:15:40 1725030940

Interesting! There's a difference though to me: if the building has a giant poster of a popular IP on it, afaik it doesn't give the photographer any rights to use that IP in their creations. I see enough series shot in Vancouver that avoid including things like that or keep it blurred out if it can't be out of frame...

Content on websites has copyright in some form, just because it isn't behind a paywall/accountwall doesn't make it public domain.

commandlinefan · 2024-08-30T16:52:51 1725036771

But the AI didn't reproduce it, the AI "read" it and incorporated it into its knowledge base. I can read & cite all the academic papers I want - I can even reproduce their text verbatim, as long as I properly cite them.

MetaverseClub · 2024-08-29T19:38:12 1724960292

I now need to deliberately choose the sites I browse to avoid those that work with the stupid closeAI?