> “WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training”
Versus
> “The media company joins The Atlantic, Axel Springer, Vox Media, and a host of other publishers who have partnered with OpenAI.”
I find it quite concerning that we are going into a situation where some things can crawl some sites but not others.
I have my own serious ethical concerns about AI, but from a pure searchability standpoint it is likely to cause more problems than solve.
Information is going to start being silo'd. It would be like if I had to go to Google for one thing and Bing for something else.
That being said, I am also quite concerned about what you mentioned. I was not happy when I found out that ArsTechnica signed a deal with OpenAI (or I assume their parent company did). Did that deal include any requirements of how they talk about OpenAI, AI in General, or their competitors?
The concerns over AI are real, but they are putting a short term gain over the long term damage by somehow thinking that licensing data is going to save them.
The tone of this is a bit weird, do we have similar "blocking google", "blocking anthropic", "blocking meta" articles? Instead of what is really happening is those partnerships were formed so everything not that partnership is being blocked.
There is an argument for banning (or limiting the term of) exclusive licensing arrangements for purposes of training an LLM. Nobody is forced to do business with anyone. But OpenAI or Facebook shouldn't be able to force a publisher to only do business with them, thereby closing the market to new entrants.
If it were built by a VC backed company, you can be sure that you could receive email from all providers but only send to your same provider. Then, you could buy bundles of access. For only 9.99/mo, send emails to @aol.com, @yahoo.com, and @live.com.
The Internet has long existed in a state of tension over where value gets created, how it is consumed, and who pays for it. This had evolved into an uneasy truce of ad-supported big sites, making the web almost (but not quite) unusable for most people except a few adblocking freeloaders who get the good experience.
AI just blows all that up by providing a general-purpose plagiarism engine. It offers a substitute for any content, with a similar-looking but inauthentic version. People might accept not paywalling in order to reach human users, but nobody wants to be fodder for the plagiarism bot.
In some ways this is the "Can Googlebot scrape your content and display it on the home page but without generating any ad views for you?" debate we've seen with news publishers before.
> the world spends their ticket money on hanging out with the robot to hear about the movie.
Yes, but I can already do that by reading plot summaries on Wikipedia.
I've read precisely zero Harry Potter books, and of the films seen 1, 2 in the background muted, 3, and 4.
Despite this I know about Dolores Umbridge's personality (oh, and just recognised the reason for the name), that Snape Kills Dumbledore, and several other events and characters that were also not in any of those films.
The vault number is only mentioned in the UK version of one of the books, the US version was localized and this detail isn't included. This question has shown up in some trivia events and many consider it a gotcha because they only ever read the US version.
Movie recap channels on YouTube are basically this already. They summarize a 90-180 minute movie into 15 minutes of silent clips and narration. Very popular, and plausibly done end-to-end by a bot within the next few years.
What's next is the bot writing the whole story and generating the visuals. For the cost of one actual movie, Slopflix could turn out thousands of recap style stories.
I'm already getting personalized spam mail that's probably powered by AI ("use this API" type marketing in my work email.) I'm okay with this, even if I find it a little unsettling.
I think the dystopian future we're heading towards is personalized phishing and/or scams that sends to your hijacked accounts' contacts that you're going through a hard time and requesting donations, using training data from crowdfunding sites.
Or, one-level more dystopian, hijacking social media accounts and advertising AI-generated Patreon-style content using the actual account owner's likeness.
Imagine how dangerous this will be, once we have the technology to send a human with a built-in camera to that theater. Humanity and our poor 100 Million dollar films will be ruined!
A robot (nor the software that runs it) has no inherent rights, as it's a non-sentient creation. That's why they can't go into a movie theater without the theater's explicit permission. Plus, the whole "companies can chose who they associate with" thing means not all humans can even go into a movie theater.
And if we want to jog down this rabbit hole, I suggest starting with existing animals which we're realizing are sentient, but still prevent them from going into movie theaters. Once octopi can visit the movie theater, let's revisit the the robot argument.
> A robot (nor the software that runs it) has no inherent rights, as it's a non-sentient creation.
That's not why. We don't know what test could be performed to determine if some future (or present) AI is or isn't sentient, we just assert it and get on with our lives.
No, the reason robots and software have no inherent rights is because the law says so.
As a demonstration proof that this is a purely legal status that has nothing to do with underlying nature, the co-inventors of the automobile did not have many rights afforded to her husband: https://en.wikipedia.org/wiki/Bertha_Benz
And conversely many such rights do exist for corporations, which are not themselves sentient.
> No, the reason robots and software have no inherent rights is because the law says so.
This is based in a limited theory of rights being delivered by a state as sovereign rather than inherent to every being. There are other valid ways of interpreting rights beyond "because the law say so."
Those other methods are great in a philosophical debate, but have no power until implemented as law.
I'm not saying you're wrong, those kinds of debate are great at telling you what the law should be, but the practicality of it is that a right you can't enforce is not useful.
A view that a sentient entity has no rights but what the law gives is incompatible with that entity’s agency. Enumeration of rights is for the purpose of limiting the laws that govern any entity’s behavior and are not a specification for what the entity may do.
It's worth pointing out the main reason why people are invested in robots, AI, or corporations is that you can own them as capital either for direct production purposes or secondhand economic benefits, often replacing human labor, which you cannot own.
This status difference is rooted in significant recognized differences and philosophical beliefs about the value of individuals, and in the idea that the law and society should exist to support that. It's pretty far from "purely legal" and the fact that we're going to let people own robots should be only one of the many clues that there's something different about them.
Couple of hundred years ago, we could own other people.
All that philosophical stuff, that only mattered because it resulted in a change in the law — and in the case of the US different philosophy led to a civil war to reject/enforce the law.
I used that more in the context of the "speciest" accusation; I don't consider robots to be a species, particularly a species worthy of forcing a theater to allow their entry on their own merits.
You're right that using sapience is a bad way to identify what is and is not a species though.
Fair — "robot" is as vague a term at this point as, oh, "fish", I guess? Even sentience (or sapience) aside, a Roomba, an Optimus, and a vending machine are all importantly different kinds of robot, and a cinema would be relaxed, cross, and confused in that order by finding them in attendance during a screening.
The sentience angle is just dogmatic/philosophical and lacks any good evidence.
A service dog can be brought into a theater just fine, as it is sufficiently intelligent to be trained to behave appropriately.
There's nothing that specifically prevents animals from entering a movie theater, I'ma say they mostly just generally lack 1) any intellectual desire/intent to watch a movie, 2) appropriate social behavior to cohabit a space with humans, 3) money to pay their ticket.
---
I would agree that current ML/NN based models might seem like a way to more or less circumvent copyright, but don't feel like there should be a blanket ban discriminating against potential intelligent entities that might arise in the future from more sophisticated technologies.
I also wasn't completely serious with my initial post. It is rather enjoyable to take an idea to the extreme and see what people's thoughts on it are.
[A robot is] a non-sentient creation. That's why they can't go into a movie theater without the theater's explicit permission.
Wait, whats the threshold for sentience and do all humans meet that standard? Does the metric for the threshold depend on cooperation from the subject? Asking due to recent headline "One-quarter of unresponsive people with brain injuries are conscious." [0]
It seems more straightforward to make the speciesist argument until cyborgs start containing majority biological components.
Why would that be weird? In addition to robots, other entities which are not allowed to buy a ticket to watch a movie include tigers, elephants, tardigrades, tarantulas, seagulls, ants, pelicans, dolphins…
If a dolphin, a tiger, and an elephant walked up to a cinema ticket office, handed over some money, and said they wanted a ticket, it would be interesting to see how the cashier responds.
Ants may not be taken seriously if they were to offer to pay, but then again nobody will stop them just walking in to a theatre without a ticket either.
Fortunately, since we already treat artificial entities like corporations and their interests as if they’re legally people even if it means that humans suffer, we should have no problems ushering in the cyberpocalypse in which we pretend that we’re being equitable to other things we pretend are life forms by sacrificing the interests of some human beings to those of others with the capital to own good stochastic parrots.
What a weird take. Nobody cares about a robot watching a movie, it's the robot reselling that movie and keeping the profits that is objectionable. I don't understand the speciest angle at all. Humans both inside and outside of tech rate intelligent non-human biological life as much less important than human life. I'm not saying that's wrong, but it's clearly true. Elevating non-biological life to parity with humans makes no sense.
But for the moment, and I say this as someone very impressed and somewhat scared with AI progress, they're probably sufficiently un-person-like today that it's not unreasonable to ban current models.
I've been wondering: the efforts to block AI scrapers many sites have implemented, are they because AI scrapers hit the sites too hard and cause hosting costs to increase, or because of a potential value in licensing of their data (i.e. "If you want this data you gotta pay me some big AI money for it").
I don’t think it’s either of those. I think it’s because they don’t like the idea of their content being used to train a model that can compete with their content in the future.
At least search engine crawlers may drive traffic to them later. What’s in it for them to contribute to model training?
I wonder if the next step for AI companies (those in control of search, at least) will be to start refusing to index content from companies that have opted out, in an effort to sway their decisions.
So basically just google? I don't think other search engines matter that much. I've never heard people specifically doing SEO for anything that's not google.
That would probably be very effective so I think there's a good chance Google does take that step at some point. The only reason they might not is that it looks extremely anti-competitive to leverage their search monopoly in that way, but we all know how lax enforcement is in that area these days so I would be surprised if that stopped them.
"This is why we can't have good things". Something that was created for the benefit, has been mis/ab-used and now is being removed.
I am also thinking (more and more lately) the movie The Congress (https://www.imdb.com/title/tt1821641/) and how easy it will be very soon to make a company without human actors. The mega-big studios can train AI by feeding it all their movies, then feed it with the scenario (this is where we - the humans come in - the famous "prompt engineers"), and a movie will be created.
Now, to the above scenario there will be a lot of corrections (can't walk in the air/water/nails)(can't open with killing a dog unless "John Wick" or "Sacred Games")(Sandra Bullock can levitate in "Gravity" but not on "Speed")(etc.)
Eventually if you hire 1000 people to watch 2 iterations of the new movie per day and make corrections to each (100025 = ), the AI will learn and improve on these unrealistic outputs and eventually you can produce movies weekly.
Yo Netflix people, if you haven't started doing this already, START! :) (oh and make a follow-up series on Sacred Games while at it)
EDIT: that's quite a tangent, but I can imagine the day where I will be asking my Paramount-AI "hey, make an original Star Trek series based on this-and-that, I want to binge it next weekend!" (or show me the one you created for someone else)
They don't care. By that time the current leaders will flip in an IPO, make a ton of money, and then the deluge. Everyone else in the market has to follow this race to the bottom, unfortunately.
Only for a small group who makes money on ads. For any small business it is either a wash or a boon, since it makes people more likely to know about them. Your hotel does not care where you found them, and will be quite happy not to pay the experia fee. Same thing with the ethnical restaurant, the hardresser or the accountant.
For those who make money on subscriptions it is a no issue, since the robot can't index the content anyway.
But yeah, if you are an influencer, well the world is probably strictly better of without you.
It's not just ad revenue though. Others put out content for exposure or as a way to connect with people or a larger community. Having that content laundered through an LLM removes those benefits.
If AI ever becomes a major way to search for businesses it will quickly become pay to play. Letting an AI company scrape your website won't get you much at that point beyond the ability to pay for traffic.
Don't know about other sites, but on my own, the issue is neither of those things. It's that I simply don't want my data to contribute to the training of these models at all.
I wouldn't mind as much if I could then use the best/latest models that trained on my data for free. As it is, there is nothing in it for me.
If they just use my content to then tell users without needing to come to my site, and get ad revenue from that - no thank you. Not what AI is doing now, but looking at SEO I don't see where else this could be heading...
News and media sites are not very high value data sources for AI models. There's not that much additional value from having 100 news sites in your data or 100k because so much content is generic with a high degree of overlap.
It'd be much more concerning if academic publishers, code repositories and wikipedia refused scraping.
This has gotta be the dumbest “debate” (at least in the USA). If you put something on the internet without protection (i.e. DRM), it is public. Enforcing a data cartel like apple is doing has gotta be the absolute worse outcome, only the ultra rich can afford to get the good and correct information while the poor get whatever scraps they can get.
Apple drank the kool-aid and now has to deal with the fact that those bad faith arguments make no sense in reality.
Data being viewable but being subject to some restrictions isn't some new thing, it's how it has always worked. E.g. if I can view a website it's public but copying/pasting it somewhere else is a violation of copyright. We've never lived in a world where you can just do whatever you want with someone else's content and there's on fact a whole huge portion of the law around intellectual property for this exact sort of thing.
Yes, I agree and this annoys me. From a privacy perspective, it's still better to have fifty siloed data stores that aren't allowed to mix than for your data to be in a single large store that can do whatever it wants.
And when you consider the comparison, it's even worse: the data they're refusing to allow Apple to train on is _public_ data. There were several court cases in the early 1900's where owners of big buildings tried to sue photographers for selling pictures of their buildings (or pictures that included their buildings) and it was finally agreed that since the building was in public, pictures of it were fair game. If you're putting an article on the internet for anybody to read, "anybody" seems like it would include an AI as well.
Interesting! There's a difference though to me: if the building has a giant poster of a popular IP on it, afaik it doesn't give the photographer any rights to use that IP in their creations. I see enough series shot in Vancouver that avoid including things like that or keep it blurred out if it can't be out of frame...
Content on websites has copyright in some form, just because it isn't behind a paywall/accountwall doesn't make it public domain.
But the AI didn't reproduce it, the AI "read" it and incorporated it into its knowledge base. I can read & cite all the academic papers I want - I can even reproduce their text verbatim, as long as I properly cite them.
I’d argue the way Wired is reporting this as a gotcha is also a conflict of interest because of their deal with an effective competitor.
https://www.wired.com/story/conde-nast-openai-deal/
> “WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training”
Versus
> “The media company joins The Atlantic, Axel Springer, Vox Media, and a host of other publishers who have partnered with OpenAI.”