Isn't it a bit hypocritical of them to use other people's copyrighted work or "output" they weren't given license/permission to use, but deny people the same opportunity with their work/output?
I'm genuinely wondering how this is different from them using others work without consent, but IANAL so maybe I'm just confusing mortality and legality
The fun thing is, is that IIRC, AI outputs aren’t copyrighted so this is actually more ethical that’s what OpenAI did. The only thing is that it was probably against the terms of service.
> aren’t copyrighted so this is actually more ethical that’s what OpenAI did
why is that? all works, as long as it is "original" should be granted copyright. It should belong to the person who made it (in this case, the user, not openAI).
If I say to one of my human friends "write me a nursery rhyme", the copyright of the resulting rhyme would obviously belong to my friend - despite me prompting them. Clearly the prompt itself does not universally count as "making" it.
Let's say I made a "NovelSnippetAI", which contains a corpus of prewritten material. You can prompt it, and it will return a page which matches the sentiment of your prompt best. I think we can agree that the copyright of the page will still belong to the original writer - the user only did a query.
What if I did "NovelMixAI", which did exactly the same but alternated lines from the two best matches? What about "NovelTransformAI", which applied a mathematical formula to the best match and fed the output to a fixed Markov Chain? Now we're suddenly at "LLMAI", which does the same using a neural network - what makes it different from the rest?
Long-standing precedent is that any automated work does not qualify for copyright. You can only copyright human work.
Where the line is - ie how much human creative input would be needed for a work to be covered by copyright seems to be unclear legally. There are some interesting parallels to this dispute I think: https://en.m.wikipedia.org/wiki/Monkey_selfie_copyright_disp...
And what happens when two users manage to get the same result? Does OpenAI need to implement an "/r9k/ algorithm" or something to prevent sufficiently similar results from ever being generated, because once a response is generated it is now copyrighted by the user who prompted it?
> If I say to one of my human friends "write me a nursery rhyme", the copyright of the resulting rhyme would obviously belong to my friend - despite me prompting them. Clearly the prompt itself does not universally count as "making" it.
This is the wrong analogy. The LLM is more like Photoshop and the prompt little more than a filter configuration. A machine cannot copyright its own output but a human guiding that machine can.
No, it's the right analogy for the point I was making. That analogy is there to point out that it is about more than just the prompt, and prompting a machine is covered in the rest of my comment.
It is more like a child's speak and say. Push a button a cow noise comes out. Someone else pushes it the same noise comes out. Even if a billion buttons exist the child doesn't own the cow noise.
Machine-generated works are not eligible for copyright, and courts have been ruling this way about AI content. https://www.reuters.com/legal/legalindustry/who-owns-ai-crea... Early days and there are still appeals as well as new legislation in progress, but that's where it stands now.
In the Monkey selfie copyright dispute Slater could not claim copyright because he did not operate the camera. Corporate personhood or juridical personality is the legal notion that a juridical person such as a corporation, separately from its associated human beings (like owners, managers, or employees), has at least some of the legal rights and responsibilities enjoyed by natural persons. If Mr. Slater was incorporated would his copyright ownership be clear? If an ai was a corporation not an asset of a corporation and could prompt it's self would the output then be copyrightable? There are a lot of ifs in there but still interesting.
IANAL, but I would guess it depends on whether the animal is an "employee" of the company. Obviously there are animals that are owned by companies and, if they create something, the company I assume would own it whether or not it was considered a work for hire in the usual sense. But that would presumably not have been the case here.
Generative AI has basically made what were once fairly irrelevant edge cases (what if I tie together a bunch of random number generators to create art work?) a lot more interesting. And laws will probably have to be adapted.
Obvious new business model providing a human shill service where you look at AI output and say "yep, I made that". Now it's copyright, assigned to the customer of your startup.
Morality and legality aside, there's a substantive difference between use of content and use of a model. Pretraining a GPT 4-class model from raw data requires trillions of tokens and millions of dollars in compute, whereas distilling a model using GPT 4's output requires orders of magnitude less data. Add to that the fact that OpenAI is probably subsidizing compute at their current per-token cost, and it's clearly unsustainable.
The morality of training on internet-scale text data is another discussion, but I would point out that this has been standard practice since the advent of the internet, both for training smaller models and for fueling large tech companies such as Google. Broadly speaking, there is nothing wrong with mere consumption. What gets both morally and legally more complex is production - how much are you allowed to synthesize from the training data? And that is a fair question.
“Content” requires as much, if not more, effort and expense than pretraining GPT-4.
All you’re doing is redefining content, ie thoughts, ideas, movies, videos, literature, sounds, writing, etc as “raw data”. But that isn’t raw data. There was a ton of effort that went into creating the “content”. For example, a single Wikipedia page may have many hundreds of people, some who have done years of college level studies and original research, to produce a few thousand words of content. Others have done research using primary sources. All of them have had to use effort and ingenuity to craft those into actual high quality statements, which in itself was only possible in many cases due to years of training and education. Finally, they had to setup a validation process to produce useful output from this collaborative process which included loads of arguments etc to generate what you are calling “raw data”.
I’m not sure what makes GPT’s output is any less raw than all the effort that went into producing a single Wikipedia page? Further, Wikipedia actually goes out of its way to cite its sources. GPT is designed to go out of its way to obscure its sources.
The only thing GPT does, IOW, that apparently makes the data it uses is not to cite its sources, something that would at the very least lead to professional disgrace for the people who created the “raw data” GPT uses without thought, and would even lead to lawsuits and prosecution in many cases.
So besides going out of its way to obscure the source of its data, what makes GPT’s output less raw than the output people have spent billions of man hours creating?
Except that the content already exists and there is no cost to maintain it.
If GPT incurred a non negligible cost on the content owners by accessing their resources it may have been different but that's not the case.
The only thing that content owners may be able to complain about is that potentially ChatGPT/DallE may reduce their potential income and this would have to be proven. I have not stopped buying books or art of any kind since I use ChatGPT/DallE. And low quality automated content producers existed before OpenAI and were already diluting the attention to more carefully produced content (as can be seen with videos on youtube).
It seems like you have no idea how much effort it takes to write a book.
Quite often it contains the experience of a life of a person condensed to a few hundred pages.
ChatGPT gives easier access to the knowledge contained in tens of thousands of these books. As for me I have been reading less and less books as more wisdom is accessable on the internet in better forms (now GPT).
I'm not against what OpenAI is doing as it moves humanity forward, but like you said I won't stop using ChatGPT just because ByteDance scrapes it.
Not what I am saying. I am saying it is much much smaller than inference/model running cost.
Easy exercise
How many books do you store in 1GB
How much does it cost a year to store it and have OpenAI gather it once.
How much does it cost to run a GPT4 level model that will output 1GB.
That's my point here that's all. It is a huge cost for OpenAI to run a system that produces dynamic content. And it is not comparable to the cost of storing static content.
I didn't talk about the cost of producing the original data.
Sure, but your comment said "maintain", not "store". Even if storage were free, and even if you discount the value of the initial creation to zero, there are still nontrivial serving costs associated with many sites.
What I share with people on the Web may look like a static byte sequence to the robots consuming it, but it takes a lot of work to compute those bytes (in the moment, I mean). Aggregated over the whole web, no, that is not smaller than OpenAI's expenditures.
The effort and resources required to train from raw data are nothing compared to the effort and resources that went into producing the "training" input. How much dors it cost to produce all the things they scrapped from the internet? So morally they are in the wrong - I don't care if it's standard practice since "the beginning of the internet" or not.
It’s also not standard practice since the beginning of the internet. Referencing original input through links is almost foundational to the internet (at least the original internet).
In fact, the power of linking to data sources is what Google is almost entirely built upon.
Others have already pointed out that you’re just shrugging off billions of hours and money that went into the content that is used to (pre-)train a model, so I’ll leave that for what it is.
I’m just curious how you start off with:
> Morality and legality aside
Only to then follow it up immediately with an argument for why one is more moral.
Just because you didn’t end it with “and that’s why I think OpenAI is more moral,” doesn’t mean it’s not obvious and less of an irony.
Morality and legality are the only relevant questions in the discussion. The two methods are virtually the same... in fact I'd argue that ByteDance's usage is more fair and moral. It really doesn't matter than it's cheaper and more efficient.
The cost of hiring humans to write the trillions of tokens they trained from scratch would surely be much larger than the training cost. Except they avoided that cost by using what's available on the Internet. [1]
Similarly, people are avoiding the cost of pre-training GPT-4 class model by scraping its output.
So I think it's fair to question the moral consistency of their ToS.
[1] Please note that I am not passing a judgement on this, just stating a fact in order to make an argument.
> I would point out that this has been standard practice since the advent of the internet
Maybe it shouldn't have been? We've been frog-boiling toward this point for a long time, from a starting point that was generally good for content creators (your content is made more discoverable) to a point that is not so good for content creators (your content is scraped and digested, programmatically laundered and regurgitated on huge corporations' own platforms with token or no attribution, and no revenue shared).
In a parallel universe where search engines were explicitly opt-in from the beginning, I think these conversations would look very different today. What OpenAI and its peers have done would, I dare say, be uncontroversially (and correctly) regarded as theft. Just as I'm not allowed to distribute^[1] software incorporating somebody else's code in a way that violates the terms of its license (or lack thereof), I shouldn't be able to distribute software that incorporates any intellectual property that I don't have the rights to.
Yes, but the original intent was to help other companies create ethical AI models. If they've already turned their back on those core values, a bit more hypocrisy won't stop them.
Stated intent and actual intent are usually different, I expect they intended the opposite all along but were just riding the open source ethical AI wave to profit.
Ive talked to a few people formerly at OpenAI a few years ago - the deviation from the original mission was a “boiling frog” process that is best demarcated when the Anthropic founders left. The core team doing the actual science and engineering work did legitimately believe in the open research mission. But when funding was hard to find, things kind of broke.
I strongly doubt it. Sam was pretty damn rich before OpenAI. He's written a lot of stuff about AI and how he worried someone would do it. The original plan was roughly build the basis for ethical AI, have someone else build on top of it, which somehow pivoted into everyone using their API.
If they just wanted money, they could have cut out a lot of the ethical filters and choke out competitors. Google didn't stop NSFW. Twitter, Reddit, Tumblr, etc didn't. AI is bound to be used for NSFW among other things, but they've set the standards to make it ethical.
I think eventually they did let loose to try to keep ahead of competitors. This probably pissed off the board and led to the drama recently? Just speculation. Because the new models are nowhere near as anal as the initial release.
> If they just wanted money, they could have cut out a lot of the ethical filters and choke out competitors.
I think you are wrong here, the safety filters (“safety” and “ethics” for AI are labels for boundaries and concerns of different ideological factions, and Altman and OpenAI are deep in the “safety” side—which has the nost money and power behind it so “safety” is also becoming the generic term for AI boundaries) are an important oart if the PR and regulatory lobbying effort, which is key to OpenAI’s long term moneymaking plan, given the absence of a durable moat for commercial AI.
> If they just wanted money, they could have cut out a lot of the ethical filters and choke out competitors. Google didn't stop NSFW. Twitter, Reddit, Tumblr, etc didn't.
Neither, in practice, has OpenAI—there are whole communitids built around using OpenAI's models for very, very NSFW purposes. They've prevented casual NSFW, to oresent the image they want to thr government and interest grouoa whose support they want as they lovby to shape regulation of AI. avoiding being a target for things like the 404 Media campaign against CivitAI where the NSFW is more readily visible.
I have yet not seen a definite conclusion that training a model equals breaching intellectual property rights. There is always the right of citation, and as long as the model is not producing copies of copyrighted material: where then is the violation?
IP rights do not per se give the author an absolute right to determine how their work is used. It does give rights to prevent a reproduction, but that is not what an AI model does.
> It does give rights to prevent a reproduction, but that is not what an AI model does.
Why would you conclude that ? While an AI model does not ONLY reproduce, it most certainly can make verbatim reproductions. The things preventing the user from getting copyrighted material from chatgpt are probably only rules/guardrails. The most prominent example of this is perhaps the Bible which you could get from it quote by quote within token limit.
That shouldn’t matter, regarding copyright violation. Purchasing a book only makes the physical object your own, it doesn’t change anything with regard to the contained work.
I agree, also my thought, but that is a principally different case then requiring consent from the author. That suggests that OpenAI would have downloaded pirated material, or hacked paywalls, how did they get this material?
BTW I noticed that GPT-4 is good at writing legal letters of the sort that is widely available online. But a subpoena, ('dagvaarding', the Dutch version I have researched) it completely fails to create. Also there are not many subpoenas available online, and the court (in the Netherlands) only publishes the verdicts, not the other documents. Lawyers OTOH have a lot more of this available in their libraries.
So, my impression is that there is still a lot of material out there that is not in the corpus.
Yeah, but they wont get non-public data that way. I'd bet they did get access to a lot of non-public data just by asking and stating they do it for a non-profit mission.
Why should that matter? I have read many borrowed books for free, and can quote many of them. We have huge institutions devoted to letting people borrow read and learn from books they are called libraries or archives.
I've heard this argument before, but I think it's pretty clear that humans and machines are not equal before the law and the fact that you have the right to make a derivative work doesn't necessarily mean that you can make a machine to do that for you.
>What if they just bought dirt cheap used copies meaning the creators saw nothing thanks to the first sale doctrine?
They could without question set up a very nice physical library in Mountain View and even invite the public in. They can probably in general scan those books for their own internal use. What got shut down was scanning the books and making them available to everyone in their entirety.
> I'm genuinely wondering how this is different from them using others work without consent, but IANAL so maybe I'm just confusing mortality and legality
When it comes to cutting edge business vs business decisions, the legality is often defined post-factum, e.g. in courts.
For an outsider to know whether something in a case like this is legal or not is near impossible, considering how opaque such businesses are.
My guess is that using the OpenAI models to generate training data for a bespoke transformer model is extremely taxing on OpenAI's computing usage. If I were to guess, that is why that behavior is proscribed by the TOS, and why ByteDance was banned. Probably has nothing to do with the ethics of how training data is gathered.
Yes, but we're doing the same thing with industrialization. The US is past the stage of factories so we'll deny everyone else factories because we deem them too pollutant now.
Not really and I’m suspicious you know these are not directly analogous.
We’re not trying to curb emissions for the purpose of kneecapping other economies. Short of China, we don’t really have any incentive to do that (bigger markets = better under US industrial policy [contradictory opinions from left wing undergrad students on Twitter don’t count as “US industrial policy”]). What’s actually happening is that we caused a problem and now it’s getting worse and in order to fix it we need to not allow everyone else to continue it.
This is a novel and advanced philosophical argument called, “two wrongs don’t make a right.”
It could be seen as a non-trade tariff barrier if you squint quite hard.
There's another example - intellectual property. The US was fine playing fast and loose with IP (Most famous example is Dickens' attempts to point out he was being pirated left and right in the States and not seeing a penny: https://www.charlesdickenspage.com/copyright1842.html)
Huh I wonder if there have been any substantial changes to international cooperation, investments made thereupon, and agreements made thereupon between the 1850s and 2013.
Countries don’t have to join international trade regimes. They also don’t have to join climate/emission commitments. They do both of them because they come with benefits.
Cursory searching suggests the first real work on international copyright, by contrast, came about in 1886. Even early versions came after Dickens’ story here.
History doesn't repeat, but it does rhyme. Look at the shape of the story. The people on top support rules that completely coincidentally help keep them on top. It's a universal impluse.
"It is difficult to get a man to understand something when his salary depends on his not understanding it." (Upton Sinclair) is another example with a similar shape, but at the scale of individuals.
You must understand that we have to hold China to a 19th century standard while holding western nations to a 21st century standard because... uh.. reasons. Don't question it!
What? Absolutely question it. Let me know if you find an answer that’s significantly more believable than, “trying to balance local quality of life, long term environmental and economic viability, and short term economic prosperity and political stability.”
If you have a different balance to strike that you think is significantly and obviously better, I’m sure the whole world is interested in hearing it.
I’m aware. The implication of the sarcasm is there’s no good reason China and western countries have different standards, so that’s what I was addressing.
It’s a great ideal, but there are interests other than purely environmental (or purely “fairness”) that must be taken into account as a matter of sheer necessity.
> Do we know what data was used? And what the constraints were around it?
The fact that the question/accusation has been raised a great many times and they have not stated "we know we haven't used information without licence because we had procedures to check licensing for all the data used to train our models", would certainly imply that they scraped data for training without reference to licensing, which makes it very likely that the models are based significantly on copyrighted and copyleft covered information.
> Do we know it was used without permission
No, we don't know for sure. But the balance of probabilities is massively skewed in that direction.
There are enough examples of image producing AIs regurgitating obvious parts of unlicensed inputs, as an indication of the common practise of just scraping everything without a care for permission. So asking for those with other models to state how they checked for permission for the input data is reasonable.
Yes, we've known for a long time that they don't shy away from taking any old code on GitHub and regurgitating it without explicit permission. They don't have benefit of the doubt anymore.
It’s not as cut and dried as you’re making it out to be.
For years we’ve accepted that search engines, for example, can grab all the code on GitHub and use it to build a search index.
Google image search, in particular, ‘regurgitates’ all the images it has indexed when it thinks they match a search term.
It has a little disclaimer it shows next to the results saying that ‘images may be copyrighted’ - figuring out if they are copyrighted and if so by whom is left as an exercise for you the searcher. Depending on what you are using the image search for, the copyright of the images may, after all, not be relevant. Like, if you’re using a Google image search to get inspiration for home decor designs, do you care who owns the copyright of each image? Should Google?
GPT poses similar risks to that. It has the explicit disclaimer that things it produces might be subject to copyright. Depending on what you’re using the output for, the copyright may or may not be relevant.
There seems to be a fairly clear distinction between importing copyrighted material to make an index for the narrow purpose of directing people towards the copyrighted material at its original location in its original form, and importing copyrighted material to make an index which improves their own service's ability to generate unattributed derivative work. It's a bit muddier when it comes to things like image searches, but they're not exactly difficult to opt out of.
Google actually pays license fees to News Corp to excerpt their content in Google News following a legal challenge so it's not exactly conclusively established that search engines have global rights to do what they do anyway. But search engines are mostly beneficial to contact creators rather than mostly competitive with them.
Google does its best, but there are limits to “directing people towards the copyrighted material at its original location in its original form” - not everything in the intellectual property world is ‘internet native’. The original form of a song lyric or a movie screenplay doesn’t have an ‘original location’ you can be directed to. You can be directed to various online sources that may or may not accurately reproduce the original, and may or may not be scrupulous about attribution, and may or may not have legitimate copyright license to distribute it in the first place.
Yes, Google will often unwittingly point to other people's copyright violations (it's useful like that!) and will usually only take down the cache/link when requested to do so by the copyright holder.
This is irrelevant to original point about the purpose of a search engine being to highlight rather than replace existing information sources, and OpenAI's purpose for indexing content and policy of not engaging with copyright holders being completely different
No, but then as you're (presumably) a person rather than an information retrieval system
you would be legally responsible for ensuring you had performance rights and paid royalties if you were quoting that script as part of a commercial service. That responsibility rests with you, not whoever gave you access to the film
Conversely, photocopiers and text-to-speech engines and LLMs don't exercise choice over whether they reproduce copyrighted material and so can't be held responsible, so responsibility for clearing rights to redistribute/transform in that format clearly lies with the people inputting the copyrighted material. Obviously, OpenAI has tended to avoid making any attempts to secure those rights whatsoever
Most libraries have photocopiers for their patrons to use. It’s their patrons’ responsibility to determine if any copying they do is permissible under fair or personal use rules. The library doesn’t know what you’re planning on doing with the information they shared with you.
Aside from copyright, this raises anti-trust issues.
Courts generally uphold provisions against reverse-engineering (as protecting internal, proprietary knowledge) but are more welcoming to copying interfaces (as encouraging market substitutes). So one question would be whether OpenAI can restrict use of the output of their tool in this manner, since the output itself is manifestly open (to the customer). That seems novel. The only analogy I know of is database licensing that prevents customers from publishing comparisons, which seems anti-competitive.
Anti-trust policy is motivated mainly in mature markets, where one player has fairly (by hypothesis) grown to dominate. The law and courts apply special scrutiny to identify ordinarily-acceptable market practices that extend the market power of the dominant player.
But is it the same analysis in a growing market? It seems like even if OpenAI (or especially, OpenAI+Microsoft) is dominant, if the market is growing quickly, the concern might be relaxed since the dominance is uncertain. Conversely, if the market is particularly susceptible to capture, early leaders might warrant heightened scrutiny.
But aside from monopoly's first-order effect on reducing competition, the second-order effect is to reduce investments in competitors, which has anti-competitive effects. That concern could be highest in the early stages of a market.
The terms against using the output to develop a competing product seem the same as reverse engineering to me.
Competitors can't get OpenAI's model weights but use its outputs to produce a functionally similar model.
It's like if you had a competitor's engine and couldn't open it up but could still see the outputs: torque, rpm, ..., and could control the inputs: fuel intake, air mixture, etc... Then you make an engine by inferring back from these measurements.
No, it will be an engine that produces the same output if given the same input. It's like saying Google reverse engineered Yahoo and built a search engine
I don’t think it counts as reverse engineering if you only treat it as a black box.
In addition, this is much more akin to data exfiltration than to copying an engineered mechanism. The training algorithm could count as an engineered mechanism, but that’s not what is being copied.
The engine analogy is not a great one. Those inputs and outputs are very crude information. The actual design of an engine is quite complicated and subject to very close tolerances.
Your analogy is like going to an airport and looking at departure and arrival times as well as the flight path and then “reverse engineering” an aircraft from that. The chances are very low that you produce anything remotely resembling the aircraft you’re trying to reverse engineer. Same goes for the engine.
In the case at hand we are dealing with a mathematical function. In->Out is all there is. Back-estimating a mathematical function by sampling is as close to reverse engineering as anything is.
Am not justifying what OpenAI did, but nobody is stopping ByteDance from doing what OpenAI did. They can also use the world’s information. Instead, since OpenAI has “cleaned” the data, they are trying to use OpenAI’s cleaned dataset. After OpenAI spending endless amounts of money on that, am not surprised they don’t want others to steal their “cleaned” dataset.
The massive illegal scraping of data on the internet is "only done once" type deal. After platforms have learned of the abuse OpenAI has engaged in, content platforms are now gated and under access controls. You can't access NSFW content on Reddit without logging in, for reference[1]. You could before OpenAI Buzz existed. The point of the illegal scraping is the first mover advantage. Subsequent scrapings will not be as easy. This is also the reason why we could send FBI agents to OpenAI to bust their servers and delete the training data. After wards, gathering this said data again would be much more harder, thus delaying any kind of LLM "progress" in future. For LLM skeptics, this is a dream. Jail the executives, send in feds to light the server rooms on fire.
[1] still works on old.reddit.com
Reddit gating NSFW content with login is pretty obviously a play to increase signups and therefore engagement. Making scraping less feasible might just be a bonus, but attributing the whole thing to that is a stretch.
There are stories all over the web of content houses locking down their stuff after they found out OAI was benefitting commercially from harvesting it. This hasn't been true for at least a year. See Reddit.
I think GP is pointing out that someone that spends years building a large online gallery of their artwork, only for it to be smushed into a pool of vector mush, has the same reasoning to prevent openAI from using their work as openAI does to prevent competitors from using their artisanaly laundered dataset.
Doesn’t matter how much endless amounts of money they spend, they’re going to have to contend with the fact that the value they ship is derived from other’s work. It’s just diluted to the point of it becoming “data” rather than “artworks”.
The actual content is the clean stuff. If you disagree then you accept OpenAI could just create all the content themselves instead of scraping, which is comparatively trivial.
>As part of the deal, ChatGPT users will receive summaries of news stories from Axel Springer’s brands, including Politico, Business Insider, Bild and Welt, with attribution and links to the original sources of reporting, the companies said Wednesday. The agreement will allow OpenAI’s models to take advantage of the publisher’s higher quality and more current information in its chatbots’ answers.
> Just look what happened to Federal President of Germany, Christian Wullf, when he declared that Islam is part of Germany.
What happened to him? By reading the Wikipedia article, he seems like a very corrupt individual in a position of power. I don't see how Bild was involved in that.
Nope, he didn't, Axel-Springer regularly campaigns against people they don't like, including private people.
Let me quote one of Germany's best cabaret artists, Volker Piepers: "Bild-Zeitung..... this filthy newspaper that is so disgusting that you insult dead fish if you wrap it in it!"
There is a reason Heinrich Böll wrote a book about them:
> Employees involved are well aware of the implications; I’ve seen conversations on Lark, ByteDance’s internal communication platform for employees, about how to “whitewash” the evidence through “data desensitization.” The misuse is so rampant that Project Seed employees regularly hit their max allowance for API access.
If you ever wondered how all these models could seem to catch up to GPT-3.5 so quickly, but then struggled to do noticeably better (much less exceed GPT-4), while not talking about their data or saying they definitely weren't simply training on GPT-3.5 outputs, remember this: they might just be lying.
> While ByteDance’s use of our API was minimal, we have suspended their account while we further investigate
I don't like ByteDance at all, but I hope OpenAI are aware both shady and legit companies are making some serious cash using their API.
> All API customers must adhere to our usage policies to ensure that our technology is used for good.
"For good" sounds like the philosophy of companies like Stripe and Twitch. There always is a grey area with stuff that is good for many people, but other people see as evil.
I have occasionally wondered how easy it would be to bootstrap your own models using models from the big players. I remember first thinking about this when IBM Watson was the newest hottest thing on the block.
Then OpenAI came along and it seemed like it wouldn't be necessary any more because they were releasing their models anyway, except for when they suddenly decided that they wouldn't any more.
But others have been carrying that torch since (Llama and many others). Still it seems an interesting way of enhancing a model.
> I have occasionally wondered how easy it would be to bootstrap your own models using models from the big players.
I guess OpenAI did the same. If you read their API terms one of the first things they prohibit is to use it to train a competing model. Maybe they do the same tricks internally and know how powerful it can be?
Every time someone uses 'the good' to justify their cause, I know that they are lying and are not who they are generally thought to be. And I'm furious and tired that it's always the 'intelligent' people who are guilty of this. I find this hypocrisy very difficult to deal with.
This must be for show. If ByteDance or anyone else is sufficiently motivated to distill OpenAI’s models, it can’t be prevented. You can simply pay other people to collect the data for you.
You just block their IP’s; Gain intel from internal employees; Have the expensive law firm start sending C&D’s and prepping that law suit.
If you tell someone that they can’t use your product in a certain manner, and then they go to extra lengths to circumvent your measures to gain a profit for themselves, then there is going to be big civil and criminal legal problems.
State sponsored or not; Bytedance has a bank account and business license here in the U.S.
P.S. if I was in the position of OpenAI and Sam Altman, and Bytedance kept playing games after we had slapped their hand; I would just “shadow ban” them based off IP and user history, and serve them back garbage results to throw off their models :]
How do you figure? It would look like people from all over the place running API workloads on a wide range of tasks and topics (if the goal is to distill the model generally, that’s what you’d need). So many people are already using GPT-4 to synthesize fine-tuning datasets, I think it would be invisible.
However, you might wonder what the goal is. This “API distillation” is good for teaching a pretrained model how to do lots of things. But the end result is always constrained by the quality of the pretrained base model, and API outputs don’t help at all with that part.
Can someone here explain how an existing model like ChatGPT, that you only have API access to, can be used to train your own model and somehow copy aspects of it?
That's entirely non-intuitive to me how that would work. Like are they just asking it questions about every topic under the sun, and then creating training materials out of the answers? How would you even begin to assemble a list of prompts that would cover all of knowledge? And how could you ever distinguish useful outputs from nonsense hallucinations?
I feel like I'm missing some key details here that the article and its links don't explain at all.
Not an expert but I suspect it would be most useful for the instruction tuning and fine-tuning phases. They would pre-train the model on some huge dump of stuff from the internet like the Pile -- Wikipedia, books, Reddit scrapes and so on, and the result would be something that can predict next words really well but doesn't know it's meant to answer questions. That's the pre-train. The next step is to train that on transcripts of chats so that it specialises in those, and finally on chats that have the balance of helpfulness and harmlessness that you want. Those latter two need significantly less data, but still more than you can cheaply produce with humans, so they could use OpenAI to generate it. But perhaps it's a small enough amount that even if you need an LLM to generate it economically, it's affordable to have humans read through it and throw away anything excessively hallucinated. Or perhaps you can restrict the scope of the conversations to ones where hallucinations are unlikely, and trust that fine tuning on them won't prevent the LLM from digging into its deeper knowledge in use.
Bytedance will still have access to the same open internet as OpenAI had. But it does not have the weights and the parameters.
However by asking ChatGPT questions and recording their answers, you can use that to fine-tune the model they are creating. They can tell their own model to answer the way ChatGPT answered.
In short they are copying how ChatGPT would answer. and not finding out for themselves by creating their own Question and answer (or completion) datasets.
That is what RLHF is.
But anyway, Chinese companies will be able to compete here, as this will be commoditized quickly and China is the king of creating and selling commodities.
> How would you even begin to assemble a list of prompts that would cover all of knowledge?
They don't need to, all that knowledge is already in public training sets from scraping the internet.
What is harder to get is the answer patterns to ensure good user experience. You want a lot of such answer patterns so the model knows what structure to use for different kinds of questions. That structure isn't to make the result formatted for humans, but contains reasoning paths the LLM takes in order to arrive at reasonable answers. Since an LLMs thinking is the words it writes the structure of how it responds thus corresponds to thinking patterns, and you want the LLM to learn a lot of those, internet data wont contain that but ChatGPT responses will.
TLDR: The words an LLM writes is also what it thinks, since it doesn't hide thoughts, so by training on what ChatGPT writes you also train on how it thinks. It isn't the facts you want but the thinking patterns.
"OpenAI suspends ByteDance's account after it used GPT to train its own AI model"
That's truly hilarious... It's akin to a child who stole a candy bar in a candy store complaining about their sister stealing that same candy bar from them.
They also have their own web scraper called ByteSpider that scrapes websites with lots of text very aggressively and ignores robots.txt. I've had to block it by useragent on one of my sites.
I don't think it ignores robots.txt, I think it just doesn't have a very good parser and you need to give them their own user-agent block. I had a similar level of frustration.
After all, if they wanted to completely ignore the wishes of the website owners they probably would not announce their spider as such in the user agent. They’d just pretend to be a web browser.
Some of them yes. But not all. Try for example to browse a Cloudflare protected site from Tor and you will be hit with a constant barrage of captchas even though you are only doing GET requests.
Yes, huristicly, a tor browser is more likely to be nefarious than a regular browser user. Note the use of huristisc - such as IP address - not related to user agent.
> Pillar II: Stem the Flow of U.S. Capital and Technology Fueling the PRC’s Military Modernization and Human Rights Abuses
> U.S. export controls have been slow to adapt to rapid changes in technology and attempts by adversaries to blur the lines between private and public sector entities, particularly the PRC’s strategy of Military-Civil Fusion.
> “The military is for civilian use, the civilian is military, and the military and civilian are fused.” In other words, no line exists between civilian and military technological development. U.S. export controls have yet to adapt to this reality
I guess they could figure it out via pattern repetition right? I thought they leave API users alone, especially corporate users… Or are they still monitoring corporate users? Sounds like a major security issue if I’m a CEO. Does Microsoft read and ban people from Microsoft Word/365? Etc
It depends which perspective we want to analyze this: philosophical, religious, ethical, legal, business, *?
1. Philosophically: who gives a damn it’s meaningless at the cosmic level with no impact on our species trajectory
2.Religious n Ethical: Wrong
3.legal: OpenAI took Internet n public content not protected by any licensing agreement and got away with it! ByteDance took OpenAI’s content protected by their L Terms, so those terms are applicable here.
4.Business view: OpenAI user others data and got away with it, ByteDance tried n was caught!
5. Other Dimensions of analyses are possible!
Your view = f(your perspective) make ur pick!
What exactly did they do? Just transform data or generate completely new data for training? I've seen plenty of people that have used GPT-4 to help transform data, i.e. using GPT-4 to generate summaries of texts and then using that to train smaller models. Not sure why this wouldn't be allowed, as it's not technically data that is coming from GPT-4.
If they don't allow this it just seems like they are trying to prevent people from building smaller cheaper models that will perform better for a specific use case and gobbling up the market for as long as they can.
In my view is: what perspective we wanna use to judge? philosoptical, religious, ethical, legal, or business?
1.philo: who gives a damn it’s meaningless in the cosmic scheme, no meaningful impact on our species!
2. Religious & Ethical : wrong
3.legal : our/internet content isn’t protected by any usage Terms (license) OpenAI’s content is . So if ByteDance violated terms OpenAI is right
4.OpenAI took content unethically and got away with it , ByteDance tried n was caught!
There could be other Dimensions of analysis, make ur pick!
Related anecdote: last web a web server of mine was under very heavy load. The site was flapping and I could barely even SSH to the host. I did some basic analysis of the httpd logs and found that there were 13,000 unique IPs within the past week and 12,000 belonged to one subnet on Amazon EC2. The user agent for all those requests was ByteSpider. Millions of requests over a few days.
I'm into freedom of information and what not but come on!!!
Oh this is priceless: “ All API customers must adhere to our usage policies to ensure that our technology is used for good.”
Not violating terms. Not abiding by terms of use. But good vs evil.
It screeches “we occupy the moral high ground and are the arbiters of what is ethical and what is evil.”
The entire brand game plan around “Open”AI is to position themselves as first doing no evil, just like google did in the early days, to enable doing a lot of evil.
Considering the extent to which youtube is blatantly encouraging people to copy their videos, and no doubt that openai haw user tiktok content for their models, this is like slapping a child.
You can use it to produce the fine tuning examples. LLMs are very good at learning styles and answer flows, so you ask ChatGPT a million standard questions and problems and then record the outputs, that data will be very similar to the finetuning data used by OpenAI. Then put that data into your own LLMs fine tuning step and voila, you have an LLM that behaves very similarly to ChatGPT!
OpenAI knows this which is why they put that TOS there, they spent a lot of money to create that finetuning data so they don't want to give all of that away for cheap.
They likely didn't use (just) ChatGPT, but the GPT-3.5/GPT-4 API.
Both can be used to create training data quite successfully. This technique has been used in the past to create synthetic post-training (fine tuning) datasets like Orca, Samantha, and so on.
What’s a competing model? Only LLM’s or images too? Kinda vague. I hope only big spenders get pinged because I’ve been using the vision api to help auto-annotate some training data.
The details on how exactly they may have used it to train their model is vague. I believe transfer learning or knowledge distillation are valid techniques based on the inference from other models.
You store the output from ChatGPT, you don't run it again every time you do a training step. Generating millions of examples to add to your own training wont cost much at all relatively.
Interesting to see that capitalism is inherently inefficient. Freely sharing a trained model would be the most efficient and beneficial to progress, but the investment in training was done by private company to turn profits. In this light the wasted energy of cryptocurrency is just another inefficiency.
I still don't understand how they can keep a straight face claiming that training on all human-written material (copyrighted or not) that can be found on the Internet is perfectly fine, but training on ChatGPT output is not (or in other words, that human writers cannot have a choice on whether their output is used, but bot owners can).
> if OpenAI prevented them from using the same resources, which ChatGPT is not.
It kind of is. Just think of how many services changed their data sharing policies and closed APIs due to ChatGPT training (Twitter, Stack Overflow, Reddit). Maybe the analogy is that instead of pulling the ladder, they set fire to it so it’s burning and making it harder for others to climb. Even if they didn’t set it alight on purpose, I don’t imagine they’re losing sleep over it.
> Their ladder (using public data, and hiring humans to classify) is still available I believe.
Not really. Once chatGPT came out, many sites changed their terms and/or significantly increased their API access costs to prevent/limit/make cost prohibitive future scraping.
Good point. Though that affects OpenAI too for new data.
I had assumed most of their web content was from Common Crawl, and the older pre-ChatGPT Common Crawl datasets used would still be available. But it looks like Twitter, for one, was not in Common Crawl.
It’s not like OpenAI trained their model using someone else’s and now won’t allow it done to them. This seems more like saying “get your own content and do the work like everyone else”.
It’s pretty simple and not hypocritical to hold these two positions simultaneously:
- It’s legal and moral to train on data you have access to, regardless of copyright.
- Nobody is obligated to provide services to you so you can obtain that data from them.
It would be hypocritical if, say, ByteDance obtained synthetic data generated from GPT-4 and then OpenAI tried to prevent them from training on the data they already obtained. But all they are doing at the moment is temporarily pausing generating new data for them. OpenAI aren’t obligated to do this and OpenAI have never argued that other people are obligated to do it for them. So no hypocrisy.
I dont think they are making a legal statement, but just doing a business maneuver. Something could be perfectly legal, but just against company policy.
Just like Google is crawling the whole internet (hitting your server a million times a day) and then, with a straight face, will plug a captcha in your face if you dare search more than 3 times with a quoted string or non-trivial terms. Forget about doing a few million searches to bootstrap your dataset, Google was always hostile - both in access controls and pricing. You want results past 100 or 1000 mark? never possible, not even 20 years ago. But they say 1 bazzilion web pages in their index match your search.
tl;dr Google crawls you, you can't crawl Google. How is that fair? They built their empire on our brain outputs but won't share theirs.
My memory of the early ish days is that Facebook heavily leveraged Google contacts (which was allowed by Google) to discover friends, and then blocked others from doing the same. Is that correct or can someone offer better info?
Google respects robots.txt though, which is more then they need to. If you put data out in public, what do you expect? If you don’t like Google crawling you though, just restrict them. That’s usually to your own detriment though, but I won’t judge you if you’re into self-flagellation. Just don’t think you’re holier than me because you are.
It's ok, the wheel is turning and now reality has come to bite them in the ass. CommonCrawl supplied the text and LLMs replace their index, for a large number of requests. A new crop of search engines like phind.com and perplexity.ai have better than Google results.
They certainly don't have the moral high ground, but in any other context this is usually considered an inference attack--sampling a model somebody else spent a lot of money to train in order to build a similar one at a much reduced training cost.
So while OpenAI is absolutely lacks the moral high-ground, ByteDance still seems to be engaging in adversarial behavior.
Just so I understand the argument, OpenAI would be claiming that anyone using their model outputs to distill i.e. train a smaller model on their model is a violation of copyright, but them training on the entirety of the internet (including copyrighted material) is not a violation?
I don't see any problem here. Their TOS to access the service is their right. Unless they used bot to accept TOS of some other site, scraping is completely legal.
Also breaking OpenAI's TOS is likely completely legal and everyone I know is collecting data to their own model. Worst they could do is ban the account.
It was a direct order from Paul Graham. He keeps mum about it but I have trusted sources who know the truth. Additionally, it's sort of public knowledge:
I don't have a full view into exactly why he was fired from both OpenAI and Y-combinator. But from what I hear the reasoning is a bit similar. Sam Altman is a bit of a political snake. He lacks ethics and he's not honest either. The last part is just me speculating on a lot of the anecdotes from quips I've heard over the years from people who know Sam.
Sams public persona is very different. And I think a lot of HN viewers worship that public persona. But Sam being a darling of the HN and Ycombinator themselves? No way. They fired him so it's unlikely.
Nope, this is wrong. HN is a site run by Y Combinator and heavily moderated but the comments are coming from the site's users and not Y Combinator itself
The work on the internet is on the internet freely available to all and the output of their API is the output of their API only available after registration and agreeing to their terms and conditions.
OpenAI are free to block anyone from using their API if they want. Just like anyone hosting their content a website is free to block the OpenAI web crawler.
It’s like company A collects everyone’s phone number and then publishes it as a phone book. And then company B copies the phone book and publishes it as their own.
It’s not a straightforward copyright issue but in many jurisdictions that is not allowed. Company A did the work, they should be allowed to profit.
In the US, company B would probably be in the clear--at least for the list of names and numbers. You don't necessarily get copyright protection just because something was a lot of work ("sweat of the brow"). The most relevant US Supreme Court case is Feist.
There is protection in a few notable jurisdictions so a violation would make the product illegal in those jurisdictions, which is a problem if it’s an online product.
In same way say Putin can spew all that fantasy bullshit over and over to Russian population which doesn't make any sense even at glance. And most folks back there do know it to certain extent, yet he keeps up the show and whole power dance instead of simply stating the truth that he is the current dictator and the rest can bow down and suck it up. Same with most if not all other dictators.
Not equalizing those two situations at all, just pointing out that dynamics of communication between normal people don't really happen in many other situations, or if they do its just a shallow charade. Or... just don't expect fairness and good behavior when tons of money, power and legacies are at stake.
Google does the same. It wants the pages that it indexes to be original; if a page just copy pastes another page's content, its score is affected negatively or even de-indexed, they call it spam.
Google itself on the other hand does just this, for example Wikipedia text and lyrics are taken from other pages and copy pasted onto Google's page.
Try telling a "googler" this and theyll go "noooo but for Google it's different because Google has determined that's optimal and good for user experience". It's difficult to get someone to understand something when their paycheck depends on them not understanding it.
> Try telling a "googler" this and theyll go "noooo but for Google it's different because Google has determined that's optimal and good for user experience". It's difficult to get someone to understand something when their paycheck depends on them not understanding it.
In my experience googlers are very capable of saying “Google is bad but my salary is good”. Plenty of people understand things that are contrary to their paycheck.
That said, Google generally respects Wikipedia’s and others license to the data. And it is generally in the users best interest to get to desired content/information in less steps, regardless of the data’s provenance.
What I said is if I copy paste things into my page, Google will kill it because it's spam. If I say "but what I'm actually paying for this content"... That's irrelevant.
I'm not saying that Google is stealing content, I'm saying they're hypocritical is applying and argument when the conclusions benefit them, but not otherwise.
Alternatively, they caught the action promptly and kept use to a minimum
Also, what does “minimal” even mean? I’m sure they monitor accounts that max out their API request limits, or even just request programmatically (i.e. request patterns that don’t match a natural human use pattern, like slowing down during a time zones lunch hours). Maybe this was a couple days worth of traffic
What You Cannot Do. You may not use our Services for any illegal, harmful, or abusive activity. For example, you may not:
Use our Services in a way that infringes, misappropriates or violates anyone’s rights.
Modify, copy, lease, sell or distribute any of our Services.
Attempt to or assist anyone to reverse engineer, decompile or discover the source code or underlying components of our Services, including our models, algorithms, or systems (except to the extent this restriction is prohibited by applicable law).
Automatically or programmatically extract data or Output (defined below).
Represent that Output was human-generated when it was not.
Interfere with or disrupt our Services, including circumvent any rate limits or restrictions or bypass any protective measures or safety mitigations we put on our Services.
Use Output to develop models that compete with OpenAI.
It's going to be stupid for openai to argue that those terms are binding when they've already argued in court that those terms are nonbinding when they scraped other people's dat.
It makes sense it'd say that. Of course, GPT is built on everyone's output itself.
So throwing around statements like "we suspended ByteDance to ensure GPT is used for good" are hypocritical at best. They're not the pope, they have no monopoly on good.
I'm genuinely wondering how this is different from them using others work without consent, but IANAL so maybe I'm just confusing mortality and legality