Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI suspends ByteDance's account after it used GPT to train its own AI model (theverge.com)
387 points by webmaven on Dec 16, 2023 | hide | past | favorite | 267 comments



Isn't it a bit hypocritical of them to use other people's copyrighted work or "output" they weren't given license/permission to use, but deny people the same opportunity with their work/output?

I'm genuinely wondering how this is different from them using others work without consent, but IANAL so maybe I'm just confusing mortality and legality


The fun thing is, is that IIRC, AI outputs aren’t copyrighted so this is actually more ethical that’s what OpenAI did. The only thing is that it was probably against the terms of service.


> aren’t copyrighted so this is actually more ethical that’s what OpenAI did

why is that? all works, as long as it is "original" should be granted copyright. It should belong to the person who made it (in this case, the user, not openAI).


It's not that cut-and-dry.

If I say to one of my human friends "write me a nursery rhyme", the copyright of the resulting rhyme would obviously belong to my friend - despite me prompting them. Clearly the prompt itself does not universally count as "making" it.

Let's say I made a "NovelSnippetAI", which contains a corpus of prewritten material. You can prompt it, and it will return a page which matches the sentiment of your prompt best. I think we can agree that the copyright of the page will still belong to the original writer - the user only did a query.

What if I did "NovelMixAI", which did exactly the same but alternated lines from the two best matches? What about "NovelTransformAI", which applied a mathematical formula to the best match and fed the output to a fixed Markov Chain? Now we're suddenly at "LLMAI", which does the same using a neural network - what makes it different from the rest?

Long-standing precedent is that any automated work does not qualify for copyright. You can only copyright human work.


Where the line is - ie how much human creative input would be needed for a work to be covered by copyright seems to be unclear legally. There are some interesting parallels to this dispute I think: https://en.m.wikipedia.org/wiki/Monkey_selfie_copyright_disp...


And what happens when two users manage to get the same result? Does OpenAI need to implement an "/r9k/ algorithm" or something to prevent sufficiently similar results from ever being generated, because once a response is generated it is now copyrighted by the user who prompted it?


> If I say to one of my human friends "write me a nursery rhyme", the copyright of the resulting rhyme would obviously belong to my friend - despite me prompting them. Clearly the prompt itself does not universally count as "making" it.

This is the wrong analogy. The LLM is more like Photoshop and the prompt little more than a filter configuration. A machine cannot copyright its own output but a human guiding that machine can.


No, it's the right analogy for the point I was making. That analogy is there to point out that it is about more than just the prompt, and prompting a machine is covered in the rest of my comment.


It is more like a child's speak and say. Push a button a cow noise comes out. Someone else pushes it the same noise comes out. Even if a billion buttons exist the child doesn't own the cow noise.


Machine-generated works are not eligible for copyright, and courts have been ruling this way about AI content. https://www.reuters.com/legal/legalindustry/who-owns-ai-crea... Early days and there are still appeals as well as new legislation in progress, but that's where it stands now.


In the Monkey selfie copyright dispute Slater could not claim copyright because he did not operate the camera. Corporate personhood or juridical personality is the legal notion that a juridical person such as a corporation, separately from its associated human beings (like owners, managers, or employees), has at least some of the legal rights and responsibilities enjoyed by natural persons. If Mr. Slater was incorporated would his copyright ownership be clear? If an ai was a corporation not an asset of a corporation and could prompt it's self would the output then be copyrightable? There are a lot of ifs in there but still interesting.


IANAL, but I would guess it depends on whether the animal is an "employee" of the company. Obviously there are animals that are owned by companies and, if they create something, the company I assume would own it whether or not it was considered a work for hire in the usual sense. But that would presumably not have been the case here.


Generative AI has basically made what were once fairly irrelevant edge cases (what if I tie together a bunch of random number generators to create art work?) a lot more interesting. And laws will probably have to be adapted.


> It should belong to the person who made it

thats the contention, its not a person who's made it.


In the US, Copyright requires creative input by a person. Just because "work" was done doesn't make something copyrightable.


What if the AI is fully autonomous, pays its own hosting bill etc? Who gets the copyright?


No one, because according to current legislation, copyright only applies to works by humans.


I'm asking about a hypothetical scenario where the law reflects the ideal outcome. What is the ideal outcome?


Obvious new business model providing a human shill service where you look at AI output and say "yep, I made that". Now it's copyright, assigned to the customer of your startup.


So you're suggesting that people commit fraud, and base their business around committing fraud? I don't see that being super scalable.


Of course, people can always lie, but they also can be caught lying.


no AI output is "original"


Should have read the small print.


I asked ChatGPT and it said it was fine.


The irony. AI would help with that too! :D


Morality and legality aside, there's a substantive difference between use of content and use of a model. Pretraining a GPT 4-class model from raw data requires trillions of tokens and millions of dollars in compute, whereas distilling a model using GPT 4's output requires orders of magnitude less data. Add to that the fact that OpenAI is probably subsidizing compute at their current per-token cost, and it's clearly unsustainable.

The morality of training on internet-scale text data is another discussion, but I would point out that this has been standard practice since the advent of the internet, both for training smaller models and for fueling large tech companies such as Google. Broadly speaking, there is nothing wrong with mere consumption. What gets both morally and legally more complex is production - how much are you allowed to synthesize from the training data? And that is a fair question.


“Content” requires as much, if not more, effort and expense than pretraining GPT-4.

All you’re doing is redefining content, ie thoughts, ideas, movies, videos, literature, sounds, writing, etc as “raw data”. But that isn’t raw data. There was a ton of effort that went into creating the “content”. For example, a single Wikipedia page may have many hundreds of people, some who have done years of college level studies and original research, to produce a few thousand words of content. Others have done research using primary sources. All of them have had to use effort and ingenuity to craft those into actual high quality statements, which in itself was only possible in many cases due to years of training and education. Finally, they had to setup a validation process to produce useful output from this collaborative process which included loads of arguments etc to generate what you are calling “raw data”.

I’m not sure what makes GPT’s output is any less raw than all the effort that went into producing a single Wikipedia page? Further, Wikipedia actually goes out of its way to cite its sources. GPT is designed to go out of its way to obscure its sources.

The only thing GPT does, IOW, that apparently makes the data it uses is not to cite its sources, something that would at the very least lead to professional disgrace for the people who created the “raw data” GPT uses without thought, and would even lead to lawsuits and prosecution in many cases.

So besides going out of its way to obscure the source of its data, what makes GPT’s output less raw than the output people have spent billions of man hours creating?


Except that the content already exists and there is no cost to maintain it.

If GPT incurred a non negligible cost on the content owners by accessing their resources it may have been different but that's not the case.

The only thing that content owners may be able to complain about is that potentially ChatGPT/DallE may reduce their potential income and this would have to be proven. I have not stopped buying books or art of any kind since I use ChatGPT/DallE. And low quality automated content producers existed before OpenAI and were already diluting the attention to more carefully produced content (as can be seen with videos on youtube).


It seems like you have no idea how much effort it takes to write a book.

Quite often it contains the experience of a life of a person condensed to a few hundred pages.

ChatGPT gives easier access to the knowledge contained in tens of thousands of these books. As for me I have been reading less and less books as more wisdom is accessable on the internet in better forms (now GPT).

I'm not against what OpenAI is doing as it moves humanity forward, but like you said I won't stop using ChatGPT just because ByteDance scrapes it.


That's great to hear that there's no cost to maintaining content! I'll tell AWS they've been overcharging me :)


Not what I am saying. I am saying it is much much smaller than inference/model running cost. Easy exercise

How many books do you store in 1GB How much does it cost a year to store it and have OpenAI gather it once. How much does it cost to run a GPT4 level model that will output 1GB.

That's my point here that's all. It is a huge cost for OpenAI to run a system that produces dynamic content. And it is not comparable to the cost of storing static content.

I didn't talk about the cost of producing the original data.

And I do not talk about training costs.


Sure, but your comment said "maintain", not "store". Even if storage were free, and even if you discount the value of the initial creation to zero, there are still nontrivial serving costs associated with many sites. What I share with people on the Web may look like a static byte sequence to the robots consuming it, but it takes a lot of work to compute those bytes (in the moment, I mean). Aggregated over the whole web, no, that is not smaller than OpenAI's expenditures.


If cost is your primary concern, shouldn't you support ByteDance's efforts to reduce inference costs by distilling the model?

(while at the same time reducing future costs for everyone by distributing the capability more widely to prevent monopolization)


At no point I said I did not support that.


The effort and resources required to train from raw data are nothing compared to the effort and resources that went into producing the "training" input. How much dors it cost to produce all the things they scrapped from the internet? So morally they are in the wrong - I don't care if it's standard practice since "the beginning of the internet" or not.


It’s also not standard practice since the beginning of the internet. Referencing original input through links is almost foundational to the internet (at least the original internet).

In fact, the power of linking to data sources is what Google is almost entirely built upon.


Others have already pointed out that you’re just shrugging off billions of hours and money that went into the content that is used to (pre-)train a model, so I’ll leave that for what it is.

I’m just curious how you start off with:

> Morality and legality aside

Only to then follow it up immediately with an argument for why one is more moral. Just because you didn’t end it with “and that’s why I think OpenAI is more moral,” doesn’t mean it’s not obvious and less of an irony.


Morality and legality are the only relevant questions in the discussion. The two methods are virtually the same... in fact I'd argue that ByteDance's usage is more fair and moral. It really doesn't matter than it's cheaper and more efficient.


The cost of hiring humans to write the trillions of tokens they trained from scratch would surely be much larger than the training cost. Except they avoided that cost by using what's available on the Internet. [1]

Similarly, people are avoiding the cost of pre-training GPT-4 class model by scraping its output.

So I think it's fair to question the moral consistency of their ToS.

[1] Please note that I am not passing a judgement on this, just stating a fact in order to make an argument.


> Pretraining a GPT 4-class model from raw data requires trillions of tokens and millions of dollars in compute,

And millions of documents authored by people that weren't compensated.

The difference is consolidating all of that value into a single company.


And they largely put their work online for free where anyone can read it not expecteing any kind of direct compensation by the reader.


> I would point out that this has been standard practice since the advent of the internet

Maybe it shouldn't have been? We've been frog-boiling toward this point for a long time, from a starting point that was generally good for content creators (your content is made more discoverable) to a point that is not so good for content creators (your content is scraped and digested, programmatically laundered and regurgitated on huge corporations' own platforms with token or no attribution, and no revenue shared).

In a parallel universe where search engines were explicitly opt-in from the beginning, I think these conversations would look very different today. What OpenAI and its peers have done would, I dare say, be uncontroversially (and correctly) regarded as theft. Just as I'm not allowed to distribute^[1] software incorporating somebody else's code in a way that violates the terms of its license (or lack thereof), I shouldn't be able to distribute software that incorporates any intellectual property that I don't have the rights to.

^[1] Broadly speaking.


Yes, but the original intent was to help other companies create ethical AI models. If they've already turned their back on those core values, a bit more hypocrisy won't stop them.


Stated intent and actual intent are usually different, I expect they intended the opposite all along but were just riding the open source ethical AI wave to profit.


Ive talked to a few people formerly at OpenAI a few years ago - the deviation from the original mission was a “boiling frog” process that is best demarcated when the Anthropic founders left. The core team doing the actual science and engineering work did legitimately believe in the open research mission. But when funding was hard to find, things kind of broke.


I strongly doubt it. Sam was pretty damn rich before OpenAI. He's written a lot of stuff about AI and how he worried someone would do it. The original plan was roughly build the basis for ethical AI, have someone else build on top of it, which somehow pivoted into everyone using their API.

If they just wanted money, they could have cut out a lot of the ethical filters and choke out competitors. Google didn't stop NSFW. Twitter, Reddit, Tumblr, etc didn't. AI is bound to be used for NSFW among other things, but they've set the standards to make it ethical.

I think eventually they did let loose to try to keep ahead of competitors. This probably pissed off the board and led to the drama recently? Just speculation. Because the new models are nowhere near as anal as the initial release.


> If they just wanted money, they could have cut out a lot of the ethical filters and choke out competitors.

I think you are wrong here, the safety filters (“safety” and “ethics” for AI are labels for boundaries and concerns of different ideological factions, and Altman and OpenAI are deep in the “safety” side—which has the nost money and power behind it so “safety” is also becoming the generic term for AI boundaries) are an important oart if the PR and regulatory lobbying effort, which is key to OpenAI’s long term moneymaking plan, given the absence of a durable moat for commercial AI.

> If they just wanted money, they could have cut out a lot of the ethical filters and choke out competitors. Google didn't stop NSFW. Twitter, Reddit, Tumblr, etc didn't.

Neither, in practice, has OpenAI—there are whole communitids built around using OpenAI's models for very, very NSFW purposes. They've prevented casual NSFW, to oresent the image they want to thr government and interest grouoa whose support they want as they lovby to shape regulation of AI. avoiding being a target for things like the 404 Media campaign against CivitAI where the NSFW is more readily visible.


I have yet not seen a definite conclusion that training a model equals breaching intellectual property rights. There is always the right of citation, and as long as the model is not producing copies of copyrighted material: where then is the violation?

IP rights do not per se give the author an absolute right to determine how their work is used. It does give rights to prevent a reproduction, but that is not what an AI model does.


This is currently untested (though, trials are in progress) and really could go either way in the courts.


> It does give rights to prevent a reproduction, but that is not what an AI model does.

Why would you conclude that ? While an AI model does not ONLY reproduce, it most certainly can make verbatim reproductions. The things preventing the user from getting copyrighted material from chatgpt are probably only rules/guardrails. The most prominent example of this is perhaps the Bible which you could get from it quote by quote within token limit.


But did they purchase every non public domain book they used? Highly doubt it.


That shouldn’t matter, regarding copyright violation. Purchasing a book only makes the physical object your own, it doesn’t change anything with regard to the contained work.


I agree, also my thought, but that is a principally different case then requiring consent from the author. That suggests that OpenAI would have downloaded pirated material, or hacked paywalls, how did they get this material?

BTW I noticed that GPT-4 is good at writing legal letters of the sort that is widely available online. But a subpoena, ('dagvaarding', the Dutch version I have researched) it completely fails to create. Also there are not many subpoenas available online, and the court (in the Netherlands) only publishes the verdicts, not the other documents. Lawyers OTOH have a lot more of this available in their libraries.

So, my impression is that there is still a lot of material out there that is not in the corpus.


> how did they get this material?

"Hello, we are a non-profit that wants to make AI models to benefit humanity. Can you give us access to your data to help us in our work?"


This is not what they did. They fed off enormous databases of pirated material.


Yeah, but they wont get non-public data that way. I'd bet they did get access to a lot of non-public data just by asking and stating they do it for a non-profit mission.


Why should that matter? I have read many borrowed books for free, and can quote many of them. We have huge institutions devoted to letting people borrow read and learn from books they are called libraries or archives.


I've heard this argument before, but I think it's pretty clear that humans and machines are not equal before the law and the fact that you have the right to make a derivative work doesn't necessarily mean that you can make a machine to do that for you.


Did Google Books?

What if they just bought dirt cheap used copies meaning the creators saw nothing thanks to the first sale doctrine?


>What if they just bought dirt cheap used copies meaning the creators saw nothing thanks to the first sale doctrine?

They could without question set up a very nice physical library in Mountain View and even invite the public in. They can probably in general scan those books for their own internal use. What got shut down was scanning the books and making them available to everyone in their entirety.


Bytedance violated the contractual arrangement.

That’s not the same thing as general copyright.


There's a saying, "thief who steals from a thief has a hundred years of forgiveness".


> I'm genuinely wondering how this is different from them using others work without consent, but IANAL so maybe I'm just confusing mortality and legality

When it comes to cutting edge business vs business decisions, the legality is often defined post-factum, e.g. in courts.

For an outsider to know whether something in a case like this is legal or not is near impossible, considering how opaque such businesses are.


My guess is that using the OpenAI models to generate training data for a bespoke transformer model is extremely taxing on OpenAI's computing usage. If I were to guess, that is why that behavior is proscribed by the TOS, and why ByteDance was banned. Probably has nothing to do with the ethics of how training data is gathered.


Yes, but we're doing the same thing with industrialization. The US is past the stage of factories so we'll deny everyone else factories because we deem them too pollutant now.


Not really and I’m suspicious you know these are not directly analogous.

We’re not trying to curb emissions for the purpose of kneecapping other economies. Short of China, we don’t really have any incentive to do that (bigger markets = better under US industrial policy [contradictory opinions from left wing undergrad students on Twitter don’t count as “US industrial policy”]). What’s actually happening is that we caused a problem and now it’s getting worse and in order to fix it we need to not allow everyone else to continue it.

This is a novel and advanced philosophical argument called, “two wrongs don’t make a right.”


It could be seen as a non-trade tariff barrier if you squint quite hard.

There's another example - intellectual property. The US was fine playing fast and loose with IP (Most famous example is Dickens' attempts to point out he was being pirated left and right in the States and not seeing a penny: https://www.charlesdickenspage.com/copyright1842.html)

Now America is on top of the IP pile, it sees other nations as playing fast and loose: https://www.forbes.com/sites/johnlamattina/2013/04/08/indias...


Huh I wonder if there have been any substantial changes to international cooperation, investments made thereupon, and agreements made thereupon between the 1850s and 2013.

Countries don’t have to join international trade regimes. They also don’t have to join climate/emission commitments. They do both of them because they come with benefits.

Cursory searching suggests the first real work on international copyright, by contrast, came about in 1886. Even early versions came after Dickens’ story here.


Well obviously, change is constant.

History doesn't repeat, but it does rhyme. Look at the shape of the story. The people on top support rules that completely coincidentally help keep them on top. It's a universal impluse.

"It is difficult to get a man to understand something when his salary depends on his not understanding it." (Upton Sinclair) is another example with a similar shape, but at the scale of individuals.

John Rawls had the right idea.


You must understand that we have to hold China to a 19th century standard while holding western nations to a 21st century standard because... uh.. reasons. Don't question it!


What? Absolutely question it. Let me know if you find an answer that’s significantly more believable than, “trying to balance local quality of life, long term environmental and economic viability, and short term economic prosperity and political stability.”

If you have a different balance to strike that you think is significantly and obviously better, I’m sure the whole world is interested in hearing it.


He was being sarcastic.


I’m aware. The implication of the sarcasm is there’s no good reason China and western countries have different standards, so that’s what I was addressing.


We share the same planet, so we need to share the same environmental standards. The self-touted "world's oldest civilization" no less than the rest.


They're aiming to hit net zero before 2060. It's.... ambitious (which I use here as a synonym for unlikely). USA and EU are aiming for 2050.

Will anybody meet their goals? The planet is the ultimate Commons. Personally, I think we're boned.


It’s a great ideal, but there are interests other than purely environmental (or purely “fairness”) that must be taken into account as a matter of sheer necessity.


You appealed to fairness, not me.


It’d be easier to have productive conversations if you just plainly stated your position on the topic at hand, if you have one.


Did I not?

> We share the same planet, so we need to share the same environmental standards

All of the excuses to hold China to a different standard are horse shit. Is that stated plainly enough for you?


Huh, so now I’m confused as to how you claim not to be invoking fairness


>“two wrongs don’t make a right.”

One wrong gets punished and the other - "we are exceptional" so we do as we please and do not fuck with us or else.


Since when have corporations ever been ashamed of hypocrisy? Corporations can't feel shame about anything.


Nope. OpenAI is bearing all the associated legal risk, not ByteDance.


it's not a bit hypocritical, it's extremely hypocritical


well nothing is "hypocritical" in the Ayn Randian world of "I have the marbles so now you pay me"


Yes


> they weren't given license/permission to use

Do we know what data was used? And what the constraints were around it?

Do we know it was used without permission or are we just jumping in the “ai bad” bandwagon?


> Do we know what data was used? And what the constraints were around it?

The fact that the question/accusation has been raised a great many times and they have not stated "we know we haven't used information without licence because we had procedures to check licensing for all the data used to train our models", would certainly imply that they scraped data for training without reference to licensing, which makes it very likely that the models are based significantly on copyrighted and copyleft covered information.

> Do we know it was used without permission

No, we don't know for sure. But the balance of probabilities is massively skewed in that direction.

There are enough examples of image producing AIs regurgitating obvious parts of unlicensed inputs, as an indication of the common practise of just scraping everything without a care for permission. So asking for those with other models to state how they checked for permission for the input data is reasonable.


Yes, we've known for a long time that they don't shy away from taking any old code on GitHub and regurgitating it without explicit permission. They don't have benefit of the doubt anymore.


It’s not as cut and dried as you’re making it out to be.

For years we’ve accepted that search engines, for example, can grab all the code on GitHub and use it to build a search index.

Google image search, in particular, ‘regurgitates’ all the images it has indexed when it thinks they match a search term.

It has a little disclaimer it shows next to the results saying that ‘images may be copyrighted’ - figuring out if they are copyrighted and if so by whom is left as an exercise for you the searcher. Depending on what you are using the image search for, the copyright of the images may, after all, not be relevant. Like, if you’re using a Google image search to get inspiration for home decor designs, do you care who owns the copyright of each image? Should Google?

GPT poses similar risks to that. It has the explicit disclaimer that things it produces might be subject to copyright. Depending on what you’re using the output for, the copyright may or may not be relevant.


There seems to be a fairly clear distinction between importing copyrighted material to make an index for the narrow purpose of directing people towards the copyrighted material at its original location in its original form, and importing copyrighted material to make an index which improves their own service's ability to generate unattributed derivative work. It's a bit muddier when it comes to things like image searches, but they're not exactly difficult to opt out of.

Google actually pays license fees to News Corp to excerpt their content in Google News following a legal challenge so it's not exactly conclusively established that search engines have global rights to do what they do anyway. But search engines are mostly beneficial to contact creators rather than mostly competitive with them.


Google does its best, but there are limits to “directing people towards the copyrighted material at its original location in its original form” - not everything in the intellectual property world is ‘internet native’. The original form of a song lyric or a movie screenplay doesn’t have an ‘original location’ you can be directed to. You can be directed to various online sources that may or may not accurately reproduce the original, and may or may not be scrupulous about attribution, and may or may not have legitimate copyright license to distribute it in the first place.


Yes, Google will often unwittingly point to other people's copyright violations (it's useful like that!) and will usually only take down the cache/link when requested to do so by the copyright holder.

This is irrelevant to original point about the purpose of a search engine being to highlight rather than replace existing information sources, and OpenAI's purpose for indexing content and policy of not engaging with copyright holders being completely different


No, but their products regurgitate copyrighted material so I guess they either come clean about the data they used or we have to assume they stole it.


I can quote the entire script of Monty Python and the Holy Grail - would you assume I stole it?


No, but then as you're (presumably) a person rather than an information retrieval system you would be legally responsible for ensuring you had performance rights and paid royalties if you were quoting that script as part of a commercial service. That responsibility rests with you, not whoever gave you access to the film

Conversely, photocopiers and text-to-speech engines and LLMs don't exercise choice over whether they reproduce copyrighted material and so can't be held responsible, so responsibility for clearing rights to redistribute/transform in that format clearly lies with the people inputting the copyrighted material. Obviously, OpenAI has tended to avoid making any attempts to secure those rights whatsoever


Most libraries have photocopiers for their patrons to use. It’s their patrons’ responsibility to determine if any copying they do is permissible under fair or personal use rules. The library doesn’t know what you’re planning on doing with the information they shared with you.


Few libraries use the scanner themselves to make a digital copy of every single book to import into their proprietary information retrieval service.

The ones that do secure permission where the works involved are subject to copyright.


If you presenting it as anything like original work, or have it to someone else without appropriate credit/attribution so that they do, yes.


Aside from copyright, this raises anti-trust issues.

Courts generally uphold provisions against reverse-engineering (as protecting internal, proprietary knowledge) but are more welcoming to copying interfaces (as encouraging market substitutes). So one question would be whether OpenAI can restrict use of the output of their tool in this manner, since the output itself is manifestly open (to the customer). That seems novel. The only analogy I know of is database licensing that prevents customers from publishing comparisons, which seems anti-competitive.

Anti-trust policy is motivated mainly in mature markets, where one player has fairly (by hypothesis) grown to dominate. The law and courts apply special scrutiny to identify ordinarily-acceptable market practices that extend the market power of the dominant player.

But is it the same analysis in a growing market? It seems like even if OpenAI (or especially, OpenAI+Microsoft) is dominant, if the market is growing quickly, the concern might be relaxed since the dominance is uncertain. Conversely, if the market is particularly susceptible to capture, early leaders might warrant heightened scrutiny.

But aside from monopoly's first-order effect on reducing competition, the second-order effect is to reduce investments in competitors, which has anti-competitive effects. That concern could be highest in the early stages of a market.

Are there any good lawyer blogs on point?


The terms against using the output to develop a competing product seem the same as reverse engineering to me.

Competitors can't get OpenAI's model weights but use its outputs to produce a functionally similar model.

It's like if you had a competitor's engine and couldn't open it up but could still see the outputs: torque, rpm, ..., and could control the inputs: fuel intake, air mixture, etc... Then you make an engine by inferring back from these measurements.

Wouldn't that be reverse engineering?


No, it will be an engine that produces the same output if given the same input. It's like saying Google reverse engineered Yahoo and built a search engine


I don’t think it counts as reverse engineering if you only treat it as a black box.

In addition, this is much more akin to data exfiltration than to copying an engineered mechanism. The training algorithm could count as an engineered mechanism, but that’s not what is being copied.


The engine analogy is not a great one. Those inputs and outputs are very crude information. The actual design of an engine is quite complicated and subject to very close tolerances.

Your analogy is like going to an airport and looking at departure and arrival times as well as the flight path and then “reverse engineering” an aircraft from that. The chances are very low that you produce anything remotely resembling the aircraft you’re trying to reverse engineer. Same goes for the engine.


OK.

In the case at hand we are dealing with a mathematical function. In->Out is all there is. Back-estimating a mathematical function by sampling is as close to reverse engineering as anything is.


OpenAI can use the world's images and text as training data, but as soon as its in their system, it can't be used by competitors.

Incredible. The most broken version of copyright imaginable.


Am not justifying what OpenAI did, but nobody is stopping ByteDance from doing what OpenAI did. They can also use the world’s information. Instead, since OpenAI has “cleaned” the data, they are trying to use OpenAI’s cleaned dataset. After OpenAI spending endless amounts of money on that, am not surprised they don’t want others to steal their “cleaned” dataset.


The massive illegal scraping of data on the internet is "only done once" type deal. After platforms have learned of the abuse OpenAI has engaged in, content platforms are now gated and under access controls. You can't access NSFW content on Reddit without logging in, for reference[1]. You could before OpenAI Buzz existed. The point of the illegal scraping is the first mover advantage. Subsequent scrapings will not be as easy. This is also the reason why we could send FBI agents to OpenAI to bust their servers and delete the training data. After wards, gathering this said data again would be much more harder, thus delaying any kind of LLM "progress" in future. For LLM skeptics, this is a dream. Jail the executives, send in feds to light the server rooms on fire. [1] still works on old.reddit.com


Reddit gating NSFW content with login is pretty obviously a play to increase signups and therefore engagement. Making scraping less feasible might just be a bonus, but attributing the whole thing to that is a stretch.


> You can't access NSFW content on Reddit without logging in

Sorry, what? You think reddit is trying to prevent openai from scraping the porn subreddits???


There is quite a bit of content that is not porn marked as nsfw on Reddit.


There are stories all over the web of content houses locking down their stuff after they found out OAI was benefitting commercially from harvesting it. This hasn't been true for at least a year. See Reddit.


I think GP is pointing out that someone that spends years building a large online gallery of their artwork, only for it to be smushed into a pool of vector mush, has the same reasoning to prevent openAI from using their work as openAI does to prevent competitors from using their artisanaly laundered dataset.

Doesn’t matter how much endless amounts of money they spend, they’re going to have to contend with the fact that the value they ship is derived from other’s work. It’s just diluted to the point of it becoming “data” rather than “artworks”.


>"After OpenAI spending endless amounts of money on that, am not surprised they don’t want others to steal their “cleaned” dataset."

And let's say I do not want them to clean up and then use my data for profit.


This makes no sense lol. The information openAI is using is cleaned to begin with


Raw text from a website including header text and footers and links and images etc is very dirty stuff.


The actual content is the clean stuff. If you disagree then you accept OpenAI could just create all the content themselves instead of scraping, which is comparatively trivial.


Since Google is using the world's images and text in their search index, are they obligated to share it with competitors as well?


It does not matter. It has power to suspend access. OpenAI can suspend ByteDance's account because they can.

Not sure if a dribbble user can suspend OpenAI.


Blocking competitors, cooperating with Germany's worst news paper, MS has a seat on OpenAI's board.

Altman's return is really working out


>cooperate with Germany's worst news paper

>As part of the deal, ChatGPT users will receive summaries of news stories from Axel Springer’s brands, including Politico, Business Insider, Bild and Welt, with attribution and links to the original sources of reporting, the companies said Wednesday. The agreement will allow OpenAI’s models to take advantage of the publisher’s higher quality and more current information in its chatbots’ answers.

https://www.cnn.com/2023/12/13/tech/open-ai-axel-springer-ch...


Matthias Döpfner, the CEO of Axel-Springer, lied to get the Leistungsschutzrecht, a law to force search engines to pay for links.

The newspaper of AS regularly publish articles to push their agendas and remove politicians they don't like.

Just look what happened to Federal President of Germany, Christian Wullf, when he declared that Islam is part of Germany.

https://en.m.wikipedia.org/wiki/Causa_Wulff

https://en.m.wikipedia.org/wiki/Causa_Wulff


> Just look what happened to Federal President of Germany, Christian Wullf, when he declared that Islam is part of Germany.

What happened to him? By reading the Wikipedia article, he seems like a very corrupt individual in a position of power. I don't see how Bild was involved in that.


This is all corporate news everywhere


> publisher’s higher quality

Lol is this a joke? Those are all tabloids.


Worse, it is the Fox News of Germany. Actively trying to create dissent and undermining democracy.


You forgot the /s on your post. I hope at least.


Nope, he didn't, Axel-Springer regularly campaigns against people they don't like, including private people.

Let me quote one of Germany's best cabaret artists, Volker Piepers: "Bild-Zeitung..... this filthy newspaper that is so disgusting that you insult dead fish if you wrap it in it!"

There is a reason Heinrich Böll wrote a book about them:

https://en.m.wikipedia.org/wiki/The_Lost_Honour_of_Katharina...


Thanks for this comment. I am a big fan of Böll and I still keep discovering interesting things about him.


Volker Piepers is radical left wing. In his narrow minded world the left make no mistakes and capitalism is the worst thing ever happened to humanity.

If you consider him "one of Germany's best cabaret artist" you could not have better proven how right I am.


Don't forget the sweet contract OpenAI signed to buy $50m of equipment from a company (Rain) where Altman just happens to be a investor.

Perhaps new subscription plans for ChatGPT will be payable in Worldcoin too.

https://www.wired.com/story/openai-buy-ai-chips-startup-sam-...


[flagged]


The number of Public Reprimands by the German Press Council do not confirm this view. For 2023 there were 20 reprimands for Bild and 0 for Die Zeit.

https://www.presserat.de/ruegen-presse-uebersicht.html


You forgot the /s on your post. I hope at least.


> Employees involved are well aware of the implications; I’ve seen conversations on Lark, ByteDance’s internal communication platform for employees, about how to “whitewash” the evidence through “data desensitization.” The misuse is so rampant that Project Seed employees regularly hit their max allowance for API access.

If you ever wondered how all these models could seem to catch up to GPT-3.5 so quickly, but then struggled to do noticeably better (much less exceed GPT-4), while not talking about their data or saying they definitely weren't simply training on GPT-3.5 outputs, remember this: they might just be lying.


> While ByteDance’s use of our API was minimal, we have suspended their account while we further investigate

I don't like ByteDance at all, but I hope OpenAI are aware both shady and legit companies are making some serious cash using their API.

> All API customers must adhere to our usage policies to ensure that our technology is used for good.

"For good" sounds like the philosophy of companies like Stripe and Twitch. There always is a grey area with stuff that is good for many people, but other people see as evil.


I have occasionally wondered how easy it would be to bootstrap your own models using models from the big players. I remember first thinking about this when IBM Watson was the newest hottest thing on the block.

Then OpenAI came along and it seemed like it wouldn't be necessary any more because they were releasing their models anyway, except for when they suddenly decided that they wouldn't any more.

But others have been carrying that torch since (Llama and many others). Still it seems an interesting way of enhancing a model.


> I have occasionally wondered how easy it would be to bootstrap your own models using models from the big players.

I guess OpenAI did the same. If you read their API terms one of the first things they prohibit is to use it to train a competing model. Maybe they do the same tricks internally and know how powerful it can be?


Every time someone uses 'the good' to justify their cause, I know that they are lying and are not who they are generally thought to be. And I'm furious and tired that it's always the 'intelligent' people who are guilty of this. I find this hypocrisy very difficult to deal with.


This must be for show. If ByteDance or anyone else is sufficiently motivated to distill OpenAI’s models, it can’t be prevented. You can simply pay other people to collect the data for you.


You just block their IP’s; Gain intel from internal employees; Have the expensive law firm start sending C&D’s and prepping that law suit.

If you tell someone that they can’t use your product in a certain manner, and then they go to extra lengths to circumvent your measures to gain a profit for themselves, then there is going to be big civil and criminal legal problems.

State sponsored or not; Bytedance has a bank account and business license here in the U.S.


They can outsource their API requests to third party “AI startup” who proxy their requests to OpenAI for a small fee


P.S. if I was in the position of OpenAI and Sam Altman, and Bytedance kept playing games after we had slapped their hand; I would just “shadow ban” them based off IP and user history, and serve them back garbage results to throw off their models :]

Garbage in = Garbage Out


“Simply”? It would be quite a project to do that at sufficient scale undetected, no?


How do you figure? It would look like people from all over the place running API workloads on a wide range of tasks and topics (if the goal is to distill the model generally, that’s what you’d need). So many people are already using GPT-4 to synthesize fine-tuning datasets, I think it would be invisible.

However, you might wonder what the goal is. This “API distillation” is good for teaching a pretrained model how to do lots of things. But the end result is always constrained by the quality of the pretrained base model, and API outputs don’t help at all with that part.


With the backing state power they don't really need to pay anyone


Can someone here explain how an existing model like ChatGPT, that you only have API access to, can be used to train your own model and somehow copy aspects of it?

That's entirely non-intuitive to me how that would work. Like are they just asking it questions about every topic under the sun, and then creating training materials out of the answers? How would you even begin to assemble a list of prompts that would cover all of knowledge? And how could you ever distinguish useful outputs from nonsense hallucinations?

I feel like I'm missing some key details here that the article and its links don't explain at all.


Not an expert but I suspect it would be most useful for the instruction tuning and fine-tuning phases. They would pre-train the model on some huge dump of stuff from the internet like the Pile -- Wikipedia, books, Reddit scrapes and so on, and the result would be something that can predict next words really well but doesn't know it's meant to answer questions. That's the pre-train. The next step is to train that on transcripts of chats so that it specialises in those, and finally on chats that have the balance of helpfulness and harmlessness that you want. Those latter two need significantly less data, but still more than you can cheaply produce with humans, so they could use OpenAI to generate it. But perhaps it's a small enough amount that even if you need an LLM to generate it economically, it's affordable to have humans read through it and throw away anything excessively hallucinated. Or perhaps you can restrict the scope of the conversations to ones where hallucinations are unlikely, and trust that fine tuning on them won't prevent the LLM from digging into its deeper knowledge in use.


Bytedance will still have access to the same open internet as OpenAI had. But it does not have the weights and the parameters.

However by asking ChatGPT questions and recording their answers, you can use that to fine-tune the model they are creating. They can tell their own model to answer the way ChatGPT answered.

In short they are copying how ChatGPT would answer. and not finding out for themselves by creating their own Question and answer (or completion) datasets.

That is what RLHF is. But anyway, Chinese companies will be able to compete here, as this will be commoditized quickly and China is the king of creating and selling commodities.


> How would you even begin to assemble a list of prompts that would cover all of knowledge?

They don't need to, all that knowledge is already in public training sets from scraping the internet.

What is harder to get is the answer patterns to ensure good user experience. You want a lot of such answer patterns so the model knows what structure to use for different kinds of questions. That structure isn't to make the result formatted for humans, but contains reasoning paths the LLM takes in order to arrive at reasonable answers. Since an LLMs thinking is the words it writes the structure of how it responds thus corresponds to thinking patterns, and you want the LLM to learn a lot of those, internet data wont contain that but ChatGPT responses will.

TLDR: The words an LLM writes is also what it thinks, since it doesn't hide thoughts, so by training on what ChatGPT writes you also train on how it thinks. It isn't the facts you want but the thinking patterns.


They should fire CEO and become open or rename itself to ClosedAI.

It's like if Monsanto was called Organic Cooperative.


I heard it told as "it's not OpenAI, it's ClosedAF"


I usually write ClosedAi (OpenAi) whenever I mention them, just to clarify.


I use this firefox extension that converts `OpenAI` to `"Open"AI` as part of my sanity-preserving toolkit.

https://addons.mozilla.org/en-US/firefox/addon/openai-is-not...


Does it replace with 'Open'AI or "Open"AI?


Double quotes because of the back ticks


BTW, Monsanto got rid of their infamous name by being bought and rebranded by Bayer.


"OpenAI suspends ByteDance's account after it used GPT to train its own AI model"

That's truly hilarious... It's akin to a child who stole a candy bar in a candy store complaining about their sister stealing that same candy bar from them.


Silly, they should have hooked it to a different model that gives harmful training data instead :D


They also have their own web scraper called ByteSpider that scrapes websites with lots of text very aggressively and ignores robots.txt. I've had to block it by useragent on one of my sites.


I don't think it ignores robots.txt, I think it just doesn't have a very good parser and you need to give them their own user-agent block. I had a similar level of frustration.

https://www.feitsui.com/en/article/32


After all, if they wanted to completely ignore the wishes of the website owners they probably would not announce their spider as such in the user agent. They’d just pretend to be a web browser.


It is trivial to detect a spider from human traffic based on requests alone. Lying about the UA would just be bad press for them.


If it's really trivial as you say, Google's reCAPTCHA and similar products like hCAPTCHA would instantly have no reason to exist.


Bot intentionally trying to look human =! Spider

A spider will generally have a pretty predictable route through a web site.


The various CAPTCHA implementations are primarily designed to prevent bot submissions, not spiders.


Some of them yes. But not all. Try for example to browse a Cloudflare protected site from Tor and you will be hit with a constant barrage of captchas even though you are only doing GET requests.


Yes, huristicly, a tor browser is more likely to be nefarious than a regular browser user. Note the use of huristisc - such as IP address - not related to user agent.


Have they suspended Elon for the same? Stanford for Alpaca?


I wonder if we’ll eventually see an open training set composed entirely of AI generated materials by OAI and others’ models.

This API restriction is only an issue until that point, I assume, or is the distilling being done in these situations much more dynamic in nature?


My guess is that they don't want their technology to be used to modernize the PRC's military.

Just the other day (PDF warning): https://selectcommitteeontheccp.house.gov/sites/evo-subsites...

> Pillar II: Stem the Flow of U.S. Capital and Technology Fueling the PRC’s Military Modernization and Human Rights Abuses

> U.S. export controls have been slow to adapt to rapid changes in technology and attempts by adversaries to blur the lines between private and public sector entities, particularly the PRC’s strategy of Military-Civil Fusion.

> “The military is for civilian use, the civilian is military, and the military and civilian are fused.” In other words, no line exists between civilian and military technological development. U.S. export controls have yet to adapt to this reality


I guess they could figure it out via pattern repetition right? I thought they leave API users alone, especially corporate users… Or are they still monitoring corporate users? Sounds like a major security issue if I’m a CEO. Does Microsoft read and ban people from Microsoft Word/365? Etc


It depends which perspective we want to analyze this: philosophical, religious, ethical, legal, business, *? 1. Philosophically: who gives a damn it’s meaningless at the cosmic level with no impact on our species trajectory 2.Religious n Ethical: Wrong 3.legal: OpenAI took Internet n public content not protected by any licensing agreement and got away with it! ByteDance took OpenAI’s content protected by their L Terms, so those terms are applicable here. 4.Business view: OpenAI user others data and got away with it, ByteDance tried n was caught! 5. Other Dimensions of analyses are possible! Your view = f(your perspective) make ur pick!


What exactly did they do? Just transform data or generate completely new data for training? I've seen plenty of people that have used GPT-4 to help transform data, i.e. using GPT-4 to generate summaries of texts and then using that to train smaller models. Not sure why this wouldn't be allowed, as it's not technically data that is coming from GPT-4.

If they don't allow this it just seems like they are trying to prevent people from building smaller cheaper models that will perform better for a specific use case and gobbling up the market for as long as they can.


In my view is: what perspective we wanna use to judge? philosoptical, religious, ethical, legal, or business? 1.philo: who gives a damn it’s meaningless in the cosmic scheme, no meaningful impact on our species! 2. Religious & Ethical : wrong 3.legal : our/internet content isn’t protected by any usage Terms (license) OpenAI’s content is . So if ByteDance violated terms OpenAI is right 4.OpenAI took content unethically and got away with it , ByteDance tried n was caught! There could be other Dimensions of analysis, make ur pick!


Related anecdote: last web a web server of mine was under very heavy load. The site was flapping and I could barely even SSH to the host. I did some basic analysis of the httpd logs and found that there were 13,000 unique IPs within the past week and 12,000 belonged to one subnet on Amazon EC2. The user agent for all those requests was ByteSpider. Millions of requests over a few days.

I'm into freedom of information and what not but come on!!!


I hope bytedance sues so this can be debated in court


Why?


I think because openai took advantage of everyone and know doesn't want anyone doing the same thing.


New Pied Piper


Phi2 trained on synthetic datasets from OpenAPI. Textbook is all you need paper also talks about using output from ChatGPT to build training models.

So Bytedance gets account revoked but tons of other models are allowed.

China bans US companies, US does same to China.

We are both almost the same. Sure they have a more authoritarian government but ours isn’t quite a liberal democracy either.

It’s competition all the way down.


Phi2 was created by Microsoft, that's probably why that was acceptable.


Oh this is priceless: “ All API customers must adhere to our usage policies to ensure that our technology is used for good.”

Not violating terms. Not abiding by terms of use. But good vs evil.

It screeches “we occupy the moral high ground and are the arbiters of what is ethical and what is evil.”

The entire brand game plan around “Open”AI is to position themselves as first doing no evil, just like google did in the early days, to enable doing a lot of evil.


OpenAI: closed in almost every possible way.


How smart is this anyway? Isn't this what causes the Model Collapse?


Considering the extent to which youtube is blatantly encouraging people to copy their videos, and no doubt that openai haw user tiktok content for their models, this is like slapping a child.


When is OpenAI going to suspend or fire Tal Broda?


I suspect they'll see a new account setup quite quickly called "NotByteDance" or something similar.


Seems everyone is trying to catch up with OpenAI. The other day Grok was spitting out OpenAI terms and conditions


Now someone else just needs to make a company in the middle that will generate the training data for you.


Can someone walk me through how you'd use what open ai offers to train your own model?


Use model output as training data. For better performance you can get some top log probs and minimize kl divergence.


How can chatGPT be used to do this? Did they ask it to just produce training data?


You can use it to produce the fine tuning examples. LLMs are very good at learning styles and answer flows, so you ask ChatGPT a million standard questions and problems and then record the outputs, that data will be very similar to the finetuning data used by OpenAI. Then put that data into your own LLMs fine tuning step and voila, you have an LLM that behaves very similarly to ChatGPT!

OpenAI knows this which is why they put that TOS there, they spent a lot of money to create that finetuning data so they don't want to give all of that away for cheap.


They likely didn't use (just) ChatGPT, but the GPT-3.5/GPT-4 API.

Both can be used to create training data quite successfully. This technique has been used in the past to create synthetic post-training (fine tuning) datasets like Orca, Samantha, and so on.


What’s a competing model? Only LLM’s or images too? Kinda vague. I hope only big spenders get pinged because I’ve been using the vision api to help auto-annotate some training data.


It's probably a technical term for models "we don't like".


If I were Bytedance I’d retaliate by opening the result of what I trained off them and the training data I milked from their model.

What they gonna do, double ban you.


Wouldn't it result in overfitting?


The details on how exactly they may have used it to train their model is vague. I believe transfer learning or knowledge distillation are valid techniques based on the inference from other models.


I would also think it'd be an incredibly expensive way to train a model.


Depends. I wonder what is the minimum reasonable amount of different tokens needed to lift up the weights.


You store the output from ChatGPT, you don't run it again every time you do a training step. Generating millions of examples to add to your own training wont cost much at all relatively.


Interesting to see that capitalism is inherently inefficient. Freely sharing a trained model would be the most efficient and beneficial to progress, but the investment in training was done by private company to turn profits. In this light the wasted energy of cryptocurrency is just another inefficiency.


That's so cliché


rules for thee, but not for me


So. What about Grok?


I still don't understand how they can keep a straight face claiming that training on all human-written material (copyrighted or not) that can be found on the Internet is perfectly fine, but training on ChatGPT output is not (or in other words, that human writers cannot have a choice on whether their output is used, but bot owners can).


Easily. Pulling up the ladder behind you is classic self-interested, antisocial behavior.

It doesn't have to be internally consistent, it just has to make them money.


Indeed. Setting up "Barriers to entry" is one of the fundamental things taught on an MBA.


> Pulling up the ladder behind you is classic self-interested, antisocial behavior.

Not the case. That would be the case if OpenAI prevented them from using the same resources, which ChatGPT is not.


Fair.

To keep that analogy going, they essentially used the ladder so hard, it broke.


> if OpenAI prevented them from using the same resources, which ChatGPT is not.

It kind of is. Just think of how many services changed their data sharing policies and closed APIs due to ChatGPT training (Twitter, Stack Overflow, Reddit). Maybe the analogy is that instead of pulling the ladder, they set fire to it so it’s burning and making it harder for others to climb. Even if they didn’t set it alight on purpose, I don’t imagine they’re losing sleep over it.


Does Bytedance allow OpenAI to train ChatGPT using TikTok?


To be fair, this isn't a ladder OpenAI used themselves, right? (The seemingly-extreme shortcut of training from an external LLM model)

Their ladder (using public data, and hiring humans to classify to taste) is still available I believe.


> Their ladder (using public data, and hiring humans to classify) is still available I believe.

Not really. Once chatGPT came out, many sites changed their terms and/or significantly increased their API access costs to prevent/limit/make cost prohibitive future scraping.


Good point. Though that affects OpenAI too for new data.

I had assumed most of their web content was from Common Crawl, and the older pre-ChatGPT Common Crawl datasets used would still be available. But it looks like Twitter, for one, was not in Common Crawl.


> Once chatGPT came out, many sites changed their terms

Which is not openAI “pulling up the ladder behind them”


No, it is them spoiling the pitch.

There is literally no way for them to avoid looking like assholes once they take enact barriers that they themselves did not have to overcome.


It’s not like OpenAI trained their model using someone else’s and now won’t allow it done to them. This seems more like saying “get your own content and do the work like everyone else”.


Except they made it nearly impossible to do so.


> using public data

As far as we know.


It’s pretty simple and not hypocritical to hold these two positions simultaneously:

- It’s legal and moral to train on data you have access to, regardless of copyright.

- Nobody is obligated to provide services to you so you can obtain that data from them.

It would be hypocritical if, say, ByteDance obtained synthetic data generated from GPT-4 and then OpenAI tried to prevent them from training on the data they already obtained. But all they are doing at the moment is temporarily pausing generating new data for them. OpenAI aren’t obligated to do this and OpenAI have never argued that other people are obligated to do it for them. So no hypocrisy.


However, it might be a tad hypocritical to name your company "OpenAI" in such arrangements.


Well, they are open for business, so...


I dont think they are making a legal statement, but just doing a business maneuver. Something could be perfectly legal, but just against company policy.


Just like Google is crawling the whole internet (hitting your server a million times a day) and then, with a straight face, will plug a captcha in your face if you dare search more than 3 times with a quoted string or non-trivial terms. Forget about doing a few million searches to bootstrap your dataset, Google was always hostile - both in access controls and pricing. You want results past 100 or 1000 mark? never possible, not even 20 years ago. But they say 1 bazzilion web pages in their index match your search.

tl;dr Google crawls you, you can't crawl Google. How is that fair? They built their empire on our brain outputs but won't share theirs.


My memory of the early ish days is that Facebook heavily leveraged Google contacts (which was allowed by Google) to discover friends, and then blocked others from doing the same. Is that correct or can someone offer better info?


Google respects robots.txt though, which is more then they need to. If you put data out in public, what do you expect? If you don’t like Google crawling you though, just restrict them. That’s usually to your own detriment though, but I won’t judge you if you’re into self-flagellation. Just don’t think you’re holier than me because you are.


It's ok, the wheel is turning and now reality has come to bite them in the ass. CommonCrawl supplied the text and LLMs replace their index, for a large number of requests. A new crop of search engines like phind.com and perplexity.ai have better than Google results.


They certainly don't have the moral high ground, but in any other context this is usually considered an inference attack--sampling a model somebody else spent a lot of money to train in order to build a similar one at a much reduced training cost.

So while OpenAI is absolutely lacks the moral high-ground, ByteDance still seems to be engaging in adversarial behavior.



Just so I understand the argument, OpenAI would be claiming that anyone using their model outputs to distill i.e. train a smaller model on their model is a violation of copyright, but them training on the entirety of the internet (including copyrighted material) is not a violation?


> OpenAI would be claiming that anyone using their model outputs to distill i.e. train a smaller model on their model is a violation of copyright

OpenAI haven’t claimed this. They are refusing to generate new data for ByteDance by suspending their account.


I was on my phone when I clicked this, so the full link was obscured, and while loading I totally expected the page to be “hypocrisy”.


Google is useful thanks to websites letting Google to crawl them. But Google doesn’t allow to be crawl.



I don't see any problem here. Their TOS to access the service is their right. Unless they used bot to accept TOS of some other site, scraping is completely legal.

Also breaking OpenAI's TOS is likely completely legal and everyone I know is collecting data to their own model. Worst they could do is ban the account.


Ask Sam, HN's darling.


You know he got fired from Y-combinator right?

It was a direct order from Paul Graham. He keeps mum about it but I have trusted sources who know the truth. Additionally, it's sort of public knowledge:

https://www.washingtonpost.com/technology/2023/11/22/sam-alt...

I don't have a full view into exactly why he was fired from both OpenAI and Y-combinator. But from what I hear the reasoning is a bit similar. Sam Altman is a bit of a political snake. He lacks ethics and he's not honest either. The last part is just me speculating on a lot of the anecdotes from quips I've heard over the years from people who know Sam.

Sams public persona is very different. And I think a lot of HN viewers worship that public persona. But Sam being a darling of the HN and Ycombinator themselves? No way. They fired him so it's unlikely.


I never said anything about Y Combinator. Only that recently HN felt like Sam Altman's circle jerk. It was surreal to read some comments praising him.


the overwhelming support of him across HN/Twitter (circle jerk) really rubs me the wrong way and I just can't take Sam Altman seriously anymore


You know HN is ycombinator right?

That when you come to this site you have to type in the words ycombinator for the url right?

You said Sam is HNs darling. That's different from saying HN viewers. So you literally did not say what you said here.


> You know HN is ycombinator right?

Nope, this is wrong. HN is a site run by Y Combinator and heavily moderated but the comments are coming from the site's users and not Y Combinator itself


Well if you want to be this pedantic. How can a website have opinions?

You most obviously must be referring to the owners of the website.

Or are you referring to the comments?

Because both the users and the owners of said website have vastly different opinions here.

Additionally at one point Sam Altman was literally a "darling" for the company, before he got fired.

Your statement is ambigiuose and therefore wrong.


Ye it was really strange. And the simultaneous "outrage" on some Twitter users. The praise felt very artificial. But the drama was real.


The work on the internet is on the internet freely available to all and the output of their API is the output of their API only available after registration and agreeing to their terms and conditions.

OpenAI are free to block anyone from using their API if they want. Just like anyone hosting their content a website is free to block the OpenAI web crawler.


It’s like company A collects everyone’s phone number and then publishes it as a phone book. And then company B copies the phone book and publishes it as their own.

It’s not a straightforward copyright issue but in many jurisdictions that is not allowed. Company A did the work, they should be allowed to profit.


In the US, company B would probably be in the clear--at least for the list of names and numbers. You don't necessarily get copyright protection just because something was a lot of work ("sweat of the brow"). The most relevant US Supreme Court case is Feist.


https://en.wikipedia.org/wiki/Database_right

There is protection in a few notable jurisdictions so a violation would make the product illegal in those jurisdictions, which is a problem if it’s an online product.


Have they complained after getting their Amazon Unlimited account disabled or something like that?


In same way say Putin can spew all that fantasy bullshit over and over to Russian population which doesn't make any sense even at glance. And most folks back there do know it to certain extent, yet he keeps up the show and whole power dance instead of simply stating the truth that he is the current dictator and the rest can bow down and suck it up. Same with most if not all other dictators.

Not equalizing those two situations at all, just pointing out that dynamics of communication between normal people don't really happen in many other situations, or if they do its just a shallow charade. Or... just don't expect fairness and good behavior when tons of money, power and legacies are at stake.


Google does the same. It wants the pages that it indexes to be original; if a page just copy pastes another page's content, its score is affected negatively or even de-indexed, they call it spam.

Google itself on the other hand does just this, for example Wikipedia text and lyrics are taken from other pages and copy pasted onto Google's page.

Try telling a "googler" this and theyll go "noooo but for Google it's different because Google has determined that's optimal and good for user experience". It's difficult to get someone to understand something when their paycheck depends on them not understanding it.


> Try telling a "googler" this and theyll go "noooo but for Google it's different because Google has determined that's optimal and good for user experience". It's difficult to get someone to understand something when their paycheck depends on them not understanding it.

In my experience googlers are very capable of saying “Google is bad but my salary is good”. Plenty of people understand things that are contrary to their paycheck.

That said, Google generally respects Wikipedia’s and others license to the data. And it is generally in the users best interest to get to desired content/information in less steps, regardless of the data’s provenance.


These may have been more valid criticisms in the past, but today, Google does indeed pay for both Wikipedia content and song lyric licensing:

https://www.theverge.com/2022/6/22/23178245/google-paying-wi...

https://www.theverge.com/2019/6/18/18684211/google-song-lyri...


So? I didn't say they don't pay.

What I said is if I copy paste things into my page, Google will kill it because it's spam. If I say "but what I'm actually paying for this content"... That's irrelevant.

I'm not saying that Google is stealing content, I'm saying they're hypocritical is applying and argument when the conclusions benefit them, but not otherwise.


I generally don't get many results pointing to google search results, so I guess google search ranks pretty low.


It's about those "information blocks" next to the search results that are often copied verbatim from Wikipedia or Stack Overflow


> While ByteDance’s use of our API was minimal

Aka... some dev fired off a handful of test queries and never used the account again, so OpenAI decided to suspend it to look good in the US press.


Alternatively, they caught the action promptly and kept use to a minimum

Also, what does “minimal” even mean? I’m sure they monitor accounts that max out their API request limits, or even just request programmatically (i.e. request patterns that don’t match a natural human use pattern, like slowing down during a time zones lunch hours). Maybe this was a couple days worth of traffic


Or the media reported on their usage, and the “minimal” clause is to reassure investors that their model is still protected and special.


From OpenAI's terms of use: https://openai.com/policies/terms-of-use

======

What You Cannot Do. You may not use our Services for any illegal, harmful, or abusive activity. For example, you may not:

    Use our Services in a way that infringes, misappropriates or violates anyone’s rights.

    Modify, copy, lease, sell or distribute any of our Services.

    Attempt to or assist anyone to reverse engineer, decompile or discover the source code or underlying components of our Services, including our models, algorithms, or systems (except to the extent this restriction is prohibited by applicable law).

    Automatically or programmatically extract data or Output (defined below).

    Represent that Output was human-generated when it was not.

    Interfere with or disrupt our Services, including circumvent any rate limits or restrictions or bypass any protective measures or safety mitigations we put on our Services.

    Use Output to develop models that compete with OpenAI.
======


It's going to be stupid for openai to argue that those terms are binding when they've already argued in court that those terms are nonbinding when they scraped other people's dat.


It makes sense it'd say that. Of course, GPT is built on everyone's output itself.

So throwing around statements like "we suspended ByteDance to ensure GPT is used for good" are hypocritical at best. They're not the pope, they have no monopoly on good.


> Automatically or programmatically extract data or Output (defined below).

What does that mean? Isn’t that just the definition of API use?


I thought OpenAI's mission was to ensure technology to benefit all humanity, not to fall into the hands of big governments and big business.


Making better AIs with it should benefit the humanity. They should change name already.


I wonder how this affects US-China relations




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: