“We can steal people’s copyrighted content but we can’t let you see it for yourself.”
Outside of privacy (leaking PII), the above is likely the main reason. Someone could have invested a lump of money to scrape as much as they can and then go to town in the courts.
The terms that prohibit it are under “2. Usage Requirements” that restrict reverse engineering the underlying model structure.
Leaking original data would expose the company to direct copyright violation lawsuits. Changing T&S is simplest way to stave the legal risk exposure, buying time to implement technical remedies.
As ridiculous as it may seem, they're doing the right thing.
I always find it amusing when criminals threaten legal action. Happens all the time. They steal your property then they cower behind their legal rights.
I've had no problem using Microsoft's toil for free by downloading free windows ISOs all my life, so if they want to pirate my Github code it's not bad enough to care about. Besides the bad practices the model might internalize as a result that is
Good for you, don't post your code on GitHub then, as they have an express terms of service about being able to use code submitted for business purposes, including AI model training.
I am not sure what you mean by "my culture," what are you basing that off of? Why would you knowingly and willingly add an ad hominem? Very strange behavior in discourse.
GitHub has always had such clauses, even if they were not explicit about AI model training in particular. It is best to self host your own git instance if you are so worried.
I made it my mission to get the lot of them mad. There are plenty of legitimate ai companies out there but YC seems fond of those unethical, which explains the infusion of ip stealing startups on here and their simps.
It will be interesting to see how it plays out. I can imagine Wiley, McGraw Hill, Pearson and other publishers[0] of educational content OpenAI used could sell the rights to their material to be used for training GPT, but the price would be high enough we would be paying $100/month instead of $20.
[0] Heck, they could even unite and found an LLM startup themselves training the models legally and making it available for users at various tiers.
“don’t touch the unsecured, loaded firearm that is sitting on the counter, that might be stolen, maybe even got a body on it, don’t look too close, or you can be kicked out of the club for not following the rules”
Disagree. It’s not the model outputting this message. It is a hard coded checked by open ai. It seems to very clearly be openai responding to the specific attack used by deepmind as explained by the article.
It is a TOS violation. It’s not a big one. But the weakness of the model is the story here.
At this point there are so many checks and rules applied to ChatGPT one wonders just how much performance is being sacrificed. Is it totally miniscule? Could it be significant?
Publicly available PII isn’t very sensitive, I think.
So I feel like it’s important to distinguish between sensitive PII (my social or bank number) and non-sensitive PII (my name and phone number scraped from my public web site).
The former is really bad, both to train on and to divulge. The latter is not bad at all and not even remarkable, unless tied to something else making it sensitive (eg, hiv status from a medical record).
It was my naïve understanding that the training data no longer existed, having been absorbed in aggregate. (Like how a simple XOR neutral net can't reproduce its training data.) But a) I don't know how this stuff actually works, and b) apparently it does exist.
Has anyone figured out why asking it to repeat words forever makes the exploit work?
Also, I've gotten it into infinite loops before without asking. I wonder if that would eventually reveal anything.
Outside of privacy (leaking PII), the above is likely the main reason. Someone could have invested a lump of money to scrape as much as they can and then go to town in the courts.
The terms that prohibit it are under “2. Usage Requirements” that restrict reverse engineering the underlying model structure.