Hacker News new | past | comments | ask | show | jobs | submit login
Stanford’s Alpaca shows that OpenAI may have a problem (the-decoder.com)
172 points by 101008 on March 21, 2023 | hide | past | favorite | 101 comments



> the team points to the OpenAI GPT-3.5 terms of use, which state that the model may not be used to develop AI models that compete with OpenAI.

and yet i bet openai was trained on data that they did not bother negotiating and honouring "terms of use" for. what's the difference?


Unfortunately under US law that's a significant difference.

Companies can take away a dizzying variety of your rights if they simply say so. It's how they won most of the anti-scraping cases.


Yes but the point is GPT-3/4 must include at least one website that prohibited scraping, thus OpenAI should not be using their GPT models, by their own logic.


But OpenAI never clicked that "I agree" checkbox, whereas you entered into an agreement with them.


Neither have most of the people who used this data to train their models[1]. "I agree" only applies to those who pressed that button.

AI output is strictly not copyrightable under current law, so standard tricks to limit what people can do generally don't work [2].

[1]: https://github.com/pointnetwork/point-alpaca [2]: https://www.federalregister.gov/documents/2023/03/16/2023-05...


If we can scrape linkedin and amazon, why can't they scrape our online content?


That's a great question, I'm wondering that myself. So I know from the legal standpoint anti-scraping has been a messy debate, but what is different this time. Amazon and Linkedin won't die if their analytics leaks, in fact in Linkedin's case all the data was volunteered by the users. These users won't up and leave to another place that scraped Linkedin. I'm not sure if I agree with the lawsuit outcomes, but it didn't look like an existential threat to Linkedin or Amazon.

Now, in case of scraping _all_ of public useful information. Let's take Stackoverflow as an example of a community-produced content. It gets 90% of its traffic from Google, people show up, read, ask and answer questions (click on jobs and ads to keep SO servers running). If a scraper comes by that inhales all of that data, and regurgitates it under Microsoft's name, giving no credit, no references and absolutely nothing back to the authors of those questions and answers. What is the reason for SO to exist anymore? What is the motivation for people to keep answering and refining the questions? 90% of Stackoverflow will evaporate. The way it's currently set up it's an existential threat to any meaningful content on the web. Maybe there have been some legal precedents to repeal anti-scraping, but this is a very big problem now.


So think about it. If Morgan Stanley is right and OpenAI is currently training GPT-5 for hundreds of millions of dollars, they might not make it directly available. They could use Alpaca power to use it to boost GPT-4 or GPT-3. This way they could have these super optimized models that can fit in smaller memory and compute their tokens faster than any of the competition. This would be the opposite of OpenAI having a problem. It would be this huge economic moat that lasts for a while until more efficient hardware and more capital spending by their competition degrades it. It would even solve the problem of super human GPT-5 cognition. That would be moot if it's never released and only used to Alpaca-ize their own smaller models.


That's a valid perspective IMO but corporations are too slow to react and keep up.. Another thing is, if models that are good enough for the general use case that I can fit in my medium range hardware then I definitely don't need to pay for the more perfect one.. A couple of such good enough models are already there and a lot of us are already not following or waiting for the GPT-10.. That's the point of the article and why OpenAI are visibly insecure about those models.


> It would even solve the problem of super human GPT-5 cognition. That would be moot if it's never released and only used to Alpaca-ize their own smaller models.

They would not have a moat from this strategy.

At the end of the day, they still make available a model with high performance. The whole point is that anyone, for very little money, can fine-tune another model to match the performance of OpenAI's models.


That's true. Anything open source to the public is obviously open source to a company. They can create public, miniaturized versions of their models and release capabilities only when it suits them, using Alpaca style training to minimize costs for their API.


Source on Morgan Stanley talking about GPT-5?


I hope my analysis makes sense regardless of when or how OpenAI ends up training GPT-5, but the Morgan Stanley thing is an internet rumor for example https://news.ycombinator.com/item?id=35222180


Thanks


Even without Morgan Stanley claims, Greg Brockman and Satya have both said that GPT-3.5 was already internally used by them since the summer of last year and GPT-4 in December. Greg mentions it in https://youtu.be/YtJEfTTD_Y4 for example


The most frustrating thing about LLaMA "release" is that no one can make an actual product around it. Meta legal department is silent and it's anyone's guess how far the terms of use can be pushed and bent until army of lawyers buries your business into the ground.

I've seen awful amount of fishy urls, magnet links, and diff files and I guess it works well enough for open source communities. This buzz and flow of amazing innovations building around LLaMA will later be repurposed when a truly open LLM + weights is released, not much is lost there. But it could be maybe x10 times the impact if startups could build their new products with the finetuned LLaMA models.

It just sucks that the price was already paid - 2.6 million KWh hours of electricity and 1,000 tons of CO2 emitted into the air. Now it needs to be done again to get to same result, why?


Indeed they have a problem there.

For that matter, I don't see how "OpenAI" could even try to legally enforce its terms against competitors training their models using the output of "OpenAI" models… at least not without being laughed out of the court room at best or ending up having to pay enormously much more themselves at worst, given how "OpenAI" blatantly disregard any licenses on the content and data they use to train their models.


I think it's still up to debate whether this tech is really a winner takes all sort of thing, I think even with the image generation ai craze, we saw a bunch of different implementations come out in succession. Maybe one might be slightly better but it's still possible to get something close pretty quickly


IANAL, but this is a clear violation of the OpenAI Terms of Use 2(c)(ii): https://openai.com/policies/terms-of-use

Doesn’t mean it’s enforceable though. That being said, given the priority this is given in their ToS, I suspect they’ll be hearing from lawyers.

Edit: to be clear I don’t support this clause in their ToS, this is just something I noticed having had to study their ToS and privacy policy within the past week.


I would be curious to hear a lawyer's take on this, though I suspect we'll be hearing the Supreme Court's on it soon enough. This seems to be begging the question of who owns the content produced by the software. So far, the courts have stated that model-generated content cannot be copyrighted [1]. That was referring to images, and text seems likely to be on even shakier ground.

I don't see how a company could license the usage of something that they don't have the legal rights to - the output text in particular. Obviously OpenAI can terminate people's accounts for whatever reason they want, but that's largely meaningless. They have 0 chances of deterring anything unless they can secure substantial damages.

[1] - https://www.smithsonianmag.com/smart-news/us-copyright-offic...


I suspect Microsoft will start Lobbying for protecting the "AI economy" pretty soon if they have not already started.

I mean they didn't care about where all that training data came from but that doesn't mean anybody else should (in their minds).

That's them pulling up the ladder behind OpenAI


It says in the article that it's a violation of the TOS.


Scraping images from Getty is also violating the terms of service.


It is. And they heard from Getty’s lawyers.


How is that related to anything?


There's nothing wrong with violating a ToS. Worst case your account gets banned.


Recent and related:

The genie escapes: Stanford copies the ChatGPT AI for less than $600 - https://news.ycombinator.com/item?id=35238338 - March 2023 (145 comments)

Stanford Alpaca web demo suspended “until further notice” - https://news.ycombinator.com/item?id=35200557 - March 2023 (77 comments)

Stanford Alpaca, and the acceleration of on-device LLM development - https://news.ycombinator.com/item?id=35141531 - March 2023 (66 comments)

Alpaca: An Instruct Tuned LLaMA 7B – Responses on par with txt-DaVinci-3 - https://news.ycombinator.com/item?id=35139450 - March 2023 (11 comments)

Alpaca: A strong open-source instruction-following model - https://news.ycombinator.com/item?id=35136624 - March 2023 (296 comments)


OpenAI has done the reasonable thing of not exposing the probability distribution per generated token so it's very hard to use that to completely map to their models. Ultimately, you still do need a very large base model to compete.


"not exposing the probability distribution per generated token"

can you elaborate on what that means?


A language model takes in a sequence of tokens and outputs a probability (0-1) for each token in the vocabulary (the set of all tokens the model knows). Based on this probability distribution, there are various sampling strategies that can be employed to choose which token to actually show to the user.


OpenAI's previous completion endpoint for the davinci-003 and older models included a "logprob" return option: https://platform.openai.com/docs/api-reference/completions/c...

Their newer chat style endpoint for the GPT-3.5-turbo and GPT-4 models no longer supports this. https://platform.openai.com/docs/api-reference/chat


I was wondering. They do give you an embeddings endpoint. Can’t that theoretically be used to reconstruct the model’s weights?


Fair use to scrape together their dataset & then cry wolf when their outputs are used to create a better competitor that runs locally lol....should not have given a free-tier but then how else would you have captured casual users


Live by the scrape, die by the scrape.


"The only way to compete against OpenAI is an open-source version of ChatGPT" [0]

It can also be a free binary only model as well, but I would prefer a transparent one. Either way seems like Stanford Alpaca and LLaMa has taken off in cloning ChatGPT and making it good enough to compete against it at a affordable cost.

[0] https://news.ycombinator.com/item?id=34643589


so far it's no where "good enough" though


ChatGPT also isn't good enough for ... almost anything, but here we are.

Might as well make it cheap and easily portable and fine tunable


I think that last part is what the "product" is. Train it with a large data set, learn to constrain it, and give it a more fixed scope AKA customer support. Retain the best of your existing staff to monitor, review and train it in an ongoing way. Use them to deal with "outliers" and "new features".

People may loose jobs... but do companies want this? Can you imagine where google/fb/twitter/insert company name here no longer being able to hide behind "No one can reach a human"....


If it wasn't against their TOS, ChatGPT is a great tool to help build ChatGPT.


It's not a meaningful restriction. Even if they were good at catching accounts doing this, the community that wants an open-source chat assistant is so huge that you could easily find many people to share their API keys or install a browser extension to share their chats.


Even that is not needed. There is enough people that want open source ChatGPT clone to create the dataset from scratch by themselves. The Open Assistant community actually already created enough training data for the initial model training and the data will be released under open source license. And from what I played with the initial model, it looked promising (though not anywhere close to GPT 3.5 yet).


It's been a week. The first instruction tuned versions of the 7b model came out 48 hours ago.

Be patient.


Have you seen any of the 65B yet? I think this will make it a lot more useful.

I haven't seen any but the news is going so quickly now lol


It would be a lot better if you could use the alpaca modifications on the 65B LLaMa model :(


Does anyone think this wasn’t Meta’s plan all along?


It's a lovely plan to kill the industry if you are not already deep in it, no amount of lawyers will help taking down the torrents already available.


I mean, isn't MS trying to kill search (Google) by pouring milions into AI?


I think so yes, and it's also nice IMO. Probably search won't be about receiving 10,000 search results, but about 1 concise, justified answer backed by actual sources.


Mass adoption of AI chatbots seems self-defeating. Who is publishing the information that the chatbot uses to give you 1 concise answer? Depending on the query, it could be Wikipedia or an academic journal, but for many topics the chatbot would need to draw from for-profit ad-supported websites.

The chatbot user benefits from having one great answer pulled from the best sources (and no ads), but the websites that underpin the chatbot’s usefulness will no longer have monetizable traffic. In the long-term this disincentivizes people from publishing online, which would reduce the quality of not only the chatbot output but the web as a whole.

Imagine local news being totally unavailable online by any means because the rise of chatbots means that nobody can make any money writing about local news.

Edit: A first reaction to this might be “have the chatbot show ads and share revenue with its sources.” This probably wouldn’t solve the problem. Journalism (and many kinds of writing) would be a less attractive career if your readership consists mostly of people getting second-hand summaries via chatbot. If chatbots do become popular, I worry about a bleak future where journalism and other writing is replaced by an anonymous blob of underpaid foreign laborers whose only job is to shovel up-to-date facts into chatbot databases.


tldr: It's not my problem

That's a valid perspective, but remember that the quality degraded when we started having paid memberships and ads to websites.

What's a news outlet worth paying for? who actually pays for content online? who knows how to block all ads and didn't do that?

That business model really proved to be worthless, it dragged the quality down with more desperate pay-to-read prompts. Nobody will miss this and we will have again people with real world experience writing knowledge or opinions in their free time. So it goes back to that, quoting scientific papers, books and actual reputable and knowledgeable people writing blog posts.

With the news being mostly propaganda, I don't know what to quote there and how many outlets still have a reputation.

I personally write during my free time in my personal ad-free minimalistic website and have list of blogs I track for contents.

Things changes, waiting for full time "authors" to come up with a profitable plan to just progress is also not an option.


> What's a news outlet worth paying for? who actually pays for content online? who knows how to block all ads and didn't do that?

I don’t know how far we’re going to get if we not only expect volunteers to supply everything the model needs in order to be up-to-date and useful, but also expect access to the model to be free of charge and ad-free.

That’s a pretty massive amount of man-hours, compute, and R&D effort to expend for literally no return on investment.

> we will have again people with real world experience writing knowledge or opinions in their free time

We already have this now in the form of a bored knowledge-worker blogger class. It’s not nearly enough to provide up-to-date info for a chatbot, and I don’t see how the implosion of journalism as an industry will lead to more people spending their free time writing for no pay. If anything, it will drive more knowledge behind pay walls like Substack, which will be inaccessible to the chatbot anyway.


Right up until the ad and SEO people get their gritty hands on it.

Either Microsoft is successful charging people for it or it's going to become just as infested as Google is right now.

Maybe even then. Double dipping is not exactly unheard of.


Can't agree more, and we are back to square 0 wondering how can we find the good stuff again after it gets buried under the mass-SEO-optimized random garbage.



What does toxic mean in "hallucinations, toxicity, and stereotyping"?


Probably stuff like this (note: possibly misleading URL) https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bin...

Early Bing AI had a tendency to be rather unhinged.


If we continue on this trajectory, I have a suspicion that the big players will increasingly cry “danger!” and, as Sam Altman has done already, call for government regulation of AI. Having potential upstarts buried in red tape is how monopolies and oligopolies sustain their positions in a lot of industries.


Altman and OpenAI have been stoking AI safety fears and using it as a selling point for narrow control and regulation since the same time they stopped pretending the “Open” in their name was meaningful. It’s clearly central to their business strategy. The best way to keep a head start is to use the government to put up roadblocks for your competitors.


I think their realisation was that the models themselves aren't actually that difficult to replicate even in absence of patent or description.

i.e they can't adequately defend their business with trade secrets.

Patents probably wouldn't work either because the structures are too easily recombined to bypass any conceivable patent that would be enforceable.

That said I think all of this is actually emblematic of a deeper problem with the space which is that none of the LLM stuff recently has been groundbreaking but rather just just continual refinement of a given branch. We aren't seeing evolution, just increasing either number of parameters or quality thereof + additional context. Which is why it was so easy for other folks to make the same progress in similar time periods.

Time will tell if we are about to slam into a local maxima or if someone finds a significant evolution or better yet stumbles on a way to properly combine LLM for context + NLP with traditional AI/logic/expert systems to engineer something that actually thinks and learns rather than regurgitating statistics.


We can also start suing NotOpenAI in many countries for copyright infringment stuff in their trainig data. They need to come up with a 'social contract' for AI models, in the same way that google made a synergistic relationship with publishers in the past


Actually, using public but copyrighted data is explicitly allowed by the digital single market directive in the EU, precisely to allow new entrants to enter the market and to keep big tech from gatekeeping the access to competitive data.


Some of those industries being heavily regulated for very good reasons.


Although there are also typically very good reasons not to regulate the industry, which will never be discovered if the industry is regulated.

For example, imagine that engines were regulated earlier, because steam engines could blow up and kill people. That's a good reason to regulate steam engines. But then we probably would have never invented other engine types like gasoline engines and jet engines. With that, we'd never have invented planes or flight, because a regulation steam engine would have been too heavy.


Steam engines were in a way self-regulated by industry in quite a few countries from about mid-1800s (because of the boiler explosion risks).


Why would regulating steam engines have led to never inventing other types of engines?


Because the engines would have to conform to the strict mandates of the regulations and any innovations would have to seek regulatory approval with the usual ‘regulatory capture’ rules in effect.


See: nuclear power


The problem with "heavy regulation" for anything to do with tech is it puts you at a technological disadvantage.

The only thing AI regulation would do is hand AI supremacy to China. This is the case with almost every other technological development that we kneecap ourselves with, nuclear power, high speed rail, etc. We waste endless energy on bureaucracy while China is building.


And Moloch laughs in the background.


These are not the good ones


Oh no someone said something mean on Twitter. How ever will we survive.

The only reason why propaganda is so effective is because life is so terrible. No one bought USSR propaganda in the 60s that the US was terrible because people remembered growing up without electricity. A majority believe Russian propaganda about the US today because life expectancy in Thailand is higher.


The folks at OpenAI have already started doing this. Just the other day, their chief scientist was making noises along those lines.


Can AI really be regulated When China could release an app?


I don't think China would release an app, it's a country which highly values censorship and hiding information, why would they want ChaGPT like systems that aren't heavily censored / broken from entering the public sphere.


If they were to train their own, wouldn’t they only train it on "approved" content?

In which case, it wouldn’t even need to be censored, would it?


Wouldn't it go the other way? Train it on dissident content, so that it's able to detect it and alert the censors so they can send the police to reducate the user?


Well, you could use both. One model for content generation, one for moderation and censorship.

But, in this case, as already discussed in other threads, why not simply transform everything people write or say? They still get to see what they wrote. Others only see a cleaned up and supportive of the gouvernement version. And to that on texts, chat, social networks,… everywhere you can.

And suddenly, you actually are in 1984. Except you don’t even have to send the police, or beat up people: they’ll all be deeply convinced they are part of a very small minority. If not alone.

ie https://xkcd.com/2015/


Isn't ChatGPT heavily censored?


Quite poorly. Look up “chatGPT jailbreaking”.


Irrelevant. ChatGPT is heavily censored.


Hilarious that you asked a question and then immediately changed that to a statement 20 minutes later when challenged. Why didn’t you just make it a statement to begin with?


It's a statement. Google "rhetorical question".


Irrelevant. Google is heavily censored/s


If you have to ask, you don’t have any idea what actual censorship is.


I don't have to ask. ChatGPT is heavily censored. Rhetorical question.


The point your rhetorical question tries to drive home is bad which is why other commenters didn't let it sit. Thanks to dang hacker news has rules of conduct that prevent this place from degenerating into a cesspool, but now that another company applied similar rules to its AI then the free speech absolutists decry at the podium.

If you want a generative AI without any decency then make it yourself, but don't act like this AI is censored just because it communicates in the way that everyone in society does when they are not protected by the veil of online anonymity


Rhetorical questions don’t work the way you think they do then. Asking a question whose answer is controversial and which is phrased as an innocent inquiry is not rhetorical. At best it’s a tired ingroup membership signal rather than a honest attempt at discourse.


It's not controversial it's factual


So not necesserily China, any other company in any country.


It sounds like regulatory capture!


[flagged]


“You don't need a formal conspiracy when interests converge. These people went to the same universities and fraternities, They're on the same Boards of directors, they're in the same country clubs, they have like interests. They don't need to call a meeting, they know what's good for them and they're getting it.” -George Carlin


I feel like this thread is about to get disappeared lol.


[flagged]


> This is the reality, not some "conspiracy".

It is the absolute textbook definition of conspiracy thinking. You seriously think some shadowy elite started "a lockdown, then a war, centralizing the banking system and unleashing A.I." in furtherance of some unspecified nefarious plan?

Wild claims presented with zero evidence can and should simply be dismissed without further thought.


Is expected plan. And if you want to think critically, do research on these topics. If you expect the media to give you the facts, good luck. The strange part is that everything that I mentioned is in the public domain, as a fact or policy.


Everything you claim makes sense but so would aliens or that we're inside a computer game, etc. Inventing explanations to facts is a trivial skill. Formulating a theory that can predict future events is an actual achievement.


Should have known your reply would be "do your own research". How about no, you go and do it, and wow the entire world with your incredible story.

The reason the media isn't covering this nonsense isn't because of a cover-up, it's because it's nothing but fantasy.


I don't think this is a place for unfounded conspiracy theories.


Literally read the book The Great Reset by Klaus Schwab, it's all discussed there. They literally have a big meeting every year where they invite a bunch of elites to discuss ways they can have more control over the global populace. There's nothing unfounded about it; except for the fact that it maybe doesn't technically count as a conspiracy if it's all done in the open.


> Literally read the book The Great Reset by Klaus Schwab, it's all discussed there

There's no way you read this book. Because I have.

Davos is mentioned once in passing in the entire book. And the whole book is about how we can build a more resilient, more equitable world for everyone, where we respect nature instead of plundering it.


Funny, the other day I had to deal with a former school board trustee making claims about the supposedly sexual and harmful content of a specific book about a transsexual person. My gut told me that what they were saying probably wasn't true so I read the book and voila, none of what the person claimed to be in the book was actually in there. It does say something about a person when they use a study/book, anything that takes time and energy to evaluate to make their point without actually reading it.


It’s actually not that uncommon for people with the same type and level of job to congregate and discuss - or conference - together to learn and socialise with each other.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: