Yes but the point is GPT-3/4 must include at least one website that prohibited scraping, thus OpenAI should not be using their GPT models, by their own logic.
That's a great question, I'm wondering that myself. So I know from the legal standpoint anti-scraping has been a messy debate, but what is different this time. Amazon and Linkedin won't die if their analytics leaks, in fact in Linkedin's case all the data was volunteered by the users. These users won't up and leave to another place that scraped Linkedin. I'm not sure if I agree with the lawsuit outcomes, but it didn't look like an existential threat to Linkedin or Amazon.
Now, in case of scraping _all_ of public useful information. Let's take Stackoverflow as an example of a community-produced content. It gets 90% of its traffic from Google, people show up, read, ask and answer questions (click on jobs and ads to keep SO servers running). If a scraper comes by that inhales all of that data, and regurgitates it under Microsoft's name, giving no credit, no references and absolutely nothing back to the authors of those questions and answers. What is the reason for SO to exist anymore? What is the motivation for people to keep answering and refining the questions? 90% of Stackoverflow will evaporate. The way it's currently set up it's an existential threat to any meaningful content on the web. Maybe there have been some legal precedents to repeal anti-scraping, but this is a very big problem now.
So think about it. If Morgan Stanley is right and OpenAI is currently training GPT-5 for hundreds of millions of dollars, they might not make it directly available. They could use Alpaca power to use it to boost GPT-4 or GPT-3. This way they could have these super optimized models that can fit in smaller memory and compute their tokens faster than any of the competition. This would be the opposite of OpenAI having a problem. It would be this huge economic moat that lasts for a while until more efficient hardware and more capital spending by their competition degrades it. It would even solve the problem of super human GPT-5 cognition. That would be moot if it's never released and only used to Alpaca-ize their own smaller models.
That's a valid perspective IMO but corporations are too slow to react and keep up.. Another thing is, if models that are good enough for the general use case that I can fit in my medium range hardware then I definitely don't need to pay for the more perfect one.. A couple of such good enough models are already there and a lot of us are already not following or waiting for the GPT-10.. That's the point of the article and why OpenAI are visibly insecure about those models.
> It would even solve the problem of super human GPT-5 cognition. That would be moot if it's never released and only used to Alpaca-ize their own smaller models.
They would not have a moat from this strategy.
At the end of the day, they still make available a model with high performance. The whole point is that anyone, for very little money, can fine-tune another model to match the performance of OpenAI's models.
That's true. Anything open source to the public is obviously open source to a company. They can create public, miniaturized versions of their models and release capabilities only when it suits them, using Alpaca style training to minimize costs for their API.
I hope my analysis makes sense regardless of when or how OpenAI ends up training GPT-5, but the Morgan Stanley thing is an internet rumor for example https://news.ycombinator.com/item?id=35222180
Even without Morgan Stanley claims, Greg Brockman and Satya have both said that GPT-3.5 was already internally used by them since the summer of last year and GPT-4 in December. Greg mentions it in https://youtu.be/YtJEfTTD_Y4 for example
The most frustrating thing about LLaMA "release" is that no one can make an actual product around it. Meta legal department is silent and it's anyone's guess how far the terms of use can be pushed and bent until army of lawyers buries your business into the ground.
I've seen awful amount of fishy urls, magnet links, and diff files and I guess it works well enough for open source communities. This buzz and flow of amazing innovations building around LLaMA will later be repurposed when a truly open LLM + weights is released, not much is lost there. But it could be maybe x10 times the impact if startups could build their new products with the finetuned LLaMA models.
It just sucks that the price was already paid - 2.6 million KWh hours of electricity and 1,000 tons of CO2 emitted into the air. Now it needs to be done again to get to same result, why?
For that matter, I don't see how "OpenAI" could even try to legally enforce its terms against competitors training their models using the output of "OpenAI" models… at least not without being laughed out of the court room at best or ending up having to pay enormously much more themselves at worst, given how "OpenAI" blatantly disregard any licenses on the content and data they use to train their models.
I think it's still up to debate whether this tech is really a winner takes all sort of thing, I think even with the image generation ai craze, we saw a bunch of different implementations come out in succession. Maybe one might be slightly better but it's still possible to get something close pretty quickly
Doesn’t mean it’s enforceable though. That being said, given the priority this is given in their ToS, I suspect they’ll be hearing from lawyers.
Edit: to be clear I don’t support this clause in their ToS, this is just something I noticed having had to study their ToS and privacy policy within the past week.
I would be curious to hear a lawyer's take on this, though I suspect we'll be hearing the Supreme Court's on it soon enough. This seems to be begging the question of who owns the content produced by the software. So far, the courts have stated that model-generated content cannot be copyrighted [1]. That was referring to images, and text seems likely to be on even shakier ground.
I don't see how a company could license the usage of something that they don't have the legal rights to - the output text in particular. Obviously OpenAI can terminate people's accounts for whatever reason they want, but that's largely meaningless. They have 0 chances of deterring anything unless they can secure substantial damages.
OpenAI has done the reasonable thing of not exposing the probability distribution per generated token so it's very hard to use that to completely map to their models. Ultimately, you still do need a very large base model to compete.
A language model takes in a sequence of tokens and outputs a probability (0-1) for each token in the vocabulary (the set of all tokens the model knows). Based on this probability distribution, there are various sampling strategies that can be employed to choose which token to actually show to the user.
Fair use to scrape together their dataset & then cry wolf when their outputs are used to create a better competitor that runs locally lol....should not have given a free-tier but then how else would you have captured casual users
"The only way to compete against OpenAI is an open-source version of ChatGPT" [0]
It can also be a free binary only model as well, but I would prefer a transparent one. Either way seems like Stanford Alpaca and LLaMa has taken off in cloning ChatGPT and making it good enough to compete against it at a affordable cost.
I think that last part is what the "product" is. Train it with a large data set, learn to constrain it, and give it a more fixed scope AKA customer support. Retain the best of your existing staff to monitor, review and train it in an ongoing way. Use them to deal with "outliers" and "new features".
People may loose jobs... but do companies want this? Can you imagine where google/fb/twitter/insert company name here no longer being able to hide behind "No one can reach a human"....
It's not a meaningful restriction. Even if they were good at catching accounts doing this, the community that wants an open-source chat assistant is so huge that you could easily find many people to share their API keys or install a browser extension to share their chats.
Even that is not needed. There is enough people that want open source ChatGPT clone to create the dataset from scratch by themselves. The Open Assistant community actually already created enough training data for the initial model training and the data will be released under open source license. And from what I played with the initial model, it looked promising (though not anywhere close to GPT 3.5 yet).
I think so yes, and it's also nice IMO. Probably search won't be about receiving 10,000 search results, but about 1 concise, justified answer backed by actual sources.
Mass adoption of AI chatbots seems self-defeating. Who is publishing the information that the chatbot uses to give you 1 concise answer? Depending on the query, it could be Wikipedia or an academic journal, but for many topics the chatbot would need to draw from for-profit ad-supported websites.
The chatbot user benefits from having one great answer pulled from the best sources (and no ads), but the websites that underpin the chatbot’s usefulness will no longer have monetizable traffic. In the long-term this disincentivizes people from publishing online, which would reduce the quality of not only the chatbot output but the web as a whole.
Imagine local news being totally unavailable online by any means because the rise of chatbots means that nobody can make any money writing about local news.
Edit: A first reaction to this might be “have the chatbot show ads and share revenue with its sources.” This probably wouldn’t solve the problem. Journalism (and many kinds of writing) would be a less attractive career if your readership consists mostly of people getting second-hand summaries via chatbot. If chatbots do become popular, I worry about a bleak future where journalism and other writing is replaced by an anonymous blob of underpaid foreign laborers whose only job is to shovel up-to-date facts into chatbot databases.
That's a valid perspective, but remember that the quality degraded when we started having paid memberships and ads to websites.
What's a news outlet worth paying for? who actually pays for content online? who knows how to block all ads and didn't do that?
That business model really proved to be worthless, it dragged the quality down with more desperate pay-to-read prompts. Nobody will miss this and we will have again people with real world experience writing knowledge or opinions in their free time. So it goes back to that, quoting scientific papers, books and actual reputable and knowledgeable people writing blog posts.
With the news being mostly propaganda, I don't know what to quote there and how many outlets still have a reputation.
I personally write during my free time in my personal ad-free minimalistic website and have list of blogs I track for contents.
Things changes, waiting for full time "authors" to come up with a profitable plan to just progress is also not an option.
> What's a news outlet worth paying for? who actually pays for content online? who knows how to block all ads and didn't do that?
I don’t know how far we’re going to get if we not only expect volunteers to supply everything the model needs in order to be up-to-date and useful, but also expect access to the model to be free of charge and ad-free.
That’s a pretty massive amount of man-hours, compute, and R&D effort to expend for literally no return on investment.
> we will have again people with real world experience writing knowledge or opinions in their free time
We already have this now in the form of a bored knowledge-worker blogger class. It’s not nearly enough to provide up-to-date info for a chatbot, and I don’t see how the implosion of journalism as an industry will lead to more people spending their free time writing for no pay. If anything, it will drive more knowledge behind pay walls like Substack, which will be inaccessible to the chatbot anyway.
Can't agree more, and we are back to square 0 wondering how can we find the good stuff again after it gets buried under the mass-SEO-optimized random garbage.
If we continue on this trajectory, I have a suspicion that the big players will increasingly cry “danger!” and, as Sam Altman has done already, call for government regulation of AI. Having potential upstarts buried in red tape is how monopolies and oligopolies sustain their positions in a lot of industries.
Altman and OpenAI have been stoking AI safety fears and using it as a selling point for narrow control and regulation since the same time they stopped pretending the “Open” in their name was meaningful. It’s clearly central to their business strategy. The best way to keep a head start is to use the government to put up roadblocks for your competitors.
I think their realisation was that the models themselves aren't actually that difficult to replicate even in absence of patent or description.
i.e they can't adequately defend their business with trade secrets.
Patents probably wouldn't work either because the structures are too easily recombined to bypass any conceivable patent that would be enforceable.
That said I think all of this is actually emblematic of a deeper problem with the space which is that none of the LLM stuff recently has been groundbreaking but rather just just continual refinement of a given branch. We aren't seeing evolution, just increasing either number of parameters or quality thereof + additional context. Which is why it was so easy for other folks to make the same progress in similar time periods.
Time will tell if we are about to slam into a local maxima or if someone finds a significant evolution or better yet stumbles on a way to properly combine LLM for context + NLP with traditional AI/logic/expert systems to engineer something that actually thinks and learns rather than regurgitating statistics.
We can also start suing NotOpenAI in many countries for copyright infringment stuff in their trainig data. They need to come up with a 'social contract' for AI models, in the same way that google made a synergistic relationship with publishers in the past
Actually, using public but copyrighted data is explicitly allowed by the digital single market directive in the EU, precisely to allow new entrants to enter the market and to keep big tech from gatekeeping the access to competitive data.
Although there are also typically very good reasons not to regulate the industry, which will never be discovered if the industry is regulated.
For example, imagine that engines were regulated earlier, because steam engines could blow up and kill people. That's a good reason to regulate steam engines. But then we probably would have never invented other engine types like gasoline engines and jet engines. With that, we'd never have invented planes or flight, because a regulation steam engine would have been too heavy.
Because the engines would have to conform to the strict mandates of the regulations and any innovations would have to seek regulatory approval with the usual ‘regulatory capture’ rules in effect.
The problem with "heavy regulation" for anything to do with tech is it puts you at a technological disadvantage.
The only thing AI regulation would do is hand AI supremacy to China. This is the case with almost every other technological development that we kneecap ourselves with, nuclear power, high speed rail, etc. We waste endless energy on bureaucracy while China is building.
Oh no someone said something mean on Twitter. How ever will we survive.
The only reason why propaganda is so effective is because life is so terrible. No one bought USSR propaganda in the 60s that the US was terrible because people remembered growing up without electricity. A majority believe Russian propaganda about the US today because life expectancy in Thailand is higher.
I don't think China would release an app, it's a country which highly values censorship and hiding information, why would they want ChaGPT like systems that aren't heavily censored / broken from entering the public sphere.
Wouldn't it go the other way? Train it on dissident content, so that it's able to detect it and alert the censors so they can send the police to reducate the user?
Well, you could use both. One model for content generation, one for moderation and censorship.
But, in this case, as already discussed in other threads, why not simply transform everything people write or say? They still get to see what they wrote. Others only see a cleaned up and supportive of the gouvernement version. And to that on texts, chat, social networks,… everywhere you can.
And suddenly, you actually are in 1984. Except you don’t even have to send the police, or beat up people: they’ll all be deeply convinced they are part of a very small minority. If not alone.
Hilarious that you asked a question and then immediately changed that to a statement 20 minutes later when challenged. Why didn’t you just make it a statement to begin with?
The point your rhetorical question tries to drive home is bad which is why other commenters didn't let it sit. Thanks to dang hacker news has rules of conduct that prevent this place from degenerating into a cesspool, but now that another company applied similar rules to its AI then the free speech absolutists decry at the podium.
If you want a generative AI without any decency then make it yourself, but don't act like this AI is censored just because it communicates in the way that everyone in society does when they are not protected by the veil of online anonymity
Rhetorical questions don’t work the way you think they do then. Asking a question whose answer is controversial and which is phrased as an innocent inquiry is not rhetorical. At best it’s a tired ingroup membership signal rather than a honest attempt at discourse.
“You don't need a formal conspiracy when interests converge. These people went to the same universities and fraternities, They're on the same Boards of directors, they're in the same country clubs, they have like interests. They don't need to call a meeting, they know what's good for them and they're getting it.” -George Carlin
It is the absolute textbook definition of conspiracy thinking. You seriously think some shadowy elite started "a lockdown, then a war, centralizing the banking system and unleashing A.I." in furtherance of some unspecified nefarious plan?
Wild claims presented with zero evidence can and should simply be dismissed without further thought.
Is expected plan. And if you want to think critically, do research on these topics. If you expect the media to give you the facts, good luck.
The strange part is that everything that I mentioned is in the public domain, as a fact or policy.
Everything you claim makes sense but so would aliens or that we're inside a computer game, etc. Inventing explanations to facts is a trivial skill. Formulating a theory that can predict future events is an actual achievement.
Literally read the book The Great Reset by Klaus Schwab, it's all discussed there. They literally have a big meeting every year where they invite a bunch of elites to discuss ways they can have more control over the global populace. There's nothing unfounded about it; except for the fact that it maybe doesn't technically count as a conspiracy if it's all done in the open.
> Literally read the book The Great Reset by Klaus Schwab, it's all discussed there
There's no way you read this book. Because I have.
Davos is mentioned once in passing in the entire book. And the whole book is about how we can build a more resilient, more equitable world for everyone, where we respect nature instead of plundering it.
Funny, the other day I had to deal with a former school board trustee making claims about the supposedly sexual and harmful content of a specific book about a transsexual person. My gut told me that what they were saying probably wasn't true so I read the book and voila, none of what the person claimed to be in the book was actually in there. It does say something about a person when they use a study/book, anything that takes time and energy to evaluate to make their point without actually reading it.
It’s actually not that uncommon for people with the same type and level of job to congregate and discuss - or conference - together to learn and socialise with each other.
and yet i bet openai was trained on data that they did not bother negotiating and honouring "terms of use" for. what's the difference?